性能优化PPT2
性能优化PPT2
📚 Session 2: Memory Hierarchy
Lecturer: Anne Reinarz
🧪 Exercise 1: Sum Reduction Benchmark
- SIMD: 4 plateaus
- Scalar: 3 plateaus
- Performance variability caused by CPU Boosting
❓ Why SIMD doesn’t reach peak performance?
- Not due to instruction throughput
- Memory bandwidth decreases with vector size
🧠 Memory Hierarchy Overview
Types of Memory:
- Small and fast
- Large and slow
Physics prevents memory from being both large and fast.
Optimization strategy:
- Restructure algorithms to keep data in fast memory
🔗 Latency reference by Colin Scott
🗂️ Cache Memory
Features:
- Hierarchy of small, fast memory
- Stores frequently accessed data
Issues:
- Frequently accessed data is unknown beforehand
- Use principle of locality to predict
📍 Principle of Locality
🕒 Temporal Locality
- If data is accessed now, it will likely be accessed again soon
🗺️ Spatial Locality
- If one address is accessed, nearby addresses will likely be accessed
📥 Cache Access Behavior
- First access:
- Load from main memory to register
- Store in cache
- Subsequent accesses:
- Load from cache (faster)
🧮 Example: Sum Reduction
1
2
3
4
float s[16] = 0;
for (i = 0; i < N; i++) {
s[i%16] += a[i];
}
s
: good temporal localitya
: good spatial locality
🧰 Cache Design Questions
- Where to put loaded data?
- How to check if data is in cache?
- What to do when cache is full?
- Address broken into:
- Tag
- Index
- Offset
📌 Direct Mapped Cache
- Each address maps to one specific cache slot
- Conflict resolution: newer replaces older (LRU)
📏 Cache Line Size
- Data is loaded one cache line at a time
- 64 bytes is typical
- Optimize algorithms to use cache-line-sized chunks
🔄 Cache Thrashing Example
1
2
3
4
int a[64], b[64], r = 0;
for (int i = 0; i < 100; i++)
for (int j = 0; j < 64; j++)
r += a[j] + b[j];
- Although cache is big enough, a[j] and b[j] map to the same cache line, causing conflict.
🧩 Cache Associativity Types
Type | Description |
---|---|
Direct Mapped | One memory block maps to one cache line |
Fully Associative | Can go anywhere, costly to implement |
Set Associative | N-way set (common values: 2, 4, 8, 16) |
📊 Exercise 2 & 3: Bandwidth vs Array Size
Bandwidth from different levels:
- L1 (<32KB): ~370GB/s
- L2 (<512KB): ~100GB/s
- L3 (<16MB): ~78GB/s
- RAM: ~17.5GB/s
Performance decreases as data moves away from CPU.
🧠 Summary: Hardware Architecture Insights
Key Factors:
- Instruction count
- Execution efficiency
- Data movement time
Hardware Parallelism:
- Sockets: 1–4 CPUs
- Cores: 4–32 per CPU
- Vectorization: 2–16 floats per vector
- Superscalar: 2–8 instructions per cycle
🧠 Direct Mapped Caches: Indexing Explained
🎯 Goal
Understand how a memory address is split into Tag, Index, and Offset in a Direct Mapped Cache.
📦 Imagine the Cache as Slots
Suppose we have 8 cache slots labeled 0 to 7:
1
2
3
+-------+-------+-------+-------+-------+-------+-------+-------+
| slot0 | slot1 | slot2 | slot3 | slot4 | slot5 | slot6 | slot7 |
+-------+-------+-------+-------+-------+-------+-------+-------+
When we load data from memory, we must decide:
- Which slot to put it in → this is the Index
- How to recognize if the data in the slot is correct → this is the Tag
🧮 Splitting the Address
Imagine an 8-bit memory address:
1
Address: 1101 0110
We divide it into three parts:
✅ 1. Byte Offset (lowest bits)
These bits tell which byte inside the cache line we want.
- If cache line = 4 bytes → we need 2 bits
- In this example:
Offset = 10
✅ 2. Cache Index (middle bits)
These bits tell us which cache slot to use.
- If cache has 8 slots → need 3 bits
- In this example:
Index = 101
→ means slot 5
✅ 3. Tag (highest bits)
These bits are used to identify the data stored in that slot.
- In this example:
Tag = 110
📌 Final Breakdown:
1
2
3
Address: 1101 0110
↑ ↑ ↑
Tag=110 Index=101 Offset=10
- Put the data in slot 5
- When we check slot 5 later, compare its tag with 110
- If it matches → cache hit
- If not → cache miss
🔁 Example of Conflict
1
2
&a[0] = 1101 0110
&b[0] = 0101 0110
- Both have
Index = 101
→ both map to slot 5 - But
Tag
is different (110 vs 010)
So:
- Access
a[0]
→ slot 5 now has tag = 110 - Then access
b[0]
→ overwrites slot 5 with tag = 010 - Re-access
a[0]
→ tag mismatch → cache miss
🎢 This causes cache thrashing — constant replacement and reloading!
✅ Summary Table
Part | What it does |
---|---|
Tag | Uniquely identifies the memory block |
Index | Selects which cache slot to use |
Offset | Selects which byte within the cache line |
🧪 Memory Operation Benchmarks: clload
, clstore
, clcopy
这三种是常用于内存带宽测试的操作类型,分别代表不同的内存访问模式:
操作名 | 行为 | 测试目的 |
---|---|---|
clload | 只读(Load) | 测试读取带宽 |
clstore | 只写(Store) | 测试写入带宽 |
clcopy | 先读后写(Copy) | 测试整体传输带宽,如 memcpy |
🧠 行为解释
✅ clload
- 只读
- 从内存加载数据,不写回
- 最快,因为现代 CPU 对读优化非常好
- 示例代码:
1 2 3
for (int i = 0; i < N; i++) { sum += a[i]; // 读取操作 }
✅ clstore
- 只写
- 将数据写入内存
- 可能触发“读-改-写”,引入额外读开销
- 示例代码:
1 2 3
for (int i = 0; i < N; i++) { a[i] = 0; // 写入操作 }
✅ clcopy
- 读 + 写
- 每个元素从源数组读取后写入目标数组
- 带宽压力是 clload 和 clstore 的总和,最慢
- 示例代码:
1 2 3
for (int i = 0; i < N; i++) { b[i] = a[i]; // 读 + 写 }
📊 性能对比(带宽)
操作 | 带宽压力 | 时间效率 | 说明 |
---|---|---|---|
clload | 1x | 🥇 最快 | 只读取,利用 cache |
clstore | ~1.5x | 🥈 中等 | 写操作可能引发读-改-写 |
clcopy | 2x | 🥉 最慢 | 同时读写,内存带宽占满 |
本文由作者按照 CC BY 4.0 进行授权