文章

性能优化PPT2

性能优化PPT2

📚 Session 2: Memory Hierarchy

Lecturer: Anne Reinarz


🧪 Exercise 1: Sum Reduction Benchmark

  • SIMD: 4 plateaus
  • Scalar: 3 plateaus
  • Performance variability caused by CPU Boosting

❓ Why SIMD doesn’t reach peak performance?

  • Not due to instruction throughput
  • Memory bandwidth decreases with vector size

🧠 Memory Hierarchy Overview

Types of Memory:

  1. Small and fast
  2. Large and slow

Physics prevents memory from being both large and fast.

Optimization strategy:

  • Restructure algorithms to keep data in fast memory

🔗 Latency reference by Colin Scott


🗂️ Cache Memory

Features:

  • Hierarchy of small, fast memory
  • Stores frequently accessed data

Issues:

  • Frequently accessed data is unknown beforehand
  • Use principle of locality to predict

📍 Principle of Locality

🕒 Temporal Locality

  • If data is accessed now, it will likely be accessed again soon

🗺️ Spatial Locality

  • If one address is accessed, nearby addresses will likely be accessed

📥 Cache Access Behavior

  • First access:
    • Load from main memory to register
    • Store in cache
  • Subsequent accesses:
    • Load from cache (faster)

🧮 Example: Sum Reduction

1
2
3
4
float s[16] = 0;
for (i = 0; i < N; i++) {
  s[i%16] += a[i];
}
  • s: good temporal locality
  • a: good spatial locality

🧰 Cache Design Questions

  1. Where to put loaded data?
  2. How to check if data is in cache?
  3. What to do when cache is full?
  • Address broken into:
    • Tag
    • Index
    • Offset

📌 Direct Mapped Cache

  • Each address maps to one specific cache slot
  • Conflict resolution: newer replaces older (LRU)

📏 Cache Line Size

  • Data is loaded one cache line at a time
  • 64 bytes is typical
  • Optimize algorithms to use cache-line-sized chunks

🔄 Cache Thrashing Example

1
2
3
4
int a[64], b[64], r = 0;
for (int i = 0; i < 100; i++)
  for (int j = 0; j < 64; j++)
    r += a[j] + b[j];
  • Although cache is big enough, a[j] and b[j] map to the same cache line, causing conflict.

🧩 Cache Associativity Types

TypeDescription
Direct MappedOne memory block maps to one cache line
Fully AssociativeCan go anywhere, costly to implement
Set AssociativeN-way set (common values: 2, 4, 8, 16)

📊 Exercise 2 & 3: Bandwidth vs Array Size

Bandwidth from different levels:

  • L1 (<32KB): ~370GB/s
  • L2 (<512KB): ~100GB/s
  • L3 (<16MB): ~78GB/s
  • RAM: ~17.5GB/s

Performance decreases as data moves away from CPU.


🧠 Summary: Hardware Architecture Insights

Key Factors:

  • Instruction count
  • Execution efficiency
  • Data movement time

Hardware Parallelism:

  • Sockets: 1–4 CPUs
  • Cores: 4–32 per CPU
  • Vectorization: 2–16 floats per vector
  • Superscalar: 2–8 instructions per cycle

🧠 Direct Mapped Caches: Indexing Explained

🎯 Goal

Understand how a memory address is split into Tag, Index, and Offset in a Direct Mapped Cache.


📦 Imagine the Cache as Slots

Suppose we have 8 cache slots labeled 0 to 7:

1
2
3
+-------+-------+-------+-------+-------+-------+-------+-------+
| slot0 | slot1 | slot2 | slot3 | slot4 | slot5 | slot6 | slot7 |
+-------+-------+-------+-------+-------+-------+-------+-------+

When we load data from memory, we must decide:

  • Which slot to put it in → this is the Index
  • How to recognize if the data in the slot is correct → this is the Tag

🧮 Splitting the Address

Imagine an 8-bit memory address:

1
Address: 1101 0110

We divide it into three parts:

✅ 1. Byte Offset (lowest bits)

These bits tell which byte inside the cache line we want.

  • If cache line = 4 bytes → we need 2 bits
  • In this example: Offset = 10

✅ 2. Cache Index (middle bits)

These bits tell us which cache slot to use.

  • If cache has 8 slots → need 3 bits
  • In this example: Index = 101 → means slot 5

✅ 3. Tag (highest bits)

These bits are used to identify the data stored in that slot.

  • In this example: Tag = 110

📌 Final Breakdown:

1
2
3
Address:       1101 0110
               ↑    ↑   ↑
          Tag=110 Index=101 Offset=10
  • Put the data in slot 5
  • When we check slot 5 later, compare its tag with 110
  • If it matches → cache hit
  • If not → cache miss

🔁 Example of Conflict

1
2
&a[0] = 1101 0110
&b[0] = 0101 0110
  • Both have Index = 101 → both map to slot 5
  • But Tag is different (110 vs 010)

So:

  • Access a[0] → slot 5 now has tag = 110
  • Then access b[0] → overwrites slot 5 with tag = 010
  • Re-access a[0] → tag mismatch → cache miss

🎢 This causes cache thrashing — constant replacement and reloading!


✅ Summary Table

PartWhat it does
TagUniquely identifies the memory block
IndexSelects which cache slot to use
OffsetSelects which byte within the cache line

🧪 Memory Operation Benchmarks: clload, clstore, clcopy

这三种是常用于内存带宽测试的操作类型,分别代表不同的内存访问模式:

操作名行为测试目的
clload只读(Load)测试读取带宽
clstore只写(Store)测试写入带宽
clcopy先读后写(Copy)测试整体传输带宽,如 memcpy

🧠 行为解释

clload - 只读

  • 从内存加载数据,不写回
  • 最快,因为现代 CPU 对读优化非常好
  • 示例代码:
    1
    2
    3
    
    for (int i = 0; i < N; i++) {
      sum += a[i]; // 读取操作
    }
    

clstore - 只写

  • 将数据写入内存
  • 可能触发“读-改-写”,引入额外读开销
  • 示例代码:
    1
    2
    3
    
    for (int i = 0; i < N; i++) {
      a[i] = 0; // 写入操作
    }
    

clcopy - 读 + 写

  • 每个元素从源数组读取后写入目标数组
  • 带宽压力是 clload 和 clstore 的总和,最慢
  • 示例代码:
    1
    2
    3
    
    for (int i = 0; i < N; i++) {
      b[i] = a[i]; // 读 + 写
    }
    

📊 性能对比(带宽)

操作带宽压力时间效率说明
clload1x🥇 最快只读取,利用 cache
clstore~1.5x🥈 中等写操作可能引发读-改-写
clcopy2x🥉 最慢同时读写,内存带宽占满

本文由作者按照 CC BY 4.0 进行授权