性能优化PPT2

发表于 2025/03/28 更新于 2025/03/28

作者

6 分钟阅读

性能优化PPT2

📚 Session 2: Memory Hierarchy

Lecturer: Anne Reinarz

🧪 Exercise 1: Sum Reduction Benchmark

SIMD: 4 plateaus
Scalar: 3 plateaus
Performance variability caused by CPU Boosting

❓ Why SIMD doesn’t reach peak performance?

Not due to instruction throughput
Memory bandwidth decreases with vector size

🧠 Memory Hierarchy Overview

Types of Memory:

Small and fast
Large and slow

Physics prevents memory from being both large and fast.

Optimization strategy:

Restructure algorithms to keep data in fast memory

🔗 Latency reference by Colin Scott

🗂️ Cache Memory

Features:

Hierarchy of small, fast memory
Stores frequently accessed data

Issues:

Frequently accessed data is unknown beforehand
Use principle of locality to predict

📍 Principle of Locality

🕒 Temporal Locality

If data is accessed now, it will likely be accessed again soon

🗺️ Spatial Locality

If one address is accessed, nearby addresses will likely be accessed

📥 Cache Access Behavior

First access:
- Load from main memory to register
- Store in cache
Subsequent accesses:
- Load from cache (faster)

🧮 Example: Sum Reduction

  
float s[16] = 0;
for (i = 0; i < N; i++) {
  s[i%16] += a[i];
}

s: good temporal locality
a: good spatial locality

🧰 Cache Design Questions

Where to put loaded data?
How to check if data is in cache?
What to do when cache is full?

Address broken into:
- Tag
- Index
- Offset

📌 Direct Mapped Cache

Each address maps to one specific cache slot
Conflict resolution: newer replaces older (LRU)

📏 Cache Line Size

Data is loaded one cache line at a time
64 bytes is typical
Optimize algorithms to use cache-line-sized chunks

🔄 Cache Thrashing Example

  
int a[64], b[64], r = 0;
for (int i = 0; i < 100; i++)
  for (int j = 0; j < 64; j++)
    r += a[j] + b[j];

Although cache is big enough, a[j] and b[j] map to the same cache line, causing conflict.

🧩 Cache Associativity Types

Type	Description
Direct Mapped	One memory block maps to one cache line
Fully Associative	Can go anywhere, costly to implement
Set Associative	N-way set (common values: 2, 4, 8, 16)

📊 Exercise 2 & 3: Bandwidth vs Array Size

Bandwidth from different levels:

L1 (<32KB): ~370GB/s
L2 (<512KB): ~100GB/s
L3 (<16MB): ~78GB/s
RAM: ~17.5GB/s

Performance decreases as data moves away from CPU.

🧠 Summary: Hardware Architecture Insights

Key Factors:

Instruction count
Execution efficiency
Data movement time

Hardware Parallelism:

Sockets: 1–4 CPUs
Cores: 4–32 per CPU
Vectorization: 2–16 floats per vector
Superscalar: 2–8 instructions per cycle

🧠 Direct Mapped Caches: Indexing Explained

🎯 Goal

Understand how a memory address is split into Tag, Index, and Offset in a Direct Mapped Cache.

📦 Imagine the Cache as Slots

Suppose we have 8 cache slots labeled 0 to 7:

+-------+-------+-------+-------+-------+-------+-------+-------+
| slot0 | slot1 | slot2 | slot3 | slot4 | slot5 | slot6 | slot7 |
+-------+-------+-------+-------+-------+-------+-------+-------+

When we load data from memory, we must decide:

Which slot to put it in → this is the Index
How to recognize if the data in the slot is correct → this is the Tag

🧮 Splitting the Address

Imagine an 8-bit memory address:

Address: 1101 0110

We divide it into three parts:

✅ 1. Byte Offset (lowest bits)

These bits tell which byte inside the cache line we want.

If cache line = 4 bytes → we need 2 bits
In this example: Offset = 10

✅ 2. Cache Index (middle bits)

These bits tell us which cache slot to use.

If cache has 8 slots → need 3 bits
In this example: Index = 101 → means slot 5

✅ 3. Tag (highest bits)

These bits are used to identify the data stored in that slot.

In this example: Tag = 110

📌 Final Breakdown:

Address:       1101 0110
               ↑    ↑   ↑
          Tag=110 Index=101 Offset=10

Put the data in slot 5
When we check slot 5 later, compare its tag with 110
If it matches → cache hit
If not → cache miss

🔁 Example of Conflict

&a[0] = 1101 0110
&b[0] = 0101 0110

Both have Index = 101 → both map to slot 5
But Tag is different (110 vs 010)

So:

Access a[0] → slot 5 now has tag = 110
Then access b[0] → overwrites slot 5 with tag = 010
Re-access a[0] → tag mismatch → cache miss

🎢 This causes cache thrashing — constant replacement and reloading!

✅ Summary Table

Part	What it does
Tag	Uniquely identifies the memory block
Index	Selects which cache slot to use
Offset	Selects which byte within the cache line

🧪 Memory Operation Benchmarks: `clload`, `clstore`, `clcopy`

这三种是常用于内存带宽测试的操作类型，分别代表不同的内存访问模式：

操作名	行为	测试目的
`clload`	只读（Load）	测试读取带宽
`clstore`	只写（Store）	测试写入带宽
`clcopy`	先读后写（Copy）	测试整体传输带宽，如 memcpy

🧠 行为解释

✅ `clload` - 只读

从内存加载数据，不写回
最快，因为现代 CPU 对读优化非常好

示例代码：

  
for (int i = 0; i < N; i++) {
  sum += a[i]; // 读取操作
}

✅ `clstore` - 只写

将数据写入内存
可能触发“读-改-写”，引入额外读开销

示例代码：

  
for (int i = 0; i < N; i++) {
  a[i] = 0; // 写入操作
}

✅ `clcopy` - 读 + 写

每个元素从源数组读取后写入目标数组
带宽压力是 clload 和 clstore 的总和，最慢

示例代码：

  
for (int i = 0; i < N; i++) {
  b[i] = a[i]; // 读 + 写
}

📊 性能对比（带宽）

操作	带宽压力	时间效率	说明
`clload`	1x	🥇 最快	只读取，利用 cache
`clstore`	~1.5x	🥈 中等	写操作可能引发读-改-写
`clcopy`	2x	🥉 最慢	同时读写，内存带宽占满

MISCADA, GPU Course

GPU Programming

本文由作者按照 CC BY 4.0 进行授权

📚 Session 2: Memory Hierarchy

🧪 Exercise 1: Sum Reduction Benchmark

❓ Why SIMD doesn’t reach peak performance?

🧠 Memory Hierarchy Overview

Types of Memory:

🗂️ Cache Memory

Features:

Issues:

📍 Principle of Locality

🕒 Temporal Locality

🗺️ Spatial Locality

📥 Cache Access Behavior

🧮 Example: Sum Reduction

🧰 Cache Design Questions

📌 Direct Mapped Cache

📏 Cache Line Size

🔄 Cache Thrashing Example

🧩 Cache Associativity Types

📊 Exercise 2 & 3: Bandwidth vs Array Size

Bandwidth from different levels:

🧠 Summary: Hardware Architecture Insights

Key Factors:

Hardware Parallelism:

🧠 Direct Mapped Caches: Indexing Explained

🎯 Goal

📦 Imagine the Cache as Slots

🧮 Splitting the Address

✅ 1. Byte Offset (lowest bits)

✅ 2. Cache Index (middle bits)

✅ 3. Tag (highest bits)

📌 Final Breakdown:

🔁 Example of Conflict

✅ Summary Table

🧪 Memory Operation Benchmarks: clload, clstore, clcopy

🧠 行为解释

✅ clload - 只读

✅ clstore - 只写

✅ clcopy - 读 + 写

📊 性能对比（带宽）

热门标签

🧪 Memory Operation Benchmarks: `clload`, `clstore`, `clcopy`

✅ `clload` - 只读

✅ `clstore` - 只写

✅ `clcopy` - 读 + 写