GPU编程PPT1
GPU编程PPT1
📚 COMP52315 GPU Programming
Lecture 01 & 02 Notes
🧠 What is Parallelism? What is Concurrency?
Concurrency
- Two processes A and B are concurrent iff B may start before A finishes, and vice versa.
- On single-core machines: Concurrency is possible via:
- Time multiplexing
- Space multiplexing
Types of Concurrency
- Interleaving:
Alternating execution of A and B on the same unit. - Parallelism:
Simultaneous execution of A and B on different units.
🏗️ Classification of Parallel Architectures
Based on:
- Memory organization:
- Shared Memory
- Distributed Memory
- Distributed Shared Memory
- Concurrent processing capabilities
- Flynn’s Taxonomy
- Erlangen classification
🧩 Flynn’s Taxonomy
Type | Description | Examples |
---|---|---|
SISD | Single Instruction, Single Data | Single-core CPU |
SIMD | Single Instruction, Multiple Data | GPU, vector unit |
MISD | Multiple Instruction, Single Data | Pipelining (debated) |
MIMD | Multiple Instruction, Multiple Data | Multi-core CPUs, GPU |
→ GPUs can be SIMD or MIMD depending on context.
🎮 GPU Architecture
Evolution
- 1970s: Arcade graphics circuits
- 1981: First GPU (NEC7220)
- 1990s: OpenGL spreads GPU usage
- 2000s: GPGPU misuse for matrix math
- 2008: Nvidia G80 + CUDA
- 2010s: Unified graphics/compute
- Today: Common in AI and HPC
- Future: Ubiquitous heterogeneous computing
General Architecture
- Hierarchical schedulers
- Memory/cache hierarchy
- Compute Units → Processing Elements
- Vendors: Nvidia, AMD, Intel (terminology varies)
🔁 SIMT Execution Model (Nvidia)
- SIMT = Single Instruction, Multiple Thread
- Warp = group of threads executing same instruction concurrently
- Branch divergence is handled via lane masking
- Conceptually similar to SIMD but abstracted for developers
🧠 CPU vs. GPU
Feature | CPU | GPU |
---|---|---|
Optimization | Latency | Throughput |
Chip | AMD EPYC 7702 | AMD MI100 |
Die Size | 592 mm² | 750 mm² |
Cores | 64 | 128 |
Base Clock | 2.0 GHz | 1.0 GHz |
Peak FP64 | 2048 GFLOPS | 8192 GFLOPS |
Compute Density | 3.46 GFLOPS/mm² | 10.9 GFLOPS/mm² |
🚀 CUDA Programming Basics
Architecture Model
1
2
3
[CPU (Host)] ⇄ [Main Memory]
⇅ PCIe/NVLink
[GPU (Device)] ⇄ [Global Memory]
Manual Memory Movement
CPU:
- Init:
t_A
- Compute:
N t_C / p
GPU:
- Init:
t_A
- Transfer to GPU:
N t_T
- Compute:
N t_C / P
- Transfer back:
N t_T
To be worthwhile:
1
(p⁻¹ - P⁻¹) > 2 t_T / t_C
→ Tips:
- Maximize
t_C
- Minimize memory transfers
- Large P (GPU parallelism) helps
CUDA Workflow
- Initialise context
- Allocate host & device memory
- Copy data to device
- Configure & launch kernel
- Copy data back to host
- Free memory
CUDA Kernels
- Kernel launches are asynchronous
- Syntax:
<<<gridDim, blockDim>>>
- Use qualifiers like
__global__
(void-return only)
💻 Examples
Hello World
CPU Version
1
2
3
4
#include <stdio.h>
int main() {
printf("Hello World from CPU!\n");
}
GPU Version
1
2
3
4
5
6
7
8
9
10
#include <stdio.h>
__global__ void helloFromGPU() {
printf("Hello World from GPU!\n");
}
int main() {
printf("Hello World from CPU!\n");
helloFromGPU<<<1, 1>>>();
cudaDeviceReset();
return 0;
}
Vector Scalar Multiplication (CUDA)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
__global__ void scalar_mul(double *v, double a, const int N){
int tid = blockIdx.x * blockDim.x + threadIdx.x;
v[tid] = a * v[tid];
}
int main() {
int N = 256;
double *v = new double[N];
double *v_d;
cudaMalloc((void**)&v_d, sizeof(double) * N);
cudaMemcpy(v_d, v, sizeof(double) * N, cudaMemcpyHostToDevice);
scalar_mul<<<1, 256>>>(v_d, 5, N);
cudaMemcpy(v, v_d, sizeof(double) * N, cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++) std::cout << v[i] << ",\n";
cudaFree(v_d);
delete[] v;
}
本文由作者按照 CC BY 4.0 进行授权