GPU编程PPT1

发表于 2025/03/29 更新于 2025/03/31

作者

2 分钟阅读

GPU编程PPT1

📚 COMP52315 GPU Programming

Lecture 01 & 02 Notes

🧠 What is Parallelism? What is Concurrency?

Concurrency

Two processes A and B are concurrent iff B may start before A finishes, and vice versa.
On single-core machines: Concurrency is possible via:
- Time multiplexing
- Space multiplexing

Types of Concurrency

Interleaving:
Alternating execution of A and B on the same unit.
Parallelism:
Simultaneous execution of A and B on different units.

🏗️ Classification of Parallel Architectures

Based on:

Memory organization:
- Shared Memory
- Distributed Memory
- Distributed Shared Memory
Concurrent processing capabilities
Flynn’s Taxonomy
Erlangen classification

🧩 Flynn’s Taxonomy

Type	Description	Examples
SISD	Single Instruction, Single Data	Single-core CPU
SIMD	Single Instruction, Multiple Data	GPU, vector unit
MISD	Multiple Instruction, Single Data	Pipelining (debated)
MIMD	Multiple Instruction, Multiple Data	Multi-core CPUs, GPU

→ GPUs can be SIMD or MIMD depending on context.

🎮 GPU Architecture

Evolution

1970s: Arcade graphics circuits
1981: First GPU (NEC7220)
1990s: OpenGL spreads GPU usage
2000s: GPGPU misuse for matrix math
2008: Nvidia G80 + CUDA
2010s: Unified graphics/compute
Today: Common in AI and HPC
Future: Ubiquitous heterogeneous computing

General Architecture

Hierarchical schedulers
Memory/cache hierarchy
Compute Units → Processing Elements
Vendors: Nvidia, AMD, Intel (terminology varies)

🔁 SIMT Execution Model (Nvidia)

SIMT = Single Instruction, Multiple Thread
Warp = group of threads executing same instruction concurrently
Branch divergence is handled via lane masking
Conceptually similar to SIMD but abstracted for developers

🧠 CPU vs. GPU

Feature	CPU	GPU
Optimization	Latency	Throughput
Chip	AMD EPYC 7702	AMD MI100
Die Size	592 mm²	750 mm²
Cores	64	128
Base Clock	2.0 GHz	1.0 GHz
Peak FP64	2048 GFLOPS	8192 GFLOPS
Compute Density	3.46 GFLOPS/mm²	10.9 GFLOPS/mm²

🚀 CUDA Programming Basics

Architecture Model

[CPU (Host)] ⇄ [Main Memory]
     ⇅ PCIe/NVLink
[GPU (Device)] ⇄ [Global Memory]

Manual Memory Movement

CPU:

Init: t_A
Compute: N t_C / p

GPU:

Init: t_A
Transfer to GPU: N t_T
Compute: N t_C / P
Transfer back: N t_T

To be worthwhile:

(p⁻¹ - P⁻¹) > 2 t_T / t_C

→ Tips:

Maximize t_C
Minimize memory transfers
Large P (GPU parallelism) helps

CUDA Workflow

Initialise context
Allocate host & device memory
Copy data to device
Configure & launch kernel
Copy data back to host
Free memory

CUDA Kernels

Kernel launches are asynchronous
Syntax: <<<gridDim, blockDim>>>
Use qualifiers like __global__ (void-return only)

💻 Examples

Hello World

CPU Version

  
#include <stdio.h>
int main() {
    printf("Hello World from CPU!\n");
}

GPU Version

  
#include <stdio.h>
__global__ void helloFromGPU() {
    printf("Hello World from GPU!\n");
}
int main() {
    printf("Hello World from CPU!\n");
    helloFromGPU<<<1, 1>>>();
    cudaDeviceReset();
    return 0;
}

Vector Scalar Multiplication (CUDA)

  
__global__ void scalar_mul(double *v, double a, const int N){
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    v[tid] = a * v[tid];
}

int main() {
    int N = 256;
    double *v = new double[N];
    double *v_d;
    cudaMalloc((void**)&v_d, sizeof(double) * N);
    cudaMemcpy(v_d, v, sizeof(double) * N, cudaMemcpyHostToDevice);
    scalar_mul<<<1, 256>>>(v_d, 5, N);
    cudaMemcpy(v, v_d, sizeof(double) * N, cudaMemcpyDeviceToHost);
    for (int i = 0; i < N; i++) std::cout << v[i] << ",\n";
    cudaFree(v_d);
    delete[] v;
}

MISCADA, GPU Course

GPU Programming

本文由作者按照 CC BY 4.0 进行授权

📚 COMP52315 GPU Programming

Lecture 01 & 02 Notes

🧠 What is Parallelism? What is Concurrency?

Concurrency

Types of Concurrency

🏗️ Classification of Parallel Architectures

Based on:

🧩 Flynn’s Taxonomy

🎮 GPU Architecture

Evolution

General Architecture

🔁 SIMT Execution Model (Nvidia)

🧠 CPU vs. GPU

🚀 CUDA Programming Basics

Architecture Model

Manual Memory Movement

CUDA Workflow

CUDA Kernels

💻 Examples

Hello World

Vector Scalar Multiplication (CUDA)

热门标签