文章

GPU编程PPT1

GPU编程PPT1

📚 COMP52315 GPU Programming

Lecture 01 & 02 Notes


🧠 What is Parallelism? What is Concurrency?

Concurrency

  • Two processes A and B are concurrent iff B may start before A finishes, and vice versa.
  • On single-core machines: Concurrency is possible via:
    • Time multiplexing
    • Space multiplexing

Types of Concurrency

  • Interleaving:
    Alternating execution of A and B on the same unit.
  • Parallelism:
    Simultaneous execution of A and B on different units.

🏗️ Classification of Parallel Architectures

Based on:

  • Memory organization:
    • Shared Memory
    • Distributed Memory
    • Distributed Shared Memory
  • Concurrent processing capabilities
  • Flynn’s Taxonomy
  • Erlangen classification

🧩 Flynn’s Taxonomy

TypeDescriptionExamples
SISDSingle Instruction, Single DataSingle-core CPU
SIMDSingle Instruction, Multiple DataGPU, vector unit
MISDMultiple Instruction, Single DataPipelining (debated)
MIMDMultiple Instruction, Multiple DataMulti-core CPUs, GPU

→ GPUs can be SIMD or MIMD depending on context.


🎮 GPU Architecture

Evolution

  • 1970s: Arcade graphics circuits
  • 1981: First GPU (NEC7220)
  • 1990s: OpenGL spreads GPU usage
  • 2000s: GPGPU misuse for matrix math
  • 2008: Nvidia G80 + CUDA
  • 2010s: Unified graphics/compute
  • Today: Common in AI and HPC
  • Future: Ubiquitous heterogeneous computing

General Architecture

  • Hierarchical schedulers
  • Memory/cache hierarchy
  • Compute Units → Processing Elements
  • Vendors: Nvidia, AMD, Intel (terminology varies)

🔁 SIMT Execution Model (Nvidia)

  • SIMT = Single Instruction, Multiple Thread
  • Warp = group of threads executing same instruction concurrently
  • Branch divergence is handled via lane masking
  • Conceptually similar to SIMD but abstracted for developers

🧠 CPU vs. GPU

FeatureCPUGPU
OptimizationLatencyThroughput
ChipAMD EPYC 7702AMD MI100
Die Size592 mm²750 mm²
Cores64128
Base Clock2.0 GHz1.0 GHz
Peak FP642048 GFLOPS8192 GFLOPS
Compute Density3.46 GFLOPS/mm²10.9 GFLOPS/mm²

🚀 CUDA Programming Basics

Architecture Model

1
2
3
[CPU (Host)] ⇄ [Main Memory]
     ⇅ PCIe/NVLink
[GPU (Device)] ⇄ [Global Memory]

Manual Memory Movement

CPU:

  • Init: t_A
  • Compute: N t_C / p

GPU:

  • Init: t_A
  • Transfer to GPU: N t_T
  • Compute: N t_C / P
  • Transfer back: N t_T

To be worthwhile:

1
(p⁻¹ - P⁻¹) > 2 t_T / t_C

→ Tips:

  1. Maximize t_C
  2. Minimize memory transfers
  3. Large P (GPU parallelism) helps

CUDA Workflow

  1. Initialise context
  2. Allocate host & device memory
  3. Copy data to device
  4. Configure & launch kernel
  5. Copy data back to host
  6. Free memory

CUDA Kernels

  • Kernel launches are asynchronous
  • Syntax: <<<gridDim, blockDim>>>
  • Use qualifiers like __global__ (void-return only)

💻 Examples

Hello World

CPU Version

1
2
3
4
#include <stdio.h>
int main() {
    printf("Hello World from CPU!\n");
}

GPU Version

1
2
3
4
5
6
7
8
9
10
#include <stdio.h>
__global__ void helloFromGPU() {
    printf("Hello World from GPU!\n");
}
int main() {
    printf("Hello World from CPU!\n");
    helloFromGPU<<<1, 1>>>();
    cudaDeviceReset();
    return 0;
}

Vector Scalar Multiplication (CUDA)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
__global__ void scalar_mul(double *v, double a, const int N){
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    v[tid] = a * v[tid];
}

int main() {
    int N = 256;
    double *v = new double[N];
    double *v_d;
    cudaMalloc((void**)&v_d, sizeof(double) * N);
    cudaMemcpy(v_d, v, sizeof(double) * N, cudaMemcpyHostToDevice);
    scalar_mul<<<1, 256>>>(v_d, 5, N);
    cudaMemcpy(v, v_d, sizeof(double) * N, cudaMemcpyDeviceToHost);
    for (int i = 0; i < N; i++) std::cout << v[i] << ",\n";
    cudaFree(v_d);
    delete[] v;
}

本文由作者按照 CC BY 4.0 进行授权