Skip to content

GPU course


2025-01-27

Heterogeneous Computing

  • Host Memory -- Main Memory (RAM)
  • Device Memory

    • The memory GPU uses
  • Simple Processing Flow

    1. Copy input data from CPU memory(HOST) to GPU memory(DEVICE Memory)
    2. Load GPU Program (Load the program which would be using the GPU here) and execute, caching data on chip for performance.
    3. Copy Results from GPU Memory (Device) to CPU Memory (HOST)
  • Throughput Oriented

    • CUDA Core (Parallelise for Heterogeneous)
      • CUDA, OpenSSE, PyCuda
    • Tensor Core based
      • Google's TPU (Tensor Processing Unit)
      • This core allows to put dataset from host memory to device memory efficiently for Artificial Intelligence, Machine Learning, Deep Learning applications.

PCI-PCIE is Legacy connection peripheral. Today Nvidia uses NV-link SCP Command to copy our data to Device(PARAM) from HOST

hello.cu -->cuda
---
int main(void){
    printf("Hello!");
    return 0;
}
  • NVIDIA Compiler (nvcc) can be used to compile programs with no device code

  • <<<>>> Triple angled brackets make a call from HOST to Device (GPU)

  • Host Pointers --> Point to CPU Memory
    • May be passed to/from host only
  • Device Pointers -->

    • May be passed to/from device only
  • GPU Computing is about massive parallelism

    • add <<<1,1>>>()
    • add <<<N,1>>>() -- Execute N times in parallel
      • N Blocks with 1 thread per block
      • Set of Blocks is referred to as grid
      • Each invocation can refer to its block index using blockIdx.x
CUDA Threads
  • A block can be split into parallel threads
  • threadIdx.x give the current thread.
  • add <<<1,N>>>()
Shared Memory
  • add <<<M,N>>>()
  • blockIdx.x , threadIdx.x
  • When M threads per block, a unique index is given by \(\(Index\ = \ threadIdx.x\ +\ blockIdx.x\ *\ blockdim.x\)\)
  • blockdim.x is an inbuild variable (M)

2025-01-28 - Block consists of warps. Each warp has 32 threads. - Handing Arbitrary Vector Sizes - Avoid accessing beyond the end of the arrays

Threads
  • Threads share data via shared memory, within a block.
  • __shared__ directive is used to use on-chip ultra fast shared memory for the blocks of the current thread.

Synchronous v/s Asynchronous - Kernel launches are asynchronous - Control returns to the CPU immediately - CPU needs to synchronise before consuming the results.

  • cudaMemcpy() - Blocks the CPU until the copy is complete
  • cudaMemcpyAsync()
  • cudaDeviceSynchronize()

  • All CUDA API calls return an error code

    • Error either in API call or in an earlier async operation
  • Get error code -> cudaError_t cudaGetLastError(void)
  • Get a string to describe the error -> __host__​__device__​const char* cudaGetErrorStringcudaError_t error )

  • command -> nvidia-smi provides monitoring and management capabilities for each of NVIDIA's Tesla, Quadro, GRID and GeForce devices from Fermi and higher architecture families.c

  • _syncthreads()

    - syncs all threads within a block

Compute Capability
  • describes architecture (no. of registers, memory size(s), features and capabilities)

  • Transparent Scalability

  • Streaming multiprocessor on a GPU depends completely on the micro-architecture to which it belongs.
  • NVCC
    • C++ Code to GCC
    • Virtual PTX (Parallel Thread Extension -> PTX to Target Compiler -> GPU Instruction)

Ques - Pthread issue - Synchronization, programmers headache to maintain the threads creation, limit, synchronisation - What is the difference between Pthreads and the threads that are being created in the GPU. - Why use pThreads/GPU Threads?


Tasks
  • Find out the declaratives and keywords to specific to NVAA
  • There is a capacity of blocks and per block threads. Find the capacity of the same
  • Number of threads you can create per cpu

20250211 ->