GPU course
Heterogeneous Computing¶
- Host Memory -- Main Memory (RAM)
Device Memory
- The memory GPU uses
Simple Processing Flow
- Copy input data from CPU memory(HOST) to GPU memory(DEVICE Memory)
- Load GPU Program (Load the program which would be using the GPU here) and execute, caching data on chip for performance.
- Copy Results from GPU Memory (Device) to CPU Memory (HOST)
Throughput Oriented
- CUDA Core (Parallelise for Heterogeneous)
- CUDA, OpenSSE, PyCuda
- Tensor Core based
- Google's TPU (Tensor Processing Unit)
- This core allows to put dataset from host memory to device memory efficiently for Artificial Intelligence, Machine Learning, Deep Learning applications.
- CUDA Core (Parallelise for Heterogeneous)
PCI-PCIE is Legacy connection peripheral. Today Nvidia uses NV-link SCP Command to copy our data to Device(PARAM) from HOST -->cuda
int main(void){
return 0;
NVIDIA Compiler (nvcc) can be used to compile programs with no device code
Triple angled brackets make a call from HOST to Device (GPU) Host Pointers
--> Point to CPU Memory- May be passed to/from host only
Device Pointers
-->- May be passed to/from device only
GPU Computing is about massive parallelism
add <<<1,1>>>()
add <<<N,1>>>()
-- Execute N times in parallel- N Blocks with 1 thread per block
- Set of Blocks is referred to as grid
- Each invocation can refer to its block index using
CUDA Threads¶
- A block can be split into parallel threads
give the current thread.add <<<1,N>>>()
Shared Memory¶
add <<<M,N>>>()
- blockIdx.x , threadIdx.x
- When M threads per block, a unique index is given by \(\(Index\ = \ threadIdx.x\ +\ blockIdx.x\ *\ blockdim.x\)\)
- blockdim.x is an inbuild variable (M)
2025-01-28 - Block consists of warps. Each warp has 32 threads. - Handing Arbitrary Vector Sizes - Avoid accessing beyond the end of the arrays
- Threads share data via shared memory, within a block.
directive is used to use on-chip ultra fast shared memory for the blocks of the current thread.
Synchronous v/s Asynchronous - Kernel launches are asynchronous - Control returns to the CPU immediately - CPU needs to synchronise before consuming the results.
- Blocks the CPU until the copy is completecudaMemcpyAsync()
All CUDA API calls return an error code
- Error either in API call or in an earlier async operation
- Get error code ->
cudaError_t cudaGetLastError(void)
Get a string to describe the error ->
__host____device__const char* cudaGetErrorString
( cudaError_t error ) -
command ->
provides monitoring and management capabilities for each of NVIDIA's Tesla, Quadro, GRID and GeForce devices from Fermi and higher architecture families.c -
- syncs all threads within a block¶
Compute Capability¶
describes architecture (no. of registers, memory size(s), features and capabilities)
Transparent Scalability
- Streaming multiprocessor on a GPU depends completely on the micro-architecture to which it belongs.
- C++ Code to GCC
- Virtual PTX (Parallel Thread Extension -> PTX to Target Compiler -> GPU Instruction)
Ques - Pthread issue - Synchronization, programmers headache to maintain the threads creation, limit, synchronisation - What is the difference between Pthreads and the threads that are being created in the GPU. - Why use pThreads/GPU Threads?
- Find out the declaratives and keywords to specific to NVAA
- There is a capacity of blocks and per block threads. Find the capacity of the same
- Number of threads you can create per cpu
20250211 ->