GPU course

2025-01-27

Heterogeneous Computing¶

Host Memory -- Main Memory (RAM)
Device Memory
- The memory GPU uses
Simple Processing Flow
1. Copy input data from CPU memory(HOST) to GPU memory(DEVICE Memory)
2. Load GPU Program (Load the program which would be using the GPU here) and execute, caching data on chip for performance.
3. Copy Results from GPU Memory (Device) to CPU Memory (HOST)
Throughput Oriented
- CUDA Core (Parallelise for Heterogeneous)
  - CUDA, OpenSSE, PyCuda
- Tensor Core based
  - Google's TPU (Tensor Processing Unit)
  - This core allows to put dataset from host memory to device memory efficiently for Artificial Intelligence, Machine Learning, Deep Learning applications.

PCI-PCIE is Legacy connection peripheral. Today Nvidia uses NV-link SCP Command to copy our data to Device(PARAM) from HOST

hello.cu -->cuda
---
int main(void){
    printf("Hello!");
    return 0;
}

NVIDIA Compiler (nvcc) can be used to compile programs with no device code
<<<>>> Triple angled brackets make a call from HOST to Device (GPU)
Host Pointers --> Point to CPU Memory
- May be passed to/from host only
Device Pointers -->
- May be passed to/from device only
GPU Computing is about massive parallelism
- add <<<1,1>>>()
- add <<<N,1>>>() -- Execute N times in parallel
  - N Blocks with 1 thread per block
  - Set of Blocks is referred to as grid
  - Each invocation can refer to its block index using blockIdx.x

CUDA Threads¶

A block can be split into parallel threads
threadIdx.x give the current thread.
add <<<1,N>>>()

Shared Memory¶

add <<<M,N>>>()
blockIdx.x , threadIdx.x
When M threads per block, a unique index is given by \(\(Index\ = \ threadIdx.x\ +\ blockIdx.x\ *\ blockdim.x\)\)
blockdim.x is an inbuild variable (M)

2025-01-28 - Block consists of warps. Each warp has 32 threads. - Handing Arbitrary Vector Sizes - Avoid accessing beyond the end of the arrays

Threads¶

Threads share data via shared memory, within a block.
__shared__ directive is used to use on-chip ultra fast shared memory for the blocks of the current thread.

Synchronous v/s Asynchronous - Kernel launches are asynchronous - Control returns to the CPU immediately - CPU needs to synchronise before consuming the results.

cudaMemcpy() - Blocks the CPU until the copy is complete
cudaMemcpyAsync()
cudaDeviceSynchronize()
All CUDA API calls return an error code
- Error either in API call or in an earlier async operation
Get error code -> cudaError_t cudaGetLastError(void)
Get a string to describe the error -> __host____device__const char* cudaGetErrorString ( cudaError_t error )
command -> nvidia-smi provides monitoring and management capabilities for each of NVIDIA's Tesla, Quadro, GRID and GeForce devices from Fermi and higher architecture families.c
_syncthreads()

- syncs all threads within a block¶

Compute Capability¶

describes architecture (no. of registers, memory size(s), features and capabilities)
Transparent Scalability
Streaming multiprocessor on a GPU depends completely on the micro-architecture to which it belongs.
NVCC
- C++ Code to GCC
- Virtual PTX (Parallel Thread Extension -> PTX to Target Compiler -> GPU Instruction)

Ques - Pthread issue - Synchronization, programmers headache to maintain the threads creation, limit, synchronisation - What is the difference between Pthreads and the threads that are being created in the GPU. - Why use pThreads/GPU Threads?

Tasks¶

Find out the declaratives and keywords to specific to NVAA
There is a capacity of blocks and per block threads. Find the capacity of the same
Number of threads you can create per cpu

2025-03-25 - Ray Tracing - Chapter 6 CUDA by Example book - BIND - Unbind - textured memory - Constant memory - usage, advantage, disadvantage, when to use, when not to. - Streams from Shane Cook Book - Remaining - PyCuda prefix sum - Introduction for docker container

2025-04-01

The PyCuda Module
OpenMP overcomes the drawback from threads where programmer is responsible for everything which is risky.
- Here OpenMP libraries is a layer on threads where there is abstraction from programmers to help them do their tasks without worrying about the intricacies of parallelism.
PyCuda Workflow
- Edit -> Run -> SourceModule("...") -> Cache ?
  - Cache ? No -> nvcc -> .cubin -> Update Cache and run PyCuda -> Upload To Gpu -> Run on GPU.
  - Cache ? Yes -> Upload to GPU -> Run on GPU

Data Dependency - exercise 1 Saxpy - Pointer aliasing in C - Different pointers are allowed to access same object. induce implicit data dependency in a loop. - Laplace Solver - Data Clauses - Data copyout transfer - Relative Performance -> Speedup - Specify reduction operator and __ explicitly - NVIDIA GPU (CUDA) Task Granularity - OpenACC Task Granularity - (45 page ppt) - Docker theory question - Data in, data out, laplace, datamovement openacc. ******* - Dual Warp Scheduler -> Henessy Patterson 4th Chapter

Paper Pattern - 5 * 12 (2x6) each -

Case study, compute level, warp size, blocks, threads etc. everything. Once Case study on sir pathvel to GPU. NVIDIA DGX Workstation with

Recommendations¶

Shane cook --> Read "Streams" Chapter