GPU: Getting Started With CUDA

What is GPU?

A graphics processing unit (GPU) is an electronic circuit that processes images and accelerates the rendering of 3D computer graphics. GPUs can perform complex calculations quickly and efficiently. GPUs are designed to process large blocks of data simultaneously, which is ideal for many areas like machine learning, gaming, cryptocurrency etc. GPUs are optimized for parallel tasks, which means they can distribute work load and perform multiple computations at the same time.

GPUs are used in a variety of applications and few of these are outlined below

Gaming

GPUs are used in gaming consoles and other devices. GPUs are use in rendering graphics in 2D and 3D, including texture, effects, and video.

Video editing

GPUs can process many pieces of data simultaneously, making them useful for video editing.

Machine learning

GPUs are used in machine learning and artificial intelligence (AI) because they can perform complex calculations quickly and efficiently.

Virtual reality

GPUs are used in virtual reality (VR) applications.

Cryptocurrency mining

GPUs are used in cryptocurrency to validate transactions within a blockchain network and earn crypto rewards. This is achieved by having GPUs perform intricate calculations and solve mathematical algorithms.

Difference between CPU and GPU

CPU and GPU architecture differ significantly and image 1 depict the differences in approach to handle processes and data. CPU or Central Processing Unit consist of Arithmetic Logic Unit (ALU), Control Unit and Cache. This unit is then interfaced with DRAM and IOs. In multi core CPU the ALU, control units and small cache in multiple number to support concurrency.

In contrast GPU consist of large no. of small ALU with shared control units and caches as shown in image 1. Usually GPUs contains thousands of ALU which process or perform same operation different part of data to reduce total time to complete the task.

Below are few key differences between CPU and GPU

CPU	GPU
4-8 cores	100 to few thousands cores
Low latency	High throughput
Good for serial processing	Good for parallel processing
Quickly process task that require interactivity	Breaks jobs into separate tasks to process simultaneously
Traditional programming are written for CPU sequential execution	Requires additional software to convert CPU functions to GPU functions for parallel execution

Introduction to CUDA

Now we've learned what GPUs are and where GPUs are employed, let try to understand further about how to program GUPs to leverage it's full potential. CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by Nvidia.

NVIDIA released the first version of CUDA in November 2006 and it came with a software environment that allowed you to use C as a high-level programming language. As of now CUDA support C, C++, Fortran, Python and MATLAB and developers can express parallelism through extensions in the form of a few basic keywords.

Hello World Using CUDA

Before we proceed further, let me introduce two keywords widely used in CUDA programming model: host and device.

The host is the CPU available in the system. The system memory associated with the CPU is called host memory. The GPU is called a device and GPU memory likewise called device memory.

To execute any CUDA program, there are three main steps:

Copy the input data from host memory to device memory, also known as host-to-device transfer.
Load the GPU program and execute, caching data on-chip for performance.
Copy the results from device memory to host memory, also called device-to-host transfer.

Since it's our first example to understand CUDA, we'll keep this example simple. In this example I'm using Visual Studio C++ 2019 compiler and CUDA toolkit 11.8. Detailed installation steps are available from Nvidia's official documentation.

// Example 1
// Filename: kernal.cu 
#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

/** CUDA kernel device code - CUDA Sample Codes
 * Sample code to print hello world. 
 */
__global__ void hello_cuda(){
    printf("Hello Cuda World\n");
}

int main(){

    dim3 grid(2);     // equivalent to 2,1,1 
                      // i.e. grid.x=2, grid.y=1, grid.z=1
    dim3 block(10);

    // creates 2 grid of 10 blocks of thread i.e. total 20 thread
    hello_cuda << <grid, block >> > ();  
    
    //to make all thread synchronize
    cudaDeviceSynchronize();   
    return 0;
}

Output

./hello_cuda.exp
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World
Hello Cuda World

Let's try to understand what's going on with kernel.cu file in example 1. In the above example we've included certain header files and brief description of these CUDA interface files along with few important ones are given below.

cuda.h defines the public host functions and types for the CUDA driver API.
cuda_runtime_api.h defines the public host functions and types for the CUDA runtime API
cuda_runtime.h defines everything cuda_runtime_api.h does, as well as built-in type definitions and function overlays for the CUDA language extensions and device intrinsic functions.
device_launch_parameters.h contains device launch parameters like grid and block.

Next we've created one function hello_cuda() which is can perform some computation on GPU but in our example it simply prints hello message. In order to transform this function to GUP compatible function, we use __global__ in front of function to denote it's CUDA kernel function. In order to execute kernel on CUDA compatible GPU, we need to call function with special language extension <<<...>>> with no. of threads. In our example we've specified 20 thread and these threads are arranged in three dimensional grid for simple abstraction. Refer Image 2 for memory and thread hierarchy.

__global__ void hello_cuda();
 // creates 2 grid of 10 blocks of thread i.e. total 20 thread
 hello_cuda << <grid, block >> > ();

A kernel is defined using the global declaration specifier and the number of CUDA threads that execute that kernel for a given kernel call is specified using a new <<<...>>>execution configuration syntax

Finally we use cudaDeviceSynchronize() to synchronize all the threads on device. Once we compiled CUDA source file kernel.cu either calling Nvidia CUDA compiler nvcc with appropriate flags or directly building MSVC solution, we can execute the elf/exe file which runs kernel function on device and print 'Hello Cuda World' from different thread in GPU device.

Hello Cuda World
Hello Cuda World
...

Image2: CUDA Memory and Thread Hierarchy

Similarly if we have to perform addition of two large arrays and store the result of addition into third array then the kernel function definition and launch of kernel would be as show in the below example 2.

// Example 2
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
}

int main()
{
    ...
    // Kernel invocation with N threads
    VecAdd<<<1, N>>>(A, B, C);
    ...
}

In our future articles we will try to go in depth of CUDA programing to understand how to leverage GPUs to optimize lengthy and machine intensive tasks.

Summary

GPUs provide very powerful capabilities to developers which were only accessible to very few in specialists in prestigious institute or labs few decades back. With Nvidia's CUDA compute platform and API it has become very easy for developers to write highly parallel programs which can be part of machine learning algorithms or commercial games. With simple language extension we have learned how to write GPU compatible code.

EmbeddedHow