Starting GPU computing with CUDA
With recent advances in AI, there’s a lot of talk about how critical GPU computation is for AI and why NVIDIA’s CUDA platform matters so much for training models. Coming from a web development background, it felt like a distant and complex field, reserved for specialists with deep knowledge of hardware and parallel programming libraries. Still, I got curious about why this technology is so critical for training AI models. When I tried CUDA, it turned out to have a very elegant API and was quite easy to get started with.
I started exploring GPU programming based on recommendations in this Reddit thread and picked up the book Professional CUDA C programming (2014). While it’s quite old, the fundamentals still hold and it’s a great hands-on guide for beginners. Another definitive guide for learning about GPU computing and CUDA is Programming Massively Parallel Processors: A Hands-on Approach, which is an interesting book to follow up with.
Professional CUDA C programming starts with an introduction to heterogeneous computing. This is an approach for designing software by splitting a program so that sequential logic runs on the CPU while parallelizable tasks run on the GPU. This approach is implemented elegantly by CUDA. You write both CPU code (in C) and GPU code (in CUDA C, an extension of C) in the same file. The CUDA compiler then separates these parts, invokes the appropriate compiler for each part, and links everything together with the necessary CUDA libraries to enable execution on the GPU. Let’s look at an example program to get a feel for GPU computing.
To mark a piece of code for GPU execution, CUDA has a special keyword __global__. By prepending this to a function, we define it as a kernel - a function that runs on the GPU. For example: __global__ void helloGPU (void).
We can then call this kernel with helloGPU <<<1, 10>>>() from the main flow executed on the CPU. The numbers in the <<< >>> brackets specify 1 block (a way of grouping calculations on the GPU) and 10 threads.
Here is what a basic “Hello World” GPU program looks like in CUDA C:
#include <stdio.h>
// GPU kernel function that prints a message
__global__ void helloGPU (void)
{
// threadIdx.x is a built-in variable identifying the specific thread
printf("Hello from GPU! (thread %d)\n", threadIdx.x);
}
int main(void)
{
printf("Hello from CPU!\n");
// Launch kernel with 10 threads
helloGPU <<<1, 10>>>();
// Wait for the GPU to finish before the program exits
cudaDeviceSynchronize();
return 0;
}
We can easily run this code on Google Colab, which provides free NVIDIA T4 runtimes. To execute the program above in a Colab notebook, we first need to write it to a file by prepending this magic command to our code:
%%writefile hello.cpp
We are saving the file with a .cpp extension. This is a neat trick to enable C++ syntax highlighting for our CUDA C code inside the Colab notebook. Once we run this cell, the hello.cpp file is created in our runtime, which we can access via Colab’s built-in terminal.
To compile it, we run the following command in the terminal:
nvcc -x cu -arch=sm_75 hello.cpp -o hello
Let’s break down what these flags do:
-x cu: forces the compiler to treathello.cppas CUDA code, since a typical CUDA C program requires a.cuextension.-arch=sm_75: tells the compiler which GPU hardware capability level to target - in this case using Compute Capability 7.5 (matching our NVIDIA T4 GPU capability level). Without this, the compiler targets older compute capability and displays a deprecation warning.-o hello: names our final executable file “hello”.
Executing our program with ./hello gives us the following output:
/content# ./hello
Hello from CPU!
Hello from GPU! (thread 0)
Hello from GPU! (thread 1)
Hello from GPU! (thread 2)
Hello from GPU! (thread 3)
Hello from GPU! (thread 4)
Hello from GPU! (thread 5)
Hello from GPU! (thread 6)
Hello from GPU! (thread 7)
Hello from GPU! (thread 8)
Hello from GPU! (thread 9)
And here is how it looks in the actual Colab interface:

While CUDA syntax looks quite elegant, there are newer approaches that go further. For example, the Mojo programming language allows CPU and GPU code to be written in the same language and provides memory safety with a borrow checker.
Coming next:
- Summing up vectors with CUDA (Chapter 2 of Professional CUDA C Programming)
- Summing up vectors with Metal (Apple’s GPU framework)
- Summing up vectors with Mojo programming language
8 Apr 2026: Post partially rewritten for clarity and to include a Google Colab example.