CUDA/Best Practice
Introduction
CUDA C programming involves running code on two different platforms concurrently: a host system with one or more CPUs and one or more devices (frequently graphics adapter cards) with CUDA-enabled NVIDIA GPUs.
While NVIDIA devices are frequently associated with rendering graphics, they are also powerful arithmetic engines capable of running thousands of lightweight threads in parallel. This capability makes them well suited to computations that can leverage parallel execution well.
However, the device is based on a distinctly different design from the host system, and it’s important to understand those differences and how they determine the performance of CUDA applications to use CUDA effectively.
Differences Between Host and Device
The primary differences occur in threading and memory access:
- Threading resources.
- Execution pipelines on host systems can support a limited number of concurrent threads. Servers that have four quad-core processors today can run only 16 threads concurrently (32 if the CPUs support HyperThreading.) By comparison, the smallest executable unit of parallelism on a CUDA device comprises 32 threads (a warp). All NVIDIA GPUs can support at least 768 concurrently active threads per multiprocessor, and some GPUs support 1,024 or more active threads per multiprocessor (see Section G.1 of the CUDA C Programming Guide). On devices that have 30 multiprocessors (such as the NVIDIA® GeForce® GTX 280), this leads to more than 30,000 active threads.
- Threads.
- Threads on a CPU are generally heavyweight entities. The operating system must swap threads on and off of CPU execution channels to provide multithreading capability. Context switches (when two threads are swapped) are therefore slow and expensive. By comparison, threads on GPUs are extremely lightweight. In a typical system, thousands of threads are queued up for work (in warps of 32 threads each). If the GPU must wait on one warp of threads, it simply begins executing work on another. Because separate registers are allocated to all active threads, no swapping of registers or state need occur between GPU threads. Resources stay allocated to each thread until it completes its execution.
- RAM.
- Both the host system and the device have RAM. On the host system, RAM is generally equally accessible to all code :(within the limitations enforced by the operating system). On the device, RAM is divided virtually and physically into different types, each of which has a special purpose and fulfills different needs. The types of device RAM are explained in the CUDA C Programming Guide and in Chapter 3 of this document.
These are the primary hardware differences between CPU hosts and GPU devices with respect to parallel programming. Other differences are discussed as they arise elsewhere in this document.