CUDA/Best Practice
This document uses a lot of text from the CUDA C Best Practices Guide
Introduction
CUDA C programming involves running code on two different platforms concurrently: a host system with one or more CPUs and one or more devices (frequently graphics adapter cards) with CUDA-enabled NVIDIA GPUs.
While NVIDIA devices are frequently associated with rendering graphics, they are also powerful arithmetic engines capable of running thousands of lightweight threads in parallel. This capability makes them well suited to computations that can leverage parallel execution well.
However, the device is based on a distinctly different design from the host system, and it’s important to understand those differences and how they determine the performance of CUDA applications to use CUDA effectively.
Differences Between Host and Device
The primary differences occur in threading and memory access:
- Threading resources.
- Execution pipelines on host systems can support a limited number of concurrent threads. Servers that have four quad-core processors today can run only 16 threads concurrently (32 if the CPUs support HyperThreading.) By comparison, the smallest executable unit of parallelism on a CUDA device comprises 32 threads (a warp). All NVIDIA GPUs can support at least 768 concurrently active threads per multiprocessor, and some GPUs support 1,024 or more active threads per multiprocessor (see Section G.1 of the CUDA C Programming Guide). On devices that have 30 multiprocessors (such as the NVIDIA® GeForce® GTX 280), this leads to more than 30,000 active threads.
- Threads.
- Threads on a CPU are generally heavyweight entities. The operating system must swap threads on and off of CPU execution channels to provide multithreading capability. Context switches (when two threads are swapped) are therefore slow and expensive. By comparison, threads on GPUs are extremely lightweight. In a typical system, thousands of threads are queued up for work (in warps of 32 threads each). If the GPU must wait on one warp of threads, it simply begins executing work on another. Because separate registers are allocated to all active threads, no swapping of registers or state need occur between GPU threads. Resources stay allocated to each thread until it completes its execution.
- RAM.
- Both the host system and the device have RAM. On the host system, RAM is generally equally accessible to all code :(within the limitations enforced by the operating system). On the device, RAM is divided virtually and physically into different types, each of which has a special purpose and fulfills different needs. The types of device RAM are explained in the CUDA C Programming Guide and in Chapter 3 of this document.
These are the primary hardware differences between CPU hosts and GPU devices with respect to parallel programming. Other differences are discussed as they arise elsewhere in this document.
What Runs on a CUDA-Enabled Device?
Because of the considerable differences between the host and the device, it’s important to partition applications so that each hardware system is doing the work it does best. The following issues should be considered when determining what parts of an application to run on the device:
- The device is ideally suited for computations that can be run on numerous data elements simultaneously in parallel. This typically involves arithmetic on large data sets (such as matrices) where the same operation can be performed across thousands, if not millions, of elements at the same time. This is a requirement for good performance on CUDA: the software must use a large number of threads. The support for running numerous threads in parallel derives from the CUDA architecture’s use of a lightweight threading model.
- There should be some coherence in memory access by device code. Certain memory access patterns enable the hardware to coalesce groups of reads or writes of multiple data items into one operation. Data that cannot be laid out so as to enable coalescing, or that doesn’t have enough locality to use textures or L1 efficiently, will not enjoy much of a performance benefit when used in computations on CUDA.
- To use CUDA, data values must be transferred from the host to the device along the PCI Express (PCIe) bus. These transfers are costly in terms of performance and should be minimized. (See Section 3.1.) This cost has several ramifications:
- The complexity of operations should justify the cost of moving data to and from the device. Code that transfers data for brief use by a small number of threads will see little or no performance benefit. The ideal scenario is one in which many threads perform a substantial amount of work.
For example, transferring two matrices to the device to perform a matrix addition and then transferring the results back to the host will not realize much performance benefit. The issue here is the number of operations performed per data element transferred. For the preceding procedure, assuming matrices of size N×N, there are N2 operations (additions) and 3N2