Difference between revisions of "CoE Cluster April 2012"

From Teknologisk videncenter
Jump to: navigation, search
m (Results)
m (CUDA Code Listing)
 
(23 intermediate revisions by the same user not shown)
Line 2: Line 2:
  
 
=Assignments=
 
=Assignments=
*[[/MD5 attack singlethreaded|MD5 attack singlethreaded]]
+
*[[CoE_Cluster_november_2011/MD5_attack_singlethreaded|MD5 attack singlethreaded]]
*[[/Dell Cluster installation|Dell Cluster Installation]]
+
*[[CoE_Cluster_november_2011/MPI|Programmering med MPI]]
*[[/MPI|Programmering med MPI]]
+
 
*[[/CUDA|Programmering med CUDA]]
 
 
=Results=
 
=Results=
 
Having investigated CUDA C programming for Nvidia graphics cards and the CUDA architecture, we made some performance measurements using a range of numbers of blocks and threads, executing in parallel.  The test program calculated values of the [http://en.wikipedia.org/wiki/Mandelbrot_set Mandelbrot Set ] on a pixel-by-pixel basis.  The Mandelbrot Set was a good choice for this test beacuse:
 
Having investigated CUDA C programming for Nvidia graphics cards and the CUDA architecture, we made some performance measurements using a range of numbers of blocks and threads, executing in parallel.  The test program calculated values of the [http://en.wikipedia.org/wiki/Mandelbrot_set Mandelbrot Set ] on a pixel-by-pixel basis.  The Mandelbrot Set was a good choice for this test beacuse:
* It is relatively easy to program
+
* It is relatively easy to program (see code listing below)
 
* Every pixel has to calculated individually - there is no correlation between values of neighbouring pixels
 
* Every pixel has to calculated individually - there is no correlation between values of neighbouring pixels
 
* The image can be separated into parts which can be calculated separately and in parallel
 
* The image can be separated into parts which can be calculated separately and in parallel
Line 18: Line 17:
 
*[http://mars.tekkom.dk/js/cuda_benchmark.htm CUDA Benchmark] (External JavaScript)
 
*[http://mars.tekkom.dk/js/cuda_benchmark.htm CUDA Benchmark] (External JavaScript)
  
Along with the CUDA C programming Guide (see Literature List), the [http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls CUDA Occupancy Calculator] provides a good basis for understanding and explaining the results seen.  In general, it was found that more than 32 blocks and/or 128 threads per block did not provide noticable performance gains, whereas there were significant performance gains for every step increase below these values.
+
Along with the CUDA C programming Guide (see Literature List), the [http://developer.download.nvidia.com/compute/cuda/CUDA_Occupancy_calculator.xls CUDA Occupancy Calculator] provides a good basis for understanding and explaining the results seen.  In general, it was found that more than 32 blocks and/or 128 threads per block did not provide noticable performance gains, whereas there were significant performance gains for every step increase below these values.  This independent presentation: [[Media:cuda_presentation.ppt]] provides a comprehensive explanation and confirmation of the results we obtained.
  
 
[[File:cuda_mandelbrot.png]]
 
[[File:cuda_mandelbrot.png]]
Line 29: Line 28:
  
 
''Above: Mandelbrot Set drawn in 1,2 seconds, using a Master-Worker MPI pattern with 16 worker nodes and 1 master node (which collected calculated results from different parts of the image and output them to an X Window).  Total image dimension 800 x 800 pixels.''
 
''Above: Mandelbrot Set drawn in 1,2 seconds, using a Master-Worker MPI pattern with 16 worker nodes and 1 master node (which collected calculated results from different parts of the image and output them to an X Window).  Total image dimension 800 x 800 pixels.''
 +
 +
=CUDA Code Listing=
 +
 +
<source lang=c>
 +
 +
// mandelbrot.cu
 +
 +
// Calculate number of iterations required to make each point
 +
// in Mandelbrot Set diverge and colour the corresponding pixel
 +
// Tends to work faster with floats rather than doubles - but
 +
// at the expense of "colour blocking" at lower resolutions
 +
 +
// Paul Saunders
 +
// Mercantec
 +
// 03/11-2011
 +
 +
//#include <stdio.h>
 +
//#include <stdlib.h>
 +
//#include <unistd.h>
 +
 +
__device__ inline int calcMandelbrot(const double xPos, const double yPos, const int crunch)
 +
{
 +
    double y = yPos;
 +
    double x = xPos;
 +
    double yy = y * y;
 +
    double xx = x * x;
 +
    int i = crunch;
 +
 +
    while (--i && (xx + yy < 4.0f)) {
 +
        y = x * y * 2.0f + yPos;
 +
        x = xx - yy + xPos;
 +
        yy = y * y;
 +
        xx = x * x;
 +
    }
 +
    return i;
 +
} // CalcMandelbrot - count down until iterations are used up or until the calculation diverges
 +
 +
__device__ void RGB(int x, int y, unsigned char* m, int step, int iter_count)
 +
{
 +
  unsigned char *p;
 +
  unsigned int rgb;
 +
  p = ((unsigned char *) (m + step*x)+3*y);
 +
  rgb = *p+((*(p+1))<<8)+((*(p+2))<<16);
 +
 
 +
  rgb = iter_count*2048;
 +
 
 +
  *p = (unsigned char) (rgb&0xff);
 +
  *(p+1) = (unsigned char) ((rgb>>8)&0xff);
 +
  *(p+2) = (unsigned char) ((rgb>>16)&0xff);
 +
  return;
 +
} //Use calculated iteration count to determine the colour for each pixel
 +
 +
__global__ void Count(unsigned char *img, int rows, int cols, int step, int max_iterations, double centre_x,
 +
double centre_y, double size, int image_size)
 +
{
 +
  double rowfac = ((double) rows)/gridDim.x;
 +
  int rowstart = blockIdx.x*rowfac;
 +
  int rowend = (blockIdx.x+1)*rowfac;
 +
  double colfac = ((double) cols)/blockDim.x;
 +
  int colstart = threadIdx.x*colfac;
 +
  int colend = (threadIdx.x+1)*colfac;
 +
 
 +
  double left_edge = centre_x - size/2.0;
 +
  double top_edge = centre_y - size/2.0;
 +
  double pixel_step = size/image_size;
 +
 
 +
  for (int i=rowstart; i<rowend; i++)
 +
    {
 +
      for (int j=colstart; j<colend; j++)
 +
      {
 +
        RGB(i, j, img, step, calcMandelbrot(left_edge + j * pixel_step, top_edge + i * pixel_step, max_iterations));
 +
      }
 +
    }
 +
}  //Divide calculations between the requested number of blocks and threads, having used the matrix's geometry to
 +
//determine the values input to the calculation for each pixel
 +
 +
void startCUDA(int blocks, int threads, int iterations, double centre_x, double centre_y,
 +
double size, unsigned char* img, int rows, int cols, int step, int image_size)
 +
{
 +
  if (img!=NULL)
 +
  {
 +
      dim3 dimBlock(threads, 1, 1);
 +
      dim3 dimGrid(blocks, 1, 1);
 +
     
 +
      unsigned char *CUDAimg;
 +
      cudaMalloc((void**) &CUDAimg, rows*cols*3);
 +
      cudaMemcpy(CUDAimg, img, rows*cols*3, cudaMemcpyHostToDevice);
 +
      Count<<<dimGrid, dimBlock>>>(CUDAimg, rows, cols, step, iterations, centre_x, centre_y, size, image_size);
 +
      cudaMemcpy(img, CUDAimg, rows*cols*3, cudaMemcpyDeviceToHost);
 +
      cudaFree(CUDAimg);
 +
  }
 +
}  // Allocate sufficient memory for the whole image (@3 bytes per pixel),
 +
//transfer it to the graphics card (host to device), start the calculation process and,
 +
//when complete, transfer the memory (containing the calculated values) back to the host
 +
</source>
  
 
=Slides=
 
=Slides=

Latest revision as of 08:11, 2 May 2012

{{#img: image=Super-computer-artw.jpg | page=Linux Cluster til Center of Excelence/Beskrivelse til CoE West | width=200px | title=Linux Supercomputer projekt }}

Assignments

Results

Having investigated CUDA C programming for Nvidia graphics cards and the CUDA architecture, we made some performance measurements using a range of numbers of blocks and threads, executing in parallel. The test program calculated values of the Mandelbrot Set on a pixel-by-pixel basis. The Mandelbrot Set was a good choice for this test beacuse:

  • It is relatively easy to program (see code listing below)
  • Every pixel has to calculated individually - there is no correlation between values of neighbouring pixels
  • The image can be separated into parts which can be calculated separately and in parallel
  • The time taken to calculate a complete image without parallelization is long enough to allow the performance gains from parallelization to be clearly seen
  • The resulting images are very pretty (if a little strange)! :-)

The results of the benchmarking are here, as an interactive chart (hover your mouse to find out which values are represented by each line, drag to zoom and hide/reveal curves relating to the number of blocks in the legend beneath):

Along with the CUDA C programming Guide (see Literature List), the CUDA Occupancy Calculator provides a good basis for understanding and explaining the results seen. In general, it was found that more than 32 blocks and/or 128 threads per block did not provide noticable performance gains, whereas there were significant performance gains for every step increase below these values. This independent presentation: Media:cuda_presentation.ppt provides a comprehensive explanation and confirmation of the results we obtained.

Cuda mandelbrot.png

Above: Mandelbrot Set drawn in 0,12 seconds, using a single NVidia CUDA capable graphics card (GeForce GTX 460) running 32 blocks and 128 threads in each block. Total image dimension 512 x 512 pixels.

For the reasons given above, the Mandelbrot Set was also a good candidate for testing Message Passing Interface - MPI programming capabilities and strategies. To support graphical output, the final program used a "Master" process which collected calculated values from "Worker" processes and set them together in an X Window. This was a good illustration of the idea that some processes, like Input-Output, have to be run in a serial manner and that perfect parallelization is rarely achievable in practice c.f. Amdahl's Law

Mpi mandelbrot.png

Above: Mandelbrot Set drawn in 1,2 seconds, using a Master-Worker MPI pattern with 16 worker nodes and 1 master node (which collected calculated results from different parts of the image and output them to an X Window). Total image dimension 800 x 800 pixels.

CUDA Code Listing

// mandelbrot.cu

// Calculate number of iterations required to make each point
// in Mandelbrot Set diverge and colour the corresponding pixel
// Tends to work faster with floats rather than doubles - but 
// at the expense of "colour blocking" at lower resolutions

// Paul Saunders
// Mercantec
// 03/11-2011

//#include <stdio.h>
//#include <stdlib.h>
//#include <unistd.h>

__device__ inline int calcMandelbrot(const double xPos, const double yPos, const int crunch)
{
    double y = yPos;
    double x = xPos;
    double yy = y * y;
    double xx = x * x;
    int i = crunch;

    while (--i && (xx + yy < 4.0f)) {
        y = x * y * 2.0f + yPos;
        x = xx - yy + xPos;
        yy = y * y;
        xx = x * x;
    }
    return i;
} // CalcMandelbrot - count down until iterations are used up or until the calculation diverges

__device__ void RGB(int x, int y, unsigned char* m, int step, int iter_count)
{
  unsigned char *p;
  unsigned int rgb;
  p = ((unsigned char *) (m + step*x)+3*y);
  rgb = *p+((*(p+1))<<8)+((*(p+2))<<16);
  
  rgb = iter_count*2048;
  
  *p = (unsigned char) (rgb&0xff);
  *(p+1) = (unsigned char) ((rgb>>8)&0xff);
  *(p+2) = (unsigned char) ((rgb>>16)&0xff);
  return;
} //Use calculated iteration count to determine the colour for each pixel 

__global__ void Count(unsigned char *img, int rows, int cols, int step, int max_iterations, double centre_x,
double centre_y, double size, int image_size)
{
  double rowfac = ((double) rows)/gridDim.x;
  int rowstart = blockIdx.x*rowfac;
  int rowend = (blockIdx.x+1)*rowfac;
  double colfac = ((double) cols)/blockDim.x;
  int colstart = threadIdx.x*colfac;
  int colend = (threadIdx.x+1)*colfac;
  
  double left_edge = centre_x - size/2.0;
  double top_edge = centre_y - size/2.0;
  double pixel_step = size/image_size;
  
  for (int i=rowstart; i<rowend; i++)
    {
      for (int j=colstart; j<colend; j++)
      {
        RGB(i, j, img, step, calcMandelbrot(left_edge + j * pixel_step, top_edge + i * pixel_step, max_iterations));
      }
    }
}  //Divide calculations between the requested number of blocks and threads, having used the matrix's geometry to
//determine the values input to the calculation for each pixel

void startCUDA(int blocks, int threads, int iterations, double centre_x, double centre_y,
double size, unsigned char* img, int rows, int cols, int step, int image_size)
{
   if (img!=NULL)
   {
      dim3 dimBlock(threads, 1, 1);
      dim3 dimGrid(blocks, 1, 1);
      
      unsigned char *CUDAimg;
      cudaMalloc((void**) &CUDAimg, rows*cols*3);
      cudaMemcpy(CUDAimg, img, rows*cols*3, cudaMemcpyHostToDevice);
      Count<<<dimGrid, dimBlock>>>(CUDAimg, rows, cols, step, iterations, centre_x, centre_y, size, image_size);
      cudaMemcpy(img, CUDAimg, rows*cols*3, cudaMemcpyDeviceToHost);
      cudaFree(CUDAimg);
   }
}  // Allocate sufficient memory for the whole image (@3 bytes per pixel),
//transfer it to the graphics card (host to device), start the calculation process and,
//when complete, transfer the memory (containing the calculated values) back to the host

Slides

Litteratur Liste

Evaluering