This is question is more about seeking general knowledge on the subject rather than a specific problem.
I've been readings bout the graphic pipeline and found some good explanations on how a pipeline works for example I found this site to explain it in quite simple yet powerful terms: link text
But when it comes to parallelism I'm stumped. I've found a couple of power points which related to the frostbite e开发者_JAVA百科ngine but that is about it. I'm looking for the why and how here.
Why does it improve performance and how does it do it?
For a general overview of parallel processing, see Andres' link in his comment.
Here's my take on GPU parallelization:
Imagine a simple scenario where you want to tint every pixel on the screen blue. If you were doing all of this on the CPU in a single thread on a hypothetical 1024x1024 pixel display, you might write something like this
/// increase the blue component of an individual pixel
RGB32 TintPixelBlue(RGB32 inputPixel)
{
/// increase the blue component by a max of 10, but don't overflow the byte by going over
/// 0xFF
inputPixel.Blue += Math.Min(10,0xFF-inputPixel.Blue)
return inputPixel;
}
void DrawImageToScreen(Image image)
{
for(int y=0;y<pixels.Height;y++)
for(int x=0;x<pixels.Width;x++)
image[x,y]=TintPixelBlue(image[x,y]);
DrawMyImageToScreen(image);
}
For a 1024x1024 image, this will have to execute 1,048,576 times, one pixel after another. This can take quite a while. If you have to do this at, say, 60 frames/sec, and have to draw a bunch of other stuff (your scene or other geometry), you're machine can grind to a screeching halt. This becomes even worse if you're working on a larger image (1920x1080 for instance).
Enter parallelization. (REALLY rough pseudo-code; see HLSL, CUDA or OpenCL for the real thing)
RGB32 TintPixelBlue(RGB32 inputPixel)
{
/// increase the blue component by a max of 10, but don't overflow the byte by going over
/// 0xFF
inputPixel.Blue += Math.Min(10,0xFF-inputPixel.Blue)
return inputPixel;
}
void DrawImageToScreen(Image image)
{
GPU.SetImage(image);
GPU.SetPixelShader(TintPixelBlue);
Draw();
}
With a single, multi-core GPU (up to 512 cores on NVidia Fermi and Tesla cards), you can write the TintPixelBlue()
function in a shader language which compiles to the GPU's native instruction set. You then pass the Image object to the GPU, and tell it to run TintPixelBlue()
on every pixel. The GPU can then use all 512 cores to in parallel, which effectively divides the required time by the number of cores (minus overhead and some other stuff we won't get into here).
Instead of 2^20=1,048,576 iterations on the CPU, you get 1,048,576/512, or 2^11=2048 iterations. That's (obviously) a performance increase of around 500x.
The key here is that each input is independent: you don't need the output of one pixel to affect another, so any free core can work on any pending input without really having to sync with the other cores.
The real fun starts when you put multiple GPUs in the system. Tesla arrays are incredibly fast, and power the world's fastest supercomputer. Given that they're significantly cheaper than an equivelant array of traditional CPUs (compare the cost of 512 1.3GHz CPUs, RAM, rack space, etc vs a $3000USD Tesla card), they're becoming very popular in the scientific community for hard-core number crunching.
Hope that helps.
精彩评论