Some days ago, a friend of mine at work asked me what was the big difference in the way GPUs and CPUs operate. Even if I went into a fairly deep description of the inner workings of GPUs in some older posts, I want to elaborate specifically on that question.
Let's start with a fundamental concept: latency, that is the time that we have to wait, after submitting an instruction, to have its results computed. If we have only one computational stage, then effectively the reciprocal of the latency is the amount of instruction we can process in an unit time.
So we want them to be small right? Well it turns out, that they were in the last years growing instead! But still our processors seem to run faster than before, why? Because they are good at hiding those latencies!
How? Simple, let's say that instead of having a single computational stage, you have more stages, a pipeline of workers. Then you might move an instruction being processed from one stage to the other (conceptually) like on a conveyor belt, and while you're processing it the other stages can accept more instructions. Any given instruction will have to go through the whole pipeline, but the rate of instruction processing can be much higher than latency, and it's called throughput.
Why we did like those kinds of designs? Well, in the era of the gigahertz wars (that now has largely scaled back), it was an easy way of having higher frequencies. If a single instruction was split in a number of tiny steps, then each of them could be simpler, thus requiring less stuff to be done, thus enabling designers to have higher frequencies, as each small step required less time.
Unfortunately, if something stalls this pipeline, if we can't fetch more instructions to process to keep it always full, then our theorical performance can't be reached, and our code will run slower than on less deeply pipelined architectures.
The causes of those stalls are various, we could have a "branch misprediction", we were thinking some work was needed, but we were wrong, we started processing instructions that are not useful. Or we could not be able to find instructions to process that are not dependant on results of the ones that are currently being processed. The worse example of this latter kind of stall is on memory accesses. Memory is slow, and it's evolving at a slower pace than processors too, so the gap is becoming bigger and bigger (there wasn't any twenty years ago, for example on the Commodore 64, its processors did not need caches too).
If one instruction is a memory fetch, and we can't find any instruction to process after it that does not depend on that memory fetch, we are stalled. Badly. That's why hyper-threading and similar architectures exist. That's why memory does matter, and why cache-friendly code is important.
CPUs become better and better at this job of optimizing their pipelines. Their architectures, and decoding stages (taking instructions and decomposing them in stages, scheduling them in the pipeline and rearranging them, that's called out-of-order instruction execution), are so complicated that's virtually impossible to predict at a cycle level the behaviour of our code. Strangely, transistor numbers did evolve according to Moore's law, but we did not use those transistors to have more raw power, but mostly to have more refined iterations of those pipelines and of the logic that controls them.
Most people say that GPUs computational power is evolving at a faster pace than Moore's law predicted. That is not true, as that law did not account for frequency improvements (i.e. thinner chip dies), so it's not about computational power at all! The fact that CPUs computational power did respect that law means that we were wasting those extra transistors, in other words, that those transistors did not linearly increase the power.
Why GPUs are different? Well, let me do a little code example. Let's say we want to compute this:
for i=0 to intArray.length do boolArray[i] = (intArray[i] * 10 + 10) > 0
GPUs will actually refactor the computation to be more like the following (plus a lot of unrolling...):
for i=0 to intArray.length do tempArray[i] = intArray[i]
for i=0 to intArray.length do tempArray[i] = tempArray[i] * 10
for i=0 to intArray.length do tempArray[i] = tempArray[i] + 10
for i=0 to intArray.length do boolArray[i] = tempArray[i] > 0
(this example would be much easier in functional pseudocode than in imperative one, but anyway...)
Odd! Why are we doing this? Basically, what we want to do is to hide latency in width, instead of in depth! Having to perform the same operation on a huge number of items, we are sure that we always have enough to do to hide latencies, without much effort. And it's quite straightforward to turn transistors in computational power too, we simply will have more width, and more computational units working in parallel on the tempArray! In fact, that kind of operation, a "parallel for", is a very useful primitive to have in your multithreading library... :)
Many GPUs work exactly like that. The only big difference is that the "tempArray" is implemented in GPU registers, so it has a fixed size, and thus work has to be subdivided in smaller pieces.
There are some caveats.
The first one is that if we need more than one temp register to execute our operation (because our computation is as simple as the one of my example!) then our register array will contain less independant operating threads (because each one requires a given space), and so we will have less latency hiding. That's why the number of registers that we use in a shader is more important than the number of instructions (now we can clearly see them as passes!) that our shader needs to perform!
Second, this kind of computation is inherently SIMD, even if GPUs do support different execution paths on the same data (i.e. branches) those are still limited in a number of ways.
Another one is that our computations have to be independant, there's no communication between processing threads, we can't compute operations like:
for i=0 to boolArray.length do result = result LOGICAL_OR boolArray[i]
That one is called in the steam processing lingo, a gather operation (or if you're familiar with functional programming, a reduce or fold), the inverse of which is called a scatter operation. Lucily for the GPGPU community, a workaround to do those kinds of computations on the GPU exists and is to map our data to be processed into a texture/rendertarget, use register threads to process multiple pixels in parallel and use texture reads, that can be arbitrary, to gather data. Scatter is still very hard, and there are limitations to the number of texture reads too, for example that code will be processed usually by doing multiple reductions, from a boolArray of size N to one of size N/2 (N/4 really, as textures are bidimensional) until reaching the final result... but that's too far away from the original question...
Are those two worlds going to meet? Probably. CPUs already do not have a single pipeline, so they're not all about depth. Plus both CPUs and GPUs have SIMD data types and operations. And now multicore is the current trend, and we will see have more and more cores, that will be simpler and simpler (i.e. the IBM Cell or the Intel Larrabee). On the other hand, GPUs are becoming more refined in their scheduling abilities, i.e. the Xbox 360 one does not only hide latency in depth, but also can choose which instructions from which shader to schedule in order to further hide memory latencies across multiple passes (basically implementing fibers)... NVidia G80 has computational units with independent memory storages...
Still I think that GPU processing is inherently more parallel than CPU, so a specialized unit will always be nice to have, we are solving a very specific problem, we have a small computational kernel to apply to huge amounts of data... On the other hands, pushing too much the stream computing paradigm on the CPUs is not too useful, as there are problems that do not map well on it, because they don't work on huge amounts of data nor they perform uniform operations...