Cell Processor Details
IBM et al have released a lot of information about the Cell processor recently. Unless you’ve been living under a PS1 for the last five years, you know that the Cell processor will be powering the upcoming PS3 gaming console from Sony.
One of the most interesting bits is IBM’s claim of superior memory access. Currently the processors spend far too much time waiting on memory, so the Cell tries to ameliorate memory access problems with a new memory architecture.
This three-level organization of storage (register file, local store, main storage) — with asynchronous DMA transfers between local store and main storage — is a radical break with conventional architecture and programming models because it explicitly parallelizes computation and the transfers of data and instructions.
The reason for this radical change is that memory latency, measured in processor cycles, has gone up several hundredfold in the last 20 years. The result is that application performance is often limited by memory latency rather than peak compute capability or peak bandwidth. When a sequential program on a conventional architecture performs a load instruction that misses in the caches, program execution now comes to a halt for several hundred cycles. Compared with this penalty, the few cycles it takes to set up a DMA transfer for an SPE is quite small. Even with deep and costly speculation, conventional processors manage to get at best a handful of independent memory accesses in flight. The result can be compared to a bucket brigade in which a hundred people are required to cover the distance to the water needed to put the fire out, but only a few buckets are available.
In contrast, the explicit DMA model allows each SPE to have many concurrent memory accesses in flight without the need for speculation.
The most productive SPE memory-access model appears to be the one in which a list (such as a scatter-gather list) of DMA transfers is constructed in an SPE’s local store so that the SPE’s DMA controller can process the list asynchronously while the SPE operates on previously transferred data. In several cases, this new approach to accessing memory has led to application performance exceeding that of conventional processors by almost two orders of magnitude, significantly more than anyone would expect from the peak performance ratio (about 10x) between the Cell Broadband Engine and conventional PC processors.