Parallel Microprocessors

(In conjunction with Ph.D. students Srinivas Vadlamani and Sevin Fide)

Do you know that the way most compilers and programmers code for multithreading processors (Pentium 4) and chip multiprocessors (Athlon64 X2, Core Duo, and Pentium D) is wrong? The problem is that the memory interfaces stayed the same as parallel execution (threads or cores) were added, thus memory is even more of a bottleneck than before! The typical parallel processing model, spatial decomposition, splits the work between the threads or cores, each of which simultaneously access memory to perform their computation, which exacerbates the problem. (It can also cause interference in shared caches, but others are working on that problem.) How bad is it? Most people turn HyperThreading off on their Pentium 4 because it provides no benefits for most people. We have examples of memory-intensive parallel programs that scale very well on clusters, but actually run slower on dual processor machines, like the PowerMac G5, dual Opterons, and Athlon64 X2 machines.

Our Synchronized Pipelined Parallelism Method (SPPM) restructures programs into producer/consumer chains. The threads of these chains then communicate through shared levels of the memory hierarchy (i.e., cache or front-side buses), which are normally much faster than memory buses.

How well does it work? We can get significant performance improvement, even on on a Pentium 4 with HyperThreading. The SPPM versions of our programs can run 20-50% faster than the sequential version while maintaining the same number of L2 cache misses! The spatial decomposition versions tend to be much slower and incur many more cache misses. SPPM can even get parallel performance from programs that are not typically parallel, like ARC4 encryption. So SPPM can make your code take advantage of modern parallel microprocessors!

We currently have a detailed journal paper under review, but some preliminary results can be found in this reference:

It turns out that AMD processors are particularly awful at cache coherence performance. So bad that it is faster to fetch data from the memory than from the other core’s cache! On Opterons, it is faster to fetch from the remote processor’s memory than from the remote processor’s cache, so it clearly isn’t the HyperTransport. This made SPPM work badly on these systems, because the consumer was slower than the producer, thus the producer had to wait. Therefore, we invented a new approach to keep the communications between producer and consumer in the cache while reducing the remote cache transfers. This idea, called Polymorphic Threads, works extremely well and is documented in:

  • “Architectural considerations for efficient software execution on parallel microprocessors,” presented at the 21st IEEE International Parallel & Distributed Processing Symposium, (Long Beach, California, USA), Mar. 2007. (PDF)(BibTeX)

In order to understand this surprising behavior, we developed methods to measure cache to cache communications behavior. Because this benchmarking tool will be of general interest to the community, we are releasing it into the Open Source community. Please see our page on SourceForge.

So if this works, what is left to do? Lots, actually. We have shown that the idea works well, but it is currently hand-coded. We need to make it easier to use, possibly with the use of a synchronized loop class. We also need to develop detailed analytical models so we can predict how a program/machine pair will perform before effort is spent converting the code to SPPM. Ideally, SPPM could be integrated into a compiler, but that is a longer term goal.

In addition, we are working on several hardware architectural changes that could greatly alleviate this bottleneck. Our simulations are very promising.