Hardware techniques to reduce communication costs in multiprocessors
Abstract
This dissertation explores techniques for reducing the costs of inter-processor
communication in shared memory multiprocessors (MP). We seek to improve MP
performance by enhancing three aspects of multiprocessor cache designs: miss reduction,
low communication latency, and high coherence bandwidth. In this dissertation,
we propose three techniques to enhance the three factors: shared non-uniform
cache architecture, coherence decoupling, and subspace snooping.
As a miss reduction technique, we investigate shared cache designs for future
Chip-Multiprocessors (CMPs). Cache sharing can reduce cache misses by eliminating
unnecessary data duplication and by reallocating the cache capacity dynamically.
We propose a reconfigurable shared non-uniform cache architecture and evaluate
the trade-offs of cache sharing with varied sharing degrees. Although shared caches
can improve caching efficiency, the most significant disadvantage of shared caches
is the increase of cache hit latencies. To mitigate the effect of the long latencies,
we evaluate two latency management techniques, dynamic block migration and L1
prefetching.
However, improving the caching efficiency does not reduce the cache misses
induced by MP communication. For such communication misses, the latencies of
cache coherence should be either reduced or hidden and the coherence bandwidth
should scale with the number of processors. To mitigate long communication latencies,
coherence decoupling uses speculation for communication data. Coherence
decoupling allows processors to run speculatively at communication misses with predicted
values. Our prediction mechanism, called Speculative Cache Lookup (SCL)
protocol, uses stale values in the local caches. We show that the SCL read component
can hide false sharing and silent store misses effectively. We also investigate
the SCL update component to hide the latencies of truly shared misses by updating
invalid blocks speculatively.
To improve the coherence bandwidth, we propose subspace snooping, which
improves the snooping bandwidth with future large-scale shared-memory machines.
Even with huge optical bus bandwidth, traditional snooping protocols may not scale
to hundreds of processors, since all processors should respond to every bus access.
Subspace snooping allows only a subset of processors to be snooped for a bus access,
thus increasing the effective snoop tag bandwidth. We evaluate subspace snooping
with a large broadcasting bandwidth provided by optical interconnects.
Department
Description
text