|dc.description.abstract||The design of packet processing systems is guided by two requirements: (1) high
packet processing throughput, and (2) ease-of-programming. Meeting these two
requirements simultaneously has proved to be challenging primarily because, packet
processing aggravates the well-known problem of memory bottleneck. To overcome
the memory bottleneck, today’s packet processing systems support a wide-range of
mechanisms such as exposed memory hierarchy, multithreading, and non-blocking
multi-word accesses. However, supporting many such mechanisms without clear
guidelines for their usage complicates programmability and wastes system resources.
In this dissertation, we ask two fundamental questions: (1) what minimal set
of mechanisms should a packet processing system support to simultaneously achieve
the goals of high throughput and ease-of-programming? and (2) how should one
allocate common system resources such as chip area and off-chip memory bandwidth
to these competing mechanisms? We make three contributions.
First, we demonstrate that the minimal set must include data caches and
multithreading for two reasons: (1) contrary to the widely-held belief, packet proviii
cessing exhibits considerable data locality and (2) relying exclusively on either data
locality or packet-level parallelism leads to low packet throughput.
Second, we demonstrate that no fixed configuration of caches and multithreading
works well across a spectrum of deployments (combinations of application,
workload, and system characteristics). Achieving high throughput in all cases
requires a malleable architecture that allows deployments to trade off resources between
data caching and multithreading.
Third, we develop such a malleable architecture based on a novel predictive
register prefetcher that efficiently overlaps the execution of a thread with the
thread-switch. Our predictor is accurate and imposes no significant overheads. We
demonstrate that for the same chip area and memory bandwidth, our architecture
achieves four times the throughput of current commercial systems.
We argue that our architecture simplifies programmability. Data caches,
unlike the mechanisms used today, are transparent to programmers and compilers.
Further, our architecture eliminates the need for explicit thread scheduling
as threads can be switched on cache misses. Thus, our approach simultaneously
achieves both, high throughput and ease-of-programming. In addition, our findings
lay the foundation for managing data accesses in the broader class of highthroughput,