Addressing the memory bottleneck in packet processing systems

Mudigonda, Jayaram

Addressing the memory bottleneck in packet processing systems

Access full-text files

mudigondaj86677.pdf (944.3 KB)

Date

2005

Authors

Mudigonda, Jayaram

Abstract

The design of packet processing systems is guided by two requirements: (1) high packet processing throughput, and (2) ease-of-programming. Meeting these two requirements simultaneously has proved to be challenging primarily because, packet processing aggravates the well-known problem of memory bottleneck. To overcome the memory bottleneck, today’s packet processing systems support a wide-range of mechanisms such as exposed memory hierarchy, multithreading, and non-blocking multi-word accesses. However, supporting many such mechanisms without clear guidelines for their usage complicates programmability and wastes system resources. In this dissertation, we ask two fundamental questions: (1) what minimal set of mechanisms should a packet processing system support to simultaneously achieve the goals of high throughput and ease-of-programming? and (2) how should one allocate common system resources such as chip area and off-chip memory bandwidth to these competing mechanisms? We make three contributions. First, we demonstrate that the minimal set must include data caches and multithreading for two reasons: (1) contrary to the widely-held belief, packet proviii cessing exhibits considerable data locality and (2) relying exclusively on either data locality or packet-level parallelism leads to low packet throughput. Second, we demonstrate that no fixed configuration of caches and multithreading works well across a spectrum of deployments (combinations of application, workload, and system characteristics). Achieving high throughput in all cases requires a malleable architecture that allows deployments to trade off resources between data caching and multithreading. Third, we develop such a malleable architecture based on a novel predictive register prefetcher that efficiently overlaps the execution of a thread with the thread-switch. Our predictor is accurate and imposes no significant overheads. We demonstrate that for the same chip area and memory bandwidth, our architecture achieves four times the throughput of current commercial systems. We argue that our architecture simplifies programmability. Data caches, unlike the mechanisms used today, are transparent to programmers and compilers. Further, our architecture eliminates the need for explicit thread scheduling as threads can be switched on cache misses. Thus, our approach simultaneously achieves both, high throughput and ease-of-programming. In addition, our findings lay the foundation for managing data accesses in the broader class of highthroughput, request-processing systems.