CPU performance in the age of big data : a case study with Hive
Distributed SQL Query Engines (DSQEs), like Hive, Shark, and Impala, have become the de-facto database set-up for Decision Support Systems with large database sizes. Unlike their single-threaded counterparts like MySQL, DSQEs experience inefficiencies related to the algorithm, code base, OS, and CPU micro-architecture that limit throughput despite the speedup from distributed execution. In my thesis, I present a detailed performance analysis of a DSQE called Hive, comparing it to MySQL, a single-threaded database application. Hive has difficulty converting queries into a set of MapReduce jobs for distributed execution. Hive also experiences a startup phase that is a significant overhead for short running queries. Additionally, both Hive and MySQL, like other server applications, experience high L1I miss rates due to a large code footprint. However, because MySQL is algorithmically efficient and traverses the database at a faster rate, it incurs a larger back-end bottleneck from LLC misses, which hides the front-end bottleneck. In contrast, Hive does not hide the high L1I cache miss rate with back-end stalls. Additionally, the higher context switch rates experienced by multi-process Hive setups thrash the first level caches, further inflaming the L1I cache miss rate. To address this micro-architectural inefficiency, I propose an instruction prefetch mechanism called Runahead Prefetch. It is similar to previously proposed branch prediction base prefetchers , but designed to easily extend modern Intel microarchitectures. Despite newer instruction prefetch mechanisms that discount branch prediction based prefching potential   , I show Runahead Prefetch can eliminate 92% of L1I misses and 96% of icache stalls on average given modern branch misprediction rates and sufficient runahead.