Improving sampled microprocessor simulation

Access full-text files




Luo, Yue

Journal Title

Journal ISSN

Volume Title



Microprocessor evaluation using detailed cycle-accurate simulation is prohibitively time-consuming. Sampling is the most widely used simulation time reduction technique. In this dissertation, new sampling designs that utilize the characteristics of the workload, the microarchitecture being simulated, and the user’s specific objective are proposed. They improve accuracy, and reduce simulation time and storage cost. Statistical sampling theory is employed to study the choice of sampling unit size for simple random sampling with perfect warm-up. More importantly, the inherent characteristic of the benchmarks that affects the choice of sampling unit size is discerned. Previous research has been focusing on the accuracy of Cycle Per Instruction (CPI). However, most simulations are used to measure the speedup due to some microarchitectural enhancements. A new sampling scheme that employs ratio estimator from statistical theory is proposed to measure speedup and to quantify its error. In the experiment, 9X fewer instructions are simulated as compared to estimating CPI for the same relative error limit. This dissertation extends sampling techniques to the simulation of commercial workloads such as On-Line Transaction Processing (OLTP) used by banks, airlines, etc. The applicability of simple random sampling and representative sampling for OLTP workloads is investigated. A dynamic stopping rule is proposed for sampling OLTP workloads, which requires only one simulation and thus eliminates the second simulation in previous random sampling methods. In order to achieve accurate sampling results, microarchitectural structures must be adequately warmed up before each measurement. Previous warm-up techniques have not considered the cache configuration being simulated, an important factor on the warmup length. This dissertation presents a new cache warm-up technique for sampled microprocessor simulation, which allows the warm-up length to be adaptive to cache configurations and benchmark variability characteristics. As a result, warm-up length has been greatly reduced, especially for small caches, without losing accuracy. For trace-driven simulation, the sampled traces have to be stored. Another contribution of the dissertation is the Locality Based Trace Compression (LBTC) technique, which employs both spatial locality and temporal locality in program memory references. It efficiently compresses not only the address but also other attributes associated with each memory reference.