Browsing by Subject "Performance modeling"

Now showing 1 - 3 of 3

Accurate modeling of core and memory locality for proxy generation targeting emerging applications and architectures
(2017-12) Panda, Reena; John, Lizy Kurian; Swartzlander, Earl E; Khurshid, Sarfraz; Gerstlauer, Andreas; Ganesan, Karthik
Designing optimal computer systems for improved performance and energy efficiency requires architects and designers to have a deep understanding of the end-user workloads. However, many end-users (e.g., large corporations, banks, defense organizations, etc.) are apprehensive to share their applications with designers due to the confidential nature of software code and data. In addition, emerging applications pose significant challenges to early design space exploration due to their long-running nature and the highly complex nature of their software stack that cannot be supported on many early performance models. The above challenges can be overcome by using a proxy benchmark. A miniaturized proxy benchmark can be used as a substitute of the original workload to perform early computer performance evaluation. The process of generating a proxy benchmark consists of extracting a set of key statistics to summarize the behavior of end-user applications through profiling and using the collected statistics to synthesize a representative proxy benchmark. Using such proxy benchmarks can help designers to understand the behavior of end-user’s workloads in a reasonable time without the users having to disclose sensitive information about their workloads. Prior proxy benchmarking schemes leverage micro-architecture independent metrics, derived from detailed simulation tools, to generate proxy benchmarks. However, many emerging workloads do not work reliably with many profiling or simulation tools, in which case it becomes impossible to apply prior proxy generation techniques to generate proxy benchmarks for such complex applications. Furthermore, these techniques model instruction pipeline-level locality in great detail, but abstract out memory locality modeling using simple stride-based models. This results in poor cloning accuracy especially for emerging applications, which have larger memory footprints and complex access patterns. A few detailed cache and memory locality modeling techniques have also been proposed in literature. However, these techniques either model limited locality metrics and suffer from poor cloning accuracy or are fairly accurate, but at the expense of significant metadata overhead. Finally, none of the prior proxy benchmarking techniques model both core and memory locality with high accuracy. As a result, they are not useful for studying system-level performance behavior. Keeping the above key limitations and shortcomings of prior work in mind, this dissertation presents several techniques that expand the frontiers of workload proxy benchmarking, thereby enabling computer designers to gain a better and faster understanding of end-user application behavior without compromising the privileged nature of software or data. This dissertation first presents a core-level proxy benchmark generation methodology that leverages performance metrics derived from hardware performance counter measurements to create miniature proxy benchmarks targeting emerging big-data applications. The presented performance counter based characterization and associated extrapolation into generic parameters for proxy generation enables faster analysis (runs almost at native hardware speeds, unlike prior workload cloning proposals) and proxy generation for emerging applications that do not work with simulators or profiling tools. The generated proxy benchmarks are representative of the performance of the real-world big-data applications, including operating system and run-time effects, and yet converge to results quickly without needing any complex software stack support. Next, to improve upon the accuracy and efficiency of prior memory proxy benchmarking techniques, this dissertation presents a novel memory locality modeling technique that leverages localized pattern detection to create miniature memory proxy benchmarks. The presented technique models memory reference locality by decomposing an application’s memory accesses into a set of independent streams (localized by using address region based localization property), tracking fine-grained patterns within the localized streams and, finally, chaining or interleaving accesses from different localized memory streams to create an ordered proxy memory access sequence. This dissertation further extends the workload cloning approach to Graphics Processing Units (GPUs) and presents a novel proxy generation methodology to model the inherent memory access locality of GPU applications, while also accounting for the GPU’s parallel execution model. The generated memory proxy benchmarks help to enable fast and efficient design space exploration of futuristic memory hierarchies. Finally, this dissertation presents a novel technique to integrate accurate core and memory locality models to create system-level proxy benchmarks targeting emerging applications. This is a new capability that can facilitate efficient overall system (core, cache and memory subsystem) design-space exploration. This dissertation further presents a novel methodology that exploits the synthetic benchmark generation framework to create hypothetical workloads with performance behavior that does not currently exist. Such proxies can be generated to cover anticipated code trends and can represent futuristic workloads before the workloads even exist.
High-speed performance and power modeling
(2010-05) Sunwoo, Dam; Chiou, Derek; Patt, Yale N.; Swartzlander, Earl E.; Pinagli, Keshav; Touba, Nur A.; Holt, James C.
The high cost of designing, testing and manufacturing semiconductor chips makes simulation essential to predict performance and power throughout the design cycle of hardware components. However, standard detailed software performance/power simulators are too slow to finish real-life benchmarks within the design cycle. To compensate, reduced accuracy is often traded for improved simulator performance. This dissertation explores the FPGA-Accelerated Simulation Technologies (FAST) methodology that can dramatically improve simulation performance without sacrificing accuracy. Design trade-offs of the functional model partition of a FAST simulator are discussed and QUICK, an implementation of a FAST functional model that is designed to provide fast functional execution as well as the ability to rollback and execute down different paths is described. QUICK is general enough to be useful beyond FPGA-accelerated simulators and provides complex ISA (x86) and full-system support. A complete FAST simulator that combines QUICK with an FPGA-based timing model runs in the millions of x86 instructions per seconds, several orders of magnitude faster than software simulators of comparable accuracy capability, and boots unmodified Windows XP and Linux. Ideally, one could model power at the same speeds as performance modeling in a FAST simulator. However, traditional software-implemented power estimation techniques are very slow. PrEsto, a new power modeling methodology that automatically generates accurate power models that can efficiently fit and operate within FAST simulators, is proposed. Such models can dramatically improve the accuracy and performance of architectural power estimation. Improving high-accuracy simulator performance will open research directions that could not be explored economically in the past. The combination of simulation performance, accuracy, and power estimation capabilities extend the usefulness of such simulators, thus enabling the co-design of architecture, hardware implementation, operating systems, and software.
Resource management for efficient single-ISA heterogeneous computing
(2011-05) Chen, Jian, doctor of electrical and computer engineering; John, Lizy Kurian; Swartzlander, Earl E.; Ghosh, Joydeep; Pan, David Z.; Eeckhout, Lieven
Single-ISA heterogeneous multi-core processors (SHMP) have become increasingly important due to their potential to significantly improve the execution efficiency for diverse workloads and thereby alleviate the power density constraints in Chip Multiprocessors (CMP). The importance of SHMP is further underscored by the fact that manufacturing defects and process variation could also cause single-ISA heterogeneity in CMPs even though the CMP is originally designed as homogeneous. However, to fully exploit the execution efficiency that SHMP has to offer, programs have to be efficiently mapped/scheduled to the appropriate cores such that the hardware resources of the cores match the resource demands of the programs, which is challenging and remains an open problem. This dissertation presents a comprehensive set of off-line and on-line techniques that leverage analytical performance modeling to bridge the gap between the workload diversity and the hardware heterogeneity. For the off-line scenario, this dissertation presents an efficient resource demand analysis framework that can estimate the resource demands of a program based on the inherent characteristics of the program without using any detailed simulation. Based on the estimated resource demands, this dissertation further proposes a multi-dimensional program-core matching technique that projects program resource demands and core configurations to a unified multi-dimensional space, and uses the weighted Euclidean distance between these two to identify the matching program-core pair. This dissertation also presents a dynamic and predictive application scheduler for SHMPs. It uses a set of hardware-efficient online profilers and an analytical performance model to simultaneously predict the application’s performance on different cores. Based on the predicted performance, the scheduler identifies and enforces near-optimal application assignment for each scheduling interval without any trial runs or off-line profiling. Using only a few kilo-bytes of extra hardware, the proposed heterogeneity-aware scheduler improves the weighted speedup by 11.3% compared with the commodity OpenSolaris scheduler and by 6.8% compared with the best known research scheduler. Finally, this dissertation presents a predictive yet cost effective mechanism to manage intra-core and/or inter-core resources in dynamic SHMP. It also uses a set of hardware-efficient online profilers and an analytical performance model to predict the application’s performance with different resource allocations. Based on the predicted performance, the resource allocator identifies and enforces near optimum resource partitions for each epoch without any trial runs. The experimental results show that the proposed predictive resource management framework could improve the weighted speedup of the CMP system by an average of 11.6% compared with the equal partition scheme, and 9.3% compared with existing reactive resource management scheme.