Browsing by Subject "Microprocessors--Design and construction"
Now showing 1 - 10 of 10
- Results Per Page
- Sort Options
Item Delay-sensitive branch predictors for future technologies(2002-05) Jiménez, Daniel Angel, 1969-; Lin, Yun CalvinItem Design of wide-issue high-frequency processors in wire delay dominated technologies(2004) Murukkathampoondi, Hrishikesh Sathyavasu; Burger, Douglas C., Ph. D.; Chase, Craig M.Item A design validation methodology for high performance microprocessors(2003) Krishnamurthy, Narayanan; Abraham, Jacob A.The task of checking whether a circuit implementation satisfies an abstract specification, prior to manufacturing the circuit, is extremely important. This is because of the reliance on the abstract specification being predictive of silicon behavior. It is also important to know the exact conditions under which the prediction is guaranteed to be valid. This dissertation delves into the fundamental bottlenecks and issues in model extraction and the inherent difficulties in verifying equivalence of transistor circuit implementations with respect to higher-level specifications. A novel implementation verification methodology that is based on symbolic simulation is presented. In addition, the dissertation presents the general theory of automatic constraint generation that is required for a sound verification strategy and proposes an enhanced implementation verification methodology to eliminate gate/switch-level full-chip simulations. The practical aspects of developing a tool to dovetail into this methodology is also presented.Item Efficient adaptation of multiple microprocessor resources for energy reduction using dynamic optimization(2005) Hu, Shiwen; John, Lizy KurianThe continuing advances in VLSI technology have fueled dramatic performance gains for general-purpose microprocessor, but microprocessor energy consumption has been increasing substantially in the past decade. The steady increase of microprocessor energy consumption significantly affects circuit reliability, cooling and package costs, and battery life of embedded systems. Adaptive microarchitectures are one of the commonly used techniques to dynamically identify configurations that are desirable from performance and power perspectives. By matching hardware resources to a program's runtime requirements, adaptive microarchitectures can effectively reduce energy with minimal performance loss. However, the task of searching for the most energy efficient configurations is complicated by configuration space explosion, which may considerably impair an adaptive microarchitecture’s performance and energy efficiency. This dissertation presents a hardware adaptation framework for efficient management of multiple configurable units, utilizing a dynamic optimization system’s inherent capabilities of detecting and optimizing dominant code regions (hot spots). The framework uses hot spot boundaries for phase detection and hardware adaptation. Since hot spots are of variable sizes and are often nested, the framework can decouple the reconfiguration of CUs with diverse adaptation costs by adjusting the granularity of adaptation based on each CU’s reconfiguration cost. This dissertation also studies the interference imposed by one CU’s configuration changes on others’ adaptation. CUs with minimal mutual interference can be adapted in parallel. In addition, for some CUs, one’s size reduction usually prompts the other to choose a smaller size for energy reduction. Hence, the search of those CUs’ best configurations biases toward certain paths, and thus prunes the tuning space. Employing the tuning-reduction strategies, the proposed framework significantly improves the energy efficiency of an adaptive microarchitecture. The energy and hardware adaptation impact of two important dynamic optimization services, JIT optimization and garbage collection, are also investigated in this work. By stressing the data caches, both dynamic optimization services decrease the average power dissipated by a dynamic optimization system. Furthermore, owing to their distinct runtime characteristics and their capabilities to alter program runtime behavior, the two dynamic optimization services change the adaptation preferences of configurable hardware units, and influence the energy efficiency of an adaptive microarchitecture.Item A hybrid-scheduling approach for energy-efficient superscalar processors(2005) Valluri, Madhavi Gopal; John, Lizy KurianThe management of power consumption while simultaneously delivering acceptable levels of performance is becoming a critical task in highperformance, general-purpose micro-architectures. Nearly a third of the energy consumed in these processors can be attributed to the dynamic scheduling hardware that identifies multiple instructions to issue in parallel. The energy consumption of this complex logic structure is projected to grow dramatically in future wide-issue processors. This research develops a novel Hybrid-Scheduling approach that synergistically combines the advantages of compile-time instruction scheduling and dynamic scheduling to reduce energy consumption in the dynamic issue hardware. This approach is predicated on the key observation that all instructions and all basic-blocks in a program are not equal; some blocks are inherently easy to schedule at compile-time, whereas others are not. In this scheme, programs are thus partitioned into low power “static regions” and high power “dynamic regions”. Static regions are regions of the program for which the compiler can generate schedules comparable to the dynamic schedules created by the run-time hardware. These regions bypass the dynamic issue units and execute on specially designed low-power, low-complexity hardware. An extensive evaluation of the proposed scheme reveals that the HybridScheduling approach wherein instructions are routed to a scheduling engine tuned to a region’s characteristics can provide substantial reduction in processor energy consumption while concurrently preserving high levels of performance.Item Instruction history management for high-performance microprocessors(2003) Bhargava, Ravindra Nath; John, Lizy KurianHistory-driven dynamic optimization is an important factor in improving instruction throughput in future high-performance microprocessors. Historybased techniques have the ability to improve instruction-level parallelism by breaking program dependencies, eliminating long-latency microarchitecture operations, and improving prioritization within the microarchitecture. However, a combination of factors, such as wider issue widths, smaller transistors, larger die area, and increasing clock frequency, has led to microprocessors that are sensitive to both wire delays and energy consumption. In this environment, the global structures and long-distance communications that characterize current history data management are limiting instruction throughput. This dissertation proposes the ScatterFlow Framework for Instruction History Management. Execution history management tasks, such as history data storage, access, distribution, collection, and modification, are partitioned and dispersed throughout the instruction execution pipeline. History data packets are then associated with active instructions and flow with the instructions as they execute, encountering the history management tasks along the way. Between dynamic instances of the instructions, the history data packets reside in trace-based history storage that is synchronized with the instruction trace cache. Compared to traditional history data management, this ScatterFlow method improves instruction coverage, increases history data access bandwidth, shortens communication distances, improves history data accuracy in many cases, and decreases the effective history data access time. A comparison of general history management effectiveness between the ScatterFlow Framework and traditional hardware tables shows that the ScatterFlow Framework provides superior history maturity and instruction coverage. The unique properties that arise due to trace-based history storage and partitioned history management are analyzed, and novel design enhancements are presented to increase the usefulness of instruction history data within the ScatterFlow Framework. To demonstrate the potential of the proposed framework, specific dynamic optimization techniques are implemented using the ScatterFlow Framework. These illustrative examples combine the history capture advantages with the access latency improvements while exhibiting desirable dynamic energy consumption properties. Compared to a traditional table-based predictor, performing ScatterFlow value prediction improves execution time and reduces dynamic energy consumption. In other detailed examples, ScatterFlowenabled cluster assignment demonstrates improved execution time over previous cluster assignment schemes, and ScatterFlow instruction-level profiling detects more useful execution traits than traditional fixed-size and infinite-size hardware tables.Item OS-aware architecture for improving microprocessor performance and energy efficiency(2004) Li, Tao; John, Lizy KurianThe Operating System (OS) which manages both hardware and software resources, constitutes a major component of today’s complex systems implemented with high-end and general-purpose microprocessors, memory hierarchy and heterogeneous I/O devices. Modern and emerging applications (e.g., database, web servers and file/e-mail workloads) exercise the OS significantly. However, microprocessor designs and (performance/power) optimizations have largely ignored the impact of OS. This dissertation characterizes the OS activity in emerging applications execution and demonstrates the necessity, advantages, and benefits of integrating OS component in processor architecture design. It is essential to understand the characteristics of today’s emerging workloads in order to design efficient architectures for them. Given the facts that modern and emerging applications involve system activities significantly, this research uses complete system evaluation. These evaluations result in several system performance and power optimizations targeting for emerging applications that have heavier OS activity. The OS dissipates a significant portion of total power in many modern application executions. Therefore, modeling OS power is imperative for accurate software power evaluation, as well as power management (e.g. dynamic thermal control and equal energy scheduling). This research characterizes the power behavior of a modern, commercial OS across a wide spectrum of applications to understand OS energy profiles and then proposed various models to cost-effectively estimate its run-time energy dissipation. To reduce software power, hardware can provide resources that closely match the needs of the software. However, with exception-driven and intermittent execution in nature, it becomes difficult to accurately predict and adapt processor resources in a timely fashion for OS power savings without significant performance degradation. This dissertation proposes a methodology that permits precise processor adaptations for the operating system with low overhead. Low power has been considered as an important issue in instruction cache (Icache) designs. This research goes beyond previous work to explore the opportunities to design energy-efficient I-cache by exploiting the interactions of hardware-OSapplications. This dissertation presents two techniques (OS-aware cache way lookup and OS-aware cache set drowsy mode) to reduce the dynamic and the static power consumption of I-cache. The proposed mechanisms require minimal hardware modification and addition. The OS component affects the control flow transfer in the execution environment because the exception-driven, intermittent invocation of OS code significantly increases the misprediction in both user and kernel code. This indicates that to improve microprocessor performance, adapting branch prediction hardware for OS has become very important now. This research proposes two OS-aware branch prediction techniques to alleviate this destructive impact.Item Scalable hardware memory disambiguation(2007-12) Sethumadhavan, Lakshminarasimhan, 1978-; Burger, Douglas C., Ph. D.This dissertation deals with one of the long-standing problems in Computer Architecture – the problem of memory disambiguation. Microprocessors typically reorder memory instructions during execution to improve concurrency. Such microprocessors use hardware memory structures for memory disambiguation, known as LoadStore Queues (LSQs), to ensure that memory instruction dependences are satisfied even when the memory instructions execute out-of-order. A typical LSQ implementation (circa 2006) holds all in-flight memory instructions in a physically centralized LSQ and performs a fully associative search on all buffered instructions to ensure that memory dependences are satisfied. These LSQ implementations do not scale because they use large, fully associative structures, which are known to be slow and power hungry. The increasing trend towards distributed microarchitectures further exacerbates these problems. As on-chip wire delays increase and high-performance processors become necessarily distributed, centralized structures such as the LSQ can limit scalability. This dissertation describes techniques to create scalable LSQs in both centralized and distributed microarchitectures. The problems and solutions described in this thesis are motivated and validated by real system designs. The dissertation starts with a description of the partitioned primary memory system of the TRIPS processor, of which the LSQ is an important component, and then through a series of optimizations describes how the power, area, and centralization problems of the LSQ can be solved with minor performance losses (if at all) even for large number of in flight memory instructions. The four solutions described in this dissertation — partitioning, filtering, late binding and efficient overflow management — enable power-, area-efficient, distributed and scalable LSQs, which in turn enable aggressive large-window processors capable of simultaneously executing thousands of instructions. To mitigate the power problem, we replaced the power-hungry, fully associative search with a power-efficient hash table lookup using a simple address-based Bloom filter. Bloom filters are probabilistic data structures used for testing set membership and can be used to quickly check if an instruction with the same data address is likely to be found in the LSQ without performing the associative search. Bloom filters typically eliminate more than 80% of the associative searches and they are highly effective because in most programs, it is uncommon for loads and stores to have the same data address and be in execution simultaneously. To rectify the area problem, we observe the fact that only a small fraction of all memory instructions are dependent, that only such dependent instructions need to be buffered in the LSQ, and that these instructions need to be in the LSQ only for certain parts of the pipelined execution. We propose two mechanisms to exploit these observations. The first mechanism, area filtering, is a hardware mechanism that couples Bloom filters and dependence predictors to dynamically identify and buffer only those instructions which are likely to be dependent. The second mechanism, late binding, reduces the occupancy and hence size of the LSQ. Both of these optimizations allows the number of LSQ slots to be reduced by up to one-half compared to a traditional organization without any performance degradation. Finally, we describe a new decentralized LSQ design for handling LSQ structural hazards in distributed microarchitectures. Decentralization of LSQs, and to a large extent distributed microarchitectures with memory speculation, has proved to be impractical because of the high performance penalties associated with the mechanisms for dealing with hazards. To solve this problem, we applied classic flow-control techniques from interconnection networks for handling resource con- flicts. The first method, memory-side buffering, buffers the overflowing instructions in a separate buffer near the LSQs. The second scheme, execution-side NACKing, sends the overflowing instruction back to the issue window from which it is later re-issued. The third scheme, network buffering, uses the buffers in the interconnection network between the execution units and memory to hold instructions when the LSQ is full, and uses virtual channel flow control to avoid deadlocks. The network buffering scheme is the most robust of all the overflow schemes and shows less than 1% performance degradation due to overflows for a subset of SPEC CPU 2000 and EEMBC benchmarks on a cycle-accurate simulator that closely models the TRIPS processor. The techniques proposed in this dissertation are independent, architectureneutral and their cumulative benefits result in LSQs that can be partitioned at a fine granularity and have low design complexity. Each of these partitions selectively buffers only memory instructions with true dependences and can be closely coupled with the execution units thus minimizing power, area, and latency. Such LSQ designs with near-ideal characteristics are well suited for microarchitectures with thousands of instructions in-flight and may enable even more aggressive microarchitectures in the future.Item Scalable primary cache memory architectures(2004) Agarwal, Vikas; John, Lizy KurianFor the past decade, microprocessors have been improving in overall performance at a rate of approximately 50–60% per year by exploiting a rapid increase in clock rate and improving instruction throughput. A part of this trend has included the growth of on-chip caches which in modern processor can be as large as 2MB. However, as smaller technologies become prevalent, achieving low average memory access time by simply scaling existing designs becomes more difficult because of process limitations. This research shows that scaling an existing design by either keeping the latency of various structures constant or allowing the latency to vary while keeping the capacity constant leads to degradation in the instructions per cycle (IPC). The goal of this research is to improve IPC at small feature sizes, using a combination of circuit and architectural techniques. This research develops technology-based models to estimate cache access times and uses the models for architectural performance estimation. The performance of a microarchitecture with clustered functional units coupled with a partitioned primary data cache is estimated using the cache access time models. This research evaluates both static and dynamic data mapping on the partitioned primary data cache and shows that using dynamic mapping in combination with the partitioned cache outperforms both the unified cache as well as the statically mapped design. In conjunction with the dynamic data mapping, this research proposes and evaluates predictive instruction steering strategies that help improve the performance of clustered processor designs. This research shows that a hybrid predictive instruction steering policy coupled with an aggressive dynamic mapping of data in a partitioned primary data cache can significantly improve the instructions per cycle (IPC) of a clustered processor, relative to dependence based steering with a unified data cache.Item Switch-based Fast Fourier Transform processor(2008-12) Mohd, Bassam Jamil, 1968-; Swartzlander, Earl E.The demand for high-performance and power scalable DSP processors for telecommunication and portable devices has increased significantly in recent years. The Fast Fourier Transform (FFT) computation is essential to such designs. This work presents a switch-based architecture to design radix-2 FFT processors. The processor employs M processing elements, 2M memory arrays and M Read Only Memories (ROMs). One processing element performs one radix-2 butterfly operation. The memory arrays are designed as single-port memory, where each has a size of N/(2M); N is the number of FFT points. Compared with a single processing element, this approach provides a speedup of M. If not addressed, memory collisions degrade the processor performance. A novel algorithm to detect and resolve the collisions is presented. When a collision is detected, a memory management operation is executed. The performance of the switch architecture can be further enhanced by pipelining the design, where each pipeline stage employs a switch component. The result is a speedup of Mlog2N compared with a single processing element performance. The utilization of single-port memory reduces the design complexities and area. Furthermore, memory arrays significantly reduce power compared with the delay elements used in some FFT processors. The switch-based architecture facilitates deactivating processing elements for power scalability. It also facilitates implementing different FFT sizes. The VLSI implementation of a non-pipeline switch-based processor is presented. Matlab simulations are conducted to analyze the performance. The timing, power and area results from RTL, synthesis and layout simulations are discussed and compared with other processors.