Browsing by Subject "Accelerators"
Now showing 1 - 3 of 3
- Results Per Page
- Sort Options
Item 3D system-circuit-device design methodologies for advanced CMOS(2021-05-03) Mathur, Rahul; Kulkarni, Jaydeep P.; John, Lizy; Dodabalapur, Ananth; Banerjee, Sanjay; Yeric, Greg; Sinha, SaurabhThe emergence of 5G, automotive, and AI-based applications are creating new capabilities and a huge amount of data that is driving the need for a broad expansion in energy-efficient compute capacities. At the same time, the typical gains in Power, Performance, Area, and Cost (PPAC) that dimensional scaling has brought over the past several decades are slowing down. To o set the slowdown in 2D scaling and continue the trajectory of PPAC improvements, coordinated innovations are needed across the system, circuit, and device abstraction levels. 3D integration may offer complementary gains to transistor density scaling. Meanwhile, 3D expands the design space of SoC adding considerations like partitioning, power delivery, signaling, and thermal management. This dissertation studies these considerations in detail. The work spans thermal analysis of a 3D CPU, system-level design space exploration of 3D ML accelerators, circuit design of a 3D-Split SRAM macro, and novel use of device-level 3D construct of Buried Power Rail (BPR) for SRAM signaling to enable next-generation computing systems in advanced CMOS.Item Lightweight offload engines for worklist management and worklist-directed prefetching(2017-12) Zhang, Dan; Chiou, Derek; Erez, Mattan; Gerstlauer, Andreas; Pingali, Keshav; Khubaib, KhubaibThe importance of irregular applications such as graph analytics is rapidly growing with the rise of Big Data. However, parallel graph workloads tend to perform poorly on general-purpose chip multiprocessors (CMPs) due to poor cache locality, low compute intensity, frequent synchronization, uneven task sizes, and dynamic task generation. At high thread counts, execution time is dominated by worklist synchronization overhead and cache misses. Researchers have proposed hardware worklist accelerators to address scheduling costs, but these proposals often harden a specific scheduling policy and do not address high cache miss rates. This thesis presents Minnow, a technique that addresses these bottlenecks by augmenting each core in a CMP with a memory throughput-optimized lightweight engine connected through an accelerator interface. These engines offload worklist operations from worker threads, reducing synchronization costs and improving scalability. The engines also perform worklist-directed prefetching, a software prefetching technique that exploits knowledge of upcoming tasks to perform nearly perfectly accurate and timely prefetch operations. In this thesis, we first characterize several graph applications within a popular graph analytics framework to determine their performance and bottlenecks. Next, Minnow and worklist-directed prefetching are discussed in detail, including the Minnow accelerator interface, microarchitecture, and prefetch flow control mechanism. Finally, the benefits of Minnow and worklist-directed prefetching are evaluated within a cycle-accurate microarchitectural simulator.Item Single-shot diagnostics of laser driven plasma accelerators(2018-07-09) Chang, Yen-Yu; Downer, Michael Coffin; Bernstein, Aaron; Fink, Manfred; Breizman, Boris; Becker, MichaelWe demonstrated single shot diagnostics of laser-plasma accelerators (LPAs). We observed the structure and the evolving process of the blow-out region, the nonlinear waves (plasma bubble) induced by the driving beam using the Faraday rotation diagnostic. We obtained the evolution of the plasma bubble in single shot using Faraday rotation diagnostic with multiple probe beams. The diameter of the bubble changed from 300 μm to 50 μm in 2 cm, which revealed the transition of the acceleration stages from ”bubble expanding mode” to ”bubble stabilizing mode”. Moreover, we demonstrated the broad bandwidth frequency domain streak camera (B-FDSC), which can resolve the dynamics of LPAs in single shot. We improved the temporal resolution of B-FDSC to 10 fs by broadening the bandwidth of the probe beam to 100 nm using supercontinuum generation, and we performed a prototype experiment to show that B-FDSC was capable of resolving the evolution of pulse self-steepening and temporal splitting in a single shot.