Browsing by Subject "Parallelization"

Now showing 1 - 8 of 8

A GPU-accelerated MRI sequence simulator for differentiable optimization and learning
(2021-05-10) Rakshit, Somnath; Tamir, Jon (Jonathan I.)
The Extended Phase Graph (EPG) Algorithm is a powerful tool for magnetic resonance imaging (MRI) sequence simulation and quantitative fitting, but simulators based on EPG are mostly written to run on CPU only and (with some exception) are poorly parallelized. A parallelized simulator compatible with other learning-based frameworks would be a useful tool to optimize scan parameters, estimate tissue parameter maps, and combine with other learning frameworks. In this thesis, I present our work on an open source, GPU-accelerated EPG simulator written in PyTorch. Since the simulator is fully differentiable by means of automatic differentiation, it can be used to take derivatives with respect to sequence parameters, e.g. flip angles, as well as tissue parameters, e.g. T₁ and T₂. Here, I describe the simulator package and demonstrate its use for a number of MRI tasks including tissue parameter estimation and sequence optimization.
A parallelized Eulerian VIScous Vorticity Equation (VISVE) method in an inertial and non-inertial frame of reference
(2021-05) Wu, Chunlin; Kinnas, Spyros A.; Liljestrand, Howard M.; Sepehrnoori, Kamy; Raja, Laxminarayan L.; Johnson, Blair A.
A parallelized VIScous Vorticity Equation (VISVE) method that solves the vorticity transport equation to simulate the incompressible viscous flow in both the inertial and non-inertial frame of reference is proposed in this dissertation. The local concentration of vorticity near the wall and the wake is fully exploited by the current method, which is designed to efficiently solve the flow past a body with irregular wall boundaries surrounded by a small and compact computational domain in general shapes. The method is also parallelized using both Open Multi-Processing (Open-MP) and Message Passing Interface (MPI) in a hybrid manner to boost the speed of computations. The present method is implemented in the case of flow past a 3-D hydrofoil with one periodic direction and flow past a 3-D wing in the inertial frame of reference. The method is also applied to oscillatory flow past a 2-D circular cylinder, uniform flow past a rotating 2-D circular cylinder, uniform flow past a rotating 3-D circular cylinder, and uniform flow past a rotating propeller in the non-inertial frame of reference. The current results, such as the velocity field, vorticity field, and body forces, exhibit a high level of agreement with those of established numerical and experimental benchmark tests, demonstrating the accuracy and robustness of this method.
Detailed-routability-driven and timing-driven scalable parallel global routing
(2022-05-06) He, Jiayuan; Pingali, Keshav; Manohar, Rajit; Fussell, Donald S.; Rossbach, Christopher J.
Electronic Design Automation (EDA) tools are used to design computer chips, which may have billions of transistors nowadays. The complexity of EDA increases rapidly as chip designs grow larger, and the algorithms as well as the objectives keep evolving as more complicated design rules are introduced in modern technology nodes. There are many stages in chip design. Routing is one of the most time-consuming stages, and it is also the last stage that directly determines the quality of the chip. The routing problem is usually solved in two steps known as global routing and detailed routing. Global routing generates a preferred routing region (Guide) for each net, and detailed routing focuses on reducing design rule violations in the routing region. Detailed routing can be easily parallelized and scaled up by parallelizing disjoint routing regions, but global routing requires complex parallelization techniques to manage the shared routing resources. Therefore, we believe it is imperative to develop high-quality and fast parallel algorithms for global routing. This dissertation studies the global routing problem. Global routing needs to generate routing guides such that (1) routability of detailed routing is considered, (2) routing is fast, and (3) timing constraints are satisfied. This dissertation proposes new algorithms as well as novel parallelization strategies to improve both the quality of result (QoR) and the runtime of global routing. The following work is presented in this dissertation. • SPRoute 1.0 proposes a non-deterministic parallelization strategy in ripup-and-reroute (RRR) maze routing. In net-level parallelization of maze routing, the livelock problem may result in redundant work and reduce the speedup of parallelization; in the worst case, it may even prevent maze routing from converging. To solve these problems, we propose a novel two-phase maze routing algorithm. This algorithm solves the livelock issue in the net-level parallelism and achieves an 11.0× speedup using 28 threads compared with a state-of-the-art single-threaded global router. • SPRoute 2.0 proposes a detailed-routability-driven technique and a deterministic parallel strategy for maze routing. We first introduce soft capacity, which reserves routing space for detailed routing based on the pin density and Rectangular Uniform wire Density (RUDY). We then propose a deterministic parallelization approach that partitions the netlist into batches and then bulk-synchronously maze-routes a single batch of nets. The advantage of this approach is that it guarantees determinacy without requiring the nets running in parallel to be disjoint, thus guaranteeing scalability. We design a scheduler that mitigates the load imbalance and livelock issues in this bulk synchronous execution model. The experimental results show that SPRoute 2.0 generates good quality of results with 43% fewer shorts, 14% fewer DRCs and a 7.4× speedup over a state-of-the-art global router. • SPRoute 3.0 targets timing-driven layer assignment in global routing. It proposes a critical net identification algorithm to dynamically determine the critical rate and a parallelization technique to parallelize layer assignment. The proposed methodology can be easily integrated with an asynchronous physical design flow and a timer. SPRoute 3.0 reduces more than 61.7% of the maximum net delay in global routing with a 1.3% increase on the quality score. It achieves good scalability with a speedup of 7.0× using eight threads compared with single thread. • Finally, I present a power router called PWRoute for the gridded cell placement flow Dali[119]. In the gridded cell, the cell height and width can be any integer multiple of two grid values, so traditional power grid generation methods for standard cells do not apply and a dedicated power router is needed. The power router consists of two steps: (i) power mesh generation and (ii) detailed power routing. PWRoute can generate DRC-clean solutions, and is verified by commercial tools and a chip tape-out.
Effectively parallelizing electronic design automation algorithms using the operator formulation
(2022-05-04) Lu, Yi-Shan; Pingali, Keshav; Fussell, Donald; Rossbach, Christopher; Manohar, Rajit
Electronic design automation (EDA) algorithms are essential for circuit designers to handle huge circuits, modeled as graphs or hypergraphs, while meeting tight design schedules. Shortening the turn-around time of EDA algorithms through parallelization is attractive due to multicore machines being ubiquitous and inexpensive for more than a decade. However, parallelizing EDA algorithms effectively is difficult due to the compute irregularity in graph algorithms. In this dissertation, we study the effective parallelization of static timing analysis (STA), the core module for timing-driven EDA algorithms. We first analyze the parallelism within graph-based STA algorithms for synchronous circuits using the operator formulation, and identify suitable schedules for both full and incremental timing update. We then apply the same methodology to path-based STA algorithms for synchronous circuits, and distill out subgraph tagging as a building block for accuracy enhancement in STA. We prototype Blitz, a parallel STA engine built with Galois 6.0, and compare Blitz against the state-of-the-art open source parallel timers to demonstrate our effective parallelization for STA without losing timing accuracy. Finally, we study the parallelization of STA for AND-causal asynchronous circuits. We extract the parallelism available in asynchronous timing algorithms, devise efficient schedules for fast convergence when cyclic dependencies exist, and deploy subgraph tagging in checking asynchronous timing violations. This results in Cyclone, the first parallel STA engine for large asynchronous circuits.
Parallel layered-medium integral-equation methods for electromagnetic analysis of high-fidelity full-size electronic package models
(2020-06-30) Liu, Chang (Ph. D. electrical and computer engineering); Yilmaz, Ali E.; Neikirk, Dean P; Gharpurey, Ranjit; Sun, Nan; Aygun, Kemal
A parallel iterative layered-medium integral-equation solver is presented for fast and scalable network parameter extraction of high-fidelity full-size electronic package models. The solver relies on a 2-D fast-Fourier-transform (FFT)-based acceleration method, a sparse preconditioner, and a scalable parallelization method to reduce the computational costs as well as advanced lumped-port models to improve the accuracy of network parameter extraction. To be able to analyze full-size electronic package models, the solver is integrated to a reduced-domain layered-medium integral-equation formulation that models large portions of planar conductors using layered medium Green’s functions. To demonstrate the method’s efficiency and applicability to complex electronic packaging problems, various package models with increasingly higher fidelities and larger sizes are analyzed, their numerical results are validated, and the costs are quantified. The studied models include a full-size package model with 304 differential-pair traces that fan out on two different routing layers that carry signals from the chip to the board through 10 metallization layers; the model contains 3127 plated-through-hole core-layer vias, 1216 stacked microvias connected to the traces, and ~3×10⁴ single-layer microvias that stitch metallization layers. The proposed parallel layered-medium integral-equation methods reduced the number of unknowns for the full-wave analysis of this model to only ~4.9×10⁶; using 1536 processes on a petascale cluster, the methods required a wall clock time of ̃ 0.58 seconds per iteration, ~28 minutes per excitation, and ~560 hours per frequency to extract the 1216-port network parameters.
Parallelization of ordered irregular algorithms
(2016-12) Hassaan, Muhammad Amber; Pingali, Keshav; Patt, Yale N; Arvind; Garg, Vijay K; Chiou, Derek; Chase, Craig M
Parallel computing hardware is ubiquitous, ranging from cell-phones with multiple cores to super-computers with hundreds of thousands of cores. Applications need to take advantage of this parallelism to reduce the time required for their execution as well as to scale to larger input sizes. This has proved to be very challenging because programmers are used to thinking serially, and because current programming notations for writing parallel software are very low level and hard to use in general. At the same time, new problem domains like social network analysis and data analytics are giving rise to "irregular'" applications that have far more complicated patterns of parallelism and locality than applications in more traditional parallel computing fields like computational science. This dissertation is focused on ordered irregular applications, which are applications in which computations must appear to have been performed in an application-specific order. One example considered in this dissertation is the simulation of colliding balls on a billiards table. Collisions in different parts of the table may be performed in parallel as long as in the end, all collisions appear to have been performed in time order. Ordered algorithms are more complex to parallelize than unordered algorithms, such as some mesh refinement algorithms, which lack a notion of algorithmic order. Existing languages and runtime systems are limited in their ability to express and parallelize ordered applications. To address these problems, we introduce a new abstraction for understanding and exploiting parallelism in ordered applications, called the Kinetic Dependence Graph (KDG). We have implemented a high-level user interface for programmers that permits them to write ordered algorithms without needing to understand the Kinetic Dependence Graph or its implementation details. This interface permits programmers to write sequential C++ programs with implicitly parallel loop constructs, using our parallel data structures and runtime libraries. We have also implemented several runtimes that use the KDG to run these applications in parallel. One runtime uses the KDG directly to find and exploit parallelism. The other runtime is based on optimistic (speculative) execution, and it uses the KDG to improve the efficiency of speculation. Our results show a speedup of up to 33 on a 40 core Intel server, performing better than or at par with third party implementations, most of which have been implemented and optimized by hand using low level tools and intricate knowledge of algorithms and hardware.
Parallelization of virtual machines for efficient multi-processor emulation
(2010-05) Chakravarthy, Ramapriyan Sreenath; Chiou, Derek; Erez, Mattan
Simulation is an essential tool for computer systems research. The speed of the simulator has a first-order effect on what experiments can be run. The recent shift in the industry to multi-core processors has put even more pressure on simulation performance, especially since most simulators are single-threaded. In this thesis, we present QEMU-MP, a parallelized version of a fast functional simulator called QEMU.
Prioritization via stochastic optimization
(2010-05) Koc, Ali; Popova, Elmira; Morton, David P.; Bard, Jonathan; Caramanis, Constantine; Hess, Stephen; Kutanoglu, Erhan
We take a novel perspective on real-life decision making problems involving binary activity-selection decisions that compete for scarce resources. The current literature in operations research approaches these problems by forming an optimal portfolio of activities that meets the specified resource constraints. However, often practitioners in industry and government do not take the optimal-portfolio approach. Instead, they form a rank-ordered list of activities and select those that have the highest priority. The academic literature tends to discredit such ranking schemes because they ignore dependencies among the activities. Practitioners, on the other hand, sometimes discredit the optimal-portfolio approach because if the problem parameters change, the set of activities that was once optimal no longer remains optimal. Even worse, the new optimal set of activities may exclude some of the previously optimal activities, which they may have already selected. Our approach takes both viewpoints into account. We rank activities considering both the uncertainty in the problem parameters and the optimal portfolio that will be obtained once the uncertainty is revealed. We use stochastic integer programming as a modeling framework. We develop several mathematical formulations and discuss their relative merits, comparing them theoretically and computationally. We also develop cutting planes for these formulations to improve computation times. To be able to handle larger real-life problem instances, we develop parallel branch-and-price algorithms for a capital budgeting application. Specifically, we construct a column-based reformulation, develop two branching strategies and a tabu search-based primal heuristic, propose two parallelization schemes, and compare these schemes on parallel computing environments using commercial and open-source software. We give applications of prioritization in facility location and capital budgeting problems. In the latter application, we rank maintenance and capital-improvement projects at the South Texas Project Nuclear Operating Company, a two-unit nuclear power plant in Wadsworth, Texas. We compare our approach with several ad hoc ranking schemes similar to those used in practice.