Browsing by Subject "Compilers"
Now showing 1 - 7 of 7
- Results Per Page
- Sort Options
Item Atomic block formation for explicit data graph execution architectures(2010-08) Maher, Bertrand Allen; McKinley, Kathryn S.; Burger, Douglas C., Ph. D.; Keckler, Stephen W.; Mahlke, Scott A.; Pingali, KeshavLimits on power consumption, complexity, and on-chip latency have focused computer architects on power-efficient designs that exploit parallelism. One approach divides programs into atomic blocks of operations that execute semi-independently, which efficiently creates a large window of potentially concurrent operations. This dissertation studies the intertwined roles of the compiler, architecture, and microarchitecture in achieving efficiency and high performance with a block-atomic architecture. For such an architecture to achieve high performance the compiler must form blocks effectively. The compiler must create large blocks of instructions to amortize the per-block overhead, but control flow and content restrictions limit the compiler's options. Block formation should consider factors such of frequency of execution, block size such as selecting control-flow paths that are frequently executed, and exploiting locality of computations to reduce communication overheads. This dissertation determines what characteristics of programs influence block formation and proposes techniques to generate effective blocks. The first contribution is a method for solving phase-ordering problems inherent to block formation, mitigating the tension between block-enlarging optimizations---if-conversion, tail duplication, loop unrolling, and loop peeling---as well as scalar optimizations. Given these optimizations, analysis shows that the remaining obstacles to creating larger blocks are inherent in the control flow structure of applications, and furthermore that any fixed block size entails a sizable amount of wasted space. To eliminate this overhead, this dissertation proposes an architectural implementation of variable-size blocks that allow the compiler to dramatically improve block efficiency. We use these mechanisms to develop policies for block formation that achieve high performance on a range of applications and processor configurations. We find that the best policies differ significantly depending on the number of participating cores. Using machine learning, we discover generalized policies for particular hardware configurations and find that the best policy varies significantly between applications and based on the number of parallel resources available in the microarchitecture. These results show that effective and efficient block-atomic execution is possible when the compiler and microarchitecture are designed cooperatively.Item Compiler and system for resilient distributed heterogeneous graph analytics(2020-05) Gill, Gurbinder Singh; Pingali, Keshav; Fussell, Donald S; Rossbach, Christopher J; Peri, Ramesh; Mytkowicz, ToddGraph analytics systems are used in a wide variety of applications including health care, electronic circuit design, machine learning, and cybersecurity. Graph analytics systems must handle very large graphs such as the Facebook friends graph, which has more than a billion nodes and 200 billion edges. Since machines have limited main memory, distributed-memory clusters with sufficient memory and computation power are required for processing of these graphs. In distributed graph analytics, the graph is partitioned among the machines in a cluster, and communication between partitions is implemented using a substrate like MPI. However, programming distributed-memory systems are not easy and the recent trend towards the processor heterogeneity has added to this complexity. To simplify the programming of graph applications on such platforms, this dissertation first presents a compiler called Abelian that translates shared-memory descriptions of graph algorithms written in the Galois programming model into efficient code for distributed-memory platforms with heterogeneous processors. An important runtime parameter to the compiler-generated distributed code is the partitioning policy. We present an experimental study of partitioning strategies for distributed work-efficient graph analytics applications on different CPU architecture clusters at large scale (up to 256 machines). Based on the study we present a simple rule of thumb to select among myriad policies. Another challenge of distributed graph analytics that we address in this dissertation is to deal with machine fail-stop failures, which is an important concern especially for long-running graph analytics applications on large clusters. We present a novel communication and synchronization substrate called Phoenix that leverages the algorithmic properties of graph analytics applications to recover from faults with zero overheads during fault-free execution and show that Phoenix is 24x faster than previous state-of-the-art systems. In this dissertation, we also look at the new opportunities for graph analytics on massive datasets brought by a new kind of byte-addressable memory technology with higher density and lower cost than DRAM such as intel Optane DC Persistent Memory. This enables the design of affordable systems that support up to 6TB of randomly accessible memory. In this dissertation, we present key runtime and algorithmic principles to consider when performing graph analytics on massive datasets on Optane DC Persistent Memory as well as highlight ideas that apply to graph analytics on all large-memory platforms. Finally, we show that our distributed graph analytics infrastructure can be used for a new domain of applications, in particular, embedding algorithms such as Word2Vec. Word2Vec trains the vector representations of words (also known as word embeddings) on large text corpus and resulting vector embeddings have been shown to capture semantic and syntactic relationships among words. Other examples include Node2Vec, Code2Vec, Sequence2Vec, etc (collectively known as Any2Vec) with a wide variety of uses. We formulate the training of such applications as a graph problem and present GraphAny2Vec, a distributed Any2Vec training framework that leverages the state-of-the-art distributed heterogeneous graph analytics infrastructure developed in this dissertation to scale Any2Vec training to large distributed clusters. GraphAny2Vec also demonstrates a novel way of combining model gradients during training, which allows it to scale without losing accuracyItem Elixir: synthesis of parallel irregular algorithms(2015-05) Prountzos, Dimitrios; Pingali, Keshav; Misra, Jayadev; Batory, Don; Cook, William; Sagiv, Mooly; Gulwani, SumitAlgorithms in new application areas like machine learning and data analytics usually operate on unstructured sparse graphs. Writing efficient parallel code to implement these algorithms is very challenging for a number of reasons. First, there may be many algorithms to solve a problem and each algorithm may have many implementations. Second, synchronization, which is necessary for correct parallel execution, introduces potential problems such as data-races and deadlocks. These issues interact in subtle ways, making the best solution dependent both on the parallel platform and on properties of the input graph. Consequently, implementing and selecting the best parallel solution can be a daunting task for non-experts, since we have few performance models for predicting the performance of parallel sparse graph programs on parallel hardware. This dissertation presents a synthesis methodology and a system, Elixir, that addresses these problems by (i) allowing programmers to specify solutions at a high level of abstraction, and (ii) generating many parallel implementations automatically and using search to find the best one. An Elixir specification consists of a set of operators capturing the main algorithm logic and a schedule specifying how to efficiently apply the operators. Elixir employs sophisticated automated reasoning to merge these two components, and uses techniques based on automated planning to insert synchronization and synthesize efficient parallel code. Experimental evaluation of our approach demonstrates that the performance of the Elixir generated code is competitive to, and can even outperform, hand-optimized code written by expert programmers for many interesting graph benchmarks.Item Explicit data graph compilation(2009-12) Smith, Aaron Lee, 1977-; Burger, Douglas C., Ph. D.; John, Lizy K.; Keckler, Stephen W.; Lin, Calvin; McKinley, Kathryn S.Technology trends such as growing wire delays, power consumption limits, and diminishing clock rate improvements, present conventional instruction set architectures such as RISC, CISC, and VLIW with difficult challenges. To show continued performance growth, future microprocessors must exploit concurrency power efficiently. An important question for any future system is the division of responsibilities between programmer, compiler, and hardware to discover and exploit concurrency. In this research we develop the first compiler for an Explicit Data Graph Execution (EDGE) architecture and show how to solve the new challenge of compiling to a block-based architecture. In EDGE architectures, the compiler is responsible for partitioning the program into a sequence of structured blocks that logically execute atomically. The EDGE ISA defines the structure of, and the restrictions on, these blocks. The TRIPS prototype processor is an EDGE architecture that employs four restrictions on blocks intended to strike a balance between software and hardware complexity. They are: (1) fixed block sizes (maximum of 128 instructions), (2) restricted number of loads and stores (no more than 32 may issue per block), (3) restricted register accesses (no more than eight reads and eight writes to each of four banks per block), and (4) constant number of block outputs (each block must always generate a constant number of register writes and stores, plus exactly one branch). The challenges addressed in this thesis are twofold. First, we develop the algorithms and internal representations necessary to support the new structural constraints imposed by the block-based EDGE execution model. This first step provides correct execution and demonstrates the feasibility of EDGE compilers. Next, we show how to optimize blocks using a dataflow predication model and provide results showing how the compiler is meeting this challenge on the SPEC2000 benchmarks. Using basic blocks as the baseline performance, we show that optimizations utilizing the dataflow predication model achieve up to 64% speedup on SPEC2000 with an average speedup of 31%.Item Exploring synergies between program synthesis and machine learning(2022-07-01) Singh, Shikhar; Khurshid, Sarfraz; Garg, Vijay K; Vikalo, Haris; Julien, Christine; Zhang, LingmingProgram synthesis is a term that describes a family of techniques that enables automatic generation of code according to user-provided specifications. Search-based program synthesis offers numerous benefits in developing reliable and performant software and has successfully been deployed in several domains. However, searching through intractable program spaces remains a daunting challenge, precluding the widespread adoption of this technique. Machine learning has had remarkable success in accomplishing incredibly challenging tasks, especially in computer vision and language translation. This thesis introduces two learning-guided techniques which make the program synthesis search more efficient in navigating the program space. In the domain of superoptimization - a solver-aided search-based code optimization technique, we develop machine learning models which direct the enumerative search towards program spaces where an optimal solution is likely to exist. By integrating our models with a modern superoptimizer, we can make the search more efficient by reducing enumeration overheads. Deep learning compilers also incorporate a search-based approach to map computations onto the target hardware. These mappings, also called schedules, specify how the underlying loop-nests corresponding to each layer is mapped on the hardware. The search must navigate through an intractable space of schedules to determine an efficient mapping for a given deep-learning model. We develop a graph convolutional network-based analytical cost model which enables the schedule search to determine the performance of a candidate schedule without depending on slow and expensive executions on hardware. Incorporating a graph-based representation enables us to surpass the accuracy of existing cost models used in state-of-art compilers. In addition to learning-guided techniques, we also explore parallel execution to address the problem of state-space explosion. We study this in the context of symbolic execution, a search-based software testing technique. In this setting, multiple executors can work on non-overlapping regions of a program, thus covering more program space and finding bugs faster when compared to a single instance. While machine learning can help alleviate challenges encountered in program synthesis, we can also utilize program synthesis to construct machine learning models which provide certain correctness guarantees. We propose a novel approach to synthesizing binary classifiers, built using neural networks, from user-provided specifications. By adopting a synthesis-based approach instead of traditional learning using datasets, we can create networks that are correct-by-construction. The results show that our technique can create networks that are difficult to train using data.Item High-level programming languages for low-level programming(2021-01-21) Peters, Arthur, 1984-; Rossbach, Christopher J.; Batory, Don S; Biros, George; Burtscher, Martin; Pingali, Keshav KMany programming problems are assumed to require low-level programming approaches, due to highly specific requirements. As such, these problems are solved in low-level programming languages, which require the programmer to specify every detail of execution. This work challenges that view point by developing a series of models and techniques, which enable high-level programming techniques to be applied to low-level problems. OrcO generalized the concept of objects to decouple them from the concurrent structure of the program. This allows programs to be expressed more naturally and be more maintainable because it eliminates adverse coupling between abstraction and concurrency. OrcO allows objects to hide their concurrent structure from clients, and allows clients to access objects in a concurrency-agnostic way. PorcE is a compiler and runtime for a high-level concurrent programming language that attempts to automatically eliminate excess concurrency to produce an optimized concurrent program. It hides the details of the execution from the programmer and asks the programmer to write a maximally concurrent program for it to optimize. PorcE shows both the promise of deparallelizing for performance and the challenges of optimizing programs without programmer provided insight into the programs structure. Lapis 2 is a specification language for concisely expressing the interaction semantics of low-level APIs. Lapis 2 is based on Lapis 1 developed for AvA. Lapis 1 and Lapis 2 can be used to automatically generate compatible API remoting systems for low-level C APIs. In addition, they can be extended to support virtualization specific features, such as resource usage tracking and migration of an application from one physical resource to another. Lapis 2 builds on Lapis 1 and provides a rule system that can abstract away low-level details while still allowing detailed control when required. The rule system dramatically improves reduces the size of API specifications. Lapis does not hide the low-level details, but instead makes it easy to express them clearly and concisely. As such, Lapis 2 enables techniques that are often restricted to high-level managed languages, such as compatible automatic API remoting, to be applied to low-level APIs written in C.Item Practical virtualization for reconfigurable accelerators(2024-05) Landgraf, Joshua James ; Rossbach, Christopher J.; Eric Schkufza; Emmett Witchel; Derek Chiou; Aditya AkellaField-Programmable Gate Arrays (FPGAs) are reconfigurable accelerators that provide hardware-like efficiency with software-like customizability. FPGAs have become increasingly popular in modern applications and are already available on-demand from major cloud providers. In recent years, technology advancements have greatly increased FPGA density, providing the capacity for both extremely large accelerators and even multiple simultaneous accelerators. As the same time, FPGA platforms have grown increasingly diversified, providing a variety of sizes, architectures, I/O mechanisms, and memory options to meet many different application needs. However, FPGA adoption has been limited by the difficulty associated with deploying accelerators across this wide range of FPGA platforms that exist today. Most platforms only support a single resident application, or statically partition FPGA resources into fixed-size slots, leading to reduced concurrency and wasted space. FPGAs also lack a truly general virtualization solution, making it difficult to manage contention for shared data center hardware or to migrate workloads between FPGAs from different vendors, especially when accelerators need to preserve their internal state. Furthermore, FPGA I/O interfaces are often exposed at a low-level, making it difficult and time-consuming to create accelerators that are both performant and portable in design. This thesis covers four systems for making FPGAs more accessible by virtualizing their hardware and supporting OS abstractions. The first, AmorphOS, enables protected and flexible sharing of FPGA fabric for efficient use by multiple accelerators in the cloud. The second, Synergy, fully virtualizes FPGA logic via a combined compiler/runtime solution, enabling support for transparent multiplexing, suspend/resume, and live migration of accelerators across different hardware platforms. The third, FSRF, abstracts FPGA I/O for memory-mapped data by providing access to the file system through virtual memory with near-native performance. Finally, Starla, abstracts FPGA I/O for streaming data sources while also enabling efficient and flexible inter-accelerator communication.