Browsing by Subject "Parallel processing (Electronic computers)"

Now showing 1 - 12 of 12

Architectural techniques to accelerate multimedia applications on general-purpose processors
(2001-08) Talla, Deependra, 1975-; John, Lizy Kurian
General-purpose processors (GPPs) have been augmented with multimedia extensions to improve performance on multimedia-rich workloads. These extensions operate in a single instruction multiple data (SIMD) fashion to extract data level parallelism in multimedia and digital signal processing (DSP) applications. This dissertation consists of a comprehensive evaluation of the execution characteristics of multimedia applications on SIMD enhanced GPPs, detection of bottlenecks in the execution of multimedia applications on SIMD enhanced GPPs, and the design and implementation of architectural techniques to eliminate and alleviate the impact of the various bottlenecks to accelerate multimedia applications. This dissertation identifies several bottlenecks in the processing of SIMD enhanced multimedia and DSP applications on GPPs. It is found that approximately 75-85% of instructions in the dynamic instruction stream of media workloads are not performing useful computations but merely supporting the useful computations by performing address generation, address transformation/data reorganization, loads/stores, and loop branches. This leads to an underutilization of the SIMD computation units with only 1-12% of the peak SIMD throughput being achieved. This dissertation proposes the use of hardware support to efficiently execute the overhead/supporting instructions by overlapping them with the useful computation instructions. A 2-way GPP with SIMD extensions augmented with the proposed MediaBreeze hardware significantly outperforms a 16-way SIMD GPP without MediaBreeze hardware on multimedia kernels. On multimedia applications, a 2-/4-way SIMD GPP augmented with MediaBreeze hardware is superior to a 4-/8-way SIMD GPP without MediaBreeze hardware. The performance improvements are achieved at an area cost that is less than 0.3% of current GPPs and power consumption that is less than 1% of the total processor power without elongating the critical path of the processor.
Compiler directed speculation for embedded clustered EPIC machines
(2004) Pillai, Satish; Jacome, Margarida F.
Very Large Instruction Word (VLIW)/Explicitly Parallel Instruction Computing (EPIC) processors are a very attractive platform for many of today's multimedia and communications applications. In particular, clustered VLIW/EPIC machines can take aggressive advantage of the available instruction level parallelism (ILP), while maintaining high energy-delay ef- ciency. However, multicluster machines are more challenging to compile to than centralized machines. In this thesis, we propose a novel compilerdirected resource-aware ILP extraction technique, called predicated switching, that is targeted towards such multicluster VLIW/EPIC machines. The proposed technique integrates three powerful ILP extraction techniques { predication, speculation and software pipelining, in a combined framework. The three novel contributions in this dissertation are: (1) a compiler transformation, denoted Static Single Assignment - Predicated Switching (SSA-PS), that leverages required data transfers between clusters for performance gains; (2) a static speculation algorithm to decide which speci c kernel operations should actually be speculated in a region of code (hyperblock), possibly being simultaneously software pipelined, so as to maximize execution performance on the target processor; and (3) an ILP extraction ow incorporating several code generation phases critical to pro table ILP extraction by the compiler. Experimental results performed on a representative set of time critical kernels compiled for a number of target machines show that, when compared to two baseline \resource-unaware" speculation techniques (one that speculates aggressively and one that speculates conservatively), predicated switching improves performance with respect to at least one of the baselines in 65% of the cases by up to 50%. Moreover, we show that code size and register pressure are not adversely affected by our technique. Finally, we show that our ILP extraction framework combining speculation and software pipelining can effectively exploit the relative merits of both techniques.
Design and evaluation of a technology-scalable architecture for instruction-level parallelism
(2007) Nagarajan, Ramadass, 1977-; Burger, Douglas C., Ph. D.
Future performance improvements must come from the exploitation of concurrency at all levels. Recent approaches that focus on thread-level and data-level concurrency are a natural fit for certain application domains, but it is unclear whether they can be adapted efficiently to eliminate serial bottlenecks. Conventional superscalar hardware that instead focuses on instruction-level parallelism (ILP) is limited by power inefficiency, on-chip wire latency, and design complexity. Ultimately, poor single-thread performance and Amdahl’s law will inhibit the overall performance growth even on parallel workloads. To address this problem, we undertook the challenge of designing a scalable, wide-issue, large-window processor that mitigates complexity, reduces power overheads, and exploits ILP to improve single-thread performance at future wire-delay dominated technologies. This dissertation describes the design and evaluation of the TRIPS architecture for exploiting ILP. The TRIPS architecture belongs to a new class of instruction set architectures called Explicit Data Graph Execution (EDGE) architectures that use large dataflow graphs of computation and explicit producer-consumer communication to express concurrency to the hardware. We describe how these architectures match the characteristics of future sub-45 nm CMOS technologies to mitigate complexity and improve concurrency at reduced overheads. We describe the architectural and microarchitectural principles of the TRIPS architecture, which exploits ILP by issuing instructions widely, in dynamic dataflow fashion, from a large distributed window of instructions. We then describe our specific contributions to the development of the TRIPS prototype chip, which was implemented in a 130 nm ASIC technology and consists of more than 170 million transistors. In particular, we describe the implementation of the distributed control protocols that offer various services for executing a single program in the hardware. Finally, we describe a detailed evaluation of the TRIPS architecture and identify the key determinants of its performance. In particular, we describe the development of the infrastructure required for a detailed analysis, including a validated performance model, a highly optimized suite of benchmarks, and critical path models that identify various architectural and microarchitectural bottlenecks at a fine level of granularity. On a set of highly optimized benchmark kernels, the manufactured TRIPS parts out-perform a conventional superscalar processor by a factor of 3× on average. We find that the automatically compiled versions of the same kernels are yet to reap the benefits of the high-ILP TRIPS core, but exceed the performance of the superscalar processor in many cases. Our results indicate that the overhead of various control protocols that manage the overall execution in the processor have only a modest effect on performance. However, operand communication between various components in the distributed microarchitecture contributes to nearly a third of the execution cycles. Fanout instructions, which are necessitated by limited, fixed-width encoding in the dataflow instruction set, also contribute to non-trivial performance overheads. Our results point to an exciting line of future research to overcome these limitations and achieve low-overhead distributed dataflow execution.
Dynamic, distributed resource allocation on regular SW-banyans
(1986) Feo, John T., 1955-; Browne, James C.
In order to achieve the orders-of-magnitude increases in speed and fault tolerance desired by many of today's users, it is necessary to execute the parallel subtasks of processes concurrently. A class of computer systems able to achieve such high-performance parallel computing is configurable multiprocessors. Such systems can be both flexible and fast; however, the degree to which they are is determined by the properties of their interconnection networks and the efficiency and robustness of the algorithms used to configure the networks into disjoint subsystems each with the resources and the architecture desired by the intended user. This dissertation establishes a basis for configuring regular SW-banyans (a large class of configurable networks) which is both theoretically sound and practical in the context of an existing system, such as the Texas Reconfigurable Array Computer. In addition, specific algorithms to construct a wide variety of architectures on regular SW-banyans are presented. These algorithms are dynamic, distributed, of complexity N Log N or better and work correctly in the presence of faults
Instruction history management for high-performance microprocessors
(2003) Bhargava, Ravindra Nath; John, Lizy Kurian
History-driven dynamic optimization is an important factor in improving instruction throughput in future high-performance microprocessors. Historybased techniques have the ability to improve instruction-level parallelism by breaking program dependencies, eliminating long-latency microarchitecture operations, and improving prioritization within the microarchitecture. However, a combination of factors, such as wider issue widths, smaller transistors, larger die area, and increasing clock frequency, has led to microprocessors that are sensitive to both wire delays and energy consumption. In this environment, the global structures and long-distance communications that characterize current history data management are limiting instruction throughput. This dissertation proposes the ScatterFlow Framework for Instruction History Management. Execution history management tasks, such as history data storage, access, distribution, collection, and modification, are partitioned and dispersed throughout the instruction execution pipeline. History data packets are then associated with active instructions and flow with the instructions as they execute, encountering the history management tasks along the way. Between dynamic instances of the instructions, the history data packets reside in trace-based history storage that is synchronized with the instruction trace cache. Compared to traditional history data management, this ScatterFlow method improves instruction coverage, increases history data access bandwidth, shortens communication distances, improves history data accuracy in many cases, and decreases the effective history data access time. A comparison of general history management effectiveness between the ScatterFlow Framework and traditional hardware tables shows that the ScatterFlow Framework provides superior history maturity and instruction coverage. The unique properties that arise due to trace-based history storage and partitioned history management are analyzed, and novel design enhancements are presented to increase the usefulness of instruction history data within the ScatterFlow Framework. To demonstrate the potential of the proposed framework, specific dynamic optimization techniques are implemented using the ScatterFlow Framework. These illustrative examples combine the history capture advantages with the access latency improvements while exhibiting desirable dynamic energy consumption properties. Compared to a traditional table-based predictor, performing ScatterFlow value prediction improves execution time and reduces dynamic energy consumption. In other detailed examples, ScatterFlowenabled cluster assignment demonstrates improved execution time over previous cluster assignment schemes, and ScatterFlow instruction-level profiling detects more useful execution traits than traditional fixed-size and infinite-size hardware tables.
Jack Rabbit : an effective Cell BE programming system for high performance parallelism
(2011-05) Ellis, Apollo Isaac Orion; Lin, Yun Calvin; Fussell, Donald S., 1951-
The Cell processor is an example of the trade-offs made when designing a mass market power efficient multi-core machine, but the machine-exposing architecture and raw communication mechanisms of Cell are hard to manage for a programmer. Cell's design is simple and causes software complexity to go up in the areas of achieving low threading overhead, good bandwidth efficiency, and load balance. Several attempts have been made to produce efficient and effective programming systems for Cell, but the attempts have been too specialized and thus fall short. We present Jack Rabbit, an efficient thread pool work queue implementation, with load balancing mechanisms and double buffering. Our system incurs low threading overhead, gets good load balance, and achieves bandwidth efficiency. Our system represents a step towards an effective way to program Cell and any similar current or future processors.
Load balancing strategies for parallel architectures
(2003) Iqbal, Saeed; Jacome, Margarida F.; Carey, Graham F.
Matching algorithms with parallel architectures : a quantitative approach
(1995-12) Wang, Dze-chaung; Not available
Parallel machine scheduling with time windows
(2004) Rojanasoonthon, Siwate; Bard, Jonathan F.
This dissertation addresses the problem of scheduling a set of jobs with multiple priorities on nonhomogeneous, parallel machines. The application of interest involves the tracking and data relay satellite system (TDRSS) run by the U.S. National Aeronautics and Space Administration. This system acts as a relay platform for Earth-orbiting vehicles that wish to communicate periodically with ground stations. An additional feature of the problem is that each job falls into one of ρ priority classes. The objective is to maximize the number of jobs scheduled, where a job in a higher priority class has infinitely more value than a job in a lower priority class. The problem is introduced and then compared to other more common scheduling and routing problems. A formal proof that shows the problem to be in NP-hard is presented. Next, mixed-integer linear programming formulations are explained. The difficulty experienced in solving even small instances with CPLEX led to the development of a dynamic programming-like heuristic and a greedy randomized adaptive search procedure (GRASP). The GRASP consists of two phases. The first phase produces feasible solutions by ranking each job with a greedy function and then selecting one at random from a restricted candidate list. The process is repeated until no more jobs can be scheduled. The second phase seeks a local optimum by searching over a series of neighborhoods defined by job insertions and exchanges. Each is described in some detail and then compared using data from a typical busy day scenario. Extensive testing was done on a series of problems randomly generated to reflect a wide range of job characteristics. An exact method was also developed. Here, the problem is reformulated as a set packing problem and a branch-and-price method is applied. The GRASP was used to provide the initial starting solution as well as lower bounds during the branching process. Two methods of branching are presented, one based on time windows and the other based on the special order sets associated with one of the constraints. The proposed procedure was found to be very effective, providing the true optimum for instances with up to 100 jobs and 2 machines. An additional contribution of this work is the development of a technique to randomly generate problem sets with real-world attributes.
Performance enhancing software loop transformations for embedded VLIW/EPIC processors
(2001-12) Akturan, Cagdas, 1973-; Jacome, Margarida F.
Software pipelining is a performance enhancing loop optimization technique widely used in optimizing compilers. This technique is particularly effective in the context of multimedia and signal processing embedded applications, since the time critical segments of such applications are typically loops. Although software pipelining can dramatically increase the performance of a large segment of today’s embedded applications market, it has two important potential drawbacks. First, it may lead to a significant increase in code size, and thus, to a costly increase in program memory size requirements. Second, it typically increases register pressure. In the context of register limited embedded processors, such an increase may lead to an increase in spills to memory, and thus, to significant performance degradation. In this research, we studied the difficult cost-performance demands posed by the embedded systems market, and developed effective performance enhancing loop optimization techniques/algorithms that directly take into consideration these two critical cost factors. In this dissertation we propose a novel software pipelining framework suitable for compilers targeting clustered embedded VLIW/EPIC processors. The key difference between our approach and previous approaches is that our proposed software pipelining framework can handle code size constraints along with latency and resource constraints while minimizing the increase in register pressure (register file size requirements) typically incurred by software pipelining. This powerful and unique combination of optimization features allows embedded system designers to perform compiler assisted exploration of "Pareto optimal” points with respect to performance, code size, and register requirements, all important figures of merit for embedded software.
Polymorphous architectures: a unified approach for extracting concurrency of different granularities
(2006) Sankaralingam, Karthikeyan; Keckler, Stephen W.
Processor architects today are faced by two daunting challenges: emerging applications with heterogeneous computation needs and technology limitations of power, wire delay, and process variation. Designing multiple application-specific processors or specialized architectures introduces design complexity, a software programmability problem, and reduces economies of scale. There is a pressing need for design methodologies that can provide support for heterogeneous applications, combat processor complexity, and achieve economies of scale. In this dissertation, we introduce the notion of architectural polymorphism to build such scalable processors that provide support for heterogeneous computation by supporting different granularities of parallelism. Polymorphism configures coarse-grained microarchitecture blocks to provide anadaptive and flexible processor substrate. Technology scalability is achieved by designing an architecture using scalable and modular microarchitecture blocks. We use the dataflow graph as the unifying abstraction layer across three granularities of parallelism-instruction-level, thread-level, and data-level. To first order, this granularity of parallelism is the main difference between different classes of applications. All programs are expressed in terms of dataflow graphs and directly mapped to the hardware, appropriately partitioned as required by the granularity of parallelism. We introduce Explicit Data Graph Execution (EDGE) ISAs, a class of ISAs as an architectural solution for effciently expressing parallelism for building technology scalable architectures. We developed the TRIPS architecture implementing an EDGE ISA using a heavily partitioned and distributed microarchitecture to achieve technology scalability. The two most signicant features of the TRIPS microarchitecture are its heavily partitioned and modular design, and the use of microarchitecture networks for communication across modules. We have also built aprototype TRIPS chip in 130nm ASIC technology composed of two processor cores and a distributed 1MB Non-Uniform Cache Access Architecture (NUCA) on-chip memory system. Our performance results show that the TRIPS microarchitecture which provides a 16-issue machine with a 1024-entry instruction window can sustain good instruction-level parallelism. On a set of hand-optimized kernels IPCs in the range of 4 to 6 are seen, and on a set of benchmarks with ample data-level parallelism (DLP), compiler generated code produces IPCs in the range of 1 to 4. On the EEMBC and SPEC CPU2000 benchmarks we see IPCs in the range of 0.5 to 2.3. Comparing performance to the Alpha 21264, which is a high performance architecture tuned for ILP, TRIPS is up to 3.4 times better on the hand optimized kernels. However, compiler generated binaries for the DLP, EEMBC, and SPEC CPU2000 benchmarks perform worse on TRIPS compared to an Alpha 21264. With more aggressive compiler optimization we expect the performance of the compiler produced binaries to improve. The polymorphous mechanisms proposed in this dissertation are effective at exploiting thread-level parallelism and data-level parallelism. When executing four threads on a single processor, significantly high levels of processor utilization are seen; IPCs are in the range of 0.7 to 3.9 for an application mix consisting of EEMBC and SPEC CPU2000 workloads. When executing programs with DLP, the polymorphous mechanisms we propose provide harmonic mean speedups of 2.1X across a set of DLP workloads, compared to an execution model of extra ting only ILP. Compared to specialized architectures, these mechanisms provide competitive performance using a single execution substrate.
Using parallel computation to apply the singular value decomposition (SVD) in solving for large Earth gravity fields based on satellite data
(2004) Hinga, Mark Brandon; Tapley, Byron D.
Using satellite data only to estimate for an Earth gravity field introduces the problem of an ill-conditioned system of equations. This mathematical difficulty amplifies as the number of unknown gravity field parameters increases, requiring a stabilization of the inversion for solution. But the number of parameters to be estimated can also be too large to allow inversion using a sequential algorithm (one computer processor). Therefore the challenge is two-fold. A stabilized inversion must be performed with a parallel (multi-processor) algorithm. Thus, new code was developed in the parallel computing infrastructure of Parallel Linear Algebra Package (PLAPACK) to achieve the task of applying the Singular Value Decomposition (SVD) to invert for (and stabilize) very large gravity fields of well over 25,000 unknown parameters. This new code is given the name (Parallel LArge Svd Solver) PLASS. The choice of the SVD was made because it offers multiple opportunities of stabilization techniques. Poorly observed parameter corrections are removed from the culpable eigenspace of the normal matrix of CHAMP or the singular vector space of the upper R triangular matrix of GRACE. Solutions were stabilized based on the removal of either eigenvalues or singular values using four different standard optimization criteria: Inspection, Relative Error, Norm Norm minimization, trace of the Mean Square Error (MSE) matrix, and with a fifth method, independently introduced for this investigation, that optimizes removal of eigenvalues or singular values based on Kaula’s power rule of thumb. This method is given the name “Kaula Eigenvalue (KEV) or Kaula Singular Value (KSV) relation”. For the gravity fields of this investigation, orbital fits, geodetic evaluations and error propagations of the best of the resulting SVD gravity fields were performed, and shown to be comparable to the CHAMP solution obtained by the GeoForschungsZentrum (GFZ) and to the full rank GRACE solution obtained by the Center for Space Research (CSR).