Browsing by Subject "High performance computing"
Now showing 1 - 14 of 14
- Results Per Page
- Sort Options
Item A computational framework for the solution of infinite-dimensional Bayesian statistical inverse problems with application to global seismic inversion(2015-08) Martin, James Robert, Ph. D.; Ghattas, Omar N.; Biros, George; Demkowicz, Leszek; Fomel, Sergey; Marzouk, Youssef; Moser, RobertQuantifying uncertainties in large-scale forward and inverse PDE simulations has emerged as a central challenge facing the field of computational science and engineering. The promise of modeling and simulation for prediction, design, and control cannot be fully realized unless uncertainties in models are rigorously quantified, since this uncertainty can potentially overwhelm the computed result. While statistical inverse problems can be solved today for smaller models with a handful of uncertain parameters, this task is computationally intractable using contemporary algorithms for complex systems characterized by large-scale simulations and high-dimensional parameter spaces. In this dissertation, I address issues regarding the theoretical formulation, numerical approximation, and algorithms for solution of infinite-dimensional Bayesian statistical inverse problems, and apply the entire framework to a problem in global seismic wave propagation. Classical (deterministic) approaches to solving inverse problems attempt to recover the “best-fit” parameters that match given observation data, as measured in a particular metric. In the statistical inverse problem, we go one step further to return not only a point estimate of the best medium properties, but also a complete statistical description of the uncertain parameters. The result is a posterior probability distribution that describes our state of knowledge after learning from the available data, and provides a complete description of parameter uncertainty. In this dissertation, a computational framework for such problems is described that wraps around the existing forward solvers, as long as they are appropriately equipped, for a given physical problem. Then a collection of tools, insights and numerical methods may be applied to solve the problem, and interrogate the resulting posterior distribution, which describes our final state of knowledge. We demonstrate the framework with numerical examples, including inference of a heterogeneous compressional wavespeed field for a problem in global seismic wave propagation with 10⁶ parameters.Item Development of a framework for parallel reservoir simulation(2017-09-15) Barrios Molano, Hector Emilio; Sepehrnoori, Kamy, 1951-Parallel reservoir simulation is a topic of special interest to reservoir engineers and reservoir simulator developers. Parallel reservoir simulators provides several advantages over non-parallel reservoir simulators, such as • Capability to run bigger models. • Capability to have simulation results faster by using several processing units at once. • Not limited to single computer memory. Memory available increases as more computers are used. All these are compelling reasons for reservoir engineers. However, for reservoir simulator developers, the creation of a parallel reservoir simulator is a more complex task than non-parallel simulators. Problems related to parallel implementation such as parallel communication, model division among processors, and the management of data distributed among processors, among others should be addressed and solved on top of the already complex task of simulator development. Hence, development time for parallel reservoir simulators is more time intensive than the traditional development on single processor computers. The objective of this work is to separate the development focus of parallel reservoir simulators in two: parallel development and reservoir simulator development. To achieve such separation, a parallel framework was developed. The framework developed in this work implements and handles the parallel complexity and provides easy to use programming interfaces to accelerate the development of new parallel reservoir simulators or the parallelization of existing ones. The University of Texas Compositional Simulator (UTCOMP) was used with the framework to create a new parallel reservoir simulator. Several cases were used to verify accuracy, to assert usability and to test parallel performance on our new parallel reservoir simulator. The parallel reservoir simulator developed in this work has all of UTCOMP's features and is able to run models with up to 102.4 million cells using up to 1024 processors.Item Distributed selective re-execution for EDGE architectures(2005) Desikan, Rajagopalan; Burger, Douglas C., Ph. D.Speculation is a key technique that modern processors use to achieve high performance. Traditionally, speculation meant control speculation, in which the processor predicts the outcome of control instructions when they are fetched, and validates the prediction when the instructions are executed. More recently, processors have adopted another form of speculation called data speculation to improve performance. Data speculation involves the prediction of the data values produced by instructions, and forwarding the predicted values to consumers in the data-flow graph. For both control and data speculation, mis-speculation recovery is required when the speculation is incorrect. The conventional mechanism for mis-speculation recovery consists of flushing the processor pipeline of all incorrect state and restarting execution from the corrected state. However, pipeline flushes have become increasingly expensive in modern microprocessors with large instruction windows and deep pipelines. Selective re-execution is a technique that can reduce the penalty of mis-speculation recovery by re-executing only instructions that received incorrect values due to the mis-speculation. Conventional mechanisms to implement selective re-execution have had limited success because of the enormous complexity involved in the implementation. In this dissertation, we introduce a new selective re-execution mechanism that exploits the properties of a data flow-like Explicit Data Graph Execution (EDGE) architecture to support efficient mis-speculation recovery, while scaling to large window sizes. This Distributed Selective Re-Execution (DSRE) mechanism permits multiple speculative waves of computation to traverse a data flow graph simultaneously. The mechanism has no centralized state, and uses simple state bits to determine instructions to re- re on a mis-speculation, thus reducing the complexity of selective re-execution. We evaluate DSRE as a recovery mechanism for load-store dependence mis-speculation on a high-level EDGE architecture simulator, the Grid Processor Architecture (GPA) simulator, and on the more detailed TRIPS prototype processor simulator. DSRE provides 17% and 4.2% speedup, respectively, over dependence prediction, on the two simulators. Our results show that DSRE needs to be used in conjunction with pipeline flushing to achieve high performance. Predictors need to be aware of the the costs associated with each mechanism, and use the appropriate recovery mechanism for each speculation.Item Fast algorithms for biophysically-constrained inverse problems in medical imaging(2017-07-20) Gholaminejad, Amir; Biros, George; Oden, J. Tinsley; Ghattas, Omar; van de Geijn, Robert; Vuduc, Richard; Ren, KuiWe present algorithms and software for parameter estimation for forward and inverse tumor growth problems and diffeomorphic image registration. Our methods target the following scenarios: automatic image registration of healthy images to tumor bearing medical images and parameter estimation/calibration of tumor models. This thesis focuses on robust and scalable algorithms for these problems. Although the proposed framework applies to many problems in oncology, we focus on primary brain tumors and in particular low and high-grade gliomas. For the tumor model, the main quantity of interest is the extent of tumor infiltration into the brain, beyond what is visible in imaging. The inverse tumor problem assumes that we have patient images at two (or more) well-separated times so that we can observe the tumor growth. Also, the inverse problem requires that the two images are segmented. But in a clinical setting such information is usually not available. In a typical case, we just have multimodal magnetic resonance images with no segmentation. We address this lack of information by solving a coupled inverse registration and tumor problem. The role of image registration is to find a plausible mapping between the patient's tumor-bearing image and a normal brain (atlas), with known segmentation. Solving this coupled inverse problem has a prohibitive computational cost, especially in 3D. To address this challenge we have developed novel schemes, scaled up to 200K cores. Our main contributions is the design and implementation of fast solvers for these problems. We also study the performance for the tumor parameter estimation and registration solvers and their algorithmic scalability. In particular, we introduce the following novel algorithms: An adjoint formulation for tumor-growth problems with/without mass-effect; The first parallel 3D Newton-Krylov method for large diffeomorphic image registration; A novel parallel semi-Lagrangian algorithm for solving advection equations in image registration and its parallel implementation on shared and distributed memory architectures; and Accelerated FFT (AccFFT), an open-source parallel FFT library for CPU and GPUs scaled up to 131,000 cores with optimized kernels for computing spectral operators. The scientific outcomes of this thesis, has appeared in the proceedings of three ACM/IEEE SCxy conferences (two best student paper finalist, and one ACM SRC gold medal), two journal papers, two papers in review, four papers in preparation (coupling, mass effect, segmentation, and multi-species tumor model), and seven conference presentations.Item GPU-accelerated high-performance computing for architecture-aware wave simulation based on discontinuous Galerkin algorithms(2020-05-09) Hanindhito, Bagus; John, Lizy KurianFull-waveform inversion has been an essential method for oil and gas industries to approximate the properties of the Earth’s surface without the need to see them directly by digging, drilling, or tunneling, and thus lowering exploration costs. This method relies on the use of generated seismic waves and the acquisition of the reflected wave data. Since each type of rocks, sediments, and materials have different properties, the acquired data can be used to approximate the location of mineral and oil repository, for example. The first problem, which is the focus of this research, is called the forward problem, which aims to generate synthetic seismograms based on the given model. The second problem is the inverse problem, which tries to find the optimum model that can best describe the obtained data. Generally, the area of the seismic survey is massive and can easily generate a vast amount of data, which is used to find the best Earth model. Therefore, a considerable amount of computing power is required to help in solving these problems. Industrial-scale wave simulators typically use multiple CPUs to accelerate computation. As the size of the problem increases, the time needed to run the simulation will increase accordingly. In this thesis, we investigated the implementation of a CPU-based wave simulator to find available parallelism that can be extracted. We mapped the massive number of parallelism to GPU which has thousands of cores, and thus suitable for doing this job. We performed additional optimization of the basic code to improve the performance. We also developed a method to verify the functionality of our implementation against the original code. The GPU-accelerated version of the code is then compared to the original CPU code. We run the simulation for different levels of discretization both in consumer-class GPU and datacenter-class GPU. For the double-precision run, our benchmark results show the speed-up over 120x, 210x, and 330x in GeForce GTX 1080Ti, Tesla P100, and Tesla V100 GPU, respectively, compared to the dual Intel Xeon Platinum 8160 CPUs with a total of 48 coresItem High performance algorithms for medical image registration with applications in neuroradiology(2022-07-01) Himthani, Naveen; Biros, George; Ghattas, Omar; Pingali, Keshav; Yankeelov, Thomas; Mang, AndreasThis dissertation concerns the design, analysis and High-Performance Computing (HPC) implementation of fast algorithms for large deformation diffeomorphic registration and its application in quantifying abnormal anatomical deformations in Magnetic Resonance Image (MRI) scans of brain tumor patients. Image registration finds point correspondences between two images by solving an optimization problem. It is a fundamental and computationally expensive operation that finds applications in computer vision and medical image analysis. Diffeomorphic registration is a non-convex and nonlinear inverse problem and, as a result, presents significant numerical and computational challenges. Designing and implementing efficient and accurate numerical schemes on modern computer architectures is the key to accelerating and sometimes even enabling the development of image analysis workflows. In this dissertation, we contribute to several aspects of diffeomorphic registration: (i) a novel preconditioner that improves performance and scalability, (ii) algorithms and their scalable implementation on heterogeneous compute architectures, and (ii) applications in neuroradiology. Our work on diffeomorphic image registration is based on CLAIRE – a formulation, algorithmic framework, and software developed at the University of Texas at Austin. As the first highlight of our contributions, we introduced a novel two-level Hessian preconditioner that results in an improvement of 2.5× in CLAIRE’s performance. As a second highlight, our optimized HPC implementation yields orders of magnitude speedup as CLAIRE now supports GPU architectures and distributed memory parallelism via GPU-aware message passing interface (MPI). CLAIRE can register clinical-grade brain MRI scans of size 256³ in under 5 seconds on a single NVIDIA V100 GPU. For research-grade high-resolution volumetric images, e.g., mouse brain CLARITY images of size 2816 × 3016 × 1162, CLAIRE takes under 30 minutes using 256 NVIDIA V100 GPUs on the Texas Advanced Computing Center’s (TACC) Longhorn supercomputer. To the best of our knowledge, CLAIRE is the most scalable image registration algorithm and software. CLAIRE has been open-sourced under the GNU v3 license and is available on Github at https://github.com/andreasmang/claire. Our target clinical application concerns the utilization of image registration to characterize the mass effect in MRI scans of patients with glioblastoma, a fatal brain cancer. Mass effect is the mechanical deformation in surrounding healthy tissue caused by the growing tumor. The location and degree of mass effect could aid in differential diagnosis and treatment planning. Towards this end, we introduce an algorithm that integrates CLAIRE, statistical analysis for abnormality detection, and machine learning to quantify and localize mass effect. Given a patient’s brain tumor scan, we generate a clinical summary with (i) an estimate of the degree of mass effect along with a severity label – mild, moderate, or severe with up to 62% accuracy, (ii) a heatmap of mass effect for the brain scan and, (iii) a list of specific anatomical regions, e.g. frontal lobe, which is statistically likely to possess significant mass effect.Item Jack Rabbit : an effective Cell BE programming system for high performance parallelism(2011-05) Ellis, Apollo Isaac Orion; Lin, Yun Calvin; Fussell, Donald S., 1951-The Cell processor is an example of the trade-offs made when designing a mass market power efficient multi-core machine, but the machine-exposing architecture and raw communication mechanisms of Cell are hard to manage for a programmer. Cell's design is simple and causes software complexity to go up in the areas of achieving low threading overhead, good bandwidth efficiency, and load balance. Several attempts have been made to produce efficient and effective programming systems for Cell, but the attempts have been too specialized and thus fall short. We present Jack Rabbit, an efficient thread pool work queue implementation, with load balancing mechanisms and double buffering. Our system incurs low threading overhead, gets good load balance, and achieves bandwidth efficiency. Our system represents a step towards an effective way to program Cell and any similar current or future processors.Item The Lagniappe programming environment(2008-08) Riché, Taylor Louis, 1978-; Vin, Harrick M.Multicore, multithreaded processors are rapidly becoming the platform of choice for designing high-throughput request processing applications. We refer to this class of modern parallel architectures as multi-[star] systems. In this dissertation, we describe the design and implementation of Lagniappe, a programming environment that simplifies the development of portable, high-throughput request-processing applications on multi-[star] systems. Lagniappe makes the following four key contributions: First, Lagniappe defines and uses a unique hybrid programming model for this domain that separates the concerns of writing applications for uni-processor, single-threaded execution platforms (single-[star]systems) from the concerns of writing applications necessary to efficiently execute on a multi-[star] system. We provide separate tools to the programmer to address each set of concerns. Second, we present meta-models of applications and multi-[star] systems that identify the necessary entities for reasoning about the application domain and multi-[star] platforms. Third, we design and implement a platform-independent mechanism called the load-distributing channel that factors out the key functionality required for moving an application from a single-[star] architecture to a multi-[star] one. Finally, we implement a platform-independent adaptation framework that defines custom adaptation policies from application and system characteristics to change resource allocations with changes in workload. Furthermore, applications written in the Lagniappe programming environment are portable; we separate the concerns of application programming from system programming in the programming model. We implement Lagniappe on a cluster of servers each with multiple multicore processors. We demonstrate the effectiveness of Lagniappe by implementing several stateful request-processing applications and showing their performance on our multi-[star] system.Item Memory management for high-performance applications(2002) Berger, Emery David; McKinley, Kathryn S.Memory managers are a source of performance and robustness problems for application software. Current general-purpose memory managers do not scale on multiprocessors, cause false sharing of heap objects, and systematically leak memory. Even on uniprocessors, the memory manager is often a performance bottleneck. General-purpose memory managers also do not provide the bulk deletion semantics required by many applications, including web servers and compilers. The approaches taken to date to address these and other memory management problems have been largely ad hoc. Programmers often attempt to work around these problems by writing custom memory managers. This approach leads to further difficulties, including data corruption caused when programmers inadvertently free custom-allocated objects to the general-purpose memory manager. In this thesis, we develop a framework for analyzing and designing high-quality memory managers. We develop a memory management infrastructure called heap layers that allows programmers to compose efficient memory managers from reusable and independently testable components. We conduct the first comprehensive examination of custom memory managers and show that most of these achieve slight or no performance improvements over a state-of-the-art general-purpose memory manager. Building on the knowledge gained in this study, we develop a hybrid memory management abstraction called reaps that combines the best of both approaches, allowing server applications to manage memory quickly and flexibly while avoiding memory leaks. We identify a number of previously unnoticed problems with concurrent memory management and analyze previous work in the light of these discoveries. We then present a concurrent memory manager called Hoard and prove that it avoids these problems.Item Petrophysical modeling and simulation study of geological CO₂ sequestration(2014-05) Kong, Xianhui; Delshad, Mojdeh; Wheeler, Mary F. (Mary Fanett)Global warming and greenhouse gas (GHG) emissions have recently become the significant focus of engineering research. The geological sequestration of greenhouse gases such as carbon dioxide (CO₂) is one approach that has been proposed to reduce the greenhouse gas emissions and slow down global warming. Geological sequestration involves the injection of produced CO₂ into subsurface formations and trapping the gas through many geological mechanisms, such as structural trapping, capillary trapping, dissolution, and mineralization. While some progress in our understanding of fluid flow in porous media has been made, many petrophysical phenomena, such as multi-phase flow, capillarity, geochemical reactions, geomechanical effect, etc., that occur during geological CO₂ sequestration remain inadequately studied and pose a challenge for continued study. It is critical to continue to research on these important issues. Numerical simulators are essential tools to develop a better understanding of the geologic characteristics of brine reservoirs and to build support for future CO₂ storage projects. Modeling CO₂ injection requires the implementation of multiphase flow model and an Equation of State (EOS) module to compute the dissolution of CO₂ in brine and vice versa. In this study, we used the Integrated Parallel Accurate Reservoir Simulator (IPARS) developed at the Center for Subsurface Modeling at The University of Texas at Austin to model the injection process and storage of CO₂ in saline aquifers. We developed and implemented new petrophysical models in IPARS, and applied these models to study the process of CO₂ sequestration. The research presented in this dissertation is divided into three parts. The first part of the dissertation discusses petrophysical and computational models for the mechanical, geological, petrophysical phenomena occurring during CO₂ injection and sequestration. The effectiveness of CO₂ storage in saline aquifers is governed by the interplay of capillary, viscous, and buoyancy forces. Recent experimental data reveals the impact of pressure, temperature, and salinity on interfacial tension (IFT) between CO₂ and brine. The dependence of CO₂-brine relative permeability and capillary pressure on IFT is also clearly evident in published experimental results. Improved understanding of the mechanisms that control the migration and trapping of CO₂ in the subsurface is crucial to design future storage projects for long-term, safe containment. We have developed numerical models for CO₂ trapping and migration in aquifers, including a compositional flow model, a relative permeability model, a capillary model, an interfacial tension model, and others. The heterogeneities in porosity and permeability are also coupled to the petrophysical models. We have developed and implemented a general relative permeability model that combines the effects of pressure gradient, buoyancy, and capillary pressure in a compositional and parallel simulator. The significance of IFT variations on CO₂ migration and trapping is assessed. The variation of residual saturation is modeled based on interfacial tension and trapping number, and a hysteretic trapping model is also presented. The second part of this dissertation is a model validation and sensitivity study using coreflood simulation data derived from laboratory study. The motivation of this study is to gain confidence in the results of the numerical simulator by validating the models and the numerical accuracies using laboratory and field pilot scale results. Published steady state, core-scale CO₂/brine displacement results were selected as a reference basis for our numerical study. High-resolution compositional simulations of brine displacement with supercritical CO₂ are presented using IPARS. A three-dimensional (3D) numerical model of the Berea sandstone core was constructed using heterogeneous permeability and porosity distributions based on geostatistical data. The measured capillary pressure curve was scaled using the Leverett J-function to include local heterogeneity in the sub-core scale. Simulation results indicate that accurate representation of capillary pressure at sub-core scales is critical. Water drying and the shift in relative permeability had a significant impact on the final CO₂ distribution along the core. This study provided insights into the role of heterogeneity in the final CO₂ distribution, where a slight variation in porosity gives rise to a large variation in the CO₂ saturation distribution. The third part of this study is a simulation study using IPARS for Cranfield pilot CO₂ sequestration field test, conducted by the Bureau of Economic Geology (BEG) at The University of Texas at Austin. In this CO₂ sequestration project, a total of approximately 2.5 million tons supercritical CO₂ was injected into a deep saline aquifer about ~10000 ft deep over 2 years, beginning December 1st 2009. In this chapter, we use the simulation capabilities of IPARS to numerically model the CO₂ injection process in Cranfield. We conducted a corresponding history-matching study and got good agreement with field observation. Extensive sensitivity studies were also conducted for CO₂ trapping, fluid phase behavior, relative permeability, wettability, gravity and buoyancy, and capillary effects on sequestration. Simulation results are consistent with the observed CO₂ breakthrough time at the first observation well. Numerical results are also consistent with bottomhole injection flowing pressure for the first 350 days before the rate increase. The abnormal pressure response with rate increase on day 350 indicates possible geomechanical issues, which can be represented in simulation using an induced fracture near the injection well. The recorded injection well bottomhole pressure data were successfully matched after modeling the fracture in the simulation model. Results also illustrate the importance of using accurate trapping models to predict CO₂ immobilization behavior. The impact of CO₂/brine relative permeability curves and trapping model on bottom-hole injection pressure is also demonstrated.Item Practical fast matrix multiplication algorithms(2018-10-08) Huang, Jianyu; Van de Geijn, Robert A.; Batory, Don S; Biros, George; Henry, Greg M; Pingali, Keshav KMatrix multiplication is a core building block for numerous scientific computing and, more recently, machine learning applications. Strassen's algorithm, the original Fast Matrix Multiplication (FMM) algorithm, has long fascinated computer scientists due to its startling property of reducing the number of computations required for multiplying n x n matrices from O(n³) to O(n [superscript 2.807]). Over the last half century, this has fueled many theoretical improvements such as other variations of Strassen-like FMM algorithms. Previous implementations of these FMM algorithms led to the "street wisdom" that they are only practical for large, relatively square matrices, that they require considerable workspace, and that they are difficult to achieve thread-level parallelism. The thesis of this work dispels these notions by demonstrating significant benefits for small and non-square matrices, requiring no workspace beyond what is already incorporated in high-performance implementations of matrix multiplication, and achieving performance benefits on multi-core, many-core, and distributed memory architectures.Item Resilient heterogeneous systems with Containment Domains(2020-02-03) Lee, Kyushick; Erez, Mattan; Touba, Nur A.; Tiwari, Mohit; Rossbach, Christopher J.; Sullivan, Michael B.Resilience is a continuing concern for extreme-scale scientific applications. Tolerating the ever-increasing hardware fault rates demands a scalable end-to-end resilience scheme. The fundamental issue of current system-wide techniques, such as checkpoint-restart, is a one-size-fits-all approach, which globally recovers local failures. The challenges for supporting efficient resilience grow at scale with the trend of adopting accelerators. Exploiting resiliency tailored to an application can offer a potential breakthrough that enables efficient localized recovery, because an individual node maintains low failure rate at scale. I propose a framework realizing Containment Domains (CDs) that addresses the resilience challenges for future-scale heterogeneous systems. My dissertation consists of two parts: tackling the resilience problem for CPU-only systems with CDs and extending CDs to systems with GPUs. In the first, I develop the CDs framework and adapt CDs-based resilience to real-world applications to verify its analytical model and show its feasibility. CDs elevate resilience to a first-class abstraction and exploit application properties to enable hierarchically decomposing applications into local domains to contain errors. Confining the range of errors with such logical domains in a program enables localized recovery. The CDs framework validates the analytical model of CDs, which matches the trend of the efficiency results measured by running CD-enabled applications with error injection. Based on the analytical model, I develop an automated workflow to tune Containment Domains by leveraging different likelihood of failures, storage hierarchy, and application characteristics. The CD-based resiliency estimated by the analytical model projects higher efficiency than the state-of-the-art, and promises scalablility toward exascale computing. In the second part of the dissertation, I extend CDs to CUDA applications on high-performance computing (HPC) systems with GPUs. GPUs offer higher computational power at lower energy and cost than homogeneous CPU-only nodes. The heterogeneous nodes in modern HPC systems show a tendency of high GPU-to-CPU ratio. While an accelerator-rich machine reduces the total number of compute nodes required to achieve a performance target, a single node becomes vulnerable to accelerator failures as well as congested intra-node resources. Preserving a large amount of local state within accelerators for checkpointing incurs significant overheads. Node-level resilience reveals a new challenge at the scale of accelerator density of HPC systems. I apply CDs to isolate and recover GPU failures in HPC CUDA applications (CD-CUDA). The extention of CDs to CUDA programs allows to express logical domains at the kernel boundary. CD-CUDA improves the system-level efficiency for resilience compared to host-only CDs by containing GPU failures. Furthermore, I propose and evaluate hardware component to resolve the bursty device-local preservation traffic within a node which is a new challenge as GPU density grows in the systemItem Scalable electronic structure methods to solve the Kohn-Sham equation(2018-01-23) Lena, Charles Manuel; Chelikowsky, James R.; Demkov, Alexander A; Ekerdt, John G; Hwang, Gyeong S; Gamba, Irene MFrom the single hydrogen to proteins in the hundreds of thousands of kilodaltons, scientists can use the electronic structure of interacting atoms to predict their material properties. Knowing the material properties through solving the electronic structure problem, would allow for the controlled prediction and corresponding design of materials. The Kohn-Sham equations, based on density functional theory, transform a many-body problem impossible to solve for anything but the smallest molecules, into a practical problem which can be used to predict material properties. Although KSDFT scales as the cube of the number of electrons in the system, there are additional well documented approximations to further reduce the number of electrons, such as the pseudopotential method. The incoming exascale era will lead to unavoidable challenges in solving the Kohn-Sham equations. These challenges include communication and hardware considerations. Old paradigms, epitomized by repeated series of globally forced synchronization points, will give way to new breeds of algorithms to maximizing scaling performance while maintaining portability. This thesis focuses on the solution to Kohn-Sham DFT in real space at scale. Key to this effort is a parallel treatment of numerical elements involving the Rayleigh-Ritz method. At minimum, the Rayleigh-Ritz projection requires a number of distributed matrix vector operations equal to the number of electrons solved for in a system. Furthermore, the projection requires that number, squared and then halved, of dot products. The memory cost for such an algorithm also grows very large quickly, and explicit intelligent management is not an option. I demonstrate the computational requirements for the various steps in solving for the electronic structure problem for both large and small molecular systems. This thesis also discusses opportunities in real space Kohn-Sham DFT to further utilize floating point optimized hardware the with higher order stencils.Item xBFT : Byzantine fault tolerance with high performance, low cost, and aggressive fault isolation(2008-05) Kotla, Ramakrishna Rao, 1976-; Dahlin, MichaelWe are increasingly relying on online services to store, access, share, and disseminate critical information from anywhere and at all times. Such services include email, digital storage, photos, video, health and financial services, etc. With increasing evidence of non-fail-stop failures in practical systems, Byzantine fault tolerant state machine replication technique is becoming increasingly attractive for building highlyreliable services in order to tolerate such failures. However, existing Byzantine fault tolerant techniques fall short of providing high availability, high performance, and long-term data durability guarantees with competitive replication cost. In this dissertation, we present BFT replication techniques that facilitate the design and implementation of such highly-reliable services by providing high availability, high performance and high durability with competitive replication cost (hardware, software, network, management). First, we propose CBASE, a BFT state machine replication architecture that leverages application-level parallelism to improve throughput of the replicated system by identifying and executing independent requests concurrently. Traditional state machine replication based Byzantine fault tolerant (BFT) techniques provide high availability and security but fail to provide high throughput. This limitation stems from the fundamental assumption of generalized state machine replication techniques that all replicas execute requests sequentially in the same total order to ensure consistency across replicas. Our architecture thus provides a general way to exploit application parallelism in order to provide high throughput without compromising correctness. Second, we present Zyzzyva, an efficient BFT agreement protocol that uses speculation to significantly reduce the performance overhead and replication cost of BFT state machine replication. In Zyzzyva, replicas respond to a client’s request without first running an expensive three-phase commit protocol to reach agreement on the order in which the request must be processed. Instead, they optimistically adopt the order proposed by the primary and respond immediately to the client. Replicas can thus become temporarily inconsistent with one another, but clients detect inconsistencies, help correct replicas converge on a single total ordering of requests, and only rely on responses that are consistent with this total order. This approach allows Zyzzyva to reduce replication overheads to near their theoretical minima. Third, we design and implement SafeStore, a distributed storage system designed to maintain long-term data durability despite conventional hardware and software faults, environmental disruptions, and administrative failures caused by human error or malice. The architecture of SafeStore is based on fault isolation, which SafeStore applies aggressively along administrative, physical, and temporal dimensions by spreading data across autonomous storage service providers (SSPs). SafeStore also performs an efficient end-to-end audit of SSPs to detect data loss quickly and improve data durability by reducing MTTR. SafeStore offers durable storage with cost, performance, and availability competitive with traditional storage systems. We evaluate these techniques by implementing BFT replication libraries and further demonstrate the practicality of these approaches by implementing an NFS based replicated file system(CBASE-FS) and a durable storage system (SafeStore-FS).