Browsing by Subject "Distributed systems"

Now showing 1 - 15 of 15

A client-centric approach to transactional datastores
(2020-05-05) Crooks, Natacha; Alvisi, Lorenzo; Peter, Simon, Ph. D.; Witchel, Emmett; Bailis, Peter
Modern applications must collect and store massive amounts of data. Cloud storage offers these applications simplicity: the abstraction of a failure-free, perfectly scalable black-box. While appealing, offloading data to the cloud is not without its challenges. These cloud storage systems often favour weaker levels of isolation and consistency. These weaker guarantees introduce behaviours that, without care, can break application logic. Offloading data to an untrusted third party like the cloud also raises questions of security and privacy. This thesis seeks to improve the performance, the semantics and the security of transactional cloud storage systems. It centers around a simple idea: defining consistency guarantees from the perspective of the applications that observe these guarantees, rather than from the perspective of the systems that implement them. This new perspective brings forth several benefits. First, it offers simpler and cleaner definitions of weak isolation and consistency guarantees. Second, it enables scalable implementations of existing guarantees like causal consistency. Finally, it has applications to security: it allows us to efficienctly augment transactional cloud storage systems with obliviousness guarantees
Active replication vs. fusion as fault tolerance mechanisms
(2016-05) Boyd, Jeremy J.; Garg, Vijay K. (Vijay Kumar), 1963-; Aziz, Adnan
This report compares two strategies for crash fault tolerance of nodes in distributed systems: active replication and fusion. To tolerate f crash faults, active replication maintains f backup servers for each primary. Fusion, however, maintains a set of f backup servers that contain the replicated data for all primaries in coded form. If n primaries each contain m data to be backed up, then, active replication requires O(nmf) space, while fusion requires only O(mf) space. These savings come at the cost of additional time during the recovery process due to additional messages and computation. For this report, we have implemented an application in which primary nodes maintain increasingly large data structures and periodically crash. Both active replication and fusion are evaluated as recovery mechanisms for the crashed nodes. The mechanisms are evaluated for performance across three metrics: backup size, time spent during updates to the backup, and recovery time. Our results validate theoretical expectations put forward in the literature – that fusion claims significant space savings at the cost of high recovery time. In the most extreme measured case, fusion costs approximately 83% of the space that replication does, while recovery time is three orders of magnitude more expensive in fusion (3.4s) than in replication (0.0037s). However, we also find that the gap between fusion and replication grows as nodes are introduced to the system. We find furthermore that fusion performs more consistently in update time due to the high variability of multicasting within active replication systems.
Alternative implementations for storage and communication abstractions in distributed systems
(2010-08) Aiyer, Amitanand S.; Alvisi, Lorenzo; Bazzi, Rida A.; Dahlin, Michael; Gouda, Mohamed; Wylie, Jay
Abstractions are widely used in building reliable distributed systems as they simplifies the task of building complex systems and aid in reasoning about them. Implementing these abstractions, however, requires making certain assumptions about the environment in which they will be used. We find that there is a mismatch in the set of assumptions used to implement abstractions in the different layers of a distributed system. This leads to a costlier design and may render the implementation unusable in situations where the assumptions do not hold. In this dissertation we provide alternative implementations for the abstractions of distributed registers and communication channels that rely on a unified set of assumptions across the different layers of a distributed system.
Computational process networks : a model and framework for high-throughput signal processing
(2011-05) Allen, Gregory Eugene; Evans, Brian L. (Brian Lawrence), 1965-; Browne, James C.; Chase, Craig M.; John, Lizy K.; Loeffler, Charles M.
Many signal and image processing systems for high-throughput, high-performance applications require concurrent implementations in order to realize desired performance. Developing software for concurrent systems is widely acknowledged to be difficult, with common industry practice leaving the burden of preventing concurrency problems on the programmer. The Kahn Process Network model provides the mathematically provable property of determinism of a program result regardless of the execution order of its processes, including concurrent execution. This model is also natural for describing streams of data samples in a signal processing system, where processes transform streams from one data type to another. However, a Kahn Process Network may require infinite memory to execute. I present the dynamic distributed deadlock detection and resolution (D4R) algorithm, which permits execution of Process Networks in bounded memory if it is possible. It detects local deadlocks in a Process Network, determines whether the deadlock can be resolved and, if so, identifies the process that must take action to resolve the deadlock. I propose the Computational Process Network (CPN) model which is based on the formalisms of Kahn’s PN model, but with enhancements that are designed to make it efficiently implementable. These enhancements include multi-token transactions to reduce execution overhead, multi-channel queues for multi-dimensional synchronous data, zero-copy semantics, and consumer and producer firing thresholds for queues. Firing thresholds enable memoryless computation of sliding window algorithms, which are common in signal processing systems. I show that the Computational Process Network model preserves the formal properties of Process Networks, while reducing the operations required to implement sliding window algorithms on continuous streams of data. I also present a high-throughput software framework that implements the Computational Process Network model using C++, and which maps naturally onto distributed targets. This framework uses POSIX threads, and can exploit parallelism in both multi-core and distributed systems. Finally, I present case studies to exercise this framework and demonstrate its performance and utility. The final case study is a three-dimensional circular convolution sonar beamformer and replica correlator, which demonstrates the high throughput and scalability of a real-time signal processing algorithm using the CPN model and framework.
Distributed trigger counting algorithms
(2010-12) Casas, Juan Manual, 1978-; Garg, Vijay K. (Vijay Kumar), 1963-; McCann, Bruce
A distributed system consists of a set of N processor nodes and a finite set of communication channels. It is frequently described as a directed graph in which each vertex represents a processor node and the edges represent the communication channels. A global snapshot of a distributed system consists of the local states of all the processor nodes and all of the in-transit messages of a distributed computation. This is meaningful as it corresponds to the global state where all the local states and communication channels of all the processor nodes in the system are recorded simultaneously. A classic example where snapshots are utilized is in the scenario of some failure where the system can restart from the last global snapshot. This is an important application of global snapshot algorithms as it forms the basis for fault-tolerance in distributed programs and aids in serviceability as a distributed program debugging mechanism. Another important application includes checkpointing and monitoring systems where a set of continuous global snapshots are employed to detect when a certain number of triggers have been received by the system. When the distributed system is scaled in terms of an increase in the number of processor nodes and an increase in the number of expected triggers the message complexity increases and impacts the total overhead for the communication and computation of the global snapshot algorithm. In such a large distributed system, an optimal algorithm is vital so that the distributed application program that is employing the snapshots does not suffer from performance degradation as the size of the distributed system continues to grow over time. We are interested in global snapshot algorithms that offer lower bound message complexity and lower bound MaxLoad messages for large values of N processor nodes and large values of W expected triggers. In this report we study and simulate the Centralized, Grid based, Tree Based, and LayeredRand global snapshot algorithms then evaluate the algorithms for total number of messages (sent and received) and MaxLoad messages (sent and received) for the trigger counting problem in distributed computing. The report concludes with simulation results that compare the performance of the algorithms with respect to the total number of messages and MaxLoad messages required by each algorithm to detect when the number of W triggers have been delivered to the distributed system.
Fault tolerance in distributed systems : a coding-theoretic approach
(2012-08) Balasubramanian, Bharath; Garg, Vijay K. (Vijay Kumar), 1963-; Chase, Craig M.; Julien, Christine L.; Plaxton, Greg; Touba, Nur A.; Vishwanath, Sriram
Distributed systems are rapidly increasing in importance due to the need for scalable computations on huge volumes of data. This fact is reflected in many real-world distributed applications such as Amazon's EC2 cloud computing service, Facebook's Cassandra key-value store or Apache's Hadoop MapReduce framework. Multi-core architectures developed by companies such as Intel and AMD have further brought this to prominence, since workloads can now be distributed across many individual cores. The nodes or entities in such systems are often built using commodity hardware and are prone to physical failures and security vulnerabilities. Achieving fault tolerance in such systems is a challenging task, since it is not easy to observe and control these distributed entities. Replication is a standard approach for fault tolerance in distributed systems. The main advantage of this approach is that the backups incur very little overhead in terms of the time taken for normal operation or recovery. However, replication is grossly wasteful in terms of the number of backups required for fault tolerance. The large number of backups has two major implications. First, the total space or memory required for fault tolerance is considerably high. Second, there is a significant cost of resources such as the power required to run the backup processes. Given the large number of distributed servers employed in real-world applications, it is a hard task to provide fault tolerance while achieving both space and operational efficiency. In the world of data fault tolerance and communication, coding theory is used as the space efficient alternate for replication. A direct application of coding theory to distributed servers, treating the servers as blocks of data, is very inefficient in terms of the updates to the backups. This is primarily because each update to the server will affect many blocks in memory, all of which have to be re-encoded at the backups. This leads us to the following thesis statement: Can we design a mechanism for fault tolerance in distributed systems that combines the space efficiency of coding theory with the low operational overhead of replication? We present a new paradigm to solve this problem, broadly referred to as fusion. We provide fusion-based solutions for two models of computation that are representative of a large class of applications: (i) Systems modeled as deterministic finite state machines and, (ii) Systems modeled as programs containing data structures. For finite state machines, we use the notion of Hamming distances to present a polynomial time algorithm to generate efficient backup state machines. For programs hosting data structures, we use a combination of erasure codes and selective replication to generate efficient backups for most commonly used data structures such as queues, array lists, linked lists, vectors and maps. We present theoretical and experimental results that demonstrate the efficiency of our schemes over replication. Finally, we use our schemes to design an efficient solution for fault tolerance in two real-world applications: Amazons Dynamo key-value store, and Google's MapReduce framework.
Kubernetes provenance
(2020-09-14) Lin, William, M.S. in Computer Sciences; Chidambaram, Vijay; Rossbach, Christopher J.
The field of machine learning (ML) has experienced a period of renaissance since the 2000s. First, exponential increase in computational power and improvements in hardware has finally allowed machine learning algorithms to process the same amount of data in minutes and hours rather than hundreds of years. Second, the model of cloud computing made large scale clusters inexpensive and available to anyone at the click of a button, allowing them to scale their algorithms without having to personally maintain hundreds or even thousands of machines. However, despite the huge rise in popularity of machine learning in both research and industry, the ML community is facing a crisis of being able to reproduce results. Although the existing machine learning frameworks all have the ability to re-execute the same piece of code saved by a researcher, the typical workflow could involve different frameworks and accesses to data on remote machines. These cross-framework workflows can not be replicated by a single frameworks provenance system, and often contain customized scripts and processes that can further obscure the ability for future replication and repeatability. I make the argument in this thesis that because of machine learning’s need for scale and frequent training on large clusters, Kubernetes serves as a good common layer for the systems community to interpose a layer of provenance collection to aid the ML community in reproducing results that make use of multiple machines, frameworks, and hardware platforms. In addition, I also propose two new mechanisms for collecting fine-grained provenance information from Kubernetes without modifying the application or host operating system.
Necessary and sufficient conditions on partial orders for modeling concurrent computations
(2014-05) Chauhan, Himanshu; Garg, Vijay K. (Vijay Kumar), 1963-
Concurrent computations have been modeled using partial orders in both event based and state based domains. We give necessary and sufficient conditions on partial orders for them to be valid state based or event based models of concurrent computations. In particular, we define notions of width-extensibility and interleaving-consistency of partial orders, and show that a partial order can be valid state based model of a concurrent computation iff it is width-extensible. Distributed computations that involve asynchronous message passing are a subset of concurrent computations. For asynchronous distributed computations, a partial order can be a valid state based model iff it is width-extensible and interleaving-consistent. We show a duality between the event based and state based models of concurrent computations, and give algorithms to convert partial orders from the event based domain to state based domain and vice-versa.
A new approach to detecting failures in distributed systems
(2015-08) Leners, Joshua Blaise; Alvisi, Lorenzo; Aguilera, Marcos K; Shmatikov, Vitaly; Walfish, Michael; Witchel, Emmett
Fault-tolerant distributed systems often handle failures in two steps: first, detect the failure and, second, take some recovery action. A common approach to detecting failures is end-to-end timeouts, but using timeouts brings problems. First, timeouts are inaccurate: just because a process is unresponsive does not mean that process has failed. Second, choosing a timeout is hard: short timeouts can exacerbate the problem of inaccuracy, and long timeouts can make the system wait unnecessarily. In fact, a good timeout value—one that balances the choice between accuracy and speed—may not even exist, owing to the variance in a system’s end-to-end delays. ƃis dissertation posits a new approach to detecting failures in distributed systems: use information about failures that is local to each component, e.g., the contents of an OS’s process table. We call such information inside information, and use it as the basis in the design and implementation of three failure reporting services for data center applications, which we call Falcon, Albatross, and Pigeon. Falcon deploys a network of software modules to gather inside information in the system, and it guarantees that it never reports a working process as crashed by sometimes terminating unresponsive components. ƃis choice helps applications by making reports of failure reliable, meaning that applications can treat them as ground truth. Unfortunately, Falcon cannot handle network failures because guaranteeing that a process has crashed requires network communication; we address this problem in Albatross and Pigeon. Instead of killing, Albatross blocks suspected processes from using the network, allowing applications to make progress during network partitions. Pigeon renounces interference altogether, and reports inside information to applications directly and with more detail to help applications make better recovery decisions. By using these services, applications can improve their recovery from failures both quantitatively and qualitatively. Quantitatively, these services reduce detection time by one to two orders of magnitude over the end-to-end timeouts commonly used by data center applications, thereby reducing the unavailability caused by failures. Qualitatively, these services provide more specific information about failures, which can reduce the logic required for recovery and can help applications better decide when recovery is not necessary.
OurFileSystem
(2013-12) Gass, Robert Benjamin; Garg, Vijay K. (Vijay Kumar), 1963-
OurFileSystem (OFS) is a peer-to-peer file and metadata sharing program. Peers freely join the network, but must be granted access to groups in which metadata and files are shared. Any peer may create a group and grant others access to the group. Group members have different degrees of authority to grant others access and set their authority. Metadata for files is created by users within the context of a group and distributed to all members of the group in the form of a post. Post templates can be created to set fields of metadata. Templates are distributed to all members of a group, and one can be selected when creating a post or searching for files. Metadata in posts is indexed, and sophisticated search on the metadata can be performed locally to help users find files of interest quickly. Files found during a search may be downloaded from peers upon request. Pieces of files are downloaded from as many different peers as possible to maximize bandwidth. Peers within a group may also be marked as bad locally. If a user marks another peer as bad within the context of a group, posts from that peer to the group are deleted and not shared with others. Furthermore, any peer that was granted access by a peer marked as bad is also marked bad. No further posts or authorizations are ever accepted from any peer marked as bad. OFS also supports small public and private messages, which are distributed to all peers in the network. Private messages are encrypted so only the intended peer can decrypt the message. Lastly OFS integrates well with anonymous overlay networks that support SOCKS proxies, such as TOR. I2P support has also been explicitly added.
Policy architecture for distributed storage systems
(2009-08) Belaramani, Nalini Moti; Dahlin, Michael
Distributed data storage is a building block for many distributed systems such as mobile file systems, web service replication systems, enterprise file systems, etc. New distributed data storage systems are frequently built as new environment, requirements or workloads emerge. The goal of this dissertation is to develop the science of distributed storage systems by making it easier to build new systems. In order to achieve this goal, it proposes a new policy architecture, PADS, that is based on two key ideas: first, by providing a set of common mechanisms in an underlying layer, new systems can be implemented by defining policies that orchestrate these mechanisms; second, policy can be separated into routing and blocking policy, each addresses different parts of the system design. Routing policy specifies how data flow among nodes in order to meet performance, availability, and resource usage goals, whereas blocking policy specifies when it is safe to access data in order to meet consistency and durability goals. This dissertation presents a PADS prototype that defines a set of distributed storage mechanisms that are sufficiently flexible and general to support a large range of systems, a small policy API that is easy to use and captures the right abstractions for distributed storage, and a declarative language for specifying policy that enables quick, concise implementations of complex systems. We demonstrate that PADS is able to significantly reduce development effort by constructing a dozen significant distributed storage systems spanning a large portion of the design space over the prototype. We find that each system required only a couple of weeks of implementation effort and required a few dozen lines of policy code.
Supporting device-to-device search and sharing of hyper-localized data
(2015-05) Michel, Jonas Reinhardt; Julien, Christine, D. Sc.; Garg, Vijay; Lam, Simon; de Veciana, Gustavo; Vishwanath, Sriram
Supporting emerging mobile applications in densely populated environments requires connecting mobile users and their devices with the surrounding digital landscape. Specifically, the volume of digitally-available data in such computing spaces presents an imminent need for expressive mechanisms that enable humans and applications to share and search for relevant information within their digitally accessible physical surroundings. Device-to-device communications will play a critical role in facilitating transparent access to proximate digital resources. A wide variety of approaches exist that support device-to-device dissemination and query-driven data access. Very few, however, capitalize on the contextual history of the shared data itself to distribute additional data or to guide queries. This dissertation presents Gander, an application substrate and mobile middleware designed to ease the burden associated with creating applications that require support for sharing and searching of hyper-localized data in situ. Gander employs a novel trajectory-driven model of spatiotemporal provenance that enriches shared data with its contextual history -- annotations that capture data's geospatial and causal history across a lifetime of device-to-device propagation. We demonstrate the value of spatiotemporal data provenance as both a tool for improving ad hoc routing performance and for driving complex application behavior. This dissertation discusses the design and implementation of Gander's middleware model, which abstracts away tedious implementation details by enabling developers to write high-level rules that govern when, where, and how data is distributed and to execute expressive queries across proximate digital resources. We evaluate Gander within several simulated large-scale environments and one real-world deployment on the UT Austin campus. The goal of this research is to provide formal constructs realized within a software framework that ease the software engineering challenges encountered during the design and deployment of several applications in emerging mobile environments.
The lattice agreement problem in distributed systems
(2021-04-29) Zheng, Xiong, Ph. D.; Garg, Vijay K. (Vijay Kumar), 1963-; Rajsbaum, Sergio; Soloveichik, David; Vishwanath, Sriram; Khurshid, Sarfraz; Vaidyan, Nitin
The lattice agreement problem is an important decision problem in distributed systems. It has applications in implementing atomic snapshot objects and building a special class of replicated state machines. In this work, we design novel algorithms for the lattice agreement problem in a variety of settings. We first focus on distributed message passing systems with only crash failures. Then we switch our attention to message passing systems with Byzantine failures. At last, we explore the application of lattice agreement in implementing linearizable and sequentially consistent snapshot objects.
UbiPAL : secure messaging and access control for ubiquitous computing
(2015-05) Bielstein, Cameron Taylor; Alvisi, Lorenzo; Dickerson, Robert F.
The ubiquitous computing environment and modern trends in personal computing, such as body sensor networks and smart houses, create unique challenges in privacy and access control. Lack of centralized computing and the dynamic nature of human environments and access rules render most access control systems insufficient for this new category of systems. UbiPAL is an object-oriented communication framework for ubiquitous systems which provides secure communication and decentralized access control. UbiPAL uses a modified SecPAL implementation to provide reliable, ad hoc access control. The UbiPAL system uses cryptographically signed, publicly held namespace certificates and access control lists in the style of TLS certificates. This approach allows message authentication and authorization in an ad hoc, completely decentralized method while maintaining human readability of policy language. UbiPAL was implemented as a C++ library, made freely available at (1), and evaluated to have minimized overhead. Even on the slowest device evaluated, a Raspberry Pi, UbiPAL authentication and authorization adds less than 20 milliseconds to the delivery a message with a message overhead of 153 bytes. The UbiPAL programming model separates access policy from application programming and results in small amounts of code required from the application programmer, creating an accessible paradigm for programming ubiquitous computing systems.
The weighted Byzantine Agreement Problem
(2012-05) Bridgman, John Francis, III; Garg, Vijay K. (Vijay Kumar), 1963-; Evans, Brian L.
This report presents a weighted version of the Byzantine Agreement Problem and its solution under various conditions. In this version, each machine is assigned a weight depending on the application. Instead of assuming that at most $f$ out of $N$ machines fail, the algorithm assumes that the total weight of the machines that fail is at most $\rho < 1/3.$ When each machine has weight $1/N,$ this problem reduces to the standard Byzantine Generals Agreement Problem. By choosing weights appropriately, the weighted Byzantine Agreement Problem can be applied to situations where a subset of processes are more trusted. By using weights, the system can reach consensus in the presence of Byzantine failures, even when more than $N/3$ processes fail, so long as the total weight of the failed processes is less than $1/3.$ Some properties of the Weighted Byzantine Agreement algorithms when the weight vectors are not the same at every process are discussed. Also, a method to update the weights of the processes after execution of the weighted Byzantine Agreement is given. The update method guarantees that the weight of any correct process is never reduced and the weight of any faulty process, suspected by correct processes whose total weight is at least $1/4,$ is reduced to $0$ for future instances. A short discussion of some weight assignment strategies is also given.