Fault tolerance in distributed systems : a coding-theoretic approach




Balasubramanian, Bharath

Journal Title

Journal ISSN

Volume Title



Distributed systems are rapidly increasing in importance due to the need for scalable computations on huge volumes of data. This fact is reflected in many real-world distributed applications such as Amazon's EC2 cloud computing service, Facebook's Cassandra key-value store or Apache's Hadoop MapReduce framework. Multi-core architectures developed by companies such as Intel and AMD have further brought this to prominence, since workloads can now be distributed across many individual cores. The nodes or entities in such systems are often built using commodity hardware and are prone to physical failures and security vulnerabilities. Achieving fault tolerance in such systems is a challenging task, since it is not easy to observe and control these distributed entities. Replication is a standard approach for fault tolerance in distributed systems. The main advantage of this approach is that the backups incur very little overhead in terms of the time taken for normal operation or recovery. However, replication is grossly wasteful in terms of the number of backups required for fault tolerance. The large number of backups has two major implications. First, the total space or memory required for fault tolerance is considerably high. Second, there is a significant cost of resources such as the power required to run the backup processes. Given the large number of distributed servers employed in real-world applications, it is a hard task to provide fault tolerance while achieving both space and operational efficiency. In the world of data fault tolerance and communication, coding theory is used as the space efficient alternate for replication. A direct application of coding theory to distributed servers, treating the servers as blocks of data, is very inefficient in terms of the updates to the backups. This is primarily because each update to the server will affect many blocks in memory, all of which have to be re-encoded at the backups. This leads us to the following thesis statement: Can we design a mechanism for fault tolerance in distributed systems that combines the space efficiency of coding theory with the low operational overhead of replication? We present a new paradigm to solve this problem, broadly referred to as fusion. We provide fusion-based solutions for two models of computation that are representative of a large class of applications: (i) Systems modeled as deterministic finite state machines and, (ii) Systems modeled as programs containing data structures. For finite state machines, we use the notion of Hamming distances to present a polynomial time algorithm to generate efficient backup state machines. For programs hosting data structures, we use a combination of erasure codes and selective replication to generate efficient backups for most commonly used data structures such as queues, array lists, linked lists, vectors and maps. We present theoretical and experimental results that demonstrate the efficiency of our schemes over replication. Finally, we use our schemes to design an efficient solution for fault tolerance in two real-world applications: Amazons Dynamo key-value store, and Google's MapReduce framework.



LCSH Subject Headings