Nearly free resilient memory architectures that balance resilience, performance, and cost
MetadataShow full item record
Memory reliability has been a major design constraint for mission-critical and large-scale systems for many years. Continued innovation is still necessary because the rate of faults, and the errors they lead to, grows with system size and because some faults become more likely as fabrication technology advances. Furthermore, recent field studies have shown that more severe permanent/intermittent and multi-bit faults are roughly as frequent as single-bit and transient ones. Therefore, strong error checking and correcting (ECC) schemes that can correct multi-bit errors have been developed and are in use. However, using ECC to correct the numerous recurring errors from permanent faults forces trading off cost, performance, and reliability. Firstly, a permanent fault is likely to result in numerous erroneous accesses, each requiring possibly high correction overhead. Secondly, once redundancy is used for correction, further errors may go uncorrected leading to data loss, which called detected uncorrectable error (DUE), or worse, go undetected and result in silent data corruption (SDC). Strong ECC can then be used to tolerate more errors, but at higher overhead. The straightforward solution to addressing this issue of repeated costly corrections and reduced coverage is to replace faulty memory devices, however, doing so is expensive, and requires either increased system down time or increased storage and bandwidth overheads. An economical alternative is to retire and possibly remap just the faulty memory regions. The existing retirement techniques, however, either require sophisticated software support, impact capacity, reliability, and/or performance, or introduce additional storage and hardware structures. Implementing a strong ECC such as Single Device Data Correction (SDDC) ECC (or chipkill-level ECC) is typically expensive in terms of storage and complexity. It is even challenging to implement a SDDC level ECC in emerging high bandwidth memories such as HBM2. This is because a single ECC codeword is transferred from one memory device in HBM2 for instance, and thus simply adding a redundant device results in high overhead in storage, energy, and bandwidth. Such wide data width memories are, however, widely used in graphics processing units (GPUs) and Intel’s Xeon Phi processors (e.g. Knight Landing) to exploit high memory bandwidth. As GPUs are popular to build large scale high performance systems, improving resilience of GPU memory systems has been an important issue, but the current GPU memory ECC is limited to single-bit error correction double-bit error detection (SECDED). To improve GPU memory resilience further, therefore, multiple techniques are coordinated. One interesting addition is to use a software driven memory repair technique, which retires affected pages with virtual memory support. This approach reduces the risk of uncorrected or even undetected memory errors in the future. However, it can be unsafe to avoid a DUE because the underlying ECC code is weak. Therefore, it needs to proactively retire susceptible memory blocks to avoid the potential threat of a system failure in the future. In this dissertation, I develop strong, and low-cost memory fault tolerance mechanisms that improve both system resilience and availability without wasting resources, increasing memory access complexity, and compromising the fault tolerance of existing resilience schemes. I first identify two interesting characteristics of DRAM failures in the field. First, permanent faults are as frequent as transient faults. Second, most faults affect small memory regions. Based on such analysis of DRAM failure patterns, I propose and evaluate two novel hardware-only memory repair mechanisms that improve memory system reliability significantly without compromising performance and increasing overhead. I also develop a strong, low-cost GPU memory fault tolerance mechanism based on three insights. First, ECC for GPUs should not interfere with the high-bandwidth DRAM system that uses possibly just one DRAM device for each data access. The second insight is that the way in which GPUs are used as accelerators offers unique opportunities to tolerate even severe memory errors without relying on SDDC ECC. The third insight is that nearly all permanent memory faults can be repaired with low overhead techniques. According to these observations, I propose and evaluate a multi-tier, strong GPU memory fault tolerance mechanism where the techniques at each level closely work together to significantly improve accelerator memory system resilience with low overhead.