Memory protection techniques for DRAM scaling-induced errors

Date

2018-10-09

Authors

Gong, Seong-Lyong

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Continued scaling of DRAM technologies induces more faulty DRAM cells than before. These inherent faults increase significantly at sub-20nm technology, and hence traditional remapping schemes such as row/column sparing become very inefficient. Because the inherent faults manifest as single-bit errors, DRAM vendors are proposing to embed single-bit error correctable (SEC) ECC modules inside each DRAM chip, called In-DRAM ECC (IECC). However, IECC can achieve limited reliability improvement due to its weak correction capability. Specifically, at high scaling error rates, multi-bit scaling errors will easily occur in practice and escape from IECC protection. Because of the escaped scaling errors, the overall reliability may be degraded despite the increased overall overheads. For highly reliable systems that apply a strong ECC at the rank level (i.e., across DRAM chips that are accessed simultaneously), for example, Chipkill cannot be guaranteed anymore if the escaped errors occur. In this dissertation, I address this scaling-induced error problem as follows. First, I propose a more sophisticated fault-error model that includes intermittent scaling errors. In general, the effectiveness of proposed solutions strongly relies on the evaluation methodology. Prior related work evaluated their solutions against scaling errors only with a simple model and concluded efficient remapping schemes effectively cope with scaling errors. However, intermittent scaling errors cannot be easily detected and remapped. This implies that rather than the proposed remapping schemes, forward error correction may be the only solution to the scaling error problem. Using the new evaluation model, the proposed solutions to scaling errors can be evaluated in a more comprehensive way than before. Secondly, I propose two alternatives to In-DRAM ECC, Dual Use of On-chip redundancy (DUO) and Why-Pay-More (YPM), for highly reliable systems. DUO achieves higher reliability than In-DRAM ECC-based solutions by transferring on-chip redundancy to the rank level. Then, using the transferred redundancy together with original rank-level redundancy, a stronger rank-level ECC is applied. YPM is the first rank-level-only ECC protection against scaling errors. For this cost-saving design, YPM optimizes the correction capability by exploiting erasure Reed-Solomon (RS) decoding and iterative bit-flipping search. Each alternative is industry-changing in that DUO achieves much higher reliability than current rank-level ECC and YPM does not require In-DRAM ECC at all. Both alternatives are practical in that they require only small changes to DRAM designs.

Description

LCSH Subject Headings

Citation