Architecting memory systems upon highly scaled error-prone memory technologies

Wang, Hao
Thumbnail Image
Other Contributors
Zhang, Tong
Le Coz, Yannick L.
Saulnier, Gary J.
Carothers, Christopher D.
Issue Date
Electrical engineering
Terms of Use
Attribution-NonCommercial-NoDerivs 3.0 United States
This electronic version is a licensed copy owned by Rensselaer Polytechnic Institute, Troy, NY. Copyright of original work retained by author.
Full Citation
This thesis presents a series of orthogonal memory system design techniques that leverage the characteristics of various applications to optimize memory fault tolerance in highly scaled memory technologies. This thesis advocates a system-aided scaling of memory and data-dependent error-tolerance design strategy that allows memory chips to provide erroneous bits. These erroneous bits are directly visible to and tolerated by system-level memory controller instead of memory chips themselves. This design is evaluated in the case of using DRAM and STT-RAM in solid-state drives (SSDs). By dynamically and jointly adjusting ECC configurations, the memory controller is able to adapt to the runtime data access characteristics. This technology contributes significant ECC redundancy saving and data reliability improvement.
Besides hardware based strategies, this thesis also presents a software-based solution on the use of DRAM with unrepaired weak cells in computing systems. The solution is based on the simple idea that operating system (OS) reserves all the error-prone pages, which contain at least one unrepaired weak cell, from being used. Under a relatively high error-prone page rate (e.g., 8%), it is almost impossible for OS to allocate a continuous fragmentation-free physical memory space for some critical operations. Moreover, reserving all the error-prone pages from practical usage could cause noticeable memory resource waste. Aiming to address these issues, this thesis presents a controller-based selective page remapping strategy to ensure a continuous critical memory region for OS and develops a software-based memory error tolerance scheme to recycle all the error-prone pages for the zRAM function in Linux. Experiments are carried out using SPEC CPU2006 and further study is performed on the latency, hardware cost and the effectiveness of recycling error-prone pages for zRAM in Linux. The experimental results show that the proposed software-based error tolerance scheme degrades the speed performance of zRAM by only up to 7%.
3D memory chip stacking is also a promising technology which is an entirely new categoryofhigh-performance memory, delivering unprecedented system performance and bandwidth. Although the emerging 3D DRAM products can significantly improve the computing system performance, the relatively high cost is one of the most critical issues that prevent their wide real-life adoption. Fortunately, system-aided DRAM scaling can very naturally fit the emerging 3D DRAM-controller integrated chips such as the hybrid memory cube (HMC). Under such a system-aided DRAM scaling design framework, the most crucial challenge is how to most effectively compensate the memory errors caused by the erroneous cells at minimal overheads in terms of data access latency and redundancy. Conventional ECC designs for memory focus on random errors while paying no attention to the feature of error patterns introduced by weak cells. Design strategy proposed in this thesis can tolerate the weak cell rate of as high as 10-4and 6×10-5if 100% and 90% of all the weak cells are known in prior. Using Micron’s HMC 3D DRAM chips as the test vehicle, the evaluated implementation results show that it only consumes less than 0.4mm2(45nm node) on the logic die. Using CPU and DRAM simulators, simulations are further carried out over a variety of computing benchmarks and the results show that this design solution only incurs less than 2% performance degradation on average.
DRAM (dynamic random access memory) technology has been fueling the computing industry for almost five decades and plays an essential role in enabling modern information technology infrastructure. However, as the DRAM technology scaling approaches20nm and below, it has become increasingly challenging to maintain the historical bit cost reduction. In particular, with the DRAM technology scaling towards sub-20nm, it becomes more and more difficult to achieve sufficient DRAM data retention time. The DRAM cells with relatively shorter retention time are referred as weak cells and may fail to keep stored data in certain refresh period (e.g., 64ms or 128ms in current practice). Thus, tremendous efforts have been devoted to seeking alternative memory technologies. Several emerging memory technologies have been considered as the promising candidates, for example, Spin-Transfer-Torque(STT) RAM and Phase Change Memory(PCM). Although these emerging memory technologies may have advantages in scaling, they inevitably face cost, capacity and reliability challenges. In conventional practice, all the erroneous memory cells are masked by redundancy repair and error control codes (ECC), which are invisible to outside. However, it becomes impractical for memory industry to keep this design philosophy in sub-20 nm region.
August 2017
School of Engineering
Dept. of Electrical, Computer, and Systems Engineering
Rensselaer Polytechnic Institute, Troy, NY
Rensselaer Theses and Dissertations Online Collection
CC BY-NC-ND. Users may download and share copies with attribution in accordance with a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License. No commercial use or derivatives are permitted without the explicit approval of the author.