Browsing by Subject "DRAM"
Now showing 1 - 11 of 11
- Results Per Page
- Sort Options
Item Compute-in-memory designs for deep neural network and combinatorial optimization problems accelerators(2023-04-23) Xie, Shanshan, Ph. D.; Kulkarni, Jaydeep P.; Pan, David Z.; Orshansky, Michael; Jia, Yaoyao; Hamzaoglu, FatihThe unprecedented growth in Deep Neural Networks (DNN) model size has resulted into a massive amount of data movement from off-chip memory to on-chip processing cores in modern Machine Learning (ML) accelerators. Compute-In-Memory (CIM) designs performing analog DNN computations within a memory array along with peripheral data converter circuits, are being explored to mitigate this ‘Memory Wall’ bottleneck of latency and energy overheads. Embedded non-volatile magnetic [Wei et al. [2019]; Chih et al. [2020]; Dong et al. [2018]; Shih et al. [2019]], and resistive [Jain et al. [2019]; Chou et al. [2020]; Chang et al. [2014]; Lee et al. [2017]] as well as standalone Flash memories suffer from lower write-speeds and poor write-endurance and can’t be used for programmable accelerators requiring fast and frequent model updates. Similarly, cost-sensitive commodity DRAM (Dynamic Random Access Memory) can’t be leveraged for high-speed, custom CIM designs due to limited metal layers and dense floorplan constraints often leading to compute-near-memory designs limiting its throughput benefits [Aga et al. [2019]]. Among the prevalent semiconductor memories, eDRAM (embedded-DRAM) which integrates the DRAM bitcell monolithically along with high-performance logic transistors and interconnects can enable custom CIM designs by offering the densest embedded bitcell, low pJ/bit access energy, high-endurance, high-performance, and high-bandwidth; all desired attributes for ML accelerators [Fredeman et al. [2015]; Berry et al. [2020]]. Yet, eDRAM has been used in niche applications due to its high cost/bit, low retention time, and high noise sensitivity. On the DNN algorithms front, the landscape is rapidly changing with the adoption of 8-bit integer arithmetic for both DNN inference and training algorithms [Jouppi et al. [2017]; Yang et al. [2020]]. These reduced bit-width computations are extremely conducive for CIM designs which have shown promising results for integer arithmetic [Biswas and Chandrakasan [2018]; Gonugondla et al. [2018a]; Zhang et al. [2017]; Si et al. [2019]; Yang et al. [2019]; Khwa et al. [2018]; Chen et al. [2019]; Dong et al. [2020]; Valavi et al. [2019]; Dong et al. [2017]; Jiang et al. [2019]; Yin et al. [2020]]. Thus, high cost/bit of eDRAM can now be amortized by repurposing existing eDRAM in high-end processors for enabling CIM circuits. Despite the potential of eDRAM technology and the progress in DNN integer arithmetic, no hardware demonstration for eDRAM-based CIM design has been reported so far. Therefore, in this dissertation, the first project explores the compute-in-memory concept with the dense 1T1C eDRAM bitcells as charge domain circuits for convolution neural network (CNN) multiply-accumulation-averaging (MAV) computation. This method minimizes area overhead by leveraging existing 1T1C eDRAM columns to construct an adaptive data converter, dot-product, averaging, pooling, and ReLU activation on the memory array. The second project presents a leakage and read bitline (RBL) swing-aware compute-in-memory (CIM) design leveraging a promising high-density gain-cell embedded DRAM bitcell and the intrinsic RBL capacitors to perform CIM computations within the limited RBL swing available in a 2T1C eDRAM. The CIM D/A converters (DAC) are realized intrinsically with variable RBL precharge voltage levels. A/D converters (ADC) are realized using Schmitt Triggers (ST) as compact and reconfigurable Flash comparators. Similar to machine learning applications, combinatorial optimation problems (COP) also require data-intensive computations, which are naturally suitable for adopting the compute-in-memory concept as well. Combinatorial optimization problems find many real-world social and industrial data-intensive computing applications. Examples include optimization of mRNA sequences for COVID-19 vaccines [Leppek et al. [2021]; Pardi et al. [2018]], semiconductor supply-chains [Crama [1997]; Kempf [2004]], and financial index tracking [Benidis et al. [2018]], to name a few. Such COPs are predominantly NP-hard [Yuqi Su and Kim [2020]], and performing an exhaustive brute force search becomes untenable as the COP size increases. An efficient way to solve COPs is to let nature perform the exhaustive search in the physical world using the Ising model, which can map many types of COPs [Lucas [2014]], The Ising model describes spin dynamics in a ferromagnetic [Peierls [1936]], wherein spins naturally orient to achieve the lowest ensemble energy state of the Ising model, representing the optimal COP solution [Yoshimura et al. [2015]]. Therefore, in order to accelerate the COP computations, the third project focuses on implementing analog compute-in-memory techniques for Ising computation to eliminate unnecessary data movement and to reduce energy costs. The COPs can be mapped into a generic Ising model framework, and the computations are performed directly on the bitlines. Spin updates are performed locally using the existing sense amplifier in the peripheral circuits and the write-after-read mechanism in the memory array controller. Beyond that, the fourth project explores the CIM designs for solving Boolean Satisfiability (SAT) problems, which s a non-deterministic polynomial time (NP)-complete problems with many practical and industrial data-intensive applications. An all-digital SAT solver, called Snap-SAT, is presented to accelerate the iterative computations using the static random-access memory (SRAM) array to reduce the frequent memory access and minimize the hardware implementation cost. This design demonstrates a promising, fast, reliable, reconfigurable, and scalable compute-in-memory design for solving and accelerating large-scale hard SAT problems, suggesting its potential for solving time-critical SAT problems in real-life applications (e.g., defense, vaccine development, etc.)Item DRAM-aware prefetching and cache management(2010-12) Lee, Chang Joo, 1975-; Patt, Yale N.; Touba, Nur A.; Chiou, Derek; Namazi, Hossein; Mutlu, OnurMain memory system performance is crucial for high performance microprocessors. Even though the peak bandwidth of main memory systems has increased through improvements in the microarchitecture of Dynamic Random Access Memory (DRAM) chips, conventional on-chip memory systems of microprocessors do not fully take advantage of it. This results in underutilization of the DRAM system, in other words, many idle cycles on the DRAM data bus. The main reason for this is that conventional on-chip memory system designs do not fully take into account important DRAM characteristics. Therefore, the high bandwidth of DRAM-based main memory systems cannot be realized and exploited by the processor. This dissertation identifies three major performance-related characteristics that can significantly affect DRAM performance and makes a case for DRAM characteristic-aware on-chip memory system design. We show that on-chip memory resource management policies (such as prefetching, buffer, and cache policies) that are aware of these DRAM characteristics can significantly enhance entire system performance. The key idea of the proposed mechanisms is to send out to the DRAM system useful memory requests that can be serviced with low latency or in parallel with other requests rather than requests that are serviced with high latency or serially. Our evaluations demonstrate that each of the proposed DRAM-aware mechanisms significantly improves performance by increasing DRAM utilization for useful data. We also show that when employed together, the performance benefit of each mechanism is achieved additively: they work synergistically and significantly improve the overall system performance of both single-core and Chip MultiProcessor (CMP) systems.Item Efficient error correcting codes for emerging and high-density memory systems(2019-12) Das, Abhishek (Ph. D. in Electrical and Computer Engineering); Touba, Nur A.; Abraham, Jacob A; Pan, Zhigang; Orshansky, Michael; Bhargava, MuditAs memory technology scales, the demand for higher performance and reliable operation is increasing as well. Field studies show increased error rates at dynamic random-access memories. The high density comes at a cost of more marginal cells and higher power consumption. Multiple bit upsets caused by high energy radiation are now the most common source of soft errors in static random-access memories affecting multiple cells. Phase change memories have been in focus as an attractive alternative to DRAMs due to their low power consumption, lower bit cost and high density. But these memories suffer from various reliability issues. The errors caused by such mechanisms can cause large overheads for conventional error correcting codes. This research addresses the issue of memory reliability under these new constraints due to technology scaling. The goal of the research is to address the different error mechanisms as well as increased error rates while keeping the error correction time low so as to enable high throughput. Various schemes have been proposed such as addressing multiple bit upsets in SRAMs through a burst error correcting code which has a linear increase in complexity as compared to exponential increase for existing methods [Das 18b], as well as a double error correcting code with lower complexity and lower correction time for the increased error rates in DRAMs [Das 19]. This research also addresses limited magnitude errors in emerging multilevel cell memories, e.g. phase change memories. A scheme which extends binary Orthogonal Latin Square codes in presented [Das 17] which utilizes a few bits from each cell to provide protection based on the error magnitude. The issue of write disturbance error in multilevel cells is also addressed [Das 18a] using a modified Reed-Solomon code. The proposed scheme achieves a very low decoding time compared to existing methods through the use of a new construction methodology and a simplified decoding procedure. A new scheme is presented using non-binary Hamming codes which protect more memory cells for the same amount of redundancy [Das 18c] through the use of unused columns in the code space of the design.Item Energy efficient high bandwidth DRAM for throughput processors(2021-05-03) O'Connor, James Michael; Swartzlander, Earl E., Jr., 1945-; Erez, Mattan; Fussell, Donald; John, Lizy K; Keckler, Stephen W; Reddi, Vijay JGraphics Processing Units (GPUs) and other throughput processing architectures have scaled performance through simultaneous improvements in compute capability and aggregate memory bandwidth. Satisfying the increasing bandwidth demands of future systems without a significant increase in the power budget for the DRAM is a key challenge going forward. A new DRAM architecture, Fine-Grained DRAM, significantly reduces the energy consumption by partitioning the DRAM die into many small independent units, called grains, each of which has a local I/O link to the processor. With this architecture, the on-DRAM data movement energy is greatly reduced due to the much shorter wiring distance between the cell array and the local I/O. At the same time, the energy on the link between the DRAM and GPU remains low by leveraging novel energy-efficient encoding techniques well-suited to the narrow buses. Furthermore, wasteful row overfetch energy due to sparse accesses to large DRAM rows is significantly reduced by reducing the effective DRAM row size in an area efficient manner. This Fine-Grained DRAM architecture enables future reliable, multi-TB/sec memory systems within a power budget comparable to current GPU memory systems.Item The feasibility of memory encryption and authentication(2013-05) Owen, Donald Edward, Jr.; John, Lizy KurianThis thesis presents an analysis of the implementation feasibility of RAM authentication and encryption. Past research as used simulations to establish that it is possible to authenticate and encrypt the contents of RAM with reasonable performance penalties by using clever implementations of tree data structures over the contents of RAM. However, previous work has largely bypassed implementation issues such as power consumption and silicon area required to implement the proposed schemes, leaving implementation details unspecified. This thesis studies the implementation cost of AES-GCM hardware and software solutions for memory authentication and encryption and shows that software solutions are infeasible because they are too costly in terms of performance and power, whereas hardware solutions are more feasible.Item Memory protection techniques for DRAM scaling-induced errors(2018-10-09) Gong, Seong-Lyong; Erez, Mattan; Swartzlander, Earl; Touba, Nur; Dimakis, Alex; Lin, Calvin; Sullivan, MikeContinued scaling of DRAM technologies induces more faulty DRAM cells than before. These inherent faults increase significantly at sub-20nm technology, and hence traditional remapping schemes such as row/column sparing become very inefficient. Because the inherent faults manifest as single-bit errors, DRAM vendors are proposing to embed single-bit error correctable (SEC) ECC modules inside each DRAM chip, called In-DRAM ECC (IECC). However, IECC can achieve limited reliability improvement due to its weak correction capability. Specifically, at high scaling error rates, multi-bit scaling errors will easily occur in practice and escape from IECC protection. Because of the escaped scaling errors, the overall reliability may be degraded despite the increased overall overheads. For highly reliable systems that apply a strong ECC at the rank level (i.e., across DRAM chips that are accessed simultaneously), for example, Chipkill cannot be guaranteed anymore if the escaped errors occur. In this dissertation, I address this scaling-induced error problem as follows. First, I propose a more sophisticated fault-error model that includes intermittent scaling errors. In general, the effectiveness of proposed solutions strongly relies on the evaluation methodology. Prior related work evaluated their solutions against scaling errors only with a simple model and concluded efficient remapping schemes effectively cope with scaling errors. However, intermittent scaling errors cannot be easily detected and remapped. This implies that rather than the proposed remapping schemes, forward error correction may be the only solution to the scaling error problem. Using the new evaluation model, the proposed solutions to scaling errors can be evaluated in a more comprehensive way than before. Secondly, I propose two alternatives to In-DRAM ECC, Dual Use of On-chip redundancy (DUO) and Why-Pay-More (YPM), for highly reliable systems. DUO achieves higher reliability than In-DRAM ECC-based solutions by transferring on-chip redundancy to the rank level. Then, using the transferred redundancy together with original rank-level redundancy, a stronger rank-level ECC is applied. YPM is the first rank-level-only ECC protection against scaling errors. For this cost-saving design, YPM optimizes the correction capability by exploiting erasure Reed-Solomon (RS) decoding and iterative bit-flipping search. Each alternative is industry-changing in that DUO achieves much higher reliability than current rank-level ECC and YPM does not require In-DRAM ECC at all. Both alternatives are practical in that they require only small changes to DRAM designs.Item Mitigating bank conflicts in main memory via selective data duplication and migration(2021-05-07) Lin, Ching-Pei; Patt, Yale N.; Chiou, Derek; Erez, Mattan; Witchel, Emmett; Wilkerson, ChrisMain memory is organized as a hierarchy of banks, rows, and columns. Only data from a single row can be accessed from each bank at any given time. Switching between different rows of the same bank requires serializing long latency operations to the bank. Consequently, memory performance suffers on bank conflicts when concurrent requests access different rows of the same bank. Many prior solutions to the bank conflict problem required modifications to the memory device and/or the memory access protocol. Such modifications create hurdles for adoption due to the commodity nature of the memory business. Instead, I propose two new runtime solutions that work with unmodified memory devices and access protocols. The first, Duplicon Cache, duplicates select data to multiple banks, allowing duplicated data to be sourced from either the original bank or the alternate bank, whichever is more lightly loaded. The second, Continuous Row Compaction, identifies data that are frequently accessed together, then migrates them to non-conflicting rows across different banks. To limit the data transfer overhead from data duplication and migration, only select data are duplicated/migrated. The key is to identify large working sets of the running applications that remain stable over very long time intervals, and slowly duplicate/migrate them over time, amortizing the cost of duplication/migration. In effect, the set of duplicated/migrated data form a cache within main memory that captures large stable working sets of the application.Item Mitigating DRAM complexities through coordinated scheduling policies(2011-05) Stuecheli, Jeffrey Adam; John, Lizy Kurian; Ambler, Tony; Erez, Mattan; Swartzlander, Earl; Zhang, LixinContemporary DRAM systems have maintained impressive scaling by managing a careful balance between performance, power, and storage density. In achieving these goals, a significant sacrifice has been made in DRAM's operational complexity. To realize good performance, systems must properly manage the significant number of structural and timing restrictions of the DRAM devices. DRAM's efficient use is further complicated in many-core systems where the memory interface has to be shared among multiple cores/threads competing for memory bandwidth. In computer architecture, caches have primarily been viewed as a means to hide memory latency from the CPU. Cache policies have focused on anticipating the CPU's data needs, and are mostly oblivious to the main memory. This work demonstrates that the era of many-core architectures has created new main memory bottlenecks, and mandates a new approach: coordination of cache policy with main memory characteristics. Using the cache for memory optimization purposes dramatically expands the memory controller's visibility of processor behavior, at low implementation overhead. Through memory-centric modification of existing policies, such as scheduled writebacks, this work demonstrates that performance-limiting effects of highly-threaded architectures combined with complex DRAM operation can be overcome. This work shows that an awareness of the physical main memory layout and by focusing on writes, both read and write average latency can be shortened, memory power reduced, and overall system performance improved. The use of the "Page-Mode" feature of DRAM devices can mitigate many DRAM constraints. Current open-page policies attempt to garner the highest level of page hits. In an effort to achieve this, such greedy schemes map sequential address sequences to a single DRAM resource. This non-uniform resource usage pattern introduces high levels of conflict when multiple workloads in a many-core system map to the same set of resources. This work presents a scheme that provides a careful balance between the benefits (increased performance and decreased power), and the detractors (unfairness) of page-mode accesses. In the proposed Minimalist approach, the system targets "just enough" page-mode accesses to garner page-mode benefits, avoiding system unfairness. This is accomplished with the use of a fair memory hashing scheme to control the maximum number of page mode hits. High density memory is becoming ever more important as many execution streams are consolidated onto single chip many-core processors. DRAM is ubiquitous as a main memory technology, but while DRAM's per-chip density and frequency continue to scale, the time required to refresh its dynamic cells has grown at an alarming rate. This work shows how currently-employed methods to schedule refresh operations are ineffective in mitigating the significant performance degradation caused by longer refresh times. Current approaches are deficient -- they do not effectively exploit the flexibility of DRAMs to postpone refresh operations. This work proposes dynamically reconfigurable predictive mechanisms that exploit the full dynamic range allowed in the industry standard DRAM memory specifications. The proposed mechanisms are shown to mitigate much of the penalties seen with dense DRAM devices. In summary this work presents a significant improvement in the ability to exploit the capabilities of high density, high frequency, DRAM devices in a many-core environment. This is accomplished though coordination of previously disparate system components, exploiting integration of such components into highly integrated system designs.Item Nearly free resilient memory architectures that balance resilience, performance, and cost(2017-08-29) Kim, Dong Wan; Erez, Mattan; Touba, Nur A.; Fussell, Donald S.; Reddi, Vijay Janapa; Tsai, Timothy K.Memory reliability has been a major design constraint for mission-critical and large-scale systems for many years. Continued innovation is still necessary because the rate of faults, and the errors they lead to, grows with system size and because some faults become more likely as fabrication technology advances. Furthermore, recent field studies have shown that more severe permanent/intermittent and multi-bit faults are roughly as frequent as single-bit and transient ones. Therefore, strong error checking and correcting (ECC) schemes that can correct multi-bit errors have been developed and are in use. However, using ECC to correct the numerous recurring errors from permanent faults forces trading off cost, performance, and reliability. Firstly, a permanent fault is likely to result in numerous erroneous accesses, each requiring possibly high correction overhead. Secondly, once redundancy is used for correction, further errors may go uncorrected leading to data loss, which called detected uncorrectable error (DUE), or worse, go undetected and result in silent data corruption (SDC). Strong ECC can then be used to tolerate more errors, but at higher overhead. The straightforward solution to addressing this issue of repeated costly corrections and reduced coverage is to replace faulty memory devices, however, doing so is expensive, and requires either increased system down time or increased storage and bandwidth overheads. An economical alternative is to retire and possibly remap just the faulty memory regions. The existing retirement techniques, however, either require sophisticated software support, impact capacity, reliability, and/or performance, or introduce additional storage and hardware structures. Implementing a strong ECC such as Single Device Data Correction (SDDC) ECC (or chipkill-level ECC) is typically expensive in terms of storage and complexity. It is even challenging to implement a SDDC level ECC in emerging high bandwidth memories such as HBM2. This is because a single ECC codeword is transferred from one memory device in HBM2 for instance, and thus simply adding a redundant device results in high overhead in storage, energy, and bandwidth. Such wide data width memories are, however, widely used in graphics processing units (GPUs) and Intel’s Xeon Phi processors (e.g. Knight Landing) to exploit high memory bandwidth. As GPUs are popular to build large scale high performance systems, improving resilience of GPU memory systems has been an important issue, but the current GPU memory ECC is limited to single-bit error correction double-bit error detection (SECDED). To improve GPU memory resilience further, therefore, multiple techniques are coordinated. One interesting addition is to use a software driven memory repair technique, which retires affected pages with virtual memory support. This approach reduces the risk of uncorrected or even undetected memory errors in the future. However, it can be unsafe to avoid a DUE because the underlying ECC code is weak. Therefore, it needs to proactively retire susceptible memory blocks to avoid the potential threat of a system failure in the future. In this dissertation, I develop strong, and low-cost memory fault tolerance mechanisms that improve both system resilience and availability without wasting resources, increasing memory access complexity, and compromising the fault tolerance of existing resilience schemes. I first identify two interesting characteristics of DRAM failures in the field. First, permanent faults are as frequent as transient faults. Second, most faults affect small memory regions. Based on such analysis of DRAM failure patterns, I propose and evaluate two novel hardware-only memory repair mechanisms that improve memory system reliability significantly without compromising performance and increasing overhead. I also develop a strong, low-cost GPU memory fault tolerance mechanism based on three insights. First, ECC for GPUs should not interfere with the high-bandwidth DRAM system that uses possibly just one DRAM device for each data access. The second insight is that the way in which GPUs are used as accelerators offers unique opportunities to tolerate even severe memory errors without relying on SDDC ECC. The third insight is that nearly all permanent memory faults can be repaired with low overhead techniques. According to these observations, I propose and evaluate a multi-tier, strong GPU memory fault tolerance mechanism where the techniques at each level closely work together to significantly improve accelerator memory system resilience with low overhead.Item Strong, thorough, and efficient memory protection against existing and emerging DRAM errors(2016-12) Kim, Jungrae; Erez, Mattan; Patt, Yale; Touba, Nur; Lin, Calvin; Alameldeen, AlaaMemory protection is necessary to ensure the correctness of data in the presence of unavoidable faults. As such, large-scale systems typically employ Error Correcting Codes (ECC) to trade off redundant storage and bandwidth for increased reliability. Single Device Data Correction (SDDC) ECC mechanisms are required to meet the reliability demands of servers and large-scale systems by tolerating even severe faults that disable an entire memory chip. In the future, however, stronger memory protection will be required due to increasing levels of system integration, shrinking process technology, and growing transfer rates. The energy-efficiency of memory protection is also important as DRAM already consumes a significant fraction of system energy budget. This dissertation develops a novel set of ECC schemes to provide strong, safe, flexible, and thorough protection against existing and emerging types of DRAM errors. This research also reduces energy consumption of such protection while only marginally impacting performance. First, this dissertation develops Bamboo ECC, a technique with strongerthan-SDDC correction and very safe detection capabilities (≥ 99.999994% of data errors with any severity are detected). Bamboo ECC changes ECC layout based on frequent DRAM error patterns, and can correct concurrent errors from multiple devices and all but eliminates the risk of silent data corruption. Also, Bamboo ECC provides flexible configurations to enable more adaptive graceful downgrade schemes in which the system continues to operate correctly after even severe chip faults, albeit at a reduced capacity to protect against future faults. These strength, safety, and flexibility advantages translate to a significantly more reliable memory sub-system for future exascale computing. Then, this dissertation focuses on emerging error types from scaling process technology and increasing data bandwidth. As DRAM process technology scales down to below 10nm, DRAM cells are becoming more vulnerable to errors from an imperfect manufacturing process. At the same time, DRAM signal transfers are getting more susceptible to timing and electrical noises as DRAM interfaces keep increasing signal transfer rates and decreasing I/O voltage levels. With individual DRAM chips getting more vulnerable to errors, industry and academia have proposed mechanisms to tolerate these emerging types of errors; yet they are inefficient because they rely on multiple levels of redundancy in the case of cell errors and ad-hoc schemes with suboptimal protection coverage for transmission errors. Active Guardband ECC and All-Inclusive ECC make systematic use of ECC and existing mechanisms to provide thorough end-to-end protection without requiring redundancy beyond what is common today. Finally, this dissertation targets the energy efficiency of memory protection. Frugal ECC combines ECC with fine-grained compression to provide versatile and energy-efficient protection. Frugal ECC compresses main memory at cache-block granularity, using any left over space to store ECC information. Frugal ECC allows more energy-efficient memory configurations while maintaining SDDC protection. Its tailored compression scheme minimizes insufficiently compressed blocks and results in acceptable performance overhead. The strong, thorough, and efficient protection described by this dissertation may allow for more aggressive design of future computing systems with larger integration, finer process technology, higher transfer rates, and better energy efficiencyItem The use of memory state knowledge to improve computer memory system organization(2011-05) Isen, Ciji; John, Lizy Kurian; McKinley, Kathryn S.; Erez, Mattan; Aziz, Adnan; Bhargava, Ravi; Gratz, Paul V.The trends in virtualization as well as multi-core, multiprocessor environments have translated to a massive increase in the amount of main memory each individual system needs to be fitted with, so as to effectively utilize this growing compute capacity. The increasing demand on main memory implies that the main memory devices and their issues are as important a part of system design as the central processors. The primary issues of modern memory are power, energy, and scaling of capacity. Nearly a third of the system power and energy can be from the memory subsystem. At the same time, modern main memory devices are limited by technology in their future ability to scale and keep pace with the modern program demands thereby requiring exploration of alternatives to main memory storage technology. This dissertation exploits dynamic knowledge of memory state and memory data value to improve memory performance and reduce memory energy consumption. A cross-boundary approach to communicate information about dynamic memory management state (allocated and deallocated memory) between software and hardware viii memory subsystem through a combination of ISA support and hardware structures is proposed in this research. These mechanisms help identify memory operations to regions of memory that have no impact on the correct execution of the program because they were either freshly allocated or deallocated. This inference about the impact stems from the fact that, data in memory regions that have been deallocated are no longer useful to the actual program code and data present in freshly allocated memory is also not useful to the program because the dynamic memory has not been defined by the program. By being cognizant of this, such memory operations are avoided thereby saving energy and improving the usefulness of the main memory. Furthermore, when stores write zeros to memory, the number of stores to the memory is reduced in this research by capturing it as compressed information which is stored along with memory management state information. Using the methods outlined above, this dissertation harnesses memory management state and data value information to achieve significant savings in energy consumption while extending the endurance limit of memory technologies.