Nvsim: A Circuit-Level Performance, Energy, and Area Model For Emerging Nonvolatile Memory
Nvsim: A Circuit-Level Performance, Energy, and Area Model For Emerging Nonvolatile Memory
7, JULY 2012
Abstract—Various new nonvolatile memory (NVM) technolo- latency-optimized microprocessor caches to highly density-
gies have emerged recently. Among all the investigated new NVM optimized secondary storage. Therefore, specialized peripheral
candidate technologies, spin-torque-transfer memory (STT-RAM,
circuitry is required for each optimization target. However,
or MRAM), phase-change random-access memory (PCRAM),
and resistive random-access memory (ReRAM) are regarded as since few of these NVM technologies are mature so far, only
the most promising candidates. As the ultimate goal of this NVM a limited number of prototype chips have been demonstrated
research is to deploy them into multiple levels in the memory hi- and just cover a small portion of the entire design space.
erarchy, it is necessary to explore the wide NVM design space and In order to facilitate the architecture-level NVM research by
find the proper implementation at different memory hierarchy
estimating the NVM performance, energy, and area values
levels from highly latency-optimized caches to highly density-
optimized secondary storage. While abundant tools are available under different design specifications before fabricating a real
as SRAM/DRAM design assistants, similar tools for NVM designs chip, in this paper, we build NVSim,1 a circuit-level model
are currently missing. Thus, in this paper, we develop NVSim, for NVM performance, energy, and area estimations, which
a circuit-level model for NVM performance, energy, and area supports various NVM technologies, including STT-RAM,
estimation, which supports various NVM technologies, including
PCRAM, ReRAM, and legacy NAND Flash.
STT-RAM, PCRAM, ReRAM, and legacy NAND Flash. NVSim is
successfully validated against industrial NVM prototypes, and it The main goals of developing NVSim tool are as follows.
is expected to help boost architecture-level NVM-related studies. 1) Estimate the access time, access energy, and silicon area
Index Terms—Analytical circuit model, MRAM, NAND of NVM chips with a given organization and specific
Flash, nonvolatile memory, phase-change random-access design options before the effort of actual fabrications.
memory (PCRAM), resistive random-access memory (ReRAM), 2) Explore the NVM chip design space to find the opti-
spin-torque-transfer memory (STT-RAM). mized chip organization and design options that achieve
best performance, energy, or area.
I. Introduction 3) Find the optimal NVM chip organization and design
options that are optimized for one design metric while
U NIVERSAL MEMORY that provides fast random
access, high storage density, and nonvolatility within one
memory technology becomes possible thanks to the emergence
keeping other metrics under constraints.
We build NVSim by using the same empirical modeling
of various new nonvolatile memory (NVM) technologies, such methodology as CACTI [1], [2], but starting from a new
as spin-torque-transfer random-access memory (STT-RAM, framework and adding specific features for NVM technologies.
or MRAM), phase-change random-access memory (PCRAM), Compared to CACTI, the framework of NVSim includes the
and resistive random-access memory (ReRAM). As the following new features.
ultimate goal of this NVM research is to devise a universal 1) It allows us to move sense amplifiers from inner memory
memory that could work across multiple layers of the memory subarrays to the outer bank level and factor them out to
hierarchy, each of these emerging NVM technologies has to achieve overall area efficiency of the memory module.
supply a wide design space that covers a spectrum from highly 2) It provides more flexible array organizations and data
Manuscript received March 17, 2011; revised June 22, 2011, September 26, activation modes by considering any combinations of
2011, and December 16, 2011; accepted January 22, 2012. Date of current memory data allocation and address distribution.
version June 20, 2012. This work was supported in part by the Semiconductor
Research Corporation Grant, in part by the National Science Foundation,
3) It models various types of data sensing schemes instead
under Grants 1147388 and 0903432, and in part by the DoE, under Award of voltage-sensing scheme only.
DE-SC0005026. This paper was recommended by Associate Editor S. Mitra. 4) It allows memory banks to be formed in a bus-like
X. Dong is with Qualcomm, Inc., San Diego, CA 92121 USA (e-mail:
[email protected]).
manner rather than the H-tree manner only.
C. Xu and Y. Xie are with the Department of Computer Science and 5) It provides multiple design options of buffers instead of
Engineering, Pennsylvania State University, University Park, PA 16802 USA latency-optimized option that uses logical effort.
(e-mail: [email protected]; [email protected]).
N. P. Jouppi is with the Intelligent Infrastructure Laboratory, Hewlett-
6) It models the cross-point memory cells rather than MOS-
Packard Labs, Palo Alto, CA 94304 USA (e-mail: [email protected]). accessed memory cells only.
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. 1 Latest NVsim binary file release and related documentations are available
Digital Object Identifier 10.1109/TCAD.2012.2185930 at our wiki site http://www.rioshering.com/nvsimwiki.
0278-0070/$31.00
c 2012 IEEE
DONG et al.: NVSIM: A CIRCUIT-LEVEL PERFORMANCE, ENERGY, AND AREA MODEL 995
Fig. 1. Basic string block of NAND Flash, and the conceptual view of
floating gate Flash memory cell (BL = bitline, WL = wordline, SG = select
gate).
between 105 and 1010 [13], [14]. A projected plan by ITRS for
2024 for emerging NVM, i.e., PCRAM and ReRAM, highlight
endurance in the order of 1015 or more write cycles [15]. In
NVSim, the write endurance limit is not modeled since NVSim
is a circuit-level modeling tool.
D. Retention Time Issue
Fig. 5. Working mechanism of ReRAM cells. Retention time is the time that data can be retained in NVM
cells. Typically, NVM technologies require retention time of
Ohmic behavior in the case of very high doping and rectifying
higher than 10 years. However, in some cases, such a high
in the case of low doping [4]. In Fig. 5, the TiOx region is
retention time is not necessary. For example, Smullen et al.
semiinsulating indicating lower oxygen vacancy concentration,
relaxed the retention time requirement to improve the timing
while the TiO2−x is conductive indicating higher concentration.
and energy profile of STT-RAMs [16]. Since the tradeoff
The oxygen vacancy in metal oxide is n-type dopant, whose
among NVM retention time and other NVM parameters (e.g.,
draft under the electric field can cause the change of doping
the duration and amplitude of write pulses) is on the device
profiles. Thus, applying electronic current can modulate the
level, as a circuit-level tool, NVSim does not model this
I–V curve of the ReRAM cell and further switch the cell
tradeoff directly but instead takes different sets of NVM
from one state to the other state. Usually, for bipolar ReRAM,
parameters with various retention time as the device-level
the cell can be switched ON (SET operation) only by applying
input.
a negative bias and OFF (RESET operation) only by applying
the opposite bias [4]. Several ReRAM prototypes [7]–[9] have E. MOS-Accessed Structure Versus Cross-Point Structure
been demonstrated and show promising properties on fast Some NVM technologies (e.g., PCRAM [17] and
switching speed and low energy consumption. ReRAM [13], [17], [18]) have the capability of building cross-
point memory arrays without access devices. Conventionally,
B. Read Operations
in the MOS-accessed structure, memory cell arrays are isolated
The read operations of these NVM technologies are almost by MOS access devices and the cell size is dominated by the
the same. Since the NVM memory cell has different resistance large MOS access device that is necessary to drive enough
in ON and OFF states, the read operation can be accomplished write current, even though the NVM cell itself is much smaller.
either by applying a small voltage on the bitline and sensing However, taking advantage of the cell nonlinearity, a NVM
the current that passes through the memory cell or by injecting array can be accessed without any extra access devices. The
a small current into the bitline and sensing the voltage across removal of MOS access devices leads to a memory cell size of
the memory cell. Instead of SRAM that generates complement only 4F 2 , where F is the process feature size. Unfortunately,
read signals from each cell, NVM usually has a group of the cross-point structure also brings extra peripheral circuitry
dummy cells to generate the reference current or reference design challenges and a tradeoff among performance, energy,
voltage. The generated current (or voltage) from the to-be-read and area is always necessary as discussed in our previ-
cell is then compared to the reference current (or voltage) by ous work [19]. NVSim models both the MOS-accessed and
using sense amplifiers. Various types of sense amplifiers are the cross-point structures, and the modeling methodology is
modeled in NVSim as we discuss in Section V-B. described in the following sections.
C. Write Endurance Issue III. NVSim Framework
Write endurance is the number of times that an NVM The framework of NVSim is modified from CACTI [2],
cell can be overwritten. Among all the NVM technologies [20]. We add several new features, such as more flexible data
modeled in NVSim, only STT-RAM does not suffer from the activation modes and alternative bank organizations.
write endurance issue. NAND Flash, PCRAM, and ReRAM
all have limited write endurance, which is the number of A. Device Model
times that a memory cell can be overwritten. NAND Flash only NVSim uses device data from ITRS report [15] and the
has write endurance of 105 –106 . The PCRAM endurance is MASTAR tool [21] to obtain the process parameters. NVSim
now in the range between 105 and 109 [10]–[12]. ReRAM covers the process nodes from 180 nm, 120 nm, 90 nm, 65 nm,
research currently shows endurance numbers in the range 45 nm, 32 nm to 22 nm and supports three transistor types,
DONG et al.: NVSIM: A CIRCUIT-LEVEL PERFORMANCE, ENERGY, AND AREA MODEL 997
which are high performance, low operating power, and low from the mats to the I/O interface; selectively output the
stand-by power. desired cache line if there is a cache hit signal generated
from the tag array.
B. Array Organization
D. Activation Mode
Fig. 6 shows the array organization. There are three hier-
archy levels in such organization, which are bank, mat, and We model the array organization and the data activation
subarray. Basically, the descriptions of these levels are as modes using eight parameters, which are as follows:
follows. 1) NMR : number of rows of mat arrays in each bank;
1) Bank is the top-level structure modeled in NVSim. One 2) NMC : number of columns of mat arrays in each bank;
nonvolatile memory chip can have multiple banks. The 3) NAMR : number of active rows of mat arrays during data
bank is a fully-functional memory unit, and it can be accessing;
operated independently. In each bank, multiple mats are 4) NAMC : number of active columns of mat arrays during
connected together in either H-tree or bus-like manner. data accessing;
2) Mat is the building block of bank. Multiple mats in a 5) NSR : number of rows of subarrays in each mat;
bank operate simultaneously to fulfill a memory oper- 6) NSC : number of columns of subarrays in each mat;
ation. Each mat consists of multiple subarrays and one 7) NASR : number of active rows of subarrays during data
predecoder block. accessing;
3) Subarray is the elementary structure modeled in NVSim. 8) NASC : number of active columns of subarrays during
Every subarray contains peripheral circuitry including data accessing.
row decoders, column multiplexers, and output drivers. The values of these parameters are all constrained to be power
of two. NMR and NMC define the number of mats in a bank,
Conventionally, sense amplifiers are integrated on the
and NSR and NSC define the number of subarrays in a mat.
subarray level as modeled in CACTI [2], [20]. However, in
NAMR , NAMC , NASR , and NASC define the activation patterns,
NVSim model, sense amplifiers can be placed either on the
and they can take any values smaller than NMR , NMC , NSR ,
subarray level or on the mat level.
and NSC , respectively. On the contrary, the limitation of array
C. Memory Bank Type organization and data activation pattern in CACTI is caused
For practical memory designs, memory cells are grouped by several constraints on these parameters such as NAMR = 1,
together to form memory modules of different types. For NAMC = NMC , and NSR = NSC = NASR = NASC = 2.
instance: NVSim has these flexible activation patterns, and is able
1) the main memory is a typical RAM, which takes the to model sophisticated memory accessing techniques, such as
address of data as input and returns the content of data; single subarray activation [22].
2) the set-associative cache contains two separate RAMs E. Routing to Mats
(data array and tag array), and can return the data if In order to first route the data and address signals from the
there is a cache hit by the given set address and tag; I/O port to the edge of memory mats and from mat to the edges
3) the fully-associative cache usually contains a content- of memory subarrays, we divided all the interconnect wires
addressable memory (CAM). into three categories: address wires, broadcast data wires, and
To cover all the possible memory designs, we model five distributed data wires. Depending on the memory module
types of memory banks in NVSim: one for RAM, one for types and the activation modes, the initial number of wires
CAM, and three for set-associate caches with different access in each group is assigned according to the rules listed in
manners. The functionalities of these five types of memory Table I. We use the terminology block to refer to the memory
banks are listed as follows. words in RAM and CAM designs and the cache lines in
1) RAM: output the data content at the I/O interface given cache designs. In Table I, Nblock is the number of blocks,
the data address. Wblock is the block size, and A is the associativity in cache
2) CAM: output the data address at the I/O interface given designs. The number of broadcast data wires are always kept
the data content if there is a hit. unchanged, the number of distributed data wires is cut by half
3) Cache with normal access: start to access the cache data at each routing point where data are merged, and the number
array and tag array at the same time; the data content of address wires is subtracted by one at each routing point
is temporarily buffered in each mat; if there is a hit, the where data are multiplexed.
cache hit signal generated from the tag array is routed We use the case of the cache bank with normal access to
to the proper mats and the content of the desired cache demonstrate how the wires are routed from the I/O port to the
line is output to the I/O interface. edges of the mats. For simplicity, we suppose the data array
4) Cache with sequential access: access the cache tag array and the tag array are two separate modules. While the data
first; if there is a hit, then access the cache data array and the tag arrays usually have different mat organizations
with the set address and the tag hit information, and in practice, we use the same 4 × 4 mat organization for the
finally output the desired cache line to the I/O interface. demonstration purpose as shown in Figs. 7 and 8. The total
5) Cache with fast access: access the cache data array 16 mats are positioned in a 4 × 4 formation and connected
and tag array simultaneously; read the entire set content by a 4-level H-tree. Therefore, NMR and NMC are 4. As an
998 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 7, JULY 2012
TABLE I
Initial Number of Wires in Each Routing Group
Address Wire (NAW ) Broadcast Data Wire (NBW ) Distributed Data Wire (NDW )
RAM 0 Wblock
log2 Nblock
CAM Wblock 0
Data array log2 (Nblock /A) log2 A Wblock
Normal access
Tag array log2 (Nblock /A) Wblock A
Cache Data array log2 Nblock 0 Wblock
Sequential access
Tag array log2 (Nblock /A) Wblock A
Data array log2 (Nblock /A) 0 Wblock A
Fast access
Tag array log2 (Nblock /A) Wblock A
From the I/O port to the edges of the mats, the numbers
of wires in the three categories are changed as follows and
demonstrated in Figs. 7 and 8, respectively.
1) At node A, the activated mats are distributed in both the
upper and the bottom parts, so node A is a merging node.
As per the routing rule, the address wires and broadcast
data wires remain the same but the distributed data wires
are cut in half. Thus, the wire segment between node A
Fig. 7. Example of the wire routing in a 4 × 4 mat organization for the data
and B have NAW = 11, NBW,data = 3, NDW,data = 256,
array of a 8-way 1 MB cache with 64 B cache lines. NBW,tag = 16, and NDW,tag = 4.
2) Node B is again a merging node. Thus, the wire
segments between nodes B and C have NAW = 11,
NBW,data = 3, NDW,data = 128, NBW,tag = 16, and
NDW,tag = 2.
3) At node C, the activated mats are allocated only in one
side, either from Mat 0/1 or from Mat 4/5, so Node
C is a multiplexing node. As per the routing rule, the
distributed data wires and broadcast data wires remain
the same but the address wires are decremented by 1.
Thus, the wire segments between nodes C and D have
Fig. 8. Example of the wire routing in a 4 × 4 mat organization for the tag
array of a 8-way 1 MB cache with 64 B cache lines.
NAW = 10, NBW,data = 3, NDW,data = 128, NBW,tag = 16,
and NDW,tag = 2.
4) Finally, node D is another multiplexing node. Thus,
example, we use the activation mode in which two rows and the wire segments at the mat edges have NAW = 9,
two columns of the mat array are activated for each data NBW,data = 3, NDW,data = 128, NBW,tag = 16, and
access, and the activation groups are Mat {0, 2, 8, 10}, Mat NDW,tag = 2.
{1, 3, 9, 11}, Mat {4, 6, 12, 14}, and Mat {5, 7, 13, 15}. Thereby, each mat in the data array takes the input of a 9-bit set
Thereby, NAMR and NAMC are 2. In addition, we set the cache address and a 3-bit tag hit signals (which can be treated as the
line size (block size) to 64 B, the cache associativity to A = 8, block address in a 8-way associative set), and it generates the
and the cache bank capacity to 1 MB so that the number of output of a 128-bit data. A group of four data mats provides
cache lines (blocks) is Nblock = 8M/512 = 16 384, the block the desired output of a 512-bit (64 B) cache line, and four
size in the data array is Wblock,data = 512, and the block size such groups cover the entire 11-bit set address space. On the
in the tag array is Wblock,tag = 16 (assuming 32-bit addressing other hand, each mat in the tag array takes the input of a
and labeling dirty block with one bit). 9-bit set address and a 16-bit tag, and it generates 2-bit hit
According to Table I, the initial number of address wires signals (01 or 10 for hit and 00 for miss). A group of four tag
(NAW ) is log2 Nblock /A = 11 for both data and tag arrays. mats concatenate their hit signals and provide the information
For data array, the initial number of broadcast data wires whether a 16-bit tag hits in a 8-way associated cache with a
(NBW,data ) is log2 A = 3, which is used to transit the tag hit 9-bit address space, and four such groups extend the address
signals from the tag array to the corresponding mats in the data space from 9-bit to the desired 11-bit.
array; the initial number of distributed data wires (NDW,data ) Other configurations in Table I can be explained in the
is Wblock,data = 512, which is used to output the desired cache similar manner.
line from the mats to the I/O port. For tag array, the broadcast
data wire (NBW,tag ) is Wblock,tag = 16, which is sent from the F. Routing to Subarrays
I/O port to each of the mat in the tag array; the initial number The interconnect wires from mat to the edges of memory
of distributed data wires (NDW,tag ) is A = 8, which is used to subarrays are routed using the same H-tree organization as
collect the tag hit signals from each mat to the I/O port and shown in Fig. 9, and its routing strategy is the same wire
then send to the data array after a 8-to-3 encoding process. partitioning rule described in Section III-E. However, NVSim
DONG et al.: NVSIM: A CIRCUIT-LEVEL PERFORMANCE, ENERGY, AND AREA MODEL 999
Fig. 11. Maximum subarray size versus nonlinearity and driving current.
Fig. 14. Conceptual view of a MOS-accessed cell (1T1R) and its connected
word line, bit line, and source line.
turns on/off the access path to the storage element by tuning equations are subjected to change depending on the technology, though the
the voltage applied to its gate. The MOS-accessed cell usually proportional relationship between the current and width-to-length (W/L) still
holds for very advanced technologies.
has the best isolation among neighboring cells due to the 3 Usually, the transistor length (L) is fixed as the minimal feature size, and
property of MOSFET. the transistor width (W) is adjustable.
DONG et al.: NVSIM: A CIRCUIT-LEVEL PERFORMANCE, ENERGY, AND AREA MODEL 1001
Fig. 15. Conceptual view of a cross-point cell array without diode (0T1R)
and its connected word lines and bit lines.
Fig. 17. Buffer designs with different transistor sizing. (a) Latency-
optimized. (b) Balanced. (c) Area-optimized.
Energydynamic = 2
CVDD (11) Fig. 20. Analysis model for voltage-divider sensing scheme.
TABLE II
Delay and Power Look-Up Table of Current−Voltage Converter
Process node 130 nm 90 nm 65 nm 45 nm 32 nm
Delay 0.49 ns 0.53 ns 0.62 ns 0.80 ns 1.07 ns
Dynamic energy per operation 8.52 × 10−14 J 8.72 × 10−14 J 9.00 × 10−14 J 10.26 × 10−14 J 12.56 × 10−14 J
Leakage power 1.40 × 10−8 W 1.87 × 10−8 W 2.57 × 10−8 W 4.41 × 10−8 W 12.54 × 10−8 W
Fig. 23. Circuit schematic of the slow quench pulse shaper used in [10].
TABLE IX TABLE X
New PCRAM Parameters After NVSim Latency Optimization Projection of a Future ReRAM Technology
TABLE XI
Predicted Full Design Spectrum of a 32 nm 8 MB ReRAM Chip
Area Opt. Read Latency Opt. Write Latency Opt. Read Energy Opt. Write Energy Opt. Leakage Opt.
Area (mm2 ) 0.664 5.508 8.071 2.971 3.133 1.399
Read latency (ns) 107.1 1.773 1.917 5.711 6.182 426.8
Write latency (ns) 204.3 200.7 100.6 202.8 203.1 518.2
Read energy (nJ) 1.884 0.195 0.234 0.012 0.014 4.624
Write energy (nJ) 13.72 25.81 13.06 12.82 12.81 12.99
Leakage (mW) 1372 3872 7081 6819 7841 26.64
Array structure Cross-point Cross-point MOS-accessed Cross-point Cross-point MOS-accessed
Subarray size 512 × 512 128 × 128 1024 × 2048 512 × 512 256 × 256 2048 × 4096
Inter-array routing Non-H-tree H-tree H-tree H-tree H-tree Non-H-tree
Sense amp placement External Internal Internal Internal Internal External
Sense amp type Current-in voltage Current Current Voltage-divider Voltage-divider Voltage-divider
SET-before- Erase-before- SET-before- SET-before-
Write method RESET RESET Normal RESET RESET Normal
Interconnect wire Normal Repeated Repeated Low-swing Low-swing Normal
Output buffer type Area opt. Latency opt. Latency opt. Area opt. Area opt. Area opt.
As shown in the result, NVSim can optimize the same design of these NVM technologies and facilitate computer architects
toward different optimization targets by exploring the full leverage these emerging technologies, it is necessary to have
design space, which means NVSim automatically tunes all a quick estimation tool. While abundant estimation tools are
the design knobs such as array structure, subarray size, sense available as SRAM/DRAM design assistants, similar tools for
amplifier design, write method, repeater design, and buffer NVM designs are currently missing. Therefore, in this paper
design. If necessary, NVSim can also explore to use different we build NVSim, a circuit-level model for NVM performance,
types of transistor or wire models to get the best result. energy, and area estimation, which supports various NVM
technologies including STT-RAM, PCRAM, ReRAM, and
IX. Related Work
conventional NAND Flash. This model is successfully validated
Many modeling tools have been developed during the last against industrial NVM prototypes, and this new NVSim tool
decade to enable system-level design exploration for SRAM- is expected to help boost NVM-related studies such as the
or DRAM-based cache and memory. For example, CACTI [1], next-generation memory hierarchy.
[2] is a tool that has been widely used in the computer archi-
tecture community to estimate the performance, energy, and References
area of SRAM and DRAM caches. Evans and Franzon [41] [1] S. J. E. Wilton and N. P. Jouppi, “CACTI: An enhanced cache access
and cycle time model,” IEEE J. Solid-State Circuits, vol. 31, no. 5,
developed an energy model for SRAMs and used it to predict pp. 677–688, May 1996.
an optimum organization for caches. eCACTI [42] incorpo- [2] S. Thoziyoor, et al., “CACTI 5.1 technical report,” HP Labs, Palo Alto,
rated a leakage power model into CACTI. Muralimanohar et CA, Tech. Rep. HPL-2008-20, 2008.
[3] S. Raoux, et al., “Phase-change random access memory: A scalable
al. [43] modeled large-capacity caches through the use of technology,” IBM J. Res. Development, vol. 52, nos. 4–5, pp. 465–479,
an interconnect-centric organization composed of mats and Jul. 2008.
request/reply H-tree networks. [4] J. J. Yang, et al., “Memristive switching mechanism for metal/oxide/
metal nanodevices,” Nature Nanotechnol., vol. 3, no. 7, pp. 429–433,
In addition, CACTI has also been extended to evalu- 2008.
ate the performance, energy, and area for STT-RAM [44], [5] Z. Wei, et al., “Highly reliable TaOx ReRAM and direct evidence of
PCRAM [45], [46], cross-point ReRAM [19] and NAND redox reaction mechanism,” in Proc. IEDM, 2008, pp. 293–296.
[6] Y. S. Chen, et al., “Highly scalable hafnium oxide memory with
Flash [47]. However, as CACTI is originally designed to model improvements of resistive distribution and read disturb immunity,” in
a SRAM-based cache, some of its fundamental assumptions Proc. IEDM, 2009, pp. 105–108.
do not match the actual NVM circuit implementations, and [7] M.-J. Lee, et al., “2-stack 1D-1R cross-point structure with oxide diodes
as switch elements for high density resistance RAM applications,” in
thereby the NVM array organization modeled in these CACTI- Proc. IEEE IEDM, Dec. 2007, pp. 771–774.
like estimation tools deviates from the NVM chips that have [8] W. C. Chien, et al., “Multi-level operation of fully CMOS compatible
been fabricated. WOx resistive random access memory (RRAM),” in Proc. Int. Memory
Workshop, 2009, pp. 228–229.
X. Conclusion [9] S.-S. Sheu, et al., “A 4 Mb embedded SLC resistive-RAM macro
with 7.2 ns read-write random-access time and 160 ns MLC-access
STT-RAM, PCRAM, and ReRAM are emerging memory capability,” in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2011,
technologies for future nonvolatile memories. The versatility pp. 200–202.
[10] K.-J. Lee, et al., “A 90 nm 1.8 V 512 Mb diode-switch PRAM with
of these upcoming NVM technologies makes it possible to use 266 MB/s read throughput,” IEEE J. Solid-State Circuits, vol. 43, no. 1,
these NVM modules at other levels in the memory hierarchy, pp. 150–162, Jan. 2008.
such as execute in place memory, main memory, or even on- [11] F. Pellizzer, et al., “Novel μTrench phase-change memory cell for
embedded and stand-alone nonvolatile memory applications,” in Proc.
chip cache. Such emerging NVM design options can vary for Int. Symp. VLSI Technol., 2004, pp. 18–19.
different applications by tuning circuit structure parameters [12] S. J. Ahn, et al., “Highly manufacturable high density phase change
such as the array organizations and the peripheral circuitry memory of 64 Mb and beyond,” in Proc. IEDM, 2004, pp. 907–910.
[13] K.-H. Kim, et al., “Nanoscale resistive memory with intrinsic diode
types, or by using devices and interconnects with different characteristics and long endurance,” Appl. Phys. Lett., vol. 96, no. 5,
properties. To enable the system-level design space exploration pp. 053 106.1–053 106.3, 2010.
DONG et al.: NVSIM: A CIRCUIT-LEVEL PERFORMANCE, ENERGY, AND AREA MODEL 1007
[14] H. Y. Lee, et al., “Evidence and solution of over-RESET problem for [40] L. M. Grupp, et al., “Characterizing Flash memory: Anomalies,
HfOx based resistive memory with sub-ns switching speed and high observations, and applications,” in Proc. Int. Symp. Microarchitecture,
endurance,” in Proc. IEDM, 2010, pp. 19.7.1–19.7.4. 2009, pp. 24–33.
[15] International Technology Roadmap for Semiconductors. (2010). Process [41] R. J. Evans and P. D. Franzon, “Energy consumption modeling and
Integration, Devices, and Structures Update [Online]. Available: optimization for SRAMs,” IEEE J. Solid-State Circuits, vol. 30, no. 5,
http://www.itrs.net pp. 571–579, May 1995.
[16] C. W. Smullen, et al., “Relaxing non-volatility for fast and energy- [42] M. Mamidipaka and N. Dutt, “eCACTI: An enhanced power estimation
efficient STT-RAM caches,” in Proc. Int. Symp. High Performance model for on-chip caches,” Center Embedded Comput. Syst., Univ.
Comput. Architecture, Feb. 2011, pp. 50–61. California, Irvine, Tech. Rep. TR04-28, 2004.
[17] D.-C. Kau, et al., “A stackable cross point phase change memory,” in [43] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Architecting
Proc. IEEE IEDM, Dec. 2009, pp. 27.1.1–27.1.4. efficient interconnects for large caches with CACTI 6.0,” IEEE Micro,
[18] Y.-C. Chen, et al., “An access-transistor-free (0T/1R) non-volatile vol. 28, no. 1, pp. 69–79, Jan.–Feb. 2008.
resistance random access memory (RRAM) using a novel threshold [44] X. Dong, et al., “Circuit and microarchitecture evaluation of 3-D
switching, self-rectifying chalcogenide device,” in Proc. IEDM, 2003, stacking magnetic RAM (MRAM) as a universal memory replacement,”
pp. 750–753. in Proc. Des. Autom. Conf., 2008, pp. 554–559.
[19] C. Xu, et al., “Design implications of memristor-based RRAM [45] P. Mangalagiri, et al., “A low-power phase change memory based
cross-point structures,” in Proc. Des. Autom. Test Eur., 2011, pp. 1–6. hybrid cache architecture,” in Proc. Great Lakes Symp. VLSI, 2008, pp.
[20] S. Thoziyoor, et al., “A comprehensive memory modeling tool and its 395–398.
application to the design and analysis of future memory hierarchies,” [46] X. Dong, N. P. Jouppi, and Y. Xie, “PCRAMsim: System-level
in Proc. Int. Symp. Comput. Architecture, 2008, pp. 51–62. performance, energy, and area modeling for phase-change RAM,” in
[21] International Technology Roadmap for Semiconductors. The Model for Proc. Int. Conf. Comput.-Aided Des., 2009, pp. 269–275.
Assessment of CMOS Technologies and Roadmaps (MASTAR) [Online]. [47] V. Mohan, S. Gurumurthi, and M. R. Stan, “FlashPower: A detailed
Available: http://www.itrs.net/models.html power model for NAND Flash memory,” in Proc. Des. Autom. Test
[22] A. Udipi, et al., “Rethinking DRAM design and organization for Eur., 2010, pp. 502–507.
energy-constrained multi-cores,” ACM SIGARCH Comput. Architecture
News, vol. 38, no. 3, pp. 175–186, 2010. Xiangyu Dong (S’09–M’12) received the B.S. de-
[23] J. Liang and H. S. P. Wong, “Cross-point memory array without cell gree in electronic engineering from Shanghai Jiao
selectors: Device characteristics and data storage pattern dependencies,” Tong University, Shanghai, China, in 2006, and the
IEEE Trans. Electron Devices, vol. 57, no. 10, pp. 2531–2538, Oct. Ph.D. degree in computer science and engineering
2010. from Pennsylvania State University, University Park,
[24] M. Hosomi, et al., “A novel nonvolatile memory with spin torque in 2011.
transfer magnetization switching: Spin-RAM,” in Proc. IEDM, 2005, He is currently a Senior Engineer with Qualcomm,
pp. 459–462. Inc., San Diego, CA. His current research interests
[25] T. Kawahara, et al., “2 Mb spin-transfer torque RAM (SPRAM) with include computer architectures, emerging nonvolatile
bit-by-bit bidirectional current write and parallelizing-direction current memory, and 3-D integration technology.
read,” in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2007, pp.
480–617.
[26] K. Tsuchida, et al., “A 64 Mb MRAM with clamped-reference and Cong Xu (S’09) received the B.S. degree from
adequate-reference schemes,” in Proc. Int. Solid-State Circuits Conf., Peking University, Beijing, China, and joined Penn-
2010, pp. 268–269. sylvania State University, University Park, in 2009,
[27] H.-R. Oh, et al., “Enhanced write performance of a 64-Mb phase- where he is currently pursuing the Ph.D. degree with
change random access memory,” IEEE J. Solid-State Circuits, vol. 41, the Department of Computer Science and Engineer-
no. 1, pp. 122–126, Jan. 2006. ing.
[28] S. Hanzawa, et al., “A 512 kB embedded phase change memory with His current research interests include nonvolatile
416 kB/s write throughput at 100 μA cell write current,” in Proc. Int. memory system design based on emerging memory
Solid-State Circuits Conf., 2007, pp. 474–616. technologies, low power very large scale integration
[29] S. Kang, et al., “A 0.1 μm 1.8 V 256 Mb phase-change random access design, and computer architectures.
memory (PRAM) with 66 MHz synchronous burst-read operation,”
IEEE J. Solid-State Circuits, vol. 42, no. 1, pp. 210–218, Jan. Yuan Xie (SM’07) received the B.S. degree in elec-
2007. tronic engineering from Tsinghua University, Bei-
[30] F. Fishburn, et al., “A 78 nm 6 F2 DRAM technology for multigigabit jing, China, in 1997, and the M.S. and Ph.D. degrees
densities,” in Proc. Symp. VLSI Technol., 2004, pp. 28–29. in electrical engineering from Princeton University,
[31] J. H. Oh, et al., “Full integration of highly manufacturable 512 Mb Princeton, NJ, in 1999 and 2002, respectively.
PRAM based on 90 nm technology,” in Proc. IEDM, 2006, pp. 49–52. He is currently an Associate Professor with the
[32] Y. Zhang, et al., “An integrated phase change memory cell with Ge Department of Computer Science and Engineering,
nanowire diode for cross-point memory,” in Proc. IEEE Symp. VLSI Pennsylvania State University, University Park. Be-
Technol., Jun. 2007, pp. 98–99. fore joining Pennsylvania State University in 2003,
[33] Y. Sasago, et al., “Cross-point phase change memory with 4F2 cell size he was with the IBM Microelectronic Division,
driven by low-contact-resistivity poly-Si diode,” in Proc. Symp. VLSI Worldwide Design Center, Essex Juction, VT. His
Technol., 2009, pp. 24–25. current research interests include very large scale integration design, computer
[34] I. E. Sutherland, R. F. Sproull, and D. Harris, Logical Effort: Designing architectures, embedded systems design, and electronic design automation.
Fast CMOS Circuits. San Francisco, CA: Morgan Kaufmann, Dr. Xie is a Senior Member of ACM. He received the SRC Inventor
1999. Recognition Award in 2002 and the U.S. National Science Foundation
[35] M. A. Horowitz, “Timing models for MOS circuits,” Stanford Faculty Early Career Development Award in 2006.
University, Stanford, CA, Tech. Rep. SEL-TR-83-003, 1983.
[36] E. Seevinck, P. J. van Beers, and H. Ontrop, “Current-mode techniques
for high-speed VLSI circuits with application to current sense amplifier Norman P. Jouppi (F’03) received the Ph.D. degree
for CMOS SRAM’s,” IEEE J. Solid-State Circuits, vol. 26, no. 4, pp. in electrical engineering from Stanford University,
525–536, Apr. 1991. Stanford, CA.
[37] Y. Moon, et al., “1.2 V 1.6 Gb/s 56 nm 6F2 4 Gb DDR3 SDRAM with He is a Fellow and the Director of the In-
hybrid-I/O sense amplifier and segmented sub-array architecture,” in telligent Infrastructure Laboratory, Hewlett-Packard
Proc. Int. Solid-State Circuits Conf., 2009, pp. 128–129. Labs, Palo Alto, CA. His current research inter-
[38] G. W. Burr, et al., “Phase change memory technology,” J. Vac. Sci. ests include computer memory systems, networking
Technol. B, vol. 28, no. 2, pp. 223–262, 2010. for cluster computing, blade system architectures,
[39] K. Ishida, et al., “A 1.8 V 30 nJ adaptive program-voltage (20 V) graphics accelerators, video, audio, and physical
generator for 3D-integrated NAND Flash SSD,” in Proc. IEEE Int. telepresence.
Solid-State Circuits Conf., Feb. 2009, pp. 238–239, 239a. Dr. Jouppi is a Fellow of ACM.