0% found this document useful (0 votes)
195 views

Nvsim: A Circuit-Level Performance, Energy, and Area Model For Emerging Nonvolatile Memory

This document presents NVSim, a circuit-level performance, energy, and area model for emerging nonvolatile memory (NVM) technologies. NVSim supports various NVM technologies including spin-torque-transfer RAM (STT-RAM), phase-change RAM (PCRAM), resistive RAM (ReRAM), and NAND flash. The goals of NVSim are to estimate the access time, energy, and area of NVM chips for different organizations and designs before fabrication, and to explore the design space to find optimized designs for performance, energy, or area. NVSim builds on the empirical modeling methodology of CACTI but is customized for NVM technologies.

Uploaded by

takahashikeyaki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
195 views

Nvsim: A Circuit-Level Performance, Energy, and Area Model For Emerging Nonvolatile Memory

This document presents NVSim, a circuit-level performance, energy, and area model for emerging nonvolatile memory (NVM) technologies. NVSim supports various NVM technologies including spin-torque-transfer RAM (STT-RAM), phase-change RAM (PCRAM), resistive RAM (ReRAM), and NAND flash. The goals of NVSim are to estimate the access time, energy, and area of NVM chips for different organizations and designs before fabrication, and to explore the design space to find optimized designs for performance, energy, or area. NVSim builds on the empirical modeling methodology of CACTI but is customized for NVM technologies.

Uploaded by

takahashikeyaki
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

994 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO.

7, JULY 2012

NVSim: A Circuit-Level Performance, Energy, and


Area Model for Emerging Nonvolatile Memory
Xiangyu Dong, Member, IEEE, Cong Xu, Student Member, IEEE, Yuan Xie, Senior Member, IEEE, and
Norman P. Jouppi, Fellow, IEEE

Abstract—Various new nonvolatile memory (NVM) technolo- latency-optimized microprocessor caches to highly density-
gies have emerged recently. Among all the investigated new NVM optimized secondary storage. Therefore, specialized peripheral
candidate technologies, spin-torque-transfer memory (STT-RAM,
circuitry is required for each optimization target. However,
or MRAM), phase-change random-access memory (PCRAM),
and resistive random-access memory (ReRAM) are regarded as since few of these NVM technologies are mature so far, only
the most promising candidates. As the ultimate goal of this NVM a limited number of prototype chips have been demonstrated
research is to deploy them into multiple levels in the memory hi- and just cover a small portion of the entire design space.
erarchy, it is necessary to explore the wide NVM design space and In order to facilitate the architecture-level NVM research by
find the proper implementation at different memory hierarchy
estimating the NVM performance, energy, and area values
levels from highly latency-optimized caches to highly density-
optimized secondary storage. While abundant tools are available under different design specifications before fabricating a real
as SRAM/DRAM design assistants, similar tools for NVM designs chip, in this paper, we build NVSim,1 a circuit-level model
are currently missing. Thus, in this paper, we develop NVSim, for NVM performance, energy, and area estimations, which
a circuit-level model for NVM performance, energy, and area supports various NVM technologies, including STT-RAM,
estimation, which supports various NVM technologies, including
PCRAM, ReRAM, and legacy NAND Flash.
STT-RAM, PCRAM, ReRAM, and legacy NAND Flash. NVSim is
successfully validated against industrial NVM prototypes, and it The main goals of developing NVSim tool are as follows.
is expected to help boost architecture-level NVM-related studies. 1) Estimate the access time, access energy, and silicon area
Index Terms—Analytical circuit model, MRAM, NAND of NVM chips with a given organization and specific
Flash, nonvolatile memory, phase-change random-access design options before the effort of actual fabrications.
memory (PCRAM), resistive random-access memory (ReRAM), 2) Explore the NVM chip design space to find the opti-
spin-torque-transfer memory (STT-RAM). mized chip organization and design options that achieve
best performance, energy, or area.
I. Introduction 3) Find the optimal NVM chip organization and design
options that are optimized for one design metric while
U NIVERSAL MEMORY that provides fast random
access, high storage density, and nonvolatility within one
memory technology becomes possible thanks to the emergence
keeping other metrics under constraints.
We build NVSim by using the same empirical modeling
of various new nonvolatile memory (NVM) technologies, such methodology as CACTI [1], [2], but starting from a new
as spin-torque-transfer random-access memory (STT-RAM, framework and adding specific features for NVM technologies.
or MRAM), phase-change random-access memory (PCRAM), Compared to CACTI, the framework of NVSim includes the
and resistive random-access memory (ReRAM). As the following new features.
ultimate goal of this NVM research is to devise a universal 1) It allows us to move sense amplifiers from inner memory
memory that could work across multiple layers of the memory subarrays to the outer bank level and factor them out to
hierarchy, each of these emerging NVM technologies has to achieve overall area efficiency of the memory module.
supply a wide design space that covers a spectrum from highly 2) It provides more flexible array organizations and data
Manuscript received March 17, 2011; revised June 22, 2011, September 26, activation modes by considering any combinations of
2011, and December 16, 2011; accepted January 22, 2012. Date of current memory data allocation and address distribution.
version June 20, 2012. This work was supported in part by the Semiconductor
Research Corporation Grant, in part by the National Science Foundation,
3) It models various types of data sensing schemes instead
under Grants 1147388 and 0903432, and in part by the DoE, under Award of voltage-sensing scheme only.
DE-SC0005026. This paper was recommended by Associate Editor S. Mitra. 4) It allows memory banks to be formed in a bus-like
X. Dong is with Qualcomm, Inc., San Diego, CA 92121 USA (e-mail:
[email protected]).
manner rather than the H-tree manner only.
C. Xu and Y. Xie are with the Department of Computer Science and 5) It provides multiple design options of buffers instead of
Engineering, Pennsylvania State University, University Park, PA 16802 USA latency-optimized option that uses logical effort.
(e-mail: [email protected]; [email protected]).
N. P. Jouppi is with the Intelligent Infrastructure Laboratory, Hewlett-
6) It models the cross-point memory cells rather than MOS-
Packard Labs, Palo Alto, CA 94304 USA (e-mail: [email protected]). accessed memory cells only.
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. 1 Latest NVsim binary file release and related documentations are available
Digital Object Identifier 10.1109/TCAD.2012.2185930 at our wiki site http://www.rioshering.com/nvsimwiki.
0278-0070/$31.00 
c 2012 IEEE
DONG et al.: NVSIM: A CIRCUIT-LEVEL PERFORMANCE, ENERGY, AND AREA MODEL 995

Fig. 2. Demonstration of a MRAM cell. (a) Structural view. (b) Schematic


view (BL = bitline, WL = wordline, SL = sourceline).

Fig. 1. Basic string block of NAND Flash, and the conceptual view of
floating gate Flash memory cell (BL = bitline, WL = wordline, SG = select
gate).

7) It considers the subarray size limit by analyzing the


current sneak path.
8) It allows advanced target users to redefine memory cell
Fig. 3. Schematic view of a PCRAM cell with a MOSFET selector transistor
properties by providing a customization interface. (BL = bitline, WL = wordline, SL = sourceline).
NVSim is validated against several industry prototype chips
resistance of MTJ. If two ferromagnetic layers have the same
within the error range of 30%. In addition, we show how to
directions, the resistance of MTJ is low, indicating a “1” state;
use this model to facilitate the architecture-level performance,
if two layers have different directions, the resistance of MTJ
energy, and area analysis for applications that adopt the
is high, indicating a “0” state.
emerging NVM technologies.
As shown in Fig. 2, when writing “0” state into STT-
II. Background of Nonvolatile Memory RAM cells (RESET operation), positive voltage difference
is established between SL and BL; when writing “1” state
In this section, we first review the technology background
(SET operation), vice versa. The current amplitude required
of four types of NVMs modeled in NVSim, which are STT-
to reverse the direction of the free ferromagnetic layer is
RAM, PCRAM, ReRAM, and legacy NAND Flash.
determined by the size and aspect ratio of MTJ and the write
A. NVM Physical Mechanisms and Write Operations pulse duration.
Different NVM technologies have their particular storage 3) PCRAM: PCRAM uses chalcogenide material (e.g.,
mechanisms and corresponding write methods. GST) to store information. The chalcogenide materials can
1) NAND Flash: The physical mechanism of the Flash be switched between a crystalline phase (SET state) and
memory is to store bits in the floating gate and control the an amorphous phase (RESET state) with the application of
gate threshold voltage. The series bit-cell string of NAND heat. The crystalline phase shows low resistivity while the
Flash, as shown in Fig. 1(a), eliminates contacts between amorphous phase is characterized by high resistivity. Fig. 3
the cells and approaches the minimum cell size of 4F 2 for shows an example of a MOS-accessed PCRAM cell.
low-cost manufacturing. The small cell size, low cost, and The SET operation crystallizes GST by heating it above its
strong application demands make the NAND Flash dominate crystallization temperature, and the RESET operation melt-
the traditional nonvolatile memory market. Fig. 1(b) shows quenches GST to make the material amorphous as illustrated
that a Flash memory cell consists of a floating gate and a in Fig. 4. The temperature is controlled by passing a specific
control gate aligned vertically. The Flash memory cell modifies electrical current profile and generating the required Joule heat.
its threshold voltage VT by adding electrons to or subtracting High-power pulses are required for the RESET operation to
electrons from the isolated floating gate. heat the memory cell above the GST melting temperature. In
NAND Flash usually charges or discharges the floating gate contrast, moderate power but longer duration pulses for the
by using Fowler–Nordheim tunneling or hot carrier injection. SET operation to heat the cell above the GST crystallization
A program operation adds tunneling charges to the floating temperature but below the melting temperature [3].
gate and the threshold voltage becomes negative, while an 4) ReRAM: Although many nonvolatile memory technolo-
erase operation subtracts charges and the threshold voltage gies (e.g., aforementioned STT-RAM and PCRAM) are based
returns positive. on electrically induced resistive switching effects, we define
2) STT-RAM: STT-RAM uses magnetic tunnel junc- ReRAM as the one that involves electro and thermochemical
tion (MTJ) as the memory storage and leverages the differ- effects in the resistance change of a metal/oxide/metal system.
ence in magnetic directions to represent the memory bit. As In addition, we confine our definition to bipolar ReRAM.
shown in Fig. 2, MTJ contains two ferromagnetic layers. One Fig. 5 illustrates the general concept for the ReRAM working
ferromagnetic layer has fixed magnetization direction and it mechanism. An ReRAM cell consists of a metal oxide layer
is called the reference layer, while the other layer has a free (e.g., Ti [4], Ta [5], and Hf [6]) sandwiched by two metal
magnetization direction that can be changed by passing a write (e.g., Pt [4]) electrodes. The electronic behavior of metal/oxide
current and it is called the free layer. The relative magneti- interfaces depends on the oxygen vacancy concentration of the
zation direction of two ferromagnetic layers determines the metal oxide layer. Typically, the metal/oxide interface shows
996 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 7, JULY 2012

Fig. 6. Memory array organization modeled in NVSim: a hierarchical


Fig. 4. Temperature–time relationship during SET and RESET operations.
memory organization includes banks, mats, and subarrays with decoders,
multiplexers, sense amplifiers, and output drivers.

between 105 and 1010 [13], [14]. A projected plan by ITRS for
2024 for emerging NVM, i.e., PCRAM and ReRAM, highlight
endurance in the order of 1015 or more write cycles [15]. In
NVSim, the write endurance limit is not modeled since NVSim
is a circuit-level modeling tool.
D. Retention Time Issue
Fig. 5. Working mechanism of ReRAM cells. Retention time is the time that data can be retained in NVM
cells. Typically, NVM technologies require retention time of
Ohmic behavior in the case of very high doping and rectifying
higher than 10 years. However, in some cases, such a high
in the case of low doping [4]. In Fig. 5, the TiOx region is
retention time is not necessary. For example, Smullen et al.
semiinsulating indicating lower oxygen vacancy concentration,
relaxed the retention time requirement to improve the timing
while the TiO2−x is conductive indicating higher concentration.
and energy profile of STT-RAMs [16]. Since the tradeoff
The oxygen vacancy in metal oxide is n-type dopant, whose
among NVM retention time and other NVM parameters (e.g.,
draft under the electric field can cause the change of doping
the duration and amplitude of write pulses) is on the device
profiles. Thus, applying electronic current can modulate the
level, as a circuit-level tool, NVSim does not model this
I–V curve of the ReRAM cell and further switch the cell
tradeoff directly but instead takes different sets of NVM
from one state to the other state. Usually, for bipolar ReRAM,
parameters with various retention time as the device-level
the cell can be switched ON (SET operation) only by applying
input.
a negative bias and OFF (RESET operation) only by applying
the opposite bias [4]. Several ReRAM prototypes [7]–[9] have E. MOS-Accessed Structure Versus Cross-Point Structure
been demonstrated and show promising properties on fast Some NVM technologies (e.g., PCRAM [17] and
switching speed and low energy consumption. ReRAM [13], [17], [18]) have the capability of building cross-
point memory arrays without access devices. Conventionally,
B. Read Operations
in the MOS-accessed structure, memory cell arrays are isolated
The read operations of these NVM technologies are almost by MOS access devices and the cell size is dominated by the
the same. Since the NVM memory cell has different resistance large MOS access device that is necessary to drive enough
in ON and OFF states, the read operation can be accomplished write current, even though the NVM cell itself is much smaller.
either by applying a small voltage on the bitline and sensing However, taking advantage of the cell nonlinearity, a NVM
the current that passes through the memory cell or by injecting array can be accessed without any extra access devices. The
a small current into the bitline and sensing the voltage across removal of MOS access devices leads to a memory cell size of
the memory cell. Instead of SRAM that generates complement only 4F 2 , where F is the process feature size. Unfortunately,
read signals from each cell, NVM usually has a group of the cross-point structure also brings extra peripheral circuitry
dummy cells to generate the reference current or reference design challenges and a tradeoff among performance, energy,
voltage. The generated current (or voltage) from the to-be-read and area is always necessary as discussed in our previ-
cell is then compared to the reference current (or voltage) by ous work [19]. NVSim models both the MOS-accessed and
using sense amplifiers. Various types of sense amplifiers are the cross-point structures, and the modeling methodology is
modeled in NVSim as we discuss in Section V-B. described in the following sections.
C. Write Endurance Issue III. NVSim Framework
Write endurance is the number of times that an NVM The framework of NVSim is modified from CACTI [2],
cell can be overwritten. Among all the NVM technologies [20]. We add several new features, such as more flexible data
modeled in NVSim, only STT-RAM does not suffer from the activation modes and alternative bank organizations.
write endurance issue. NAND Flash, PCRAM, and ReRAM
all have limited write endurance, which is the number of A. Device Model
times that a memory cell can be overwritten. NAND Flash only NVSim uses device data from ITRS report [15] and the
has write endurance of 105 –106 . The PCRAM endurance is MASTAR tool [21] to obtain the process parameters. NVSim
now in the range between 105 and 109 [10]–[12]. ReRAM covers the process nodes from 180 nm, 120 nm, 90 nm, 65 nm,
research currently shows endurance numbers in the range 45 nm, 32 nm to 22 nm and supports three transistor types,
DONG et al.: NVSIM: A CIRCUIT-LEVEL PERFORMANCE, ENERGY, AND AREA MODEL 997

which are high performance, low operating power, and low from the mats to the I/O interface; selectively output the
stand-by power. desired cache line if there is a cache hit signal generated
from the tag array.
B. Array Organization
D. Activation Mode
Fig. 6 shows the array organization. There are three hier-
archy levels in such organization, which are bank, mat, and We model the array organization and the data activation
subarray. Basically, the descriptions of these levels are as modes using eight parameters, which are as follows:
follows. 1) NMR : number of rows of mat arrays in each bank;
1) Bank is the top-level structure modeled in NVSim. One 2) NMC : number of columns of mat arrays in each bank;
nonvolatile memory chip can have multiple banks. The 3) NAMR : number of active rows of mat arrays during data
bank is a fully-functional memory unit, and it can be accessing;
operated independently. In each bank, multiple mats are 4) NAMC : number of active columns of mat arrays during
connected together in either H-tree or bus-like manner. data accessing;
2) Mat is the building block of bank. Multiple mats in a 5) NSR : number of rows of subarrays in each mat;
bank operate simultaneously to fulfill a memory oper- 6) NSC : number of columns of subarrays in each mat;
ation. Each mat consists of multiple subarrays and one 7) NASR : number of active rows of subarrays during data
predecoder block. accessing;
3) Subarray is the elementary structure modeled in NVSim. 8) NASC : number of active columns of subarrays during
Every subarray contains peripheral circuitry including data accessing.
row decoders, column multiplexers, and output drivers. The values of these parameters are all constrained to be power
of two. NMR and NMC define the number of mats in a bank,
Conventionally, sense amplifiers are integrated on the
and NSR and NSC define the number of subarrays in a mat.
subarray level as modeled in CACTI [2], [20]. However, in
NAMR , NAMC , NASR , and NASC define the activation patterns,
NVSim model, sense amplifiers can be placed either on the
and they can take any values smaller than NMR , NMC , NSR ,
subarray level or on the mat level.
and NSC , respectively. On the contrary, the limitation of array
C. Memory Bank Type organization and data activation pattern in CACTI is caused
For practical memory designs, memory cells are grouped by several constraints on these parameters such as NAMR = 1,
together to form memory modules of different types. For NAMC = NMC , and NSR = NSC = NASR = NASC = 2.
instance: NVSim has these flexible activation patterns, and is able
1) the main memory is a typical RAM, which takes the to model sophisticated memory accessing techniques, such as
address of data as input and returns the content of data; single subarray activation [22].
2) the set-associative cache contains two separate RAMs E. Routing to Mats
(data array and tag array), and can return the data if In order to first route the data and address signals from the
there is a cache hit by the given set address and tag; I/O port to the edge of memory mats and from mat to the edges
3) the fully-associative cache usually contains a content- of memory subarrays, we divided all the interconnect wires
addressable memory (CAM). into three categories: address wires, broadcast data wires, and
To cover all the possible memory designs, we model five distributed data wires. Depending on the memory module
types of memory banks in NVSim: one for RAM, one for types and the activation modes, the initial number of wires
CAM, and three for set-associate caches with different access in each group is assigned according to the rules listed in
manners. The functionalities of these five types of memory Table I. We use the terminology block to refer to the memory
banks are listed as follows. words in RAM and CAM designs and the cache lines in
1) RAM: output the data content at the I/O interface given cache designs. In Table I, Nblock is the number of blocks,
the data address. Wblock is the block size, and A is the associativity in cache
2) CAM: output the data address at the I/O interface given designs. The number of broadcast data wires are always kept
the data content if there is a hit. unchanged, the number of distributed data wires is cut by half
3) Cache with normal access: start to access the cache data at each routing point where data are merged, and the number
array and tag array at the same time; the data content of address wires is subtracted by one at each routing point
is temporarily buffered in each mat; if there is a hit, the where data are multiplexed.
cache hit signal generated from the tag array is routed We use the case of the cache bank with normal access to
to the proper mats and the content of the desired cache demonstrate how the wires are routed from the I/O port to the
line is output to the I/O interface. edges of the mats. For simplicity, we suppose the data array
4) Cache with sequential access: access the cache tag array and the tag array are two separate modules. While the data
first; if there is a hit, then access the cache data array and the tag arrays usually have different mat organizations
with the set address and the tag hit information, and in practice, we use the same 4 × 4 mat organization for the
finally output the desired cache line to the I/O interface. demonstration purpose as shown in Figs. 7 and 8. The total
5) Cache with fast access: access the cache data array 16 mats are positioned in a 4 × 4 formation and connected
and tag array simultaneously; read the entire set content by a 4-level H-tree. Therefore, NMR and NMC are 4. As an
998 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 7, JULY 2012

TABLE I
Initial Number of Wires in Each Routing Group

Address Wire (NAW ) Broadcast Data Wire (NBW ) Distributed Data Wire (NDW )
RAM 0 Wblock
log2 Nblock
CAM Wblock 0
Data array log2 (Nblock /A) log2 A Wblock
Normal access
Tag array log2 (Nblock /A) Wblock A
Cache Data array log2 Nblock 0 Wblock
Sequential access
Tag array log2 (Nblock /A) Wblock A
Data array log2 (Nblock /A) 0 Wblock A
Fast access
Tag array log2 (Nblock /A) Wblock A

From the I/O port to the edges of the mats, the numbers
of wires in the three categories are changed as follows and
demonstrated in Figs. 7 and 8, respectively.
1) At node A, the activated mats are distributed in both the
upper and the bottom parts, so node A is a merging node.
As per the routing rule, the address wires and broadcast
data wires remain the same but the distributed data wires
are cut in half. Thus, the wire segment between node A
Fig. 7. Example of the wire routing in a 4 × 4 mat organization for the data
and B have NAW = 11, NBW,data = 3, NDW,data = 256,
array of a 8-way 1 MB cache with 64 B cache lines. NBW,tag = 16, and NDW,tag = 4.
2) Node B is again a merging node. Thus, the wire
segments between nodes B and C have NAW = 11,
NBW,data = 3, NDW,data = 128, NBW,tag = 16, and
NDW,tag = 2.
3) At node C, the activated mats are allocated only in one
side, either from Mat 0/1 or from Mat 4/5, so Node
C is a multiplexing node. As per the routing rule, the
distributed data wires and broadcast data wires remain
the same but the address wires are decremented by 1.
Thus, the wire segments between nodes C and D have
Fig. 8. Example of the wire routing in a 4 × 4 mat organization for the tag
array of a 8-way 1 MB cache with 64 B cache lines.
NAW = 10, NBW,data = 3, NDW,data = 128, NBW,tag = 16,
and NDW,tag = 2.
4) Finally, node D is another multiplexing node. Thus,
example, we use the activation mode in which two rows and the wire segments at the mat edges have NAW = 9,
two columns of the mat array are activated for each data NBW,data = 3, NDW,data = 128, NBW,tag = 16, and
access, and the activation groups are Mat {0, 2, 8, 10}, Mat NDW,tag = 2.
{1, 3, 9, 11}, Mat {4, 6, 12, 14}, and Mat {5, 7, 13, 15}. Thereby, each mat in the data array takes the input of a 9-bit set
Thereby, NAMR and NAMC are 2. In addition, we set the cache address and a 3-bit tag hit signals (which can be treated as the
line size (block size) to 64 B, the cache associativity to A = 8, block address in a 8-way associative set), and it generates the
and the cache bank capacity to 1 MB so that the number of output of a 128-bit data. A group of four data mats provides
cache lines (blocks) is Nblock = 8M/512 = 16 384, the block the desired output of a 512-bit (64 B) cache line, and four
size in the data array is Wblock,data = 512, and the block size such groups cover the entire 11-bit set address space. On the
in the tag array is Wblock,tag = 16 (assuming 32-bit addressing other hand, each mat in the tag array takes the input of a
and labeling dirty block with one bit). 9-bit set address and a 16-bit tag, and it generates 2-bit hit
According to Table I, the initial number of address wires signals (01 or 10 for hit and 00 for miss). A group of four tag
(NAW ) is log2 Nblock /A = 11 for both data and tag arrays. mats concatenate their hit signals and provide the information
For data array, the initial number of broadcast data wires whether a 16-bit tag hits in a 8-way associated cache with a
(NBW,data ) is log2 A = 3, which is used to transit the tag hit 9-bit address space, and four such groups extend the address
signals from the tag array to the corresponding mats in the data space from 9-bit to the desired 11-bit.
array; the initial number of distributed data wires (NDW,data ) Other configurations in Table I can be explained in the
is Wblock,data = 512, which is used to output the desired cache similar manner.
line from the mats to the I/O port. For tag array, the broadcast
data wire (NBW,tag ) is Wblock,tag = 16, which is sent from the F. Routing to Subarrays
I/O port to each of the mat in the tag array; the initial number The interconnect wires from mat to the edges of memory
of distributed data wires (NDW,tag ) is A = 8, which is used to subarrays are routed using the same H-tree organization as
collect the tag hit signals from each mat to the I/O port and shown in Fig. 9, and its routing strategy is the same wire
then send to the data array after a 8-to-3 encoding process. partitioning rule described in Section III-E. However, NVSim
DONG et al.: NVSIM: A CIRCUIT-LEVEL PERFORMANCE, ENERGY, AND AREA MODEL 999

Fig. 11. Maximum subarray size versus nonlinearity and driving current.

the sophisticated H-tree solution. In contrast, larger subarrays


are preferred for area-optimized designs since they can greatly
amortize the peripheral circuitry area. However, the subarray
Fig. 9. Example of mat using internal sensing and H-tree routing. size has its upper limit in practice.
For MOS-accessed subarrays, the leakage current paths
from unselected wordlines are the main constraint to the bitline
length. For cross-point subarrays, the leakage current path
issue is much more severe as there is no MOSFET in such
subarrays that can isolate selected and unselected cells [23].
The half-select cells in cross-point subarrays serve as current
dividers in the selected row and columns, preventing the
array size from growing unbounded since the available driving
current is limited. The minimum current that a column write
driver should provide is determined by
Fig. 10. Example of mat using external sensing and bus-like routing.
Idriver = Iwrite + (Nr − 1) × I(Vwrite /2) (1)
provides an option of building mat using a bus-like routing
organization as illustrated in Fig. 10. The wire partitioning where Iwrite and Vwrite are the current and voltage of either
rule described in Section III-E can also be applied to the RESET or SET operation. Nonlinearity of memory cells is
bus-like organization with a few extensions. For example, a reflected by the fact that the current through cross-point
multiplexing node with a fanout of N decrements the number memory cells is not directly proportional to the voltage applied
of address wires by log2 N instead of 1; a merging node with on it, which means nonconstant resistance of the memory cell.
a fanout of N divides the number of distributed data wires by In NVSim, we define a nonlinearity coefficient, Kr , to quantify
N instead of 2. the current divider effect of the half selected memory cells as
Furthermore, the default setting including sense amplifiers follows:
in each subarray can cause a dominant portion of the total array R(Vwrite /2)
Kr = (2)
area. As a result, for high-density memory module designs, R(Vwrite )
NVSim provides an option of moving the sense amplifiers
where R(Vwrite /2) and R(Vwrite ) are equivalent static resistance
out of the subarray and using external sensing. In addition, a
of cross-point memory cells biased at Vwrite /2 and Vwrite ,
bus-like routing organization is designed to associate with the
respectively. Then, we derive the upper limit in a cross-point
external sensing scheme.
subarray size by
Fig. 9 shows a common mat using H-tree organization to  
connect all the sense amplifier-included subarrays together. Idriver
Nr = − 1 × 2 × Kr + 1 (3)
In contrast, the new external sensing scheme is illustrated I
 write 
in Fig. 10. In this external sensing scheme, all the sense Idriver
Nc = − Nsc × 2 × Kr + Nsc (4)
amplifiers are located at the mat level and the output signals Iwrite
from each sense amplifier-free subarray are partial-swing. It where Idriver is the maximum driving current that the write
is obvious that the external sensing scheme has much higher driver attached to the selected row/column can provide and
area efficiency compared to its internal sensing counterpart. Nsc is the number of selected columns per row. Thus, Nr and
However, as a penalty, sophisticated global interconnect tech- Nc are the maximum numbers of rows and columns in a cross-
nologies, such as repeater inserting, cannot be used in the point subarray.
external sensing scheme since all the global signals are partial- As shown in Fig. 11, the maximum cross-point subarray
swing before passing through the sense amplifiers. size increases with larger current driving capability or larger
G. Subarray Size Limit nonlinearity coefficient.
The subarray size is a critical parameter to design a memory
module. Basically, smaller subarrays are preferred for latency- H. Two-Step Write in Cross-Point Subarrays
optimized designs since they reduce local bitline and wordline In cross-point structures, SET and RESET operations cannot
latencies and leave the global interconnects to be handled by be performed simultaneously. Thus, two steps of write oper-
1000 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 7, JULY 2012

Fig. 12. Sequential write method: SET-before-RESET. (a) SET step.


(b) RESET step.

Fig. 14. Conceptual view of a MOS-accessed cell (1T1R) and its connected
word line, bit line, and source line.

In MOS-accessed cells, the size of NMOS is bounded by


the current needed by the write operation. the size of NMOS
in each MOS-accessed cell needs to be sufficiently large so
Fig. 13. Sequential write method: ERASE-before-RESET. (a) ERASE step.
(b) RESET step. that the NMOS has the capability of driving enough write
current. The driving current of NMOS, IDS , can be first-order
ations are required in the cross-point structure when multiple estimated as follows:2
cells are selected in a row.  
W V2
In NVSim, we model two write methods for cross-point IDS = K (VGS − VTH ) VDS − DS (5)
L 2
subarrays. The first one separates SET and RESET operations
as Fig. 12 shows, and it is called SET-before-RESET. The if NMOS is working at the linear region; or calculated by
second one erases all the cells in the selected row before KW
IDS = (VGS − VTH )2 (1 + λVDS ) (6)
the selective RESET operation as Fig. 13 shows, and it is 2 L
called ERASE-before-RESET. Supposing the 4-bit word to if NMOS is working at the saturation region. Hence, no matter
write is “0101,” we first write “x1x1” (“x” here means bias in which region NMOS is working, the current driving ability
row and column of the corresponding cells at the same voltage of NMOS is proportional to its W/L ratio,3 which determines
to keep their original states) and then write “0x0x” in SET- the NMOS size. To achieve high cell density, we model
before-RESET method, or we first SET all the four cells and the MOS-accessed cell area by referring to DRAM design
then write “0x0x” in ERASE-before-RESET method. The first rules [30]. As a result, the cell size of a MOS-accessed cell
method has smaller write latency since the erase operation in NVSim is calculated as follows:
can be performed before the arrival of the column selector
signal, but it needs more write energy due to the redundant Areacell,MOS-accessed = 3 (W/L + 1)(F 2 ) (7)
SET on the cells that are RESET back in the second step. Here, in which the W/L ratio is determined by (5) or (6) and
ERASE-before-RESET is chosen rather than ERASE-before- the required write current is configured as one of the input
SET because SET operation usually consumes less energy than values of NVSim. In NVSim, we also allow advanced users
RESET operation does. to override this cell size calculation by directly importing the
IV. Area Model user-defined cell size.
2) Cross-Point Cell: Cross-point cell corresponds to the
Since NVSim estimates the performance, energy, and area 1-diode-1-resistor (1D1R) [7], [10], [31]–[33] or 0-transistor-
of nonvolatile memory modules, the area model is an essential 1-resistor (0T1R) [13], [17], [18] structures used by several
component of NVSim especially given the facts that intercon- high-density NVM chips recently. Fig. 15 shows a cross-point
nect wires contribute a large portion of total access latency array without diodes (i.e., 0T1R structure). For 1D1R struc-
and access energy and the geometry of the module becomes ture, a diode is inserted between the word line and the storage
highly important. In this section, we describe the NVSim area element. Such cells either rely on the one-way connectivity
model from the memory cell level to the bank level in detail. of diode (i.e., 1D1R) or leverage materials’ nonlinearity (i.e.,
A. Cell Area Estimation 0T1R) to control the memory access path. As illustrated in
Fig. 15, the widths of word lines and bit lines can be the
Three types of memory cells are modeled in NVSim: MOS-
minimal value of 1F and the spacing in each direction is also
accessed, cross-point, and NAND-string.
1F, thus the cell size of each cross-point cell is as follows:
1) MOS-Accessed Cell: MOS-accessed cell corresponds
to the typical 1-transistor-1-resistor (1T1R) structure used by Areacell,cross-point = 4(F 2 ). (8)
many NVM chips [12], [24]–[29], in which a NMOS access
device is connected in series with the nonvolatile storage Compared to MOS-accessed cells, cross-point cells have
element (i.e., MTJ in STT-RAM, GST in PCRAM, and metal- worse cell isolation but provide a way of building high-density
oxide in ReRAM) as shown in Fig. 14. Such a NMOS device 2 Equations (5) and (6) are for long-channel drift/diffusion devices, the

turns on/off the access path to the storage element by tuning equations are subjected to change depending on the technology, though the
the voltage applied to its gate. The MOS-accessed cell usually proportional relationship between the current and width-to-length (W/L) still
holds for very advanced technologies.
has the best isolation among neighboring cells due to the 3 Usually, the transistor length (L) is fixed as the minimal feature size, and
property of MOSFET. the transistor width (W) is adjustable.
DONG et al.: NVSIM: A CIRCUIT-LEVEL PERFORMANCE, ENERGY, AND AREA MODEL 1001

Fig. 15. Conceptual view of a cross-point cell array without diode (0T1R)
and its connected word lines and bit lines.

Fig. 17. Buffer designs with different transistor sizing. (a) Latency-
optimized. (b) Balanced. (c) Area-optimized.

sizing methods rather than only applying logical effort. In


addition, for some peripheral circuitry in NVM chips, the size
of some transistors is determined by their required driving
current instead of their capacitive load, and this violates the
basic rules of using logical effort.
Therefore, we offer three transistor sizing choices in the
area model of NVSim: one optimizing latency, one optimizing
Fig. 16. Layout of the NAND-string cell modeled in NVSim. area, while another balancing latency and area. An example
is illustrated in Fig. 17 demonstrating the different sizing
memory chip because they have much smaller cell sizes. In methods when an output buffer with 4096 times the capac-
some cases, the cross-point cell size is constrained by the itance of a minimum-sized inverter is to be designed. In a
diode due to limited current density, NVSim allows the user latency-optimized buffer design, the number of stages and
to override the default 4F 2 setting. all of the inverter sizing in the inverter chain is calculated
3) NAND-String Cell: NAND-string cells are particularly by logical effort to achieve minimum delay (30 units) while
modeled for NAND Flash. In NAND-string cells, a group of paying a huge area penalty (1365 units). In an area-optimized
floating gates are connected in series and two ordinary gates buffer design, there are only two stages of inverters, and the
with contacts are added at the string ends as shown in Fig. 16. size of the last stage is determined by the minimum driving
Since the area of the floating gates can be minimized to 2Fx2F, current requirement. This type of buffer has the minimum
the total area of a NAND-string cell is as follows: area (65 units), but is much slower than the latency-optimized
Areacell,NAND-string = 2 (2N + 5) (F 2 ) (9) buffer. The balanced option determines the size of last stage
inverter by its driving current requirement and calculates the
where N is the number of floating gates in a string and we size of the other inverters by logical effort. This results in a
assume the addition of two gates and two contacts causes 5F balanced delay and area metric.
in the total string length.
V. Timing and Power Models
B. Peripheral Circuitry Area Estimation
As an analytical modeling tool, NVSim uses RC analysis for
Besides the area occupied by memory cells, there is a timing and power. In this section, we describe how resistances
large portion of memory chip area that is contributed to the and capacitances are estimated in NVSim and how they are
peripheral circuitry. In NVSim, we have peripheral circuitry combined to calculate the delay and power consumption.
components such as row decoders, prechargers, and column
multiplexors on the subarray level, predecoders on the mat A. Generic Timing and Power Estimation
level, and sense amplifiers and write drivers on either the In NVSim, we consider the wire resistance and wire capac-
subarray or mat level depending on whether internal or exter- itance from interconnects, turn-on resistance, switching resis-
nal data sensing scheme is used. In addition, on every level, tance, gate and drain capacitances from transistors, and equiva-
interconnect wires might occupy extra silicon area if the wires lent resistance and capacitance from memory storage elements
are relayed using repeaters. (e.g., MTJ in STT-RAM and GST in PCRAM). The methods
In order to estimate the area of each peripheral circuitry of estimating wire and parasitic resistances and capacitances
component, we delve into the actual gate-level logic design as are modified from the previous versions of CACTI [1], [2] by
similar to CACTI [2]. However, in NVSim, we size transistors several enhancements. The enhancements include updating the
in a more generalized way than CACTI does. transistor models by latest ITRS report [15], considering the
The sizing philosophy of CACTI is to use logical effort [34] thermal impact on wire resistance calculation, adding drain-to-
to size the circuits for minimum delay. NVSim’s goal is to channel capacitance in the drain capacitance calculation, and
estimate the properties of NVM chips of a broad range, and so on. We build a look-up table to model the equivalent resis-
these chips might be optimized for density or energy con- tance and capacitance of memory storage elements since they
sumption instead of minimum delay, thus we provide optional are the properties of certain nonvolatile memory technology.
1002 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 7, JULY 2012

Considering NVSim is a system-level estimation tool, we only


model the static behavior of the storage elements and record
the equivalent resistances and capacitances of RESET and SET
states (i.e., RRESET , RSET , CRESET , CSET ).4
After calculating the resistances and capacitances of nodes, Fig. 18. Analysis model for current sensing scheme.
the delay of each logic component is calculated by using a
simplified version of Horowitz’s timing model [35] as follows:
 
1 2
Delay = τ ln + αβ (10)
2
Fig. 19. Analysis model for current-in voltage sensing scheme.
where α is the slope of the input, β = gm R is the normalized
input transconductance by the output resistance, and τ = RC
is the RC time constant.
The dynamic energy and leakage power consumptions can
be modeled as follows:

Energydynamic = 2
CVDD (11) Fig. 20. Analysis model for voltage-divider sensing scheme.

Powerleakage = VDD Ileak (12)


1) Bitline RC Model: We model the bit-line RC delay
where we model both gate leakage and sub-threshold leakage analytically for each sensing scheme. The most significant
currents in Ileak . difference between the current-mode sensing and voltage-
The overall memory access latency and energy consumption mode sensing is that the input resistance of ideal current-mode
are estimated by combining all the timing and power values sensing is zero while that of ideal voltage-mode sensing is
of circuit components together. NVSim follows the same infinite. And, the most significant difference between current-
methodology that CACTI [2] uses with minor modifications. in voltage sensing and voltage-divider sensing is that the
internal resistance of an ideal current source is infinite while
B. Data Sensing Models the resistor Rx serving as a voltage divider can be treated as
Unlike other peripheral circuitries, the sense amplifier is an the internal resistance of a voltage source. Delays of current-
analog design instead of a logic design. Thus, in NVSim, we in voltage sensing, voltage-divider sensing and current sensing
develop a separate timing model for the data sensing schemes. are given by the following equations using Seevinck’s delay
Different sensing schemes have their impacts on the tradeoff expression [36]:
among performance, energy, and area. In NVSim, we consider  
R T CT 2RB
three types of sensing schemes: current sensing, current-in δtv = × 1+ (14)
2 RT
voltage sensing, and voltage-divider sensing.  
R T CT 2(RB ||Rx )
In the current sensing scheme as shown in Fig. 18, the state δtvd = × 1+ (15)
2 RT
of memory cell (STT-RAM, PCRAM, or ReRAM) is read 
out by measuring the resulting current through the selected R T CT RB + R3T
δti = × (16)
memory cell when a read voltage is applied: the current on 2 RB + R T
the bit-line is compared to the reference current generated by
reference cells, the current difference is amplified by current- where RT and CT are the total line resistance and capacitance,
mode sense amplifiers, and they are eventually converted to RB is the equivalent resistance of the memory cell, and Rx is
voltage signals. the resistance of voltage divider. In these equations, tv , tvd , and
Fig. 19 demonstrates an alternative sensing method by ti are the RC delays of current-in voltage sensing, voltage-
applying a current source on the selected memory cell and divider sensing, and current sensing schemes, respectively.
sensing the voltage via the voltage-mode sense amplifier. Rx ||RB , instead of RB , is used as the new effective pull-
The voltage-divider sensing scheme is presented by intro- down resistance in (15) according to the transformation from
ducing a resistor (Rx ) in series with the memory cell as a Thevenin Equivalent to a Norton Equivalent.
illustrated in Fig. 20. The resistance value is selected to Equations (14) and (15) show that voltage-divider sensing
achieve the maximum read sensing margin, and it is calculated is faster than current-in voltage sensing with the extra cost of
as follows: fabricating a large resistor. Comparing (16) with (14) and (15),
 we can see the current sensing is much faster than current-in
Rx = Ron × Roff (13) voltage sensing and voltage-divider sensing since the former
delay is less than the intrinsic line delay RT CT /2 while
where Ron and Roff are the equivalent resistance values of the the latter delays are larger than RT CT /2. The bit-line delay
memory cell in LRS and HRS, respectively. analytical models are verified by comparing them with the
4 One of the exceptions is that NVSim records the detailed I–V curves
HSPICE simulation results. As shown in Fig. 21, the RC
for cross-point ReRAM cells without diode because we need to leverage the delays derived by our analytical RC models are consistent with
nonlinearity of the storage element. the HSPICE simulation results.
DONG et al.: NVSIM: A CIRCUIT-LEVEL PERFORMANCE, ENERGY, AND AREA MODEL 1003

TABLE II
Delay and Power Look-Up Table of Current−Voltage Converter
Process node 130 nm 90 nm 65 nm 45 nm 32 nm
Delay 0.49 ns 0.53 ns 0.62 ns 0.80 ns 1.07 ns
Dynamic energy per operation 8.52 × 10−14 J 8.72 × 10−14 J 9.00 × 10−14 J 10.26 × 10−14 J 12.56 × 10−14 J
Leakage power 1.40 × 10−8 W 1.87 × 10−8 W 2.57 × 10−8 W 4.41 × 10−8 W 12.54 × 10−8 W

Fig. 23. Circuit schematic of the slow quench pulse shaper used in [10].

and tRESET ) and SET/RESET current (i.e., ISET and IRESET ) or


Fig. 21. Delay model verification of three sensing schemes comparing to voltage. NVSim does not model the dynamic behavior during
HSPICE simulations. the switching of the cell state, the switching latency (i.e., cell
write latency) is directly the pulse duration and the switching
energy (i.e., cell write energy) is estimated using Joule’s first
law, that is
2
EnergySET = ISET RtSET
2
EnergyRESET = IRESET RtRESET (17)
in which the resistance value R can be the equivalent resistance
of the corresponding SET or RESET state (i.e., RSET or
Fig. 22. Current–voltage converter modeled in NVSim.
RRESET . However, for NVM technologies that have thresh-
old switching phenomenon (e.g., PCRAM and ReRAM), the
2) Current–Voltage Converter Model: As shown in Fig. 18,
resistance value R always equals the resistance of the low-
the current–voltage converter in our current-mode sensing
resistance state. This is because when a voltage above a
scheme is actually the first-level sense amplifier, and the
particular threshold is applied to these NVM cells in the
CACTI-modeled voltage sense amplifier is still kept in the
high-resistance state, the resulting large electrical fields greatly
bitline model as the final stage of the sensing scheme. The
increase the electrical conductivity [38].
current–voltage converter senses the current difference I1 − I2
and then it is converted into a voltage difference V1 − V2 . VI. Miscellaneous Circuitry
The required voltage difference produced by current–voltage
Some specialized circuitry is required for certain types of
converter is set by default to 80 mV. Although this value
NVMs. For instance, some PCRAM chips need pulse shaper to
is the minimum sensible voltage difference of the CACTI-
reform accurate SET and RESET pulses, and NAND Flash and
modeled voltage sense amplifier, advanced user can override
some PCRAM chips need charge pump to generate the high-
it for specific sense amplifier design. We refer to a previous
voltage power plane that is necessary for write operations.
current–voltage converter design [36] and the circuit schematic
is shown in Fig. 22. This sensing scheme is similar to A. Pulse Shaper
the hybrid-I/O approach [37], which can achieve high-speed, Some PCRAM need specialized circuits to handle its
robust sensing, and low power operation. RESET and SET operations. Specific pulse shapes are required
To avoid unnecessary calculation, the current–voltage con- to heat up the GST quickly and to cool it down gradually,
verter is modeled by directly using the HSPICE-simulated especially for SET operations. This pulse shaping requirement
values and building a look-up table of delay, dynamic energy, is achieved by using a slow quench pulse shaper. As shown
and leakage power (Table II). in Fig. 23, the slow quench pulse shaper is composed of an
arbitrary slow-quench waveform generator and a write driver.
C. Cell Switching Model In NVSim, the delay impacts of the slow quench shaper
Different NVM technologies have their specific switch- are neglected because they are already included in the RE-
ing mechanism. Usually, the switching phenomenon involves SET/SET calculation of the timing model. The energy impacts
magnetoresistive, phase-change, thermochemical, and electro- of the shaper is modeled by adding an energy efficiency during
chemical effects, and it cannot be estimated by RC analysis. the RESET/SET operation, which we set the default value to
Hence, the cell switching model in NVSim largely relies on 35% [28] and it can be overridden by advanced user. The
the NVM cell definition. The predefined NVM cell switching area of slow quench shapers is modeled by measuring the die
properties include the SET/RESET pulse duration (i.e., tSET photos [10], [28].
1004 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 7, JULY 2012

TABLE III TABLE VI


NVSim’s NAND Flash Model Validation With Respect to a NVSim’s PCRAM Model Validation With Respect to a 90 nm
50 nm 2 Gb NAND Flash Chip (B-SLC2) [40] 512 Mb Diode-Selected PCRAM Prototype Chip [10]
Metric Actual Projected Error Metric Actual Projected Error
Area 23.85 mm2 22.61 mm2 −5.20% Area 91.50 mm2 93.04 mm2 +1.68%
Read latency 21 μs 25.2 μs +20.0% Read latency 78 ns 59.76 ns −23.40%
Program latency 200 μs 200.1 μs +0.1% Write latency 430 ns 438.55 ns +1.99%
Erase latency 1.25 ms 1.25 ms +0.0% Write energy 54 nJ 47.22 nJ −12.56%
Read energy 1.56 μJ 1.85 μJ +18.6%
TABLE VII
Program energy 3.92 μJ 4.24 μJ +8.2%
Erase energy 34.5 μJ 36.0 μJ +4.3% NVSim’s ReRAM Model Validation With Respect to a
TABLE IV 0.18 μm 4 Mb MOSFET-Selected ReRAM Prototype Chip [9]
NVSim’s STT-RAM Model Validation With Respect to a 65 nm Metric Actual Projected Error
Area5 187.69 mm2 33.42 mm2 –
64 Mb STT-RAM Prototype Chip [26]
Read latency 7.2 ns 7.72 ns +7.22%
Metric Actual Projected Error Write latency 0.3 ns–7.2 ns 6.56 ns –
Area 39.1 mm2 38.05 mm2 −2.69% Write energy N/A 0.46 nJ –
Read latency 11 ns 11.47 ns +4.27%
Write latency <30 ns 27.50 ns –
TABLE VIII
Write energy N/A 0.26 nJ – Using PCRAM as Direct Replacement of NAND
TABLE V A typical 90 nm 512 Mb NAND (Source: K9F1208X0C Datasheet)
Access unit Page
NVSim’s PCRAM Model Validation With Respect to a 0.12 μm
Read latency 15 μs
64 Mb MOS-Accessed PCRAM Prototype Chip [12] Write latency 200 μs
Metric Actual Projected Error Erase latency 2 ms
Area 64 mm2 57.44 mm2 −10.25% A 90 nm 512 Mb PCRAM (source: [10], Table VI for more details)
Read latency 70.0 ns 65.93 ns −5.81% Access unit Byte
Write latency >180.0 ns 180.17 ns – Read latency 78 ns (59.76 ns, NVSim estimation)
Write energy N/A 6.31 nJ – Write latency 430 ns (438.55 ns, NVSim estimation)
A typical 90 nm 512 Mb DRAM (source: K4T51043Q datasheet)
Access unit Byte
B. Charge Pump tRCD 15 ns
tRP 15 ns
The write operations of NAND Flash and some PCRAM
chips require voltage higher than the chip supply voltage.
Therefore, a charge pump that uses capacitors as energy PCRAM [12], [29] and ReRAM [9] in terms of area, latency,
storage elements to create a higher voltage is necessary in a and energy. We first extract the information from real chip
NAND Flash chip design. In NVSim, we neglect the silicon area design specifications to set the input parameters required by
occupied by charge pump since the charge pump area can vary NVSim, such as capacity, line size, technology node, and array
a lot depending on its underlying circuit design techniques and organization. Then, we compare the performance, energy, and
the charge pump area is relatively small compared to the cell area estimation numbers generated from NVSim to the actual
array area in a large-capacity NAND chip. However, we model reported numbers in those chip designs. The validation results
the energy dissipated by charge pumps during the program and are listed in this section. Note that all the simulation results are
erase operations in NVSim because they contribute a consid- for nominal cases since process variations are not supported
erable portion of the total energy consumption. The energy in current version of NVSim.
consumed by charge pumps is referred from an actual NAND
A. NAND Flash Validation
Flash chip design [39], which specifies that a conventional
charge pump consumes 0.25 μJ at 1.8 V supply voltage. We It is challenging to validate the NAND Flash model in NVSim
use this value as the default in NVSim. since the public manufacturer datasheets do not disclose suffi-
cient data on the operation latency and power consumption
VII. Validation Result for validation purpose. Instead, Grupp et al. [40] reported
NVSim is built on the basis of generic assumptions of mem- both latency and power consumption measurements of several
ory cell layouts, circuit design rules, and CMOS fabrication commercial NAND Flash chips from different vendors. Grupp’s
parameters, whereas the performance, energy, and area of a report does not include the NAND Flash silicon area, hence we
real nonvolatile memory design depend on the specific choices set the actual NAND Flash chip area by assuming an area effi-
of all these. However, as described in previous sections, ciency of 90%. The comparison between the measurement [40]
we provide a set of knobs in NVSim to adjust the design and the estimations given by NVSim is listed in Table III. The
parameters such as memory organization, wire type, transistor estimation error is within 20%.
type, data sensing schemes, and others. Therefore, NVSim is
B. STT-RAM Validation
capable of emulating a real memory chip, and comparing the
NVSim estimation result to the actual memory chip parameters We validate the STT-RAM model against a 65 nm prototype
can show the accuracy of NVSim. chip [26]. We let 1 bank = 32×8 mats, and 1 mat = 1 subarray
Hence, we validate NVSim against NAND Flash chips 5 A large portion of the chip area is contributed to the MLC control and test
and several industrial prototype designs of STT-RAM [26], circuits, which are not modeled in NVSim.
DONG et al.: NVSIM: A CIRCUIT-LEVEL PERFORMANCE, ENERGY, AND AREA MODEL 1005

TABLE IX TABLE X
New PCRAM Parameters After NVSim Latency Optimization Projection of a Future ReRAM Technology

Parameter Before Optimization After Optimization MOS-Accessed Cross-Point


Subarray size 1024 × 1024 512 × 32 Cell size 4F 2 20F 2
Area 93.04 mm2 102.34 mm2 Maximum NMOS driver size 100 F
Read latency 59.76 ns 16.23 ns RESET voltage and pulse duration 2.0 V, 100 ns
Write latency 438.55 ns 416.23 ns SET voltage and pulse duration −2.0 V, 100 ns
0.4 V voltage source,
READ input
or 2 μA current source
to simulate the memory array organization. We also exclude LRS resistance 10 k
the chip area of I/O pads and duplicated cells to make the HRS resistance 500 k
fair comparison. As the write latency is not disclosed, we Half-select resistance – 100 k
assume the write pulse duration is 20 ns. The validation result
is listed in Table IV. The result shows the area and the latency
in such systems, their byte-accessibility property can eliminate
estimation error is within 3% and 5%, respectively.
the need of DRAM modules in such systems. But, the issue
C. PCRAM Validation of directly adopting emerging NVM technologies as the NAND
We first validate the PCRAM model against a 0.12 μm Flash substitute comes from the observation that the current
MOS-accessed prototype. The array organization is configured prototype has a much slower read/write latency than DRAM.
to have two banks, each has 8×8 mats. Every mat contains In this case study, we use PCRAM as an example without
only one subarray. Table V lists the validation result, which the loss of generality. The technology node used in this case
shows a 10% underestimation of area and 6% underestimation study is 90 nm. Table VIII shows the latency difference among
of read latency. The projected write latency (SET latency as a NAND chip, a DRAM chip, and a PCRAM prototype chip
the worst case) is also consistent with the actual value. with the same 512 Mb capacity.
Another PCRAM validation is made against a 90 nm diode- The comparison shows the PCRAM prototype chip is much
accessed prototype [10]. slower than its DRAM counterpart. To overcome this obstacle,
it is necessary to optimize PCRAM chips for latency at the
D. ReRAM Validation
expense of area efficiency by aggressively cutting wordlines
We validate the ReRAM model against a 180 nm 4 Mb and bitlines or inserting repeaters. Such area/performance
HfO2 -based MOS-accessed ReRAM prototype [9]. According tradeoff is also available for DRAM designs. However, in this
to the disclosed data, the subarray size is configured to 128 case study, we keep the DRAM chip parameters unchanged
kb. We further model a bank with 4 × 8 mats and each mat since the current DRAM specification is already the sweet-
contains a single subarray. The validation result is listed in spot explored by DRAM industry for many years. But for
Table VII. Note that the estimated chip area given by NVSim PCRAM, such performance optimization is necessary.
is much smaller than the actual value since the prototype chip Table IX shows the comparison before and after NVSim
has SLC/MLC dual modes, but the current version of NVSim optimization. The result shows the PCRAM read latency can
does not model the MLC-related circuitry. be reduced from 59.76 ns to 16.23 ns by only cutting subarrays
E. Comparison to CACTI into smaller size (from 1024×1024 to 512×32). Although the
PCRAM write latency does not reduce too much due to the
We also test the closeness between NVSim and CACTI by
inherent SET/RESET pulse duration, write latency is typically
simulating identical SRAM caches and DRAM chips. The
not in the critical path and can be tolerated using write buffers.
results show that NVSim models SRAM and DRAM more
As a result, the optimized PCRAM chip projected by NVSim
accurately than CACTI does since some false assumptions in
can properly replace the traditional NAND+DRAM solution
CACTI are fixed in NVSim.
in the embedded system. The latency optimization is at the
VIII. Case Studies by Using NVSim expense of increasing chip area, which rises from 93.04 mm2
to 102.34 mm2 .
In this section, we conduct two case studies to demonstrate
how we can use NVSim in two ways: 1) use NVSim to B. Use NVSim for Early Stage Estimation
optimize the NVM designs toward certain design metric; and
2) use NVSim to estimate the performance, energy, and area Considering the facts that the research of some emerging
before fabricating a real prototype chip, especially when the NVM technologies (e.g., ReRAM) is still in an early stage
emerging NVM device technology is still under development and there are only a limited number of NVM prototype chips
and there is no standard so far. available for high-level computer architects understanding the
technologies, we expect NVSim would be helpful in providing
A. Use NVSim for Design Optimization performance, energy, and area estimations at an early design
NAND Flash is currently the widely-used firmware storage stage. In this case study, we demonstrate how NVSim can
or disk in embedded systems. However, codes stored in NAND predict the full design spectrum of a projected ReRAM tech-
must be copied to random-accessible memory like DRAM nology when such a device is fabricated as an 8 MB memory
before execution since NAND’s page-accessible structure causes chip. Table X lists the projection.
poor random access performance. If emerging NVM technolo- Table XI tabulates the full design spectrum of this 32 nm
gies such as STT-RAM, PCRAM, and ReRAM can be adopted 8 MB ReRAM chip by listing the details of each design corner.
1006 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 7, JULY 2012

TABLE XI
Predicted Full Design Spectrum of a 32 nm 8 MB ReRAM Chip

Area Opt. Read Latency Opt. Write Latency Opt. Read Energy Opt. Write Energy Opt. Leakage Opt.
Area (mm2 ) 0.664 5.508 8.071 2.971 3.133 1.399
Read latency (ns) 107.1 1.773 1.917 5.711 6.182 426.8
Write latency (ns) 204.3 200.7 100.6 202.8 203.1 518.2
Read energy (nJ) 1.884 0.195 0.234 0.012 0.014 4.624
Write energy (nJ) 13.72 25.81 13.06 12.82 12.81 12.99
Leakage (mW) 1372 3872 7081 6819 7841 26.64
Array structure Cross-point Cross-point MOS-accessed Cross-point Cross-point MOS-accessed
Subarray size 512 × 512 128 × 128 1024 × 2048 512 × 512 256 × 256 2048 × 4096
Inter-array routing Non-H-tree H-tree H-tree H-tree H-tree Non-H-tree
Sense amp placement External Internal Internal Internal Internal External
Sense amp type Current-in voltage Current Current Voltage-divider Voltage-divider Voltage-divider
SET-before- Erase-before- SET-before- SET-before-
Write method RESET RESET Normal RESET RESET Normal
Interconnect wire Normal Repeated Repeated Low-swing Low-swing Normal
Output buffer type Area opt. Latency opt. Latency opt. Area opt. Area opt. Area opt.

As shown in the result, NVSim can optimize the same design of these NVM technologies and facilitate computer architects
toward different optimization targets by exploring the full leverage these emerging technologies, it is necessary to have
design space, which means NVSim automatically tunes all a quick estimation tool. While abundant estimation tools are
the design knobs such as array structure, subarray size, sense available as SRAM/DRAM design assistants, similar tools for
amplifier design, write method, repeater design, and buffer NVM designs are currently missing. Therefore, in this paper
design. If necessary, NVSim can also explore to use different we build NVSim, a circuit-level model for NVM performance,
types of transistor or wire models to get the best result. energy, and area estimation, which supports various NVM
technologies including STT-RAM, PCRAM, ReRAM, and
IX. Related Work
conventional NAND Flash. This model is successfully validated
Many modeling tools have been developed during the last against industrial NVM prototypes, and this new NVSim tool
decade to enable system-level design exploration for SRAM- is expected to help boost NVM-related studies such as the
or DRAM-based cache and memory. For example, CACTI [1], next-generation memory hierarchy.
[2] is a tool that has been widely used in the computer archi-
tecture community to estimate the performance, energy, and References
area of SRAM and DRAM caches. Evans and Franzon [41] [1] S. J. E. Wilton and N. P. Jouppi, “CACTI: An enhanced cache access
and cycle time model,” IEEE J. Solid-State Circuits, vol. 31, no. 5,
developed an energy model for SRAMs and used it to predict pp. 677–688, May 1996.
an optimum organization for caches. eCACTI [42] incorpo- [2] S. Thoziyoor, et al., “CACTI 5.1 technical report,” HP Labs, Palo Alto,
rated a leakage power model into CACTI. Muralimanohar et CA, Tech. Rep. HPL-2008-20, 2008.
[3] S. Raoux, et al., “Phase-change random access memory: A scalable
al. [43] modeled large-capacity caches through the use of technology,” IBM J. Res. Development, vol. 52, nos. 4–5, pp. 465–479,
an interconnect-centric organization composed of mats and Jul. 2008.
request/reply H-tree networks. [4] J. J. Yang, et al., “Memristive switching mechanism for metal/oxide/
metal nanodevices,” Nature Nanotechnol., vol. 3, no. 7, pp. 429–433,
In addition, CACTI has also been extended to evalu- 2008.
ate the performance, energy, and area for STT-RAM [44], [5] Z. Wei, et al., “Highly reliable TaOx ReRAM and direct evidence of
PCRAM [45], [46], cross-point ReRAM [19] and NAND redox reaction mechanism,” in Proc. IEDM, 2008, pp. 293–296.
[6] Y. S. Chen, et al., “Highly scalable hafnium oxide memory with
Flash [47]. However, as CACTI is originally designed to model improvements of resistive distribution and read disturb immunity,” in
a SRAM-based cache, some of its fundamental assumptions Proc. IEDM, 2009, pp. 105–108.
do not match the actual NVM circuit implementations, and [7] M.-J. Lee, et al., “2-stack 1D-1R cross-point structure with oxide diodes
as switch elements for high density resistance RAM applications,” in
thereby the NVM array organization modeled in these CACTI- Proc. IEEE IEDM, Dec. 2007, pp. 771–774.
like estimation tools deviates from the NVM chips that have [8] W. C. Chien, et al., “Multi-level operation of fully CMOS compatible
been fabricated. WOx resistive random access memory (RRAM),” in Proc. Int. Memory
Workshop, 2009, pp. 228–229.
X. Conclusion [9] S.-S. Sheu, et al., “A 4 Mb embedded SLC resistive-RAM macro
with 7.2 ns read-write random-access time and 160 ns MLC-access
STT-RAM, PCRAM, and ReRAM are emerging memory capability,” in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2011,
technologies for future nonvolatile memories. The versatility pp. 200–202.
[10] K.-J. Lee, et al., “A 90 nm 1.8 V 512 Mb diode-switch PRAM with
of these upcoming NVM technologies makes it possible to use 266 MB/s read throughput,” IEEE J. Solid-State Circuits, vol. 43, no. 1,
these NVM modules at other levels in the memory hierarchy, pp. 150–162, Jan. 2008.
such as execute in place memory, main memory, or even on- [11] F. Pellizzer, et al., “Novel μTrench phase-change memory cell for
embedded and stand-alone nonvolatile memory applications,” in Proc.
chip cache. Such emerging NVM design options can vary for Int. Symp. VLSI Technol., 2004, pp. 18–19.
different applications by tuning circuit structure parameters [12] S. J. Ahn, et al., “Highly manufacturable high density phase change
such as the array organizations and the peripheral circuitry memory of 64 Mb and beyond,” in Proc. IEDM, 2004, pp. 907–910.
[13] K.-H. Kim, et al., “Nanoscale resistive memory with intrinsic diode
types, or by using devices and interconnects with different characteristics and long endurance,” Appl. Phys. Lett., vol. 96, no. 5,
properties. To enable the system-level design space exploration pp. 053 106.1–053 106.3, 2010.
DONG et al.: NVSIM: A CIRCUIT-LEVEL PERFORMANCE, ENERGY, AND AREA MODEL 1007

[14] H. Y. Lee, et al., “Evidence and solution of over-RESET problem for [40] L. M. Grupp, et al., “Characterizing Flash memory: Anomalies,
HfOx based resistive memory with sub-ns switching speed and high observations, and applications,” in Proc. Int. Symp. Microarchitecture,
endurance,” in Proc. IEDM, 2010, pp. 19.7.1–19.7.4. 2009, pp. 24–33.
[15] International Technology Roadmap for Semiconductors. (2010). Process [41] R. J. Evans and P. D. Franzon, “Energy consumption modeling and
Integration, Devices, and Structures Update [Online]. Available: optimization for SRAMs,” IEEE J. Solid-State Circuits, vol. 30, no. 5,
http://www.itrs.net pp. 571–579, May 1995.
[16] C. W. Smullen, et al., “Relaxing non-volatility for fast and energy- [42] M. Mamidipaka and N. Dutt, “eCACTI: An enhanced power estimation
efficient STT-RAM caches,” in Proc. Int. Symp. High Performance model for on-chip caches,” Center Embedded Comput. Syst., Univ.
Comput. Architecture, Feb. 2011, pp. 50–61. California, Irvine, Tech. Rep. TR04-28, 2004.
[17] D.-C. Kau, et al., “A stackable cross point phase change memory,” in [43] N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Architecting
Proc. IEEE IEDM, Dec. 2009, pp. 27.1.1–27.1.4. efficient interconnects for large caches with CACTI 6.0,” IEEE Micro,
[18] Y.-C. Chen, et al., “An access-transistor-free (0T/1R) non-volatile vol. 28, no. 1, pp. 69–79, Jan.–Feb. 2008.
resistance random access memory (RRAM) using a novel threshold [44] X. Dong, et al., “Circuit and microarchitecture evaluation of 3-D
switching, self-rectifying chalcogenide device,” in Proc. IEDM, 2003, stacking magnetic RAM (MRAM) as a universal memory replacement,”
pp. 750–753. in Proc. Des. Autom. Conf., 2008, pp. 554–559.
[19] C. Xu, et al., “Design implications of memristor-based RRAM [45] P. Mangalagiri, et al., “A low-power phase change memory based
cross-point structures,” in Proc. Des. Autom. Test Eur., 2011, pp. 1–6. hybrid cache architecture,” in Proc. Great Lakes Symp. VLSI, 2008, pp.
[20] S. Thoziyoor, et al., “A comprehensive memory modeling tool and its 395–398.
application to the design and analysis of future memory hierarchies,” [46] X. Dong, N. P. Jouppi, and Y. Xie, “PCRAMsim: System-level
in Proc. Int. Symp. Comput. Architecture, 2008, pp. 51–62. performance, energy, and area modeling for phase-change RAM,” in
[21] International Technology Roadmap for Semiconductors. The Model for Proc. Int. Conf. Comput.-Aided Des., 2009, pp. 269–275.
Assessment of CMOS Technologies and Roadmaps (MASTAR) [Online]. [47] V. Mohan, S. Gurumurthi, and M. R. Stan, “FlashPower: A detailed
Available: http://www.itrs.net/models.html power model for NAND Flash memory,” in Proc. Des. Autom. Test
[22] A. Udipi, et al., “Rethinking DRAM design and organization for Eur., 2010, pp. 502–507.
energy-constrained multi-cores,” ACM SIGARCH Comput. Architecture
News, vol. 38, no. 3, pp. 175–186, 2010. Xiangyu Dong (S’09–M’12) received the B.S. de-
[23] J. Liang and H. S. P. Wong, “Cross-point memory array without cell gree in electronic engineering from Shanghai Jiao
selectors: Device characteristics and data storage pattern dependencies,” Tong University, Shanghai, China, in 2006, and the
IEEE Trans. Electron Devices, vol. 57, no. 10, pp. 2531–2538, Oct. Ph.D. degree in computer science and engineering
2010. from Pennsylvania State University, University Park,
[24] M. Hosomi, et al., “A novel nonvolatile memory with spin torque in 2011.
transfer magnetization switching: Spin-RAM,” in Proc. IEDM, 2005, He is currently a Senior Engineer with Qualcomm,
pp. 459–462. Inc., San Diego, CA. His current research interests
[25] T. Kawahara, et al., “2 Mb spin-transfer torque RAM (SPRAM) with include computer architectures, emerging nonvolatile
bit-by-bit bidirectional current write and parallelizing-direction current memory, and 3-D integration technology.
read,” in Proc. IEEE Int. Solid-State Circuits Conf., Feb. 2007, pp.
480–617.
[26] K. Tsuchida, et al., “A 64 Mb MRAM with clamped-reference and Cong Xu (S’09) received the B.S. degree from
adequate-reference schemes,” in Proc. Int. Solid-State Circuits Conf., Peking University, Beijing, China, and joined Penn-
2010, pp. 268–269. sylvania State University, University Park, in 2009,
[27] H.-R. Oh, et al., “Enhanced write performance of a 64-Mb phase- where he is currently pursuing the Ph.D. degree with
change random access memory,” IEEE J. Solid-State Circuits, vol. 41, the Department of Computer Science and Engineer-
no. 1, pp. 122–126, Jan. 2006. ing.
[28] S. Hanzawa, et al., “A 512 kB embedded phase change memory with His current research interests include nonvolatile
416 kB/s write throughput at 100 μA cell write current,” in Proc. Int. memory system design based on emerging memory
Solid-State Circuits Conf., 2007, pp. 474–616. technologies, low power very large scale integration
[29] S. Kang, et al., “A 0.1 μm 1.8 V 256 Mb phase-change random access design, and computer architectures.
memory (PRAM) with 66 MHz synchronous burst-read operation,”
IEEE J. Solid-State Circuits, vol. 42, no. 1, pp. 210–218, Jan. Yuan Xie (SM’07) received the B.S. degree in elec-
2007. tronic engineering from Tsinghua University, Bei-
[30] F. Fishburn, et al., “A 78 nm 6 F2 DRAM technology for multigigabit jing, China, in 1997, and the M.S. and Ph.D. degrees
densities,” in Proc. Symp. VLSI Technol., 2004, pp. 28–29. in electrical engineering from Princeton University,
[31] J. H. Oh, et al., “Full integration of highly manufacturable 512 Mb Princeton, NJ, in 1999 and 2002, respectively.
PRAM based on 90 nm technology,” in Proc. IEDM, 2006, pp. 49–52. He is currently an Associate Professor with the
[32] Y. Zhang, et al., “An integrated phase change memory cell with Ge Department of Computer Science and Engineering,
nanowire diode for cross-point memory,” in Proc. IEEE Symp. VLSI Pennsylvania State University, University Park. Be-
Technol., Jun. 2007, pp. 98–99. fore joining Pennsylvania State University in 2003,
[33] Y. Sasago, et al., “Cross-point phase change memory with 4F2 cell size he was with the IBM Microelectronic Division,
driven by low-contact-resistivity poly-Si diode,” in Proc. Symp. VLSI Worldwide Design Center, Essex Juction, VT. His
Technol., 2009, pp. 24–25. current research interests include very large scale integration design, computer
[34] I. E. Sutherland, R. F. Sproull, and D. Harris, Logical Effort: Designing architectures, embedded systems design, and electronic design automation.
Fast CMOS Circuits. San Francisco, CA: Morgan Kaufmann, Dr. Xie is a Senior Member of ACM. He received the SRC Inventor
1999. Recognition Award in 2002 and the U.S. National Science Foundation
[35] M. A. Horowitz, “Timing models for MOS circuits,” Stanford Faculty Early Career Development Award in 2006.
University, Stanford, CA, Tech. Rep. SEL-TR-83-003, 1983.
[36] E. Seevinck, P. J. van Beers, and H. Ontrop, “Current-mode techniques
for high-speed VLSI circuits with application to current sense amplifier Norman P. Jouppi (F’03) received the Ph.D. degree
for CMOS SRAM’s,” IEEE J. Solid-State Circuits, vol. 26, no. 4, pp. in electrical engineering from Stanford University,
525–536, Apr. 1991. Stanford, CA.
[37] Y. Moon, et al., “1.2 V 1.6 Gb/s 56 nm 6F2 4 Gb DDR3 SDRAM with He is a Fellow and the Director of the In-
hybrid-I/O sense amplifier and segmented sub-array architecture,” in telligent Infrastructure Laboratory, Hewlett-Packard
Proc. Int. Solid-State Circuits Conf., 2009, pp. 128–129. Labs, Palo Alto, CA. His current research inter-
[38] G. W. Burr, et al., “Phase change memory technology,” J. Vac. Sci. ests include computer memory systems, networking
Technol. B, vol. 28, no. 2, pp. 223–262, 2010. for cluster computing, blade system architectures,
[39] K. Ishida, et al., “A 1.8 V 30 nJ adaptive program-voltage (20 V) graphics accelerators, video, audio, and physical
generator for 3D-integrated NAND Flash SSD,” in Proc. IEEE Int. telepresence.
Solid-State Circuits Conf., Feb. 2009, pp. 238–239, 239a. Dr. Jouppi is a Fellow of ACM.

You might also like