Hardware Memory Models
Hardware Memory Models
H M M
not involved. Twenty-five years ago, people started trying to write memory mod-
els defining what a high-level programming language like Java or C++ guaran-
tees to programmers writing code in that language. Including the compiler in
the model makes the job of defining a reasonable model much more complicat-
ed.
This is the first of a pair of posts about hardware memory models and
programming language memory models, respectively. My goal in writing these
posts is to build up background for discussing potential changes we might want
to make in Go’s memory model. But to understand where Go is and where we
might want to head, first we have to understand where other hardware memory
models and language memory models are today and the precarious paths they
took to get there.
Again, this post is about hardware. Let’s assume we are writing assembly lan-
guage for a multiprocessor computer. What guarantees do programmers need
from the computer hardware in order to write correct programs? Computer sci-
entists have been searching for good answers to this question for over forty
years.
Sequential Consistency
Leslie Lamport’s 1979 paper “How to Make a Multiprocessor Computer That
Correctly Executes Multiprocess Programs” introduced the concept of sequen-
tial consistency:
The customary approach to designing and proving the correctness of
multiprocess algorithms for such a computer assumes that the follow-
ing condition is satisfied: the result of any execution is the same as if
the operations of all the processors were executed in some sequential
order, and the operations of each individual processor appear in this
sequence in the order specified by its program. A multiprocessor sat-
isfying this condition will be called sequentially consistent.
Today we talk about not just computer hardware but also programming lan-
guages guaranteeing sequential consistency, when the only possible executions
of a program correspond to some kind of interleaving of thread operations into
a sequential execution. Sequential consistency is usually considered the ideal
model, the one most natural for programmers to work with. It lets you assume
programs execute in the order they appear on the page, and the executions of
individual threads are simply interleaved in some order but not otherwise rear-
ranged.
One might reasonably question whether sequential consistency should be the
ideal model, but that’s beyond the scope of this post. I will note only that consid-
ering all possible thread interleavings remains, today as in 1979, “the customary
approach to designing and proving the correctness of multiprocess algorithms.”
In the intervening four decades, nothing has replaced it.
Earlier I asked whether this program can print 0:
// Thread 1 // Thread 2
x = 1; while(done == 0) { /* loop */ }
done = 1; print(x);
H M M
To make the program a bit easier to analyze, let’s remove the loop and the print
and ask about the possible results from reading the shared variables:
Litmus Test: Message Passing
Can this program see r1 = 1, r2 = 0?
// Thread 1 // Thread 2
x = 1 r1 = y
y = 1 r2 = x
We assume every example starts with all shared variables set to zero. Because
we’re trying to establish what hardware is allowed to do, we assume that each
thread is executing on its own dedicated processor and that there’s no compil-
er to reorder what happens in the thread: the instructions in the listings are the
instructions the processor executes. The name rN denotes a thread-local regis-
ter, not a shared variable, and we ask whether a particular setting of thread-lo-
cal registers is possible at the end of an execution.
This kind of question about execution results for a sample program is called a
litmus test. Because it has a binary answer—is this outcome possible or not?—a
litmus test gives us a clear way to distinguish memory models: if one model al-
lows a particular execution and another does not, the two models are clearly dif-
ferent. Unfortunately, as we will see later, the answer a particular model gives to
a particular litmus test is often surprising.
If the execution of this litmus test is sequentially consistent, there are only six
possible interleavings:
x = 1 x = 1 x = 1
y = 1 r1 = y (0) r1 = y (0)
r1 = y (1) y = 1 r2 = x (1)
r2 = x (1) r2 = x (1) y = 1
Shared Memory
(The three memory model hardware diagrams in this post are adapted from
Maranget et al., “A Tutorial Introduction to the ARM and POWER Relaxed
Memory Models.”)
This diagram is a model for a sequentially consistent machine, not the only
way to build one. Indeed, it is possible to build a sequentially consistent machine
using multiple shared memory modules and caches to help predict the result of
memory fetches, but being sequentially consistent means that machine must be-
H M M
R R R R R
W W W W W
Shared Memory
All the processors are still connected to a single shared memory, but each pro-
cessor queues writes to that memory in a local write queue. The processor con-
tinues executing new instructions while the writes make their way out to the
shared memory. A memory read on one processor consults the local write queue
before consulting main memory, but it cannot see the write queues on other
processors. The effect is that a processor sees its own writes before others do.
But—and this is very important—all processors do agree on the (total) order in
which writes (stores) reach the shared memory, giving the model its name: to-
tal store order, or TSO. At the moment that a write reaches shared memory, any
future read on any processor will see it and use that value (until it is overwrit-
ten by a later write, or perhaps by a buffered write from another processor).
The write queue is a standard first-in, first-out queue: the memory writes are
applied to the shared memory in the same order that they were executed by
the processor. Because the write order is preserved by the write queue, and be-
cause other processors see the writes to shared memory immediately, the mes-
sage passing litmus test we considered earlier has the same outcome as before:
r1 = 1, r2 = 0 remains impossible.
Litmus Test: Message Passing
Can this program see r1 = 1, r2 = 0?
// Thread 1 // Thread 2
x = 1 r1 = y
y = 1 r2 = x
H M M
// Thread 1 // Thread 2
x = 1 y = 1
r1 = y r2 = x
H M M
If Thread 3 sees x change before y, can Thread 4 see y change before x? For x86
and other TSO machines, the answer is no: there is a total order over all stores
(writes) to main memory, and all processors agree on that order, subject to the
wrinkle that each processor knows about its own writes before they reach main
memory.
H M M
ber 1999 in similar confusion over the guarantees provided by Intel processors.
In response to more and more people running into these difficulties over the
decade that followed, a group of architects at Intel took on the task of writ-
ing down useful guarantees about processor behavior, for both current and fu-
ture processors. The first result was the “Intel 64 Architecture Memory Ordering
White Paper”, published in August 2007, which aimed to “provide software writ-
ers with a clear understanding of the results that different sequences of memo-
ry access instructions may produce.” AMD published a similar description later
that year in the AMD64 Architecture Programmer’s Manual revision 3.14. These
descriptions were based on a model called “total lock order + causal consistency”
(TLO+CC), intentionally weaker than TSO. In public talks, the Intel architects
said that TLO+CC was “as strong as required but no stronger.” In particular, the
model reserved the right for x86 processors to answer “yes” to the IRIW litmus
test. Unfortunately, the definition of the memory barrier was not strong enough
to reestablish sequentially-consistent memory semantics, even with a barrier af-
ter every instruction. Even worse, researchers observed actual Intel x86 hard-
ware violating the TLO+CC model. For example:
Litmus Test: n6 (Paul Loewenstein)
Can this program end with r1 = 1, r2 = 0, x = 1?
// Thread 1 // Thread 2
x = 1 y = 1
r1 = x x = 2
r2 = y
// Thread 1 // Thread 2
x = 1 x = 2
r1 = x r2 = x
H M M
still making useful guarantees for compiler writers and assembly-language pro-
grammers. “As strong as required but no stronger” is a difficult balancing act.
Memory 1
d5
Thre
W
R
ory 5
Mem
a
w w
Thre
d2 a
w
Mem
ory 2
W
R
w w
w w w w
4 Me
ry o mo
w ry
R
em M 3 W
R Th
d4 W rea
rea d3
Th
Here, there is no total store order. Not depicted, each processor is also allowed
to postpone a read until it needs the result: a read can be delayed until after a
later write. In this relaxed model, the answer to every litmus test we’ve seen so
far is “yes, that really can happen.”
For the original message passing litmus test, the reordering of writes by a sin-
gle processor means that Thread 1’s writes may not be observed by other threads
in the same order:
Litmus Test: Message Passing
Can this program see r1 = 1, r2 = 0?
// Thread 1 // Thread 2
x = 1 r1 = y
y = 1 r2 = x
H M M
// Thread 1 // Thread 2
x = 1 y = 1
r1 = y r2 = x
// Thread 1 // Thread 2
r1 = x r2 = y
y = 1 x = 1
H M M
on the stronger behavior and means that future, weaker hardware will break
programs, validly or not.
Like on TSO systems, ARM and POWER have barriers that we can insert into
the examples above to force sequentially consistent behaviors. But the obvious
question is whether ARM/POWER without barriers excludes any behavior at all.
Can the answer to any litmus test ever be “no, that can’t happen?” It can, when
we focus on a single memory location.
Here’s a litmus test for something that can’t happen even on ARM and POWER:
Litmus Test: Coherence
Can this program see r1 = 1, r2 = 2, r3 = 2, r4 = 1?
(Can Thread 3 see x = 1 before x = 2 while Thread 4 sees the reverse?)
H M M
W(x)
R(x)
The vertical arrow marks the order of execution within a single thread: the write
happens, then the read. There is no race in this program, since everything is in
a single thread.
In contrast, there is a race in this two-thread program:
Thread 1 Thread 2
W(x)
W(x)
R(x)
Here, thread 2 writes to x without coordinating with thread 1. Thread 2’s write
races with both the write and the read by thread 1. If thread 2 were reading x
H M M
instead of writing it, the program would have only one race, between the write
in thread 1 and the read in thread 2. Every race involves at least one write: two
uncoordinated reads do not race with each other.
To avoid races, we must add synchronization operations, which force an or-
der between operations on different threads sharing a synchronization variable.
If the synchronization S(a) (synchronizing on variable a, marked by the dashed
arrow) forces thread 2’s write to happen after thread 1 is done, the race is elim-
inated:
Thread 1 Thread 2
W(x)
R(x)
S(a) S(a)
W(x)
Now the write by thread 2 cannot happen at the same time as thread 1’s oper-
ations.
If thread 2 were only reading, we would only need to synchronize with thread
1’s write. The two reads can still proceed concurrently:
Thread 1 Thread 2
W(x)
S(a) S(a)
R(x) R(x)
Threads can be ordered by a sequence of synchronizations, even using an inter-
mediate thread. This program has no race:
Thread 1 Thread 2 Thread 3
W(x)
S(a) S(a)
S(b) S(b)
R(x)
On the other hand, the use of synchronization variables does not by itself elim-
inate races: it is possible to use them incorrectly. This program does have a race:
Thread 1 Thread 2 Thread 3
W(x)
W(x)
S(a) S(a)
S(b) S(b)
R(x)
H M M
Thread 2’s read is properly synchronized with the writes in the other threads—it
definitely happens after both—but the two writes are not themselves synchro-
nized. This program is not data-race-free.
Adve and Hill presented weak ordering as “a contract between software and
hardware,” specifically that if software avoids data races, then hardware acts as
if it is sequentially consistent, which is easier to reason about than the models
we were examining in the earlier sections. But how can hardware satisfy its end
of the contract?
Adve and Hill gave a proof that hardware “is weakly ordered by DRF,” mean-
ing it executes data-race-free programs as if by a sequentially consistent order-
ing, provided it meets a set of certain minimum requirements. I’m not going
to go through the details, but the point is that after the Adve and Hill paper,
hardware designers had a cookbook recipe backed by a proof: do these things,
and you can assert that your hardware will appear sequentially consistent to da-
ta-race-free programs. And in fact, most relaxed hardware did behave this way
and has continued to do so, assuming appropriate implementations of the syn-
chronization operations. Adve and Hill were concerned originally with the VAX,
but certainly x86, ARM, and POWER can satisfy these constraints too. This idea
that a system guarantees to data-race-free programs the appearance of sequen-
tial consistency is often abbreviated DRF-SC.
DRF-SC marked a turning point in hardware memory models, providing a
clear strategy for both hardware designers and software authors, at least those
writing software in assembly language. As we will see in the next post, the ques-
tion of a memory model for a higher-level programming language does not have
as neat and tidy an answer.
The next post in this series is about programming language memory models.
Acknowledgements
This series of posts benefited greatly from discussions with and feedback from
a long list of engineers I am lucky to work with at Google. My thanks to them.
I take full responsibility for any mistakes or unpopular opinions.