Distributed Shared Memory
Outline
Introduction
Shared Memory vs. Distributed
Memory
Distributed Shared Memory
Architecture
DSM vs. Message Passing
Design and Implementation
Consistency Models
DSM Algorithms
Conclusion
Introduction
From the system interconnection perspective, parallel systems fall into
two main categories:
CP
CP
U
U
CP
CP
U
U
Shared Memory
Memor
y
CP
CP
U
U
CP
CP
U
U
Multiprocessors
Tightly-Coupled Multiprocessor
Multiprocessor
Shared Memory architecture
architecture
CP
CP
U
U
Memor
y
CP
CP
U
U
Network
Memor
y
CP
CP
U
U
Memor
y
CP
CP
U
U
Multicomputers
Loosely-Coupled
Distributed Memory
Shared Memory vs. Distributed Memory
Shared Memory
Distributed Memory
Global Address Space
No concept of global address space
Cache Coherent
No concept of cache coherency
Lack of scalability
Scalable performance
Expensive to build
Cost effectiveness: can use commodity, off-theshelf processors and networking
Easy to program, reusability
programmer is responsible for data
communication
Data sharing fast and uniform
Data sharing by message passing, non-uniform
memory access times
Cache coherent means if one processor updates a location in shared
memory, all the other processors know about the update.
Distributed-Shared Memory
Architecture
Definition
A Distributed Shared Memory DSM is an abstraction
that allows the physically separated memories to be
addressed as one logically shared address space.
General Characteristics
Hybrid architecture
Virtual Space shared between all processes
Shared Memory model implemented over
physically distributed memory
Shared-Memory programming techniques can be used
When reading and updating, processes see DSM as an ordinary
memory within their address space
Mapping Manager: Maps shared-memory
address to physical memory (remote or local)
P1
P2
M1
M2
Mn
MM
MM
MM
Pn
Interconnection
Network
Shared Virtual Space
Distributed-Shared Memory
Architecture (contd.)
General Characteristics
Covert Communication operations
Heterogeneous Nodes
The shared memory component can be a cache-coherent SMP
machine and/or a graphics processing unit GPU
Processes on different computers observe the updates made
by one another
Communications are still needed to exchange data, but they are hidden
from the programmer. Inter-process communication transparency
Cache-coherent
SMP or UMA (bus-based) : A model with identical processors that
have equal access times to a shared memory
UMA : Uniform Memory Access
SMP : Symmetric Multiprocessor
P1
P2
M1
M2
Mn
MM
MM
MM
Pn
Interconnection
Network
Shared Virtual Space
Distributed-Shared Memory
Architecture
(contd.)
Advantages
Implicit data sharing
Less expensive to build and scalable
Inherited from the distributed-memory architecture
Very large total physical memory for all nodes
Shields programmer from Send/Receive primitives
Large programs can run more efficiently
Software Portability and Reusability
Programs written for shared memory multiprocessors can be run on DSM systems with
minimum changes
Disadvantages
Little programmer control over actual messages being generated
DSM implementations use asynchronous message-passing
cannot be more efficient than Message Passing implementations
Distributed-Shared Memory
Architecture
(contd.)
Best Suitable
When individual shared data items can be accessed directly
e.g. Parallel Applications
Less appropriate
When data is accessed by request
e.g. Client-Server Systems
A server may be used to assist in providing DSM functionality for data
shared between clients
DSM vs. Message Passing
Property
DSM
Message Passing
Marshalling
No. Variables shared directly
Yes. Programmers job
Address space
Single. Interference may occur
Private. Processes are protected
Data representation
Uniform
Heterogonous
Synchronization
Normal construct for
shared-memory programming
Message passing primitives
Process execution
Non-overlapping lifetimes
At the same time
communications cost
Invisible
Obvious
No evidence against or in favor to any of the two
communication mechanisms
Design and Implementation
Main Issues
Granularity refers to the size of sharing unit that can be uniform chunks of memory or data structures
Structure refers to the arrangement of shared data
Most systems view DSM as a linear array of words
Replacement Strategies
byte, page or complex data structure
Small Pages : increased parallelism
increase in directory size
Large Pages : reduce paging overhead, but increase sharing overhead
Similar to caching mechanisms in MP
In cache systems, LRU is often used
In DSM, shared pages need to be given higher priority than exclusively owned pages
first
Synchronization Primitives
Coherence protocols must ensure the consistency of shared data
DSM must allow simultaneous access to shared data on different machines
single writer, multiple readers, etc.
they could be replaced
Consistency Models
Definition
A memory consistency model for a shared address space specifies constraints on the order in which
memory operations must appear to be performed (i.e. to become visible to the processors) with respect to
one another.
Strict Consistency Model
Any read to a certain memory location returns the value stored by most recent write
operation to that address, irrespective of the locations of the processors performing the
read and the write operation.
Sequential Consistency Model
if the result of any execution is the same as if the operations of all processors were
executed in the same sequential order, and the operations of each individual processor
appear in this sequence in the order specified by its program. (Leslie Lamport)
Definition restated: Sequential consistency requires that a shared memory multiprocessor
appears to be amultiprogramming uniprocessor system to any program running on it.
All instructions are executed in order
Every write operation becomes instantaneously visible throughout the system
Consistency Models (contd.)
Sequential Consistency
Model
Example 1
P1
P2
Data = 2000
{}
while (Head == 0)
Head = 1
= Data
Sequential consistency requires program
order
The write to Data has to complete before
the write to Head can begin
The read of Head has to complete before
the read of Data can begin
Example 2
Initially A = B = 0
P1
A=1
P2
P3
if (A == 1)
B=1
if (B
== 1)
register = A
Sequential consistency can be had if a
process makes sure that everyone has seen
an update before that value is read -
Consistency Models (contd.)
Causal Consistency Model
Writes that are potentially causally related must be seen by all processors in the same order .
Writes that are not potentially causally related may be seen in a different order on different
machines
Processor Consistency Model
Writes done by a single processor are seen by all other processors in the order in
which they were written on that processor.
Writes from different processors may be seen in a different order by different
processors.
Release Consistency Model
Weak consistency with two types of synchronization operations : acquire and release
Each type of operations is guaranteed to be processor consistent
DSM
Algorithms
Server
The Central Server Algorithm
Central Server maintains all shared data
Read Request The server just returns the data
Write Request update the data and send acknowledgement to the client
Two messages for each data access
Implementation
If an applications request to access shared data fails repeatedly, a failure condition is sent to the application
Issues: performance and reliability
Client
A timeout is used to resend a request if acknowledgment fails
Associated sequence numbers can be used to detect duplicate write requests
bottleneck at the server
Possible solutions
Partition shared data between several servers
Use a mapping function to distribute/locate data
DSM Algorithms (contd.)
The Migration Algorithm
Data is always migrated to the site where it is accessed
Allow only one node to access a shared data at a time
Migration Request
Single Reader / Single Writer SRSW protocol
Data is typically migrated between servers in a fixed-size unit called a block
Facilitate the management of data instead of migrating individual data units
Advantages
Takes advantage of the locality of reference
No communication costs are incurred when a process accesses data held locally
Data Block
DSM can be integrated with the virtual memory of the OS at each node
-
The size of the block is chosen to be equal to a virtual memory page or a multiple thereof
A locally-held shared memory page can be mapped into the applications virtual address space
Access to data items on data blocks not held locally triggers a page fault
Normal machine instructions for accessing memory can be used
the fault handler can communicate with the remote hosts to obtain the requested data.
When a data block is migrated away, it will be removed from any local address space it was mapped to
To locate a remote data object:
Use a location server
Broadcast query
Issues
Pages can thrash between hosts: to minimize it, set minimum time for data objects to reside at a node
DSM Algorithms (contd.)
The problem with the previous techniques is the sequential access
to the data block
The Read-Replication Algorithm
Extends the migration algorithm
Replicates data blocks at multiple nodes for read access
Replication can reduce the average cost of read operations
Multiple nodes can have read access or one node write access
Block Request
multiple readers-one writer MRSW protocol
After a write, all copies are invalidated or updated
DSM has to keep track of locations of all copies of data objects
IVY The owner node of data object knows all nodes that have copies
PLUS Distributed linked-list tracks all nodes that have copies
Data Block
Multicast invalidate
Advantages
The read-replication can lead to substantial performance improvements if the ratio of reads to writes is
large
Disadvantages
Write operations might be more expensive since replies may have to be invalidated or updated to maintain
consistency
DSM Algorithms
(contd.)
The Full-Replication Algorithm
sequencer
Extension of read-replication algorithm
Multiple nodes have both read and write access to shared data blocks
multiple-readers multiple-writers MRMW protocol
write
update
Issues
Consistency of data for multiple writers
Solution
use of gap-free sequencer
Client
s
All writes sent to a sequencer
Sequencer assigns sequence number and sends write request to all sites that have copies
Each node performs writes according to sequence numbers
A gap in sequence numbers indicates a missing write request: Node asks for retransmission of missing write
requests
DSM Algorithms
Performance Measure
It needs to take into account the cost of
accessing local and remote data blocks
Parameters
p: cost of sending or receiving
a short packet
P: cost of sending or receiving a data
block, assume P/p equal to 20
S:
r: Read/Write ratio
f: probability of an access fault on a
number of sites participating in
distributed shared memory
non-replicated data block
f `:
probability of an access fault on
replicated data blocks
Conclusion
Being a hybrid of the distributed and shared memory architectures, DSM
systems offer a trade-off between the easy-programming of shared
memory machines and the efficiency and scalability of the distributed
memory systems.
While the programmer is relieved from the communication details, he still
has to take care of many design and implementation issues. The
algorithms mentioned above offer various solutions with cost and
performance varying for each.
No single algorithm is good for all applications.
Algorithms need to be adaptive to application characteristics
Thank you for Paying Attention