Title: Distributed Shared Memory over IP Networks - Revisited
1Distributed Shared Memory over IP Networks -
Revisited
Which would you rather use to plow a
field two strong oxen, or 1024 chickens?
-Seymour Cray
CS/ECE 757 Spring 2005 Instructor Mark Hill
2DSM over IP
- Abstract a single, large, powerful machine from a
group of many commodity machines - Workstations can be idle for long periods
- Networking infrastructure already exists
- Increase performance through parallel execution
- Increase utilization of existing hardware
- Fault tolerance, fault isolation
3Overview - 1
- Source-level distributed shared memory system for
x86-Linux - Code written with DSM/IP API
- Memory is identically named at high-level (C/C)
- Only explicitly shared variables are shared
- User-level library checks coherence on each
access - Similar to programming in an SMP environment
4Overview - 2
- Evaluated on a small (Ethernet) network of
heterogeneous machines - 4 x86-based machines of varying quality
- Small, simulated workloads
- Application-dependent performance
- Communication latency must be tolerated
- Relaxed consistency models improve performance
for some workloads - Heterogeneity vastly affects performance
- All machines operate as slowly as the slowest
machine in Barrier-synchronized workloads - Pentium 4 4.0 GHz is faster than a Pentium 3 600
Mhz!
5Motivation
- DSMoIP similar to building an SMP with unordered
interconnect - Shared-memory over message-passing
- Race conditions possible
- Coherence concerns
- DSMoIP offers a tangible result
- Runs on real HW
6Challenges
- IP on Ethernet is slow!
- Point-to-point peak bandwidth lt12Mb/s
- Latency 100us, RTT 200us
- Common case must be fast
- Software check of coherence on every access
- Many coherence changes require at least one RTT
- IP semantics difficult for correctness
- Packets can be
- Dropped
- Delivered More than Once
- Delivered Very Late, etc.
7Related Projects
- TreadMarks
- DSM over UDP/IP, page-level granularity,
source-level compatibility, release consistency - Brazos (Rice)
- DSM using multicast IP, MPI compatibility, scope
consistency, source-level compatibility - Plurix (Universitdt Ulm)
- Distributed OS for shared memory, Java-source
compatible
8More Related Projects
- Quarks (Utah)
- DSM over IP, user-level package interface,
source-level compatibility, many coherence
options - SHRIMP (Princeton), StarDust (IRISA)
- And more
- DSM over IP is not a new idea
- Most have demonstrated speedup for some workloads
9Implementation Overview
- All participating machines have space reserved
for each shared object at all times - Each shared object has accompanying coherence
data - Depending on coherence, the object may or may not
be accessible - If permissions are needed on a shared object,
network communication is initiated - Identical semantics to a long-latency memory
operation
10Implementation Overview
11Implementation Overview
- Shared memory objects use altered syntax for
declaration and access - User-level DSMoIP library, consisting of
- Communication backbone
- Coherence engine
- Programmer uses a small API for setup and
synchronization
12API Accessing Shared Objects
- SMP
- a x b
- x y 1
- z arrayi 7
- arrayi
- DSMoIP
- a x.Read() b
- x.Write(y.Read()-1)
- z array.Read(i) 7
- array.Write(
- array.Read(i)1,
- i)
Variables x, y, z, and array are shared. All
others are unshared
13API Accessing Shared Objects
- Naming of shared objects is syntactically
different, but semantically unchanged - Use of .Read() and .Write() functions allows DSM
to interpose if necessary - Use of C/C operator overloading can remove some
of the changes in syntax (not implemented)
14Communication Backbone (CB)
- A substrate on which to build coherence
mechanisms - Provide primitives for communication
- Guaranteed delivery over an unreliable network
- At-most-once delivery
- Arbitrary message passing to a logical thread or
broadcast to all threads
15Coherence Engine (CE)
- Uses primitives in communication backbone to
abstract shared memory from a message-passing
system - Swappable, based on needs of application
- CE determines consistency model
- CE plays a major role in performance
16CE/CB Interaction Example
MACHINE 1 MACHINE 2 a / None a / Read
17CE/CB Interaction Example
MACHINE 1 MACHINE 2 a / Read a / Read
18CE/CB Interaction Example
Already have perm.
MACHINE 1 MACHINE 2 a / Write a / None
19CE/CB Interaction Example
MACHINE 1 MACHINE 2 a / Write a / None
20CE/CB Interaction Example
MACHINE 1 MACHINE 2 a / None a / Write
21Coherence Engines Implemented
- ENGINE_NC
- Fast, simple engine that pays no attention to
consistency - Reads can occur at any time
- Writes are broadcast, can be observed by any
processor in any order, and are not guaranteed to
ever become visible at all processors - Communicates only on writes, non-blocking
22Coherence Engines Implemented
- ENGINE_OU
- Owned/Unowned
- Sequentially-consistent engine
- Ensures SC by actually enforcing total order of
accesses - Each shared object is valid at only one processor
for the duration of execution - Slow, cumbersome (nearly every access generates
traffic) - Blocking, latent reads and writes
23Coherence Engines Implemented
- ENGINE_MSI
- Based on MSI cache coherence protocol
- Shared objects can exist in multiple locations in
S state - Writing is exclusive to a single, M-state machine
- Communicates as needed by typical MSI protocol
- Sequentially consistent
24Evaluation
- DSM/IP evaluated using four heterogeneous
x86/Linux machines - Microbenchmarks test correctness
- Simulated workloads explore performance issues
- Early results indicate that DSM/IP overwhelms
100M LAN - 1-10k broadcast packets/second per machine
- TCP/IP streams in subnet are suppressed by UDP
packets in absence of Quality of Service measures - Process is self-throttling, but makes DSM/IP
unattractive to network administrators
25Measurement
- Event counting
- Regions of DSM Library augmented with event
counters - Timers
- gettimeofday(), setitimer() for low-granularity
timing - Built-in x86 cycle timers for high-granularity
measurement - Validated with gettimeofday()
26Timer Validation
- Assumed system-call based timers accurate
- x86 cycle timers validated for long executions
against gettimeofday()
27Microbenchmarks
- message_test
- Each thread sends a specified number of messages
to the next thread - Attempts to overload network/backbone to produce
an incorrect execution - barrier_test
- Each thread iterates over a specified number of
barriers - Attempts to cause failures in DSM_Barrier()
mechanism - lock_test
- All threads attempt to acquire a lock and
increment global variable - Puts maximal pressure on DSM_Lock primitive
28Microbenchmarks Validation
- message_test
- nMessages 1M, N1,4, no lost messages
(runtime 2 minutes) - barrier_test
- nBarriers 1M, N1,4, consistent
synchronization (runtime 4 minutes) - lock_test
- nIncs 1M, ENGINE_OU (SC), N1,4, final value
of variable N-million (correct)
29Simulated Workloads
- Three simulated workloads
- sea, similar to OCEAN from SPLASH-2
- genetic, a typical iterative genetic algorithm
- xstate, an exhaustive solver for complex
expressions - Implementations are relatively simplified, for
quick development.
More Communication
30Simulated Workloads - sea
- Integer version of SPLASH-2s OCEAN
- Simple iterative simulation, taking averages of
surrounding points - Uses large integer array
- Coherence granularity depends on number of
threads - Uses DSM_Barrier() primitives for synchronization
31Simulated Workloads - genetic
- Multithreaded genetic algorithm
- Uses distributed genetic process to breed a
specific integer from a random population - Iterates in two phases
- Breeding Generate new potential solutions from
the most fit members of the current population - Survival of the Fittest Remove the least fit
solutions from the population - Uses DSM_Barrier() for synchronization
32Simulated Workloads - xstate
- Integer solver for arbitrary expressions
- Uses space exploration to find fixed points of
hard-coded expression - Employs a global list of solutions, protected
with a DSM_Lock primitive
33Methodology
- Each simulated workload run for each coherence
engine - Normalized parallel-phase runtime (versus
uniprocessor case) measured with high-resolution
counters - Does not include startup overhead for DSM system
(1.5 seconds) - Does not include other initialization overheads
34Results - SEA
Engine ABT Wait
OU 0.6 s 82
NC 0.6 s 10
MSI 0.05s 15
35Results - GENETIC
Engine ABT Wait
OU 0.02 s 20
NC 0.02 s 9
MSI 0.01 s 1
36Results - XSTATE
Engine ABT Wait
OU 3.1 s 8
NC 2.7 s 2
MSI 2.6 s 9
37Observations
- ENGINE_NC
- speedup for all workloads (but no consistency)
- Scalability concerns Broadcasts on every write!
- ENGINE_OU
- speedup for workload with lightest communication
load - Scalability concerns Unicast on every read
write! - ENGINE_MSI
- Reduces network traffic has speedup for some
workloads - Impressive performance on SEA
- Effect of heterogeneity is significant!
- Est. fastest machine spends 25-35 of its time
waiting for other machines to arrive at barriers
38Conclusions
- Performance for DSM/IP implementation is marginal
for small networks (Nlt4) - Coherence engine significantly affects
performance - NaĂŻve implementation of SC has far too much
overhead
39Conclusions
- Common case might not be fast enough
- Best case
- Single load or store becomes a function call
10 instrs. - Worst case
- Single load or store becomes a RTT on the network
(1M-100M instruction times) - Average case (depends on engine)
- LD/ST becomes 100-1K instruction times?
40More Questions
- How might other engine implementations behave?
- Is the communication substrate to blame for all
performance problems? - How would DSM/IP perform for N8,16,64,128?
41Questions
42 43Initial Synchronization
44Initial Synchronization
45Initial Synchronization