Distributed Shared Memory over IP Networks - Revisited - PowerPoint PPT Presentation

About This Presentation

Title:

Distributed Shared Memory over IP Networks - Revisited

Description:

'Which would you rather use to plow a field: two strong oxen, or 1024 chickens?' -Seymour Cray ... Abstract a single, large, powerful machine from a group of ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 46

Provided by: dangi5

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Shared Memory over IP Networks - Revisited

1
Distributed Shared Memory over IP Networks -
Revisited

Dan Gibson
Chuck Tsen

Which would you rather use to plow a
field two strong oxen, or 1024 chickens?
-Seymour Cray
CS/ECE 757 Spring 2005 Instructor Mark Hill
2
DSM over IP

Abstract a single, large, powerful machine from a
group of many commodity machines
Workstations can be idle for long periods
Networking infrastructure already exists
Increase performance through parallel execution
Increase utilization of existing hardware
Fault tolerance, fault isolation

3
Overview - 1

Source-level distributed shared memory system for
x86-Linux
Code written with DSM/IP API
Memory is identically named at high-level (C/C)
Only explicitly shared variables are shared
User-level library checks coherence on each
access
Similar to programming in an SMP environment

4
Overview - 2

Evaluated on a small (Ethernet) network of
heterogeneous machines
4 x86-based machines of varying quality
Small, simulated workloads
Application-dependent performance
Communication latency must be tolerated
Relaxed consistency models improve performance
for some workloads
Heterogeneity vastly affects performance
All machines operate as slowly as the slowest
machine in Barrier-synchronized workloads
Pentium 4 4.0 GHz is faster than a Pentium 3 600
Mhz!

5
Motivation

DSMoIP similar to building an SMP with unordered
interconnect
Shared-memory over message-passing
Race conditions possible
Coherence concerns
DSMoIP offers a tangible result
Runs on real HW

6
Challenges

IP on Ethernet is slow!
Point-to-point peak bandwidth lt12Mb/s
Latency 100us, RTT 200us
Common case must be fast
Software check of coherence on every access
Many coherence changes require at least one RTT
IP semantics difficult for correctness
Packets can be
Dropped
Delivered More than Once
Delivered Very Late, etc.

7
Related Projects

TreadMarks
DSM over UDP/IP, page-level granularity,
source-level compatibility, release consistency
Brazos (Rice)
DSM using multicast IP, MPI compatibility, scope
consistency, source-level compatibility
Plurix (Universitdt Ulm)
Distributed OS for shared memory, Java-source
compatible

8
More Related Projects

Quarks (Utah)
DSM over IP, user-level package interface,
source-level compatibility, many coherence
options
SHRIMP (Princeton), StarDust (IRISA)
And more
DSM over IP is not a new idea
Most have demonstrated speedup for some workloads

9
Implementation Overview

All participating machines have space reserved
for each shared object at all times
Each shared object has accompanying coherence
data
Depending on coherence, the object may or may not
be accessible
If permissions are needed on a shared object,
network communication is initiated
Identical semantics to a long-latency memory
operation

10
Implementation Overview
11
Implementation Overview

Shared memory objects use altered syntax for
declaration and access
User-level DSMoIP library, consisting of
Communication backbone
Coherence engine
Programmer uses a small API for setup and
synchronization

12
API Accessing Shared Objects

SMP
a x b
x y 1
z arrayi 7
arrayi

DSMoIP
a x.Read() b
x.Write(y.Read()-1)
z array.Read(i) 7
array.Write(
array.Read(i)1,
i)

Variables x, y, z, and array are shared. All
others are unshared
13
API Accessing Shared Objects

Naming of shared objects is syntactically
different, but semantically unchanged
Use of .Read() and .Write() functions allows DSM
to interpose if necessary
Use of C/C operator overloading can remove some
of the changes in syntax (not implemented)

14
Communication Backbone (CB)

A substrate on which to build coherence
mechanisms
Provide primitives for communication
Guaranteed delivery over an unreliable network
At-most-once delivery
Arbitrary message passing to a logical thread or
broadcast to all threads

15
Coherence Engine (CE)

Uses primitives in communication backbone to
abstract shared memory from a message-passing
system
Swappable, based on needs of application
CE determines consistency model
CE plays a major role in performance

16
CE/CB Interaction Example
MACHINE 1 MACHINE 2 a / None a / Read
17
CE/CB Interaction Example
MACHINE 1 MACHINE 2 a / Read a / Read
18
CE/CB Interaction Example
Already have perm.
MACHINE 1 MACHINE 2 a / Write a / None
19
CE/CB Interaction Example
MACHINE 1 MACHINE 2 a / Write a / None
20
CE/CB Interaction Example
MACHINE 1 MACHINE 2 a / None a / Write
21
Coherence Engines Implemented

ENGINE_NC
Fast, simple engine that pays no attention to
consistency
Reads can occur at any time
Writes are broadcast, can be observed by any
processor in any order, and are not guaranteed to
ever become visible at all processors
Communicates only on writes, non-blocking

22
Coherence Engines Implemented

ENGINE_OU
Owned/Unowned
Sequentially-consistent engine
Ensures SC by actually enforcing total order of
accesses
Each shared object is valid at only one processor
for the duration of execution
Slow, cumbersome (nearly every access generates
traffic)
Blocking, latent reads and writes

23
Coherence Engines Implemented

ENGINE_MSI
Based on MSI cache coherence protocol
Shared objects can exist in multiple locations in
S state
Writing is exclusive to a single, M-state machine
Communicates as needed by typical MSI protocol
Sequentially consistent

24
Evaluation

DSM/IP evaluated using four heterogeneous
x86/Linux machines
Microbenchmarks test correctness
Simulated workloads explore performance issues
Early results indicate that DSM/IP overwhelms
100M LAN
1-10k broadcast packets/second per machine
TCP/IP streams in subnet are suppressed by UDP
packets in absence of Quality of Service measures
Process is self-throttling, but makes DSM/IP
unattractive to network administrators

25
Measurement

Event counting
Regions of DSM Library augmented with event
counters
Timers
gettimeofday(), setitimer() for low-granularity
timing
Built-in x86 cycle timers for high-granularity
measurement
Validated with gettimeofday()

26
Timer Validation

Assumed system-call based timers accurate
x86 cycle timers validated for long executions
against gettimeofday()

27
Microbenchmarks

message_test
Each thread sends a specified number of messages
to the next thread
Attempts to overload network/backbone to produce
an incorrect execution
barrier_test
Each thread iterates over a specified number of
barriers
Attempts to cause failures in DSM_Barrier()
mechanism
lock_test
All threads attempt to acquire a lock and
increment global variable
Puts maximal pressure on DSM_Lock primitive

28
Microbenchmarks Validation

message_test
nMessages 1M, N1,4, no lost messages
(runtime 2 minutes)
barrier_test
nBarriers 1M, N1,4, consistent
synchronization (runtime 4 minutes)
lock_test
nIncs 1M, ENGINE_OU (SC), N1,4, final value
of variable N-million (correct)

29
Simulated Workloads

Three simulated workloads
sea, similar to OCEAN from SPLASH-2
genetic, a typical iterative genetic algorithm
xstate, an exhaustive solver for complex
expressions
Implementations are relatively simplified, for
quick development.

More Communication
30
Simulated Workloads - sea

Integer version of SPLASH-2s OCEAN
Simple iterative simulation, taking averages of
surrounding points
Uses large integer array
Coherence granularity depends on number of
threads
Uses DSM_Barrier() primitives for synchronization

31
Simulated Workloads - genetic

Multithreaded genetic algorithm
Uses distributed genetic process to breed a
specific integer from a random population
Iterates in two phases
Breeding Generate new potential solutions from
the most fit members of the current population
Survival of the Fittest Remove the least fit
solutions from the population
Uses DSM_Barrier() for synchronization

32
Simulated Workloads - xstate

Integer solver for arbitrary expressions
Uses space exploration to find fixed points of
hard-coded expression
Employs a global list of solutions, protected
with a DSM_Lock primitive

33
Methodology

Each simulated workload run for each coherence
engine
Normalized parallel-phase runtime (versus
uniprocessor case) measured with high-resolution
counters
Does not include startup overhead for DSM system
(1.5 seconds)
Does not include other initialization overheads

34
Results - SEA
Engine ABT Wait
OU 0.6 s 82
NC 0.6 s 10
MSI 0.05s 15
35
Results - GENETIC
Engine ABT Wait
OU 0.02 s 20
NC 0.02 s 9
MSI 0.01 s 1
36
Results - XSTATE
Engine ABT Wait
OU 3.1 s 8
NC 2.7 s 2
MSI 2.6 s 9
37
Observations

ENGINE_NC
speedup for all workloads (but no consistency)
Scalability concerns Broadcasts on every write!
ENGINE_OU
speedup for workload with lightest communication
load
Scalability concerns Unicast on every read
write!
ENGINE_MSI
Reduces network traffic has speedup for some
workloads
Impressive performance on SEA
Effect of heterogeneity is significant!
Est. fastest machine spends 25-35 of its time
waiting for other machines to arrive at barriers