Distributed Shared Memory over IP Networks - Revisited - PowerPoint PPT Presentation

About This Presentation
Title:

Distributed Shared Memory over IP Networks - Revisited

Description:

'Which would you rather use to plow a field: two strong oxen, or 1024 chickens?' -Seymour Cray ... Abstract a single, large, powerful machine from a group of ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 46
Provided by: dangi5
Category:

less

Transcript and Presenter's Notes

Title: Distributed Shared Memory over IP Networks - Revisited


1
Distributed Shared Memory over IP Networks -
Revisited
  • Dan Gibson
  • Chuck Tsen

Which would you rather use to plow a
field two strong oxen, or 1024 chickens?
-Seymour Cray
CS/ECE 757 Spring 2005 Instructor Mark Hill
2
DSM over IP
  • Abstract a single, large, powerful machine from a
    group of many commodity machines
  • Workstations can be idle for long periods
  • Networking infrastructure already exists
  • Increase performance through parallel execution
  • Increase utilization of existing hardware
  • Fault tolerance, fault isolation

3
Overview - 1
  • Source-level distributed shared memory system for
    x86-Linux
  • Code written with DSM/IP API
  • Memory is identically named at high-level (C/C)
  • Only explicitly shared variables are shared
  • User-level library checks coherence on each
    access
  • Similar to programming in an SMP environment

4
Overview - 2
  • Evaluated on a small (Ethernet) network of
    heterogeneous machines
  • 4 x86-based machines of varying quality
  • Small, simulated workloads
  • Application-dependent performance
  • Communication latency must be tolerated
  • Relaxed consistency models improve performance
    for some workloads
  • Heterogeneity vastly affects performance
  • All machines operate as slowly as the slowest
    machine in Barrier-synchronized workloads
  • Pentium 4 4.0 GHz is faster than a Pentium 3 600
    Mhz!

5
Motivation
  • DSMoIP similar to building an SMP with unordered
    interconnect
  • Shared-memory over message-passing
  • Race conditions possible
  • Coherence concerns
  • DSMoIP offers a tangible result
  • Runs on real HW

6
Challenges
  • IP on Ethernet is slow!
  • Point-to-point peak bandwidth lt12Mb/s
  • Latency 100us, RTT 200us
  • Common case must be fast
  • Software check of coherence on every access
  • Many coherence changes require at least one RTT
  • IP semantics difficult for correctness
  • Packets can be
  • Dropped
  • Delivered More than Once
  • Delivered Very Late, etc.

7
Related Projects
  • TreadMarks
  • DSM over UDP/IP, page-level granularity,
    source-level compatibility, release consistency
  • Brazos (Rice)
  • DSM using multicast IP, MPI compatibility, scope
    consistency, source-level compatibility
  • Plurix (Universitdt Ulm)
  • Distributed OS for shared memory, Java-source
    compatible

8
More Related Projects
  • Quarks (Utah)
  • DSM over IP, user-level package interface,
    source-level compatibility, many coherence
    options
  • SHRIMP (Princeton), StarDust (IRISA)
  • And more
  • DSM over IP is not a new idea
  • Most have demonstrated speedup for some workloads

9
Implementation Overview
  • All participating machines have space reserved
    for each shared object at all times
  • Each shared object has accompanying coherence
    data
  • Depending on coherence, the object may or may not
    be accessible
  • If permissions are needed on a shared object,
    network communication is initiated
  • Identical semantics to a long-latency memory
    operation

10
Implementation Overview
11
Implementation Overview
  • Shared memory objects use altered syntax for
    declaration and access
  • User-level DSMoIP library, consisting of
  • Communication backbone
  • Coherence engine
  • Programmer uses a small API for setup and
    synchronization

12
API Accessing Shared Objects
  • SMP
  • a x b
  • x y 1
  • z arrayi 7
  • arrayi
  • DSMoIP
  • a x.Read() b
  • x.Write(y.Read()-1)
  • z array.Read(i) 7
  • array.Write(
  • array.Read(i)1,
  • i)

Variables x, y, z, and array are shared. All
others are unshared
13
API Accessing Shared Objects
  • Naming of shared objects is syntactically
    different, but semantically unchanged
  • Use of .Read() and .Write() functions allows DSM
    to interpose if necessary
  • Use of C/C operator overloading can remove some
    of the changes in syntax (not implemented)

14
Communication Backbone (CB)
  • A substrate on which to build coherence
    mechanisms
  • Provide primitives for communication
  • Guaranteed delivery over an unreliable network
  • At-most-once delivery
  • Arbitrary message passing to a logical thread or
    broadcast to all threads

15
Coherence Engine (CE)
  • Uses primitives in communication backbone to
    abstract shared memory from a message-passing
    system
  • Swappable, based on needs of application
  • CE determines consistency model
  • CE plays a major role in performance

16
CE/CB Interaction Example
MACHINE 1 MACHINE 2 a / None a / Read
17
CE/CB Interaction Example
MACHINE 1 MACHINE 2 a / Read a / Read
18
CE/CB Interaction Example
Already have perm.
MACHINE 1 MACHINE 2 a / Write a / None
19
CE/CB Interaction Example
MACHINE 1 MACHINE 2 a / Write a / None
20
CE/CB Interaction Example
MACHINE 1 MACHINE 2 a / None a / Write
21
Coherence Engines Implemented
  • ENGINE_NC
  • Fast, simple engine that pays no attention to
    consistency
  • Reads can occur at any time
  • Writes are broadcast, can be observed by any
    processor in any order, and are not guaranteed to
    ever become visible at all processors
  • Communicates only on writes, non-blocking

22
Coherence Engines Implemented
  • ENGINE_OU
  • Owned/Unowned
  • Sequentially-consistent engine
  • Ensures SC by actually enforcing total order of
    accesses
  • Each shared object is valid at only one processor
    for the duration of execution
  • Slow, cumbersome (nearly every access generates
    traffic)
  • Blocking, latent reads and writes

23
Coherence Engines Implemented
  • ENGINE_MSI
  • Based on MSI cache coherence protocol
  • Shared objects can exist in multiple locations in
    S state
  • Writing is exclusive to a single, M-state machine
  • Communicates as needed by typical MSI protocol
  • Sequentially consistent

24
Evaluation
  • DSM/IP evaluated using four heterogeneous
    x86/Linux machines
  • Microbenchmarks test correctness
  • Simulated workloads explore performance issues
  • Early results indicate that DSM/IP overwhelms
    100M LAN
  • 1-10k broadcast packets/second per machine
  • TCP/IP streams in subnet are suppressed by UDP
    packets in absence of Quality of Service measures
  • Process is self-throttling, but makes DSM/IP
    unattractive to network administrators

25
Measurement
  • Event counting
  • Regions of DSM Library augmented with event
    counters
  • Timers
  • gettimeofday(), setitimer() for low-granularity
    timing
  • Built-in x86 cycle timers for high-granularity
    measurement
  • Validated with gettimeofday()

26
Timer Validation
  • Assumed system-call based timers accurate
  • x86 cycle timers validated for long executions
    against gettimeofday()

27
Microbenchmarks
  • message_test
  • Each thread sends a specified number of messages
    to the next thread
  • Attempts to overload network/backbone to produce
    an incorrect execution
  • barrier_test
  • Each thread iterates over a specified number of
    barriers
  • Attempts to cause failures in DSM_Barrier()
    mechanism
  • lock_test
  • All threads attempt to acquire a lock and
    increment global variable
  • Puts maximal pressure on DSM_Lock primitive

28
Microbenchmarks Validation
  • message_test
  • nMessages 1M, N1,4, no lost messages
    (runtime 2 minutes)
  • barrier_test
  • nBarriers 1M, N1,4, consistent
    synchronization (runtime 4 minutes)
  • lock_test
  • nIncs 1M, ENGINE_OU (SC), N1,4, final value
    of variable N-million (correct)

29
Simulated Workloads
  • Three simulated workloads
  • sea, similar to OCEAN from SPLASH-2
  • genetic, a typical iterative genetic algorithm
  • xstate, an exhaustive solver for complex
    expressions
  • Implementations are relatively simplified, for
    quick development.

More Communication
30
Simulated Workloads - sea
  • Integer version of SPLASH-2s OCEAN
  • Simple iterative simulation, taking averages of
    surrounding points
  • Uses large integer array
  • Coherence granularity depends on number of
    threads
  • Uses DSM_Barrier() primitives for synchronization

31
Simulated Workloads - genetic
  • Multithreaded genetic algorithm
  • Uses distributed genetic process to breed a
    specific integer from a random population
  • Iterates in two phases
  • Breeding Generate new potential solutions from
    the most fit members of the current population
  • Survival of the Fittest Remove the least fit
    solutions from the population
  • Uses DSM_Barrier() for synchronization

32
Simulated Workloads - xstate
  • Integer solver for arbitrary expressions
  • Uses space exploration to find fixed points of
    hard-coded expression
  • Employs a global list of solutions, protected
    with a DSM_Lock primitive

33
Methodology
  • Each simulated workload run for each coherence
    engine
  • Normalized parallel-phase runtime (versus
    uniprocessor case) measured with high-resolution
    counters
  • Does not include startup overhead for DSM system
    (1.5 seconds)
  • Does not include other initialization overheads

34
Results - SEA
Engine ABT Wait
OU 0.6 s 82
NC 0.6 s 10
MSI 0.05s 15
35
Results - GENETIC
Engine ABT Wait
OU 0.02 s 20
NC 0.02 s 9
MSI 0.01 s 1
36
Results - XSTATE
Engine ABT Wait
OU 3.1 s 8
NC 2.7 s 2
MSI 2.6 s 9
37
Observations
  • ENGINE_NC
  • speedup for all workloads (but no consistency)
  • Scalability concerns Broadcasts on every write!
  • ENGINE_OU
  • speedup for workload with lightest communication
    load
  • Scalability concerns Unicast on every read
    write!
  • ENGINE_MSI
  • Reduces network traffic has speedup for some
    workloads
  • Impressive performance on SEA
  • Effect of heterogeneity is significant!
  • Est. fastest machine spends 25-35 of its time
    waiting for other machines to arrive at barriers

38
Conclusions
  • Performance for DSM/IP implementation is marginal
    for small networks (Nlt4)
  • Coherence engine significantly affects
    performance
  • NaĂŻve implementation of SC has far too much
    overhead

39
Conclusions
  • Common case might not be fast enough
  • Best case
  • Single load or store becomes a function call
    10 instrs.
  • Worst case
  • Single load or store becomes a RTT on the network
    (1M-100M instruction times)
  • Average case (depends on engine)
  • LD/ST becomes 100-1K instruction times?

40
More Questions
  • How might other engine implementations behave?
  • Is the communication substrate to blame for all
    performance problems?
  • How would DSM/IP perform for N8,16,64,128?

41
Questions
  • SEA Demo

42
  • Extra Slides

43
Initial Synchronization
44
Initial Synchronization
45
Initial Synchronization
Write a Comment
User Comments (0)
About PowerShow.com