ECE 1747: Parallel Programming - PowerPoint PPT Presentation

About This Presentation
Title:

ECE 1747: Parallel Programming

Description:

Larger number of processors: distributed shared memory with coherent caches (CC-NUMA) ... Anderson's fares poorer because the Butterfly lacks coherent caches, and CPUs ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 72
Provided by: CITI
Category:

less

Transcript and Presenter's Notes

Title: ECE 1747: Parallel Programming


1
ECE 1747 Parallel Programming
  • Basics of Parallel Architectures
  • Shared-Memory Machines

2
Two Parallel Architectures
  • Shared memory machines.
  • Distributed memory machines.

3
Shared Memory Logical View
Shared memory space
proc1
proc2
proc3
procN
4
Shared Memory Machines
  • Small number of processors shared memory with
    coherent caches (SMP).
  • Larger number of processors distributed shared
    memory with coherent caches (CC-NUMA).

5
SMPs
  • 2- or 4-processors PCs are now commodity.
  • Good price/performance ratio.
  • Memory sometimes bottleneck (see later).
  • Typical price (8-node) 20-40k.

6
Physical Implementation
Shared memory
bus
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
7
Shared Memory Machines
  • Small number of processors shared memory with
    coherent caches (SMP).
  • Larger number of processors distributed shared
    memory with coherent caches (CC-NUMA).

8
CC-NUMA Physical Implementation
mem2
mem3
memN
mem1
inter- connect
cache2
cache1
cacheN
cache3
proc1
proc2
proc3
procN
9
Caches in Multiprocessors
  • Suffer from the coherence problem
  • same line appears in two or more caches
  • one processor writes word in line
  • other processors now can read stale data
  • Leads to need for a coherence protocol
  • avoids coherence problems
  • Many exist, will just look at simple one.

10
What is coherence?
  • What does it mean to be shared?
  • Intuitively, read last value written.
  • Notion is not well-defined in a system without a
    global clock.

11
The Notion of last written in a Multi-processor
System
r(x)
P0
w(x)
P1
w(x)
P2
r(x)
P3
12
The Notion of last written in a Single-machine
System
w(x)
w(x)
r(x)
r(x)
13
Coherence a Clean Definition
  • Is achieved by referring back to the single
    machine case.
  • Called sequential consistency.

14
Sequential Consistency (SC)
  • Memory is sequentially consistent if and only if
    it behaves as if the processors were executing
    in a time-shared fashion on a single machine.

15
Returning to our Example
r(x)
P0
w(x)
P1
w(x)
P2
r(x)
P3
16
Another Way of Defining SC
  • All memory references of a single process execute
    in program order.
  • All writes are globally ordered.

17
SC Example 1
Initial values of x,y are 0.
w(x,1)
w(y,1)
r(x)
r(y)
What are possible final values?
18
SC Example 2
w(x,1)
w(y,1)
r(y)
r(x)
19
SC Example 3
w(x,1)
w(y,1)
r(y)
r(x)
20
SC Example 4
r(x)
w(x,1)
w(x,2)
r(x)
21
Implementation
  • Many ways of implementing SC.
  • In fact, sometimes stronger conditions.
  • Will look at a simple one MSI protocol.

22
Physical Implementation
Shared memory
bus
cache1
cache2
cache3
cacheN
proc1
proc2
proc3
procN
23
Fundamental Assumption
  • The bus is a reliable, ordered broadcast bus.
  • Every message sent by a processor is received by
    all other processors in the same order.
  • Also called a snooping bus
  • Processors (or caches) snoop on the bus.

24
States of a Cache Line
  • Invalid
  • Shared
  • read-only, one of many cached copies
  • Modified
  • read-write, sole valid copy

25
Processor Transactions
  • processor read(x)
  • processor write(x)

26
Bus Transactions
  • bus read(x)
  • asks for copy with no intent to modify
  • bus read-exclusive(x)
  • asks for copy with intent to modify

27
State Diagram Step 0
I
S
M
28
State Diagram Step 1
PrRd/BuRd
I
S
M
29
State Diagram Step 2
PrRd/-
PrRd/BuRd
I
S
M
30
State Diagram Step 3
PrWr/BuRdX
PrRd/-
PrRd/BuRd
I
S
M
31
State Diagram Step 4
PrWr/BuRdX
PrRd/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
32
State Diagram Step 5
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
33
State Diagram Step 6
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRd/Flush
34
State Diagram Step 7
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRd/Flush
BuRd/-
35
State Diagram Step 8
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRdX/-
BuRd/Flush
BuRd/-
36
State Diagram Step 9
PrWr/BuRdX
PrRd/-
PrWr/-
PrRd/BuRd
PrWr/BuRdX
I
S
M
BuRdX/-
BuRd/Flush
BuRd/-
BuRdX/Flush
37
In Reality
  • Most machines use a slightly more complicated
    protocol (4 states instead of 3).
  • See architecture books (MESI protocol).

38
Problem False Sharing
  • Occurs when two or more processors access
    different data in same cache line, and at least
    one of them writes.
  • Leads to ping-pong effect.

39
False Sharing Example (1 of 3)
  • pragma omp parallel for schedule(cyclic)
  • for( i0 iltn i )
  • ai bi
  • Lets assume
  • p 2
  • element of a takes 4 words
  • cache line has 32 words

40
False Sharing Example (2 of 3)
cache line
a0
a1
a2
a3
a4
a5
a6
a7
Written by processor 0
Written by processor 1
41
False Sharing Example (3 of 3)
a2
a4
P0
a0
...
inv
data
P1
a3
a5
a1
42
Summary
  • Sequential consistency.
  • Bus-based coherence protocols.
  • False sharing.

43
Algorithms for Scalable Synchronization on
Shared-Memory Multiprocessors
  • J.M. Mellor-Crummey, M.L. Scott
  • (MCS Locks)

44
Introduction
  • Busy-waiting techniques heavily used in
    synchronization on shared memory MPs
  • Two general categories locks and barriers
  • Locks ensure mutual exclusion
  • Barriers provide phase separation in an
    application

45
Problem
  • Busy-waiting synchronization constructs tend to
  • Have significant impact on network traffic due to
    cache invalidations
  • Contention leads to poor scalability
  • Main cause spinning on remote variables

46
The Proposed Solution
  • Minimize access to remote variables
  • Instead, spin on local variables
  • Claim
  • It can be done all in software (no need for fancy
    and costly hardware support)
  • Spinning on local variables will minimize
    contention, allow for good scalability, and good
    performance

47
Spin Lock 1 Test-and-Set Lock
  • Repeatedly test-and-set a boolean flag indicating
    whether the lock is held
  • Problem contention for the flag
    (read-modify-write instructions are expensive)
  • Causes lots of network traffic, especially on
    cache-coherent architectures (because of cache
    invalidations)
  • Variation test-and-test-and-set less traffic

48
Test-and-test with Backoff Lock
  • Pause between successive test-and-set (backoff)
  • TS with backoff idea
  • while testset (L) fails
  • pause (delay)
  • delay delay 2

49
Spin Lock 2 The Ticket Lock
  • 2 counters (nr_requests, and nr_releases)
  • Lock acquire fetch-and-increment on the
    nr_requests counter, waits until its ticket is
    equal to the value of the nr_releases counter
  • Lock release increment of the nr_releases counter

50
Spin Lock 2 The Ticket Lock
  • Advantage over TS polls with read operations
    only
  • Still generates lots of traffic and contention
  • Can further improve by using backoff

51
Array-Based Queueing Locks
  • Each CPU spins on a different location, in a
    distinct cache line
  • Each CPU clears the lock for its successor (sets
    it from must-wait to has-lock)
  • Lock-acquire
  • while (slotsmy_place must-wait)
  • Lock-release
  • slots(my_place 1) P has-lock

52
List-Based Queueing Locks (MCS Locks)
  • Spins on local flag variables only
  • Requires a small constant amount of space per lock

53
List-Based Queueing Locks (MCS Locks)
  • CPUs are all in a linked list upon release by
    current CPU, lock is acquired by its successor
  • Spinning is on local flag
  • Lock points at tail of queue (null if not held)
  • Compare-and-swap allows for detection if it is
    the only processor in queue and atomic removal of
    self from the queue

54
List-Based Queueing Locks (MCS Locks)
  • Spin in acquire_lock waits for lock to become
    free
  • Spin in release_lock compensates for the time
    window between fetch-and-store and assignment to
    predecessor-gtnext in acquire_lock
  • If no compare_and_swap cumbersome

55
The MCS Tree-Based Barrier
  • Uses a pair of P (nr. of CPUs) trees arrival
    tree, and wakeup tree
  • Arrival tree each node has 4 children
  • Wakeup tree binary tree
  • Fastest way to wake up all P processors

56
Hardware Description
  • BBN Butterfly 1 DSM multiprocessor
  • Supports up to 256 CPUs, 80 used in experiments
  • Atomic primitives allow fetch_and_add,
    fetch_and_store (swap), test_and_set
  • Sequent Symmetry Model B cache-coherent,
    shared-bus multiprocessor
  • Supports up to 30 CPUs, 18 used in experiments
  • Snooping cache-coherence protocol
  • Neither supports compare-and-swap

57
Measurement Technique
  • Results averaged over 10k (Butterfly) or 100k
    (Symmetry) acquisitions
  • For 1 CPU, time represents latency between
    acquire and release of lock
  • Otherwise, time represents time elapsed between
    successive acquisitions

58
Spin Locks on Butterfly
59
Spin Locks on Butterfly
60
Spin Locks on Butterfly
  • Andersons fares poorer because the Butterfly
    lacks coherent caches, and CPUs may spin on
    statically unpredictable locations which may
    not be local
  • TS with exponential backoff, Ticket lock with
    proportional backoff, MCS all scale very well,
    with slopes of 0.0025, 0.0021 and 0.00025 µs
    respectively

61
Spin Locks on Symmetry
62
Spin Locks on Symmetry
63
Latency and Impact of Spin Locks
64
Latency and Impact of Spin Locks
  • Latency results are poor on Butterfly because
  • Atomic operations are inordinately expensive in
    comparison to non-atomic ones
  • 16-bit atomic primitives on Butterfly cannot
    manipulate 24-bit pointers

65
Barriers on Butterfly
66
Barriers on Butterfly
67
Barriers on Symmetry
68
Barriers on Symmetry
  • Different results from Butterfly because
  • More CPUs can spin on same location (own copy in
    local cache)
  • Distributing writes across different memory
    modules yields no benefit because the bus
    serializes all communication

69
Conclusions
  • Criteria for evaluating spin locks
  • Scalability and induced network load
  • Single-processor latency
  • Space requirements
  • Fairness
  • Implementability with available atomic operations

70
Conclusions
  • MCS lock algorithm scales best, together with
    array-based queueing on cache-coherent machines
  • TS and Ticket Locks with proper backoffs also
    scale well, but incur more network load
  • Anderson and GT prohibitive space requirements
    for large numbers of CPUs

71
Conclusions
  • MCS, array-based, and Ticket Locks guarantee
    fairness (FIFO)
  • MCS benefits significantly from existence of
    compare-and-swap
  • MCS is best when contention expected excellent
    scaling, FIFO ordering, least interconnect
    contention, low space reqs.
Write a Comment
User Comments (0)
About PowerShow.com