Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick - PowerPoint PPT Presentation

Loading...

PPT – Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick PowerPoint presentation | free to download - id: 6f5cf2-OTRlN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick

Description:

Shared Memory Programming Synchronization primitives Ing. Andrea Marongiu (a.marongiu_at_unibo.it) Includes s from course CS162 at UC Berkeley, by prof Anthony D ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 32
Provided by: Andrea477
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Includes slides from course CS162 at UC Berkeley, by prof Anthony D. Joseph and Ion Stoica and from course CS194, by prof. Katherine Yelick


1
Includes slides from course CS162 at UC Berkeley,
by prof Anthony D. Joseph and Ion Stoica and from
course CS194, by prof. Katherine Yelick
  • Shared Memory Programming Synchronization
    primitives
  • Ing. Andrea Marongiu (a.marongiu_at_unibo.it)

2
Shared Memory Programming
  • Program is a collection of threads of control.
  • Can be created dynamically, mid-execution, in
    some languages
  • Each thread has a set of private variables, e.g.,
    local stack variables
  • Also a set of shared variables, e.g., static
    variables, shared common blocks, or global heap.
  • Threads communicate implicitly by writing and
    reading shared variables.
  • Threads coordinate by synchronizing on shared
    variables

Shared memory
s
s ...
y ..s ...
Private memory
Pn
P1
P0
3
Shared Memory code for computing a sum
static int s 0
Thread 1 for i 0, n/2-1 s s
sqr(Ai)
Thread 2 for i n/2, n-1 s s
sqr(Ai)
  • Problem is a race condition on variable s in the
    program
  • A race condition or data race occurs when
  • two processors (or two threads) access the same
    variable, and at least one does a write.
  • The accesses are concurrent (not synchronized) so
    they could happen simultaneously

4
Shared Memory code for computing a sum
A
f square
3
5
static int s 0
Thread 1 . compute f(Ai) and put in
reg0 reg1 s reg1 reg1 reg0 s
reg1
Thread 2 compute f(Ai) and put in reg0
reg1 s reg1 reg1 reg0 s reg1
9
25
0
0
9
25
25
9
  • Assume A 3,5, f is the square function, and
    s0 initially
  • For this program to work, s should be 34 at the
    end
  • but it may be 34,9, or 25
  • The atomic operations are reads and writes
  • Never see ½ of one number, but no operation is
    not atomic
  • All computations happen in (private) registers

5
Shared Memory code for computing a sum
static int s 0
Thread 1 local_s1 0 for i 0, n/2-1
local_s1 local_s1 sqr(Ai) s
s local_s1
Thread 2 local_s2 0 for i n/2, n-1
local_s2 local_s2 sqr(Ai) s
s local_s2
ATOMIC
ATOMIC
  • Since addition is associative, its OK to
    rearrange order
  • Right?
  • Most computation is on private variables
  • Sharing frequency is also reduced, which might
    improve speed
  • But there is still a race condition on the update
    of shared s

6
Atomic Operations
  • To understand a concurrent program, we need to
    know what the underlying indivisible operations
    are!
  • Atomic Operation an operation that always runs
    to completion or not at all
  • It is indivisible it cannot be stopped in the
    middle and state cannot be modified by someone
    else in the middle
  • Fundamental building block if no atomic
    operations, then have no way for threads to work
    together
  • On most machines, memory references and
    assignments (i.e. loads and stores) of words are
    atomic

7
Role of Synchronization
  • A parallel computer is a collection of
    processing elements that cooperate and
    communicate to solve large problems fast.
  • Types of Synchronization
  • Mutual Exclusion
  • Event synchronization
  • point-to-point
  • group
  • global (barriers)
  • How much hardware support?

Most used forms of synchronization in shared
memory parallel programming
8
Motivation Too much milk
  • Example People need to coordinate

9
Definitions
  • Synchronization using atomic operations to
    ensure cooperation between threads
  • For now, only loads and stores are atomic
  • hard to build anything useful with only reads and
    writes
  • Mutual Exclusion ensuring that only one thread
    does a particular thing at a time
  • One thread excludes the other while doing its
    task
  • Critical Section piece of code that only one
    thread can execute at once
  • Critical section and mutual exclusion are two
    ways of describing the same thing
  • Critical section defines sharing granularity

10
More Definitions
  • Lock prevents someone from doing something
  • Lock before entering critical section and before
    accessing shared data
  • Unlock when leaving, after accessing shared data
  • Wait if locked
  • Important idea all synchronization involves
    waiting
  • Example fix the milk problem by putting a lock
    on refrigerator
  • Lock it and take key if you are going to go buy
    milk
  • Fixes too much (coarse granularity) roommate
    angry if only wants orange juice

11
Too Much Milk Correctness properties
  • Need to be careful about correctness of
    concurrent programs, since non-deterministic
  • Always write down desired behavior first
  • think first, then code
  • What are the correctness properties for the Too
    much milk problem?
  • Never more than one person buys
  • Someone buys if needed
  • Restrict ourselves to use only atomic load and
    store operations as building blocks

12
Too Much Milk Solution 1
  • Use a note to avoid buying too much milk
  • Leave a note before buying (kind of lock)
  • Remove note after buying (kind of unlock)
  • Dont buy if note (wait)
  • Suppose a computer tries this (remember, only
    memory read/write are atomic)
  • if (noMilk) if (noNote)
    leave Note buy milk remove
    note
  • Result?

13
Too Much Milk Solution 1
  • Thread A Thread B
  • if (noMilk)
  • if (noNote)
  • if (noMilk)
  • if (noNote)
  • leave Note buy milk
  • remove note
  • leave Note
    buy milk
  • remove note

Need to atomically update lock variable
14
How to Implement Lock?
  • Lock prevents someone from accessing something
  • Lock before entering critical section (e.g.,
    before accessing shared data)
  • Unlock when leaving, after accessing shared data
  • Wait if locked
  • Important idea all synchronization involves
    waiting
  • Should sleep if waiting for long time
  • Hardware atomic instructions?

15
Examples of hardware atomic instructions
  • testset (address) / most architectures
    / result Maddress Maddress 1 return
    result
  • swap (address, register) / x86 / temp
    Maddress Maddress register register
    temp
  • compareswap (address, reg1, reg2) / 68000
    / if (reg1 Maddress) Maddress
    reg2 return success else return
    failure

Atomic operations!
16
Implementing Locks with testset
  • Simple solution
  • int value 0 // Free
  • Acquire() while (testset(value)) // while
    busy
  • Release() value 0
  • Simple explanation
  • If lock is free, testset reads 0 and sets
    value1, so lock is now busy. It returns 0 so
    while exits
  • If lock is busy, testset reads 1 and sets
    value1 (no change). It returns 1, so while loop
    continues
  • When we set value 0, someone else can get lock

testset (address) result Maddress
Maddress 1 return result
17
Too Much Milk Solution 2
  • Lock.Acquire() wait until lock is free, then
    grab
  • Lock.Release() unlock, waking up anyone waiting
  • atomic operations if two threads are waiting
    for the lock, only one succeeds to grab the lock
  • Then, our milk problem is easy
  • milklock.Acquire()
  • if (nomilk)
  • buy milk
  • milklock.Release()
  • Once again, section of code between Acquire() and
    Release() called a Critical Section

18
Shared Memory code for computing a sum
static int s 0
Thread 1 local_s1 0 for i 0, n/2-1
local_s1 local_s1 sqr(Ai) s
s local_s1
Thread 2 local_s2 0 for i n/2, n-1
local_s2 local_s2 sqr(Ai) s
s local_s2
  • Since addition is associative, its OK to
    rearrange order
  • Right?
  • Most computation is on private variables
  • Sharing frequency is also reduced, which might
    improve speed
  • But there is still a race condition on the update
    of shared s

19
Performance Criteria for Synch. Ops
  • Latency (time per op)
  • How long does it take if you always win
  • Especially when light contention
  • Bandwidth (ops per sec)
  • Especially under high contention
  • How long does it take (averaged over threads)
    when many others are trying for it
  • Traffic
  • How many events on shared resources (bus,
    crossbar,)
  • Storage
  • How much memory is required?
  • Fairness
  • Can any one threads be starved and never get
    the lock?

20
Barriers
  • Software algorithms implemented using locks,
    flags, counters
  • Hardware barriers
  • Wired-AND line separate from address/data bus
  • Set input high when arrive, wait for output to be
    high to leave
  • In practice, multiple wires to allow reuse
  • Useful when barriers are global and very frequent
  • Difficult to support arbitrary subset of
    processors
  • even harder with multiple processes per processor
  • Difficult to dynamically change number and
    identity of participants
  • e.g. latter due to process migration
  • Not common today on bus-based machines

21
A Simple Centralized Barrier
  • Shared counter maintains number of processes that
    have arrived
  • increment when arrive (lock), check until reaches
    numprocs
  • Problem?

struct bar_type int counter struct
lock_type lock int flag 0
bar_name BARRIER (bar_name, p)
LOCK(bar_name.lock) if (bar_name.counter
0) bar_name.flag 0 / reset flag if
first to reach/ mycount bar_name.counter /
mycount is private / UNLOCK(bar_name.lock) i
f (mycount p) / last to arrive
/ bar_name.counter 0 / reset for next
barrier / bar_name.flag 1 / release
waiters / else while (bar_name.flag 0)
/ busy wait for release /
22
A Working Centralized Barrier
  • Consecutively entering the same barrier doesnt
    work
  • Must prevent process from entering until all have
    left previous instance
  • Could use another counter, but increases latency
    and contention
  • Sense reversal wait for flag to take different
    value consecutive times
  • Toggle this value only when all processes reach

BARRIER (bar_name, p) local_sense
!(local_sense) / toggle private sense
variable / LOCK(bar_name.lock) mycount
bar_name.counter / mycount is private / if
(bar_name.counter p) UNLOCK(bar_name.lock)
bar_name.flag local_sense / release
waiters/ else UNLOCK(bar_name.lock) whi
le (bar_name.flag ! local_sense)
23
Centralized Barrier Performance
  • Latency
  • Centralized has critical path length at least
    proportional to p
  • Traffic
  • About 3p bus transactions
  • Storage Cost
  • Very low centralized counter and flag
  • Fairness
  • Same processor should not always be last to exit
    barrier
  • No such bias in centralized
  • Key problems for centralized barrier are latency
    and traffic
  • Especially with distributed memory, traffic goes
    to same node

24
Improved Barrier Algorithm
  • Master-Slave barrier
  • Master core gathers slaves on the barrier and
    releases them
  • Use separate, per-core polling flags for
    different wait stages

Contention
Centralized
Master-Slave
  • Separate gather and release trees
  • Advantage use of ordinary reads/writes instead
    of locks (array of flags)
  • 2x(p-1) messages exchanged over the network
  • Valuable in distributed network communicate
    along different paths

25
Improved Barrier Algorithm
  • What if implemented on top of NUMA
    (cluster-based) shared memory system?
  • e.g., p2012

Master-Slave
  • Not all messages have same latency
  • Need for locality-aware implementation

26
Improved Barrier Algorithm
  • Software combining tree
  • Only k processors access the same location, where
    k is degree of tree

Little contention
Contention
Centralized
Tree
  • Separate arrival and exit trees, and use sense
    reversal
  • Valuable in distributed network communicate
    along different paths
  • Higher latency (log p steps of work, and O(p)
    serialized bus xactions)
  • Advantage use of ordinary reads/writes instead
    of locks

27
Improved Barrier Algorithm
  • Software combining tree
  • Only k processors access the same location, where
    k is degree of tree

Contention
Centralized
Tree
  • Separate arrival and exit trees, and use sense
    reversal
  • Valuable in distributed network communicate
    along different paths
  • Higher latency (log p steps of work, and O(p)
    serialized bus xactions)
  • Advantage use of ordinary reads/writes instead
    of locks

28
Improved Barrier Algorithm
  • What if implemented on top of NUMA
    (cluster-based) shared memory system?
  • e.g., p2012

Tree
  • Hierarchical synchronization
  • locality-aware implementation

29
Barrier performance
30
Parallel programming models
  • Programming model is made up of the languages and
    libraries that create an abstract view of the
    machine
  • Control
  • How is parallelism created?
  • How is are dependencies (orderings) enforced?
  • Data
  • Can data be shared or is it all private?
  • How is shared data accessed or private data
    communicated?
  • Synchronization
  • What operations can be used to coordinate
    parallelism
  • What are the atomic (indivisible) operations?

31
Parallel programming models
  • In this and the upcoming lectures we will see
    different programming models and the features
    that each provide with respect to
  • Control
  • Data
  • Synchronization
  • Pthreads
  • OpenMP
  • OpenCL
About PowerShow.com