CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and Programming - PowerPoint PPT Presentation

Loading...

PPT – CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and Programming PowerPoint presentation | free to download - id: 5e96a1-MTIzO



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and Programming

Description:

CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and Programming Jim Demmel http://www.cs.berkeley.edu/~demmel/cs267_Spr99 ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: CS 267 Applications of Parallel Computers Lecture 4: More about Shared Memory Processors and Programming


1
CS 267 Applications of Parallel
ComputersLecture 4More about Shared Memory
Processorsand Programming
  • Jim Demmel
  • http//www.cs.berkeley.edu/demmel/cs267_Spr99

2
Recap of Last Lecture
  • There are several standard programming models
    (plus variations) that were developed to support
    particular kinds of architectures
  • shared memory
  • message passing
  • data parallel
  • The programming models are no longer strictly
    tied to particular architectures, and so offer
    portability of correctness
  • Portability of performance still depends on
    tuning for each architecture
  • In each model, parallel programming has 4 phases
  • decomposition into parallel tasks
  • assignment of tasks to threads
  • orchestration of communication and
    synchronization among threads
  • mapping threads to processors

3
Outline
  • Performance modeling and tradeoffs
  • Shared memory architectures
  • Shared memory programming

4
Cost Modeling and Performance Tradeoffs
5
Example
  • s f(A1) f(An)
  • Decomposition
  • computing each f(Aj)
  • n-fold parallelism, where n may be gtgt p
  • computing sum s
  • Assignment
  • thread k sums sk f(Akn/p)
    f(A(k1)n/p-1)
  • thread 1 sums s s1 sp
  • for simplicity of this example, will be improved
  • thread 1 communicates s to other threads
  • Orchestration
  • starting up threads
  • communicating, synchronizing with thread 1
  • Mapping
  • processor j runs thread j

6
Identifying enough Concurrency
  • Parallelism profile
  • area is total work done

n
n x time(f)
Simple Decomposition f ( Ai ) is the
parallel task sum is sequential
Concurrency
1 x time(sum(n))
Time
  • Amdahls law bounds speedup
  • let s the fraction of total work done
    sequentially

After mapping
p
Concurrency
p x n/p x time(f)
7
Algorithmic Trade-offs
  • Parallelize partial sum of the fs
  • what fraction of the computation is sequential
  • what does this do for communication? locality?
  • what if you sum what you own

p x time(sum(n/p) )
Concurrency
1 x time(sum(p))
p x n/p x time(f)
8
Problem Size is Critical
Amdahls Law Bounds
  • Total work n P
  • Serial work P
  • Parallel work n
  • s serial fraction
  • P/ (nP)
  • Speedup(P)n/(n/PP)
  • Speedup decreases for
  • large P if n small

n
In general seek to exploit a fraction of the peak
parallelism in the problem.
9
Algorithmic Trade-offs
  • Parallelize the final summation (tree sum)
  • Generalize Amdahls law for arbitrary ideal
    parallelism profile

10
  • Shared Memory Architectures

11
Recap Basic Shared Memory Architecture
  • Processors all connected to a large shared memory
  • Local caches for each processor
  • Cost much cheaper to cache than main memory
  • Simplest to program, but hard to build with many
    processors
  • Now take a closer look at structure, costs,
    limits

12
Limits of using Bus as Network
  • Assume 100 MB/s bus
  • 50 MIPS processor w/o cache
  • gt 200 MB/s inst BW per processor
  • gt 60 MB/s data BW at 30 load-store
  • Suppose 98 inst hit rate and 95 data hit rate
    (16 byte block)
  • gt 4 MB/s inst BW per processor
  • gt 12 MB/s data BW per processor
  • gt 16 MB/s combined BW
  • \ 8 processors will saturate bus

I/O
MEM
MEM

16 MB/s

cache
cache
260 MB/s
PROC
PROC
Cache provides bandwidth filter as well as
reducing average access time
13
Cache Coherence The Semantic Problem
  • p1 and p2 both have cached copies of x (as 0)
  • p1 writes x1 and then the flag, f1, as a signal
    to other processors that it has updated x
  • writing f pulls it into p1s cache
  • both of these writes write through to memory
  • p2 reads f (bringing it into p2s cache) to see
    if it is 1, which it is
  • p2 therefore reads x, expecting the value written
    by p1, but gets the stale cached copy

x 1 f 1
x 1 f 1
x 0 f 1
p1
p2
  • SMPs have complicated caches to enforce
    coherence

14
Programming SMPs
  • Coherent view of shared memory
  • All addresses equidistant
  • dont worry about data partitioning
  • Caches automatically replicate shared data close
    to processor
  • If program concentrates on a block of the data
    set that no one else updates gt very fast
  • Communication occurs only on cache misses
  • cache misses are slow
  • Processor cannot distinguish communication misses
    from regular cache misses
  • Cache block may introduce unnecessary
    communication
  • two distinct variables in the same cache block
  • false sharing

15
Where are things going
  • High-end
  • collections of almost complete workstations/SMP
    on high-speed network (Millennium)
  • with specialized communication assist integrated
    with memory system to provide global access to
    shared data
  • Mid-end
  • almost all servers are bus-based CC SMPs
  • high-end servers are replacing the bus with a
    network
  • Sun Enterprise 10000, IBM J90, HP/Convex SPP
  • volume approach is Pentium pro quadpack SCI
    ring
  • Sequent, Data General
  • Low-end
  • SMP desktop is here
  • Major change ahead
  • SMP on a chip as a building block

16
Programming Shared Memory Machines
  • Creating parallelism in shared memory models
  • Synchronization
  • Building shared data structures
  • Performance programming (throughout)

17
Programming with Threads
  • Several Threads Libraries
  • PTHREADS is the Posix Standard
  • Solaris threads are very similar
  • Relatively low level
  • Portable but possibly slow
  • P4 (Parmacs) is a widely used portable package
  • Higher level than Pthreads
  • OpenMP is new proposed standard
  • Support for scientific programming on shared
    memory
  • Currently a Fortran interface
  • Initiated by SGI, Sun is not currently supporting
    this

18
Creating Parallelism
19
Language Notions of Thread Creation
  • cobegin/coend
  • fork/join
  • cobegin cleaner, but fork is more general

cobegin job1(a1) job2(a2) coend
  • Statements in block may run in parallel
  • cobegins may be nested
  • Scoped, so you cannot have a missing coend

tid1 fork(job1, a1) job2(a2) join tid1
  • Forked function runs in parallel with current
  • join waits for completion (may be in different
    function)

20
Forking Threads in Solaris
Signature
int thr_create(void stack_base, size_t
stack_size, void (
start_func)(void ), void arg,
long flags, thread_t new_tid) thr_create(NULL,
NULL, start_func, arg, NULL, tid)
Example
  • start_fun defines the thread body
  • start_fun takes one argument of type void and
    returns void
  • an argument can be passed as arg
  • j-th thread gets argj so it knows who it is
  • stack_base and stack_size give the stack
  • standard default values
  • flags controls various attributes
  • standard default values for now
  • new_tid thread id (for thread creator to identify
    threads)

21
Example Using Solaris Threads
main() thread_ptr (thrinfo_t )
malloc(NTHREADS sizeof(thrinfo_t)) thread_ptr
0.chunk 0 thread_ptr0.tid myID for
(i 1 i lt NTHREADS i)
thread_ptri.chunk i if
(thr_create(0, 0, worker, (void)thread_ptri.ch
unk. 0, thread_ptri.tid))
perror("thr_create")
exit(1) worker(0) for (i 1 i
lt NTHREADS i) thr_join(thread_ptri.
tid, NULL, NULL)
22
Synchronization
23
Basic Types of Synchronization Barrier
  • Barrier -- global synchronization
  • fork multiple copies of the same function work
  • SPMD Single Program Multiple Data
  • simple use of barriers -- a threads hit the same
    one
  • more complicated -- barriers on branches
  • or in loops -- need equal number of barriers
    executed
  • barriers are not provided in many thread
    libraries
  • need to build them

work_on_my_subgrid() barrier read_neighboring_va
lues() barrier
if (tid 2 0) work1() barrier
else barrier
24
Basic Types of Synchronization Mutexes
  • Mutexes -- mutual exclusion aka locks
  • threads are working mostly independently
  • need to access common data structure
  • Java and other languages have lexically scoped
    synchronization
  • similar to cobegin/coend vs. fork and join
  • Semaphores give guarantees on fairness in
    getting the lock, but the same idea of mutual
    exclusion
  • Locks only affect processors using them
  • pair-wise synchronization

lock l alloc_and_init() / shared
/ acquire(l) access data release(l)
25
Basic Types of Synchronization Post/Wait
  • Post/Wait -- producer consumer synchronization
  • post/wait not as common a term
  • could be done with generalization of locks to
    reader/writer locks
  • sometimes done with barrier, if there is global
    agreement

waitlock l alloc_and_init() / shared /
P1 P2 big_data -
new_value post(l)
wait(l) use
big_data
26
Synchronization at Different Levels
  • Can build it yourself out of flags
  • while (!flag)
  • Lock/Unlock primitives build in the waiting
  • typically well tested
  • system friendly
  • sometimes optimized for the machine
  • sometimes higher overhead than building your own
  • Most systems provide higher level synchronization
    primitives
  • barrier - global synchronization
  • semaphores
  • monitors

27
Solaris Threads Example
mutex_t mul_lock barrier_t ba int sum
main() sync_type USYNC_PROCESS
mutex_init(mul_lock, sync_type, NULL)
barrier_init(ba, NTHREADS, sync_type, NULL .
spawn all the threads as above worker (int
me) int x all_do_work(me)
barrier_wait(ba) mutex_lock(mul_lock)
sum mine mutex_unlock(mul_lock)
28
Producer-Consumer Synchronization
  • A very common pattern in parallel programs
  • special case is write-once variable
  • Motivated language constructs that combine
    parallelism and synchronization
  • future as in Multilisp
  • next_job (future (job1 (x1)), future (job2(x2)))
  • job1 and job2 will run in parallel, next_job will
    run until args are needed, e.g., arg1arg2 inside
    next_job
  • implemented using presence bits (hardware?)
  • promise
  • like future, but need to explicitly ask for a
    promise
  • T and promiseT are different types
  • more efficient, but requires more programmer
    control

29
Rolling Your Own Synchronization
  • Natural to use a variable for producer/consumer
  • This works as long as your compiler and machine
    implement sequential consistency Lamport
  • The parallel execution must behave as if it were
    an interleaving of the sequences of memory
    operations from each of the processors.

P1 P2 big_data
new_value while (flag ! 1)
flag 1 . ...big_data...
  • There must exist some
  • serial (total) order
  • consistent with the partial order
  • that is a correct sequential execution

w x
w x
r z
r x
w y
union forms a partial order
30
But Machines arent Always Sequentially
Consistency
  • hardware does out-of-order execution
  • hardware issues multiple writes
  • placed into (mostly FIFO) write buffer
  • second write to same location can be merged into
    earlier
  • hardware reorders reads
  • first misses in cache, second does not
  • compiler puts something in a register,
    eliminating some reads and writes
  • compiler reorders reads or writes
  • writes are going to physically remote locations
  • first write is further away than second

31
Programming Solutions
  • At the compiler level, the best defense is
  • declaring variables as volatile
  • At the hardware level, there are instructions
  • memory barriers (bad term) or fences that forces
    serialization
  • only serialize operations executed by one
    processor
  • different flavors, depending on the machine
  • Sequential consistency is sufficient but not
    necessary for hand-made synchronization
  • many machines only reorder reads with respect to
    writes
  • the flag example breaks only if read/read pairs
    reordered or write/write pairs reordered

32
Foundation Behind Sequential Consistency
  • All violations of sequential consistency are
    figure 8s
  • Generalized to multiple processors (can be
    longer)
  • Intuition Time cannot move backwards
  • All violations appear as simple cycles that
    visit each processor at most twice ShashaSnir

P1 P2 big_data
new_value while (flag ! 1)
flag 1 . ...big_data...
33
Building Shared Data Structures
34
Shared Address Allocation
  • Some systems provide a special form of
    malloc/free
  • p4_shmalloc, p4_shfree
  • p4_malloc, p4_free are just basic malloc/free
  • sharing unspecified
  • Solaris threads
  • mallocd and static variables are shared

35
Building Parallel Data Structures
  • Since data is shared, this is easy
  • Some awkwardness comes in setup
  • only 1 argument passed to all threads
  • need to package data (pointers to) into a
    structure and pass that
  • otherwise everything global (hard to read for
    usual reasons)
  • Depends on type of data structure

36
Data Structures
  • For data structures that are static (regular in
    time)
  • typically partition logically, although not
    physically
  • need to divide work evenly, often done by
    dividing data
  • use the owner computes rule
  • true for irregular (unstructured) data like
    meshes too
  • usually barriers are sufficient synchronization
  • For Dynamic data structures
  • need locking or other synchronization
  • Example tree-based particle computation
  • parts of this computation are naturally
    partitioned
  • each processor/thread has a set of particles
  • each works on updating a part of the tree
  • during a tree walk, need to look at other parts
    of the tree
  • locking used (on nodes) to prevent simultaneous
    updates

37
Summary
38
Uniform Shared Address Space
  • Programmers view is still processor-centric
  • Specifies what each processor/thread does
  • Global view implicit in pattern of data sharing

Shared Data Structure
Local / Stack Data
39
Segmented Shared Address Space
  • Programmer has local and global view
  • Specifies what each processor/thread does
  • Global data, operation, and synchronization

Shared Data Structure
Local / Stack Data
40
Work vs. Data Assignment
for (I MyProc Iltn IPROCS) AI f(I)
  • Assignment of work is easier in a global address
    space
  • It is faster if it corresponds to the data
    placement
  • Hardware replication moves data to where it is
    accessed
About PowerShow.com