CS 267: Shared Memory Machines Programming Example: Sharks and Fish - PowerPoint PPT Presentation

Loading...

PPT – CS 267: Shared Memory Machines Programming Example: Sharks and Fish PowerPoint presentation | free to download - id: 178af2-NzcxO



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

CS 267: Shared Memory Machines Programming Example: Sharks and Fish

Description:

Hardware evolves to try to match speeds. Program semantics evolve too ... Performance evolves as well. Well tuned programs today may be inefficient tomorrow ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 61
Provided by: kathyy
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: CS 267: Shared Memory Machines Programming Example: Sharks and Fish


1
CS 267 Shared Memory Machines Programming Examp
le Sharks and Fish
  • James Demmel
  • demmel_at_cs.berkeley.edu
  • www.cs.berkeley.edu/demmel/cs267_Spr05

2
Basic Shared Memory Architecture
  • Processors all connected to a large shared memory
  • Where are caches?

P2
P1
Pn
interconnect
memory
  • Now take a closer look at structure, costs,
    limits, programming

3
Outline
  • Evolution of Hardware and Software
  • CPUs getting exponentially faster than memory
    they share
  • Hardware evolves to try to match speeds
  • Program semantics evolve too
  • Programs change from correct to buggy, unless
    programmed carefully
  • Performance evolves as well
  • Well tuned programs today may be inefficient
    tomorrow
  • Goal teach a programming style likely to stay
    correct, if not always as efficient as possible
  • Use locks to avoid race conditions
  • Current research seeks best of both worlds
  • Example Sharks and Fish (part of next homework)

4
Processor-DRAM Gap (latency)
µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap (grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
5
Shared Memory Code for Computing a Sum s
f(A0) f(A1)
static int s 0
Thread 0 s s f(A0)
Thread 1 s s f(A1)
  • Might get f(A0) f(A1) or f(A0) or
    f(A1)
  • Problem is a race condition on variable s in the
    program
  • A race condition or data race occurs when
  • two processors (or two threads) access the same
    variable, and at least one does a write.
  • The accesses are concurrent (not synchronized) so
    they could happen simultaneously

6
Approaches to Building Parallel Machines
P
P
Scale
1
n
Switch
(Interleaved)
P
First-level
P
n
1


(Interleaved)
Main memory
Inter
connection network
Shared Cache
Mem
Mem
Centralized Memory UMA Uniform Memory
Access
P
P
n
1


Mem
Mem
Inter
connection network
Distributed Memory (NUMA Non-UMA)
7
Shared Cache Advantages and Disadvantages
  • Advantages
  • Cache placement identical to single cache
  • Only one copy of any cached block
  • Cant have values of same memory location in
    different caches
  • Fine-grain sharing is possible
  • Good Interference
  • One processor may prefetch data for another
  • Can share data within a line without moving line
  • Disadvantages
  • Bandwidth limitation
  • Bad Interference
  • One processor may flush another processors data

8
Limits of Shared Cache Approach
  • Assume
  • 1 GHz processor w/o cache
  • gt 4 GB/s inst BW per processor (32-bit)
  • gt 1.2 GB/s data BW at 30 load-store
  • Need 5.2 GB/s of bus bandwidth per processor!
  • Typical off-chip bus bandwidth is closer to 1
    GB/s

9
Evolution of Shared Cache
  • Alliant FX-8 (early 1980s)
  • eight 68020s with x-bar to 512 KB interleaved
    cache
  • Encore Sequent (1980s)
  • first 32-bit micros (N32032)
  • two to a board with a shared cache
  • Disappeared for a while, and then …
  • Cray X1 shares L3 cache
  • IBM Power 4 and Power 5 share L2 cache
  • If switch and cache on chip, may have enough
    bandwidth again

10
Approaches to Building Parallel Machines
P
P
Scale
1
n
Switch
(Interleaved)
P
First-level
P
n
1


(Interleaved)
Main memory
Inter
connection network
Shared Cache
Mem
Mem
Centralized Memory UMA Uniform Memory
Access
P
P
n
1


Mem
Mem
Inter
connection network
Distributed Memory (NUMA Non-UMA)
11
Intuitive Memory Model
  • Reading an address should return the last value
    written to that address
  • Easy in uniprocessors
  • except for I/O
  • Cache coherence problem in MPs is more pervasive
    and more performance critical
  • More formally, this is called sequential
    consistency
  • A multiprocessor is sequentially consistent if
    the result of any execution is the same as if the
    operations of all the processors were executed in
    some sequential order, and the operations of each
    individual processor appear in this sequence in
    the order specified by its program. Lamport,
    1979

12
Sequential Consistency Intuition
  • Sequential consistency says the machine behaves
    as if it does the following

13
Memory Consistency Semantics
  • What does this imply about program behavior?
  • No process ever sees garbage values, I.e., ½ of
    2 values
  • Processors always see values written by some some
    processor
  • The value seen is constrained by program order on
    all processors
  • Time always moves forward
  • Example spin lock
  • P1 writes data1, then writes flag1
  • P2 waits until flag1, then reads data

If P2 sees the new value of flag (1), it must
see the new value of data (1)
initially flag0 data0
P1
P2
data 1 flag 1
10 if flag0, goto 10 … data
14
If Caches are Not Coherent
  • Coherence means different copies of same location
    have same value
  • p1 and p2 both have cached copies of data (as 0)
  • p1 writes data1
  • May write through to memory
  • p2 reads data, but gets the stale cached copy
  • This may happen even if it read an updated value
    of another variable, flag, that came from memory

data 0
data 1
data 0
data 0
p1
p2
15
Snoopy Cache-Coherence Protocols
Pn
P0
bus snoop


memory bus
memory op from Pn
Mem
Mem
  • Memory bus is a broadcast medium
  • Caches contain information on which addresses
    they store
  • Cache Controller snoops all transactions on the
    bus
  • A transaction is a relevant transaction if it
    involves a cache block currently contained in
    this cache
  • Take action to ensure coherence
  • invalidate, update, or supply value
  • Many possible designs (see CS252 or CS258)

16
Limits of Bus-Based Shared Memory
  • Assume
  • 1 GHz processor w/o cache
  • gt 4 GB/s inst BW per processor (32-bit)
  • gt 1.2 GB/s data BW at 30 load-store
  • Suppose 98 inst hit rate and 95 data hit rate
  • gt 80 MB/s inst BW per processor
  • gt 60 MB/s data BW per processor
  • 140 MB/s combined BW
  • Assuming 1 GB/s bus bandwidth
  • \ 8 processors will saturate bus

I/O
MEM
MEM

140 MB/s

cache
cache
5.2 GB/s
PROC
PROC
17
Sample Machines
  • Intel Pentium Pro Quad
  • Coherent
  • 4 processors
  • Sun Enterprise server
  • Coherent
  • Up to 16 processor and/or memory-I/O cards
  • IBM Blue Gene/L
  • L1 not coherent, L2 shared

18
Approaches to Building Parallel Machines
P
P
Scale
1
n
Switch
(Interleaved)
P
First-level
P
n
1


(Interleaved)
Main memory
Inter
connection network
Shared Cache
Mem
Mem
Centralized Memory UMA Uniform Memory
Access
P
P
n
1


Mem
Mem
Inter
connection network
Distributed Memory (NUMA Non-UMA))
19
Basic Choices in Memory/Cache Coherence
  • Keep Directory to keep track of which memory
    stores latest copy of data
  • Directory, like cache, may keep information such
    as
  • Valid/invalid
  • Dirty (inconsistent with memory)
  • Shared (in another caches)
  • When a processor executes a write operation to
    shared data, basic design choices are
  • With respect to memory
  • Write through cache do the write in memory as
    well as cache
  • Write back cache wait and do the write later,
    when the item is flushed
  • With respect to other cached copies
  • Update give all other processors the new value
  • Invalidate all other processors remove from
    cache
  • See CS252 or CS258 for details

20
SGI Altix 3000
  • A node contains up to 4 Itanium 2 processors and
    32GB of memory
  • Network is SGIs NUMAlink, the NUMAflex
    interconnect technology.
  • Uses a mixture of snoopy and directory-based
    coherence
  • Up to 512 processors that are cache coherent
    (global address space is possible for larger
    machines)

21
Cache Coherence and Sequential Consistency
  • There is a lot of hardware/work to ensure
    coherent caches
  • Never more than 1 version of data for a given
    address in caches
  • Data is always a value written by some processor
  • But other HW/SW features may break sequential
    consistency (SC)
  • The compiler reorders/removes code (e.g., your
    spin lock)
  • The compiler allocates a register for flag on
    Processor 2 and spins on that register value
    without every completing
  • Write buffers (place to store writes while
    waiting to complete)
  • Processors may reorder writes to merge addresses
    (not FIFO)
  • Write X1, Y1, X2 (second write to X may happen
    before Ys)
  • Prefetch instructions cause read reordering (read
    data before flag)
  • The network reorders the two write messages.
  • The write to flag is nearby, whereas data is far
    away.
  • Some of these can be prevented by declaring
    variables volatile
  • Most current commercial SMPs give up SC

22
Programming with Weaker Memory Models than SC
  • Possible to reason about machines with fewer
    properties, but difficult
  • Some rules for programming with these models
  • Avoid race conditions
  • Use system-provided synchronization primitives
  • If you have race conditions on variables, make
    them volatile
  • At the assembly level, may use fences (or analog)
    directly
  • The high level language support for these differs
  • Built-in synchronization primitives normally
    include the necessary fence operations
  • lock (), … only one thread at a time allowed
    here…. unlock()
  • Region between lock/unlock called critical region
  • For performance, need to keep critical region
    short

23
Improved Code for Computing a Sum s f(A0) …
f(An-1)
static int s 0
Thread 1 local_s1 0 for i 0, n/2-1
local_s1 local_s1 f(Ai) s s
local_s1
Thread 2 local_s2 0 for i n/2, n-1
local_s2 local_s2 f(Ai) s s
local_s2
  • Since addition is associative, its OK to
    rearrange order

24
Improved Code for Computing a Sum s f(A0) …
f(An-1)
static int s 0
Thread 1 local_s1 0 for i 0, n/2-1
local_s1 local_s1 f(Ai) s s
local_s1
Thread 2 local_s2 0 for i n/2, n-1
local_s2 local_s2 f(Ai) s s
local_s2
  • Since addition is associative, its OK to
    rearrange order
  • Critical section smaller
  • Most work outside it

25
Caches and Scientific Computing
  • Caches tend to perform worst on demanding
    applications that operate on large data sets
  • transaction processing
  • operating systems
  • sparse matrices
  • Modern scientific codes use tiling/blocking to
    become cache friendly
  • easier for dense matrix codes (eg matmul) than
    for sparse
  • tiling and parallelism are similar transformations

26
Sharing A Performance Problem
  • True sharing
  • Frequent writes to a variable can create a
    bottleneck
  • OK for read-only or infrequently written data
  • Technique make copies of the value, one per
    processor, if this is possible in the algorithm
  • Example problem the data structure that stores
    the freelist/heap for malloc/free
  • False sharing
  • Cache block may also introduce artifacts
  • Two distinct variables in the same cache block
  • Technique allocate data used by each processor
    contiguously, or at least avoid interleaving
  • Example problem an array of ints, one written
    frequently by each processor

27
What to Take Away?
  • Programming shared memory machines
  • May allocate data in large shared region without
    too many worries about where
  • Memory hierarchy is critical to performance
  • Even more so than on uniprocs, due to coherence
    traffic
  • For performance tuning, watch sharing (both true
    and false)
  • Semantics
  • Need to lock access to shared variable for
    read-modify-write
  • Sequential consistency is the natural semantics
  • Architects worked hard to make this work
  • Caches are coherent with buses or directories
  • No caching of remote data on shared address space
    machines
  • But compiler and processor may still get in the
    way
  • Non-blocking writes, read prefetching, code
    motion…
  • Avoid races or use machine-specific fences
    carefully

28
Creating Parallelism with Threads
29
Programming with Threads
  • Several Thread Libraries
  • PTHREADS is the Posix Standard
  • Solaris threads are very similar
  • Relatively low level
  • Portable but possibly slow
  • OpenMP is newer standard
  • Support for scientific programming on shared
    memory
  • http//www.openMP.org
  • P4 (Parmacs) is another portable package
  • Higher level than Pthreads
  • http//www.netlib.org/p4/index.html

30
Language Notions of Thread Creation
  • cobegin/coend
  • fork/join
  • cobegin cleaner, but fork is more general

cobegin job1(a1) job2(a2) coend
  • Statements in block may run in parallel
  • cobegins may be nested
  • Scoped, so you cannot have a missing coend

tid1 fork(job1, a1) job2(a2) join tid1
  • Forked function runs in parallel with current
  • join waits for completion (may be in different
    function)

31
Forking Posix Threads
Signature
Signature int pthread_create(pthread_t ,
const pthread_attr_t ,
void ()(void ),
void ) Example call errcode
pthread_create(thread_id thread_attribute
thread_fun fun_arg)
  • thread_id is the thread id or handle (used to
    halt, etc.)
  • thread_attribute various attributes
  • standard default values obtained by passing a
    NULL pointer
  • thread_fun the function to be run (takes and
    returns void)
  • fun_arg an argument can be passed to thread_fun
    when it starts
  • errorcode will be set nonzero if the create
    operation fails

32
Posix Thread Example
  • include ltpthread.hgt
  • void print_fun( void message )
  • printf("s \n", message)
  • main()
  • pthread_t thread1, thread2
  • char message1 "Hello"
  • char message2 "World"
  • pthread_create( thread1,
  • NULL,
  • (void)print_fun,
  • (void) message1)
  • pthread_create(thread2,
  • NULL,
  • (void)print_fun,
  • (void) message2)
  • return(0)

Compile using gcc lpthread See
Millennium/Seaborg docs for paths/modules
Note There is a race condition in the print
statements
33
Loop Level Parallelism
  • Many scientific application have parallelism in
    loops
  • With threads
  • … my_stuff nn
  • for (int i 0 i lt n i)
  • for (int j 0 j lt n j)
  • … pthread_create (update_cell, …,
  • my_stuffij)
  • But overhead of thread creation is nontrivial

Also need i j
34
Shared Data and Threads
  • Variables declared outside of main are shared
  • Object allocated on the heap may be shared (if
    pointer is passed)
  • Variables on the stack are private passing
    pointer to these around to other threads can
    cause problems
  • Often done by creating a large thread data
    struct
  • Passed into all threads as argument

35
Basic Types of Synchronization Barrier
  • Barrier -- global synchronization
  • fork multiple copies of the same function work
  • SPMD Single Program Multiple Data
  • simple use of barriers -- all threads hit the
    same one
  • work_on_my_subgrid()
  • barrier
  • read_neighboring_values()
  • barrier
  • more complicated -- barriers on branches (or
    loops)
  • if (tid 2 0)
  • work1()
  • barrier
  • else barrier
  • barriers are not provided in many thread libraries

36
Basic Types of Synchronization Mutexes
  • Mutexes -- mutual exclusion aka locks
  • threads are working mostly independently
  • need to access common data structure
  • lock l alloc_and_init() / shared
    /
  • acquire(l)
  • access data
  • release(l)
  • Java and other languages have lexically scoped
    synchronization
  • similar to cobegin/coend vs. fork and join
  • Semaphores give guarantees on fairness in
    getting the lock, but the same idea of mutual
    exclusion
  • Locks only affect processors using them
  • pair-wise synchronization

37
A Model Problem Sharks and Fish
  • Illustration of parallel programming
  • Original version (discrete event only) proposed
    by Geoffrey Fox
  • Called WATOR
  • Sharks and fish living in a 2D toroidal ocean
  • We can imagine several variation to show
    different physical phenomenon
  • Basic idea sharks and fish living in an ocean
  • rules for movement
  • breeding, eating, and death
  • forces in the ocean
  • forces between sea creatures

38
Particle Systems
  • A particle system has
  • a finite number of particles.
  • moving in space according to Newtons Laws (i.e.
    F ma).
  • time is continuous.
  • Examples
  • stars in space with laws of gravity.
  • electron beam and ion beam semiconductor
    manufacturing.
  • atoms in a molecule with electrostatic forces.
  • neutrons in a fission reactor.
  • cars on a freeway with Newtons laws plus model
    of driver and engine.
  • Many simulations combine particle simulation
    techniques with some discrete event techniques
    (e.g., Sharks and Fish).

39
Forces in Particle Systems
  • Force on each particle decomposed into near and
    far
  • force external_force nearby_force
    far_field_force
  • External force
  • ocean current to sharks and fish world
  • externally imposed electric field in electron
    beam.
  • Nearby force
  • sharks attracted to eat nearby fish balls on a
    billiard table bounce off of each other.
  • Van der Waals forces in fluid (1/r6).
  • Far-field force
  • fish attract other fish by gravity-like (1/r2 )
    force
  • gravity, electrostatics
  • forces governed by elliptic PDE.

40
Parallelism in External Forces
  • External forces are the simplest to implement.
  • The force on each particle is independent of
    other particles.
  • Called embarrassingly parallel.
  • Evenly distribute particles on processors
  • Any even distribution works.
  • Locality is not an issue, no communication.
  • For each particle on processor, apply the
    external force.

41
Parallelism in Nearby Forces
  • Nearby forces require interaction and therefore
    communication.
  • Force may depend on other nearby particles
  • Example collisions.
  • simplest algorithm is O(n2) look at all pairs to
    see if they collide.
  • Usual parallel model is decomposition of
    physical domain
  • O(n2/p) particles per processor if evenly
    distributed.
  • Often called domain decomposition (which also
    refers to numerical alg.)
  • Challenges
  • Dealing with particles near processor boundaries
  • Dealing with load imbalance from nonuniformly
    distributed particles

42
Parallelism in Far-Field Forces
  • Far-field forces involve all-to-all interaction
    and therefore communication.
  • Force depends on all other particles
  • Examples gravity, protein folding
  • Simplest algorithm is O(n2)
  • Just decomposing space does not help since every
    particle needs to visit every other particle.
  • Use more clever algorithms to lower O(n2) to O(n
    log n)
  • Several later lectures
  • Implement by rotating particle sets.
  • Keeps processors busy
  • All processor eventually see all particles

43
Examine Sharks and Fish code
  • Gravitational forces among fish only
  • Use Eulers method to move fish numerically
  • Sequential and Shared Memory with Pthreads
  • www.cs.berkeley.edu/demmel/cs267_Spr05/SharksAndF
    ish

44
Extra Slides
45
Engineering Intel Pentium Pro Quad
  • SMP for the masses
  • All coherence and multiprocessing glue in
    processor module
  • Highly integrated, targeted at high volume
  • Low latency and bandwidth

46
Engineering SUN Enterprise
  • Proc mem card - I/O card
  • 16 cards of either type
  • All memory accessed over bus, so symmetric
  • Higher bandwidth, higher latency bus

47
Outline
  • Historical perspective
  • Bus-based machines
  • Pentium SMP
  • IBM SP node
  • Directory-based (CC-NUMA) machine
  • Origin 2000
  • Global address space machines
  • Cray t3d and (sort of) t3e

48
60s Mainframe Multiprocessors
  • Enhance memory capacity or I/O capabilities by
    adding memory modules or I/O devices
  • How do you enhance processing capacity?
  • Add processors
  • Already need an interconnect between slow memory
    banks and processor I/O channels
  • cross-bar or multistage interconnection network

49
70s Breakthrough Caches
  • Memory system scaled by adding memory modules
  • Both bandwidth and capacity
  • Memory was still a bottleneck
  • Enter… Caches!
  • Cache does two things
  • Reduces average access time (latency)
  • Reduces bandwidth requirements to memory

memory (slow)
A
17
interconnect
I/O Device or Processor
P
processor (fast)
50
Technology Perspective
Capacity Speed Logic 2x in 3 years 2x
in 3 years DRAM 4x in 3 years 1.4x in 10
years Disk 2x in 3 years 1.4x in 10 years
DRAM Year Size Cycle
Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1
Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145
ns 1995 64 Mb 120 ns
10001!
21!
51
Example Write-thru Invalidate
  • Update and write-thru both use more memory
    bandwidth if there are writes to the same address
  • Update to the other caches
  • Write-thru to memory

52
Write-Back/Ownership Schemes
  • When a single cache has ownership of a block,
    processor writes do not result in bus writes,
    thus conserving bandwidth.
  • reads by others cause it to return to shared
    state
  • Most bus-based multiprocessors today use such
    schemes.
  • Many variants of ownership-based protocols

53
Directory-Based Cache-Coherence
54
90 Scalable, Cache Coherent Multiprocessors
55
Cache Coherence and Memory Consistency
56
Violations of Sequential Consistency
  • Flag/data program is one example that relies on
    SC
  • Given coherent memory,all violations of SC based
    on reordering on independent operations are
    figure 8s
  • See paper by Shasha and Snir for more details
  • Operations can be linearized (move forward time)
    if SC

P1
P0
P2
read y write
x
57
Sufficient Conditions for Sequential Consistency
  • Processors issues memory operations in program
    order
  • Processor waits for store to complete before
    issuing any more memory operations
  • E.g., wait for write-through and invalidations
  • Processor waits for load to complete before
    issuing any more memory operations
  • E.g., data in another cache may have to be marked
    as shared rather than exclusive
  • A load must also wait for the store that produced
    the value to complete
  • E.g., if data is in cache and update event
    changes value, all other caches much also have
    processed that update
  • There are much more aggressive ways of
    implementing SC, but most current commercial SMPs
    give up

Based on slide by Mark Hill et al
58
Classification for Relaxed Models
  • Optimizations can generally be categorized by
  • Program order relaxation
  • Write ? Read
  • Write ? Write
  • Read ? Read, Write
  • Read others write early
  • Read own write early
  • All models provide safety net, e.g.,
  • A write fence instruction waits for writes to
    complete
  • A read fence prevents prefetches from moving
    before this point
  • Prefetches may be synchronized automatically on
    use
  • All models maintain uniprocessor data and control
    dependences, write serialization
  • Memory models differ on orders to two different
    locations

Slide source Sarita Adve et al
59
Some Current System-Centric Models
Safety Net
Read Own Write Early
Read Others Write Early
R ?RW Order
W ?W Order
W ?R Order
Relaxation
serialization instructions
?
IBM 370
RMW
?
?
TSO
RMW
?
?
?
PC
RMW, STBAR
?
?
?
PSO
synchronization
?
?
?
?
WO
release, acquire, nsync, RMW
?
?
?
?
RCsc
release, acquire, nsync, RMW
?
?
?
?
?
RCpc
MB, WMB
?
?
?
?
Alpha
various MEMBARs
?
?
?
?
RMO
SYNC
?
?
?
?
?
PowerPC
Slide source Sarita Adve et al
60
Data-Race-Free-0 Some Definitions
  • (Consider SC executions ? global total order)
  • Two conflicting operations race if
  • From different processors
  • Execute one after another (consecutively)
  • P1 P2
  • Write, A, 23
  • Write, B, 37

  • Read, Flag, 0
  • Write, Flag, 1
  • Read, Flag, 1
  • Read, B, ___ Read, A, ___
  • Races usually labeled as synchronization, others
    data
  • Can optimize operations that never race

Slide source Sarita Adve et al
61
Cache-Coherent Shared Memory and Performance
About PowerShow.com