Computer Architecture - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Computer Architecture

Description:

Iolanthe II racing in Waitemata Harbour MIMD Parallel Processors Classification of Parallel Processors Flynn s Taxonomy Classifies according to instruction and data ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 54
Provided by: jm41
Category:

less

Transcript and Presenter's Notes

Title: Computer Architecture


1
Computer Architecture
Iolanthe II racing in Waitemata Harbour
  • MIMD Parallel Processors

2
Classification of Parallel Processors
  • Flynns Taxonomy
  • Classifies according to instruction and data
    stream
  • Single Instruction Single Data
  • Sequential processors
  • Single Instruction Multiple Data
  • CM-2 multiple small processors
  • Vector processors
  • Parts of commercial processors - MMX, Altivec
  • Multiple Instruction Single Data
  • ?
  • Multiple Instruction Multiple Data
  • General Parallel Processors

3
MIMD Systems
  • Recipe
  • Buy a few high performance commercial PEs
  • DEC Alpha
  • MIPS R10000
  • UltraSPARC
  • Pentium?
  • Put them together with some memory and
    peripherals on a common bus
  • Instant parallel processor!
  • How to program it?

4
Programming Model
  • Problem not unique to MIMD
  • Even sequential machines need one
  • von Neuman (stored program) model
  • Parallel - Splitting the work load
  • Data
  • Distribute data to PEs
  • Instructions
  • Distribute tasks to PEs
  • Synchronization
  • Having divided the data tasks,how do we
    synchronize tasks?

5
Programming Model Shared Memory Model
  • Shared Memory Model
  • Flavour of the year
  • Generally thought to be simplest to manage
  • All PEs see a common (virtual) address space
  • PEs communicate by writing into the common
    address space

6
Data Distribution
  • Trivial
  • All the data sits in the common address space
  • Any PE can access it!
  • Uniform Memory Access(UMA) systems
  • All PEs access all data with same tacc
  • Non-UMA (NUMA) systems
  • Memory is physically distributed
  • Some PEs are closer to some addresses
  • More later!

7
Synchronisation
  • Read static shared data
  • No problem!
  • Update problem
  • PE0 writes x
  • PE1 reads x
  • How to ensure thatPE1 reads the lastvalue
    written by PE0?
  • Semaphores
  • Lock resources(memory areas or ...)while being
    updatedby one PE

8
Synchronisation
  • Semaphore
  • Data structure in memory
  • Count of waiters
  • -1 resource free
  • gt 0 resource in use
  • Pointer to list of waiters
  • Two operations
  • Wait
  • Proceed immediately if resource free(waiter
    count -1)
  • Notify
  • Advise semaphore that you have finished with
    resource
  • Decrement waiter count
  • First waiter will be given control

9
Semaphores - Implementation
  • Scenario
  • Semaphore free (-1)
  • PE0 wait ..
  • Resource free, so PE0 uses it (sets 0)
  • PE1 wait ..
  • Reads count (0)
  • Starts to increment it ..
  • PE0 notify ..
  • Gets bus and writes -1
  • PE1 (finishing wait)
  • Adds 1 to 0, writes 1 to count, adds PE1 TCB to
    list
  • Stalemate!
  • Who issues notify to free the resource?

10
Atomic Operations
  • Problem
  • PE0 wrote a new value (-1) after PE1 had read
    the counter
  • PE1 increments the value it read (0) and writes
    it back
  • Solution
  • PE1s read and update must be atomic
  • No other PE must gain access to counterwhile PE1
    is updating
  • Usually an architecture will provide
  • Test and set instruction
  • Read a memory location, test it,if its 0, write
    a new value,else do nothing
  • Atomic or indivisible .. No other PE can access
    the value until the operation is complete

11
Atomic Operations
  • Test Set
  • Read a memory location, test it,if its 0, write
    a new value,else do nothing
  • Can be used to guard a resource
  • When the location contains 0 -access to the
    resource is allowed
  • Non-zero value means the resource is locked
  • Semaphore
  • Simple semaphore (no wait list)
  • Implement directly
  • Waiter backs off and tries again (rather than
    being queued)
  • Complex semaphore (with wait list)
  • Guards the wait counter

12
Atomic Operations
  • Processor must provide an atomic operation for
  • Multi-tasking or multi-threading on a single PE
  • Multiple processes
  • Interrupts occur at arbitrary points in time
  • including timer interrupts signaling end of
    time-slice
  • Any process can be interrupted in the middle of a
    read-modify-write sequence
  • Shared memory multi-processors
  • One PE can lose control of the bus after the read
    of a read-modify-write
  • Cache?
  • Later!

13
Atomic Operations
  • Variations
  • Provide equivalent capability
  • Sometimes appear in strange guises!
  • Read-modify-write bus transactions
  • Memory location is read, modified and written
    back as a single, indivisible operation
  • Test and exchange
  • Check registers value, if 0, exchange with
    memory
  • Reservation Register (PowerPC)
  • lwarx - load word and reserve indexed
  • stwcx - store word conditional indexed
  • Reservation register stores address of reserved
    word
  • Reservation and use can be separated by sequence
    of instructions

14
Synchronization High level
15
Barriers
  • In shared memoryenvironment
  • PEs must know whenanother PE hasproduced a
    result
  • Simplest casebarrier for all PEs
  • Must be inserted byprogrammer
  • Potentially expensive
  • All PEs stall and waste time in the barrier

16
PE-PE synchronization
  • Barriers are global and potentially wasteful
  • Small group of PEs (subset of total) may be
    working on a sub-task
  • Need to synchronize within the group
  • Steps
  • Allocate semaphore (its just a block of memory)
  • PEs within the group access a shared location
    guarded by this semaphore
  • eg
  • shared location is count of PEs which have
    completed their tasks each PE increments the
    count when it completes
  • master monitors count until all PEs have
    finished

17
Cache
  • Performance of a modern PE depends on the
    cache(s)!

18
Cache?
  • What happens to cachedlocations?

19
Multiple Caches
  • Coherence
  • PEA reads location xfrom memory
  • Copy in cache A
  • PEB reads location x from memory
  • Copy in cache B
  • PEA adds 1

20
Multiple Caches - Inconsistent states
  • Coherence
  • PEA reads location xfrom memory
  • Copy in cache A
  • PEB reads location x from memory
  • Copy in cache B
  • PEA adds 1
  • As copy now 201
  • PEB reads location x
  • Reads 200 from cache B!!

21
Multiple Caches - Inconsistent states
  • Coherence
  • PEA reads location xfrom memory
  • Copy in cache A
  • PEB reads location x from memory
  • Copy in cache B
  • PEA adds 1
  • As copy now 201
  • PEB reads location x
  • Reads 200 from cache B
  • Caches and memory are now inconsistent ornot
    coherent

22
Cache - Maintaining Coherence
  • Invalidate on write
  • PEA reads location xfrom memory
  • Copy in cache A
  • PEB reads location x from memory
  • Copy in cache B
  • PEA adds 1
  • As copy now 201
  • PEA Issues invalidate x
  • Cache B marks x invalid
  • Invalidate is address transaction only

23
Cache - Maintaining Coherence
  • Reading the new value
  • PEB reads location x
  • Main memoryis wrong also
  • PEA snoops read
  • Realises it hasvalid copy
  • PEA issues retry

24
Cache - Maintaining Coherence
  • Reading the new value
  • PEB reads location x
  • Main memoryis wrong also!
  • PEA snoops read
  • Realises it hasvalid copy
  • PEA issues retry
  • PEA writes x back
  • Memory now correct
  • PEB reads location x again
  • Reads latest version

25
Coherent Cache - Snooping
  • SIU snoops bus for transactions
  • Addresses compared with local cache
  • On matches
  • Hits in the local cache
  • Initiate retries
  • Local copy is modified
  • Local copy is written to bus
  • Invalidate local copies
  • Another PE is writing
  • Mark local copies shared
  • second PE is readingsame value

26
Coherent Cache - MESI protocol
  • Cache line has 4 states
  • Invalid
  • Modified
  • Only valid copy
  • Memory copy is invalid
  • Exclusive
  • Only cached copy
  • Memory copy is valid
  • Shared
  • Multiple cached copies
  • Memory copy is valid

27
MESI State Diagram
  • Note the number of bus transactions needed!

WH Write Hit WM Write Miss RH Read Hit RMS
Read Miss Shared RME Read Miss Exclusive SHW
Snoop Hit Write
28
Coherent Cache - The Cost
  • Cache coherency transactions
  • Additional transactions needed
  • Shared
  • Write Hit
  • Other caches must be notified
  • Modified
  • Other PE read
  • Push-out needed
  • Other PE write
  • Push-out needed - writing one word of n-word line
  • Invalid - modified in other cache
  • Read or write
  • Wait for push-out

29
Clusters
  • A bus which is too long becomes slow!
  • eg PCI is limited to 10 TTL loads
  • Lots of processors?
  • On the same bus
  • Bus speed must be limited
  • Low communication rate
  • Better to use a single PE!
  • Clusters
  • 8 processors on a bus

30
Clusters
8 cache coherent (CC) processors on a bus
Interconnect network
100? clusters
31
Clusters
Network Interface Unit Detects requests
for remote memory
32
Clusters
Message despatched to remote clusters NIU
Memory Request Message
33
Clusters - Shared Memory
  • Non Uniform Memory Access
  • Access time to memory depends on location!

From PEs in this cluster
This memory is much closer than this one!
34
Clusters - Shared Memory
  • Non Uniform Memory Access
  • Access time to memory depends on location!

Worse! NIU needs to maintain cache
coherence across the entire machine
35
Clusters - Maintaining Cache Coherence
  • NIU (or equivalent) maintains directory
  • Directory Entries
  • All lines from local memory cached elsewhere
  • NIU software (firmware)
  • Checks memory requests against directory
  • Update directory
  • Send invalidate messages to other clusters
  • Fetch modified (dirty) lines from other clusters
  • Remote memory access cost
  • 100s of cycles!

Directory (Cluster 2)
Address Status Clusters 4340 S 1, 3,
8 5260 E 9
36
Clusters - Off the shelf
  • Commercial clusters
  • Provide page migration
  • Make copy of a remote page on the local PE
  • Programmer remains responsible for coherence
  • Dont provide hardware support for cache
    coherence (across network)
  • Fully CC machines may never be available!
  • Software Systems
  • .... è

37
Shared Memory Systems
  • Software Systems
  • eg Treadmarks
  • Provide shared memory on page basis
  • Software
  • detects references to remote pages
  • moves copy to local memory
  • Reduces shared memory overhead
  • Provides some of the shared memory model
    convenience
  • Without swamping interconnection network with
    messages
  • Message overhead is too high for a single word!
  • Word basis is too expensive!!

38
Granularity in Parallel Systems
39
Shared Memory Systems - Granularity
  • Granularity
  • Keeping data coherent on a word basis is too
    expensive!!
  • Sharing data at low granularity
  • Fine grain sharing
  • Access / sharing for individual words
  • Overheads too high
  • Number of messages
  • Message overhead is high for one word
  • Compare
  • Burst access to memory
  • Dont fetch a single word -
  • Overhead (bus protocol) is too high
  • Amortize cost of access over multiple words

40
Shared Memory Systems - Granularity
  • Coarse Grain Systems
  • Transferring data from cluster to cluster
  • Overhead
  • Messages
  • Updating directory
  • Amortise the overhead over a whole page
  • Lower relative overhead
  • Applies to thread size also
  • Split program into small threads of control
  • Parallel Overhead
  • Cost of setting up starting each thread
  • Ccost of synchronising at the end of a set of
    threads
  • Can be more efficient to run a single sequential
    thread!

41
Coarse Grain Systems
  • So far ...
  • Most experiments suggest that fine grain systems
    are impractical
  • Larger, coarser grain
  • Blocks of data
  • Threads of computation
  • needed to reduce overall computation time by
    using multiple processors
  • Too Fine grain parallel systems
  • can run slower than a single processor!

42
Parallel Overhead
  • Ideal
  • T(n) time to solve problem with n PEs
  • Sequential time T(1)
  • Wed like
  • T(n) T(1) / n
  • Add Overhead
  • Time gt optimal
  • No point to usemore than4 PEs!!

Actual T(n)
43
Parallel Overhead
  • Ideal
  • Time 1/n
  • Add Overhead
  • Time gt optimal
  • No point to usemore than4 PEs!!

44
Parallel Overhead
  • Shared memory systems
  • Best results if you
  • Share on large block basis
  • eg page
  • Split program into coarse grain(long running)
    threads
  • Give away some parallelismto achieve any
    parallel speedup!
  • Coarse grain
  • Data
  • Computation

Theres parallelism at the instruction level
too!The instruction issue unit in a sequential
processor is trying to exploit it!
45
Clusters - Improving multiple PE performance
  • Bandwidth to memory
  • Cache reduces dependency on the memory-CPU
    interface
  • 95 cache hits
  • 5 of memory accesses crossing the interface
  • but add
  • a few PEs and
  • a few CC transactions
  • even if the interface was coping before,it wont
    in a multiprocessor system!

A major bottleneck!
46
Clusters - Improving multiple PE performance
  • Bus protocols add to access time
  • Request / Grant / Release phases needed
  • Point-to-point is faster!
  • Cross-bar switch interface to memory
  • No PE contends with any other for the common
    bus

Cross-bar? Name taken from old telephone
exchanges!
47
Clusters - Memory Bandwidth
  • Modern Clusters
  • Use Point-to-point X-bar interfaces to memory
    to get bandwidth!
  • Cache coherence?
  • Now really hard!!
  • How does each cachesnoop all transactions?

48
Programming Model - Distributed Memory
  • Distributed Memory
  • also Message passing
  • Alternative to shared memory
  • Each PE has own address space
  • PEs communicate with messages
  • Messages providesynchronisation
  • PE can block orwait for a message

49
Programming Model - Distributed Memory
  • Distributed Memory Systems
  • Hardware is simple!
  • Network can be as simple as ethernet
  • Networks of Workstations model
  • Commodity (cheap!) PEs
  • Commodity Network
  • Standard
  • Ethernet
  • ATM
  • Proprietary
  • Myrinet
  • Achilles (UWA!)

50
Programming Model - Distributed Memory
  • Distributed Memory Systems
  • Software is considered harder
  • Programmer responsible for
  • Distributing data to individual PEs
  • Explicit Thread control
  • Starting, stopping synchronising
  • At least two commonly available systems
  • Parallel Virtual Machine (PVM)
  • Message Passing Interface (MPI)
  • Built on two operations
  • Send ( data, destPE, block dont block )
  • Receive ( data, srcPE, block dont block )
  • Blocking ensures synchronisation

51
Programming Model - Distributed Memory
  • Distributed Memory Systems
  • Performance generally better (versus shared
    memory)
  • Shared memory has hidden overheads
  • Grain size poorly chosen
  • eg data doesnt fit into pages
  • Unnecessary coherencetransactions
  • Updating a shared region (each page)before end
    of computation
  • MP system waits and updates page when computation
    is complete

52
Programming Model - Distributed Memory
  • Distributed Memory Systems
  • Performance generally better (versus shared
    memory)
  • False sharing
  • Severely degrades performance
  • May not be apparent on superficial analysis

Memory page
PEa accesses this data
This whole page ping-pongs between PEa and PEb
PEb accesses this data
53
Distributed Memory - Summary
  • Simpler (almost trivial) hardware
  • Software
  • More programmer effort
  • Explicit data distribution
  • Explicit synchronisation
  • Performance generally better
  • Programmer knows more about the problem
  • Communicates only when necessary
  • Communication grain size can be optimum
  • Lower overheads
Write a Comment
User Comments (0)
About PowerShow.com