Computer Architecture - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

Computer Architecture

Description:

Iolanthe II racing in Waitemata Harbour MIMD Parallel Processors Classification of Parallel Processors Flynn s Taxonomy Classifies according to instruction and data ... – PowerPoint PPT presentation

Number of Views:113

Avg rating:3.0/5.0

Slides: 54

Provided by: jm41

Category:

more less

Transcript and Presenter's Notes

Title: Computer Architecture

1
Computer Architecture
Iolanthe II racing in Waitemata Harbour

MIMD Parallel Processors

2
Classification of Parallel Processors

Flynns Taxonomy
Classifies according to instruction and data
stream
Single Instruction Single Data
Sequential processors
Single Instruction Multiple Data
CM-2 multiple small processors
Vector processors
Parts of commercial processors - MMX, Altivec
Multiple Instruction Single Data
?
Multiple Instruction Multiple Data
General Parallel Processors

3
MIMD Systems

Recipe
Buy a few high performance commercial PEs
DEC Alpha
MIPS R10000
UltraSPARC
Pentium?
Put them together with some memory and
peripherals on a common bus
Instant parallel processor!
How to program it?

4
Programming Model

Problem not unique to MIMD
Even sequential machines need one
von Neuman (stored program) model
Parallel - Splitting the work load
Data
Distribute data to PEs
Instructions
Distribute tasks to PEs
Synchronization
Having divided the data tasks,how do we
synchronize tasks?

5
Programming Model Shared Memory Model

Shared Memory Model
Flavour of the year
Generally thought to be simplest to manage
All PEs see a common (virtual) address space
PEs communicate by writing into the common
address space

6
Data Distribution

Trivial
All the data sits in the common address space
Any PE can access it!
Uniform Memory Access(UMA) systems
All PEs access all data with same tacc
Non-UMA (NUMA) systems
Memory is physically distributed
Some PEs are closer to some addresses
More later!

7
Synchronisation

Read static shared data
No problem!
Update problem
PE0 writes x
PE1 reads x
How to ensure thatPE1 reads the lastvalue
written by PE0?
Semaphores
Lock resources(memory areas or ...)while being
updatedby one PE

8
Synchronisation

Semaphore
Data structure in memory
Count of waiters
-1 resource free
gt 0 resource in use
Pointer to list of waiters
Two operations
Wait
Proceed immediately if resource free(waiter
count -1)
Notify
Advise semaphore that you have finished with
resource
Decrement waiter count
First waiter will be given control

9
Semaphores - Implementation

Scenario
Semaphore free (-1)
PE0 wait ..
Resource free, so PE0 uses it (sets 0)
PE1 wait ..
Reads count (0)
Starts to increment it ..
PE0 notify ..
Gets bus and writes -1
PE1 (finishing wait)
Adds 1 to 0, writes 1 to count, adds PE1 TCB to
list
Stalemate!
Who issues notify to free the resource?

10
Atomic Operations

Problem
PE0 wrote a new value (-1) after PE1 had read
the counter
PE1 increments the value it read (0) and writes
it back
Solution
PE1s read and update must be atomic
No other PE must gain access to counterwhile PE1
is updating
Usually an architecture will provide
Test and set instruction
Read a memory location, test it,if its 0, write
a new value,else do nothing
Atomic or indivisible .. No other PE can access
the value until the operation is complete

11
Atomic Operations

Test Set
Read a memory location, test it,if its 0, write
a new value,else do nothing
Can be used to guard a resource
When the location contains 0 -access to the
resource is allowed
Non-zero value means the resource is locked
Semaphore
Simple semaphore (no wait list)
Implement directly
Waiter backs off and tries again (rather than
being queued)
Complex semaphore (with wait list)
Guards the wait counter

12
Atomic Operations

Processor must provide an atomic operation for
Multi-tasking or multi-threading on a single PE
Multiple processes
Interrupts occur at arbitrary points in time
including timer interrupts signaling end of
time-slice
Any process can be interrupted in the middle of a
read-modify-write sequence
Shared memory multi-processors
One PE can lose control of the bus after the read
of a read-modify-write
Cache?
Later!

13
Atomic Operations

Variations
Provide equivalent capability
Sometimes appear in strange guises!
Read-modify-write bus transactions
Memory location is read, modified and written
back as a single, indivisible operation
Test and exchange
Check registers value, if 0, exchange with
memory
Reservation Register (PowerPC)
lwarx - load word and reserve indexed
stwcx - store word conditional indexed
Reservation register stores address of reserved
word
Reservation and use can be separated by sequence
of instructions

14
Synchronization High level
15
Barriers

In shared memoryenvironment
PEs must know whenanother PE hasproduced a
result
Simplest casebarrier for all PEs
Must be inserted byprogrammer
Potentially expensive
All PEs stall and waste time in the barrier

16
PE-PE synchronization

Barriers are global and potentially wasteful
Small group of PEs (subset of total) may be
working on a sub-task
Need to synchronize within the group
Steps
Allocate semaphore (its just a block of memory)
PEs within the group access a shared location
guarded by this semaphore
eg
shared location is count of PEs which have
completed their tasks each PE increments the
count when it completes
master monitors count until all PEs have
finished

17
Cache

Performance of a modern PE depends on the
cache(s)!

18
Cache?

What happens to cachedlocations?

19
Multiple Caches

Coherence
PEA reads location xfrom memory
Copy in cache A
PEB reads location x from memory
Copy in cache B
PEA adds 1

20
Multiple Caches - Inconsistent states

Coherence
PEA reads location xfrom memory
Copy in cache A
PEB reads location x from memory
Copy in cache B
PEA adds 1
As copy now 201
PEB reads location x
Reads 200 from cache B!!

21
Multiple Caches - Inconsistent states

Coherence
PEA reads location xfrom memory
Copy in cache A
PEB reads location x from memory
Copy in cache B
PEA adds 1
As copy now 201
PEB reads location x
Reads 200 from cache B
Caches and memory are now inconsistent ornot
coherent

22
Cache - Maintaining Coherence

Invalidate on write
PEA reads location xfrom memory
Copy in cache A
PEB reads location x from memory
Copy in cache B
PEA adds 1
As copy now 201
PEA Issues invalidate x
Cache B marks x invalid
Invalidate is address transaction only

23
Cache - Maintaining Coherence

Reading the new value
PEB reads location x
Main memoryis wrong also
PEA snoops read
Realises it hasvalid copy
PEA issues retry

24
Cache - Maintaining Coherence

Reading the new value
PEB reads location x
Main memoryis wrong also!
PEA snoops read
Realises it hasvalid copy
PEA issues retry
PEA writes x back
Memory now correct
PEB reads location x again
Reads latest version

25
Coherent Cache - Snooping

SIU snoops bus for transactions
Addresses compared with local cache
On matches
Hits in the local cache
Initiate retries
Local copy is modified
Local copy is written to bus
Invalidate local copies
Another PE is writing
Mark local copies shared
second PE is readingsame value

26
Coherent Cache - MESI protocol

Cache line has 4 states
Invalid
Modified
Only valid copy
Memory copy is invalid
Exclusive
Only cached copy
Memory copy is valid
Shared
Multiple cached copies
Memory copy is valid

27
MESI State Diagram

Note the number of bus transactions needed!

WH Write Hit WM Write Miss RH Read Hit RMS
Read Miss Shared RME Read Miss Exclusive SHW
Snoop Hit Write
28
Coherent Cache - The Cost

Cache coherency transactions
Additional transactions needed
Shared
Write Hit
Other caches must be notified
Modified
Other PE read
Push-out needed
Other PE write
Push-out needed - writing one word of n-word line
Invalid - modified in other cache
Read or write
Wait for push-out

29
Clusters

A bus which is too long becomes slow!
eg PCI is limited to 10 TTL loads
Lots of processors?
On the same bus
Bus speed must be limited
Low communication rate
Better to use a single PE!
Clusters
8 processors on a bus

30
Clusters
8 cache coherent (CC) processors on a bus
Interconnect network
100? clusters
31
Clusters
Network Interface Unit Detects requests
for remote memory
32
Clusters
Message despatched to remote clusters NIU
Memory Request Message
33
Clusters - Shared Memory

Non Uniform Memory Access
Access time to memory depends on location!

From PEs in this cluster
This memory is much closer than this one!
34
Clusters - Shared Memory

Non Uniform Memory Access
Access time to memory depends on location!

Worse! NIU needs to maintain cache
coherence across the entire machine
35
Clusters - Maintaining Cache Coherence

NIU (or equivalent) maintains directory
Directory Entries
All lines from local memory cached elsewhere
NIU software (firmware)
Checks memory requests against directory
Update directory
Send invalidate messages to other clusters
Fetch modified (dirty) lines from other clusters
Remote memory access cost
100s of cycles!

Directory (Cluster 2)
Address Status Clusters 4340 S 1, 3,
8 5260 E 9
36
Clusters - Off the shelf

Commercial clusters
Provide page migration
Make copy of a remote page on the local PE
Programmer remains responsible for coherence
Dont provide hardware support for cache
coherence (across network)
Fully CC machines may never be available!
Software Systems
.... è

37
Shared Memory Systems

Software Systems
eg Treadmarks
Provide shared memory on page basis
Software
detects references to remote pages
moves copy to local memory
Reduces shared memory overhead
Provides some of the shared memory model
convenience
Without swamping interconnection network with
messages
Message overhead is too high for a single word!
Word basis is too expensive!!

38
Granularity in Parallel Systems
39
Shared Memory Systems - Granularity

Granularity
Keeping data coherent on a word basis is too
expensive!!
Sharing data at low granularity
Fine grain sharing
Access / sharing for individual words
Overheads too high
Number of messages
Message overhead is high for one word
Compare
Burst access to memory
Dont fetch a single word -
Overhead (bus protocol) is too high
Amortize cost of access over multiple words

40
Shared Memory Systems - Granularity

Coarse Grain Systems
Transferring data from cluster to cluster
Overhead
Messages
Updating directory
Amortise the overhead over a whole page
Lower relative overhead
Applies to thread size also
Split program into small threads of control
Parallel Overhead
Cost of setting up starting each thread
Ccost of synchronising at the end of a set of
threads
Can be more efficient to run a single sequential
thread!

41
Coarse Grain Systems

So far ...
Most experiments suggest that fine grain systems
are impractical
Larger, coarser grain
Blocks of data
Threads of computation
needed to reduce overall computation time by
using multiple processors
Too Fine grain parallel systems
can run slower than a single processor!

42
Parallel Overhead

Ideal
T(n) time to solve problem with n PEs
Sequential time T(1)
Wed like
T(n) T(1) / n
Add Overhead
Time gt optimal
No point to usemore than4 PEs!!

Actual T(n)
43
Parallel Overhead

Ideal
Time 1/n
Add Overhead
Time gt optimal
No point to usemore than4 PEs!!

44
Parallel Overhead

Shared memory systems
Best results if you
Share on large block basis
eg page
Split program into coarse grain(long running)
threads
Give away some parallelismto achieve any
parallel speedup!
Coarse grain
Data
Computation

Theres parallelism at the instruction level
too!The instruction issue unit in a sequential
processor is trying to exploit it!
45
Clusters - Improving multiple PE performance