Computer Architecture - PowerPoint PPT Presentation

Loading...

PPT – Computer Architecture PowerPoint presentation | free to download - id: 5a7dc3-YTNmM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Computer Architecture

Description:

Key Points John Morris Electrical & Computer Enginering/ Computer Science, The University of Auckland Iolanthe II drifts off Waiheke Island Memory Bottleneck State-of ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 55
Provided by: jm41
Learn more at: http://www.cs.auckland.ac.nz
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Computer Architecture


1
  • Computer Architecture
  • Key Points
  • John Morris
  • Electrical Computer Enginering/Computer
    Science, The University of Auckland

Iolanthe II drifts off Waiheke Island
2
Memory Bottleneck
  • State-of-the-art processor
  • f 3 GHz
  • tclock 330ps
  • 1-2 instructions per cycle
  • 25 memory reference
  • Memory response
  • 4 instructions x 330ps
  • 1.2ns needed!
  • Bulk semiconductor RAM
  • 100ns for a random access!
  • Processor will spend most of its time waiting for
    memory!

3
MemoryBottleneck
  • Assume
  • Clock speed, f 3GHz
  • ?Cycle time, tcyc 1/f 330ps
  • 32-bit 4 byte machine word
  • ?Internal bandwidth (bytes per word) f
    4 f 12 GB/s
  • 64-bit PCI bus, fbus 32 MHz

Arrow width (roughly) indicatesdata bandwidth
4
Cache
  • Small, fast memory
  • Typically 50kbytes (1998)
  • 2 cycle access time
  • Same die as processor
  • Off-chip cache possible
  • Custom cache chip closely coupled to processor
  • Use fast static RAM (SRAM) rather thanslower
    dynamic RAM
  • Several levels possible
  • 2nd level of the memory hierarchy
  • Caches most recently used memory locations
    closer to the processor
  • closer closer in time

5
MemoryBottleneck
  • Assume
  • Clock speed, f 3GHz
  • ?Cycle time, tcyc 1/f 330ps
  • 32-bit 4 byte machine word
  • ?Internal bandwidth (bytes per word) f
    4 f 12 GB/s
  • 64-bit PCI bus, fbus 32 MHz

Arrow width (roughly) indicatesdata bandwidth
6
MemoryBottleneck
  • Assume
  • Clock speed, f 3GHz
  • ?Cycle time, tcyc 1/f 330ps
  • 32-bit 4 byte machine word
  • ?Internal bandwidth (bytes per word) f
    4 f 12 GB/s
  • 64-bit PCI bus, fbus 32 MHz

Arrow width (roughly) indicatesdata bandwidth
7
Memory hierarchy performance
  • Usual metric is machine cycle time, tcyc 1/f
  • Visible to programmer
  • Registers lt 1 cycle latency
    (respond in same cycle)
  • Transparent to programmer
  • Level 1 (L1) cache 2 cycle latency
  • L2 cache 5-6 cycles
  • L3 cache about 10 cycles
  • Main memory 100 cycles
    for a random access
  • Disc gt 1 ms or gt106 cycles
  • Effective memory access time, teff S fi
    tiwhere fi fraction of hits at level i,
    ti access time at level i

8
Cache - organisation
  • Direct-mapped cache
  • Each word in the cache has a tag
  • Assume
  • cache size - 2k words
  • machine words - p bits
  • byte-addressed memory
  • m log2 ( p/8 ) bits not used to address words
  • m 2 for 32-bit machines

Address format
p bits
p-k-m
m
k
tag
cache address
byte address
9
Cache - organisation
A cache line
  • Direct-mapped cache

tag
data
2k lines
memory
p-k-m
p

Hit?

p-k-m
m
k
CPU
tag
cache address
byte address
Memory address
10
Cache - Conflicts
  • Conflicts
  • Two addresses separated by 2kmwill hit the same
    cache location

p bits
p-k-m
m
k
cache address
tag
byte address
Addresses in which these k bitsare the same will
map to the samecache line
11
Cache - Conflicts
  • When a word is modified in cache
  • Write-back cache
  • Only writes data back when needed
  • Misses
  • Two memory accesses
  • Write modified word back
  • Read new word
  • Write-through cache
  • Low priority write to main memory is queued
  • Processor is delayed by read only
  • Memory write occurs in parallel with other work
  • Instruction and necessary data fetches take
    priority

12
Cache - Write-through or write-back?
  • Write-through
  • Seems a good idea!
  • but ...
  • Multiple writes to the same location waste memory
    bus bandwidth
  • Typical programs better with write-back caches
  • however
  • Often you can easily predict which will be best
  • Some processors (eg PowerPC) allow you to
    classify memory regions as write-back or
    write-through

13
Cache - more bits
  • Cache lines need some status bits
  • Tag bits ..
  • Valid
  • All set to false on power up
  • Set to true as words are loaded into cache
  • Dirty
  • Needed by write-back cache
  • Write- through cache always queues thewrite, so
    lines are never dirty

Tag
V
M
Data
Cache line
p-k-m
p
1
1
14
Cache Improving Performance
  • Conflicts ( addresses 2km bytes apart )
  • Degrade cache performance
  • Lower hit rate
  • Murphys Law operates
  • Addresses are never random!
  • Some locations thrash in cache
  • Continually replaced and restored
  • Alternatively
  • Ideal cache performance depends on uniform access
    to all parts of memory
  • Never happens in real programs!

15
Cache - Fully Associative
  • All tags are compared at the same time
  • Words can use any cache line

16
Cache - Fully Associative
  • Associative
  • Each tag is compared at the same time
  • Any match ? hit
  • Avoids unnecessary flushing
  • Replacement
  • Least Recently Used - LRU
  • Needs extra status bits
  • Cycles since last accessed
  • Hardware cost high
  • Extra comparators
  • Wider tags
  • p-m bits vs p-k-m bits

17
Cache - Set Associative

2-way set associative
Each line - two words two comparators only
18
Cache - Set Associative
  • n-way set associative caches
  • n can be small 2, 4, 8
  • Best performance
  • Reasonable hardware cost
  • Most high performance processors
  • Replacement policy
  • LRU choice from n
  • Reasonable LRU approximation
  • 1 or 2 bits
  • Set on access
  • Cleared / decremented by timer
  • Choose cleared word for replacement

19
Cache - Locality of Reference
  • Temporal Locality
  • Same location will be referenced again soon
  • Access same data again
  • Program loops - access same instruction again
  • Caches described so far exploit temporal locality
  • Spatial Locality
  • Nearby locations will be referenced soon
  • Next element of an array
  • Next instruction of a program

20
Cache - Line Length
  • Spatial Locality
  • Use very long cache lines
  • Fetch one datum
  • Neighbours fetched also
  • PowerPC 601 (Motorola/Apple/IBM)first of the
    single chip Power processors
  • 64 sets
  • 8-way set associative
  • 32 bytes per line
  • 32 bytes (8 instructions) fetched into
    instruction buffer in one cycle
  • 64 x 8 x 32 16k byte total

21
Cache - Separate I- and D-caches
  • Unified cache
  • Instructions and Data in same cache
  • Two caches -
  • Instructions Data
  • Increases total bandwidth
  • MIPS R10000
  • 32Kbyte Instruction 32Kbyte Data
  • Instruction cache is pre-decoded! (32 ? 36bits)
  • Data
  • 8-word (64byte) line, 2-way set associative
  • 256 sets
  • Replacement policy?

22
COMPSYS 304
  • Computer Architecture
  • Memory Management Units

Reefed down - heading for Great Barrier Island
23
Memory Management Unit
  • Virtual Address Space
  • Each user has a private address space

User Ds Address Space
24
Virtual Addresses
  • Mappings between user space and physical memory
    created by OS

25
Memory Management Unit (MMU)
  • Responsible for VIRTUAL ? PHYSICAL address
    mapping
  • Sits between CPU and cache
  • Cache operates on Physical Addresses(mostly -
    some research on VA cache)

VA
PA
MMU
PA
Main Mem
CPU
Cache
D or I
D or I
26
MMU - operation
q-k
27
MMU - Virtual memory space
  • Page Table Entries can also point to disc blocks
  • Valid bit
  • Set page in memory address is physical
    page address
  • Cleared page swapped out address is
    disc block address
  • MMU hardware generates page faultwhen swapped
    out page is requested
  • Allows virtual memory space to be larger than
    physical memory
  • Only working set is in physical memory
  • Remainder on paging disc

28
Page Fault
q-k
29
MMU Page faults
  • Very expensive!
  • Gap in access times
  • Main memory 100 ns
  • Disc 1 ms
  • A factor of 104 slower!!
  • May require write-back of old (but modified) page
  • May require reading of Page Table Entries from
    disc!
  • Good way to make a system thrash!

30
MMU Access control
  • Provides additional protection to programmer
  • Pages can be marked
  • Read only
  • Execute only
  • Can prevent wayward programmes from corrupting
    their own programme code or vital data
  • Protection is hardware!
  • MMU will raise exception if illegal access
    attempted
  • OS traps the exception and process it

31
MMU
  • Inverted page tables
  • Scheme which saves memory for page tables
  • One PTE per page of physical memory
  • Hash function used
  • Collisions probable
  • Possibly slower
  • Sharing
  • Map virtual pages for several users to same
    physical page
  • Good for sharing program code
  • Data also (read/write control provided by OS)
  • Saves physical memory
  • Reduces pressure on main memory

32
MMU
  • TLB
  • Cache for page table entries
  • Enables MMU to translate VA ? PA in time!
  • Can be quite small 50-100 entries
  • Often fully associative
  • Small size avoids one cost of FA cache
  • Only 50-100 comparators needed
  • TLB Coverage
  • Amount of memory covered by TLB entries
  • Size of a program for which VA ? PA translation
    will be fast

33
Memory Hierarchy - Operation
34
System Interface Unit
  • Tasks
  • Control bus
  • Match cacheline lengthto bus width
  • Follow bus protocol
  • Request / Grant / Data cycles
  • Manage burst transactions
  • Burst transactions ? greater bus efficiency
  • More work (data cycles) per transaction
  • Overhead (request grant address) is smaller
    fraction of total bus cycles / transaction
  • Maintain transaction queues
  • Read (high priority)
  • Write (low priority)
  • Reads check write Q for latest copy of data

35
System Interface Unit Bus efficiency
  • Split phase transactions
  • Separate address and data buses
  • Separate address and data phases
  • Overlap ? greater bus utilization
  • Multiple transactions in flight at any time
  • Slow peripheral devices dont hog the busand
    prevent fast transactions (eg memory) from
    accessing bus

2nd transactionstarts before 1st completes
Overhead cycles
Work cycles
36
System Interface Unit Bus efficiency
  • Single purpose bus
  • Graphics, memory
  • Simpler, faster
  • Single direction (CPU ? graphics buffer)
  • Single device (eg memory)
  • Simpler protocol (only one type of device)
  • Point to point wiring
  • Shorter, faster
  • Single driver (no need for delay in switch from
    read to write)

37
Superscalar Processors
  • Superpipelined
  • Deep pipeline (gt5 stages)
  • Hazards and dependencies limit depth
  • Each stage has overhead
  • Registers needed
  • Larger circuit
  • Speed reduction
  • gt8 stages
  • decrease in efficiency
  • vs
  • Superscalar
  • next slide

38
Superscalar Processors
  • Superscalar
  • Multiple functional units
  • Integer ALUs, FPUs, branches, load/store
  • Floating point typically 3 internal stages
  • Usually several integer ALUs per FPU
  • Addressing, loop calcs need integer ALU
  • Instruction issue unit is now more complex
  • Determines which instructions can be issued in
    each cycle
  • What data is ready?
  • Which functional units are free?
  • Typically tries to issue 4 instructions / cycle
  • Achieves 2-3 instructions / cycle on average
  • Out of order execution
  • Instructions executed when data is available
  • Dependent instructions may stall while later ones
    execute
  • Number of functional units gt instruction issue
    width
  • eg 6 FUs, max 4 instructions / cycle

39
Speculation
  • Data prefetch
  • Try to get data into cache well in advance
  • No stall for memory read when data actually
    needed
  • PowerPC dcbt data cache block touch
  • Advice for the system low priority read
  • Pentium prefetchTx (x0,1,2)
  • Semantics varies for Pentium 3 and Pentium 4
  • Pentium 4 fetches into L2 cache only
  • Compiler can detect many patterns
  • eg sequential access of array elements
  • for( j0 jltn j ) sum sum xj
  • Programmer can insert pre-fetch instructions
  • Speculative because data may not be needed

40
Speculation - branching
  • Branches are expensive
  • Stall pipeline
  • More expensive as pipeline depth increases!
  • Fetching useless instructions wastes bandwidth!
  • Couple Branch unit with Instruction Issue unit
  • Conditional branches
  • if ( cond ) s1 else s2
  • Execute both s1 and s2
  • If functional units and data available
  • Use idle resources!
  • Squash results from wrong branch when value of
    cond known
  • MIPS allows 4 streams of speculative execution
  • Pentium 4 Up to 126 in flight ?
  • From a web article by an obvious Intel fan
  • Starts with The Pentium still kicks butt.
  • Not a good flag for an objective article!
  • Probably counts instruction issue unit buffers
    system interface transactions too!

41
Parallel Processing
42
Parallel Processing
  • Communications bottleneck!
  • (Again!)
  • Limits ability to write efficient parallel
    systems
  • Exception
  • Small group of embarrassingly parallel systems
  • Very high computation communication ratios
  • Long computation on small data sets
  • Results communicated to master PE are small
  • Ideal n PEs ? Time tn t1/n
  • Actual tn gt
    t1/n
  • Eventually tn gt tn-1
  • Adding PEs slows things down!
  • Communications and thread management overhead

43
Parallel Processing
  • Flynns Taxonomy
  • Simple, but useful starting point
  • Classification based on
  • I (instruction stream) and
  • D (data stream)
  • 4 classes
  • SISD (sequential PEs)
  • SIMD (many simple PEs, vector machines, MMX,
    Altivec),
  • MISD (no known examples),
  • MIMD (general parallel processor)

44
Parallel Processing Programming Models
  • Shared Memory Model
  • All PEs see a common address space
  • Trivial data distribution (none!)
  • Threads of computation need explicit
    synchronization
  • Synchronization is an overhead!
  • Dataflow or Functional
  • Message Passing
  • Details follow

45
Parallel Processing Programming Models
  • Dataflow Model
  • Execution is data-driven
  • Used as model for both hardware and software
  • Dataflow machines
  • Functional languages
  • Theoretically important
  • Produce provably correct programs
  • Slow in practice
  • Cilk Hybrid dataflow/imperative

46
Parallel Processing Programming Models
  • Message Passing
  • Execution is control-driven
  • Threads run in their own address spaces on each
    PE
  • Data transferred by sending and receiving
    messages
  • Available as libraries of functions
  • Can be invoked from any language
  • Commonly used
  • Message Passing Interface (MPI)
  • C, FORTRAN, libraries readily available
  • Parallel Virtual Machine (PVM)
  • First, generally considered less efficient than
    MPI
  • Two basic primitive operations
  • Send
  • send( destination_PE, data_address, n_bytes )
  • Receive
  • receive( destination_PE, data_address, n_bytes )

47
Parallel Processing Programming Models
  • Message Passing
  • Two basic primitive operations
  • Send
  • send( destination_PE, data_address, n_bytes )
  • Receive
  • receive( destination_PE, data_address, n_bytes )
  • Data distributed explicitly by programmer
  • Using sends
  • Synchronisation is implicit
  • Receive waits for sender to complete computation
    and send data
  • Considerer low level as a programming language
  • Programmer does everything!

48
Parallel Processing Programming Models
  • Message Passing
  • Efficient
  • Runs faster than shared memory
  • Cilk (hybrid dataflow/imperative) faster though
  • Programmer usually knows more about problem
  • Codes minimum data distribution and
    synchronization
  • Libraries are easy to implement
  • Can use any communications network
  • Ethernet
  • ATM
  • Myrinet
  • etc
  • Popular
  • Most used in practice
  • Libraries are widely available
  • Programming concept is simple
  • Even though it requires slightly more work!

49
Architectures
  • Overriding messages
  • Communication overhead is the killer
  • If communication patterns do not match needs of
    problem to be solved
  • Then parallel overheads will swamp benefit from
    adding PEs
  • Main overhead is sending and receiving data
  • Message overheads
  • Synchronization dead time
  • Coarse grain is the key
  • Give away low level parallelism
  • Minimize overheads
  • Larger messages
  • Longer running threads
  • Reduce communication computation ratio

50
Architectures
  • SIMD
  • Large numbers of small PEs connected in grid
  • Easy to build
  • Can solve certain problems efficiently
  • Variations
  • PEs are trivial just ALU
  • Simple PE, eg mprocessor
  • Complex PE, eg Pentium, with local memory
  • Systolic arrays
  • Linear communication patterns
  • Very limited range of problems
  • Idea appears in ALUs of modern processors
  • MMX (Intel), Altivec (Motorola),
  • Useful for graphics operations

51
Architectures
  • Vector machines
  • Three main components
  • Address generation unit
  • Streams vector data efficiently to and from
    memory
  • Handles address computation overhead
  • Vector registers
  • Fast FIFO queues for data
  • ALUs
  • Very fast floating point ALU
  • Efficient for wide range of problems requiring
    vector and matrix computations
  • Including sparse matrix (eg diagonal)
  • Expensive
  • n ? 106 each

52
Architectures
  • Dataflow machines
  • Data driven not control driven
  • Dataflow graph exposes all possible parallelism
  • Originally expected to be able to extract maximum
    parallelism
  • and therefore maximum speedup!
  • Fine grain dataflow dies because of communication
    overhead
  • Coarse grain dataflow has potential
  • But difficult to attract interest in
    non-mainstream ( non-Pentium) architectures
  • New processor development is expensive
  • n ? 108 each new 108 transistor chip

53
Architectures
  • Dataflow machines
  • Data driven not control driven
  • Idea survives in instruction issue unit of high
    performance superscalars
  • It issues instructions as
  • Data is available
  • Functional units are available
  • Checks dependencies, hazards
  • Finds instruction level parallelism (ILP)
  • Limited parallelism
  • Issues maximum 4-8 instructions in each cycle

54
Architectures
  • Network architectures
  • Crossbar
  • Ideal Any PE?Any PE direct communication
  • Only possible with low orders
  • Ethernet
  • Essentially linear common bus
  • Switches and world-wide grids provide
    additional paths and increase useful inter-PE
    bandwidth
  • Rectangular Grids
  • Easily implemented on 2-D circuit boards
  • Hypercubes
  • Reasonable compromise
  • Effective bandwidth between arbitrary PEs
  • Low order interconnection nodes
  • Useful theoretical properties
  • Simple definition of sub-cubes
  • Match between interconnection pattern and problem
    shape vital
  • Otherwise gains from additional PEs lost in comms
    overhead!
About PowerShow.com