CS136, Advanced Architecture - PowerPoint PPT Presentation

About This Presentation
Title:

CS136, Advanced Architecture

Description:

Assumptions for ideal/perfect machine to start: ... 2 & 3 no control dependencies; perfect speculation & an unbounded buffer of ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 50
Provided by: csH2
Learn more at: https://www.cs.hmc.edu
Category:

less

Transcript and Presenter's Notes

Title: CS136, Advanced Architecture


1
CS136, Advanced Architecture
  • Limits to ILP
  • Simultaneous Multithreading

2
Outline
  • Limits to ILP (another perspective)
  • Thread Level Parallelism
  • Multithreading
  • Simultaneous Multithreading
  • Power 4 vs. Power 5
  • Head to Head VLIW vs. Superscalar vs. SMT
  • Commentary
  • Conclusion

3
Limits to ILP
  • Conflicting studies of amount
  • Benchmarks (vectorized Fortran FP vs. integer C
    programs)
  • Hardware sophistication
  • Compiler sophistication
  • How much ILP is available using existing
    mechanisms with increasing HW budgets?
  • Do we need to invent new HW/SW mechanisms to keep
    on processor performance curve?
  • Intel MMX, SSE (Streaming SIMD Extensions) 64
    bit ints
  • Intel SSE2 128 bit, including 2 64-bit Fl. Pt.
    per clock
  • Motorola AltiVec 128 bit ints and FPs
  • Supersparc Multimedia ops, etc.

4
Overcoming Limits
  • Advances in compiler technology significantly
    new and different hardware techniques may be able
    to overcome limitations assumed in studies
  • However, unlikely such advances when coupled with
    realistic hardware will overcome these limits in
    near future

5
Limits to ILP
  • Initial HW Model here MIPS compilers.
  • Assumptions for ideal/perfect machine to start
  • 1. Register renaming infinite virtual
    registers all register WAW WAR hazards are
    avoided
  • 2. Branch prediction perfect no
    mispredictions
  • 3. Jump prediction all jumps perfectly
    predicted (returns, case statements)2 3 ? no
    control dependencies perfect speculation an
    unbounded buffer of instructions available
  • 4. Memory-address alias analysis addresses
    known a load can be moved before a store
    provided addresses not equal 14 eliminates all
    but RAW
  • Also perfect caches 1 cycle latency for all
    instructions (FP ,/) unlimited instructions
    issued/clock cycle

6
Limits to ILP HW Model comparison
7
Upper Limit to ILP Ideal Machine(Figure 3.1)
FP 75 - 150
Integer 18 - 60
Instructions Per Clock
8
Limits to ILP HW Model comparison
9
More Realistic HW Window ImpactFigure 3.2
  • Change from Infinite window 2048, 512, 128, 32

FP 9 - 150
Integer 8 - 63
IPC
10
Limits to ILP HW Model comparison
11
More Realistic HW Branch ImpactFigure 3.3
FP 15 - 45
  • Change from Infinite window to 2048, and maximum
    issue of 64 instructions per clock cycle

Integer 6 - 12
IPC
Profile
BHT (512)
Tournament
Perfect
No prediction
12
Misprediction Rates
13
Limits to ILP HW Model comparison
14
More Realistic HW Renaming Register Impact (N
int N fp) Figure 3.5
FP 11 - 45
  • Change to 2048 instr window, 64 instr issue, 8K
    2 level Prediction

IPC
Integer 5 - 15
64
None
256
Infinite
32
128
15
Limits to ILP HW Model comparison
16
More Realistic HW Memory Address Alias
ImpactFigure 3.6
  • Change 2048 instr window, 64 instr issue, 8K 2
    level Prediction, 256 renaming registers

FP 4 - 45 (Fortran, no heap)
IPC
Integer 4 - 9
None
Global/Stack perfheap conflicts
Perfect
Compiler Inspection
17
Limits to ILP HW Model comparison
18
Realistic HW Window Impact(Figure 3.7)
  • Perfect disambiguation (HW), 1K Selective
    Prediction, 16 entry return, 64 registers, issue
    as many as window

FP 8 - 45
IPC
Integer 6 - 12
64
16
256
Infinite
32
128
8
4
19
Outline
  • Limits to ILP (another perspective)
  • Thread Level Parallelism
  • Multithreading
  • Simultaneous Multithreading
  • Power 4 vs. Power 5
  • Head to Head VLIW vs. Superscalar vs. SMT
  • Commentary
  • Conclusion

20
How to Exceed ILP Limitsof This Study?
  • These are not laws of physics
  • Just practical limits for today
  • Could be overcome via research
  • Compiler and ISA advances could change results
  • WAR and WAW hazards through memory eliminated
    WAW and WAR hazards through register renaming,
    but not in memory usage
  • Can get conflicts via allocation of stack frames
  • Because called procedure reuses memory addresses
    of previous stack frames

21
HW v. SW to increase ILP
  • Memory disambiguation HW best
  • Speculation
  • HW best when dynamic branch prediction better
    than compile-time prediction
  • Exceptions easier for HW
  • HW doesnt need bookkeeping code or compensation
    code
  • Very complicated to get right in SW
  • Scheduling SW can look ahead to schedule better
  • Compiler independence HW does not require new
    compiler to run well

22
Performance Beyond Single-Thread ILP
  • Much higher natural parallelism in some
    applications
  • Database or scientific codes
  • Explicit thread-level or data-level parallelism
  • Thread has own instructions and data
  • May be part of parallel program or independent
    program
  • Each thread has all state (instructions, data,
    PC, register state, and so on) needed to execute
  • Data-level parallelism Perform identical
    operations on lots of data

23
Thread Level Parallelism (TLP)
  • ILP exploits implicit parallel operations within
    loop or straight-line code segment
  • TLP explicitly represented by multiple threads of
    execution that are inherently parallel
  • Goal Use multiple instruction streams to improve
  • Throughput of computers that run many programs
  • Execution time of multi-threaded programs
  • TLP could be more cost-effective to exploit than
    ILP

24
Do Both ILP and TLP?
  • TLP and ILP exploit two different kinds of
    parallel structure in a program
  • Could a processor oriented to ILP still exploit
    TLP?
  • Functional units are often idle in data path
    designed for ILP because of either stalls or
    dependencies in the code
  • Could TLP be used as source of independent
    instructions that might keep the processor busy
    during stalls?
  • Could TLP be used to employ functional units that
    would otherwise lie idle when insufficient ILP
    exists?

25
New ApproachMultithreaded Execution
  • Multithreading multiple threads share functional
    units of 1 processor via overlapping
  • Processor must duplicate independent state of
    each thread
  • Separate copy of register file, PC
  • Separate page table if different process
  • Memory sharing via virtual memory mechanisms
  • Already supports multiple processes
  • HW for fast thread switch
  • Must be much faster than full process switch
    (which is 100s to 1000s of clocks)
  • When to switch?
  • Alternate instruction per thread (fine
    grain)round robin
  • When thread is stalled (coarse grain)
  • E.g., cache miss

26
Fine-Grained Multithreading
  • Switches between threads on each instruction,
    interleaving execution of multiple threads
  • Usually done round-robin, skipping stalled
    threads
  • CPU must be able to switch threads every clock
  • Advantage can hide both short and long stalls
  • Instructions from other threads always available
    to execute
  • Easy to insert on short stalls
  • Disadvantage slows individual threads
  • Thread ready to execute without stalls will be
    delayed by instructions from other threads
  • Used on Suns Niagara (will see later)

27
Course-Grained Multithreading
  • Switches threads only on costly stalls
  • E.g., L2 cache misses
  • Advantages
  • Relieves need to have very fast thread switching
  • Doesnt slow thread
  • Other threads only issue instructions when main
    one would stall (for long time) anyway
  • Disadvantage pipeline startup costs make it hard
    to hide throughput losses from shorter stalls
  • Pipeline must be emptied or frozen on stall,
    since CPU issues instructions from only one
    thread
  • New thread must fill pipe before instructions can
    complete
  • Thus, better for reducing penalty of high-cost
    stalls where pipeline refill
  • Used in IBM AS/400

28
Simultaneous Multithreading (SMT)
  • Simultaneous multithreading (SMT) insight that
    dynamically scheduled processor already has many
    HW mechanisms to support multithreading
  • Large set of virtual registers that can be used
    to hold register sets for independent threads
  • Register renaming provides unique register
    identifiers
  • Instructions from multiple threads can be mixed
    in data path
  • Without confusing sources and destinations across
    threads!
  • Out-of-order completion allows the threads to
    execute out of order, and get better utilization
    of the HW
  • Just add per-thread renaming table and keep
    separate PCs
  • Independent commitment can be supported via
    separate reorder buffer for each thread

Source Micrprocessor Report, December 6, 1999
Compaq Chooses SMT for Alpha
29
Simultaneous Multithreading ...
One thread, 8 units
Two threads, 8 units
Cycle
M
M
FX
FX
FP
FP
BR
CC
M
M
FX
FX
FP
FP
BR
CC
Cycle
M Load/Store, FX Fixed Point, FP Floating
Point, BR Branch, CC Condition Codes
30
Multithreaded Categories
Fine-Gr.
Coarse-Gr.
SMT
Superscalar
Multiprocessing
Time (processor cycle)
Thread 1
Thread 3
Thread 5
Thread 2
Thread 4
Idle slot
31
Design Challenges in SMT
  • What is impact on single-thread performance?
  • Preferred-thread approach
  • Sacrifices neither throughput nor single-thread
    performance?
  • Nope processor will sacrifice some throughput
    when preferred thread stalls
  • Larger register file needed to hold multiple
    contexts
  • Must not affect clock cycle, especially in
  • Instruction issuemore candidate instructions to
    consider
  • Instruction completionhard to choose which to
    commit
  • Must ensure that cache and TLB conflicts caused
    by SMT dont degrade performance

32
Digression Covert Channels
  • Imagine youre spy with account on Knuth
  • Want to communicate a secret to Geoff
  • Secret is reasonably small
  • FBI is watching your account and your e-mail
  • Solution process spawning
  • Once a second, Geoff spawns process
  • Records own PID, waits 10 ms, forks records
    child PID
  • Once a second, you send one bit of information
  • If bit is zero, you do nothing
  • If bit is one, you spawn processes as fast as
    possible
  • If Geoff sees big PID gap, he records 1, else
    0
  • Many variations on this basic idea

33
Covert-Channel Attacks on Crypto
  • Most (not all) crypto code behaves differently on
    1 bit in key vs. 0 bit
  • Runs longer or shorter
  • Uses more or less power
  • Accesses different memory
  • Etc.
  • Usually called information leakage
  • Has been successfully used in lab to crack strong
    crypto
  • Even recovering some bits from key makes
    brute-force attack practical for getting
    remainder
  • Some modern implementations try to fight by doing
    wasted work on shorter path of if, etc.

34
SMT Attack on SSH
  • On SMT machine, lower-priority threads execution
    rate depends on higher-priority ones
    instructions
  • More stalls in top thread mean more speed in
    bottom one
  • Stalls vary depending on what crypto code is
    doing
  • Operates at very low level
  • Thus much harder to defend against
  • Successful attack on ssh keys has been
    demonstrated in lab
  • Best known defense dont do SMT
  • Careful coding of crypto could probably also work
  • Note that this also applies to things like cache
    and TLB
  • Lots of ways to leak information unintentionally!

35
Power 4
Single-threaded predecessor to Power 5. 8
execution units in out-of-order engine each can
issue instruction each cycle.
36
Power 4
2 commits (architected register sets)
Power 5
2 fetch (PC),2 initial decodes
37
Power 5 data flow ...
Why only 2 threads? With 4, one of the shared
resources (physical registers, cache, memory
bandwidth) would be prone to bottleneck
38
Power 5 thread performance ...
Relative priority of each thread controllable in
hardware.
For balanced operation, both threads run slower
than if they owned the machine.
39
Changes in Power 5 to support SMT
  • Increased associativity of L1 instruction cache
    and instruction address translation buffers
  • Added per-thread load and store queues
  • Increased size of L2 (1.92 vs. 1.44 MB) and L3
    caches
  • Added separate instruction prefetch and buffering
    per thread
  • Increased virtual registers from 152 to 240
  • Increased size of several issue queues
  • Power5 core is about 24 larger than Power4
    because of SMT support

40
Initial Performance of SMT
  • Pentium 4 Extreme SMT yields 1.01 speedup for
    SPECint_rate benchmark 1.07 for SPECfp_rate
  • Pentium 4 is dual-threaded SMT
  • SPECRate requires each benchmark to be run
    against vendor-selected number of copies of same
    benchmark
  • Pairing each of 26 SPEC benchmarks with every
    other on Pentium 4 (262 runs) gives speedups from
    0.90 to 1.58 average was 1.20
  • 8-processor Power 5 server 1.23 faster for
    SPECint_rate w/ SMT, 1.16 faster for SPECfp_rate
  • Power 5 running 2 copies of each app had speedup
    between 0.89 and 1.41
  • Most gained some
  • Floating-point apps had most cache conflicts and
    least gains

41
Head-to-Head ILP Competition
42
Performance on SPECint2000
43
Performance on SPECfp2000
44
Normalized Performance Efficiency
45
No Silver Bullet for ILP
  • No obvious overall leader in performance
  • AMD Athlon leads on SPECInt performance, followed
    by the Pentium 4, Itanium 2, and Power5
  • Itanium 2 and Power5 clearly dominate Athlon and
    Pentium 4 on SPECFP
  • Itanium 2 is most inefficient processor both for
    floating-point and integer code for all but one
    efficiency measure (SPECFP/Watt)
  • Athlon and Pentium 4 both use transistors and
    area efficiently
  • IBM Power5 is most effective user of energy on
    SPECFP, essentially tied on SPECINT

46
Limits to ILP
  • Doubling issue rates above todays 3-6
    instructions per clock probably requires
    processor to
  • Issue 3-4 data-memory accesses per cycle,
  • Resolve 2-3 branches per cycle,
  • Rename and access over 20 registers per cycle,
    and
  • Fetch 12-24 instructions per cycle.
  • Complexity of implementing these capabilities is
    likely to mean sacrifices in maximum clock rate
  • E.g, widest-issue processor is Itanium 2
  • It also has slowest clock rate
  • Despite consuming the most power!

47
Limits to ILP (contd)
  • Most ways to increase performance also boost
    power consumption
  • Key question is energy efficiency does a method
    increase power consumption faster than it boosts
    performance?
  • Multiple-issue techniques are energy inefficient
  • Incurs logic overhead that grows faster than
    issue rate
  • Growing gap between peak issue rates and
    sustained performance
  • Number of transistors switching f(peak issue
    rate) performance f(sustained rate) growing
    gap between peak and sustained performance ?
    Increasing energy per unit of performance

48
Commentary
  • Itanium is not significant breakthrough in
    scaling ILP or in avoiding problems of complexity
    and power consumption
  • Instead of pursuing more ILP, architects turning
    to TLP using single-chip multiprocessors
  • In 2000, IBM announced Power4, 1st commercial
    single-chip, general-purpose multiprocessor has
    two Power3 processors and integrated L2 cache
  • Sun Microsystems, AMD, and Intel have also
    switched focus from aggressive uniprocessors to
    single-chip multiprocessors
  • Right balance of ILP and TLP is unclear today
  • Maybe desktops (mostly single-threaded?) need
    different design than servers (can do lots of TLP)

49
And in conclusion
  • Limits to ILP (power efficiency, compilers,
    dependencies ) seem to limit to 3 to 6 issue for
    practical options
  • Explicitly parallel (data-level parallelism or
    thread-level parallelism) is next step to
    performance
  • Coarse-grained vs. fine-grained multithreading
  • Only on big stall vs. every clock cycle
  • Simultaneous multithreading is fine-grained
    multithreading based on OOO superscalar
    microarchitecture
  • Instead of replicating registers, reuse rename
    registers
  • Itanium/EPIC/VLIW is not a breakthrough in ILP
  • Balance of ILP and TLP decided in marketplace
Write a Comment
User Comments (0)
About PowerShow.com