Krste Asanovic - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Krste Asanovic

Description:

Doesn't scale to large register files without bigger instructions ... Hardware saves 'next-PC' into machine register as each barrier instruction completes ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 86
Provided by: ronn74
Category:

less

Transcript and Presenter's Notes

Title: Krste Asanovic


1
Power-Efficient Microarchitectures
  • Krste Asanovic
  • krste_at_mit.edu
  • MIT Computer Science and Artificial Intelligence
    Laboratory
  • http//cag.csail.mit.edu/scale
  • IBM ACEED Conference, Austin, TX
  • 1 March 2005

2
Academic Computer Architectures
  • Build one flimsy (but expensive) prototype, that
    is never really used
  • Eventually, some ideas are adopted, enter mass
    production, millions sold

3
SpecInt 2000
Horowitz, ISSCC2004
4
Power
Horowitz, ISSCC2004
5
Where does the power go?
IBM, HPCA 2005
  • Parallel instruction fetch and decode
  • Register renaming, issue window, reorder buffer
  • Multiported register files and bypass networks
  • Load and store queues
  • Multiported primary caches and TLBs
  • Energy-oblivious instruction sets (i.e., 360,
    x86, RISC) require most of this
    microarchitectural machinery to achieve high
    performance.

6
Energy-Oblivious Instruction Sets
  • Current RISC/VLIW ISAs only expose hardware
    features that affect critical path through
    computation
  • Most energy is consumed in microarchitectural
    operations that are hidden from software!

7
Energy-Exposed Instruction Sets
  • Rethinking the hardware-software interface for
    lower power
  • Use compile-time knowledge to reduce run-time
    energy dissipation
  • Without reducing performance
  • Without using excessive energy to transmit
    compile-time knowledge to hardware at run time

8
IBMs Instruction Sets
  • Pre-1964 IBM 701, 650, 702, 1401,
  • Prehistoric times
  • 1964 IBM System/360
  • Invention of the instruction set architecture
    (ISA)
  • 1978 IBM System/38, AS/400
  • Object-based capability systems
  • 1990 IBM POWER
  • Superscalar RISC
  • Maybe time to start working on next energy-aware
    ISA?

9
Talk Outline
  • Variable-Length Instruction Formats
  • Vectors
  • Exception Management
  • The Vector-Thread Architecture

10
Problems with Fixed-Length Instructions
  • Waste memory bandwidth/power at all levels of
    instruction hierarchy
  • Reduce effective cache capacity
  • Introduce unnecessary serial dependencies to work
    around length limits
  • lui r1, 0x8765 MIPS code to load 32-bit
  • ori r1, r1, 0x4321 constant 0x87654321 in r1
  • Advantages?
  • Easier pipelined or parallel fetch and decode.

11
Heads and Tails Format
  • Each instruction split into two
    portionsfixed-length head variable-length
    tail

12
Heads and Tails Format
  • Each instruction split into two
    portionsfixed-length head variable-length
    tail
  • Multiple instructions packed into a fixed-length
    bundle

13
Heads and Tails Format
  • Each instruction split into two
    portionsfixed-length head variable-length
    tail
  • Multiple instructions packed into a fixed-length
    bundle
  • A cache line can have multiple bundles

14
Heads and Tails Format
15
Heads and Tails Format
H0 H1 H2 H3 H4
H0 H1 H2 H3 H4 H5 H6
H0 H1 H2 H3 H4 H5
heads
16
Heads and Tails Format
H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
17
Heads and Tails Format
unused
H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
18
Heads and Tails Format
unused
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
19
Heads and Tails Format
? not all heads must have tails
unused
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
20
Heads and Tails Format
? not all heads must have tails ? tails at fixed
granularity
unused
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
21
Heads and Tails Format
? not all heads must have tails ? tails at fixed
granularity ? granularity of tails independent
of size of heads
unused
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
22
Heads and Tails Format
PC
bundle
instruction
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
23
Heads and Tails Format
? sequential pc incremented
PC
bundle
instruction
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
24
Heads and Tails Format
? sequential pc incremented ? end of bundle
bundle incremented inst reset to 0
PC
bundle
instruction
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
25
Heads and Tails Format
? sequential pc incremented ? end of bundle
bundle incremented inst reset to 0 ? branch
inst checked
PC
bundle
instruction
4 H0 H1 H2 H3 H4
T4 T3 T2 T1 T0
6 H0 H1 H2 H3 H4 H5 H6 T6
T4 T3 T1 T0
5 H0 H1 H2 H3 H4 H5
T4 T3 T2 T0
heads
tails
last instr
26
Conventional VL Length-Decoding
Instr 1 Instr 2
Instr 3
Length 1
27
Conventional VL Length-Decoding
Instr 1 Instr 2
Instr 3
Length 1
Length 2
? 2nd length decoder needs to know Length1 first
28
Conventional VL Length-Decoding
Instr 1 Instr 2
Instr 3
Length 1
Length 2

Length 3
? 3rd length decoder needs to know Length1
Length2
29
Conventional VL Length-Decoding
Instr 1 Instr 2
Instr 3
Length 1
Length 2

Length 3

? Need to know all 3 lengths to fetch and align
more instructions.
30
HAT Length-Decoding
Head1 Head2 Head3 Tail3
Tail2 Tail1
Length 1
Length 2
Length 3
? Length decoding done in parallel
31
HAT Length-Decoding
Head1 Head2 Head3 Tail3
Tail2 Tail1
Length 1
Length 2
Length 3
? Length decoding done in parallel? Only
tail-length adders dependent on previous length
information (carry-save adders delay O(logW))
32
HAT Length-Decoding
Head1 Head2 Head3 Tail3
Tail2 Tail1
Length 1
Length 2
Length 3

? Length decoding done in parallel? Only
tail-length adders dependent on previous length
information (carry-save adders delay O(logW))

33
HAT Length-Decoding
Head1 Head2 Head3 Tail3
Tail2 Tail1

Length 1
Length 2
Length 3

? Length decoding done in parallel? Only
tail-length adders dependent on previous length
information (carry-save adders delay O(logW))

34
Heads and Tails Summary
  • Density of variable-length instructions while
    retaining pipelined or superscalar instruction
    fetch
  • For recoded MIPS ISA, save 25 static and
    dynamic instruction bits using 256-bit bundles
  • Can design an ISA to exploit HT (e.g., avoid
    spurious serializations)

35
Vectors
36
Parallelism is Good
Horowitz, ISSCC2004
37
Forms of Parallelism and Energy per Op
Scalar Pipelined Machine
Energy/ Operation
Performance
1
2
5
38
Vectors
  • Omission of vectors is single biggest mistake in
    commercial computer architectures
  • Simple
  • High performance
  • Low power
  • Works great with caches
  • Mature compiler technology
  • Easily understood performance-programming model
  • Good for everything, not just scientific
    computing
  • Possibly only valid reasons for omission
  • A little harder to make work with virtual memory
    and rapid context swaps (see restart markers)
  • Large vector register files (see vector-thread
    architecture)

39
Automatic Code Vectorization
for (i0 i lt N i) Ci Ai Bi
Vectorization is a massive compile-time
reordering of operation sequencing ? avoids many
run-time overheads
40
Vector Energy Advantages
  • Instruction fetch amortized over vector of
    operations
  • Loop bookkeeping factored out into separate
    control processor
  • Efficient vector memory operations move multiple
    memory operands with one cache tagTLB lookup
  • All arithmetic operations only access local lane,
    no cross-lane wiring
  • Length of vector register effectively provides
    register renaming and loop unrolling without
    additional hardware

41
Vector Instruction Parallelism
  • Can overlap execution of multiple vector
    instructions
  • example machine has 32 elements per vector
    register and 8 lanes

Load Unit
Multiply Unit
Add Unit
time
Instruction issue
Complete 24 operations/cycle while issuing 1
short instruction/cycle
42
Why SIMD extensions fall short of Vectors
  • Only executes one cycles worth of operands per
    instruction fetch
  • Requires superscalar dispatch to keep multiple
    functional units busy
  • Scalar unit cannot run ahead to find next vector
    loop
  • Tied up issuing SIMD instructions for current
    loop
  • No long vector memory operations
  • Memory system cant get ahead in fetching data
    without speculation
  • Doesnt scale to wider datapaths without software
    rewrite
  • Doesnt scale to large register files without
    bigger instructions
  • Awkward interface for compilers
  • Extensive microarchitecture-specific loop
    unrolling and software pipelining required to
    keep pipelines busy
  • Load/store alignment constraints
  • No vector length register
  • No scatter/gather
  • Causes larger loop startup delays than vectors

43
Vectors vs. Superscalar on General-Purpose
Applications
Single-scalar Runtime
  • Accelerating 28 of code by factor of 8 gives
    same speedup as accelerating all code by 1.3
    vectorizing specint95

44
Vectorizable Workloads
  • Vectors known to work well for scientific and
    media applications, but also can help many other
    codes, e.g.,
  • Databases
  • Hash-joins vectorizable
  • String operations
  • Operating Systems
  • Bzero/bcopy
  • Many other important commercial algorithms can
    be vectorized
  • All vendors will soon be telling customers to
    multithread their code to get better performance
  • Vectorization can be simpler, and give much
    better power-performance than multithreading

45
Exceptions
46
Exception Management Overhead
  • Large part of power cost in modern
    microarchitectures come from need to provide
    precise exceptions
  • Reorder buffer to track original program order
  • Register renaming or bypass networks to allow
    undo of speculative register writes
  • Store queues to allow undo of speculative memory
    writes
  • (Even in-order architectures speculate on
    exceptions)
  • But also large opportunity cost because some
    things are too difficult to make precise
  • Deeply exposed machine state
  • Overlapped execution of multiple highly parallel
    instructions
  • Special purpose execution units with embedded
    state

47
Whats important in Exceptions?
  • For operating system with multiprogramming and
    virtual memory
  • Must allow fast (and simple) process state save,
    to allow process restart later
  • These swappable exceptions are much easier to
    provide than precise exceptions, especially in
    highly parallel machines with large quantities of
    architectural state

48
Software Restart Markers
  • Software explicitly marks restart points, e.g.,
    by setting barrier bit on each instruction
  • Hardware saves next-PC into machine register as
    each barrier instruction completes
  • branches store target PC
  • must also wait for any earlier potentially
    exception-causing instructions to clear exception
    checks (trap barrier)
  • After any trap, OS resumes execution at saved PC

49
Idempotent Regions
  • Hardware does not buffer state updates and cannot
    undo state changes if trap occurs in middle of
    region
  • Can only restart cleanly if regions are
    idempotent, i.e., can re-execute from beginning
    multiple times with same effect

add r3, r1, r2 st.bar r3, 0(r5) Restart
ld r2, 4(r7) ld r3, 8(r7) add.bar r4, r2, r3
Restart
st r4, 4(r7) st.bar r7, 8(r7) Restart
50
Rules for Idempotent Regions
  • Sufficient rule is that external read set is
    disjoint from internal write set
  • OK to overwrite value if it was produced within
    region
  • Not necessary because of idempotent update
    operations, e.g.,
  • X lt- X AND Y
  • Y lt- Y OR Z
  • Require that any prefix of region is also
    idempotent, e.g.,

51
Some Idempotent Functions
  • matmul(int m, int k, int n,
  • const double a, const double b, double c)
  • int sprintf(char s, const char format, ...)
  • int sscanf(char s, const char format, ...)
  • char strcpy(char, const char) / Also strcmp,
    strlen../
  • void memcpy(char, const char, size_t) / Also
    memset../
  • double sin(double) / Also sqrt, exp, etc. /
  • double atof(const char) / Also atoi, atol,
    strtod,... /
  • Can be protected with single restart marker on
    calling instruction, saving only entry PC
  • assuming arguments untouched in stack memory
  • For vector machine, almost no (lt1) overhead to
    add restart markers to common loops

52
Temporary State
  • Temporary state is only visible inside restart
    region.
  • Thrown away at any exception.
  • Will be rebuilt when restart region is restarted.
  • For SCALE, all vector-thread unit state is
    temporary
  • OS unaware of vector-thread unit.
  • Provides advantages of exposing more machine
    state without the headaches.

53
Vector-Thread Architecture
54
Vector and Multithreaded Architectures
vector control
Control Processor
thread control
PE0
PE1
PE2
PEN
PE0
PE1
PE2
PEN
Memory
Memory
  • Vector processors provide efficient DLP execution
  • Amortize instruction control
  • Amortize loop bookkeeping overhead
  • Exploit structured memory accesses
  • Unable to execute loops with loop-carried
    dependencies or complex internal control flow
  • Multithreaded processors can flexibly exploit TLP
  • Unable to amortize common control overhead across
    threads
  • Unable to exploit structured memory accesses
    across threads
  • Costly memory-based synchronization and
    communication between threads

55
Vector-Thread Architecture
  • VT unifies the vector and multithreaded compute
    models
  • A control processor interacts with a vector of
    virtual processors (VPs)
  • Vector-fetch control processor fetches
    instructions for all VPs in parallel
  • Thread-fetch a VP fetches its own instructions
  • VT allows a seamless intermixing of vector and
    thread control

vector-fetch
Control Processor
thread- fetch
VP0
VP1
VP2
VP3
VPN
Memory
56
Virtual Processor Abstraction
vector-fetch
  • VPs contain a set of registers
  • VPs execute RISC-like instructions grouped into
    atomic instruction blocks (AIBs)
  • VPs have no automatic program counter, AIBs must
    be explicitly fetched
  • VPs contain pending vector-fetch and thread-fetch
    addresses
  • A fetch instruction allows a VP to fetch its own
    AIB
  • May be predicated for conditional branch
  • If an AIB does not execute a fetch, the VP thread
    stops


VP

thread- fetch
VP thread execution
AIB
instruction
fetch
thread-fetch
fetch
thread-fetch
57
Virtual Processor Vector
  • A VT architecture includes a control processor
    and a virtual processor vector
  • Two interacting instruction sets
  • A vector-fetch command allows the control
    processor to fetch an AIB for all the VPs in
    parallel
  • Vector-load and vector-store commands transfer
    blocks of data between memory and the VP registers

vector-fetch
Control Processor
VP0
VP1
VPN



vector-load



vector-store
Vector Memory Unit
Memory
58
Cross-VP Data Transfers
  • Cross-VP connections provide fine-grain data
    operand communication and synchronization
  • VP instructions may target nextVP as destination
    or use prevVP as a source
  • CrossVP queue holds wrap-around data, control
    processor can push and pop
  • Restricted ring communication pattern is cheap to
    implement, scalable, and matches the software
    usage model for VPs

Control Processor
vector-fetch
VP0
VP1
VPN



crossVP-pop
crossVP-push



crossVP queue
59
Mapping Loops to VT
  • A broad class of loops map naturally to VT
  • Vectorizable loops
  • Loops with loop-carried dependencies
  • Loops with internal control flow
  • Each VP executes one loop iteration
  • Control processor manages the execution
  • Stripmining enables implementation-dependent
    vector lengths
  • Programmer or compiler only schedules one loop
    iteration on one VP
  • No cross-iteration scheduling

60
Vectorizable Loops
  • Data-parallel loops with no internal control flow
    mapped using vector commands
  • predication for small conditionals

ld
ld
ltlt
x

operation
st
loop iteration DAG
Control Processor
VP1
VP0
VP2
VP3
VPN
vector-load
ld
ld
ld
ld
ld
vector-load
ld
ld
ld
ld
ld
vector-fetch
ltlt
x
ltlt
x
ltlt
x
ltlt
x
ltlt
x





vector-store
st
st
st
st
st
61
Loop-Carried Dependencies
  • Loops with cross-iteration dependencies mapped
    using vector commands with cross-VP data
    transfers
  • Vector-fetch introduces chain of prevVP receives
    and nextVP sends
  • Vector-memory commands with non-vectorizable
    compute

ld
ld
ltlt
x

st
Control Processor
VP1
VP0
VP2
VP3
VPN
vector-load
ld
ld
ld
ld
ld
vector-load
ld
ld
ld
ld
ld
vector-fetch
ltlt
x
ltlt
x
ltlt
x
ltlt
x
ltlt
x





vector-store
st
st
st
st
st
62
Loops with Internal Control Flow
  • Data-parallel loops with large conditionals or
    inner-loops mapped using thread-fetches
  • Vector-commands and thread-fetches freely
    intermixed
  • Once launched, the VP threads execute to
    completion before the next control processor
    command

ld
ld

br
st
Control Processor
VP1
VP0
VP2
VP3
VPN
vector-load
ld
ld
ld
ld
ld
vector-fetch
ld
ld
ld
ld
ld





br
br
br
br
br
ld
ld
ld



br
br
br
vector-store
st
st
st
st
st
63
VT Physical Model
  • A Vector-Thread Unit contains an array of lanes
    with physical register files and execution units
  • VPs map to lanes and share physical resources, VP
    execution is time-multiplexed on the lanes
  • Independent parallel lanes exploit parallelism
    across VPs and data operand locality within VPs

64
VP Execution Interleaving
  • Hardware provides the benefits of loop unrolling
    by interleaving VPs
  • Time-multiplexing can hide thread-fetch, memory,
    and functional unit latencies

time-multiplexing
Lane 3
Lane 2
Lane 1
Lane 0
VP0
VP4
VP8
VP12
VP1
VP5
VP9
VP13
VP2
VP6
VP10
VP14
VP3
VP7
VP11
VP15
65
VP Execution Interleaving
  • Dynamic scheduling of cross-VP data transfers
    automatically adapts to software critical path
    (in contrast to static software pipelining)
  • No static cross-iteration scheduling
  • Tolerant to variable dynamic latencies

time-multiplexing
Lane 3
Lane 2
Lane 1
Lane 0
VP0
VP4
VP8
VP12
VP1
VP5
VP9
VP13
VP2
VP6
VP10
VP14
VP3
VP7
VP11
VP15
vector-fetch
time
66
SCALE Registers and VP Configuration
c0
  • Atomic instruction blocks allow VPs to share
    temporary state only valid within the AIB
  • VP general registers divided into private and
    shared
  • Chain registers at ALU inputs avoid reading and
    writing general register file to save energy

shared
VP8
VP4
VP0
cr0
cr1
  • Number of VP registers in each cluster is
    configurable
  • The hardware can support more VPs when they each
    have fewer private registers
  • Low overhead Control processor instruction
    configures VPs before entering stripmine loop, VP
    state undefined across reconfigurations

4 VPs with 0 shared regs 8 private regs
7 VPs with 4 shared regs 4 private regs
25 VPs with 7 shared regs 1 private reg
VP12
shared
shared
VP24
VP20
VP8
VP16
VP12
VP4
VP8
VP0
VP4
VP0
67
SCALE Prototype and Simulator
  • Prototype SCALE processor in development
  • Control processor MIPS, 1 instr/cycle
  • VTU 4 lanes, 4 clusters/lane, 32
    registers/cluster, 128 VPs max
  • Primary I/D cache 32 KB, 4x128b per cycle,
    non-blocking
  • DRAM 64b, 200 MHz DDR2 (64b at 400Mb/s 3.2GB/s)
  • Estimated 10 mm2 in TSMC 180nm, 400 MHz (25 FO4)

68
Summary
  • Energy/operation for given performance is key
    parameter
  • Increasing parallelism and locality are standard
    tricks for improving performance
  • But standard microarchitectural techniques to
    achieve better parallelism and locality increase
    energy/operation
  • Energy-exposed instruction set allows software to
    increase parallelism, increase locality, and
    reduce microarchitectural waste for lower
    energy/op

69
SCALE Group
http//cag.csail.mit.edu/scale
  • Seongmoo Heo
  • Ronny Krashinsky
  • Jae Lee
  • Rose Liu
  • Albert Ma
  • Heidi Pan
  • Brian Pharris
  • Jessica Tseng
  • Michael Zhang
  • Krste Asanovic
  • Gautham Arumilli
  • Ken Barr
  • Elizabeth Basha
  • Chris Batten
  • Vimal Bhalodia
  • Jared Casper
  • Steve Gerding
  • Mark Hampton

Funding provided by DARPA, NSF, CMI, IBM,
Infineon, Intel, SGI, Xilinx, MIT Project Oxygen
70
(No Transcript)
71
Backup
72
Lane Execution
  • Lanes execute decoupled from each other
  • Command management unit handles vector-fetch and
    thread-fetch commands
  • Execution cluster executes instructions in-order
    from small AIB cache (e.g. 32 instructions)
  • AIB caches exploit locality to reduce instruction
    fetch energy (on par with register read)
  • Execute directives point to AIBs and indicate
    which VP(s) the AIB should be executed for
  • For a thread-fetch command, the lane executes the
    AIB for the requesting VP
  • For a vector-fetch command, the lane executes the
    AIB for every VP
  • AIBs and vector-fetch commands reduce control
    overhead
  • 10s100s of instructions executed per fetch
    address tag-check, even for non-vectorizable
    loops

Lane 0
vector-fetch
thread-fetch
vector-fetch
miss addr
AIB address
miss
AIB tags
VP12
execute directive
VP8
VP4
VP0
VP
AIB cache
ALU
AIB fill
instr.
AIB
73
SCALE Vector-Thread Processor
  • SCALE is designed to be a complexity-effective
    all-purpose embedded processor
  • Exploit all available forms of parallelism and
    locality to achieve high performance and low
    energy
  • Constrained to small area (estimated 10 mm2 in
    0.18 µm)
  • Reduce wire delay and complexity
  • Support tiling of multiple SCALE processors for
    increased throughput
  • Careful balance between software and hardware for
    code mapping and scheduling
  • Optimize runtime energy, area efficiency, and
    performance while maintaining a clean scalable
    programming model

74
SCALE Clusters
  • VPs partitioned into four clusters to exploit ILP
    and allow lane implementations to optimize area,
    energy, and circuit delay
  • Clusters are heterogeneous c0 can execute loads
    and stores, c1 can execute fetches, c3 has
    integer mult/div
  • Clusters execute decoupled from each other

Lane 0
Lane 1
Lane 2
Lane 3
Control Processor
SCALE VP
AIB Fill Unit

c3
c3
c3
c3
c3
c2
c2
c2
c2
c2
c1
c1
c1
c1
c1
c0
c0
c0
c0
c0
L1 Cache
75
SCALE Micro-Ops
  • Assembler translates portable software ISA into
    hardware micro-ops
  • Per-cluster micro-op bundles access local
    registers only
  • Inter-cluster data transfers broken into
    transports and writebacks

Software VP code
Hardware micro-ops
cluster micro-op bundle
Cluster 3 not shown
76
SCALE Cluster Decoupling
  • Cluster execution is decoupled
  • Cluster AIB caches hold micro-op bundles
  • Each cluster has its own execute-directive queue,
    and local control
  • Inter-cluster data transfers synchronize with
    handshake signals
  • Memory Access Decoupling (see paper)
  • Load-data-queue enables continued execution after
    a cache miss
  • Decoupled-store-queue enables loads to slip ahead
    of stores

Cluster 3
VP
writeback
compute
AIB cache
Regs
ALU
transport
Cluster 2
writeback
compute
AIBs
transport
Cluster 1
writeback
compute
AIBs
transport
Cluster 0
writeback
compute
AIBs
transport
77
Why it might be time for new ISA
  • Power-performance crisis
  • Single-thread performance plateau
  • for real this time
  • Memory wall
  • Reliability scaling
  • Hope for everyday large-scale multithreading
  • Software quality crisis

78
SpecInt/MHz
79
Clock Frequency Scaling
80
Clock Cycle in FO4
Alpha
81
Forms of Parallelism and Energy per Op
Scalar Pipelined Machine
Energy/ Operation
Performance
1
2
5
82
Idempotency is Non-Monotonic with Region Size
C st r1, (r4) B ld r1, (r2) A
st r1, (r3) r3 ! r2 add r1, 1
83
Idempotency is Non-Monotonic with Region Size
C st r1, (r4) B ld r1, (r2) A
st r1, (r3) r3 ! r2 add r1, 1
84
Idempotency is Non-Monotonic with Region Size
C st r1, (r4) B ld r1, (r2) A
st r1, (r3) r3 ! r2 add r1, 1
85
Idempotency is Non-Monotonic with Region Size
C st r1, (r4) B ld r1, (r2) A
st r1, (r3) r3 ! r2 add r1, 1
Write a Comment
User Comments (0)
About PowerShow.com