Embedded Systems in Silicon TD5102 Advanced Architectures with emphasis on ILP exploitation - PowerPoint PPT Presentation

About This Presentation
Title:

Embedded Systems in Silicon TD5102 Advanced Architectures with emphasis on ILP exploitation

Description:

Mesh or hypercube connectivity. Exploit data locality of e.g. image processing applications ... Tight inter FU connectivity required. Large instructions. Not ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 94
Provided by: mur691
Category:

less

Transcript and Presenter's Notes

Title: Embedded Systems in Silicon TD5102 Advanced Architectures with emphasis on ILP exploitation


1
Embedded Systems in SiliconTD5102Advanced
Architectureswith emphasis on ILP exploitation
Henk Corporaal http//www.ics.ele.tue.nl/heco/cou
rses/EmbSystems Technical University
Eindhoven DTI / NUS Singapore 2005/2006
2
Future
  • We foresee that
  • many characteristics of current high
    performance architectures will find their way
    into the embedded domain.

3
What are we talking about?
ILP Instruction Level Parallelism ability to
perform multiple operations (or
instructions), from a single instruction
stream, in parallel
4
Processor Components
  • Overview
  • Motivation and Goals
  • Trends in Computer Architecture
  • ILP Processors
  • Transport Triggered Architectures
  • Configurable components
  • Summary and Conclusions

5
Motivation for ILP (and other types of
parallelism)
  • Increasing VLSI densities decreasing feature
    size
  • Increasing performance requirements
  • New application areas, like
  • Multi-media (image, audio, video, 3-D)
  • intelligent search and filtering engines
  • neural, fuzzy, genetic computing
  • More functionality
  • Use of existing Code (Compatibility)
  • Low Power P ?fCV2

6
Low power through parallelism
  • Sequential Processor
  • Switching capacitance C
  • Frequency f
  • Voltage V
  • P ?fCV2
  • Parallel Processor (two times the number of
    units)
  • Switching capacitance 2C
  • Frequency f/2
  • Voltage V lt V
  • P ?f/2 2C V2 ?fCV2

7
ILP Goals
  • Making the most powerful single chip processor
  • Exploiting parallelism between independent
    instructions (or operations) in programs
  • Exploit hardware concurrency
  • multiple FUs, buses, reg files, bypass paths,
    etc.
  • Code compatibility
  • binary superscalar and super-pipelined
  • HLL VLIW
  • Incorporate enhanced functionality (ASIP)

8
Overview
  • Motivation and Goals
  • Trends in Computer Architecture
  • ILP Processors
  • Transport Triggered Architectures
  • Configurable components
  • Summary and Conclusions

9
Trends in Computer Architecture
  • Bridging the semantic gap
  • Performance increase
  • VLSI developments
  • Architecture developments design space
  • The role of compiler
  • Right match

10
Bridging the Semantic Gap
  • Programming domains
  • Application domain
  • Architecture domain
  • Data path domain

11
Bridging the Semantic Gap Different Methods
12
Bridging the Semantic Gap What happens to the
semantic level ?
Compiler and/or interpretation
CISC
RISC
Semantic Level
?
Interpretation
1950
1960
1970
1980
1990
2000
2010
Year
Application Domain
Architecture Domain
Datapath Domain
13
Performance Increase
SPECfp92 data
SPECint92 data
1000
SPECfp92 growth
SPECint92 growth
100
10
SPECint and SPECfp ratings
1.0
0.1
78
80
82
84
86
88
90
92
94
96
98
00
02
Year
  • Microprocessor SPEC Ratings
  • 50 SPECint improvement / year
  • 60 SPECfp improvement / year

14
VLSI Developments
Transistors (DRAM) 2(year-1956) 2/3
10e7
10
Density in transistors/chip
Minimum feature size in (um)
1.0
10e5
Feature Size
Density
10e3
0.1
70
80
90
00
Year
Cycle time tcycle tgate gate_levels
wiring_delay pad_delay What happens to these
contributions ?
15
Architecture Developments
  • How to improve performance?
  • (Super)-pipelining
  • Powerful instructions
  • MD-technique
  • multiple data operands per operation
  • MO-technique
  • multiple operations per instruction
  • Multiple instruction issue

16
Architecture DevelopmentsPipelined Execution of
Instructions
IF Instruction Fetch DC Instruction Decode RF
Register Fetch EX Execute instruction WB Write
Result Register
CYCLE
1
2
4
3
5
6
7
8
1
2
INSTRUCTION
3
4
Simple 5-stage pipeline
  • Purpose
  • Reduce gate_levels in critical path
  • Reduce CPI close to one
  • More efficient Hardware
  • Problems
  • Hazards pipeline stalls
  • Structural hazards add more hardware
  • Control hazards, branch penalties use branch
    prediction
  • Data hazards by passing required
  • Superpipelining Split one or more of the
    critical pipeline stages

17
Architecture DevelopmentsPowerful Instructions
(1)
  • MD-technique
  • Multiple data operands per operation
  • Two Styles
  • Vector
  • SIMD

a B c d
18
Architecture DevelopmentsPowerful Instructions
(1)
  • Vector Computing
  • FU mix may match the application domain
  • Use of interleaved memory
  • FUs need to be tightly connected
  • SIMD computing
  • Nodes used for independent operations
  • Mesh or hypercube connectivity
  • Exploit data locality of e.g. image processing
    applications
  • SIMD on restricted scale Multi-media
    instructions
  • MMX, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow,
    Trimedia, ......
  • Example ?i1..4ai-bi

19
Architecture DevelopmentsPowerful Instructions
(2)
  • MO-technique multiple operations per instruction
  • CISC (Complex Instruction Set Computer)
  • VLIW (Very Long Instruction Word)

FU 1
FU 2
FU 3
FU 4
FU 5
field
sub r8, r5, 3
and r1, r5, 12
mul r6, r5, r2
ld r3, 0(r5)
bnez r5, 13
instruction
VLIW instruction example
20
Architecture Developments Powerful Instructions
(2) VLIW Characteristics
  • Only RISC like operation support
  • Short cycle times
  • Flexible Can implement any FU mixture
  • Extensible
  • Tight inter FU connectivity required
  • Large instructions
  • Not binary compatible

21
Architecture DevelopmentsMultiple instruction
issue (per cycle)
  • Who guarantees semantic correctness?
  • User specifies multiple instruction streams
  • MIMD (Multiple Instruction Multiple Data)
  • Run-time detection of ready instructions
  • Superscalar
  • Compile into dataflow representation
  • Dataflow processors

22
Multiple instruction issueThree Approaches
Example code
a b 15 c 3.14 d e c / f
Translation to DDG (Data Dependence Graph)
d
ld
3.14
f
b
ld
ld

15
c

/
st
a
e
st
st
23
  • Generated Code

Instr. Sequential Code Dataflow Code

I1 ld r1,M(b) ld(M(b) -gt I2 I2 addi r1,r1,15
addi 15 -gt I3 I3 st r1,M(a) st
M(a) I4 ld r1,M(d) ld M(d) -gt
I5 I5 muli r1,r1,3.14 muli 3.14 -gt I6,
I8 I6 st r1,M(c) st M(c) I7 ld r2,M(f) ld
M(f) -gt I8 I8 div r1,r1,r2 div -gt
I9 I9 st r1,M(e) st M(e)
  • Notes
  • An MIMD may execute two streams (1) I1-I3 (2)
    I4-I9
  • No dependencies between streams in practice
    communication and synchronization required
    between streams
  • A superscalar issues multiple instructions from
    sequential stream
  • Obey dependencies (True and name dependencies)
  • Reverse engineering of DDG needed at run-time
  • Dataflow code is direct representation of DDG

24
Instruction Pipeline Overview
CISC
RISC
Superscalar
Superpipelined
DATAFLOW
VLIW
25
Four dimensional representation of the
architecture design space ltI, O, D, Sgt
SIMD
100
Data/operation D
10
Vector
CISC
Superscalar
MIMD
Dataflow
0.1
10
100
Instructions/cycle I
RISC
Superpipelined
VLIW
10
10
Operations/instruction O
Superpipelining Degree S
26
Architecture design space
Typical values of K ( of functional units or
processor nodes), and ltI, O, D, Sgt for different
architectures
S(architecture) ? f(Op) lt (Op)
?Op ?I_set
Mpar IODS
27
The Role of the Compiler
  • 9 steps required to translate an HLL program
  • Front-end compilation
  • Determine dependencies
  • Graph partitioning make multiple threads (or
    tasks)
  • Bind partitions to compute nodes
  • Bind operands to locations
  • Bind operations to time slots Scheduling
  • Bind operations to functional units
  • Bind transports to buses
  • Execute operations and perform transports

28
Division of responsibilities between hardware and
compiler
Application
Frontend
Superscalar
Determine Dependencies
Determine Dependencies
Dataflow
Binding of Operands
Binding of Operands
Multi-threaded
Scheduling
Scheduling
Indep. Arch
Binding of Operations
Binding of Operations
VLIW
Binding of Transports
Binding of Transports
TTA
Execute
Responsibility of compiler
Responsibility of Hardware
29
The Right Matchif it does not fit on a chip you
have a disadvantage!
109
NoC, System-on-Chip
108
MIMD, bus based, L2
107
VLIW Superscalar Dataflow, L1
106
INTRODUCTION DISADVANTAGE
RISC MMU FP 64-bit
CISC 32-bit core
105
Transistor per CPU Chip
RISC 32-bit core
104
8-bit Microprocessor
103
00
72
80
90
Year
30
Overview
  • Motivation and Goals
  • Trends in Computer Architecture
  • ILP Processors
  • Transport Triggered Architectures
  • Configurable components
  • Summary and Conclusions

31
ILP Processors
  • Overview
  • General ILP organization
  • VLIW concept
  • examples like TriMedia, Mpact, TMS320C6x, IA-64
  • Superscalar concept
  • examples like HP-PA8000, Alpha 21264, MIPS
    R10k/R12k, Pentium I-IV, AMD5-7, UltraSparc
  • (Ref IEEE Micro April 1996 (HotChips issue)
  • Comparing Superscalar and VLIW

32
General ILP processor organization
Central Processing Unit
FU-1
Instruction Fetch Unit
Instruction Decode Unit
FU-2
Data Memory
Instruction Memory
Register File
FU-K
33
ILP processor characteristics
  • Issue multiple operations/instructions per cycle
  • Multiple concurrent Function Units
  • Pipelined execution
  • Shared register file
  • Four Superscalar variants
  • In-order/Out-of-order execution
  • In-order/Out-of-order completion

34
VLIWVery Long Instruction WordArchitecture
35
VLIW concept
Instruction Memory
A VLIW architecture with 7 FUs
Int FU
Int FU
Int FU
LD/ST
LD/ST
FP FU
FP FU
Int Register File
Floating Point Register File
Data Memory
36
VLIW example Trimedia
SDRAM
Trimedia Overview
Memory Interface
Timers
PCI interface
32 bit, 33 MHZ
Video In
Video Out
40 Mpix/s
19 Mpix/s
Audio In
Audio Out
Stereo digital audio
208 chanel digital audio
I2C Interface
Serial Interface
5-issue 128 registers 27 Fus 32-bit
8-Way set associative caches dual ported data
cache gaurded operations
VLIW Processor
32K I
Huffman decoder MPEG1,2
VLD coprocessor
16K D
37
VLIW example TMS320C62
  • TMS320C62 VelociTI Processor
  • 8 operations (of 32-bit) per instruction (256
    bit)
  • Two clusters
  • 8 Fus 4 Fus / cluster (2 Multipliers, 6 ALUs)
  • 2 x 16 registers
  • One port available to read from register file of
    other cluster
  • Flexible addressing modes (like circular
    addressing)
  • Flexible instruction packing
  • All operations conditional
  • 5 ns, 200 MHz, 0.25 um, 5-layer CMOS
  • 128 KB on-chip RAM

38
VelociTIC64 datapath
Cluster
39
VLIW example IA-64
  • Intel HP 64 bit VLIW like architecture
  • 128 bit instruction bundle containing 3
    instructions
  • 128 Integer 128 Floating Point registers
    7-bit reg id.
  • Guarded instructions
  • 64 entry boolean register file heavily rely on
    if-conversion to remove branches
  • Specify instruction independence
  • some extra bits per bundle
  • Fully interlocked
  • i.e. no delay slots operations are latency
    compatible within family of architectures
  • Split loads
  • non trapping load exception check

40
Intel Itanium 2
  • EPIC
  • 0.18um 6ML
  • 8 issue slots
  • 1 GHz (8000 MIPS)
  • 130 W (max)
  • 61 MOPS/W
  • 128b bundle (3x41b 5b)

41
SuperscalarMultiple Instructions / Cycle
42
Superscalar Concept
Instruction Memory
Instruction Cache
Instruction
Decoder
Reservation Stations
Branch Unit
ALU-1
ALU-2
Logic Shift
Load Unit
Store Unit
Address
Data Cache
Data
Reorder Buffer
Data
Register File
Data Memory
43
Intel Pentium 4
  • Superscalar
  • 0.12um 6ML
  • 1.0 V
  • 3 issue
  • gt3 GHz
  • 58 W
  • 20 stage pipeline
  • ALUs clocked at 2X
  • Trace cache

44
Example of Superscalar Processor Execution
  • Superscalar processor organization
  • simple pipeline IF, EX, WB
  • fetches 2 instructions each cycle
  • 2 ld/st units, dual-ported memory 2 FP adders 1
    FP multiplier
  • Instruction window (buffer between IF and EX
    stage) is of size 2
  • FP ld/st takes 1 cc FP /- takes 2 cc FP
    takes 4 cc FP / takes 8 cc
  • Cycle 1 2 3 4 5 6 7
  • L.D F6,32(R2)
  • L.D F2,48(R3)
  • MUL.D F0,F2,F4
  • SUB.D F8,F2,F6
  • DIV.D F10,F0,F6
  • ADD.D F6,F8,F2
  • MUL.D F12,F2,F4

45
Example of Superscalar Processor Execution
  • Superscalar processor organization
  • simple pipeline IF, EX, WB
  • fetches 2 instructions each cycle
  • 2 ld/st units, dual-ported memory 2 FP adders 1
    FP multiplier
  • Instruction window (buffer between IF and EX
    stage) is of size 2
  • FP ld/st takes 1 cc FP /- takes 2 cc FP
    takes 4 cc FP / takes 8 cc
  • Cycle 1 2 3 4 5 6 7
  • L.D F6,32(R2) IF
  • L.D F2,48(R3) IF
  • MUL.D F0,F2,F4
  • SUB.D F8,F2,F6
  • DIV.D F10,F0,F6
  • ADD.D F6,F8,F2
  • MUL.D F12,F2,F4

46
Example of Superscalar Processor Execution
  • Superscalar processor organization
  • simple pipeline IF, EX, WB
  • fetches 2 instructions each cycle
  • 2 ld/st units, dual-ported memory 2 FP adders 1
    FP multiplier
  • Instruction window (buffer between IF and EX
    stage) is of size 2
  • FP ld/st takes 1 cc FP /- takes 2 cc FP
    takes 4 cc FP / takes 8 cc
  • Cycle 1 2 3 4 5 6 7
  • L.D F6,32(R2) IF EX
  • L.D F2,48(R3) IF EX
  • MUL.D F0,F2,F4 IF
  • SUB.D F8,F2,F6 IF
  • DIV.D F10,F0,F6
  • ADD.D F6,F8,F2
  • MUL.D F12,F2,F4

47
Example of Superscalar Processor Execution
  • Superscalar processor organization
  • simple pipeline IF, EX, WB
  • fetches 2 instructions each cycle
  • 2 ld/st units, dual-ported memory 2 FP adders 1
    FP multiplier
  • Instruction window (buffer between IF and EX
    stage) is of size 2
  • FP ld/st takes 1 cc FP /- takes 2 cc FP
    takes 4 cc FP / takes 8 cc
  • Cycle 1 2 3 4 5 6 7
  • L.D F6,32(R2) IF EX WB
  • L.D F2,48(R3) IF EX WB
  • MUL.D F0,F2,F4 IF EX
  • SUB.D F8,F2,F6 IF EX
  • DIV.D F10,F0,F6 IF
  • ADD.D F6,F8,F2 IF
  • MUL.D F12,F2,F4

48
Example of Superscalar Processor Execution
  • Superscalar processor organization
  • simple pipeline IF, EX, WB
  • fetches 2 instructions each cycle
  • 2 ld/st units, dual-ported memory 2 FP adders 1
    FP multiplier
  • Instruction window (buffer between IF and EX
    stage) is of size 2
  • FP ld/st takes 1 cc FP /- takes 2 cc FP
    takes 4 cc FP / takes 8 cc
  • Cycle 1 2 3 4 5 6 7
  • L.D F6,32(R2) IF EX WB
  • L.D F2,48(R3) IF EX WB
  • MUL.D F0,F2,F4 IF EX EX
  • SUB.D F8,F2,F6 IF EX EX
  • DIV.D F10,F0,F6 IF
  • ADD.D F6,F8,F2 IF
  • MUL.D F12,F2,F4

stall because of data dep.
cannot be fetched because window full
49
Example of Superscalar Processor Execution
  • Superscalar processor organization
  • simple pipeline IF, EX, WB
  • fetches 2 instructions each cycle
  • 2 ld/st units, dual-ported memory 2 FP adders 1
    FP multiplier
  • Instruction window (buffer between IF and EX
    stage) is of size 2
  • FP ld/st takes 1 cc FP /- takes 2 cc FP
    takes 4 cc FP / takes 8 cc
  • Cycle 1 2 3 4 5 6 7
  • L.D F6,32(R2) IF EX WB
  • L.D F2,48(R3) IF EX WB
  • MUL.D F0,F2,F4 IF EX EX EX
  • SUB.D F8,F2,F6 IF EX EX WB
  • DIV.D F10,F0,F6 IF
  • ADD.D F6,F8,F2 IF EX
  • MUL.D F12,F2,F4 IF

50
Example of Superscalar Processor Execution
  • Superscalar processor organization
  • simple pipeline IF, EX, WB
  • fetches 2 instructions each cycle
  • 2 ld/st units, dual-ported memory 2 FP adders 1
    FP multiplier
  • Instruction window (buffer between IF and EX
    stage) is of size 2
  • FP ld/st takes 1 cc FP /- takes 2 cc FP
    takes 4 cc FP / takes 8 cc
  • Cycle 1 2 3 4 5 6 7
  • L.D F6,32(R2) IF EX WB
  • L.D F2,48(R3) IF EX WB
  • MUL.D F0,F2,F4 IF EX EX EX EX
  • SUB.D F8,F2,F6 IF EX EX WB
  • DIV.D F10,F0,F6 IF
  • ADD.D F6,F8,F2 IF EX EX
  • MUL.D F12,F2,F4 IF

cannot execute structural hazard
51
Example of Superscalar Processor Execution
  • Superscalar processor organization
  • simple pipeline IF, EX, WB
  • fetches 2 instructions each cycle
  • 2 ld/st units, dual-ported memory 2 FP adders 1
    FP multiplier
  • Instruction window (buffer between IF and EX
    stage) is of size 2
  • FP ld/st takes 1 cc FP /- takes 2 cc FP
    takes 4 cc FP / takes 8 cc
  • Cycle 1 2 3 4 5 6 7
  • L.D F6,32(R2) IF EX WB
  • L.D F2,48(R3) IF EX WB
  • MUL.D F0,F2,F4 IF EX EX EX EX WB
  • SUB.D F8,F2,F6 IF EX EX WB
  • DIV.D F10,F0,F6 IF EX
  • ADD.D F6,F8,F2 IF EX EX WB
  • MUL.D F12,F2,F4 IF ?

52
Register Renaming
  • A technique to eliminate anti- and output
    dependencies
  • Can be implemented
  • by the compiler
  • advantage low cost
  • disadvantage old codes perform poorly
  • in hardware
  • advantage binary compatibility
  • disadvantage extra hardware needed
  • Implemented in Tomasulo algorithm, but we
    describe general idea

53
Register Renaming
  • theres a physical register file larger than
    logical register file
  • mapping table associates logical registers with
    physical register
  • when an instruction is decoded
  • its physical source registers are obtained from
    mapping table
  • its physical destination register is obtained
    from a free list
  • mapping table is updated

add r3,r3,4
add R2,R1,4
before
after
mapping table
mapping table
r0
r0
R8
R8
r1
r1
R7
R7
r2
r2
R5
R5
r3
r3
R1
R2
r4
r4
R9
R9
free list
free list
R2 R6
R6
54
Eliminating False Dependencies
  • How register renaming eliminates false
    dependencies
  • Before
  • addi r1,r2,1
  • addi r2,r0,0
  • addi r1,r0,1
  • After (free list R7, R8, R9)
  • addi R7,R5,1
  • addi R8,R0,0
  • addi R9,R0,1

55
Branch Prediction
56
Branch PredictionMotivation
  • High branch penalties in pipelined processors
  • With on average one out of five instructions
    being a branch, the maximum ILP is five
  • Situation even worse for multiple-issue
    processors, because we need to provide an
    instruction stream of n instructions per cycle.
  • Idea predict the outcome of branches based on
    their history and execute instructions
    speculatively

57
5 Branch Prediction Schemes
  • 1-bit Branch Prediction Buffer
  • 2-bit Branch Prediction Buffer
  • Correlating Branch Prediction Buffer
  • Branch Target Buffer
  • Return Address Predictors
  • A way to get rid of those malicious branches

58
Branch Prediction Buffer 1-bit prediction
10..10 101 00
PC
BHT
0 1 0 1 0 1 1 0
  • Buffer is like a cache without tags
  • Does not help for simple MIPS pipeline because
    target address calculations in same stage as
    branch condition calculation

59
Branch Prediction Buffer 1 bit prediction
Branch address
2 K entries
(K bits)
prediction bit
  • Problems
  • Aliasing lower K bits of different branch
    instructions could be the same
  • Soln Use tags (the buffer becomes a tag)
    however very expensive
  • Loops are predicted wrong twice
  • Soln Use n-bit saturation counter prediction
  • taken if counter ? 2 (n-1)
  • not-taken if counter lt 2 (n-1)
  • A 2 bit saturating counter predicts a loop wrong
    only once

60
2-bit Branch Prediction Buffer
  • Solution 2-bit scheme where prediction is
    changed only if mispredicted twice
  • Can be implemented as a saturating counter

T
NT
Predict Taken
Predict Taken
T
T
NT
NT
Predict Not Taken
Predict Not Taken
T
NT
61
Correlating Branches
  • Fragment from SPEC92 benchmark eqntott
  • if (aa2) subi R3,R1,2
  • aa 0 b1 bnez R3,L1
  • if (bb2) add R1,R0,R0
  • bb0 L1 subi R3,R2,2
  • if (aa!bb) b2 bnez R3,L2
  • add R2,R0,R0
  • L2 sub R3,R1,R2
  • b3 beqz R3,L3

62
Correlating Branch Predictor
4 bits from branch address
  • Idea behavior of this branch is related to
    taken/not taken history of recently executed
    branches
  • Then behavior of recent branches selects between,
    say, 4 predictions of next branch, updating just
    that prediction
  • (2,2) predictor 2-bit global, 2-bit local
  • (k,n) predictor uses behavior of last k branches
    to choose from 2k predictors, each of which is
    n-bit predictor

2-bits per branch local predictors
Prediction
shift register
2-bit global branch history (01 not taken,
then taken)
63
Branch Correlation most general scheme
  • Two schemes (a, k, m, n)
  • PA Per address history, a gt 0
  • GA Global history, a 0

Pattern History Table
2m-1
n-bit saturating Up/Down Counter
m
1
Prediction
Branch Address
0
0
1
2k-1
k
a
Branch History Table
Table size (usually n 2) bits k 2a 2k
2m n Variant Gshare (Scott McFarling93) GA
which takes logic OR of PC address bits and
branch history bits
64
Accuracy (taking the best combination of
parameters)
GA(0,11,5,2)
98
PA(10, 6, 4, 2)
97
96
95
Bimodal
94
GAs
Branch Prediction Accuracy ()
93
PAs
92
91
89
64
128
256
1K
2K
4K
8K
16K
32K
64K
Predictor Size (bytes)
65
Accuracy of Different Branch Predictors
18
Mispredictions Rate
0
4096 Entries 2-bit BHT Unlimited Entries 2-bit
BHT 1024 Entries (2,2) BHT
66
Branch Target Buffer
  • For MIPS pipeline, need target address at same
    time as prediction
  • Branch Target Buffer (BTB) Tag address and
    Target address
  • Note must check for branch match now, since can
    be any instruction

10..10 101 00
PC
Yes instruction is branch. Use predicted PC as
next PC if branch predicted taken.
?
Branch prediction
No instruction is not a branch. Proceed normally
67
Instruction Fetch Stage
found taken
target address
  • Not shown hardware needed when prediction was
    wrong

68
Special Case Return Addresses
  • Register indirect branches hard to predict
    target address
  • MIPS instruction jr r31 // PC r31
  • useful for
  • implementing switch/case statements
  • FORTRAN computed GOTOs
  • procedure return (mainly)
  • SPEC89 85 such branches for procedure return
  • Since stack discipline for procedures, save
    return address in small buffer that acts like a
    stack 8 to 16 entries has small miss rate

69
Dynamic Branch Prediction Summary
  • Prediction becoming important part of scalar
    execution
  • Branch History Table 2 bits for loop accuracy
  • Correlation Recently executed branches
    correlated with next branch
  • Either different branches
  • Or different executions of same branch
  • Branch Target Buffer include branch target
    address prediction
  • Return address stack for prediction of indirect
    jumps

70
Comparing Superscalar and VLIW
71
Limitations of Multiple-Issue Processors
  • Available ILP is limited (were not programming
    with parallelism in mind)
  • Hardware cost
  • adding more functional units is easy
  • more memory ports and register ports needed
  • dependency check needs O(n2) comparisons
  • Limitations of VLIW processors
  • Loop unrolling increases code size
  • Unfilled slots waste bits
  • Cache miss stalls pipeline
  • Research topic scheduling loads
  • Binary incompatibility (not EPIC)

72
Overview
  • Motivation and Goals
  • Trends in Computer Architecture
  • ILP Processors
  • Transport Triggered Architectures
  • Configurable components
  • Summary and Conclusions

73
Reducing Datapath Complexity TTA
  • TTA Transport Triggered Architecture
  • Overview
  • Philosophy
  • MIRROR THE PROGRAMMING PARADIGM
  • Program transports, operations are side effects
    of transports
  • Compiler is in control of hardware transport
    capacity

74
Transport Triggered Architecture
General Structure of TTA
Data-transport Buses / Move Buses
Sockets
FU1
FU1
FU1
Integer Reg File
FP Reg FIle
Boolean Reg File
75
Programming TTAs
  • How to do data operations ?
  • 1. Transport of operands to FU
  • Operand move (s)
  • Trigger move
  • 2. Transport of results from FU
  • Result move (s)

Example Add r3,r1,r2 becomes r1 ? Oint //
operand move to integer unit r2 ? Tadd // trigger
move to integer unit . // addition operation
in progress Rint ? r3 // result move from
integer unit
FU Pipeline
How to do Control flow ? 1. Jumps jump-address
? pc 2. Branch displacement ? pcd 3. Call pc
? r call-address ? pcd
76
Programming TTAs
Scheduling advantages of Transport Triggered
Architectures
1. Software bypassing Rint ? r1 r1 ? Tadd ?
Rint ?r1 Rint ? Tadd
2. Dead writeback removal Rint ? r1 Rint ? Tadd
? Rint ? Tadd
3. Common operand elimination 4 ?Oint r1 ?
Tadd ? 4 ? Oint r1 ? Tadd 4 ?Oint r2 ?
Tadd r2 ? Tadd
4. Decouple operand, trigger and result moves
completely r1 ? Oint r2 ? Tadd ? r1 ?
Oint Rint ? r3 --- r2 ? Tadd
--- Rint ? r3
77
TTA Advantages
  • Summary of advantages of TTAs
  • Better usage of transport capacity
  • Instead of 3 transports per dyadic operation,
    about 2 are needed
  • register ports reduced with at least 50
  • Inter FU connectivity reduces with 50-70
  • No full connectivity required
  • Both the transport capacity and register ports
    become independent design parameters this
    removes one of the major bottlenecks of VLIWs
  • Flexible FUs can incorporate arbitrary
    functionality
  • Scalable FUs, reg.files, etc. can be changed
  • TTAs are easy to design and can have short cycle
    times

78
TTA automatic DSE (design space exploration)
User intercation
Optimizer
Architecture parameters
feedback
feedback
Parametric compiler
Hardware generator
Move framework
Parallel object code
chip
79
Overview
  • Motivation and Goals
  • Trends in Computer Architecture
  • RISC processors
  • ILP Processors
  • Transport Triggered Architectures
  • Configurable HW components
  • Summary and Conclusions

80
Tensilica Xtensa
  • Configurable RISC
  • 0.13um
  • 0.9V
  • 1 issue slot / 5 stage pipeline
  • 490 MHz typical
  • 39.2 mW (no mem.)
  • 12500 MOPS / W
  • Tool support
  • Optional vector unit
  • Special Function Units

81
Fine-Grained reconfigurable Xilinx XC4000 FPGA
Programmable Interconnect
I/O Blocks (IOBs)
Configurable Logic Blocks (CLBs)
82
Coarse-Grained reconfigurable Chameleon CS2000
  • Highlights
  • 32-bit datapath (ALU/Shift)
  • 16x24 Multiplier
  • distributed local memory
  • fixed timing

83
Hybrid FPGAs Xilinx Virtex II-Pro
Memory blocks
PowerPCs
ReConfig. logic
Reconfigurable logic blocks
Courtesy of Xilinx (Virtex II Pro)
84
HW or SW reconfigurable?
reset
Reconfiguration time
loopbuffer
context
Subword parallelism
1 cycle
fine
coarse
Data path granularity
85
Granularity Makes Differences
86
Overview
  • Motivation and Goals
  • Trends in Computer Architecture
  • RISC processors
  • ILP Processors
  • Transport Triggered Architectures
  • Configurable components
  • Multi-threading
  • Summary and Conclusions

87
Multi-threading
  • Definition
  • A multi-threading architecture is a single
    processor architecture which can execute 2 or
    more threads (or processes) either
  • simultaneously (SMT architecture)
  • with an extremely short context switch (1 or a
    few cycles)
  • A multi-core architecture has 2 or more processor
    cores on the same die. It can (also) execute 2 or
    more threads simultaneously !

88
Simultaneous Multithreading Characteristics
  • An SMT is an extension of a superscalar
    architecture allowing multiple threads to run
    simultaneously.
  • It has separate front-ends for the different
    threads but shares the back-end between all
    threads.
  • Each thread has its own
  • Program counter
  • Re-order buffer (if used)
  • Branch History Register
  • General registers, caches, branch prediction
    tables, reservation stations, FUs, etc. can be
    shared.

89
Superscalar SMT
Instruction Memory
Instruction Cache
Instruction
PC_1
PC_2
Decoder
Instruction buffer
Reservation Stations
Branch Unit
ALU-1
ALU-2
Logic Shift
Load Unit
Store Unit
Address
Data Cache
Data
Reorder Buffer
Data
Register File
Data Memory
90
Multi-threading in Uniprocessor Architectures
Simultaneous Multithreading
Concurrent Multithreading
Superscalar
Empty Slot
Thread 1
Clock cycles
Thread 2
Thread 3
Thread 4
Issue slots
91
Future Processors Components
  • New TriMedia
  • VLIW with deeper pipeline, L1 and L2 cache,
    branch prediction.
  • used in SpaceCake cell
  • Sony-IBM PS3 Cell architecture
  • Merrimac (Stanford successor of Imagine)
  • combines operation (VLIW) and data level
    parallelism (SIMD)
  • TRIPS (Texas Austin / IBM) and SCALE (MIT)
  • processors combine task, operation and data level
    parallelism.
  • Silicon Hife (Philips)
  • Coarse grain programmable kind of VLIW with many
    ALUs, Multipliers,...
  • See also, for many more architectures and
    platforms
  • WWW Computer Architecture Page
    www.cs.wisc.edu/arch/www
  • HOT chips www.hotchips.org (especially the
    archives)

92
Summary and Conclusions
  • ILP architectures have great potential
  • Superscalars
  • Binary compatible upgrade path
  • VLIWs
  • Very flexible ASIPs
  • TTAs
  • Avoid control and datapath bottlenecks
  • Completely compiler controlled
  • Very good cost-performance ratio
  • Low power
  • Multi-threading
  • Surpass exploitable ILP in applications
  • How to choose threads ?

93
What should you choose?
  • Depends on
  • application characteristics
  • what types of parallelism can you exploit
  • what is a good memory hierarchy
  • size of each level
  • bandwidth of each level
  • performance requirements
  • energy budget
  • money budget
  • available tooling
Write a Comment
User Comments (0)
About PowerShow.com