ELEC 669 Low Power Design Techniques Lecture 1 - PowerPoint PPT Presentation

About This Presentation

Title:

ELEC 669 Low Power Design Techniques Lecture 1

Description:

Low Power Design Techniques Lecture 1 Amirali Baniasadi amirali_at_ece.uvic.ca – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 54

Provided by: Origi6

Category:

more less

Transcript and Presenter's Notes

Title: ELEC 669 Low Power Design Techniques Lecture 1

1
ELEC 669Low Power Design TechniquesLecture 1

Amirali Baniasadi
amirali_at_ece.uvic.ca

2
ELEC 669 Low Power Design Techniques

Instructor
Amirali Baniasadi
EOW 441, Only by appt. Call or
email with your schedule.
Email amirali_at_ece.uvic.ca
Office Tel 721-8613
Web Page for this class will be
at
http//www.ece.uvic.ca/amirali/c
ourses/ELEC669/elec669.html
Will use paper reprints
Lecture notes will be posted on the course web
page.

3
Course Structure

Lectures
1-2 weeks on processor review
5 weeks on low power techniques
6 weeks discussion, presentation, meetings
Reading paper posted on the web for each week.
Need to bring a 1 page review of the papers.
Presentations Each student should give to
presentations in class.

4
Course Philosophy

Papers to be used as supplement for lectures (If
a topic is not covered in the class, or a detail
not presented in the class, that means I expect
you to read on your own to learn those details)
One Project (50)
Presentation (30)- Will be announced in advance.
Final Exam take home (20)
IMPORTANT NOTE Must get passing grade in all
components to pass the course. Failing any of the
three components will result in failing the
course.

5
Project

More on project later

6
Topics

High Performance Processors?
Low-Power Design
Low Power Branch Prediction
Low-Power Register Renaming
Low-Power SRAMs
Low-Power Front-End
Low-Power Back-End
Low-Power Issue Logic
Low-Power Commit
AND more

7
A Modern Processor
1-What do each do? 2-Possible Power
Optimizations?
Front-end
Back-end
8
Power Breakdown
PentiumPro
Alpha 21464
9
Instruction Set Architecture (ISA)

Instruction Execution Cycle

10
What Should we Know?

A specific ISA (MIPS)
Performance issues - vocabulary and motivation
Instruction-Level Parallelism
How to Use Pipelining to improve performance
Exploiting Instruction-Level Parallelism w/
Dynamic Approach
Memory caches and virtual memory

11
What is Expected From You?

Read papers!
Be up-to-date!
Come back with your input questions for
discussion!

12
Power?

Everything is done by tiny switches
Their charge represents logic values
Changing charge ? energy
Power ? energy over time
Devices are non-ideal ? power ? heat
Excess heat ? Circuits breakdown
Need to keep power within acceptable limits

13
POWER in the real world
14
Power as a Performance Limiter
Conventional Performance Scaling Goal Max.
performance w/ min cost/complexity How -More
and faster xtors. -More complex
structures. Power Dont fix if it
aint broken Not True Anymore Power has
increased rapidly Power-Aware Architecture
a Necessity
15
Power-Aware Architecture
Conventional Architecture Goal Max.
performance How Do as much as you can. This
Work Power-Aware Architecture
Goal Min. Power and Maintain Performance How
Do as little as you can, while maintaining
performance Challenging and new area
16
Why is this challenging

Identify actions that can be delayed/eliminated
Dont touch those that boost performance
Cost/Power of doing so must not out-weight
benefits

17
Definitions

Performance is in units of things-per-second
bigger is better
If we are primarily concerned with response time
performance(x) 1
execution_time(x)
" X is n times faster than Y" means
Performance(X)
n ----------------------
Performance(Y)

18
Amdahl's Law

Speedup due to enhancement E
ExTime w/o E
Performance w/ E
Speedup(E) --------------------
---------------------
ExTime w/ E
Performance w/o E
Suppose that enhancement E accelerates a fraction
F of the task
by a factor S and the remainder of the task is
unaffected then,
ExTime(with E) ((1-F) F/S) X ExTime(without
E)
Speedup(with E) ExTime(without E) ((1-F)
F/S) X ExTime(without E)
Speedup(with E) 1/ ((1-F) F/S)

19
Amdahl's Law-example

A new CPU makes Web serving 10 times faster. The
old CPU spent 40 of the time on computation and
60 on waiting for I/O. What is the overall
enhancement?
Fraction enhanced 0.4
Speedup enhanced 10
Speedup overall 1
1.56
0.6 0.4/10

20
Why Do Benchmarks?

How we evaluate differences
Different systems
Changes to a single system
Provide a target
Benchmarks should represent large class of
important programs
Improving benchmark performance should help many
programs
For better or worse, benchmarks shape a field
Good ones accelerate progress
good target for development
Bad benchmarks hurt progress
help real programs v. sell machines/papers?
Inventions that help real programs dont help
benchmark

21
SPEC first round

First round 1989 10 programs, single number to
summarize performance
One program 99 of time in single line of code
New front-end compiler could improve dramatically

22
SPEC Evolution

Second round SpecInt92 (6 integer programs) and
SpecFP92 (14 floating point programs)
Add SPECbase one flag setting for integer
programs 1 for FP
Third round 1995 new set of programs
benchmarks useful for 3 years
Now (SPEC 2000)

23
SPEC95

Eighteen application benchmarks (with inputs)
reflecting a technical computing workload
Eight integer
go, m88ksim, gcc, compress, li, ijpeg, perl,
vortex
Ten floating-point intensive
tomcatv, swim, su2cor, hydro2d, mgrid, applu,
turb3d, apsi, fppp, wave5
Must run with standard compiler flags
eliminate special undocumented incantations that
may not even generate working code for real
programs

24
Summary

Time is the measure of computer performance!
Remember Amdahls Law Improvement is limited by
unimproved part of program

25
Execution Cycle
Instruction Fetch
Obtain instruction from program storage
Instruction Decode
Determine required actions and instruction size
Operand Fetch
Locate and obtain operand data
Compute result value or status
Execute
Result Store
Deposit results in storage for later use
Next Instruction
Determine successor instruction
26
What Must be Specified?
Instruction Fetch

Instruction Format or Encoding
how is it decoded?
Location of operands and result
where other than memory?
how many explicit operands?
how are memory operands located?
which can or cannot be in memory?
Data type and Size
Operations
what are supported
Successor instruction
jumps, conditions, branches

Instruction Decode
Operand Fetch
Execute
Result Store
Next Instruction
27
What Is an ILP?

Principle Many instructions in the code do not
depend on each other
Result Possible to execute them in parallel
ILP Potential overlap among instructions (so
they can be evaluated in parallel)
Issues
Building compilers to analyze the code
Building special/smarter hardware to handle the
code
ILP Increase the amount of parallelism
exploited among instructions
Seeks Good Results out of Pipelining

28
What Is ILP?

CODE A
CODE B
LD R1, (R2)100 LD
R1,(R2)100
ADD R4, R1 ADD
R4,R1
SUB R5,R1 SUB
R5,R4
CMP R1,R2 SW
R5,(R2)100
ADD R3,R1 LD
R1,(R2)100
Code A Possible to execute 4 instructions in
parallel.
Code B Cant execute more than one instruction
per cycle.
Code A has Higher ILP

29
Out of Order Execution
Programmer Instructions execute
in-order Processor Instructions may execute
in any order if results remain the same at the
end
Out-of-Order
B ADD R3, R4 C ADD R3, R5 A LD R1, (R2) D CMP
R3, R1
30
Assumptions

Five-stage integer pipeline
Branches have delay of one clock cycle
ID stage Comparisons done, decisions made and PC
loaded
No structural hazards
Functional units are fully pipelined or
replicated (as many times as the pipeline depth)
FP Latencies

Integer load latency 1 Integer ALU operation
latency 0
31
Simple Loop Assembler Equivalent

for (i1000 igt0 i--) xi xi s
Loop LD F0, 0(R1) F0array element
ADDD F4, F0, F2 add scalar in F2
SD F4 , 0(R1) store result
SUBI R1, R1, 8 decrement pointer 8bytes
(DW)
BNE R1, R2, Loop branch R1!R2

xi s are double/floating point type
R1 initially address of array element with the
highest address
F2 contains the scalar value s
Register R2 is pre-computed so that 8(R2) is the
last element to operate on

32
Where are the stalls?

Unscheduled
Loop LD F0, 0(R1)
stall
ADDD F4, F0, F2
stall
stall
SD F4, 0(R1)
SUBI R1, R1, 8
stall
BNE R1, R2, Loop
stall
10 clock cycles
Can we minimize?

Scheduled
Loop LD F0, 0(R1)
SUBI R1, R1, 8
ADDD F4, F0, F2
stall
BNE R1, R2, Loop
SD F4, 8(R1)
6 clock cycles
3 cycles actual work 3 cycles overhead
Can we minimize further?

33
Loop Unrolling
Four copies of loop
Four iteration code

LD F0, 0(R1)
ADDD F4, F0, F2
SD F4 , 0(R1)
SUBI R1, R1, 8
BNE R1, R2, Loop
LD F0, 0(R1)
ADDD F4, F0, F2
SD F4 , 0(R1)
SUBI R1, R1, 8
BNE R1, R2, Loop
LD F0, 0(R1)
ADDD F4, F0, F2
SD F4 , 0(R1)
SUBI R1, R1, 8
BNE R1, R2, Loop
LD F0, 0(R1)

Loop LD F0, 0(R1)
ADDD F4, F0, F2
SD F4, 0(R1)
LD F6, -8(R1)
ADDD F8, F6, F2
SD F8, -8(R1)
LD F10, -16(R1)
ADDD F12, F10, F2
SD F12, -16(R1)
LD F14, -24(R1)
ADDD F16, F14, F2
SD F16, -24(R1)
SUBI R1, R1, 32
BNE R1, R2, Loop

Assumption R1 is initially a multiple of 32 or
number of loop iterations is a multiple of 4
34
Loop Unroll Schedule

Loop LD F0, 0(R1)
stall
ADDD F4, F0, F2
stall
stall
SD F4, 0(R1)
LD F6, -8(R1)
stall
ADDD F8, F6, F2
stall
stall
SD F8, -8(R1)
LD F10, -16(R1)
stall
ADDD F12, F10, F2
stall
stall
SD F12, -16(R1)
LD F14, -24(R1)

Loop LD F0, 0(R1) LD F6, -8(R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16,
F14, F2 SD F4, 0(R1) SD F8, -8(R1) SD F12,
-16(R1) SUBI R1, R1, 32 BNE R1, R2,
Loop SD F16, 8(R1)
Schedule
No stalls! 14 clock cycles or 3.5 per
iteration Can we minimize further?
28 clock cycles or 7 per iteration Can we
minimize further?
35
Summary
Iteration 10 cycles
Unrolling
7 cycles
Scheduling
Scheduling
6 cycles
3.5 cycles (No stalls)
36
Multiple Issue

Multiple Issue is the ability of the processor to
start more than one instruction in a given cycle.
Superscalar processors
Very Long Instruction Word (VLIW) processors

37
A Modern Processor
Multiple Issue
Front-end
Back-end
38
1990s Superscalar Processors

Bottleneck CPI gt 1
Limit on scalar performance (single instruction
issue)
Hazards
Superpipelining? Diminishing returns (hazards
overhead)
How can we make the CPI 0.5?
Multiple instructions in every pipeline stage
(super-scalar)
1 2 3 4 5 6 7
Inst0 IF ID EX MEM WB
Inst1 IF ID EX MEM WB
Inst2 IF ID EX MEM WB
Inst3 IF ID EX MEM WB
Inst4 IF ID EX MEM WB
Inst5 IF ID EX MEM WB

39
Elements of Advanced Superscalars

High performance instruction fetching
Good dynamic branch and jump prediction
Multiple instructions per cycle, multiple
branches per cycle?
Scheduling and hazard elimination
Dynamic scheduling
Not necessarily Alpha 21064 Pentium were
statically scheduled
Register renaming to eliminate WAR and WAW
Parallel functional units, paths/buses/multiple
register ports
High performance memory systems
Speculative execution

40
SS DS Speculation

Superscalar Dynamic scheduling Speculation
Three great tastes that taste great together
CPI gt 1?
Overcome with superscalar
Superscalar increases hazards
Overcome with dynamic scheduling
RAW dependences still a problem?
Overcome with a large window
Branches a problem for filling large window?
Overcome with speculation

41
The Big Picture
issue
Static program
Fetch branch predict
execution

Reorder commit
42
Superscalar Microarchitecture
Floating point register file
Functional units
Memory interface
Floating point inst. buffer
Inst. Cache
Decode rename dispatch
Inst. buffer
Pre-decode
Functional units and data cache
Integer address inst buffer
Integer register file
Reorder and commit
43
Register renaming methods

First Method
Physical register file vs. logical
(architectural) register file.
Mapping table used to associate physical reg w/
current value of log. Reg
use a free list of physical registers
Physical register file bigger than log register
file
Second Method
physical register file same size as logical
Also, use a buffer w/ one entry per inst.
Reorder buffer.

44
Register Renaming Example

Loop LD F0, 0(R1)
stall
ADDD F4, F0, F2
stall
stall
SD F4, 0(R1)
LD F6, -8(R1)
stall
ADDD F8, F6, F2
stall
stall
SD F8, -8(R1)
LD F10, -16(R1)
stall
ADDD F12, F10, F2
stall
stall
SD F12, -16(R1)
LD F14, -24(R1)

Loop LD F0, 0(R1) LD F6, -8(R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0,
F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16,
F14, F2 SD F4, 0(R1) SD F8, -8(R1) SD F12,
-16(R1) SUBI R1, R1, 32 BNE R1, R2,
Loop SD F16, 8(R1)
Schedule
No stalls! 14 clock cycles or 3.5 per
iteration Can we minimize further?
28 clock cycles or 7 per iteration Can we
minimize further?
45
Register renaming first method
Mapping table
Mapping table
Add r3,r3,4
Free List
Free List
46
Superscalar Processors

Issues varying number of instructions per clock
Scheduling Static (by the compiler) or
dynamic(by the hardware)
Superscalar has a varying number of
instructions/cycle (1 to 8), scheduled by
compiler or by HW (Tomasulo).
IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000

47
More Realistic HW Register Impact

Effect of limiting the number of renaming
registers

FP 11 - 45
Integer 5 - 15
IPC
48
Reorder Buffer

Place data in entry when execution finished

Reserve entry at tail when dispatched
Remove from head when complete
Bypass to other instructions when needed
49
register renamingreorder buffer
Before add r3,r3,4
Add r3, rob6, 4 add rob8,rob6,4
r0 r1 r2 r3 r4
R8
r0 r1 r2 r3 r4
R7
R5
rob8
R9
8 7 6 0
7 6 0
R3 0 R3 .
....
r3
Reorder buffer
Reorder buffer
50
Instruction Buffers
Floating point register file
Functional units
Memory interface
Floating point inst. buffer
Inst. Cache
Decode rename dispatch
Inst. buffer
Pre-decode
Functional units and data cache
Integer address inst buffer
Integer register file
Reorder and commit
51
Issue Buffer Organization

a) Single, shared queue
b)Multiple queue one per inst. type

No out-of-order No Renaming
No out-of-order inside queues Queues issue out of
order
52
Issue Buffer Organization

c) Multiple reservation stations (one
per instruction type or big pool)
NO FIFO ordering
Ready operands, hardware available
execution starts
Proposed by Tomasulo

From Instruction Dispatch
53
Typical reservation station
Operation source 1 data 1
valid 1 source 2 data 2 valid 2
destination
54
Memory Hazard Detection Logic

Write a Comment

User Comments (0)