Title: Introduction to VLSI Programming Lecture 9: High Performance DLX
1Introduction to VLSI Programming Lecture 9
High Performance DLX
- (course 2IN30)
- Prof. dr. ir.Kees van Berkel
-
2Time table 2005
date class lab subject
Aug. 30 2 0 hours intro VLSI
Sep. 6 3 0 hours handshake circuits
Sep. 13 3 0 hours handshake circuits assignment
Sep. 20 3 0 hours Tangram
Sep. 27 no lecture
Oct. 4 no lecture
Oct. 11 1 2 hours demo, fifos, registers deadline assignment
Oct. 18 1 2 hours design cases
Oct. 25 1 2 hours DLX introduction
Nov. 1 1 2 hours low-cost DLX
Nov. 8 1 2 hours high-speed DLX
Dec. 13 deadline final report
3Lecture 9
- Recapitulation of Lecture 8
- 3-stage DLX, using branch-delay slots
- Lab work 3-stage DLX in Tangram ? 80 MIPS
- Industrial applications of asynchronous
technology - Conclusion of course 2IN30
4Pipelining in Tangram
- Compare three programs
- P0 a?x0 b!f2(f1(f0(x0)))
- P1 a?x0 x1 f0(x0) x2 f1(x1)
b!f2(x2) - P2 a?x0 a1!f0(x0) a1?x1
a2!f1(x1) a2?x2 b!f2(x2)
5Pipelining in Tangram (cntd)
- Output sequence b identical for P0, P1, and P2.
- P0 and P1 have same communication behavior P1
is larger, slower, and warmer. - P2 vs P1 similar in size, energy, and latency,
but up to 3 times higher throughput, depending
on (relative) complexity of f0, f1, f2.
6DLX0 instruction loop
- do -halted then
- ROMaddr!PC
- ROMdata?ir
- PCPC4
auxPCPC4 PCPCaux - case (ir cast Itype.0)
- is ltltt,f,f,f,f,fgtgt then LW()
- or ltltt,f,f,f,f,tgtgt then SW()
- or ltltf,f,f,f,f,fgtgt then if (ir cast
Rtype.4 1) then SLT() fi - or ltltf,t,f,f,f,fgtgt then BEQZ()
- or ltltf,t,f,f,f,tgtgt then J()
- or ltltf,f,t,f,f,fgtgt then haltedtrue
- si
- od
7DLX0 instruction loop
- Each instruction cycle
- 4 sequential commands for each instruction
type - 1-3 sequential commands for specific
instructions - 5-7 sequential commands each cycle
- Pipelining
- split these 5-7 commands over 2 stages,
- in a (more or less) balanced way.
- is simple when instruction does not affect PC,
but more difficult for jump and branch
instructions.
82-stage DLX example template
9DLX 3-stage pipelined execution
Time ? instruction cycles 1 2
3 4 5 6 7
...
Program execution ? instructions
Stage EX includes memory access and writeback
103-stage DLX example template
11Reducing pipeline branch penalties
- Problem
- which instruction to fetch after branch
instruction? - Strategies
- wait until branch address is computed (DLX0)
- predict branch not taken
- predict branch taken
- introduce branch-delay slots (todays assignment)
12Branch delay slots
- Single branch delay slot
-
- branch instruction
- branch-delay instruction
- branch target (if not taken)
-
- Branch-delay instruction, various possibilities
- e.g. instruction preceding branch instruction(if
branch condition does not depend on outcome) - ... or an instruction succeeding the branch, if
- NOP instruction if no productive alternative
available. - This constitutes a change in the ISA!
13Final assignment
- 3-stage DLX, with instruction rate exceeding 80
MIPS when executing GCD (measured over several
GCD cycles). - NB1 exploit branch delay slots. This requires a
different version of the assembler text!!. - NB2 can be achieved using command level
parallelism and pipelining. (Expression-level
parallelism may yield a bonus.) - NB3 speed up the environment (RAM, ROM) when
necessary.
14VLSI programming of asynchronous circuits
behavior, area, time, energy, test coverage
Tangram program
feedback
compiler
simulator
Handshake circuit
expander
Asynchronous circuit
(netlist of gates)
15Demonstrator ICs
16Added value
- 1985 modularity, ease of design (no value
added to product!) - 1990 low power (ESPRIT project ?????)
- 1992 low noise, low EME
(Electro-Magnetic Emission) - 2000 ...
17Added value low power DCC Error Corrector
18A sync-async arms race
19Added value Low Power
Synchronous 80C51 - Asynchronous 80C51
20Added value Low EM Emission
21Roadblock circuit sizethe 80C51 learning curve
1995/6
1999/4
22Just in time processing
23ADPCM
24ADPCM
25ADPCM
26Industrialization of the Technology
- Philips Semiconductors Zürich (1994 Dec)
We want to set a world record in low power,
.. by using asynchronous technology. - Their choice for a vehicle the 80C51
micro-controller (used in many consumer
products). - Result 4 less power, minimal EME.
- Follow-up pager baseband ICs,
-
- In parallel transfer and upgrade of tools
design flow
27Pager Baseband Controller ICs
- Myna pager
- FLEX protocol
- 32 alphanumeric messages
- a single AAA battery (1V)
- up to 25 weeks battery life
- Pager baseband controller ICs
- PCA5007, PCA 5010
- http//www.semiconductors.philips.
com/pip/PCA5007 - http//www.win.tue.nl/pa/wsinap/ async.html
281998-Sep the PCA 5007
29A new generation of pagersa common platform for
all standards
30EMI a critical design factor (Electro-Magnetic
Interference)
- Antenna signal may be as small as 25?V.
- Clock harmonics of synchronous micro-controllers
interfere with RF (X00 MHz). - With asynchronous 80C51 signal decoding by means
of (standard-specific) software. (This also
enables upgrading/downloading!) - Furthermore no shielding is required between
controller and RF receiver.
31PCA5007 block diagram
32Contactless smartcard IC (ESPRIT project
DESCALE)
Power regulator
80C51 micro-controller DES engine UART RAM, ROM,
EEPROM
13.56 MHz clock power (a few mW) bi-directional
communication (106 kbit/s)
Radio link
33Contactless smartcard IC
- Properties
- a) low average power
- lower peak power
- speed adaptation
- Merits
- Maximum speed for received power (a,c)
- Robust operation against voltage drops (c)
- Smaller buffer capacitor (b,c)
34Conclusion
- First asynchronous VLSI circuits on the market
(high volume sales). - Prospects for more async products look good.
- Added value low power, EME performance.
- Added costs test, IC area, being different.
- Asynchronous VLSI technology
- there is room for it in market niches,
- but it may contribute to main-stream VLSI.
35Bibliography
- Computer Architecture a Quantitative Approach
(3rd Ed.) John L Hennessy David A Patterson
Morgan Kaufmann Publishers Inc, 1996. - ARM System Architecture Steve Furber Addison
Wesley, 1996. - DSP Processor Fundamentals, Architectures and
Features Phil Lapsey et al (Berkeley Design
Technology Inc.), IEEE, 1996. - www.handshakesolutions.com
- www.arm.com/news/6936.html
- www.research.philips.com/ newscenter/archive/2004/
handshake.html
36Lab-work and report
- You are allowed to team up with a colleague (Not
mandatory.) - Report more than listing of functional Tangram
programs - analyze the specifications and requirements
- present design options, alternatives, trade-offs
- motivate your design choices
- explain functional correctness of your Tangram
programs - analyze explain area, time, energy of your
programs.
37Lab work
- Assignment 6
- create a 3-stage pipelined dlx3.tg
- design a reduced-costs version dlx3s.tg
- Kees van Berkel
- Attn Cecile Brouwers, HG 5.06, Wisk
Informatica - Success! and have fun!