Introduction to Silicon Programming in the TangramHaste language

About This Presentation

Title:

Introduction to Silicon Programming in the TangramHaste language

Description:

Introduction to Silicon Programming in the TangramHaste language – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 26

Provided by: csU8

Learn more at: http://www.cs.unc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Silicon Programming in the TangramHaste language

1
Introduction to Silicon Programmingin the
Tangram/Haste language

Material adapted from lectures by
Prof.dr.ir Kees van Berkel
Dr. Johan Lukkien
Dr.ir. Ad Peeters
at the Technical University of Eindhoven, the
Netherlands

2
VLSI programming for

Low costs
introduce resource sharing.
Low delay (high throughput)
introduce parallelism.
Low energy (low power)
reduce activity

3
VLSI programming for high performance

Keep it simple!!
Make the analysis focus on bottlenecks
Introduce parallelism expressions, commands,
loops, pipelining
Enable parallelism, by reducing dependencies such
as resource sharing

4
Expression-level parallelism

Examples
balancing (vw)(xy) is faster than vwxy
substitution zg(f(x)) is faster than y
f(x) z g(y)
carry-select adder
carry-save multiplier

5
Command level parallelism

If S2 does not depend on outcome of S1 thenS1
S2 can be transformed into S1
S2.(dependencies data, sharing,
synchronization)
This reduces computation time ?, unless ordering
is enforced through external synchronization.
?(S1 S2 ) ?() ?(S1) ?(S2)
?(S1 S2 ) ? () max(?(S1), ?(S2))

6
Exposure of cmd-level parallelism

Let S be a shorthand for forever do S od
Assume S0 must precede S1 and S1 must precede S2
How to speedup S0 S1 S2 ?
S0 S1 S2
loop unfolding S0 S1 S2 S0
S0 does not depend on S1 S0 S1 (S2
S0)

7
wagging

a?x b!f(x)
loop unrolling, renaming
a?x b!f(x) a?y b!f(y)
loop folding
a?x b!f(x) a?y b!f(y) a?x
? increases slack by 1
a?x (b!f(x) a?y) (b!f(y) a?x)

8
Parallel reads from REG file

Let RF be a register file. Then x RFi
y RFj cannot be parallelized. (Register
files have a single read port.)
Parallel read actions can be realized by doubling
the register file z write and , RGj read

9
Pipelining in Tangram

Compare three programs
P0 a?x0 b!f2(f1(f0(x0)))
P1 a?x0 x1 f0(x0) x2 f1(x1)
b!f2(x2)
P2 a?x0 a1!f0(x0) a1?x1
a2!f1(x1) a2?x2 b!f2(x2)

10
Pipelining in Tangram (cntd)

Output sequence b identical for P0, P1, and P2.
P0 and P1 have same communication behavior P1
is larger, slower, and warmer.
P2 vs P1 similar in size, energy, and latency,
but up to 3 times higher throughput, depending
on (relative) complexity of f0, f1, f2.

11
A Processor Example DLX (Deluxe)

(AMD 29K DECstation 3100 HP850 IBM801
Intel i860 MIPS M/120A MIPS M/1000 Motorola
88K RISC I SGI 4D/60 SPARCstation-1 Sun
4/110 Sun-4/260) / 13
DLX
Other RISC examples include
Cray-1,2,3, AMD2900, DEC
Alpha, ARM.

12
DLX instruction formats
31 26, 25 21, 20 16, 15 11, 10
0
13
Example instructions
14
GCD in DLX assembler

pre LW R1,4(R0) R1Mem40
LW R2,8(R0) R2Mem80
loop SUB R3,R1,R2 R3R1-R2
BEQZ R3,exit if (R30) then PCexit
SLT R4,R1,R2 R4(R1
BEQZ R4,pos2 if (R40) then PCpos2
pos1 SUB R2,R2,R1 R2R2-R1
J loop PCloop
pos2 SUB R1,R1,R2 R1R1-R2
J loop PCloop
exit SW 20(R0),R1 Mem200R1
HLT

15
DLX interface, state
Instruction memory
Mem (Data memory)
address
address
r0
pc
r1
r2
DLX CPU
Reg
instruction
data
r/w
r31
clock
interrupt
16
DLX Moore machine(ignoring interrupts)

?Reg0,pc ? ?0,0?
do ?MemRegrs1 immediate, pc, Regrd ?
? if SW ? Regrd fi
, if J ? pc4offset
BEQZ ? if Regrs0 ? pc4
immediate Regrs0 ? pc4 fi
else ? pc4
fi
, if LW ? Memrs1immediate
ADD ? ALU(add, Regrs1, Regrs2)
fi ?
od

17
DLX 5-step sequential execution
IF
ID
EX
MM
WB
18
DLX pipelined execution
Time ? in clock cycles 1 2 3
4 5 6 7 8
...
Program execution ? instructions
19
DLX pipelined execution
Instruction Fetch
Inst.Decode
EXecute
Memory
Write Back
4
0?
pc
Instr. mem
Reg
Mem
20
DLX system organization
RAMaddrdatatoRAMdatafromRAM
ROMaddrROMdata
dlx()

systemboundary
rom()
ram()
filesRAMoutRAMin
system_dlx()
file gcd.bin
21
dlx0.ht

include types.ht
dlx0 export proc ( ROMaddr!chan adtype
ROMdata?chan word
RAMaddr!chan rwadtype datatoRAM!chan
S30 datafromRAM?chan S30
) .
begin
RF ram array U5 of S30
end

22
system_dlx0.ht

include "dlx0.ht"
dlx0 proc ( ROMaddr!chan adtype
ROMdata?chan word
RAMaddr!chan rwadtype datatoRAM!chan
S30 datafromRAM?chan S30
) . import
env_dlx4 main proc (
ROMfile? chan word
RAMinfile? chan S30
RAMfile! chan S30 /
/
) .
begin
next slide
end

23
system_dlx0.ht main body

begin
ROMaddr chan adtype
ROMdata chan word
RAMaddr chan rwadtype
datatoRAM chan S30
datafromRAM chan S30
ROMinterface proc() . begin .. end
RAMinterface proc() . begin .. end
initialise() ROMinterface()
RAMinterface() dlx0( ROMaddr, ROMdata,
RAMaddr, datatoRAM, datafromRAM )
end

24
script