Introduction to Silicon Programming in the TangramHaste language - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Silicon Programming in the TangramHaste language

Description:

Introduction to Silicon Programming in the TangramHaste language – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 26
Provided by: csU8
Learn more at: http://www.cs.unc.edu
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Silicon Programming in the TangramHaste language


1
Introduction to Silicon Programmingin the
Tangram/Haste language
  • Material adapted from lectures by
  • Prof.dr.ir Kees van Berkel
  • Dr. Johan Lukkien
  • Dr.ir. Ad Peeters
  • at the Technical University of Eindhoven, the
    Netherlands

2
VLSI programming for
  • Low costs
  • introduce resource sharing.
  • Low delay (high throughput)
  • introduce parallelism.
  • Low energy (low power)
  • reduce activity

3
VLSI programming for high performance
  • Keep it simple!!
  • Make the analysis focus on bottlenecks
  • Introduce parallelism expressions, commands,
    loops, pipelining
  • Enable parallelism, by reducing dependencies such
    as resource sharing

4
Expression-level parallelism
  • Examples
  • balancing (vw)(xy) is faster than vwxy
  • substitution zg(f(x)) is faster than y
    f(x) z g(y)
  • carry-select adder
  • carry-save multiplier

5
Command level parallelism
  • If S2 does not depend on outcome of S1 thenS1
    S2 can be transformed into S1
    S2.(dependencies data, sharing,
    synchronization)
  • This reduces computation time ?, unless ordering
    is enforced through external synchronization.
  • ?(S1 S2 ) ?() ?(S1) ?(S2)
  • ?(S1 S2 ) ? () max(?(S1), ?(S2))

6
Exposure of cmd-level parallelism
  • Let S be a shorthand for forever do S od
  • Assume S0 must precede S1 and S1 must precede S2
    How to speedup S0 S1 S2 ?
  • S0 S1 S2
  • loop unfolding S0 S1 S2 S0
  • S0 does not depend on S1 S0 S1 (S2
    S0)

7
wagging
  • a?x b!f(x)
  • loop unrolling, renaming
  • a?x b!f(x) a?y b!f(y)
  • loop folding
  • a?x b!f(x) a?y b!f(y) a?x
  • ? increases slack by 1
  • a?x (b!f(x) a?y) (b!f(y) a?x)

8
Parallel reads from REG file
  • Let RF be a register file. Then x RFi
    y RFj cannot be parallelized. (Register
    files have a single read port.)
  • Parallel read actions can be realized by doubling
    the register file z write and , RGj read

9
Pipelining in Tangram
  • Compare three programs
  • P0 a?x0 b!f2(f1(f0(x0)))
  • P1 a?x0 x1 f0(x0) x2 f1(x1)
    b!f2(x2)
  • P2 a?x0 a1!f0(x0) a1?x1
    a2!f1(x1) a2?x2 b!f2(x2)

10
Pipelining in Tangram (cntd)
  • Output sequence b identical for P0, P1, and P2.
  • P0 and P1 have same communication behavior P1
    is larger, slower, and warmer.
  • P2 vs P1 similar in size, energy, and latency,
    but up to 3 times higher throughput, depending
    on (relative) complexity of f0, f1, f2.

11
A Processor Example DLX (Deluxe)
  • (AMD 29K DECstation 3100 HP850 IBM801
    Intel i860 MIPS M/120A MIPS M/1000 Motorola
    88K RISC I SGI 4D/60 SPARCstation-1 Sun
    4/110 Sun-4/260) / 13
  • DLX
  • Other RISC examples include
    Cray-1,2,3, AMD2900, DEC
    Alpha, ARM.

12
DLX instruction formats
31 26, 25 21, 20 16, 15 11, 10
0
13
Example instructions
14
GCD in DLX assembler
  • pre LW R1,4(R0) R1Mem40
  • LW R2,8(R0) R2Mem80
  • loop SUB R3,R1,R2 R3R1-R2
  • BEQZ R3,exit if (R30) then PCexit
  • SLT R4,R1,R2 R4(R1
  • BEQZ R4,pos2 if (R40) then PCpos2
  • pos1 SUB R2,R2,R1 R2R2-R1
  • J loop PCloop
  • pos2 SUB R1,R1,R2 R1R1-R2
  • J loop PCloop
  • exit SW 20(R0),R1 Mem200R1
  • HLT

15
DLX interface, state
Instruction memory
Mem (Data memory)
address
address
r0
pc
r1
r2
DLX CPU
Reg
instruction
data
r/w
r31
clock
interrupt
16
DLX Moore machine(ignoring interrupts)
  • ?Reg0,pc ? ?0,0?
  • do ?MemRegrs1 immediate, pc, Regrd ?
  • ? if SW ? Regrd fi
  • , if J ? pc4offset
  • BEQZ ? if Regrs0 ? pc4
    immediate Regrs0 ? pc4 fi
  • else ? pc4
  • fi
  • , if LW ? Memrs1immediate
  • ADD ? ALU(add, Regrs1, Regrs2)
  • fi ?
  • od

17
DLX 5-step sequential execution
IF
ID
EX
MM
WB
18
DLX pipelined execution
Time ? in clock cycles 1 2 3
4 5 6 7 8
...
Program execution ? instructions
19
DLX pipelined execution
Instruction Fetch
Inst.Decode
EXecute
Memory
Write Back
4
0?
pc
Instr. mem
Reg
Mem
20
DLX system organization
RAMaddrdatatoRAMdatafromRAM
ROMaddrROMdata
dlx()

systemboundary
rom()
ram()
filesRAMoutRAMin
system_dlx()
file gcd.bin
21
dlx0.ht
  • include types.ht
  • dlx0 export proc ( ROMaddr!chan adtype
  • ROMdata?chan word
  • RAMaddr!chan rwadtype datatoRAM!chan
    S30 datafromRAM?chan S30
  • ) .
  • begin
  • RF ram array U5 of S30
  • end

22
system_dlx0.ht
  • include "dlx0.ht"
  • dlx0 proc ( ROMaddr!chan adtype
  • ROMdata?chan word
  • RAMaddr!chan rwadtype datatoRAM!chan
    S30 datafromRAM?chan S30
  • ) . import
  • env_dlx4 main proc (
  • ROMfile? chan word
  • RAMinfile? chan S30
  • RAMfile! chan S30 /
    /
  • ) .
  • begin
  • next slide
  • end

23
system_dlx0.ht main body
  • begin
  • ROMaddr chan adtype
  • ROMdata chan word
  • RAMaddr chan rwadtype
  • datatoRAM chan S30
  • datafromRAM chan S30
  • ROMinterface proc() . begin .. end
  • RAMinterface proc() . begin .. end
  • initialise() ROMinterface()
    RAMinterface() dlx0( ROMaddr, ROMdata,
    RAMaddr, datatoRAM, datafromRAM )
  • end

24
script
  • htcomp system_dlx0
  • htsim -limit 1000 system_dlx0 RAMin RAMout
  • htview system_dlx0
  • Htmap system_dlx0

25
DLX0 instruction loop
  • do -halted then
  • ROMaddr!PC
  • ROMdata?ir
  • PCPC4
    auxPCPC4 PCPCaux
  • case (ir cast Itype.0)
  • is then LW()
  • or then SW()
  • or then if (ir cast
    Rtype.4 1) then SLT() fi
  • or then BEQZ()
  • or then J()
  • or then haltedtrue
  • si
  • od
Write a Comment
User Comments (0)
About PowerShow.com