Transforming an implementation into a cycleaccurate simulator using BDN - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Transforming an implementation into a cycleaccurate simulator using BDN

Description:

Jessica has ported this design onto Xilinx XUPV5. Takes up 92% of the area ... Protoflex: James Hoe, Eric Chung et al at CMU. RAMP Gold: Krste Asanovic et al at ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 29
Provided by: mur675
Category:

less

Transcript and Presenter's Notes

Title: Transforming an implementation into a cycleaccurate simulator using BDN


1
Transforming an implementation into a
cycle-accurate simulator using BDN
  • Murali Vijayaraghavan and Arvind
  • Computer Science and Artificial Intelligence
    Laboratory
  • M.I.T.
  • RAMP Workshop, Austin, TX
  • June 25, 2009

2
IBM/MIT CollaborationSept 2007
  • Motivation Create an ecosystem to foster and
    promote the use of Power architecture in system
    research
  • Initial Goal Create a flexible and synthesizable
    multithreaded, multicore PowerPC model that
    facilitates rapid architectural exploration
  • parameterized for the number of threads the
    number and functionality of pipeline stages
  • Current Goals
  • Cycle-accurate modeling
  • Open source distribution on widely available
    FPGAs by summer 2010

3
The Team
  • Architecture and Bluespec Coding
  • K. Ekanadham, Jessica Tseng
  • MIT Asif Khan, Murali Vijayaraghavan
  • Linux OS Bring-up Team
  • Hubertus Franke, Jimi Xenidis
  • FPGA Prototyping Team
  • Richard Kaufman, Kai Schleupen
  • Managers
  • Nancy Greco, Pratap Pattnaik
  • MIT Arvind

4
Results
  • A 64-bit embedded PowerPC was created from
    scratch in Bluespec System Verilog (BSV)
  • Implemented on an IBM internal FPGA Platform that
    uses Xilinx Virtex-5 LX330 chip
  • Linux was booted on it by Nov 2008
  • Jessica has ported this design onto Xilinx XUPV5
  • Takes up 92 of the area
  • Running at 20Mhz but probably can be jacked up to
    40MHz

5
Issues in Prototype RTL to FPGA mapping
  • Some structures consume a disproportionate amount
    of FPGA resources
  • multiported register file
  • CAM
  • multiply, divide
  • Prototype RTL implementations on FPGAs need to
    compensate for external memory timing
  • Lack of tools for mapping on multiple FPGAs

Can be implemented in multiple cycles to save
resources
? Cycle-accurate modeling
6
  • Bounded Data Flow Networks (BDNs) as a
    theoretical frame for cycle-accurate modeling of
    synchronous sequentional machines
  • Murali Vijayaraghavan Arvind MEMOCODE 2009

7
Implementing RTL on FPGAs
Simulate on BRAMs in multiple cycles
3-read 2-write Reg File
Target RTL ASIC
On FPGA
In general, functional correctness requires cycle
accuracy
8
BDN as a refinement of an SSM
  • There is a bijective mapping between the inputs
    (outputs) of S and R
  • for all n gt 0,
  • I(k) matches for S and R (1 ? k ? n)
  • ? O(j) matches for S and R (1 ? j ? n)

Cycle Accuracy
Refers to the kth enqueue in each input FIFO for
a BDN
9
Patient SSMs SSMs with a start signal to
update registers
enable
10
SSM to BDN and refinements
S2 (big)
S3 (big)
S1
SSM
cut
Patient SSMs
to BDNs
BDN
refine-ments
BDN
11
SSM to BDN
  • The translations has to be done such that the
    generated BDN is latency-insensitive, i.e., the
    input-output behavior of the BDN does not change
    if we change the latency of one of its component
    BDNs or the size of the FIFOs connecting the
    components

12
Implementing an SSM as a BDN
a
a
c
c
f
f
b
b
d
d
rule O when (?a.empty??b.empty??c.full ??d.full)
? c.enq(f(a.first, b.first)) d.enq(b.first)
a.deq b.deq
This description can be easily translated into
logic that serves as a wrapper for the original
logic The SSM and BDN have the same input-output
behavior
13
Deadlocks
a
a
c
c
f
f
b
b
d
d
rule O when (?a.empty??b.empty??c.full ??d.full)
? c.enq(f(a.first, b.first)) d.enq(b.first)
a.deq b.deq
Extraneous dependencies -- d unnecessarily
depends upon a and c
14
Another behavior for the same BDN
a
a
c
c
f
f
b
b
cDone
d
d
dDone
  • rule O1 when (?a.empty??b.empty??c.full ??cDone)
  • ? c.enq(f(a.first, b.first)) cDone lt True
  • rule O2 when (?b.empty??d.full ??dDone)
  • ? d.enq(b.first) dDone lt True
  • rule In when (cDone ?dDone)
  • ? a.deq b.deq cDone lt False dDone lt False

No extraneous dependencies No deadlock
15
Latency-Insensitive BDNs
  • No extraneous dependency property if output Oi
    is not enqueued n times, assuming it is not full
    and all the inputs are enqueued n-1 times, then
    it must be that one of the inputs in
    Depends-on(Oi) is not enqueued n times
  • Self Cleaning property If all outputs are
    enqueued n times then all inputs must be dequeued
    n times

BDNs with these properties and do not deadlock
16
Writing an LI-BDN wrapper for an SSM
LI-BDN rule Oj when (?donej) ? donej lt
True oj.enq( fj(ij1.first, ... ,ijIj.first,
s) ) rule Finish when (done1 ? done2 ? ...)
? done1 lt False done2 lt False ... s lt
g(i1.first, i2.first, ... , s) i1.deq
i2.deq ...
  • Given the SSM
  • oj(t) fj(ij1(t), ... ,ijIj(t), s(t))
  • // ij1, ij2, ... ijIj are in Depends-on(oj)
  • s(t1) g(i1(t), i2(t), ... , s(t))

17
The Wrapper Circuit
18
PPC In-order Pipeline
  • The designer specifies the FSM for each stage
  • The FIFOs are latency-insensitive, that is, the
    correctness of the specification does not depend
    upon the depth of FIFOs or the number of stages

19
The steps in Cycle-accurate implementation on
FPGAs
  • The specs are turned into Bluespec code to give a
    target SSM
  • Once the size of FIFOs is fixed the whole design
    has a precise timing specification
  • If the FPGA implementation requires refining some
    stages then cuts are made in the design to
    isolate the stages (SSMs) to be refined
  • Each SSM is turned into a BDN by introducing
    FIFOs for each input and output wire, including
    the wires going in and out of model FIFOs of the
    SSM
  • This converts the nth time cycle of the SSM into
    the nth enqueue into input FIFOs and nth dequeue
    from output FIFOs
  • Atomic rules for the operation of each BDN are
    defined so that no extraneous dependencies are
    introduced
  • This also ensures deadlock-free operation

20
Preliminary results
  • Cycle-accurate refinements onto Xilinx XUPV5
    (Asif Murali)
  • Slice Logic Utilization
  • Number of Slice Registers 15448 out of 69120 22
  • Number of Slice LUTs 16702 out of 69120 24
  • Specific Feature Utilization
  • Number of Block RAM/FIFO 1 out of 148 0 (only 1
    BRAM for the register file)
  • Number of DSP48Es 12 out of 64 18 (these are
    used for the divider)
  • Minimum period 7.988ns (Maximum Frequency
    125.188MHz)
  • Partially verified by running a 50 instruction
    program

Compared to Jessica has port onto Xilinx
XUPV5 Takes up 92 of the area 20Mhz ? 40Mhz
No numbers yet for actual work done
21
Conclusion
  • Cycle-accurate modeling of processors on FPGAs is
    feasible and offers a 3-orders of magnitude
    improvement in performance over software
    simulators
  • BDNs offer a way to refine RTL without losing
    cycle-accuracy
  • Bluespec is makes quick RTL generation feasible
  • The generation of BDNs can be automated
  • We plan to release our Bluespec designs under
    open source licensing to strengthen PowrPC
    ecosystem.

22
Related work
Luca Carloni et al for Latency-Insensitive
refinements
  • HAsim Joel Emer, Michael Pellauer, et al at
    Intel/MIT
  • Cycle accurate modeling using the A-ports
    abstraction
  • UTFast Derek Chiou and students at UT Austin
  • speculative functional model, corrected by timing
    model when necessary
  • Protoflex James Hoe, Eric Chung et al at CMU
  • RAMP Gold Krste Asanovic et al at Berkeley

23
Thanks!
24
BDN Input/Output notation
  • Ii(n) represents the nth values enqueued in input
    buffer Ii
  • I(n) represents the nth values enqueued in all
    input buffers
  • Oj(n) represents the nth values dequeued from
    output buffer Oj
  • O(n) represents the nth values dequeued from all
    output buffers

25
Examples of primitive BDNsRegister
A register whose reads and writes must match
Behavior
rule RO when (?b.full ? ?bDone) ?b.enq(r)
bDone lt True rule RI when (?a.empty ? bDone)
?r lt a.first a.deq bDone lt False
Initial Values
bDone False r r0
26
Examples of primitive BDNsMux
A mux that accepts an input value on each input
port but passes only the appropriate value to the
output
Behavior
rule MuxO when ?c.full ? ?p.empty ? if(p.first ?
? a.empty) then c.enq(a.first) a.deq
bCntltbCnt1 else if(!(p.first) ? ?
b.empty) then c.enq(b.first) b.deq
aCntltaCnt1 rule MuxI1 when aCnt gt0 ? ? a.empty
? a.deq aCntltaCnt-1 rule MuxI2 when bCnt gt0
? ? b.empty ? b.deq bCntltbCnt-1
Initial values
aCnt 0 bCnt 0
27
Composition of BDNs
  • If R1 and R2 are BDNs then so is the parallel
    composition of R1 and R2 (R R1 ? R2)
  • R1 is a BDN then so is the ( Ii ,Oj) iterative
    composition of R1 (R (i,j) ? R1) provided Ii ?
    Depends-on(Oj)


No direct combinational path
28
Deadlock-free BDN
  • Assuming an infinite sink, a BDN is deadlock-free
    if for all n gt 0, if n values are enqueued into I
    then eventually n values will be dequeued from
    both O and I
  • we need a stronger property for deadlock-freeness
    to be preserved under composition
Write a Comment
User Comments (0)
About PowerShow.com