DART: A Programmable Architecture for NoC Simulation on FPGAs - PowerPoint PPT Presentation

About This Presentation
Title:

DART: A Programmable Architecture for NoC Simulation on FPGAs

Description:

... depends on network load only Higher speedups over Booksim for large NoCs XUPV2P Development Board Virtex-II Pro XC2VP30 26,385 (96%) Total 152 Control FSM ... – PowerPoint PPT presentation

Number of Views:373
Avg rating:3.0/5.0
Slides: 62
Provided by: Dany151
Category:

less

Transcript and Presenter's Notes

Title: DART: A Programmable Architecture for NoC Simulation on FPGAs


1
DART A Programmable Architecture for NoC
Simulation on FPGAs
  • Danyao Wang Natalie Enright Jerger
    J. Gregory Steffan

Department of Electrical Computer
Engineering University of Toronto
Google Inc.
2
Why yet another NoC simulator?
  • Software simulators
  • Stand-alone or integrated
  • Parallel NoC simulator (DARSIM)
  • FPGA-based Models
  • Direct map NoC emulators (Genko et al., NoCem)
  • Dynamic reconfiguration (DRNoC)
  • Decoupled timing and functional model (RAMPGold,
    ProtoFlex, A-Ports)
  • Analytical models FIST

3
Why yet another NoC simulator?
Requirement Software Simulation
Accurate Possible
Fast to run lt 10 KIPS to 100s KIPS
Easy to implement Yes
Easy to use modify Yes
Available early Yes
_at_100KIPS 1s of execution _at_ 1GHz 10K sec 2.8
hrs
Benefits of thread-based parallelization is
limited due to high synchronization overhead
4
Why yet another NoC simulator?
Requirement Software Simulation FPGA-based Emulators
Accurate Possible Possible
Fast to run lt 10 KIPS to 100s KIPS 10s to 100s MIPS
Easy to implement Yes No
Easy to use modify Yes No
Available early Yes Yes
Orders of magnitude faster!
Hardware changes Hours of synthesis-place-route
time
5
DART Hybrid Approach
FPGA
UART
Control FSM
DART Simulator
configuration, commands
PC
Simulation results
  • Generic NoC simulation engine
  • Fixed function nodes for basic NoC building
    blocks
  • Router, traffic generator, link
  • Software configurable parameters in each node

Simulate different NoCs without changing hardware
6
Why yet another NoC simulator?
Requirement Software Simulation FPGA-based Emulators DART
Accurate Possible Possible Yes
Fast to run lt 10 KIPS to 100s KIPS 10s to 100s MIPS 10s MIPS
Easy to implement Yes No No
Easy to use modify Yes No Yes
Available early Yes Yes Yes
7
DART Simulator Architecture
8
Generic NoC Model
Global interconnect
  • Topology
  • Routing algorithm
  • Flow control
  • Router microarchitecture
  • Simulated traffic
  • Link properties

Router
Traffic Generator
Flit Queue
9
DART Architecture
Synchronize all network transfers to a global
time counter
10
DART Nodes
Node Parameters Statistics Counter
TrafficGenerator Traffic pattern Injection intervals Packet size ( of flits) of injected packets of received packets Cumulative packet latency
Flit Queue Latency (flit cycles) Bandwidth (flits / cycle) More can be added easily
Routers Routing Table Input buffer sizes (credits) Pipeline delay (flit cycles) More can be added easily
  • Parameters implemented using a shift register
  • Configuration byte stream generated on the PC and
    sent to the FPGA

11
Simulating a NoC
  1. Map simulated NoC to DART nodes
  2. Program the routing tables to implement the
    simulated topology
  3. Record timing of flit transfers

12
Example Walk-Through
0
1
2
3
4
5
6
7
13
Example Walk-Through
0
1
2
3
4
5
6
7
Traffic Generator
Router
Flit Queues
Global Interconnect
14
Example Walk-Through
0
1
2
3
4
5
6
7
0
Global Interconnect
15
Example Walk-Through
0
1
2
3
4
5
6
7
0
1
Global Interconnect
16
Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
Global Interconnect
17
Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
Global Interconnect
18
Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
4
Global Interconnect
19
Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
4
5
Global Interconnect
20
Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
Global Interconnect
21
Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Global Interconnect
22
Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Global Interconnect
23
Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Global Interconnect
24
Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Global Interconnect
25
Example Walk-Through
received 1 Slatency 6
received 1 Slatency 6
injected 1
injected 1
0
1
2
3
4
5
6
7
Global Interconnect
Global Timer
0
1
2
3
4
5
6
26
DART Router
  • Virtualizes the ports ? replace crossbar with MUX
  • No large switch allocators and crossbars
  • Routes 1 flit per DART cycle
  • N cycles for N ports
  • Input ports selected based on timestamp

Multiplexing in time saves area
27
DART Summary
  • Configurable functional model of an NoC
  • Easy to modify and reuse
  • Fast by exploiting fine grained parallelism
  • Decouple simulated cycle from FPGA cycles
  • Trade simulation speed for area and
    programmability
  • Software configurable parameters
  • Familiar simulation flow and fast turn-around time

28
Evaluation Results
  • Overhead
  • Architecture Scalability
  • Implementation Performance

29
Methodology
  • C Cycle-accurate architecture simulator
  • Explore various DART architectures
  • Evaluate performance trade-offs
  • 9-node implementation on a Virtex-II Pro FPGA
  • Baseline Booksim 2.0
  • Cycle-based software simulator (C)
  • Metrics
  • Overhead DART cycles/simulated cycle (CPS)
  • Performance Thousands of simulated cycles per
    second

30
Programmability Overhead
  • Measure performance overhead of global
    interconnect and simplified Router model
  • Four combinations of two options
  • Interconnect
  • Router

31
Programmability Overhead
  • Measure performance overhead of global
    interconnect and simplified Router model
  • Four combinations of two options
  • Interconnect dedicated vs. global
  • Router

32
Programmability Overhead
  • Measure performance overhead of global
    interconnect and simplified Router model
  • Four combinations of two options
  • Interconnect dedicated vs. global
  • Router 5-port vs. 1-port

33
Programmability Overhead
  • Measure performance overhead of global
    interconnect and simplified Router model
  • Four combinations of two options
  • Interconnect dedicated vs. global
  • Router 5-port vs. 1-port
  • Baseline dedicated5-port
  • Benchmarks 9-node mesh and 64-node mesh

34
Overhead 9-node DART
Lower Overhead
Simulated 9-node DART
35
Overhead 64-node DART
Lower Overhead
Simulated 64-node DART
36
Scalability
  • Compare DARTs performance scaling to Booksim
    beyond 9 nodes
  • 64-node DART with 8-partition global
    interconnect
  • Benchmarks mesh sizes from 9 to 64
  • DART performance extrapolated from architecture
    simulator assuming 50 MHz clock

37
Scalability Mesh Benchmarks
Faster
Booksim
64-node DART
DART simulation speed depends on network load
only Higher speedups over Booksim for large NoCs
38
An Implementation of DART
  • 9 Nodes (max. that fit)
  • 8-partition interconnect
  • 50 MHz

Component Utilization (LUTs)
Router (x9) 612
TrafficGen (x9) 691
FlitQueue (x36) 305
Interconnect 2,144
Control FSM 152
Total 26,385 (96)
XUPV2P Development Board Virtex-II Pro XC2VP30
39
Real Speed-up vs. Booksim
Faster
Booksim
DART Speedup
Large NoC simulations can become more interactive
40
Future Work
  • Virtualize DART nodes using multithreading
  • Further trade performance for area
  • Off-chip traffic generation
  • Integrate with full-system evaluation framework
  • Better coverage of the router design space
  • Adaptive routing, speculative routing, etc.
  • Investigate specialized soft processors

41
Summary
  • Software configurable FPGA-based NoC simulator is
    feasible
  • Area overhead vs. existing emulators is
    negligible
  • Over 100x speedup over software NoC simulator
    (Booksim)
  • Hardware and software tools available at
    http//www.eecg.toronto.edu/DART

42
Q A
  • Thank you!

43
Backup Slides
  • Classic Router Microarchitecture
  • Global Interconnect
  • DART Software Flow
  • Correctness Analysis
  • Interconnect Performance vs. Resource Utilization
  • DART vs. Booksim Speedup

44
Classic Router Microarchitecture
Back
45
Global Interconnect
Back
46
DART Software
  • DARTgen
  • Placement of simulated nodes in DART partitions
  • Evenly distribute nodes across partitions to
    balance load
  • Generate configuration bytes
  • DARTportal
  • Communicates with the DART simulator on FPGA
    through serial port
  • Interactive

Back
47
Correctness (1/2)
Topology 3 x 3 mesh
Router architecture Input queued
Routing algorithm XY
of VCs per port 2
VC Allocation Round-robin
Traffic pattern Random permutation
Packet size 2 flits
  • booksim 5-cycle routing delay
  • booksim2 4-cycle routing delay 1-cycle switch
    allocation delay

Back
48
Correctness (2/2)
Booksim has longer tail
Back
49
Interconnect Scalability (1/2)
Flit injection rate 0.1
Flit injection rate 0.5
Back
50
Interconnect Scalability (2/2)
Back
51
DART vs. Booksim Speedup
Better speedup for larger NoCs
Back
52
Related Work (1/2)
  • FPGA-based processor simulation
  • RAMPGold Tan et al. DAC 2010.
  • ProtoFlex Chung et al. IPDPS 2007.
  • A-Ports Pellauer et al. FPGA 2008.
  • Direct NoC emulation
  • Genko et al. DATE 2005.
  • NoCem Schelle and Grunwald. WARFP 2006.

53
Related Work (2/2)
  • DRNoC exploit dynamic reconfiguration of Xilinx
    FPGAs Krasteva et al. Reconfig. 2008.
  • Virtualized simulation Wolkotte et al. NoCS
    2007.
  • DARSIM parallel software NoC simulator Lis et
    al. MoBS 2010.

54
Software Simulators
  • Modular design (typically in an OO language)
  • Stand-alone or integrated
  • Pros
  • Easy to implement new models
  • Fast to develop and debug
  • As detailed and accurate as desired
  • Cons
  • Simulating large NoCs in detail can be slow
  • lt10 KIPS to 100s KIPS
  • Parallelizing using threads is non-trivial
  • High synchronization overhead

_at_100KIPS 1s of execution _at_ 1GHz 10K sec 2.8
hrs
55
FPGA-based Models
  • FPGAs have become big enough
  • Map entire NoC to FPGA
  • Pros
  • Faster than software simulation (10s to 100s
    MIPS)
  • Lots of parallelism
  • Low-overhead synchronization
  • Cons
  • Emulators cant be reused to evaluate different
    NoCs
  • Redesign is difficult and time-consuming
  • Max simulatable NoC size limited by FPGA size

56
DART Configurable Simulator on FPGA
  • Emulators cant be reused to evaluate different
    NoCs
  • A generic NoC simulation model that is decoupled
    from the architecture from a specific NoC
  • Redesign is difficult and time-consuming
  • Software configurable, no hardware redesign
    needed
  • Max simulatable NoC size limited by FPGA size
  • Optimize simulator architecture for area by
    trading off some speed

Fixed framework, configurable settings, still
fast!
57
Architecture Evaluation Methods
Requirement Software Simulation FPGA Prototypes FPGA-based Emulators DART
Accurate Possible Very Possible Yes
Fast to run lt 10 KIPS to 100s KIPS 100s MIPS 10s to 100s MIPS 10s MIPS
Easy to build Yes No No No
Easy to modify Yes No No Yes
Available early Yes No Yes Yes
KIPS Thousands of Instructions per Second MIPS
Millions of Instructions per Second
58
DART Simulator Model (contd)
  • Descriptors without data payload
  • Flits 36 bits
  • Credits 12 bits
  • 10-bit timestamp
  • Correctly captures latency up to 1024 cycles
  • Scale up to 256 nodes, 8 ports/node, 4 VCs/port

59
NoC Basics
  • Topology
  • Routing algorithm
  • Flow Control
  • Router microarchitecture

60
Motivation
  • Multi-core is here to stay
  • Communication is performance bottleneck
  • Network-on-Chip (NoC) advantages
  • Higher bandwidth
  • More efficient sharing of on-chip resources
  • Easier to build, verify, fabricate
  • Need high quality evaluation tools

Intel SCC 48 cores mesh NoC
Cell Processor 8 SPEs ring NoC
61
The Ideal Simulator
  • Accurate
  • Fast
  • Easy to implement, use and modify
  • Available early in the design process

Existing tools dont offer all four properties
Write a Comment
User Comments (0)
About PowerShow.com