Title: DART: A Programmable Architecture for NoC Simulation on FPGAs
1DART A Programmable Architecture for NoC
Simulation on FPGAs
- Danyao Wang Natalie Enright Jerger
J. Gregory Steffan
Department of Electrical Computer
Engineering University of Toronto
Google Inc.
2Why yet another NoC simulator?
- Software simulators
- Stand-alone or integrated
- Parallel NoC simulator (DARSIM)
- FPGA-based Models
- Direct map NoC emulators (Genko et al., NoCem)
- Dynamic reconfiguration (DRNoC)
- Decoupled timing and functional model (RAMPGold,
ProtoFlex, A-Ports) - Analytical models FIST
3Why yet another NoC simulator?
Requirement Software Simulation
Accurate Possible
Fast to run lt 10 KIPS to 100s KIPS
Easy to implement Yes
Easy to use modify Yes
Available early Yes
_at_100KIPS 1s of execution _at_ 1GHz 10K sec 2.8
hrs
Benefits of thread-based parallelization is
limited due to high synchronization overhead
4Why yet another NoC simulator?
Requirement Software Simulation FPGA-based Emulators
Accurate Possible Possible
Fast to run lt 10 KIPS to 100s KIPS 10s to 100s MIPS
Easy to implement Yes No
Easy to use modify Yes No
Available early Yes Yes
Orders of magnitude faster!
Hardware changes Hours of synthesis-place-route
time
5DART Hybrid Approach
FPGA
UART
Control FSM
DART Simulator
configuration, commands
PC
Simulation results
- Generic NoC simulation engine
- Fixed function nodes for basic NoC building
blocks - Router, traffic generator, link
- Software configurable parameters in each node
Simulate different NoCs without changing hardware
6Why yet another NoC simulator?
Requirement Software Simulation FPGA-based Emulators DART
Accurate Possible Possible Yes
Fast to run lt 10 KIPS to 100s KIPS 10s to 100s MIPS 10s MIPS
Easy to implement Yes No No
Easy to use modify Yes No Yes
Available early Yes Yes Yes
7DART Simulator Architecture
8Generic NoC Model
Global interconnect
- Topology
- Routing algorithm
- Flow control
- Router microarchitecture
- Simulated traffic
- Link properties
Router
Traffic Generator
Flit Queue
9DART Architecture
Synchronize all network transfers to a global
time counter
10DART Nodes
Node Parameters Statistics Counter
TrafficGenerator Traffic pattern Injection intervals Packet size ( of flits) of injected packets of received packets Cumulative packet latency
Flit Queue Latency (flit cycles) Bandwidth (flits / cycle) More can be added easily
Routers Routing Table Input buffer sizes (credits) Pipeline delay (flit cycles) More can be added easily
- Parameters implemented using a shift register
- Configuration byte stream generated on the PC and
sent to the FPGA
11Simulating a NoC
- Map simulated NoC to DART nodes
- Program the routing tables to implement the
simulated topology - Record timing of flit transfers
12Example Walk-Through
0
1
2
3
4
5
6
7
13Example Walk-Through
0
1
2
3
4
5
6
7
Traffic Generator
Router
Flit Queues
Global Interconnect
14Example Walk-Through
0
1
2
3
4
5
6
7
0
Global Interconnect
15Example Walk-Through
0
1
2
3
4
5
6
7
0
1
Global Interconnect
16Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
Global Interconnect
17Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
Global Interconnect
18Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
4
Global Interconnect
19Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
4
5
Global Interconnect
20Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
Global Interconnect
21Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Global Interconnect
22Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Global Interconnect
23Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Global Interconnect
24Example Walk-Through
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Global Interconnect
25Example Walk-Through
received 1 Slatency 6
received 1 Slatency 6
injected 1
injected 1
0
1
2
3
4
5
6
7
Global Interconnect
Global Timer
0
1
2
3
4
5
6
26DART Router
- Virtualizes the ports ? replace crossbar with MUX
- No large switch allocators and crossbars
- Routes 1 flit per DART cycle
- N cycles for N ports
- Input ports selected based on timestamp
Multiplexing in time saves area
27DART Summary
- Configurable functional model of an NoC
- Easy to modify and reuse
- Fast by exploiting fine grained parallelism
- Decouple simulated cycle from FPGA cycles
- Trade simulation speed for area and
programmability - Software configurable parameters
- Familiar simulation flow and fast turn-around time
28Evaluation Results
- Overhead
- Architecture Scalability
- Implementation Performance
29Methodology
- C Cycle-accurate architecture simulator
- Explore various DART architectures
- Evaluate performance trade-offs
- 9-node implementation on a Virtex-II Pro FPGA
- Baseline Booksim 2.0
- Cycle-based software simulator (C)
- Metrics
- Overhead DART cycles/simulated cycle (CPS)
- Performance Thousands of simulated cycles per
second
30Programmability Overhead
- Measure performance overhead of global
interconnect and simplified Router model - Four combinations of two options
- Interconnect
- Router
31Programmability Overhead
- Measure performance overhead of global
interconnect and simplified Router model - Four combinations of two options
- Interconnect dedicated vs. global
- Router
32Programmability Overhead
- Measure performance overhead of global
interconnect and simplified Router model - Four combinations of two options
- Interconnect dedicated vs. global
- Router 5-port vs. 1-port
33Programmability Overhead
- Measure performance overhead of global
interconnect and simplified Router model - Four combinations of two options
- Interconnect dedicated vs. global
- Router 5-port vs. 1-port
- Baseline dedicated5-port
- Benchmarks 9-node mesh and 64-node mesh
34Overhead 9-node DART
Lower Overhead
Simulated 9-node DART
35Overhead 64-node DART
Lower Overhead
Simulated 64-node DART
36Scalability
- Compare DARTs performance scaling to Booksim
beyond 9 nodes - 64-node DART with 8-partition global
interconnect - Benchmarks mesh sizes from 9 to 64
- DART performance extrapolated from architecture
simulator assuming 50 MHz clock
37Scalability Mesh Benchmarks
Faster
Booksim
64-node DART
DART simulation speed depends on network load
only Higher speedups over Booksim for large NoCs
38An Implementation of DART
- 9 Nodes (max. that fit)
- 8-partition interconnect
- 50 MHz
Component Utilization (LUTs)
Router (x9) 612
TrafficGen (x9) 691
FlitQueue (x36) 305
Interconnect 2,144
Control FSM 152
Total 26,385 (96)
XUPV2P Development Board Virtex-II Pro XC2VP30
39Real Speed-up vs. Booksim
Faster
Booksim
DART Speedup
Large NoC simulations can become more interactive
40Future Work
- Virtualize DART nodes using multithreading
- Further trade performance for area
- Off-chip traffic generation
- Integrate with full-system evaluation framework
- Better coverage of the router design space
- Adaptive routing, speculative routing, etc.
- Investigate specialized soft processors
41Summary
- Software configurable FPGA-based NoC simulator is
feasible - Area overhead vs. existing emulators is
negligible - Over 100x speedup over software NoC simulator
(Booksim) - Hardware and software tools available at
http//www.eecg.toronto.edu/DART
42Q A
43Backup Slides
- Classic Router Microarchitecture
- Global Interconnect
- DART Software Flow
- Correctness Analysis
- Interconnect Performance vs. Resource Utilization
- DART vs. Booksim Speedup
44Classic Router Microarchitecture
Back
45Global Interconnect
Back
46DART Software
- DARTgen
- Placement of simulated nodes in DART partitions
- Evenly distribute nodes across partitions to
balance load - Generate configuration bytes
- DARTportal
- Communicates with the DART simulator on FPGA
through serial port - Interactive
Back
47Correctness (1/2)
Topology 3 x 3 mesh
Router architecture Input queued
Routing algorithm XY
of VCs per port 2
VC Allocation Round-robin
Traffic pattern Random permutation
Packet size 2 flits
- booksim 5-cycle routing delay
- booksim2 4-cycle routing delay 1-cycle switch
allocation delay
Back
48Correctness (2/2)
Booksim has longer tail
Back
49Interconnect Scalability (1/2)
Flit injection rate 0.1
Flit injection rate 0.5
Back
50Interconnect Scalability (2/2)
Back
51DART vs. Booksim Speedup
Better speedup for larger NoCs
Back
52Related Work (1/2)
- FPGA-based processor simulation
- RAMPGold Tan et al. DAC 2010.
- ProtoFlex Chung et al. IPDPS 2007.
- A-Ports Pellauer et al. FPGA 2008.
- Direct NoC emulation
- Genko et al. DATE 2005.
- NoCem Schelle and Grunwald. WARFP 2006.
53Related Work (2/2)
- DRNoC exploit dynamic reconfiguration of Xilinx
FPGAs Krasteva et al. Reconfig. 2008. - Virtualized simulation Wolkotte et al. NoCS
2007. - DARSIM parallel software NoC simulator Lis et
al. MoBS 2010.
54Software Simulators
- Modular design (typically in an OO language)
- Stand-alone or integrated
- Pros
- Easy to implement new models
- Fast to develop and debug
- As detailed and accurate as desired
- Cons
- Simulating large NoCs in detail can be slow
- lt10 KIPS to 100s KIPS
- Parallelizing using threads is non-trivial
- High synchronization overhead
_at_100KIPS 1s of execution _at_ 1GHz 10K sec 2.8
hrs
55FPGA-based Models
- FPGAs have become big enough
- Map entire NoC to FPGA
- Pros
- Faster than software simulation (10s to 100s
MIPS) - Lots of parallelism
- Low-overhead synchronization
- Cons
- Emulators cant be reused to evaluate different
NoCs - Redesign is difficult and time-consuming
- Max simulatable NoC size limited by FPGA size
56DART Configurable Simulator on FPGA
- Emulators cant be reused to evaluate different
NoCs - A generic NoC simulation model that is decoupled
from the architecture from a specific NoC - Redesign is difficult and time-consuming
- Software configurable, no hardware redesign
needed - Max simulatable NoC size limited by FPGA size
- Optimize simulator architecture for area by
trading off some speed
Fixed framework, configurable settings, still
fast!
57Architecture Evaluation Methods
Requirement Software Simulation FPGA Prototypes FPGA-based Emulators DART
Accurate Possible Very Possible Yes
Fast to run lt 10 KIPS to 100s KIPS 100s MIPS 10s to 100s MIPS 10s MIPS
Easy to build Yes No No No
Easy to modify Yes No No Yes
Available early Yes No Yes Yes
KIPS Thousands of Instructions per Second MIPS
Millions of Instructions per Second
58DART Simulator Model (contd)
- Descriptors without data payload
- Flits 36 bits
- Credits 12 bits
- 10-bit timestamp
- Correctly captures latency up to 1024 cycles
- Scale up to 256 nodes, 8 ports/node, 4 VCs/port
59NoC Basics
- Topology
- Routing algorithm
- Flow Control
- Router microarchitecture
60Motivation
- Multi-core is here to stay
- Communication is performance bottleneck
- Network-on-Chip (NoC) advantages
- Higher bandwidth
- More efficient sharing of on-chip resources
- Easier to build, verify, fabricate
- Need high quality evaluation tools
Intel SCC 48 cores mesh NoC
Cell Processor 8 SPEs ring NoC
61The Ideal Simulator
- Accurate
- Fast
- Easy to implement, use and modify
- Available early in the design process
Existing tools dont offer all four properties