Implementation Analysis of NoC: A MPSoC TraceDriven Approach - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Implementation Analysis of NoC: A MPSoC TraceDriven Approach

Description:

Implementation Analysis of NoC: A MPSoC TraceDriven Approach ... Standford SPLASH-2 MultiProcessor Benchmark (radix,lu,fft,ocean,raytrace) ... – PowerPoint PPT presentation

Number of Views:107
Avg rating:3.0/5.0
Slides: 44
Provided by: tlcPo
Category:

less

Transcript and Presenter's Notes

Title: Implementation Analysis of NoC: A MPSoC TraceDriven Approach


1
Implementation Analysis of NoC A MPSoC
TraceDriven Approach
  • Sergio Tota¹, Mario R. Casu¹, Luca Macchiarulo²

¹ Politecnico di Torino
² University of Hawaii
2
Outline
  • Motivations
  • Network-on-Chip Paradigm
  • Definitions and Terminology
  • NoC Topologies and Routing Strategies
  • Switch Design
  • Trace-Based Emulation Experiments
  • Results
  • Conclusions

3
Motivations
  • The number of processors in the same die
    increases at each technology node (MPSoC)
  • This trend leads to the need of a scalable
    communication infrastructure
  • Shared bus are not a long-term solution
  • On-Chip Micronetworks better suit the demand of
    scalability and performance

4
Network-on-Chip (NoC)
  • On-chip networks inherit some of the features of
    computer networks
  • New constraints emerge for the on-chip
    implementation (i.e. area, power)
  • The NoC characteristics depend on the choice of
    the Topologies and Routing Strategy
  • Regular On-chip networks facilitate modular
    design and improve performance

5
Definitions and Terminology
  • Flit The elementary unit of information
    excanged in the communication network in a clock
    cycle.
  • Packet An element of information that a
    processing element (PE) sends to another PE. A
    packet may consist of a variable number of
    flits.
  • Switch The component of the network that is
    in charge of flit routing.

6
Definitions and Terminology (cont'd)
  • Flit Latency The time needed for a FLIT to
    reach its target PE from its source PE.
  • Packet Latency The time needed for a PACKET to
    reach its target PE from its source PE.
  • Packet Spread The time from the reception of
    the first flit of a packet to the reception of
    the last one.

7
Network Topology
1
2
3
4
1
2
3
4
5
6
7
8
5
6
7
8
9
10
11
12
9
10
11
12
13
14
15
16
13
14
15
16
Mesh
Physical implementation
8
Network Topology (cont'd)
1
2
4
3
1
2
3
4
5
6
7
8
13
14
16
15
9
10
11
12
5
6
8
7
13
14
15
16
1
2
3
4
Torus
Physical implementation
9
Routing
  • What is important to us
  • Minimum latency is of paramount importance in
    MP-SoC (interprocess communication).
  • Ideally 1 clock latency per switch (flit enters
    at time t and exits at t1)
  • Maximum switch clock frequency (technology
    routing logic limits)
  • Deadlock free
  • No flits are ever lost once a flit is injected
    in the NoC, it will eventually reach its
    destination

10
Routing Strategy Wormhole
  • In wormhole routing a header flit digs the
    hole
  • Successive flits are routed to the same
    direction
  • In case of blocks and lossless NoC we need
  • Buffers
  • A backpressure mechanism (unless you dont have
    infinite FIFOs)
  • We assume X-Y static routing (first X then Y,
    proven deadlock free)

11
Worm-Hole
Src
Dest
12
Worm-Hole
Src
HF
F2
F3
F4
TF
Dest
13
Worm-Hole
Src
F2
HF
F3
F4
TF
Dest
14
Worm-Hole
Src
F3
F2
F4
TF
HF
Dest
15
Worm-Hole
Src
F4
F3
TF
F2
HF
Dest
16
Worm-Hole
Src
F4
F3
TF
F2
HF
Dest
17
Worm-Hole
Src
F3
F4
TF
F2
HF
Dest
18
Worm-Hole
Src
F4
TF
F3
F2
Dest
HF
19
Worm-Hole
Src
TF
F4
F3
Dest
F2
HF
20
Worm-Hole
Src
TF
F4
Dest
F3
F2
HF
21
Worm-Hole
Src
TF
Dest
F3
F2
HF
22
Worm-Hole
Src
Dest
TF
F3
F2
HF
23
Routing Strategy Deflection Routing
  • Every flit can be routed to different directions
    (no packet notion at the switch level)
  • if the optimal direction is blocked, the flit is
    deflected to another direction
  • switch latency of 1 clock cycle no matter the
    congestion
  • minimum buffer requirements
  • A.K.A. Hot Potato, deadlock free by
    construction

24
Hot-Potato
Src
Dest
25
Hot-Potato
Src
HF
F2
F3
TF
Dest
26
Hot-Potato
Src
F2
HF
F3
TF
Dest
27
Hot-Potato
Src
F3
F2
HF
TF
Dest
28
Hot-Potato
Src
TF
HF
F2
F3
Dest
29
Hot-Potato
Src
TF
HF
F2
F3
Dest
30
Hot-Potato
Src
TF
Dest
HF
F2
F3
31
Hot-Potato
Src
TF
Dest
HF
F2
F3
32
Hot-Potato
Src
Dest
TF
HF
F2
F3
33
Hot-Potato
Src
Dest
F3
TF
HF
F2
34
Routing Techniques
Wormhole
Hot-Potato
No packets reordering - Static routing -
Buffering ( ?2 flits/port) - Back pressure - XY
routing needs mesh
- Packets reordering Adaptive routing No
buffering No back pressure Works with
torus/mesh
35
Switch logic scheme
36
Physical Implementation(IBM CMOS 0.13 ?m _at_ 500
MHz) 160 Gbit/s
Deflection-Routing
Wormhole10²
Wormhole 2¹
0.038 mm²
0.273 mm²
0.086 mm²
Area _at_ 64 bits
0.07 mm²
0.502 mm²
0.140 mm²
Area _at_ 128 bits
0.14 mm²
0.910 mm²
0.234 mm²
Area _at_ 256 bits
29 uW/MHz
190 uW/MHz
54 uW/MHz
Power _at_ 64 bits
52 uW/MHz
380 uW/MHz
92 uW/MHz
Power _at_ 128 bits
102 uW/MHz
655 uW/MHz
171 uW/MHz
Power _at_ 256 bits
¹Two buffers per port
²Ten buffers per port
37
Emulation environment
38
Emulation Environment
  • Statistically generated traffic is not realistic
  • Standford SPLASH-2 MultiProcessor Benchmark
    (radix,lu,fft,ocean,raytrace)
  • RSIM cycle-accurate simulator (UIUC) to extract
    traffic traces between processors
  • Simulation in VHDL using behavioural traffic
    generators and RTL NoC implementations
  • 1 dual Opteron and 5 dual Xeon servers with Linux
    64 bit, Modelsim, Synopsys and Encounter

39
Experiments parameters
  • NoC size N x N, N2 and N4
  • Packet size 10 flits
  • Wormhole buffer size 2 flits/port (minimum
    allowed value)
  • Standard simulation PE computation time 1 X
  • Accelerated simulation PE computation time 20
    X, same NoC clock frequency (worse traffic
    conditions)
  • WH wormhole, HP hot potato

40
Results
Mesh uniform traffic ideal latency
Torus uniform traffic ideal latency
Mean Lat. 2/3N Peak Lat. 2N-2
Mean Lat. 1/2N Peak Lat. N
41
Results
Ideal packet latency packet size 10
42
Comments
  • Peak flit latency occurs sporadically (peak gtgt
    average)
  • WH and HP show comparable latency (both peak and
    ave)
  • HP packet mean latency ? 10 flits arrive
    in-order (with few exceptions)
  • No substantial differences between standard and
    accelerated simulations
  • Benchmark execution time overhead due to NoC
    congestion negligible
  • Overall, WH is not better nor worse than HP

43
Conclusions
  • RTL NoC based on Worm-Hole and Hot Potato
  • Switch design and synthesis on 0.13 ?m CMOS
  • 500 MHz clock frequency
  • Real MP-SoC trace simulations
  • WH and HP show similar performance but HP needs
    less area and power
  • The strength of HP is supposed to emerge in a
    condition of higher load Need for benchmarks
    that generate higher traffic
  • Ongoing work
  • New RTL design working at 700 MHz
  • Reconstruction interface for HP implemented
Write a Comment
User Comments (0)
About PowerShow.com