Prediction Router: - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Prediction Router:

Description:

Yet another low-latency on-chip router architecture ... QuickSilver ACM. WH, no VC. Up*/down* Fat Tree (32bit) UPMC SPIN. WH, no VC. XY DOR. 2D mesh (32bit) ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 42
Provided by: amIcsK
Category:

less

Transcript and Presenter's Notes

Title: Prediction Router:


1
Prediction Router
Yet another low-latency on-chip router
architecture
  • Hiroki Matsutani (Keio Univ., Japan)
  • Michihiro Koibuchi (NII, Japan)
  • Hideharu Amano (Keio Univ., Japan)
  • Tsutomu Yoshinaga (UEC, Japan)

2
Why low-latency router is needed?
  • Tile architecture
  • Many cores (e.g., processors caches)
  • On-chip interconnection network

Dally, DAC01
Router
Core
router
router
router
router
router
router
router
router
router
Packet switched network
16-core tile architecture
On-chip router affects the performance and cost
of the chip
3
Why low-latency router is needed?
Low-latency router architecture has been
extensively studied
4
Outline Prediction router for low-latency NoC
  • Existing low-latency routers
  • Speculative router
  • Look-ahead router
  • Bypassing router
  • Prediction router
  • Architecture and the prediction algorithms
  • Hit rate analysis
  • Evaluations
  • Hit rate, gate count, and energy consumption
  • Case study 1 2-D mesh (small core size)
  • Case study 2 2-D mesh (large core size)
  • Case study 3 Fat tree network

5
Wormhole router Hardware structure
Output ports
Input ports
ARBITER
X
X
FIFO
X-
X-
FIFO
Y
Y
FIFO
Y-
Y-
FIFO
5x5 CROSSBAR
CORE
CORE
FIFO
Routing, arbitration, switch traversal are
performed in a pipeline manner
6
Pipeline structure 3-cycle router
  • At least 3-cycle for traversing a router
  • RC (Routing computation)
  • VSA (Virtual channel switch allocations)
  • ST (Switch traversal)
  • A packet transfer from router (a) to router (c)

VA SA are speculatively performed in parallel
_at_Router B
_at_Router C
_at_Router A
RC
VSA
ST
RC
VSA
ST
RC
VSA
ST
HEAD
DATA 1
ST
ST
ST
SA
SA
SA
ST
ST
ST
DATA 2
SA
SA
SA
ST
ST
ST
DATA 3
SA
SA
SA
1
2
3
4
5
6
7
8
9
10
11
12
To perform RC and VSA in parallel, look-ahead
routing is used
ELAPSED TIME CYCLE
At least 12-cycle for transferring a packet from
router (a) to router (c)
7
Look-ahead routerRC/VA in parallel
  • At least 3-cycle for traversing a router
  • NRC (Next routing computation)
  • VSA (Virtual channel switch allocations)
  • ST (Switch traversal)

VSA can be performed w/o waiting for NRC
Routing computation for the next hop ? Output
port of router (i1) is selected by router i
_at_Router B
_at_Router C
_at_Router A
NRC
VSA
ST
VSA
ST
VSA
ST
HEAD
NRC
NRC
DATA 1
ST
ST
ST
SA
SA
SA
ST
ST
ST
DATA 2
SA
SA
SA
ST
ST
ST
DATA 3
SA
SA
SA
1
2
3
4
5
6
7
8
9
10
11
12
ELAPSED TIME CYCLE
8
Look-ahead routerRC/VA in parallel
  • At least 2-cycle for traversing a router
  • NRC VSA (Next routing computation /
    arbitrations)
  • ST (Switch traversal)

No dependency between NRC VSA ? NRC VSA
in parallel
Dallys book, 2004
_at_Router A
_at_Router B
_at_Router C
NRC
NRC
NRC
Typical example of 2-cycle router
ST
HEAD
ST
ST
VSA
VSA
VSA
DATA 1
DATA 2
DATA 3
1
2
3
4
5
6
7
8
9
Packing NRC,VSA,ST into a single stage ?
frequency harmed
ELAPSED TIME CYCLE
At least 9-cycle for transferring a packet from
router (a) to router (c)
9
Bypassing router skip some stages
  • Bypassing between intermediate nodes
  • E.g., Express VCs

Kumar, ISCA07
SRC
DST
3-cycle
3-cycle
3-cycle
3-cycle
3-cycle
10
Bypassing router skip some stages
  • Bypassing between intermediate nodes
  • E.g., Express VCs
  • Pipeline bypassing utilizing the regularity of
    DOR
  • E.g., Mad postman
  • Pipeline stages on frequently used are skipped
  • E.g., Dynamic fast path
  • Pipeline stages on user-specific paths are
    skipped
  • E.g., Preferred path
  • E.g., DBP

Kumar, ISCA07
SRC
DST
3-cycle
3-cycle
3-cycle
3-cycle
3-cycle
Izu, PDP94
Park, HOTI07
Michelogiannakis, NOCS07
Koibuchi, NOCS08
We propose a low-latency router based on multiple
predictors
11
Outline Prediction router for low-latency NoC
  • Existing low-latency routers
  • Speculative router
  • Look-ahead router
  • Bypassing router
  • Prediction router
  • Architecture and the prediction algorithms
  • Hit rate analysis
  • Evaluations
  • Hit rate, gate count, and energy consumption
  • Case study 1 2-D mesh (small core size)
  • Case study 2 2-D mesh (large core size)
  • Case study 3 Fat tree network

12
Prediction router for 1-cycle transfer
Yoshinaga,IWIA06
  • Each input channel has predictors
  • When an input channel is idle,
  • Predict an output port to be used (RC
    pre-execution)
  • Arbitration to use the predicted port(SA
    pre-execution)

Yoshinaga,IWIA07
RC VSA are skipped if prediction hits ? 1-cycle
transfer
_at_Router A
_at_Router B
_at_Router C
RC
VSA
ST
RC
VSA
ST
RC
VSA
ST
HEAD
DATA 1
ST
ST
ST
ST
ST
ST
DATA 2
ST
ST
ST
DATA 3
1
2
3
4
5
6
7
8
9
10
11
12
ELAPSED TIME CYCLE
E.g, we can expect 1.6 cycle transfer if 70 of
predictions hit
13
Prediction router for 1-cycle transfer
Yoshinaga,IWIA06
  • Each input channel has predictors
  • When an input channel is idle,
  • Predict an output port to be used (RC
    pre-execution)
  • Arbitration to use the predicted port(SA
    pre-execution)

Yoshinaga,IWIA07
RC VSA are skipped if prediction hits ? 1-cycle
transfer
MISS
_at_Router B
_at_Router C
RC
VSA
ST
RC
VSA
ST
RC
VSA
ST
HEAD
DATA 1
ST
ST
ST
ST
ST
ST
DATA 2
ST
ST
ST
DATA 3
1
2
3
4
5
6
7
8
9
10
11
12
ELAPSED TIME CYCLE
E.g, we can expect 1.6 cycle transfer if 70 of
predictions hit
14
Prediction router for 1-cycle transfer
Yoshinaga,IWIA06
  • Each input channel has predictors
  • When an input channel is idle,
  • Predict an output port to be used (RC
    pre-execution)
  • Arbitration to use the predicted port(SA
    pre-execution)

Yoshinaga,IWIA07
RC VSA are skipped if prediction hits ? 1-cycle
transfer
HIT
MISS
_at_Router C
RC
VSA
ST
ST
RC
VSA
ST
HEAD
DATA 1
ST
ST
ST
ST
ST
DATA 2
ST
ST
ST
DATA 3
ST
1
2
3
4
5
6
7
8
9
10
11
12
ELAPSED TIME CYCLE
E.g, we can expect 1.6 cycle transfer if 70 of
predictions hit
15
Prediction router for 1-cycle transfer
Yoshinaga,IWIA06
  • Each input channel has predictors
  • When an input channel is idle,
  • Predict an output port to be used (RC
    pre-execution)
  • Arbitration to use the predicted port(SA
    pre-execution)

Yoshinaga,IWIA07
RC VSA are skipped if prediction hits ? 1-cycle
transfer
HIT
HIT
MISS
RC
VSA
ST
ST
ST
HEAD
DATA 1
ST
ST
ST
ST
ST
DATA 2
ST
ST
ST
DATA 3
ST
1
2
3
4
5
6
7
8
9
10
11
12
ELAPSED TIME CYCLE
E.g, we can expect 1.6 cycle transfer if 70 of
predictions hit
16
Prediction router Prediction algorithms
Yoshinaga,IWIA06
  • Efficient predictor is key
  • Prediction router
  • Multiple predictors for each input channel
  • Select one of them in response to a given network
    environment

Yoshinaga,IWIA07
Single predictor isnt enough
for applications with different traffic patterns
17
Basic operation _at_ Correct prediction
2nd cycle Next flit is transferred to X
without RC and VSA
1-cycle transfer using the reserved crossbar-port
when prediction hits
18
Basic operation _at_ Miss prediction
2nd/3rd cycle Dead flit is removed
retransmission to the correct port
More energy for retransmission
Even with miss prediction, a flit is transferred
in 3-cycle as original router
19
Outline Prediction router for low-latency NoC
  • Existing low-latency routers
  • Speculative router
  • Look-ahead router
  • Bypassing router
  • Prediction router
  • Architecture and the prediction algorithms
  • Hit rate analysis
  • Evaluations
  • Hit rate, gate count, and energy consumption
  • Case study 1 2-D mesh (small core size)
  • Case study 2 2-D mesh (large core size)
  • Case study 3 Fat tree network

20
Prediction hit rate analysis
  • Formulas to calculate the prediction hit rates on
  • 2-D torus (Random, LP, SS, FCM, and SPM)
  • 2-D mesh (Random, LP, SS, FCM, and SPM)
  • Fat tree (Random and LRU)
  • To forecast which prediction algorithm is suited
    for a given network environment w/o simulations
  • Accuracy of the analytical model is confirmed
    through simulations

Derivation of the formulas is omitted in this
talk (See Section 4 of our paper for more
detail)
21
Outline Prediction router for low-latency NoC
  • Existing low-latency routers
  • Speculative router
  • Look-ahead router
  • Bypassing router
  • Prediction router
  • Architecture and the prediction algorithms
  • Hit rate analysis
  • Evaluations
  • Hit rate, gate count, and energy consumption
  • Case study 1 2-D mesh (small core size)
  • Case study 2 2-D mesh (large core size)
  • Case study 3 Fat tree network

22
Evaluation items
How many cycles ?
Astro (place route)
FIFO
hit
NC-Verilog (simulation)
FIFO
XBAR
SDF
SAIF
miss
hit
hit
Design compiler(synthesis)
Power compiler
Fujitsu 65nm library
Flit-level net simulation
Hit rate / Comm. latency
Area (gate count)
Energy cons. pJ / bit
Table 1 Router network parameters
Table 2 Process library
Table 3 CAD tools used
Topology and traffic are mentioned later
23
3 case studies of prediction router
How many cycles ?
Astro (place route)
FIFO
hit
NC-Verilog (simulation)
FIFO
XBAR
SDF
SAIF
miss
hit
hit
Design compiler(synthesis)
Power compiler
Fujitsu 65nm library
Flit-level net simulation
Hit rate / Comm. latency
Area (gate count)
Energy cons. pJ / bit
2-D mesh network
Fat tree network
  • The most popular network topology
  • MITs RAW Taylor,ISCA04
  • Intels 80-core Vangal,ISSCC07
  • Dimension-order routing (XY routing)
  • ? Here, we show the results of case studies 1 and
    2 together

Case study 3
Case study 1 2
24
Case study 1 Zero-load comm.latency
  • Original router
  • Pred router (SS)
  • Pred router (100 hit)

Uniform random traffic on
4x4 to 16x16 meshes
() 1-cycle transfer for correct prediction,
3-cycle for wrong prediction
? Simulation results
(analytical model also shows the same result)
Comm. latency cycles
More latency reduced (48 for k16) as network
size increases
Network size (k-ary 2-mesh)
25
Case study 2 Hit rate _at_ 8x8 mesh
  • SS go straight
  • LP the last one
  • FCM frequently used pattern

Prediction hit rate
7 NAS parallel benchmark programs
4 synthesized traffics
26
Case study 2 Hit rate _at_ 8x8 mesh
  • SS go straight
  • LP the last one
  • FCM frequently used pattern

Efficient for long straight comm.
Efficient for short repeated comm.
Prediction hit rate
7 NAS parallel benchmark programs
4 synthesized traffics
27
Case study 2 Hit rate _at_ 8x8 mesh
  • SS go straight
  • LP the last one
  • FCM frequently used pattern

Efficient for long straight comm.
Efficient for short repeated comm.
All arounder !
Prediction hit rate
7 NAS parallel benchmark programs
4 synthesized traffics
28
Case study 2 Area Energy
  • Area (gate count)
  • Original router
  • Pred router (SS LP)
  • Pred router (SSLPFCM)
  • Energy consumption

Light-weight (small overhead)
Verilog-HDL designs
Router area kilo gates
Synthesized with 65nm library
6.4 - 15.9 increased, depending on type and
number of predictors
29
Case study 2 Area Energy
  • Area (gate count)
  • Original router
  • Pred router (SS LP)
  • Pred router (SSLPFCM)
  • Energy consumption
  • Original router
  • Pred router (70 hit)
  • Pred router (100 hit)
  • This estimation is pessimistic.
  • More energy consumed in links ? Effect of router
    energy overhead is reduced
  • Application will be finished early ? More energy
    saved

Flit switching energy pJ / bit
Router area kilo gates
6.4 - 15.9 increased, depending on type and
number of predictors
Miss prediction consumes power 9.5 increased if
hit rate is 70
Latency 35.8-48.2 saved w/ reasonable
area/energy overheads
30
3 case studies of prediction router
How many cycles ?
Astro (place route)
FIFO
hit
NC-Verilog (simulation)
FIFO
XBAR
SDF
SAIF
miss
hit
hit
Design compiler(synthesis)
Power compiler
Fujitsu 65nm library
Flit-level net simulation
Hit rate / Comm. latency
Area (gate count)
Energy cons. pJ / bit
2-D mesh network
Fat tree network
Case study 3
Case study 1 2
31
Case study 3 Fat tree network
Down
Up
1. LRU algorithm LRU output port is selected
for upward transfer 2. LRU LP algorithm Plus,
LP for downward transfer
32
Case study 3 Fat tree network
  • Comm. latency _at_uniform
  • Original router
  • Pred router (LRU)
  • Pred router (LRU LP)

Down
Up
Comm. latency cycles
1. LRU algorithm LRU output port is selected
for upward transfer 2. LRU LP algorithm Plus,
LP for downward transfer
Network size ( of cores)
Latency 30.7 reduced _at_ 256-core Small area
overhead (7.8)
33
Summary of the prediction router
  • Prediction router for low-latency NoCs
  • Multiple predictors, which can be switched in a
    cycle
  • Architecture and six prediction algorithms
  • Analytical model of prediction hit rates
  • Evaluations of prediction router
  • Case study 1 2-D mesh (small core size)
  • Case study 2 2-D mesh (large core size)
  • Case study 3 Fat tree network
  • Results
  • Prediction router can be applied to various NoCs
  • Communication latency reduced with small
    overheads
  • 3. Prediction router with multiple predictors
    can accelerate a wider range of applications

From three case studies
34
Thank you for your attention
It would be very helpful if you would speak
slowly. Thank you in advance.
35
Prediction router New modifications
  • Predictors for each input channel
  • Kill mechanism to remove dead flits
  • Two-level arbiter
  • Reservation ? higher priority
  • Tentative reservation by the pre-execution of
    VSA

KILL signals
ARBITER
X
X
FIFO
Currently, the critical path is related to the
arbiter
X-
X-
Y
Y
Y-
Y-
5x5 XBAR
CORE
CORE
36
Prediction router Predictor selection
  • Static scheme
  • A predictor is selected by user per application
  • Dynamic scheme
  • A predictor is adaptively selected

Predictors
Predictors
A
B
C
A
B
C
Count up if each predictor hits
Configuration table
A predictor is selected every n cycles (e.g., n
10,000)
Flexible More energy
Simple Pre-analysis is needed
37
Case study 1 Router critical path
  • RC Routing comp.
  • VSA Arbitration
  • ST Switch traversal

ST can be occurred in these stages of prediction
router
6.2 critical path delay increased compared with
original router
Stage delay FO4s
Pred router (SS)
Original router
38
Case study 2 Hit rate _at_ 8x8 mesh
  • SS go straight
  • LP the last one
  • FCM frequently used pattern
  • Custom user-specific path

Efficient for long straight comm.
Efficient for short repeated comm.
All arounder !
Efficient for simple comm.
Prediction hit rate
7 NAS parallel benchmark programs
4 synthesized traffics
39
Case study 4 Spidergon network
  • Spidergon topology
  • Ring across links
  • Each router has 3-port
  • Mesh-like 2-D layout
  • Across first routing
  • Hit rate _at_ Uniform

Coppola,ISSOC04
40
Case study 4 Spidergon network
  • Spidergon topology
  • Ring across links
  • Each router has 3-port
  • Mesh-like 2-D layout
  • Across first routing
  • Hit rate _at_ Uniform
  • SS Go straight
  • LP Last used one
  • FCM Frequently used one

Coppola,ISSOC04
Prediction hit rate
Hit rates of SS FCM are almost the same
Network size ( of cores)
High hit rate is achieved (80 for 64core 94
for 256core)
41
4 case studies of prediction router
How many cycles ?
Astro (place route)
FIFO
hit
NC-Verilog (simulation)
FIFO
XBAR
SDF
SAIF
miss
hit
hit
Design compiler(synthesis)
Power compiler
Fujitsu 65nm library
Flit-level net simulation
Hit rate / Comm. latency
Area (gate count)
Energy cons. pJ / bit
2-D mesh network
Fat tree network
Spidergon network
Case study 3
Case study 4
Case study 1 2
Write a Comment
User Comments (0)
About PowerShow.com