Title: Reducing NoC Energy Consumption Through Compiler-Directed Channel Voltage Scaling
1Reducing NoC Energy Consumption Through
Compiler-Directed Channel Voltage Scaling
- Guangyu Chen, Feihui Li, Mahmut Kandemir, Mary
Jane Irwin - Microsystems Design Lab, Department of CSE
- The Pennsylvania State University
- mdl_at_cse.psu.edu
2Why NoCs?
- Scalability
- Support for large number of processing units
- Flexibility
- Topology and routing policy can be configured
according to the needs of a particular
application - Point-to-point, broadcasting (one-to-multiple),
gathering (multiple-to-one) - Performance
- Low latency, high bandwidth
- Reliability
- Multiple routes between a source/target pair
- Signal strengthening in routers
3Mesh-Based NoC Abstraction
Communication Channel
Router
CPU
CPU
CPU
Memory
Memory
Memory
CPU
CPU
CPU
Memory
Memory
Memory
CPU
CPU
CPU
Memory
Memory
Memory
4Related Work
- Communication channels can account for a
significant portion to the chip energy
consumption (between 20 and 45) - Prior efforts
- Simunic and Boyd NoC power modeling (DATE02)
- Benini and De Micheli Design methodology for
energy-efficient reliable SoC networks (ISSS01) - Shang et al Hardware-directed DVS for
communication links (HPCA03) - Kim et al Communication link shutdown
(ISLPED03) - Soteriou and Peh Design space exploration for
link turn on/off (ICCD04) - Soteriou et al Software-directed power-aware
interconnection networks (CASES05) - Li et al Software-directed DVS for communication
links (CASES05) - Li et al Compiler-directed link turnoff and
routing (ICCAD05, EMSOFT05, POPL06) - Our goal is to save network energy through
voltage/frequency scaling
5Motivational Example (1)
Node 2
Node 1
for i 0 to N send(2, Ai0..1023
receive(2, buffer)
for i 0 to N send(1, Ai0..255
receive(1, buffer)
i0
i1
i2
i3
i4
6Motivational Example (2)
Node 2
Node 1
for i 0 to N send(2, Ai0..255 short
computation receive(2, buffer)
for i 0 to N send(1, Ai0..255 long
computation receive(1, buffer)
Node 1
Node 2
i0
i1
i2
i3
i4
7Overview of Our Approach
CriticalPathAnalysis
BuildingIPCG
InputParallel Code
IPCG
CodeModification
Scaling Factorfor EachConnection
OutputParallelCode
8Assumptions
- Array-based embedded applications
- Message-passing based parallel program
- For each send(p, m) instruction, the destination
node p, and the size of message m can be
statically determined at compilation time - For each receive(p, m) instruction, the source
node p can be determined at compilation time - A send instruction is blocked if the previous
message send by the same node has not been
delivered to the destination node - A receive instruction is blocked if the message
is not ready in the buffer of the receiver node - Code is parallelized and process-to-node mapping
is performed - Network is exposed to the compiler
9Inter-Process Communication Graph (IPCG)
- IPCG G(P) captures the communication behavior of
application P - G(P) (V(P), E(P), ?, ? )
- V(P) the set of vertices
- E(P) the set of edges
- ?, ? the weights for edges, capturing
minimum/maximum execution latencies
10Vertices of IPCG
- V(P) X(P) ? B(P) ? S(P) ? D(P) ? R(P)
- x ? X(P) the entry point of a loop in program P
- b ? B(P) the back jump of a loop in program P
- s ? S(P) the point in P at which a message is
sent - d ? D(P) the point in P at which a message is
delivered - r ? R(P) the point in P at which a message is
used
send(2,..)
Node 1
s
Node 2
d
r
messagedelivered
receive(1,..)
11Edges of IPCG
- Task edges
- Communication edge (s, d) a message is sent at
point s ? S(P) and delivered at point d ? D(P) - Computation edge (u, v) a computation task
starts at point u and ends at point v - u, v ? X(P) ? S(P) ? R(P)
- Control edges
- Enforce the order at which the points of the
given program can be reached - Back-jump edge
- Other control edges
12? and ? Functions
- ?(u,v) and ?(u,v) the minimum and maximum times
required to execute task (u,v) - For communication edge (s,d)
- ?(s,d) (min. message size) / (max. data rate)
- ?(u,v) (max. message size) / (max. data rate)
- For computation edge (u, v)
- ?(s,d) the minimum time for executing the
instructions between u and v - ?(u,v) the maximum time for executing the
instructions between u and v - For control edge(u,v)
- ?(s,d) ?(u,v) 0
13IPCG Example (1)
// Process 1 x3for(...) r1receive(2,..)
2025 cycles s2send(2,..)
// Process 2 x1for(...) s1send(1,..)
x2for(...) 10 cycles s3send(3,..)
1015 cycles s4send(3,..) 80-90
cycles r5receive(3,..) 20 cycles
r2receive(1,..)
// Process 3 x4for(...) 10 cycles
r3receive(2,..) 15 cycles r4receive(2,..)
40-50 cycles s5send(2,..)
14IPCG Example (2)
x4
10/10
10/10
10/10
0/0
x1
x3
s3
r3
d3
0/0
15/15
10/15
s1
d1
r1
s4
r4
d4
0/0
10/15
20/25
10/15
x2
40/50
80/90
s2
120/?
s5
d5
r5
d2
r2
0/0
10/10
0/0
20/20
10/10
b3
0/0
b4
b2
b1
p2
p3
p1
15IPCG Example (2)
x4
x1
x3
s3
r3
d3
s1
d1
r1
s4
r4
d4
x2
s2
s5
d5
r5
d2
r2
b3
b4
b2
b1
p2
p3
p1
16IPCG Example (2)
x4
x1
x3
s3
r3
d3
s1
d1
r1
s4
r4
d4
x2
s2
s5
d5
r5
d2
r2
b3
b4
b2
b1
p2
p3
p1
17IPCG Example (2)
x4
x1
x3
s3
r3
d3
s1
d1
r1
s4
r4
d4
x2
s2
s5
d5
r5
d2
r2
b3
b4
b2
b1
p2
p3
p1
18IPCG Example (2)
x4
x1
x3
s3
r3
d3
s1
d1
r1
s4
r4
d4
x2
s2
s5
d5
r5
d2
r2
b3
b4
b2
b1
p2
p3
p1
19IPCG Example (2)
x4
10/10
x1
x3
s3
r3
d3
s1
d1
r1
s4
r4
d4
10/15
10/15
x2
s2
s5
d5
r5
d2
r2
10/10
10/10
b3
b4
b2
b1
p2
p3
p1
20IPCG Example (2)
x4
10/10
10/10
10/10
0/0
x1
x3
s3
r3
d3
0/0
15/15
10/15
s1
d1
r1
s4
r4
d4
0/0
10/15
20/25
10/15
x2
40/50
80/90
s2
120/?
s5
d5
r5
d2
r2
0/0
10/10
0/0
20/20
10/10
b3
0/0
b4
b2
b1
p2
p3
p1
21IPCG Example (2)
x4
10/10
10/10
10/10
0/0
x1
x3
s3
r3
d3
0/0
15/15
10/15
s1
d1
r1
s4
r4
d4
0/0
10/15
20/25
10/15
x2
40/50
80/90
s2
120/?
s5
d5
r5
d2
r2
0/0
10/10
0/0
20/20
10/10
b3
0/0
b4
b2
b1
p2
p3
p1
22IPCG Example (2)
x4
10/10
10/10
10/10
0/0
x1
x3
s3
r3
d3
0/0
15/15
10/15
s1
d1
r1
s4
r4
d4
0/0
10/15
20/25
10/15
x2
40/50
80/90
s2
120/?
s5
d5
r5
d2
r2
0/0
10/10
0/0
20/20
10/10
b3
0/0
b4
b2
b1
p2
p3
p1
23IPCG Example (2)
x4
10/10
10/10
10/10
0/0
x1
x3
s3
r3
d3
0/0
15/15
10/15
s1
d1
r1
s4
r4
d4
0/0
10/15
20/25
10/15
x2
40/50
80/90
s2
120/?
s5
d5
r5
d2
r2
0/0
10/10
0/0
20/20
10/10
b3
0/0
b4
b2
b1
p2
p3
p1
24IPCG Example (2)
25Parallel Loop Group
- A set of loops that communicate with each other
- Unit of granularity for optimization
x4
10/10
10/10
10/10
0/0
x1
x3
s3
r3
d3
0/0
15/15
10/15
s1
d1
r1
s4
r4
d4
0/0
10/15
20/25
10/15
x2
40/50
80/90
s2
120/?
s5
d5
r5
d2
r2
0/0
10/10
0/0
20/20
10/10
b3
0/0
b4
b2
b1
26Representative Iterations
- A set of loop iterations that represent the
timing behavior of the entire parallel loop group
Time
27Critical Path Analysis
- Determine q and Q such that q, Q 1 are the
set of representative loop iterations - Determine t?i,j the earliest time that node vi
at the jth iteration (j ?q, Q-1) can be
reached, assuming each task is completed in the
shortest time - Determine t?i,j the earliest time that node vi
at the jth iteration (j ?q, Q-1) can be
reached, assuming each task takes the longest
time - Determine the scaling factor for each
communication channel such that the overall
performance degradation due to voltage scaling is
within ? (a preset bound)
28Determining t?i,j - Constraints
where
the set of intra-iteration edges
at each iteration j, u must be reached before v
the set of inter-iteration edges
u at the (j 1)th iteration must be reached
before v at the jth iteration
29Examples of Intra- and Inter-Iteration Edges
x4
x1
x3
s3
r3
d3
s1
d1
r1
s4
r4
d4
x2
s2
s5
d5
r5
d2
r2
b3
b4
b2
b1
p2
p3
p1
Intra-Iteration edge
Inter-Iteration edge
30Determining t?i,j - Example
x2
x3
x1
s2
s3
d3
s1
d1
d1
20/25
20/25
20/25
25/30
20/20
20/25
r1
r2
r3
25/30
15/15
10/10
b1
b2
b3
p2
p3
p1
31Determining t?i,j - Example
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
t?i,0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
20
32Determining t?i,j - Example
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
t?i,0 0 0 20 20 30 0 0 20 25 50 0 0 20 20 35
t?i,1 30 20 0 0 0 20 50 0 0 0 35 20 0 0 0
33Determining t?i,j - Example
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
t?i,0 0 0 20 20 30 0 0 20 25 50 0 0 20 20 35
t?i,1 30 30 50 55 65 50 50 70 75 100 35 35 55 70 85
34Determining t?i,j Example
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
t?i,0 0 0 20 20 30 0 0 20 25 50 0 0 20 20 35
t?i,1 30 30 50 55 65 50 50 70 75 100 35 35 55 70 85
t?i,2 65 65 85 105 115 100 100 120 125 150 85 85 105 120 135
t?i,3 115 115 135 155 165 150 150 170 175 200 135 135 155 170 185
t?i,4 165 .... .... .... .... 200 .... .... .... .... 185 .... .... .... ....
q 2, Q 4, T 50
35Determining t?i,j - Constraints
where
the set of intra-iteration edges
the set of inter-iteration edges
36Determining Scaling Factor -Constraints
where
the set of intra-iteration and inter-iteration
edges
the node that executes operation v
the maximum performance degradation allowed
the scaling factor for the network connection
from node n1 to n2 We try to maximize k(n1, n2)
for each connection
37Determining Scaling Factor - Algorithm
- repeat
- select a connection C
- scale down the data rate of C by one grade
- determine ti, j using
- if
- make the data rate of C permanent
- else
- restore the data rate of C
- until no more connection can be scale down
38Determining Scaling Factor - Example
q 2, Q 4, T 100, ? 10,
k 1, 0.8, 0.6, 0.4, 0.2
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
t?i,q 65 65 85 105 115 100 100 120 125 150 85 85 105 120 135
t?i,Q 165 .... .... .... .... 200 .... .... .... .... 185 .... .... .... ....
t?i,Q 170 .... .... .... .... 210 .... .... .... .... 190 .... .... .... ....
tmaxi,Q 175 .... .... .... .... 210 .... .... .... .... 195 .... .... .... ....
39Determining Scaling Factor - Example
q 2, Q 4, T 100, ? 10,
k 1, 0.8, 0.6, 0.4, 0.2
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
t?i,q 65 65 85 105 115 100 100 120 125 150 85 85 105 120 135
t?i,Q 165 .... .... .... .... 200 .... .... .... .... 185 .... .... .... ....
t?i,Q 170 .... .... .... .... 210 .... .... .... .... 190 .... .... .... ....
tmaxi,Q 175 .... .... .... .... 210 .... .... .... .... 195 .... .... .... ....
k1, 2 0.8, k2, 3 1, k3, 1 1
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
ti,Q 170 .... .... .... .... 210 .... .... .... .... 190 .... .... .... ....
40Determining Scaling Factor - Example
q 2, Q 4, T 100, ? 10,
k 1, 0.8, 0.6, 0.4, 0.2
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
t?i,q 65 65 85 105 115 100 100 120 125 150 85 85 105 120 135
t?i,Q 165 .... .... .... .... 200 .... .... .... .... 185 .... .... .... ....
t?i,Q 170 .... .... .... .... 210 .... .... .... .... 190 .... .... .... ....
tmaxi,Q 175 .... .... .... .... 210 .... .... .... .... 195 .... .... .... ....
k1, 2 0.8, k2, 3 0.8, k3, 1 1
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
ti,Q 170 .... .... .... .... 210 .... .... .... .... 196.25 .... .... .... ....
41Determining Scaling Factor - Example
q 2, Q 4, T 100, ? 10,
k 1, 0.8, 0.6, 0.4, 0.2
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
t?i,q 65 65 85 105 115 100 100 120 125 150 85 85 105 120 135
t?i,Q 165 .... .... .... .... 200 .... .... .... .... 185 .... .... .... ....
t?i,Q 170 .... .... .... .... 210 .... .... .... .... 190 .... .... .... ....
tmaxi,Q 175 .... .... .... .... 210 .... .... .... .... 195 .... .... .... ....
k1, 2 0.8, k2, 3 1, k3, 1 0.8
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
ti,Q 176.25 .... .... .... .... 210 .... .... .... .... 190 .... .... .... ....
42Determining Scaling Factor - Example
q 2, Q 4, T 100, ? 10,
k 1, 0.8, 0.6, 0.4, 0.2
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
t?i,q 65 65 85 105 115 100 100 120 125 150 85 85 105 120 135
t?i,Q 165 .... .... .... .... 200 .... .... .... .... 185 .... .... .... ....
t?i,Q 170 .... .... .... .... 210 .... .... .... .... 190 .... .... .... ....
tmaxi,Q 175 .... .... .... .... 210 .... .... .... .... 195 .... .... .... ....
k1, 2 0.6, k2, 3 1, k3, 1 1
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
ti,Q 170 .... .... .... .... 210 .... .... .... .... 190 .... .... .... ....
43Determining Scaling Factor - Example
q 2, Q 4, T 100, ? 10,
k 1, 0.8, 0.6, 0.4, 0.2
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
t?i,q 65 65 85 105 115 100 100 120 125 150 85 85 105 120 135
t?i,Q 165 .... .... .... .... 200 .... .... .... .... 185 .... .... .... ....
t?i,Q 170 .... .... .... .... 210 .... .... .... .... 190 .... .... .... ....
tmaxi,Q 175 .... .... .... .... 210 .... .... .... .... 195 .... .... .... ....
k1, 2 0.4, k2, 3 1, k3, 1 1
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
ti,Q 170 .... .... .... .... 210 .... .... .... .... 190 .... .... .... ....
44Determining Scaling Factor - Example
q 2, Q 4, T 100, ? 10,
k 1, 0.8, 0.6, 0.4, 0.2
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
t?i,q 65 65 85 105 115 100 100 120 125 150 85 85 105 120 135
t?i,Q 165 .... .... .... .... 200 .... .... .... .... 185 .... .... .... ....
t?i,Q 170 .... .... .... .... 210 .... .... .... .... 190 .... .... .... ....
tmaxi,Q 175 .... .... .... .... 210 .... .... .... .... 195 .... .... .... ....
k1, 2 0.2, k2, 3 1, k3, 1 1
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
ti,Q 170 .... .... .... .... 210 .... .... .... .... 190 .... .... .... ....
45Determining Scaling Factor - Example
q 2, Q 4, T 100, ? 10,
k 1, 0.8, 0.6, 0.4, 0.2
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
t?i,q 65 65 85 105 115 100 100 120 125 150 85 85 105 120 135
t?i,Q 165 .... .... .... .... 200 .... .... .... .... 185 .... .... .... ....
t?i,Q 170 .... .... .... .... 210 .... .... .... .... 190 .... .... .... ....
tmaxi,Q 175 .... .... .... .... 210 .... .... .... .... 195 .... .... .... ....
k1, 2 0.2, k2, 3 1, k3, 1 1
x1 s1 d1 r1 b1 x2 s2 d2 r2 b2 x3 s3 d3 r3 b3
ti,Q 170 .... .... .... .... 270 .... .... .... .... 190 .... .... .... ....
RESULT k1, 2 0.4, k2, 3 1, k3, 1 1
46Shared Communication Channels
v1
a
c
- The voltage level of the channel shared by
multiple connections is determined by the
connection that requires the highest voltage
level
v1
v3
v2
v2
v2
b
b?
v3
v1
v3
v1
c?
a?
47Code Modification
p0
p1
v1
v2
v3
v4
v5
v6
p2
send(p1, CTRL, v1, v2, v3) send(p2, CTRL, v4,
v5, v6) for(...) ... send(p1, ...)
send(p2,..) ...
// loop executed on p0 for(...) ...
send(p1, ...) send(p2, ...) ...
48Experimental Setup
Voltage (V) Rate (bps) Energy (pJ/bit)
0.7 200M 4.21
0.9 660M 5.25
1.1 1.33G 6.49
1.3 1.93G 8.31
1.5 2.50G 10.21
Parameter Value
NoC topology 5 5 mesh
Idle channel power 8.6pJ/cycle
Voltage switch energy 1020pJ,
Voltage delay 120 cycles
Processor 1GHz, 2-issue
Node local memory 20KB
Package header size 3 flits
Flit size 39bits
49Impact on Energy Consumption
50Energy Consumption Breakdown
51Accuracy of Voltage Selection
52Conclusions and Research Directions
- NoC presents unique opportunities for compilers
- Expose network layout to compiler for energy
reduction through voltage scaling and channel
shutdown - We implemented a compiler directed voltage
scaling algorithm and compared its performance to
a hardware scheme - Promising results
- Research Directions
- Evaluating impact of process-to-node mapping
- Combined voltage/frequency scaling for NoC and
CPUs - Metrics other than energy (e.g., temperature,
reliability,)
53Thank you!
- http//www.cse.psu.edu/mdl
- mdl_at_cse.psu.edu
Funded in part by GSRC and NSF