High-Performance Power-Aware Computing - PowerPoint PPT Presentation

About This Presentation
Title:

High-Performance Power-Aware Computing

Description:

'it's not speed but power low power, because data centers can consume as much ... Develop AMPERE. a message passing environment for reducing energy ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 55
Provided by: Vincent154
Category:

less

Transcript and Presenter's Notes

Title: High-Performance Power-Aware Computing


1
High-PerformancePower-Aware Computing
  • Vincent W. Freeh
  • Computer Science
  • NCSU
  • vin_at_csc.ncsu.edu

2
Acknowledgements
  • NCSU
  • Tyler K. Bletsch
  • Mark E. Femal
  • Nandini Kappiah
  • Feng Pan
  • Daniel M. Smith
  • U of Georgia
  • Robert Springer
  • Barry Rountree
  • Prof. David K. Lowenthal

3
The case for power management
  • Eric Schmidt, Google CEO
  • its not speed but powerlow power, because
    data centers can consume as much electricity as a
    small city.
  • Power/energy consumption becoming key issue
  • Power limitations
  • Energy Heat Heat dissipation is costly
  • Non-trivial amount of money
  • Consequence
  • Excessive power consumption limits performance
  • Fewer nodes can operate concurrently
  • Goal
  • Increase power/energy efficiency
  • More performance per unit power/energy

4
CPU scaling
power ? frequency x voltage2
  • How CPU scaling
  • Reduce frequency voltage
  • Reduce power performance
  • Energy/power gears
  • Frequency-voltage pair
  • Power-performance setting
  • Energy-time tradeoff
  • Why CPU scaling?
  • Large power consumer
  • Mechanism exists

power
frequency/voltage
application throughput
frequency/voltage
5
Is CPU scaling a win?
power
ECPU
PCPU
Psystem
Eother
Pother
T
time
full
6
Is CPU scaling a win?
power
PCPU
ECPU
PCPU
Psystem
Eother
Psystem
Pother
Pother
T
TDT
time
reduced
full
7
Our work
  • Exploit bottlenecks
  • Application waiting on bottleneck resource
  • Reduce power consumption (non-critical resource)
  • Generally CPU not on critical path
  • Bottlenecks we exploit
  • Intra-node (memory)
  • Inter-node (load imbalance)
  • Contributions
  • Impact studies HPPAC 05 IPDPS 05
  • Varying gears/nodes PPoPP 05 PPoPP 06
    (submitted)
  • Leveraging load imbalance SC 05

8
Methodology
  • Cluster used 10 nodes, AMD Athlon-64
  • Processor supports 7 frequency-voltage settings
    (gears)
  • Frequency (MHz) 2000 1800 1600 1400 1200
    1000 800
  • Voltage (V) 1.5 1.4 1.35 1.3
    1.2 1.1 1.0
  • Measure
  • Wall clock time (gettimeofday system call)
  • Energy (external power meter)

9
NAS
10
CG 1 node
2000MHz
800MHz
  • Not CPU bound
  • Little time penalty
  • Large energy savings

11
EP 1 node
11 -3
  • CPU bound
  • Big time penalty
  • No (little) energy savings

12
Operation per miss
CG 8.60
13
Multiple nodes EP
14
Multiple nodes LU
S8 5.8 E8 1.28
S4 3.3 E4 1.15
S2 1.9 E2 1.03
Good speedup E-T tradeoff as N increases
15
Multiple nodes MG
Poor speedup Increased E as N increases
S8 2.7 E8 2.29
S4 1.6 E4 1.99
S2 1.2 E2 1.41
16
Phases
17
Phases LU
18
Phase detection
  • First, divide program into blocks
  • All code in block execute in same gear
  • Block boundaries
  • MPI operation
  • Expect OPM change
  • Then, merge adjacent blocks into phases
  • Merge if similar memory pressure
  • Use OPM
  • OPMi OPMi1 small
  • Merge if small (short time)
  • Note, in future
  • Leverage large body of phase detection research
  • Kennedy Kremer 1998 Sherwood, et al 2002

19
Data collection
MPI application
MPI library
  • Use MPI-jack
  • Pre and post hooks
  • For example
  • Program tracing
  • Gear shifting
  • Gather profile data during execution
  • Define MPI-jack hook for every MPI operation
  • Insert pseudo MPI call at end of loops
  • Information collected
  • Type of call and location (PC)
  • Status (gear, time, etc)
  • Statistics (uops and L2 misses for OPM
    calculation)

MPI-jack
code
20
Example bt
21
Comparing two schedules
  • What is the best schedule?
  • Depends on user
  • User supplies better function
  • bool better(i, j)
  • Several metrics can be used
  • Energy-delay
  • Energy-delay squared Cameron et al. SC2004

22
Slope metric
  • Project uses slope
  • Energy-time tradeoff
  • Slope -1 ? energy savings time delay
  • User-defines the limit
  • Limit 0 ? minimize energy
  • Limit -8 ? minimize time
  • If slope lt limit, then better
  • We do not advocate this metric over others

23
Example bt
Solutions Slope lt -1.5?
1 00 ? 01 -11.7 true
2 01 ? 02 -1.78 true
3 02 ? 03 -1.19 false
4 02 ? 12 -1.44 false
02 is the best 02 is the best 02 is the best 02 is the best
24
Benefit of multiple gears mg
25
Current work no. of nodes, gear/phase
26
Load imbalance
27
Node bottleneck
  • Best course is to keep load balanced
  • Load balancing is hard
  • Slow down if not critical node
  • How to tell if not critical node?
  • Suppose a barrier
  • All nodes must arrive before any leave
  • No benefit to arriving early
  • Measure block time
  • Assume it is (mostly) the same between iterations
  • Assumptions
  • Iterative application
  • Past predicts future

28
Example
synch pt
synch pt
performance 1
performance (t-slack)/t
iteration k1
iteration k
Reduced performance power ? Energy savings
29
Measuring slack
  • Blocking operations
  • Receive
  • Wait
  • Barrier
  • Measure with MPI_Jack
  • Too frequent
  • Can be hundreds or thousands per second
  • Aggregate slack for one or more iterations
  • Computing slack, S
  • Measure times for computing and blocking phases
  • T C1 B1 C2 B2 Cn Bn
  • Compute aggregate slack
  • S (B1B2Bn)/T

30
Slack
Communication slack
CG
Aztec
Sweep3d
  • Slack
  • Varies between nodes
  • Varies between applications
  • Use net slack
  • Each node individually determines slack
  • Reduction to find min slack

31
Shifting
  • When to reduce performance?
  • When there is enough slack
  • When to increase performance?
  • When application performance suffers
  • Create high and low limit for slack
  • Need damping
  • Dynamically learn
  • Not the same for all applications
  • Range starts small
  • Increase if necessary

reduce gear
slack
same gear
increase gear
T
32
Aztec gears
33
Performance
Aztec
Sweep3d
34
Synthetic benchmark
35
Summary
  • Contributions
  • Improved energy efficiency of HPC applications
  • Found simple metric for phase boundary location
  • Developed simple, effective linear time algorithm
    for determining proper gears
  • Leveraged load imbalance
  • Future work
  • Reduce sampling interval to handful of iterations
  • Reduce algorithm time w/ modeling and prediction
  • Develop AMPERE
  • a message passing environment for reducing energy
  • http//fortknox.csc.ncsu.eduosr/
  • vin_at_csc.ncsu.edu dkl_at_cs.uga.edu

36
End
37
Shifting test
NAS LU 1 node
7.7
1
1
4.5
38
Beta
  • Hsu Kremer PLDI 03
  • Relates application slowdown to CPU slowdown
  • b
  • b1 ? time is
  • CPU dependent
  • b0 ? time is
  • independent of CPU
  • OPM vs. b
  • Correlated
  • Log(OPM) ? b

39
OPM and b and slack
  • OPM not strongly correlated to b in multi-node
  • Why?
  • There is another bottleneck
  • Communication slack
  • Waiting time
  • Eg, MPI_Receive, MPI_Wait, MPI_Barrier
  • MG OPM 70.6 slack 25
  • LU OPM 73.5 slack 11
  • Can predict b with
  • Log(OPM) and
  • slack

40
Energy savings (synthetic)
41
Normalized MG
With communication bottleneck E-T tradeoff
improves as N increases
42
SPEC FP
43
SPEC INT
44
Single node MG
6 -7
12 -8
Modest memory pressure Gears offer E-T tradeoff
45
Dynamically adjust performance
net slack
time
0
2
1
46
Adjust performance
net slack
time
0
0
0
1
1
1
47
Dampening
net slack
time
0
0
1
1
1
0
48
Power consumption
Average for NAS suite
49
Related work Energy conservation
  • Goal conserve energy
  • Performance degradation acceptable
  • Usually in mobile environments (finite energy
    source, battery)
  • Primary goal
  • Extend battery life
  • Secondary goal
  • Re-allocate energy
  • Increase value of energy use
  • Tertiary goal
  • Increase energy efficiency
  • More tasks per unit energy
  • Example
  • Feedback-driven, energy conservation
  • Control average power usage
  • Pave (E0 Ef)/T

50
Related work Realtime DVS
  • Goal
  • Reduce energy consumption
  • With no performance degradation
  • Mechanism
  • Eliminate slack time in system
  • Savings
  • Eidle
  • with F scaling
  • Additional Etask Etask
  • with V scaling

P
P
Pmax
Pmax
Etask
deadline
deadline
Etask
Eidle
T
T
51
Related work
  • Previous studies in power-aware HPC
  • Cameron et al., SC 2004 IPDPS 2005, Freeh et
    al., IPDPS 2005
  • Energy-aware server clusters
  • Many projects e.g., Heath PPoPP 2005
  • Low-power supercomputer design
  • Green Destiny (Warren et al., 2002)
  • Orion Multisystems

52
Related work Fixed installations
  • Goal
  • Reduce cost (in heat generation or )
  • Goal is not to conserve a battery
  • Mechanisms
  • Scaling
  • Fine-grain DVS
  • Coarse-grain power down
  • Load balancing

53
Memory pressure
  • Why different tradeoffs?
  • CG is memory bound CPU not on critical path
  • EP is CPU bound CPU is on critical path
  • Operations per miss
  • Metric of memory pressure
  • Indicates criticality of CPU
  • Use performance counters
  • Count micro operations and cache misses

54
Single node MG
55
Single node LU
Write a Comment
User Comments (0)
About PowerShow.com