Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation

Description:

First full-system FPGA emulation of CMP running Linux ... Leon3 Sparc V8 VHDL core. Organization. L1 snoopy cache coherence (ARM bus) Pipeline ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 26
Provided by: Abh61
Category:

less

Transcript and Presenter's Notes

Title: Full-System Chip Multiprocessor Power Evaluations Using FPGA-Based Emulation


1
Full-System Chip Multiprocessor Power
EvaluationsUsing FPGA-Based Emulation
  • Abhishek Bhattacharjee
  • Gilberto Contreras
  • Margaret Martonosi
  • Princeton University

2
Problem SW Simulators for Architectural Power
Estimation
  • Power has become a first-class design problem
  • Affects power density, thermal behavior,
    packaging constraints
  • Early stage µ-arch perf/power evaluation is
    crucial
  • Convention SW simulators (Wattch, SimplePower,
    Hotspot)
  • Flexible, low development time
  • But SW simulations are too slow
  • Chips getting more complex core counts,
    interconnect, etc.
  • Design space getting more complex
    perf/power/thermal
  • Must consider OS, workload interaction

3
Alternatives to Long Simulations
  • Run application snippets, ignore OS
  • Compromises result accuracy and credibility
  • Parallelize simulator Falsafi et al. ACM
    Modeling 97, Mukherjee et al. IEEE Concurrency
    00, Chidester et al. ACM Modeling 02 and
    others
  • Shared structures (LLC, coherence) limit
    scalability
  • Hardware runtime monitoring Joseph et al.
    ISLPED 01, Bellosa et al. ACM SIGOPS, and
    others
  • Fast evaluation time
  • Restricted view of components
  • Requires existing design

4
Our Approach FPGA-Based Full-System Emulation
  • Develop FPGA-based perf./power emulator of a
    proposed CMP machine
  • Emulation rate of 50-300 MHz ? run full apps, OS
  • Similar to HW monitoring
  • Programmable ? insert relevant monitors, model
    various designs
  • Similar to SW simulations
  • Bottomline Get detail and full-system effects of
    real measurements before it is built
  • First full-system FPGA emulation of CMP running
    Linux
  • Demonstrate use on activity migration example

5
Recent Related Work on FPGA-Based Emulation
  • Memory controller emulation
  • RPM Oner et al. ISFPGA 95
  • Purely performance emulation
  • HASim Emer et al. ISFPGA 06, RAMP Wawrzynek
    et al. 06
  • Modular, parameterizable perf. models on FPGAs
  • Purely power emulation Coburn et al. DAC 05
  • RTL with power-models on FPGA (area/latency
    overhead analysis)
  • Performance and power emulation Atienza et al.
    DAC 06
  • Performance and thermal emulation of MPSoCs for
    existing cores
  • Runs OS on host and communicates with FPGA

6
Presentation Outline
  • Designing the emulator
  • Validating emulator power models
  • Evaluating emulator speedup
  • Profiling application runtime power behavior
  • Case study Activity migration
  • Conclusion

7
Steps in Designing Emulator
  • 1. Choose target platform
  • 2. Choose candidate core design
  • 3. Design event counters
  • 4. Design power models
  • 5. Boot OS and run full apps.

8
Target Emulation Platform
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
  • Target FPGA platform BEE2
  • 5 Xilinx V2P 70 FPGAs (1 control/4 user)
  • Current design on control unit
  • Methodology extensible to other platforms

9
Candidate Core DesignLeon3 SparcV8 CMP
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
Candidate Core Leon3 Sparc V8 VHDL core
Organization L1 snoopy cache coherence (ARM bus)
Pipeline Single-issue, in-order, 7-stage
Functional Units Adder, Shifter, Pipelined Mul /Div
L1 I-Cache 4 KB, 2-way, 32-byte lines, LRR
L1 D-Cache 4 KB, 2-way, 32-byte lines, LRR, write-through, virtually addressed
MMU 8-entry I and D TLBs, LRU
  • Paper emulates 2 cores subsequently scaled to 4
    cores
  • Currently use 60 LUTs, 20 BRAM on 1 Virtex2P 70
  • Current synthesized FPGA clock rate 65MHz
  • Future further scale core count, L1 caches, add
    LLC, FPU
  • Methodology extensible to other core designs

10
Inserting Event Counters
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
SparcV8 Core 1
SparcV8 Core N
3-Port Reg. File
3-Port Reg. File
. . .
7-Stage Integer Pipeline
7-Stage Integer Pipeline
Memory-mapped counters Add to ISA
start/stop/reset counters 36 counters ? 3
LUTs, no impact on freq.
Event Counters 64-bit
4KB I
4KB D
4KB I
4KB D
AHB Cont.
AHB Bus
11
Power Model Development
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
  • General form of component power model
  • How to assign event Ei?
  • Want power of emulated machine, not FPGA !
  • Calibrate with gate-level simulations and
    microbenchmarks
  • Write 500-1000 instruction benchmarks exercising
    events
  • Get Leon3 gate-level netlist from Synopsys Design
    Compiler
  • Feed µ-benchmarks and netlist into Synopsys
    PrimeTime to get component power breakdown
  • Please refer to paper for details

Dynamic power term
Un-clock gated leakage power
12
Register File SwitchingPower Model
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
  • Write 500-instruction microbenchmarks
  • Vary event/nop ratio
  • Idle Power 18.83 mW, Write 0.53 nJ,
  • Single Read 0.29 nJ, Double Read 0.39 nJ

13
Full-System Emulator with OS and Applications
Emulator Design Steps 1. Choose target
platform 2. Choose candidate core design 3.
Design event counters 4. Design power models 5.
Boot OS and run full apps.
FPGA Platform BEE2 Control Unit
I/O
Emulated CMP
Linux 2.6, applications (Spec2006, Splash-2,
PARSEC) Knowledge of power models
Host PC
RS-232
SparcV8Core 0
SparcV8Core 1
Ethernet
AHB Bus
Event counters for all modules
Main Memory
14
Presentation Outline
  • Designing the emulator
  • Validating emulator power models
  • Evaluating emulator speedup
  • Profiling application runtime power behavior
  • Case study Activity migration
  • Conclusion

15
Validating Emulator Power Models
  • Extensive validation with Synopsys PrimeTime PX
    using
  • Validation micro-benchmarks
  • 2x calibration micro-benchmarks, multiple event
    types
  • Spec 2006 benchmarks
  • Mcf, Libquantum, Bzip2, Gcc, Sjeng (train problem
    size)
  • Run 5 distinct 1-million instruction snapshots
    (short snippets due to PrimeTime)

Module µ-benchmarks Spec 2006
Pipeline 7.51 7.58
Reg. File 6.03 6.23
I-Cache 6.81 7.21
D-Cache 7.21 7.41
AHB 5.66 7.30
16
Results Emulation Speedup
  • Speedup over architectural simulator, Multifacet
    GEMS
  • 2-core, 4KB L1 caches
  • Mcf, Libquantum, Bzip2, Gcc, Sjeng on each core
    with train size
  • With Ruby Max. 35x
  • Even greater speedup expected for
  • Detailed pipeline modeling
  • Modeling greater core counts
  • Collecting power/thermal data
  • Greater FPGA clock

NOTE GEMS host uses a 64-bit, 2-GHz dual-core
AMD Athlon processor
17
Presentation Outline
  • Designing the emulator
  • Validating emulator power models
  • Evaluating emulator speedup
  • Profiling application runtime power behavior
  • Case study Activity migration
  • Conclusion

18
Runtime Power Profiling
  • Important for OS controlled power-aware
    scheduling
  • Modify Linux kernel to feed counter values to
    power models
  • Read counters within 10ms timer interrupt
  • Sampling rate multiples of 10ms
  • Access 36 counters in 5700 cycles ? Max. 0.87
    perturbation

19
Runtime Power for LU (2-threads)
CPU 1 master, CPU0 idle (380 mW)
Barrier CPU0 spin-waiting
Possible Reg. File hotspot cannot be tracked on
CPU composite profile
Low power numbers and swing no L2, no FPU, no
gating, simple pipeline
20
Case Study Activity Migration
  • Goal Demonstrate use of emulation system on
    problem of real-world relevance
  • Problem Use activity migration (AM) to mitigate
    CMP hotspots Heo et al. ISLPED 03, Choi et al.
    ISLPED 07
  • Our Solution Modify Linux kernel scheduler to
    read counters, deduce power trends, and migrate
    threads accordingly
  • Our emulator is the ideal platform for AM studies
  • Hotspots depend on component power
  • Emulator directly provides this
  • On-chip temperature rise/fall times 100 ms
  • Emulator fast enough to run OS and apps. beyond
    this time range

21
Case Study AM on Bzip2, Mcf
Mcf data cached, computation proceeds
Bzip2 small working set, high activity, high
power
Migration triggered
CPU 0 (Bzip2) overheats
CPU 0 (Mcf) cools off
Mcf large working set, high stalls, low power
22
Presentation Outline
  • Designing the emulator
  • Validating emulator power models
  • Evaluating emulator speedup
  • Profiling application runtime power behavior
  • Case study Activity migration
  • Conclusion

23
Conclusion
  • First FPGA-based emulation for CMPs for
    full-system power-performance modeling of
    early-stage designs.
  • Emulator combines HW speeds (65 MHz) with SW
    programmability 35x speedup over GEMS (Ruby)
  • Power models accurate within 10 of Synopsys
    simulations
  • Can model range of proposed designs
  • Moores Law applies to FPGAs too!
  • Ongoing/future work
  • Emulate architecture with GHz frequency using raw
    FPGA clock in MHz
  • DVFS emulation
  • Thermal models

24
Linux Kernel Scheduler for AM
Avg. migration time ? 300ms (65 MHz clock and
small caches) 2s interval for max. 15 migration
penalty
25
Modeling Leakage Power
  • General form of component power model
  • Leakage power depends heavily on temp.
  • Separate voltage/temperature dependent leakage
    term possible
  • Emulator runs fast enough to collect accurate
    temperature data ? more accurate leakage power
    estimates
  • Calibrate with leakage estimates from Synopsys
    PrimeTime
  • Write µ-benchmarks across range of temperatures
    and see per-component leakage variation

Dynamic power term
Un-clock gated leakage power
Write a Comment
User Comments (0)
About PowerShow.com