Title: Orion: A PowerPerformance Simulator for Interconnection Networks
1Orion A Power-Performance Simulator for
Interconnection Networks
MICRO02
- Hang-Sheng Wang, Xinping Zhu, Li-Shiuan Peh,
Sharad Malik - Princeton University
2Introduction
- Single-chip multiprocessor systems are seeing the
use of interconnection networks as the only
scalable solution to inter-processor
communication. - Interconnection networks consume a significant
fraction of power - Alpha 21364 25W/125W
- Mellanox server blade (InfiniBand switch)
15W/40W
3Orion
- A network power-performance simulator
- plug-and-play router and link components
- Run different communication workloads
- Constructed within LSE complete platform for
exploring interconnected µ-processors - Provide designers with a framework for rapid
exploration of interconnected µ-processor systems - Enable research in power-efficient H/W and
compiler techniques for interconnected
µ-processors
4Simulation infrastructure
- LSE (Liberty Simulation Environment)
- Constructs concurrent structural models and
retargetable simulators from a unified structural
machine description and specification DB. - Target fast design space exploration for modern
µ-processors
5Building blocks of an interconnection network
- Message transporting class
- Links and crossbars
- Message processing class
- Sources, sinks, buffers and arbiters
- All modules support different types of
operational and timing behavior depending on the
dynamic configuration - Can construct a wide range of interconnection
networks through careful parameterization
6Component power modeling
7Power modeling Discussion
- Architectural-level modeling
- Estimation based on transistor count and area can
only useful for average power - Information such as transistor count and area is
typically not available at the time of
architectural exploration - Model hierarchy and reusability
- Maximize reuse of our power models
- Can extend them to new microarchitectures
8Power modeling Discussion
- Validation
- Against measured power
- Against low-level power estimation tools
- Alpha 21364, InfiniBand switch
- Link power modeling
- Plug in actual power numbers of specific links
obtained from published datasheets - Developing parameterized link power models
9Walkthrough example of a simple wormhole router
10Case Studies
Exploring different configurations
Exploring different workloads (feedback compiler
or application programmer)
Exploring new microarchitectures
11Experimental setup
- 16 node network, 4x4 torus
- Credit-based flow control
- Source dimension-ordered routing
- Uniformly distributed traffic to random
destinations - 59 modules
- Simulator size 5.2MB
- 1000 simulation cycles/s
12Exploring different configurations
- Wormhole vs. Virtual-channel routers
- Wormhole router with 64-flit input buffer per
port (WH64) - VC router with 2 VCs per port and 8-flit input
buffer per VC (VC16) - VC router with 8 VCs per port and 8-flit input
buffer per VC (VC64) - VC router with 8 VCs per port and 16-flit input
buffer per VC (VC128) - VC router 3-stage router pipeline
- WH 2-stage router pipeline
13WH vs. VC average packet latency
saturation
VC16 outperforms WH64 despite having small buffer
14WH vs. VC total network power
saturation
VC64 dissipates approximately the same amount of
power as WH64 despite VC requires more complex
hardware.
15VC64 average power breakdown
Buffer and crossbar are the dominant power
consumers. (85) E(VC128) E(VC64) E(VC 16),
L(VC128) L(VC64) lt L(VC16) Arbiter consumes
less than 1. E(VC) E(WH)
16Exploring different workloads
- Broadcast vs. uniform traffic
(1,2)
Broadcast workload change L?L/4 ?L/8 YX routing
(1,1) and (1,3) consumes higher power than
(0,2) and (2,2) All nodes with the same x
coordinate have identical power
consumption. Orion can be interfaced with actual
communication traces.
17Exploring a new microarchitectural technique
- Central buffered routers (CB)
- Shared central buffer forwards flits between
input and output ports - Deployed in IBM SP/2 and InfiniBand routers
- Higher throughput over input-buffered
crossbar-based routers (XB) - No head-of-line blocking
- Configurations same area
- Chip-to-chip 4x4 network
18(No Transcript)
19CB vs. XB
- Performance
- Random traffic CB lt XB
- Due to the fewer of ports in CB (25)
- Broadcast traffic CB gt XB
- Packets from the same input port need not line up
behind one another if they are destined for
different output ports. - Power
- Random traffic CB gt XB
- Broadcast traffic CB XB
- Central buffer consumes much more energy than a
crossbar due to its higher switching capacitance
20On-chip vs. chip-to-chip
- On-chip
- Links take up less than 15 of node power
- Power consumption depends heavily on traffic
- Chip-to-chip
- Links take up more than 70
- Traffic insensitive
21Conclusions
- Orion
- An architecture-level power-performance simulator
for interconnection networks that provides a
platform for rapid exploration of
power-performance tradeoffs - Future works
- Extensive modules with detailed validation