Title: NepSim:A Network Processor Simulator with a Power Evaluation Framework
1- NepSimA Network Processor Simulator with a Power
Evaluation Framework - Yan Luo, Jun Yang, Laxmi N. Bhuyan, and Li Zhao
- University of California, Riverside
- By ???
2outline
- Introduction
- Cycle-level simulation
- Validation
- Power modeling
- Power and performance analysis
- Reducing processing core power
3Introduction (1)
- Network Processor (NP)
- Providing both high performance and flexibility
in building powerful routers. - Exponential increase in clock frequency and core
complexity, power dissipation(??) will become a
major design consideration in NP development. - NPs have cycle-accurate architecture simulators
for commercial NPs (not open source) - Intel ?Software Development Kit (SDK)
- Motorola ? C-Ware
- These simulators (above) dont incorporate power
modeling and evaluation
4Introduction (2)
- NePSim
- Includes a cycle-accurate architecture simulator
- An automatic formal verification engine
- Parameterizable power estimator
- Execution cores
- Memory controllers
- I/O ports
- Packet buffers
- High-speed buses
- We define our system to comply with IXP1200
5Introduction (3)
- We propose low-power tech. tailored to NPs and
using our NePSim system. - Dynamic voltage scaling (DVS)
- Adopted it to each execution core
- Observing abundant idle time (avg.10-23)
resulting from contention in the shared memory. - Achieved
- 17 power savings for the NP over four
application benchmarks - Less than 6 performance loss
6Cycle-level simulation
- High-level overview of the IXP1200 , then
describe our simulator software structure - Background Intel IXP1200 and its microengines
- The simulator
7Background Intel IXP1200 and its microengines
- StrongARM
- 6 MEs
- Standard memory interfaces
- SDRAM
- SRAM
- High-speed bus interfaces
- IX bus
8The simulator NepSim
- Implements most functionalities of the IXP1200
- Not model StrongARM core, its main task is
control plane function that dont affect the
critical path of packet processing.
Why use microcode? (not binary)
- Leave room for instruction
- extensions in future research
- Easier to modify the program
9NePSim body-the ME simulation core
- The nepsim body is the module simulates the
following 5 stages of the ME pipeline - Instruction lookup
- Initial instruction decoding and formation of the
source register address - Reading of operands from the source registers
- ALU operations, shift or compare operations, and
generation of condition codes and - Writing of result to destination register
10Model components
- Device implements I/O devices such as I/O ports
and the MACs. - Dlite resembles the debugger in SimpleScalar.
- Ex lets users set breakpoints, print pipeline
status, display register values, and dump memory
content. - Enable configurations
- Different clock rates and supply voltages of MEs
- Configure the SRAM and SDRAM with different
latencies and bandwidths - Incoming traffic with different arrival tares and
patterns
11our simulator vs. Intels SDK -Several
advantages-
- Enable new architecture design
- Permits number of MEs and threads to vary
- Provides instruction set extensibility in
microcode assembly code - Provides faster execution speed
12validation
- Avg error of 1 in thoughput and 6 in avg
processing time across the 4 benchmarks. - The simulation can produce relatively dependable
results.
13Power modeling
- The IXP1200 uses 0.28um technology.
- We use 0.25um technology because it is the
closest available feature size to 0.28um. - The IXP cores power, excluding I/O, is 4.5W at
232MHz , include - All the MEs ( 0.468W/each )
- Memory units ( SRAM0.0639W SDRAM0.0643W )
- IX bus unit ( 0.363W )
- StrongARM ( 0.5W )
- 0.46860.3630.06390.06430.5 3.8W
- 4.5 - 3.8 0.7W
- Result from our use of a smaller tech.,0.25um
instead of 0.28um - Didnt model internal buses and the clock.
14Power and performance analysis (1)
- Assume
- Max packet arrival rate
- 16 Ethernet interfaces for receiving
- 16 Ethernet interfaces for transmitting
- SRAM SDRAM frequency is 116MHz
15Power and performance analysis (2)
- Benchmarks
- Ipfwdr
- url
- nat
- md4
- 4 receiving MEs and 2 transmitting MEs.
- Researchers have tested this 42 ratio to provide
maximum throughput, and we adopted this
configuration throughout our experiments.
16Benchmark descriptions-ipfwdr
- Ipfwdr
- A simple of IP forwarding software provided in
Intels SDK - Processing includes Ethernet and IP header
validation and trie-based touting-table lookup. - Routing table resides in SRAM, and the output
port information is in SDRAM - Next hop router on the basis of output port
information.
17Benchmark descriptions-URL
- URL
- Routes packets on the basis of their contained
URL request. - Often examine the payload of packets when
processing them - Performs a string-matching algorithm that we
ported from NetBench. - String patterns are initialized in SRAM, urls
code must generate SRAM accesses in later
comparisons. - Also, they must be scanned for pattern matching,
many requests are generated to SDRAM, which
stores payload data
18Benchmark descriptions-nat
- Nat-network address translation
- Use the source and destination IP addresses and
port numbers to compute an index - Index serves as a hash-table lookup to retrieve a
replacement address and port. - Each packet accesses the SRAM to look up the hash
table. - SDRAM access arent necessary.
19Benchmark descriptions-md4
- Md4
- The md4 algorithm works on arbitrary-length
messages and provides a 128-bit fingerprint, or
digital signature. - Use it to implement a Secure Sockets Layer or
firewall at the edge routers - Moves data packets from SDRAM to SRAM and
accesses SRAM multiple times to compute the
digital signature. - The program is both memory and computation
intensive.
20Performance observations (1)
- The impact of having more MEs on the total packet
throughput - For memory-intensive benchmarks (url md4)
- Increasing thread means increasing memory
contention - Because memories are shared among all threads.
- Double the core frequency doesnt double the
throughput. - strange result decrease, nat is not memory bound.
21Performance observations (2)
- Nat doesnt have much ME idle time, even the
ME-to-memory speed ratio is 41 - Implies that all MEs are busy
- Receiving fast enough, but transmitting dont
release memory slots fast enough. - Receiving
- Busy requesting new memory slot
- Transmitting
- Busy sending packets
- R/T ME ratio 33 might better than traditional
42 configuration.
22What does the power go? (1)
- IXP1200s power distribution among the MEs.
- 0-3receiving4-5transmitting
23What does the power go? (2)
- Power consumption
- ALU 45
- control 28
- Instruction operands and results reside 13
- Static power 7
24Performance and power observations
- Performance variation is consistent with the
power variation - Both performance and power consumption grow with
the addition of MEs, except for nats
performance. - 2 MEs consume less than twice the power of one ME
because ME idle time increases - The gap btw. the 2 curves widens as the of MEs
increases. - Power consumption increases faster than
performance.
25(No Transcript)
26Reducing processing core power
- Wide use of dynamic voltage scaling (DVS) to
conserve power in MEs. - Reducing voltage and frequency when the processor
has low activity - And increasing them when theres a demand for
peak processor performance. - These range comply with Intels IXP2400
configurations. - Frequency 600MHz-400MHz
- Voltage 1.3V-1.1V
27DVS policy
- ME idle time is abundant because most of the
benchmarks are memory bound. - Applying DVS while MEs are not very active
28DVS scheme
- Using hardware to observe ME idle time
periodically. - The percentage of idle time in a past period
- exceeds a threshold, scale down the voltage and
frequency (VF). - Below the threshold, scale up the VF.
29Deploying DVS
- The hardware required to implement the DVS policy
is trivial. - Timer signals after a certain number of cycles
- accumulator counts the of ME idle cycles
- When timer signals, the accumulator compares its
result with a preinitialized value T. - EX
- T20000
- Idle time Threshold 10 of T 2000
30Control mechanism
31- DVS can save up to 17 of power consumption with
a performance loss of less than 6 - DVS hardly affects throughput because the MEs
have enough idle cycles to cover the stall
penalty. - Also tested using thresholds other than 10 and
achieved similar results.
32(No Transcript)
33(No Transcript)