NepSim:A Network Processor Simulator with a Power Evaluation Framework - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

NepSim:A Network Processor Simulator with a Power Evaluation Framework

Description:

our simulator vs. Intel's SDK -Several advantages- Enable new architecture design ... A simple of IP forwarding software provided in Intel's SDK ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 34

Provided by: cialCsie

Category:

more less

Transcript and Presenter's Notes

Title: NepSim:A Network Processor Simulator with a Power Evaluation Framework

1

NepSimA Network Processor Simulator with a Power
Evaluation Framework
Yan Luo, Jun Yang, Laxmi N. Bhuyan, and Li Zhao
University of California, Riverside
By ???

2
outline

Introduction
Cycle-level simulation
Validation
Power modeling
Power and performance analysis
Reducing processing core power

3
Introduction (1)

Network Processor (NP)
Providing both high performance and flexibility
in building powerful routers.
Exponential increase in clock frequency and core
complexity, power dissipation(??) will become a
major design consideration in NP development.
NPs have cycle-accurate architecture simulators
for commercial NPs (not open source)
Intel ?Software Development Kit (SDK)
Motorola ? C-Ware
These simulators (above) dont incorporate power
modeling and evaluation

4
Introduction (2)

NePSim
Includes a cycle-accurate architecture simulator
An automatic formal verification engine
Parameterizable power estimator
Execution cores
Memory controllers
I/O ports
Packet buffers
High-speed buses
We define our system to comply with IXP1200

5
Introduction (3)

We propose low-power tech. tailored to NPs and
using our NePSim system.
Dynamic voltage scaling (DVS)
Adopted it to each execution core
Observing abundant idle time (avg.10-23)
resulting from contention in the shared memory.
Achieved
17 power savings for the NP over four
application benchmarks
Less than 6 performance loss

6
Cycle-level simulation

High-level overview of the IXP1200 , then
describe our simulator software structure
Background Intel IXP1200 and its microengines
The simulator

7
Background Intel IXP1200 and its microengines

StrongARM
6 MEs
Standard memory interfaces
SDRAM
SRAM
High-speed bus interfaces
IX bus

8
The simulator NepSim

Implements most functionalities of the IXP1200
Not model StrongARM core, its main task is
control plane function that dont affect the
critical path of packet processing.

Why use microcode? (not binary)

Leave room for instruction
extensions in future research
Easier to modify the program

9
NePSim body-the ME simulation core

The nepsim body is the module simulates the
following 5 stages of the ME pipeline
Instruction lookup
Initial instruction decoding and formation of the
source register address
Reading of operands from the source registers
ALU operations, shift or compare operations, and
generation of condition codes and
Writing of result to destination register

10
Model components

Device implements I/O devices such as I/O ports
and the MACs.
Dlite resembles the debugger in SimpleScalar.
Ex lets users set breakpoints, print pipeline
status, display register values, and dump memory
content.
Enable configurations
Different clock rates and supply voltages of MEs
Configure the SRAM and SDRAM with different
latencies and bandwidths
Incoming traffic with different arrival tares and
patterns

11
our simulator vs. Intels SDK -Several
advantages-

Enable new architecture design
Permits number of MEs and threads to vary
Provides instruction set extensibility in
microcode assembly code
Provides faster execution speed

12
validation

Avg error of 1 in thoughput and 6 in avg
processing time across the 4 benchmarks.
The simulation can produce relatively dependable
results.

13
Power modeling

The IXP1200 uses 0.28um technology.
We use 0.25um technology because it is the
closest available feature size to 0.28um.
The IXP cores power, excluding I/O, is 4.5W at
232MHz , include
All the MEs ( 0.468W/each )
Memory units ( SRAM0.0639W SDRAM0.0643W )
IX bus unit ( 0.363W )
StrongARM ( 0.5W )
0.46860.3630.06390.06430.5 3.8W
4.5 - 3.8 0.7W
Result from our use of a smaller tech.,0.25um
instead of 0.28um
Didnt model internal buses and the clock.

14
Power and performance analysis (1)

Assume
Max packet arrival rate
16 Ethernet interfaces for receiving
16 Ethernet interfaces for transmitting
SRAM SDRAM frequency is 116MHz

15
Power and performance analysis (2)

Benchmarks
Ipfwdr
url
nat
md4
4 receiving MEs and 2 transmitting MEs.
Researchers have tested this 42 ratio to provide
maximum throughput, and we adopted this
configuration throughout our experiments.

16
Benchmark descriptions-ipfwdr

Ipfwdr
A simple of IP forwarding software provided in
Intels SDK
Processing includes Ethernet and IP header
validation and trie-based touting-table lookup.
Routing table resides in SRAM, and the output
port information is in SDRAM
Next hop router on the basis of output port
information.

17
Benchmark descriptions-URL

URL
Routes packets on the basis of their contained
URL request.
Often examine the payload of packets when
processing them
Performs a string-matching algorithm that we
ported from NetBench.
String patterns are initialized in SRAM, urls
code must generate SRAM accesses in later
comparisons.
Also, they must be scanned for pattern matching,
many requests are generated to SDRAM, which
stores payload data

18
Benchmark descriptions-nat

Nat-network address translation
Use the source and destination IP addresses and
port numbers to compute an index
Index serves as a hash-table lookup to retrieve a
replacement address and port.
Each packet accesses the SRAM to look up the hash
table.
SDRAM access arent necessary.

19
Benchmark descriptions-md4

Md4
The md4 algorithm works on arbitrary-length
messages and provides a 128-bit fingerprint, or
digital signature.
Use it to implement a Secure Sockets Layer or
firewall at the edge routers
Moves data packets from SDRAM to SRAM and
accesses SRAM multiple times to compute the
digital signature.
The program is both memory and computation
intensive.

20
Performance observations (1)

The impact of having more MEs on the total packet
throughput
For memory-intensive benchmarks (url md4)
Increasing thread means increasing memory
contention
Because memories are shared among all threads.
Double the core frequency doesnt double the
throughput.
strange result decrease, nat is not memory bound.

21
Performance observations (2)

Nat doesnt have much ME idle time, even the
ME-to-memory speed ratio is 41
Implies that all MEs are busy
Receiving fast enough, but transmitting dont
release memory slots fast enough.
Receiving
Busy requesting new memory slot
Transmitting
Busy sending packets
R/T ME ratio 33 might better than traditional
42 configuration.

22
What does the power go? (1)

IXP1200s power distribution among the MEs.
0-3receiving4-5transmitting

23
What does the power go? (2)

Power consumption
ALU 45
control 28
Instruction operands and results reside 13
Static power 7

24
Performance and power observations

Performance variation is consistent with the
power variation
Both performance and power consumption grow with
the addition of MEs, except for nats
performance.
2 MEs consume less than twice the power of one ME
because ME idle time increases
The gap btw. the 2 curves widens as the of MEs
increases.
Power consumption increases faster than
performance.

25
(No Transcript)
26
Reducing processing core power

Wide use of dynamic voltage scaling (DVS) to
conserve power in MEs.
Reducing voltage and frequency when the processor
has low activity
And increasing them when theres a demand for
peak processor performance.
These range comply with Intels IXP2400
configurations.
Frequency 600MHz-400MHz
Voltage 1.3V-1.1V

27
DVS policy

ME idle time is abundant because most of the
benchmarks are memory bound.
Applying DVS while MEs are not very active

28
DVS scheme

Using hardware to observe ME idle time
periodically.
The percentage of idle time in a past period
exceeds a threshold, scale down the voltage and
frequency (VF).
Below the threshold, scale up the VF.

29
Deploying DVS

The hardware required to implement the DVS policy
is trivial.
Timer signals after a certain number of cycles
accumulator counts the of ME idle cycles
When timer signals, the accumulator compares its
result with a preinitialized value T.
EX
T20000
Idle time Threshold 10 of T 2000

30
Control mechanism
31

DVS can save up to 17 of power consumption with
a performance loss of less than 6
DVS hardly affects throughput because the MEs
have enough idle cycles to cover the stall
penalty.
Also tested using thresholds other than 10 and
achieved similar results.

32
(No Transcript)
33
(No Transcript)

Write a Comment

User Comments (0)