Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware - PowerPoint PPT Presentation

About This Presentation

Title:

Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware

Description:

Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 35

Provided by: GregS192

Learn more at: http://www.ann.ece.ufl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Frequent Loop Detection Using Efficient Non-Intrusive On-Chip Hardware

1
Frequent Loop Detection Using Efficient
Non-Intrusive On-Chip Hardware

Ann Gordon-Ross and Frank Vahid
Department of Computer Science and Engineering
University of California, Riverside
Also with the Center for Embedded Computer
Systems, UC Irvine

This work was supported in part by the U.S.
National Science Foundation and a U.S. Dept. of
Education GAANN Fellowship
International Conference on Compilers,
Architecture, and Synthesis for Embedded Systems
2003.
2
System Optimizations - Static

Specialization of a system for a particular
application or suite of applications to improve
power consumption and/or performance
Static optimizations are performed at design time
by the designer
There are many static optimization approaches
Critical regions can be partitioned to
configurable logic
Constant propagation and code specialization for
statically determined invariant variables
Critical regions can be locked to a specialized
cache
Many more

3
System Optimizations - Static
Power Consumption
Modeled input stimulus

Design Time
http//www.ims.uconn.edu/metal/
Undergrad/microchip.jpg
4
Static Optimization Drawbacks

Designer must perform optimizations
May disrupt standard software tool flows
Runtime environment may present optimization
opportunities that are not evident during design
time simulation
Simulation is usually used
Framework may take months to set up a testing
environment with realistic input stimuli
Exploration time may take weeks to run one of
hundreds of possible configurations

5
System Optimizations - Dynamic

Dynamic software optimizations are becoming
increasingly popular for improving software
performance and power.
Dynamic optimizations are performed in system
during runtime

Designer
End Product
Power Consumption
Runtime Input
Execution Time
6
System Optimizations - Dynamic

There are many dynamic optimization approaches
Dynamo performs dynamic software optimizations on
the most frequently executed regions of code
Frequently executed regions of code can be
remapped to non-interfering cache locations
Dynamic binary translation methods store
translation results of frequently executed
regions of code for quick look-up
Value profiling can determine runtime invariant
variables for constant propagation and/or code
specialization
Many others.

7
Dynamic Optimizations - Effectiveness

For dynamic optimizations to be most effective,
optimizations are typically applied to the most
frequently executed regions of code
For a large selection of the MediaBench benchmark
suite, we observed that 90 of the execution time
was spent in approximately 10 of the code
Profiling is used to determine the critical
regions of code

8
Previous Profiling Methods
Desktop

Desktop targeted profiling methods
Instrumentation and sampling
These methods are unsuitable for embedded systems
Causes disruption of run-time behavior

Early methods used logic analyzers
Not possible for todays systems-on-a-chip (SOCs)
JTAG standard allows for internal registers to be
read
Typically used for testing and debugging
Interrupts processor to write internal
information to external pins

Embedded
9
Profiling Methodology Goal

The goal of our profiling approach is to design a
profiling tool suitable for embedded systems to
determine the most critical regions of code

10
Critical Region Detection - Operational
Requirements

Non-intrusion
Important for real-time systems
Minimizes the impact on current tool chains i.e.
no special compilers or binary modification tools
Low power
Battery operated systems
Systems with limited cooling
Small area
Less significant due to the large transistor
capacities of current and future chips
Accuracy
Exact results are not required for the
information to be useful -- instead, reasonable
accuracy is acceptable

11
Frequent Loop Detection

We analyzed the critical regions for various
Powerstone and Mediabench benchmarks
We translate the problem of finding the critical
regions to finding the frequently executed loops
Short backwards branch (sbb) instruction is
typically the last instruction of a loop

All Critical Regions
15 - Subroutines with no inner loops
85 - Small inner loops
12
Percentage of Execution Time for Frequent Loops

In addition to detection of frequent loops we
also want to know the loops percentage
contribution to total execution time.

Application X
Application Y
Loop A - 32
Loop A - 10
Loop B - 33
Loop B - 10
Loop C - 35
Loop C - 80
13
Frequent Loop Detection - Cache Based Architecture
To L1 Memory
rd/wr
rd/wr
addr
addr
data
sbb
saturation
14
Cache Operation
Sbb Trace
1111001
15
Cache Operation - Conflict Resolution

Resolve most conflicts using associativity and an
LRU replacement policy for further conflicts
Further conflicts may cause frequent loops to
constantly be replaced in the cache - thrashing
Our experiments did not suffer from this
contention but a victim buffer may be added if
necessary

16
Cache Operation Frequency Width

Our goal is to find the smallest possible cache
needed to determine the frequent loops
We keep the cache small by allowing the frequency
field width to be varied
If the frequency field is too small, saturations
can occur and frequency information may be lost

17
Cache Operation - Frequency Counter Saturation
Sbb Trace
All frequencies are divided by 2 with a shift
right (built as a special feature of the cache
and activated by asserting the saturation signal
to the cache)
1111001
255
1101010
100
1011010
2
18
Experimental Setup

We ran extensive experiments to determine the
best frequent loop cache configuration
Cache configurations simulated
To determine the accuracy of each cache
configuration we wrote a trace simulator for the
cache architecture in C

Frequency Counter Field Widths
Cache Associativities
Cache Sizes
336 configurations
16, 32, and 64 entries
1, 2, 4, and 8-way
X

4 to 32 bits
X
19
Experimental Setup

Benchmarks
Selected Powerstone benchmarks running on a
32-bit MIPS instruction set simulator
Selected MediaBench benchmarks running on
SimpleScalar
Power consumption
UMC 0.18-micron CMOS technology running at 250
MHz at 1.8V
Cache memory power consumption obtained using the
Artisan memory compiler
Additional logic and functionality modeled in
synthesizable VHDL using Synopsys Design Compiler

20
Accuracy - Sum of Differences

We computed the average difference between the
actual loop execution time percentage and the
computed loop execution time percentage for the
ten most frequently executed loops

21
Results - Sum of Differences

Sum of differences results averaged over all
Powerstone benchmarks

22
Results - Best Cache Configuration

We determine the smallest possible cache
configuration necessary to give good results
Overall best cache configuration -
2-way 32-entry cache with a frequency width of 24
bits
95 accuracy for Powerstone
90 accuracy for MediaBench
No change to system performance

23
Results - Base System Power and Area

MIPS32 4Kp microprocessor core
Embedded processor with a cache
Small - area of 1.7 mm2
Low power - 528 mW

24
Results - Frequent Loop Detector Power Overhead

For the best cache configuration -
Increase in average power consumption of the
total system with frequent loop detector is 2.4

Power Consumption of Operations
142 mW per cache read and increment
156 mW per cache write
20.7 mW per saturation
Average Frequency of Operations
Cache updates 4.25
Saturations 0.000051
25
Results - Frequent Loop Detector Area Overhead

Resulting area overhead of 10.5 compared to the
reported size of the MIPS32 4Kp
Our numbers are pessimistic while reported
microprocessor areas are likely optimistic

Area Overhead Area Overhead
Frequent loop cache controller, incrementor and additional control/steering logic 1400 gates (0.0012 mm2)
Cache including saturation logic 0.167 mm2
26
Reducing Power Overhead via Frequent Update
Coalescing

Since frequent loops tend to iterate many times,
the same entry is updated in the frequent loop
cache many times in a row
Coalesce consecutive sbb executions into one
cache update to reduce cache updates

Sbb Trace
1110 1110 1110 1110

increment frequency
increment frequency
add 4 to frequency
increment frequency
increment frequency
27
Reducing Power Overhead via Frequent Update
Coalescing
MediaBench
Powerstone
28
Sampling for Further Reduced Power Overhead

Instead of tallying every sbb executed, only
tally sbbs that occur at a fixed sampling
interval
This method does not require interrupting of the
processor.

29
Sum of Differences for Sampling
Powerstone
MediaBench
30
Sum of Differences for Sampling

When going from a sampling interval of 1 to 50 -
Average accuracy decreases for Powerstone
benchmarks by 5
Average accuracy increases for MediaBench
benchmarks by 2

31
Results for a Sampling Interval of 50

Coalescing plus sampling reduces the average
system power overhead to a mere 0.02
Still no change to system performance

Average Frequency of Operations
Cache updates 0.03
Coalesces - 0.06
No Saturations
32
Example Use - Warp Processing

The detector has been successfully incorporated
into a novel prototype system-on-a-chip
architecture performing what is presently known
as warp processing (also being developed at UCR)

33
Warp Processing
µP
µP
Profiler
Mem
µP
µP
Configurable Logic
Dynamic Partitioning Module
34
Conclusions

We have presented a frequent loop detector that
is small, power-efficient, non-intrusive and
accurately provides relative frequencies of loops
2-way set-associative 32-entry cache with a
24-bit frequency counter
Power overhead of 2.4 compared to a low-power
32-bit embedded processor
Power overhead is easily reducible to well below
0.1 using simple coalescing and sampling methods
Currently being used in the profiling step of the
Warp processor at UCR