Title: Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin-Madison
1Adaptive Cache Compression for High-Performance
ProcessorsAlaa R. Alameldeen and David
A.WoodComputer Sciences Department, University
of Wisconsin-Madison
2Outline
- Introduction
- Motivation
- Adaptive Cache Compression
- Evaluation Methodology
- Reported performance
- Review conclusion
- Critics/Suggestions
3Introduction
- Increasing performance gap between processors
and memory calls for faster memory access. - Cache memories reduce average memory
- latency
- Cache Compression improves performance of cache
- memories
- Adaptive Cache Compression Theme of this
discussion
4Motivation
- Cache compression can improve effectiveness of
cache memories (increase effective cache
capacity) - Increasing effective cache capacity reduces miss
rate - Performance will improve !
5Adaptive Cache Compression An Overview
- Dynamically optimize cache performance
- Use the past to predict the future
-
- How likely is compression going to help, hurt,
or make no difference - to next reference?
- Feedback from previous compression helps to
decide whether to compress - the next write to cache
6Adaptive Cache CompressionImplementation
- 2-level cache hierarchy
- L1 cache (data, instruction)
- uncompressed
- L2 cache is unified and
- optionally compressed
- Decompression/
- Compression used/skipped
- as necessary
-
-
Pros L1 cache performance not affected Cons
Compression/Decompression introduces latency
7Adaptive Cache CompressionL2 cache detail
- 8-way set associative
- Use a compression information tag stored with
each address tag - 32 segments (8 bytes each) in each set
- An uncompressed line comprises 8 segments
- (4 uncompressed lines max in each set)
- Compressed lines are 1 to 7 segments in length
- Max number of lines in each set 8
- Least recently used (LRU) lines evicted
- Compacting may be used to make room for a new
line
8Adaptive Cache CompressionTo compress or not to
compress?
- While compression eliminates L2 misses, it
increases the latency of L2 hits (more frequent). - However, penalty for L2 misses is usually large
and extra latency due to decompression is usually
small. - Compression helps if
( avoided L2 misses ) x (L2
miss penalty)
( penalized L2 hits ) x (
decompression penalty)
gt
Example For a 5 cycle decompression penalty and
400 cycle cycle L2 miss
penalty, compression wins if it eliminates at
least one L2 miss for every
400/580 penalized L2 hits
9Adaptive Cache CompressionClassification of
Cache References
- Classifications of hits
- Unpenalized hit
- (e.g. reference to address A)
- Penalized hit
- (e.g. reference to address C)
- Avoided miss
- (e.g. reference to address E)
- Classifications of misses
- Avoidable miss
- ( e.g. reference to address G)
- Unavoidable miss
- ( e.g. reference to address H)
Evicted
10Adaptive Cache CompressionHardware use in
decision-making
- Global Compression Predictor
- estimates the recent cost or benefit of
compression - On a penalized hit, the controller biases against
compression by decrementing the counter - ( subtractedvaluedecompression penalty)
- On an avoided or avoidable miss, the controller
increments the counter by the L2 miss penalty. - The controller uses the GCP when allocating a
line in the L2 cache - Positive value -gt compression has helped, so
now compress - Negative value -gt compression has been
penalizing, so dont - compress
- Size of GCP determines sensitivity to changes
- In this paper, 19-bit used ( saturates at 262143
or -262144 )
11Adaptive Cache CompressionSensitivity
- Effectiveness depend on the workloads size,
caches size and latencies - Sensitive to L2 cache size (effective for small
L2 cache) - Sensitive to L1 cache size (observe trade-offs)
- Adapting to benchmark phase
- - changes in phase behaviour may hurt
adaptive policy - - takes time to adapt
12Evaluation Methodology
- Host system dynamically-scheduled SPARC V9
uniprocessor - Target system superscalar processor with
out-of-order execution - Simulation Parameters
13Evaluation Methodology (continued)
- Simulator Simics full-system simulator, extended
with a detailed - processor simulator
(TFSim), and a detailed memory - system timing simulator.
- Workloads
- multi-threaded commercial workloads from the
Wisconsin Commercial - workload suite
- eight of the SPECcpu2000 benchmarks
- Integer benchmarks (bzip, gcc, mcf, twolf)
- Floating benchmarks (ammp, applu, equake, swim)
-
- Workloads selected to cover a wide range of
compressibility properties, - miss rates and working set sizes.
14Evaluation methodology (continued)
- To understand the utility of adaptive
compression, 2 extreme policies ( Never compress,
and always compress were compared with ) - Never strives to reduce hit latency
- Always strives to reduce miss rate
- Adaptive strives to optimize.
15Reported Performance(Average cache capacity)
Figure Average cache capacity during benchmark
runs (4MB uncompressed)
16Reported Performance (cache miss rate)
Figure L2 cache miss rate (normalized to Never
miss rate)
17Reported Performance (Runtime)
Figure Runtime for the three compression
alternatives (normalized to Never)
18Reported Performance(sensitivity of adaptive
compression to benchmark phase changes)
Top temporal changes in Global Compression
Predictor values Bottom effective cache size
19Review Conclusion
- Compressing all compressible cache lines only
improves memory-intensive applications.
Applications with low miss rate / compressibility
suffer. - Optimization achieved by adaptive scheme are
- Up to 26 speedup (over uncompressed scheme) for
- memory-intensive, highly-compressible benchmarks
- Performance degradation for other benchmarks lt
0.4
20Critics/Suggestions
- Data inconsistency17 improvement in performance
for memory-intensive commercial workloads claimed
on page 2 but 26 claimed on page 11. - Miscalculation on page 4
- The sum of the compressed sizes at stack depths 1
through 7 totals - 29.
- However, this miss cannot be avoided because the
sum of compressed sizes exceeds the total number
of segments (i.e. 35 gt 32 ) . - All in all, the proposed technique doesnt seem
to enhance performance significantly with respect
to always.
21Thank you !