Title: The Performance Impact of Kernel Prefetching on Buffer Cache Replacement Algorithms
1The Performance Impact of Kernel Prefetching on
Buffer Cache Replacement Algorithms
- (ACM SIGMETRIC 05 ) ACM International Conference
on Measurement Modeling of Computer Systems - Ali R. Butt, Chris Gniady, Y. Charlie Hu
- Purdue University
Presented by Hsu Hao Chen
2Outline
- Introduction
- Motivation
- Replacement Algorithm
- OPT
- LRU
- LRU-2
- 2Q
- LIRS
- LRFU
- MQ
- ARC
- Performance Evaluation
- Conclusion
3Introduction
- Improving file system performance
- Design effective block replacement algorithms for
the buffer cache - Almost all buffer cache replacement algorithms
have been proposed and studied comparatively
without taking into account file system
prefetching which exists in all modern operating
systems - Cache hit ratio is used as sole performance
metric - The actual number of disk I/O requests?
- The actual running time of applications?
4Introduction (Cont.)
- Kernel Prefetching in Linux
- Beneficial for sequential accesses
Various kernel components on the path from file
system operation to the disk
5Motivation
- The goal of buffer replacement algorithm
- Minimize the number of disk I/O
- Reduce the running time of the applications
- Example
- Without prefetching,
- Belady results in 16 misses
- LRU results in 23 misses
With prefetching, Beladys is not optimal!
6Replacement Algorithm
- OPT
- Evicts the block that will be referenced farthest
in the future - Often used for comparative studies
- Prefetched blocks are assumed to be accessed most
recently, OPT can immediately determine wrong or
right prefetches
7Replacement Algorithm
- LRU
- Replaces the page that has not been accessed for
the longest time - Prefetched blocks are inserted in the MRU just
like regular blocks
8Replacement Algorithm
- LRU pathological case
- the working set size is larger than the cache
- The application has a looping access pattern
- In this case, LRU will replace all blocks before
they are used again
9Replacement Algorithm
- LRU-2
- Try to avoid the pathological cases of LRU
- LRU-K replaces a block based on the
Kth-to-the-last reference - Authors recommended K2
- LRU-2 can quickly remove cold blocks from the
cache - Each block access requires log(N) operations to
manipulate a priority queue - N is the number of blocks in the cache
10Replacement Algorithm
- 2Q
- Proposed
- Achieve similar page replacement performance to
LRU-2 - Low overehad way (constant LRU)
- All missed blocks in A1in queue
- Address of replaced blocks in A1out queue
- Re-referenced blocks in Am queue
- Prefetched blocks are treated as on-demand blocks
and if prefetched block is evicted from A1in
queue before on-demand access, it is simply
discarded
11Replacement Algorithm
12Replacement Algorithm
- LIRS (Low Inter-reference Recency Set)
- LIR block if accessed again since inserted on
the LRU stack - HIR block referenced less frequently
- Insert prefetched blocks into the cache that
maintains HIR blocks
13Replacement Algorithm
- LRFU (Least Recently/Frequently Used)
- Replaces the block with the smallest C(x) value
- Prefetched blocks are treated as the most
recently accessed - Problem how to assign the initial weight (c(x))
- Solution a prefetched flag is set
- When the block is accessed on-demand
- Initial value
every block x,at every time t,? a tunable
parameter Initially,assign a value C(x)0
14Replacement Algorithm
- MQ (Multi-Queue)
- Use m LRU queues (typically m8)
- Q0,Q1,.Qm-1,where Qi contains blocks that have
been at least 2i times but no more than 2i1-1
times recently - Not increments the reference counter when a block
is prefetched
15Replacement Algorithm
16Replacement Algorithm
- ARC (Adaptive Replacement Cache)
- Maintains two LRU lists
- Pages that have been referenced only once (L1)
- Pages that have been referenced at least twice
(L2) - Each list has same length c as cache
- Cache contains tops of both lists T1 and T2
L-1
L-2
T1 T2 c
T1
T2
17Replacement Algorithm
- ARC attempts to maintain a Buffer size B_T1 for
list T1 - When cache is full, ARC replacement
- if T1 gt B_T1
- LRU page from T1
- otherwise
- LRU page from T2
- if prefetched block is already in the ghost
queue, it is not moved to the second queue, but
to the first queue
18Performance Evaluation
- Simulation Environment
- implement a buffer cache simulator
- functionally (prefetching, I/O clustering) Linux
- With DiskSim, they simulate the I/O time of
applications
Application
Sequential access
Random access
Multi1 workload in a code development
environment Multi2 workload in a graphic
development and simulation Multi2 workload in a
database and a web index server
19Performance Evaluation (Cont.)
cscope (sequential)
Hit ratio
of clustered disk requests
Execution time
20Performance Evaluation (Cont.)
cscope (sequential)
Hit ratio
of clustered disk requests
Execution time
21Performance Evaluation (Cont.)
glimpse (sequential)
Hit ratio
of clustered disk requests
Execution time
22Performance Evaluation (Cont.)
tph-h (random)
Hit ratio
of clustered disk requests
Execution time
23Performance Evaluation (Cont.)
tph-r (random)
Hit ratio
of clustered disk requests
Execution time
24Performance Evaluation (Cont.)
- Concurrent applications
- Multi1 hit ratios and disk requests with or
without prefetching exhibit similar behavior as
cscope - Multi2 behavior is similar to multi1, but
prefetching does not improve the execution time
(CPU-bound viewperf) - Multi3 behavior is similar to tpc-h
- Synchronous vs. asynchronous prefetching
With prefetching, number of requests is at least
30 lower than without prefetching except OPT,
especially when asynchronous prefetching is used
Number and size of disk I/O (cscope at 128MB
cache size)
25Conclusion
- Kernel prefetching performance can have
significant impact - different replacement algorithms
- Application file access patterns importance for
prefetching disk data - Sequential access
- Random access
- With prefetching or without prefetching, hit
ratio is not sole performance metric