Title: A%20chicken%20in%20every%20pot:%20a%20persistent%20snapshot%20memory%20scaled%20in%20time
1A chicken in every pota persistent snapshot
memoryscaled in time
- Liuba Shrira and Hao Xu
- Brandeis University
2Storage systems the 7 year itch
- 1984 rotational delay FFS
- 1991 large memory - LFS
- 1998 cheaper disk - Elephant
- 2005 .. a chicken in every pot
- snapshot box on the side..
3Trends
- Hardware Disk
- Cheap (1/GB) and cheaper
- Software Industry Forbes (12/2004) says
- need for keeping past state is growing
4Trends cont.
- - A casino chases a card counter
- - IT dept. chased by Sarbanes Oxley
- - Hippocratic DB audited about patient privacy
preservation - Need to analyze past activity
5SNAP a snapshot system for an object storage
system
- Goal
- Storage system capability for
- back-in-time execution (BITE)
- application runs against
- read-only snapshots
- without synchronization analysis in
retrospect
6Baseline Requirements for BITE
- Consistent snapshots same (old) invariants
hold - BITE of general code after-the-fact ad-hoc
analysis - ( vs predefined SQL access methods)
- App chooses the snapshot snapshot state
meaningful - to app (vs some time in the past )
- High time resolution fine-grained past
analysis (vs backup for recovery)
7 Over long time-scales..
-
- Living with the past how close?
- today too close (Temporal DB, CVFS)
- or too far (warehouse - Netezza)
- Snapshots can be of long-term importance, or
transient - today uniform - apps can not discriminate
-
- Inherent tension
- latency of access vs
- cost of representation (space and
time) - today limited adaptation - compress
or not
8Capturing past states
- Two ways
- Cheep - no-overwrite update
- past stays put, copy new
- less to write, but
- bloated DB, past inherits same rep
- Opportunistic- in-place update
- past is copied-out, separated
- more to write but can write smartly, can
- tailor past rep, and DB stays clustered (vigor)
9Our requirements
- Non-disruptive past just right distance -
separated - At adaptive distance
- e.g. faster BITE on more recent states
- Discriminated past
- application classifies, snapshot system
filters - Some snapshots outlive others,
- some can be accessed faster
- Flexible classification e.g. after the fact
10 Snapshot system operations
- Request to take a snapshot (declaration)
- sid snapshot_request (filter_spec)
- Request to access a snapshot v
- snapshot_access (sid)
- Request to specify a filter for a snapshot v
- lazy_filter (sid,filter_spec)
- T1, T2, S1, T3, T4, T5, S2,
11Baseline storage system
- General interface
- pages and a page table
- transactions access objects on pages
- Server
- DB disk slotted pages of objects
- physical oid (page,o)
- and a page table
- Transaction Log
- Cache pages and modifed object cache
12Storage system, cont.optimistic CCARIES
- Clients
- fetch pages, run transactions
- send modifed objects to server
- Server
- validates, commits (WAL)
- caches committed modifications
- no-force, no-STEAL
13The snapshot system
- Archive separated from DB
- Archive i/o sequential, DB random
- Copy-on-write (COW)
- copy out snapshot states into archive
- just before updating DB
- during cleaning.
-
-
14Snapshot interface
- Same as DB -
- Snapshot Pages
- Snapshot Page Table
-
- So BITE is transparent
- BITE on snapshot S(v) uses PageTable(v)
15Snapshot systembelow the interface
-
- Some S(v) pages are in the archive,
- some in DB
- and pages in the archive can have
- a different representations
16BITE (v) namespace redirection
17Creating non-disruptive snapshots (i/o bound
system)
- Archiving snapshot states when cleaning
- can slow down cleaning
- compared to a system without snapshots.
- Copying to the archive disk (sequential I/O)
- in parallel
- to database I/O (random)
- can partially hide archiving cost
- behind database I/O.
18Creating snapshots how well can you hide?
- Is determined by
- how much is archived
- compactness of snapshot representation,
- frequency, snapshot
- update workload (overwriting)
- cost of archiving,
- sequential, other archive traffic BITE
19Creating snapshots some issues
- Issue
- avoid overwriting snapshot states
- (without blocking, pinning etc)
- Issue
- update snapshot meta data efficiently
- (large, dynamic page tables )
- Issue
- filter out long-lived snaps (focus here)
20New techniques for copy-out snapshots
- - VMOB in-memory versioned data structure
preserves snapshot states w/out blocking - LPT incrementally archived page table with
logarithmic reconstruction cost - Filtering exploit smart representation for
- past states (focus here)
21Filtering motivation
- Want unlimited past at high resolution
- but
- some snapshots are transient
- others of long-term interest to application
-
- application needs to discriminate between
snapshots
22Thresher a filtering system for SNAP
23Snapshot representation
- What can representation do for filtering?
- life-time based allocation
- avoids fragmentation
- diff-based encoding
- reduces cost of copying
- adaptive combination -
- real winner
-
24Example hierarchical snapshots at multiple time
granularity
- ICU patient monitoring DB takes snapshots
- minute by minute vital sign monitor readings
- hourly includes nurses writeup summarizing
monitor readings -
- daily includes doctors notes summarizing
nurses checkups - Doctors have longer life-time than nurses
25Brief overview snapshot creation
- Some notation
- Snapshot span
- Recorded pages
- example
- .. v4, T w (x_P), T w (y_S), v5, T..
- Span of v4 T, T
- Pages recorded by snapshot v4 P, S
26Incremental snapshot creation
- Archived snapshot pages dispersed
- v4 P S v5 P Q
- -----------------------------------------------
-? - Archived snapshot page tables (PT)
- PT(v4) addr (P4), addr(S4) PT(v5) addr(P5),
addr(Q5).. - -----------------------------------------------
--? - Another talk how to construct archived page
tables - Construct APT (v4) recorded (v4)
Construct APT (v5) -
27Filtering example filter out short-lived v5
- Doctors Nurses
- v4 P S v5 P Q
v6 - -----------------------------------------------
-? Archive -
- Filter long-lived v4, reclaim v5
- reclaim P5
- retain Q5 (v4 needs it)
- filtering incremental snapshots creates
fragmentation
28Problem fragmentation
- fragmented archive, over time
- non sequential archive writes
- or
- random reads to copy out long lived states
29Our approach filter-spec
-
- Filter spec determines
- relative snapshot lifetime
- App knows best
- the app supplies a filter spec
- the system filters
30 avoid fragmentation with filter-spec
- Known at snapshot declaration
- use lifetime-based allocation
- After the fact -
- use a flexible rep to filter lazily
- rep allows adaptive trade-off
- cost of filtering vs cost of BITE
31App specifies filter at declaration
P4 S4 Q5 long-lived
pages --------------------------------------
----------? P5
short-lived -------------------------------
-----------------? Invariant to reclaim w/out
fragmentation, short-lived areas store no
long-lived pages
32FilterTree filter pages for free
33After-the-fact (lazy) filtering
- Some applications want
- to defer filter specification
- Lazy filtering requires copying
- We can specialize representation (compact)
- to reduce copying cost
34Compact representation diffs
- Two components filtered separately
- compact diffs reduce cost of copying
- (diffs clustered by page)
- checkpoints accelerate BITE
- (page-based snapshots
- system-declared, can use FilterTree)
-
35Adaptive trade-off
- Like recovery log
- less frequent checkpoints
- increase compactness
-
- more frequent checkpoints
- accelerate BITE
36Lazy filtering checkpoints filtered for free
Archive regions for diff extents
FilterTree for checkpoints
G2(diffs)
E
B1
G1(diffs)
E1
E2
E3
B1
B2
B3
37But some applications want more
-
- lazy filtering
- and
- faster BITE
- e.g.
- - app runs BITE on batch of recent snapshots
- to decide which ones to retain -
- needs fast BITE to keep up..
38Combined hybrid
- Faster BITE in recent window
- and
- Lazy filtering
39Hybrid checkpoints and checkpointfiltered for
free
40Status
- Implemented
- SNAP and Thresher for Thor storage system
- Performance results
- encouraging.
- here is a 5000 feet view
41Performance metrics
-
- Cost of filtering
- non-disruptiveness rate-of-drain/ rate-of-pour
- t_clean determins rate-of-drain
- workload parameter overwriting
- Compactness of diff-based rep
- retention relative to page-based rep
- R_diff - fixed
- R_ckp - tunable by frequency of checkpoints
- workload parameter density
- BITE - page-based snapshots, vs diff-based vs
DB
42Non-disruptiveness
- Storage system w/hybrid snapshots vs
- w/out snapshots (Thor)
- How much drop in
- rate-of-drain / rate-of-pour
43Experimental configuration
- Workoads
- extend multiuser 007 to control
- density
- overwriting
- System configuration
- single client, medium 007 small DB 185MB
- multiple clients large DB 140GB
-
44FIlterTree
45Non-disruptiveness/ single client summertime
life is easy
46Non-disruptiveness/multi user DB works harder
47Summary non-disruptive snapshot memory
-
- Unlimited filtered past
- is cheaper than you may think.
- .. A chicken in every pot..
- Every storage system
- can have a snapshot box on the side..
48To get there
- Generalize
- ARIES/ STEAL / underway
- file systems / need extended interfaces
-
- Beyond
- upgrades/ have techniques
- provenance / need ideas..