Multi-level Selective Deduplication for VM Snapshots in Cloud Storage

About This Presentation

Title:

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage

Description:

Motivations Virtual machines on the cloud use frequent backup to improve service reliability Used in Alibaba s Aliyun ... – PowerPoint PPT presentation

Number of Views:126

Avg rating:3.0/5.0

Slides: 22

Provided by: TaoY9

Learn more at: https://sites.cs.ucsb.edu

Category:

more less

Transcript and Presenter's Notes

Title: Multi-level Selective Deduplication for VM Snapshots in Cloud Storage

1
Multi-level Selective Deduplication for
VMSnapshots in Cloud Storage

Wei Zhang, Hong Tang, Hao Jiang, Tao Yang,
Xiaogang Li, Yue Zeng
University of California at Santa Barbara
Aliyun.com Inc.

2
Motivations

Virtual machines on the cloud use frequent
backup to improve service reliability
Used in Alibabas Aliyun - the largest public
cloud service in China
High storage demand
Daily backup workload hundreds of TB _at_ Aliyun
Number of VMs per cluster 10000
Large content duplicates
Limited resource for deduplication
No special hardware or dedicated machines
Small CPU memory footprint

3
Focus and Related Work

Previous work
Version-based incremental snapshot backup
Inter-block/VM duplicates are not detected.
Chunk-based file duduplication
High cost for chunk lookup
Focus on
Parallel backup of a large number of virtual
disks.
Large files for VM disk images.
Contributions
Cost-constrained solution with very limited
computing resource
Multi-level selective duplicate detection and
parallel backup.

4
Requirements

Negligible impact on existing cloud service and
VM performance
Must minimize CPU and IO bandwidth consumption
for backup and deduplication workload
(e.g. lt1 of total resource).
Fast backup speed
Compute backup for 10,000 users within a few
hours each day during light cloud workload.
Fault tolerance constraint
addition of data deduplication should not
decrease the degree of fault tolerance.

5
Design Considerations

Design alternatives
An external and dedicated backup storage system.
A decentralized and co-hosted backup system with
full deduplication

Backup
Cloud service
. . .
6
Design Considerations

Decentralized architecture running on a general
purpose cluster
co-hosting both elastic computing and backup
service
Multi-level deduplication
Localize backup traffic and exploit data
parallelism
Increase fault tolerance
Selective deduplication
Use minimal resource while still removing most of
redundant content and accomplishing good
efficiency

7
Key Observations

Inner-VM data characteristics
Exploit unchanged data to localize deduplication
Cross-VM data characteristics
Small common data dominates duplicates
Zipf-like distribution of VM OS/user data
Separate consideration of OS and user data

8
VM Snapshot Representation
Segments are fix-sized
Data blocks are variable-sized
9
Processing Flow of Multi-level Deduplication
10
Data Processing Steps

Segment level checkup.
Use dirty bitmap to see which segments are
modified.
Block level checkup
Divide a segment into variable-sized blocks,
and compare their signatures with the parent
snapshot
Checkup from common dataset (CDS)
Identify duplicate chunks from CDS
Write new snapshot blocks
Write new content chunks to stoage.
Save recipes
Save segment meta-data information

11
Architecture of Multi-level VM snapshot backup
Cluster node
12
Status Evaluation

Prototype system running on Alibabas Aliyuan
cloud.
Based on Xen.
100 nodes and each has 16 cores, 48G memory,
25VMs.
Use lt150MB per machine for backupdeduplication
Evaluation data from Aliyuans production cluster
41TB.
10 snapshots per VM
Segment size 2MB.
Avg. Block size 4KB

13
Data Characteristics of the Benchmark

Each VM uses 40GB storage space on average
OS and user data disks each takes 50 of space
OS data
7 main stream OS releases
Debian, Ubuntu, Redhat, CentOS, Win2003 32bit,
win2003 64 bit and win2008 64 bit.
User data
From 1323 VM users

14
Impacts of 3-Level Deduplication
Level 1 Segment-level detection within VM Level
2 Block-level detection within VM Level 3
Common data block detection across-VM
15
Impact for Different OS Releases
16
Separate consideration of OS and user data
Both have Zipf-like data distribution But
popularity growth differs as the cluster size/VM
users increase
17
Commonality among OS releases
1G common OS meta data covers 70
18
Cumulative coverage of popular user data
Coverage is the summation of covered data
block sizefrequency
19
Space saving compared to perfect deduplication as
CDS size increases
100G CDS (1GB index) -gt 75 of perfect dedup
20
Impact of dataset-size increase
21
Conclusions

Contributions
A multi-level selective deduplication scheme
among VM snapshots
Inner-VM deduplication localizes backup and
exposes more parallelism
global deduplication with a small common data set
appeared in OS and data disks
Use less than 0.5 of memory per node to meet a
stringent cloud resource requirement -gt
accomplish 75 of what perfect deduplication
does.
Experiments
Achieve 500TB/hour on a 1000-node cloud cluster
Reduce bandwidth by 92 -gt 40TB/hour