Multi-level Selective Deduplication for VM Snapshots in Cloud Storage - PowerPoint PPT Presentation

About This Presentation
Title:

Multi-level Selective Deduplication for VM Snapshots in Cloud Storage

Description:

Motivations Virtual machines on the cloud use frequent backup to improve service reliability Used in Alibaba s Aliyun ... – PowerPoint PPT presentation

Number of Views:126
Avg rating:3.0/5.0
Slides: 22
Provided by: TaoY9
Category:

less

Transcript and Presenter's Notes

Title: Multi-level Selective Deduplication for VM Snapshots in Cloud Storage


1
Multi-level Selective Deduplication for
VMSnapshots in Cloud Storage
  • Wei Zhang, Hong Tang, Hao Jiang, Tao Yang,
    Xiaogang Li, Yue Zeng
  • University of California at Santa Barbara
  • Aliyun.com Inc.

2
Motivations
  • Virtual machines on the cloud use frequent
    backup to improve service reliability
  • Used in Alibabas Aliyun - the largest public
    cloud service in China
  • High storage demand
  • Daily backup workload hundreds of TB _at_ Aliyun
  • Number of VMs per cluster 10000
  • Large content duplicates
  • Limited resource for deduplication
  • No special hardware or dedicated machines
  • Small CPU memory footprint

3
Focus and Related Work
  • Previous work
  • Version-based incremental snapshot backup
  • Inter-block/VM duplicates are not detected.
  • Chunk-based file duduplication
  • High cost for chunk lookup
  • Focus on
  • Parallel backup of a large number of virtual
    disks.
  • Large files for VM disk images.
  • Contributions
  • Cost-constrained solution with very limited
    computing resource
  • Multi-level selective duplicate detection and
    parallel backup.

4
Requirements
  • Negligible impact on existing cloud service and
    VM performance
  • Must minimize CPU and IO bandwidth consumption
    for backup and deduplication workload
  • (e.g. lt1 of total resource).
  • Fast backup speed
  • Compute backup for 10,000 users within a few
    hours each day during light cloud workload.
  • Fault tolerance constraint
  • addition of data deduplication should not
    decrease the degree of fault tolerance.

5
Design Considerations
  • Design alternatives
  • An external and dedicated backup storage system.
  • A decentralized and co-hosted backup system with
    full deduplication

Backup
Cloud service
. . .
6
Design Considerations
  • Decentralized architecture running on a general
    purpose cluster
  • co-hosting both elastic computing and backup
    service
  • Multi-level deduplication
  • Localize backup traffic and exploit data
    parallelism
  • Increase fault tolerance
  • Selective deduplication
  • Use minimal resource while still removing most of
    redundant content and accomplishing good
    efficiency

7
Key Observations
  • Inner-VM data characteristics
  • Exploit unchanged data to localize deduplication
  • Cross-VM data characteristics
  • Small common data dominates duplicates
  • Zipf-like distribution of VM OS/user data
  • Separate consideration of OS and user data

8
VM Snapshot Representation
Segments are fix-sized
Data blocks are variable-sized
9
Processing Flow of Multi-level Deduplication
10
Data Processing Steps
  • Segment level checkup.
  • Use dirty bitmap to see which segments are
    modified.
  • Block level checkup
  • Divide a segment into variable-sized blocks,
  • and compare their signatures with the parent
    snapshot
  • Checkup from common dataset (CDS)
  • Identify duplicate chunks from CDS
  • Write new snapshot blocks
  • Write new content chunks to stoage.
  • Save recipes
  • Save segment meta-data information

11
Architecture of Multi-level VM snapshot backup
Cluster node
12
Status Evaluation
  • Prototype system running on Alibabas Aliyuan
    cloud.
  • Based on Xen.
  • 100 nodes and each has 16 cores, 48G memory,
    25VMs.
  • Use lt150MB per machine for backupdeduplication
  • Evaluation data from Aliyuans production cluster
  • 41TB.
  • 10 snapshots per VM
  • Segment size 2MB.
  • Avg. Block size 4KB

13
Data Characteristics of the Benchmark
  • Each VM uses 40GB storage space on average
  • OS and user data disks each takes 50 of space
  • OS data
  • 7 main stream OS releases
  • Debian, Ubuntu, Redhat, CentOS, Win2003 32bit,
    win2003 64 bit and win2008 64 bit.
  • User data
  • From 1323 VM users

14
Impacts of 3-Level Deduplication
Level 1 Segment-level detection within VM Level
2 Block-level detection within VM Level 3
Common data block detection across-VM
15
Impact for Different OS Releases
16
Separate consideration of OS and user data
Both have Zipf-like data distribution But
popularity growth differs as the cluster size/VM
users increase
17
Commonality among OS releases
1G common OS meta data covers 70
18
Cumulative coverage of popular user data
Coverage is the summation of covered data
block sizefrequency
19
Space saving compared to perfect deduplication as
CDS size increases
100G CDS (1GB index) -gt 75 of perfect dedup
20
Impact of dataset-size increase
21
Conclusions
  • Contributions
  • A multi-level selective deduplication scheme
    among VM snapshots
  • Inner-VM deduplication localizes backup and
    exposes more parallelism
  • global deduplication with a small common data set
    appeared in OS and data disks
  • Use less than 0.5 of memory per node to meet a
    stringent cloud resource requirement -gt
    accomplish 75 of what perfect deduplication
    does.
  • Experiments
  • Achieve 500TB/hour on a 1000-node cloud cluster
  • Reduce bandwidth by 92 -gt 40TB/hour
Write a Comment
User Comments (0)
About PowerShow.com