cdfSync: Networked Synchronization of netCDF Datasets - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

cdfSync: Networked Synchronization of netCDF Datasets

Description:

University of Washington. 2. What is cdfSync? Synchronizes netCDF datasets ... 1.5 GB netCDF file with extra data appended to record dimension: Rsync: 434 sec ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 21
Provided by: john626
Category:

less

Transcript and Presenter's Notes

Title: cdfSync: Networked Synchronization of netCDF Datasets


1
cdfSync Networked Synchronization of netCDF
Datasets
  • Joe Sirott
  • L.C.Sun, Donald W. Denbo
  • NODC/PMEL NOAA
  • University of Washington

2
What is cdfSync?
  • Synchronizes netCDF datasets over the Internet
  • Only differences between datasets are transmitted
  • Based on rsync algorithm and program (Tridgell,
    2003)

3
Applications
  • Local mirroring of dynamic datasets for faster
    access
  • Mobile applications where network access may be
    unreliable

4
Rsync Algorithm
  • Client divides file into blocks
  • Calculates a hash based signature for each block
  • Client sends signatures to server
  • Server compares signatures from client and only
    sends data that isnt already on client

5
Rsync Algorithm (client)
WH(B0,S-1),SH(B0,S-1)
WH(BS,2S-1),SH(BS,2S-1)
WH(B2S,3S-1),SH(B2S,3S-1)
WH weak rolling hashSH strong (MD4) hash, S
block size
6
Rsync Algorithm (server)
WH(B0,S-1)WH(B1,S)WH(B2,S1)
WH weak rolling hashSH strong (MD4) hash, S
block size
7
Rsync Algorithm (server)
WH weak rolling hashSH strong (MD4) hash, S
block size
8
cdfSync enhancements
  • Take advantage of netCDF block structure
  • Compress file metadata for efficient updates of
    large number of small files
  • In-place updates for small updates to large files

9
cdfSync algorithm (server)
WH weak rolling hashSH strong (MD4) hash,
S(i) block size i
10
cdfSync enhancements
  • Take advantage of netCDF block structure
  • Compress file metadata for efficient updates of
    large number of small files
  • In-place updates for small updates to large files

11
Compressed File Metadata
  • In-situ data frequently consists of large numbers
    (105-6) of small files
  • Not many files change between updates (101-2)
  • Transfer of file metadata (file name,
    modification date, etc.) dominates update time
  • cdfSync compress (gzip) this data

12
cdfSync enhancements
  • Take advantage of netCDF block structure
  • Compress file metadata for efficient updates of
    large number of small files
  • In-place updates for small updates to large files

13
In-place Updates
  • Rsync inefficient for large netCDF datasets ( gt
    1GB) with small updates
  • Writes all data (even local data) to temporary
    file and then renames the temporary file
  • Data write time can be gtgt than network
    transmission time

14
In-place Updates (cont)
  • cdfSync has option that allows data updates to be
    written to existing file
  • If data block hasnt moved, no data is written
  • Much more efficient for datasets where data is
    appended on the netCDF record dimension
  • Downside file corrupt if update interrupted

15
In-place Updates (cont)
  • Find and resolve cyclic dependencies

2
1
M
1
2
Server
Client
16
In-place Updates (cont)
17
Results (use netCDF blocks)
  • Synchronize identical 512 MB netCDF files,
    compare with synchronization of identical 512 MB
    non-netCDF files. Use in-place to measure only
    disk reads, not writes
  • Rsync 105 sec
  • Cdfsync 72 sec

18
Results (compressed file list)
  • 1.5 million identical netCDF files over low
    bandwidth (256 Kb/sec) link
  • Rsync 910 sec
  • Cdfsync 206 sec

19
Results (in-place updates)
  • 1.5 GB netCDF file with extra data appended to
    record dimension
  • Rsync 434 sec
  • Cdfsync 175 sec

20
Availability
  • http//www.epic.noaa.gov
  • Joe.Sirott_at_noaa.gov
Write a Comment
User Comments (0)
About PowerShow.com