Explicit Control in a Batch-aware Distributed File System - PowerPoint PPT Presentation

About This Presentation
Title:

Explicit Control in a Batch-aware Distributed File System

Description:

Data committal. Traditional DFS must guess when to commit. AFS uses close, NFS uses 30 seconds ... Private read-write name space. Batch-aware scheduler ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 34
Provided by: con92
Category:

less

Transcript and Presenter's Notes

Title: Explicit Control in a Batch-aware Distributed File System


1
Explicit Control in a Batch-aware Distributed
File System
  • John Bent
  • Douglas Thain
  • Andrea Arpaci-Dusseau
  • Remzi Arpaci-Dusseau
  • Miron Livny
  • University of Wisconsin, Madison

2
Grid computing
Physicists invent distributed computing!
Astronomers develop virtual supercomputers!
3
Grid computing
Internet
Home storage
If it looks like a duck . . .
4
Are existing distributed file systems adequate
for batch computing workloads?
  • NO. Internal decisions inappropriate
  • Caching, consistency, replication
  • A solution Batch-Aware Distributed File System
    (BAD-FS)
  • Combines knowledge with external storage control
  • Detail information about workload is known
  • Storage layer allows external control
  • External scheduler makes informed storage
    decisions
  • Combining information and control results in
  • Improved performance
  • More robust failure handling
  • Simplified implementation

5
Outline
  • Introduction
  • Batch computing
  • Systems
  • Workloads
  • Environment
  • Why not DFS?
  • Our answer BAD-FS
  • Design
  • Experimental evaluation
  • Conclusion

6
Batch computing
  • Not interactive computing
  • Job description languages
  • Users submit
  • System itself executes
  • Many different batch systems
  • Condor
  • LSF
  • PBS
  • Sun Grid Engine

7
Batch computing
Internet
Home storage
Scheduler
1
2
3
4
8
Batch workloads
Pipeline and Batch Sharing in Grid Workloads,
Douglas Thain, John Bent, Andrea Arpaci-Dusseau,
Remzi Arpaci-Dussea, Miron Livny. HPDC 12, 2003.
  • General properties
  • Large number of processes
  • Process and data dependencies
  • I/O intensive
  • Different types of I/O
  • Endpoint
  • Batch
  • Pipeline
  • Our focus Scientific workloads
  • More generally applicable
  • Many others use batch computing
  • video production, data mining, electronic design,
    financial services, graphic rendering

9
Batch workloads
Endpoint
Endpoint
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Endpoint
Endpoint
Endpoint
Endpoint
10
Cluster-to-cluster (c2c)
  • Not quite p2p
  • More organized
  • Less hostile
  • More homogeneity
  • Correlated failures
  • Each cluster is autonomous
  • Run and managed by different entities
  • An obvious bottleneck is wide-area

Internet
Home store
How to manage flow of data into, within and out
of these clusters?
11
Why not DFS?
Internet
Home store
  • Distributed file system would be ideal
  • Easy to use
  • Uniform name space
  • Designed for wide-area networks
  • But . . .
  • Not practical
  • Embedded decisions are wrong

12
DFSs make bad decisions
  • Caching
  • Must guess what and how to cache
  • Consistency
  • Output Must guess when to commit
  • Input Needs mechanism to invalidate cache
  • Replication
  • Must guess what to replicate

13
BAD-FS makes good decisions
  • Removes the guesswork
  • Scheduler has detailed workload knowledge
  • Storage layer allows external control
  • Scheduler makes informed storage decisions
  • Retains simplicity and elegance of DFS
  • Practical and deployable

14
Outline
  • Introduction
  • Batch computing
  • Systems
  • Workloads
  • Environment
  • Why not DFS?
  • Our answer BAD-FS
  • Design
  • Experimental evaluation
  • Conclusion

15
Practical and deployable
  • User-level requires no privilege
  • Packaged as a modified batch system
  • A new batch system which includes BAD-FS
  • General will work on all batch systems
  • Tested thus far on multiple batch systems

SGE
SGE
SGE
SGE
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
SGE
SGE
SGE
SGE
Internet
Home store
16
Contributions of BAD-FS
Compute node
Compute node
Compute node
Compute node
CPU Manager
CPU Manager
CPU Manager
CPU Manager
BAD-FS
BAD-FS
BAD-FS
1) Storage managers
2) Batch-Aware Distributed File System
Job queue
3) Expanded job
description language
4) BAD-FS scheduler
Scheduler
Home storage
BAD-FS Scheduler
17
BAD-FS knowledge
  • Remote cluster knowledge
  • Storage availability
  • Failure rates
  • Workload knowledge
  • Data type (batch, pipeline, or endpoint)
  • Data quantity
  • Job dependencies

18
Control through volumes
  • Guaranteed storage allocations
  • Containers for job I/O
  • Scheduler
  • Creates volumes to cache input data
  • Subsequent jobs can reuse this data
  • Creates volumes to buffer output data
  • Destroys pipeline, copies endpoint
  • Configures workload to access containers

19
Knowledge plus control
  • Enhanced performance
  • I/O scoping
  • Capacity-aware scheduling
  • Improved failure handling
  • Cost-benefit replication
  • Simplified implementation
  • No cache consistency protocol

20
I/O scoping
  • Technique to minimize wide-area traffic
  • Allocate storage to cache batch data
  • Allocate storage for pipeline and endpoint
  • Extract endpoint

Compute node
Compute node
AMANDA 200 MB pipeline 500 MB batch 5 MB
endpoint
Internet
Steady-state Only 5 of 705 MB traverse
wide-area.
BAD-FS Scheduler
21
Capacity-aware scheduling
  • Technique to avoid over-allocations
  • Scheduler runs only as many jobs as fit

22
Capacity-aware scheduling
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Endpoint
Endpoint
Endpoint
Endpoint
23
Capacity-aware scheduling
  • 64 batch-intensive synthetic pipelines
  • Vary size of batch data
  • 16 compute nodes

24
Improved failure handling
  • Scheduler understands data semantics
  • Data is not just a collection of bytes
  • Losing data is not catastrophic
  • Output can be regenerated by rerunning jobs
  • Cost-benefit replication
  • Replicates only data whose replication cost is
    cheaper than cost to rerun the job
  • Results in paper

25
Simplified implementation
  • Data dependencies known
  • Scheduler ensures proper ordering
  • No need for cache consistency protocol in
    cooperative cache

26
Real workloads
  • AMANDA
  • Astrophysics study of cosmic events such as
    gamma-ray bursts
  • BLAST
  • Biology search for proteins within a genome
  • CMS
  • Physics simulation of large particle colliders
  • HF
  • Chemistry study of non-relativistic interactions
    between atomic nuclei and electors
  • IBIS
  • Ecology global-scale simulation of earths
    climate used to study effects of human activity
    (e.g. global warming)

27
Real workload experience
  • Setup
  • 16 jobs
  • 16 compute nodes
  • Emulated wide-area
  • Configuration
  • Remote I/O
  • AFS-like with /tmp
  • BAD-FS
  • Result is order of magnitude improvement

28
BAD Conclusions
  • Existing DFSs insufficient
  • Schedulers have workload knowledge
  • Schedulers need storage control
  • Caching
  • Consistency
  • Replication
  • Combining this control with knowledge
  • Enhanced performance
  • Improved failure handling
  • Simplified implementation

29
For more information
  • http//www.cs.wisc.edu/adsl
  • http//www.cs.wisc.edu/condor
  • Questions?

30
Why not BAD-scheduler and traditional DFS?
  • Cooperative caching
  • Data sharing
  • Traditional DFS
  • assume sharing is exception
  • provision for arbitrary, unplanned sharing
  • Batch workloads, sharing is rule
  • Sharing behavior is completely known
  • Data committal
  • Traditional DFS must guess when to commit
  • AFS uses close, NFS uses 30 seconds
  • Batch workloads precisely define when

31
Is cap aware imp in real world?
  • Heterogeneity of remote resources
  • Shared disk
  • Workloads changing, some are very, very large.

32
User burden
  • Additional info needed in declarative lang.
  • User probably already knows this info
  • Or can readily obtain
  • Typically, this info already exists
  • Scattered across collection of scripts,
    Makefiles, etc.
  • BAD-FS improves current situation by collecting
    this info into one central location

33
Enhanced performance
  • I/O scoping
  • Scheduler knows I/O types
  • Creates storage volumes accordingly
  • Only endpoint I/O traverses wide-area
  • Capacity-aware scheduling
  • Scheduler knows I/O quantities
  • Throttles workloads, avoids over-allocations

34
Improved failure handling
  • Scheduler understands data semantics
  • Lost data is not catastrophic
  • Pipe data can be regenerated
  • Batch data can be refetched
  • Enables cost-benefit replication
  • Measure
  • replication cost
  • data generation cost
  • failure rate
  • Replicate only data whose replication cost is
    cheaper than expected cost to reproduce
  • Improves workload throughput

35
Capacity-aware scheduling
  • Goal
  • Avoid overallocations
  • Cache thrashing
  • Write failures
  • Method
  • Breadth-first
  • Depth-first
  • Idleness

36
Capacity-aware scheduling evaluation
  • Workload
  • 64 synthetic pipelines
  • Varied pipe size
  • Environment
  • 16 compute nodes
  • Configuration
  • Breadth-first
  • Depth-first
  • BAD-FS

Failures directly correlate to workload
throughput.
37
Workload example AMANDA
  • Astrophysics study of cosmic events such as
    gamma-ray bursts
  • Four stage pipeline
  • 200 MB pipeline I/O
  • 500 MB batch I/O
  • 5 MB endpoint I/O
  • Focus
  • Scientific workloads
  • Many others use batch computing
  • video production, data mining, electronic design,
    financial services, graphic rendering

38
BAD-FS and scheduler
  • BAD-FS
  • Allows external decisions via volumes
  • A guaranteed storage allocation
  • Size, lifetime, and a type
  • Cache volumes
  • Read-only view of an external server
  • Can be bound together into cooperative cache
  • Scratch volumes
  • Private read-write name space
  • Batch-aware scheduler
  • Rendezvous of control and information
  • Understands storage needs and availability
  • Controls storage decisions

39
Scheduler controls storage decisions
  • What and how to cache?
  • Answer batch data and cooperatively
  • Technique I/O scoping and capacity-aware
    scheduling
  • What and when to commit?
  • Answer endpoint data when ready
  • Technique I/O scoping and capacity-aware
    scheduling
  • What and when to replicate?
  • Answer data whose cost to regenerate is high
  • Technique cost-benefit replication

40
I/O scoping
  • Goal
  • Minimize wide-area traffic
  • Means
  • Information about data type
  • Storage volumes
  • Method
  • Create coop cache volumes for batch data
  • Create scratch volumes to contain pipe
  • Result
  • Only endpoint data traverses wide-area
  • Improved workload throughput

41
I/O scoping evaluation
  • Workload
  • 64 synthetic pipelines
  • 100 MB of I/O each
  • Varied data mix
  • Environment
  • 32 compute nodes
  • Emulated wide-area
  • Configuration
  • Remote I/O
  • Cache volumes
  • Scratch volumes
  • BAD-FS

Wide-area traffic directly correlates to workload
throughput.
42
Capacity-aware scheduling
  • Goal
  • Avoid over-allocations of storage
  • Means
  • Information about data quantities
  • Information about storage availability
  • Storage volumes
  • Method
  • Use depth-first scheduling to free pipe volumes
  • User breadth-first scheduling to free batch
  • Result
  • No thrashing due to over-allocations of batch
  • No failures due to over-allocations of pipe
  • Improved throughput

43
Capacity-aware scheduling evaluation
  • Workload
  • 64 synthetic pipelines
  • Pipe-intensive
  • Environment
  • 16 compute nodes
  • Configuration
  • Breadth-first
  • Depth-first
  • BAD-FS

44
Capacity-aware scheduling evaluation
  • Workload
  • 64 synthetic pipelines
  • Pipe-intensive
  • Environment
  • 16 compute nodes
  • Configuration
  • Breadth-first
  • Depth-first
  • BAD-FS

Failures directly correlate to workload
throughput.
45
Cost-benefit replication
  • Goal
  • Avoid wasted replication overhead
  • Means
  • Knowledge of data semantics
  • Data loss is not catastrophic
  • Can be regenerated or refetched
  • Method
  • Measure
  • Failure rate, f, within each cluster
  • Cost, p, to reproduce data
  • Time to rerun jobs to regenerate pipe data
  • Time to refetch batch data from home
  • Cost, r, to replicate data
  • Replicate only when pf gt r
  • Result
  • Data is replicated only when it should be
  • Can improve throughput

46
Cost-benefit replication evaluation
  • Workload
  • Synthetic pipelines of depth 3
  • Runtime 60 seconds
  • Environment
  • Artificially injected failures
  • Configuration
  • Always-copy
  • Never-copy
  • BAD-FS

Trade-off overhead in environment without failure
to gain throughput in environment with failure.
47
Real workloads
  • Workload
  • Real workloads
  • 64 pipelines
  • Environment
  • 16 compute nodes
  • Emulated wide-area
  • Cold and warm
  • First 16 are cold
  • Subsequent 48 warm
  • Configuration
  • Remote I/O
  • AFS-like
  • BAD-FS

48
Experimental results not shown here
  • I/O scoping
  • Capacity planning
  • Cost-benefit replication
  • Other real workload results
  • Large in the wild demonstration
  • Works in c2c
  • Works across multiple batch systems

49
Existing approaches
  • Remote I/O
  • Interpose and redirect all I/O home
  • CON Quickly saturates wide-area connection
  • Pre-staging
  • Manually push all input endpoint and batch
  • Manually pull all endpoint output
  • Manually configure workload to find pre-staged
    data
  • CON Repetitive, error-prone, laborious
  • Traditional distributed file systems
  • Locate remote compute nodes within same name
    space as home (e.g. AFS)
  • Not an existing approach impractical to deploy

50
Declarative language
  • Existing languages express process
  • specification
  • requirements
  • dependencies
  • Add primitives to describe I/O behavior
  • Modified language can express data
  • dependencies
  • type (i.e. endpoint, batch, pipe)
  • quantities

51
Example AMANDA on AFS
?
  • Caching
  • Batch data redundantly fetched
  • Callback overhead
  • Consistency
  • Pipeline data committed on close
  • Replication
  • No idea which data is important

200 MB
200 MB
200 MB
200 MB
200 MB
200 MB
200 MB
200 MB
500 MB
500 MB
500 MB
500 MB
500 MB
500 MB
500 MB
500 MB
AMANDA 200 MB pipeline I/O 500 MB batch I/O 5 MB
endpoint I/O
This is slide in which Im most interested in
feedback.
52
Overview
53
I/O Scoping
54
Capacity-aware scheduling, batch-intense
55
Capacity-aware scheduling evaluation
  • Workload
  • 64 synthetic pipelines
  • Pipe-intensive
  • Environment
  • 16 compute nodes

56
Failure handling
57
Workload experience
58
In the wild
59
Example workflow language Condor DAGman
  • Keyword job names file w/ execute instrs
  • Keywords parent, child express relations
  • no declaration of data

job A instructions.A job B instructions.B job
C instructions.C job D instructions.D parent
A child B parent C child D
60
Adding data primitives to a workflow language
  • New keywords for container operations
  • volume create a container
  • scratch specify container type
  • mount how the app addresses the container
  • extract the desired endpoint output
  • User must provide complete, exact I/O information
    to the scheduler
  • Specify which procs use which data
  • Specify size of data read and written

61
Extended workflow language
job A instructions.A job B instructions.B job
C instructions.C job D instructions.D parent
A child B parent C child D volume B1
ftp//home/data 1GB volume P1 scratch 500
MB volume P2 scratch 500 MB A mount B1 /data C
mount B1 /data A mount P1 /tmp B mount P1 /tmp C
mount P2 /tmp D mount P2 /tmp extract P1/out
ftp//home/out.1 extract P2/out ftp//home/out.2
62
Terminology
  • Application
  • Process
  • Workload
  • Pipeline I/O
  • Batch I/O
  • Endpoint I/O
  • Pipe-depth
  • Batch-width
  • Scheduler
  • Home storage
  • Catalogue

63
Remote resources
64
Example scenario
  • Workload
  • Width 100, depth 2
  • 1 GB batch
  • 1 GB pipe
  • 1 KB endpoint
  • Environment
  • Batch data archived at home
  • Remote compute cluster available

1 KB
1 KB
1 GB
1 GB
1 GB
Home store
1 KB
1 KB
65
Ideal utilization of remote storage
  • Minimize wide-area traffic by scoping I/O
  • Transfer batch data once and cache
  • Contain pipe data within compute cluster
  • Only endpoint data should traverse wide-area
  • Improve throughput through space mgmt
  • Avoid thrashing due to excessive batch
  • Avoid failure due to excessive pipe
  • Cost-benefit checkpointing and replication
  • Track data generation and replication costs
  • Measure failure rates
  • Use cost-benefit checkpointing algorithm
  • Apply independent policy for each pipeline

66
Remote I/O
  • Simplest conceptually
  • Requires least amount of remote privilege
  • But . . .
  • Batch data fetched redundantly
  • Pipe I/O unnecessarily crosses wide-area
  • Wide-area bottleneck quickly saturates

67
Pre-staging
  • Requires large user burden
  • Needs access to local file sys for each cluster
  • Manually pushes batch data
  • May configures workload to use /tmp
  • Must manually pulls endpoint outputs
  • Good performance through I/O scoping but
  • Tedious, repetitive, mistake-prone
  • Availability of /tmp cant be guaranteed
  • Scheduler lacks knowledge to checkpoint
Write a Comment
User Comments (0)
About PowerShow.com