Explicit Control in a Batch-aware Distributed File System - PowerPoint PPT Presentation

About This Presentation
Title:

Explicit Control in a Batch-aware Distributed File System

Description:

Explicit Control in a Batch-aware Distributed File System – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 23
Provided by: Cond55
Category:

less

Transcript and Presenter's Notes

Title: Explicit Control in a Batch-aware Distributed File System


1
Explicit Control in a Batch-aware Distributed
File System
2
Focus of work
  • Harnessing, managing remote storage
  • Batch-pipelined I/O intensive workloads
  • Scientific workloads
  • Wide-area grid computing

3
Batch-pipelined workloads
  • General properties
  • Large number of processes
  • Process and data dependencies
  • I/O intensive
  • Different types of I/O
  • Endpoint
  • Batch
  • Pipeline

4
Batch-pipelined workloads
Endpoint
Endpoint
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Pipeline
Endpoint
Endpoint
Endpoint
Endpoint
5
Wide-area grid computing
Internet
Home storage
6
Cluster-to-cluster (c2c)
  • Not quite p2p
  • More organized
  • Less hostile
  • More homogeneity
  • Correlated failures
  • Each cluster is autonomous
  • Run and managed by different entities
  • An obvious bottleneck is wide-area

Internet
Home store
How to manage flow of data into, within and out
of these clusters?
7
Current approaches
  • Remote I/O
  • Condor standard universe
  • Very easy
  • Consistency through serialization
  • Prestaging
  • Condor vanilla universe
  • Manually intensive
  • Good performance through knowledge
  • Distributed file systems (AFS, NFS)
  • Easy to use, uniform name space
  • Impractical in this environment

8
Pros and cons
Practical Easy to use Leverages workload info
Remote I/O v v X
Pre-staging v X v
Trad. DFS X v X
9
BAD-FS
  • Solution Batch-Aware Distributed File System
  • Leverages workload info with storage control
  • Detail information about workload is known
  • Storage layer allows external control
  • External scheduler makes informed storage
    decisions
  • Combining information and control results in
  • Improved performance
  • More robust failure handling
  • Simplified implementation

Practical Easy to use Leverages workload info
BAD-FS v v v
10
Practical and deployable
  • User-level requires no privilege
  • Packaged as a modified Condor system
  • A Condor system which includes BAD-FS
  • General glide-in works everywhere

SGE
SGE
SGE
SGE
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
BAD- FS
SGE
SGE
SGE
SGE
Internet
Home store
11
BAD-FS Condor
Compute node
Compute node
Compute node
Compute node
Condor startd
Condor startd
Condor Startd
Condor startd
BAD-FS
BAD-FS
BAD-FS
1) NeST storage management
3) Expanded Condor submit language
2) Batch-Aware Distributed File System
4) BAD-FS scheduler
Job queue
Condor DAGMan
Home storage
Condor DAGMan
12
BAD-FS knowledge
  • Remote cluster knowledge
  • Storage availability
  • Failure rates
  • Workload knowledge
  • Data type (batch, pipeline, or endpoint)
  • Data quantity
  • Job dependencies

13
Control through lots
  • Abstraction that allows external storage control
  • Guaranteed storage allocations
  • Containers for job I/O
  • e.g. I need 2 GB of space for at least 24 hours
  • Scheduler
  • Creates lots to cache input data
  • Subsequent jobs can reuse this data
  • Creates lots to buffer output data
  • Destroys pipeline, copies endpoint
  • Configures workload to access lots

14
Knowledge plus control
  • Enhanced performance
  • I/O scoping
  • Capacity-aware scheduling
  • Improved failure handling
  • Cost-benefit replication
  • Simplified implementation
  • No cache consistency protocol

15
I/O scoping
  • Technique to minimize wide-area traffic
  • Allocate lots to cache batch data
  • Allocate lots for pipeline and endpoint
  • Extract endpoint
  • Cleanup

Compute node
Compute node
AMANDA 200 MB pipeline 500 MB batch 5 MB
endpoint
Steady-state Only 5 of 705 MB traverse
wide-area.
Internet
BAD-FS Scheduler
16
Capacity-aware scheduling
  • Technique to avoid over-allocations
  • Scheduler has knowledge of
  • Storage availability
  • Storage usage within the workload
  • Scheduler runs as many jobs as fit
  • Avoids wasted utilizations
  • Improves job throughput

17
Improved failure handling
  • Scheduler understands data semantics
  • Data is not just a collection of bytes
  • Losing data is not catastrophic
  • Output can be regenerated by rerunning jobs
  • Cost-benefit replication
  • Replicates only data whose replication cost is
    cheaper than cost to rerun the job
  • Can improve throughput in lossy environment

18
Simplified implementation
  • Data dependencies known
  • Scheduler ensures proper ordering
  • Build a distributed file system
  • With cooperative caching
  • But without a cache consistency protocol

19
Real workloads
  • AMANDA
  • Astrophysics study of cosmic events such as
    gamma-ray bursts
  • BLAST
  • Biology search for proteins within a genome
  • CMS
  • Physics simulation of large particle colliders
  • HF
  • Chemistry study of non-relativistic interactions
    between atomic nuclei and electrons
  • IBIS
  • Ecology global-scale simulation of earths
    climate used to study effects of human activity
    (e.g. global warming)

20
Real workload experience
  • Setup
  • 16 jobs
  • 16 compute nodes
  • Emulated wide-area
  • Configuration
  • Remote I/O
  • AFS-like with /tmp
  • BAD-FS
  • Result is order of magnitude improvement

21
BAD Conclusions
  • Schedulers can obtain workload knowledge
  • Schedulers need storage control
  • Caching
  • Consistency
  • Replication
  • Combining this control with knowledge
  • Enhanced performance
  • Improved failure handling
  • Simplified implementation

22
For more information
Pipeline and Batch Sharing in Grid Workloads,
Douglas Thain, John Bent, Andrea Arpaci-Dusseau,
Remzi Arpaci-Dussea, Miron Livny. HPDC 12, 2003.
  • http//www.cs.wisc.edu/condor/publications.html
  • Questions?

Explicit Control in a Batch-Aware Distributed
File System, John Bent, Douglas Thain, Andrea
Arpaci-Dusseau, Remzi Arpaci-Dussea, Miron
Livny. NSDI 04, 2004.
23
Why not BAD-scheduler and traditional DFS?
  • Practical reasons
  • Deployment
  • Interoperability
  • Technical reasons
  • Cooperative caching
  • Data sharing
  • Traditional DFS
  • assume sharing is exception
  • provision for arbitrary, unplanned sharing
  • Batch workloads, sharing is rule
  • Sharing behavior is completely known
  • Data committal
  • Traditional DFS must guess when to commit
  • AFS uses close, NFS uses 30 seconds
  • Batch workloads precisely define when

24
Is capacity awareness important in real world?
  1. Heterogeneity of remote resources
  2. Shared disk
  3. Workloads changing some are very, very large and
    still growing.

25
User burden
  • Additional info needed in declarative lang.
  • User probably already knows this info
  • Or can readily obtain
  • Typically, this info already exists
  • Scattered across collection of scripts,
    Makefiles, etc.
  • BAD-FS improves current situation by collecting
    this info into one central location

26
In the wild
27
Capacity-aware scheduling evaluation
  • Workload
  • 64 synthetic pipelines
  • Varied pipe size
  • Environment
  • 16 compute nodes
  • Configuration
  • Breadth-first
  • Depth-first
  • BAD-FS

Failures directly correlate to workload
throughput.
28
I/O scoping evaluation
  • Workload
  • 64 synthetic pipelines
  • 100 MB of I/O each
  • Varied data mix
  • Environment
  • 32 compute nodes
  • Emulated wide-area
  • Configuration
  • Remote I/O
  • Cache volumes
  • Scratch volumes
  • BAD-FS

Wide-area traffic directly correlates to workload
throughput.
29
Cost-benefit replication evaluation
  • Workload
  • Synthetic pipelines of depth 3
  • Runtime 60 seconds
  • Environment
  • Artificially injected failures
  • Configuration
  • Always-copy
  • Never-copy
  • BAD-FS

Trade-off overhead in environment without failure
to gain throughput in environment with failure.
30
Real workloads
  • Workload
  • Real workloads
  • 64 pipelines
  • Environment
  • 16 compute nodes
  • Emulated wide-area
  • Cold and warm
  • First 16 are cold
  • Subsequent 48 warm
  • Configuration
  • Remote I/O
  • AFS-like
  • BAD-FS

31
Example workflow language Condor DAGman
  • Keyword job names file w/ execute instrs
  • Keywords parent, child express relations
  • no declaration of data

job A instructions.A job B instructions.B job
C instructions.C job D instructions.D parent
A child B parent C child D
32
Adding data primitives to a workflow language
  • New keywords for container operations
  • volume create a container
  • scratch specify container type
  • mount how the app addresses the container
  • extract the desired endpoint output
  • User must provide complete, exact I/O information
    to the scheduler
  • Specify which procs use which data
  • Specify size of data read and written

33
Extended workflow language
job A instructions.A job B instructions.B job
C instructions.C job D instructions.D parent
A child B parent C child D volume B1
ftp//home/data 1GB volume P1 scratch 500
MB volume P2 scratch 500 MB A mount B1 /data C
mount B1 /data A mount P1 /tmp B mount P1 /tmp C
mount P2 /tmp D mount P2 /tmp extract P1/out
ftp//home/out.1 extract P2/out ftp//home/out.2
Write a Comment
User Comments (0)
About PowerShow.com