IO Access in Condor and Grid - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

IO Access in Condor and Grid

Description:

All system calls are redirected to submission computer. No files are transferred ... Multiple protocols: access however you like: GridFTP, FTP, HTTP, NFS, and more. 28 ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 50
Provided by: MironL
Category:

less

Transcript and Presenter's Notes

Title: IO Access in Condor and Grid


1
I/O Access in Condor and Grid
2
What is Condor?
  • Condor is a batch job system
  • Goal High throughput computing
  • Different than high-performance
  • Goal High reliability
  • Goal Support distributed ownership

3
High Throughput Computing
  • Worry about FLOPS/year, not FLOPS/second
  • Use all resources effectively
  • Dedicated clusters
  • Non-dedicated computers (desktop)

4
Effective resource use
  • Requires high reliability
  • Computers come and go, your jobs shouldnt.
  • Checkpointing
  • Be prepared for everything breaking
  • Requires distributed ownership
  • Requires distributed access
  • Must deal with lack of shared filesystem

5
Jobs in Condor
  • Standard Universe
  • Checkpointing Migration
  • Remote I/O
  • Available to many (not all) jobs
  • Vanilla Universe
  • Any job you want
  • No checkpointing, No remote I/O
  • Other Universes
  • MPI PVM Java

6
Machines in Condor
  • Distributed ownership
  • Your desktop can be in a Condor pool
  • You choose how it is used
  • You choose when it is used
  • Dedicated computers or non-dedicated

7
Notable Users
  • UW-Madison Condor pool
  • 900 CPUs, millions of CPU hours/year
  • INFN
  • 150 CPUs?
  • Oracle
  • 2000-3000 CPUs, worldwide
  • Hundreds of pools worldwide

8
Working with files in Condor
  • Today
  • Shared file systems
  • Transferring files
  • Remote I/O
  • Tomorrow
  • Pluggable File System
  • NeST
  • Stork

9
Review Submitting a job
  • Write a submit file
  • Executable dowork
  • Input dowork.in
  • Output dowork.out
  • Arguments 1 alpha beta
  • Universe vanilla
  • Log dowork.log
  • Queue
  • Give it to Condor
  • condor_submit ltsubmit-filegt
  • Watch it run condor_q

Files on shared fs
10
What happens when it runs?
  • Condor requires a shared filesystem for this job.
    Why?
  • You have a vanilla job
  • You did not tell it to transfer any files
  • Therefore, Condor adds a requirement
  • (TARGET.FileSystemDomain MY.FileSystemDomain)
  • What does this mean?

11
What happens when it runs?
12
No shared filesystem
  • Tell Condor to transfer files
  • Executable dowork
  • Input dowork.in
  • Output dowork.out
  • Arguments 1 alpha beta
  • Universe vanilla
  • Log dowork.log
  • Transfer_Files ONEXIT
  • Transfer_Input_Files dataset
  • Queue
  • Job can run in Padova or Bologna
  • Files are always transferred

13
Shared Filesystem?
  • Even better (Condor 6.5.3 and later)
  • Input dowork.in
  • Output dowork.out
  • Universe vanilla
  • Should_Transfer_files IF_NEEDED
  • Transfer_Input_Files dataset
  • Rank (MY.FileSystemDomain
  • TARGET.FileSystemDomain)
  • Job can run in Padova or Bologna
  • Files are transferred, for Padova, not Bo
  • We prefer avoiding transfer

14
Standard Universe
  • Standard universe provides
  • Checkpointing
  • Remote I/O
  • Requires re-link of your program
  • No recompilation
  • Doesnt work for all programs
  • No threads
  • No dynamic libraries on Linux
  • Limited networking

15
Remote I/O
  • All system calls are redirected to submission
    computer
  • No files are transferred
  • It looks just like home
  • Job runs as nobody on remote computer
  • Files should be read or write only, not both

16
Is remote I/O efficient?
  • If you read a file only once, yes
  • If you read less than a whole file, yes
  • If you read a file many time, it may be less
    efficient
  • We find it is not a problem for most jobs

17
How do you use it?
  • Compile your program
  • gcc c somejob.c ? somejob.o
  • Link your program
  • condor_compile gcc o somejob somejob.o
  • Use the standard universe
  • Executable somejob
  • Universe standard

18
What happens when it runs?
  • Condor does not require a shared filesystem
  • You used the standard universe
  • Files will not be transferred
  • Condor does not modify the requirements

19
What happens when it runs?
Execute Computer
Submit Computer
Job
Shadow
Remote I/O library
Remote I/O handler
Disk
20
Summary Condor Today
  • Vanilla jobs works with
  • Shared file system
  • Transferring files
  • Standard universe
  • Remote I/O
  • Checkpointing

21
Pluggable File System (PFS)
  • Bypass an interposition agent
  • Shared library squeezes into program
  • Intercepts calls to specific functions
  • PFS
  • Intercepts file access
  • Use FTP/GSI-FTP/HTTP/NeST
  • Like Condor remote I/O except
  • No relinking
  • Usable outside of Condor

22
Bypass (Work by Doug Thain)
  • A tool for interposition agents
  • Uses dynamic library preload
  • setenv LD_PRELOAD /usr/lib/pfs.so
  • Just write replacement code
  • ssize_t read(int fd)
  • agent_action
  • //code to do something
  • return read() // call real read

23
PFS (Doug Thain)
  • PFS uses Bypass intercepts all file accesses
  • When you access a URL, it implements it.
  • /http/www.yahoo.com/index.html
  • Just use pfsrun
  • pfsrun vi /http/www.yahoo.com/index.html
  • Warning it mostly works

24
Live Demo
  • pfsrun vi /http/www.yahoo.com/index.html
  • pfsrun tcsh f
  • cd /anonftp/ftp.cs.wisc.edu/condor
  • pwd
  • more RTab
  • cd
  • grep -i infom /http/www.bo.infn.it/index.html

25
PFS Status
  • You can download it today
  • http//www.cs.wisc.edu/thain/
  • It works well, but not all the time
  • Can give remote I/O in the vanilla universe
  • An alternative is being explored ptrace

26
NeST
  • Network storage for the grid
  • A file server on steroids
  • A work in progress
  • Work by John Bent and Joseph Stanley

27
NeST as a file server
  • Any user, not just rootyou do not need to be a
    system administrator
  • Multiple protocols access however you like
  • GridFTP, FTP, HTTP, NFS, and more

28
Lots
  • NeST supports storage allocations, or lots
  • You request 500MB for 10 hours
  • You can rely on your storage being there
  • (Well, as much as you can rely on anything in a
    grid)

29
Nest as a Cache
  • Run one NeST as a master
  • Your home file server
  • Make it reliable well-maintained machine, UPS
  • Run other NeSTs as caches
  • Point to master NeST
  • Cache data locally
  • May be unreliable

30
Using a NeST cache
  • Use the NeST protocol to talk to the cache
  • If the cache disappears, you will talk to the
    master NeST
  • Is it inconvenient to use the NeST protocol?

31
PFS NeST
  • PFS can speak to NeST
  • Your applications can speak to a NeST cache with
    no modification
  • You can work in a wide-area or grid environment
    with no modification
  • You get local data access for free
  • Recall question is remote I/O a good idea?

32
Scenario 1
  • Submit script as job. The script
  • Runs NeST
  • Runs your job with PFS file arguments pointing
    at NeST

Submit Machine
Execute Machine
Job
NeST Cache
NeST Master
PFS
33
Scenario 1
  • When your job reads data, PFS redirects requests
    to NeST cache
  • If the data is not present it is requested from
    the NeST master
  • If the NeST cache fails, the NeST master is used

34
Scenario 2
  • Submit one job that is the NeST cache
  • Submit many jobs that access this NeST cache

Execute Machine 1
NeST Cache
Submit Machine
NeST Master
35
Condor or a Grid?
  • These scenarios work in Condor on a grid
  • Mostly useful across a wide-area, not a local
    Condor pool
  • Clever way to use it on a grid
  • Condor-G Glide-in

36
What is Condor-G?
  • Another Condor universe the Globus universe
  • When you submit jobs, they are not run by Condor,
    but are given to Globus
  • Condor-G provides job persistence and reliability
    for Globus jobs

37
Condor-G
Globus Gatekeeper
Submit Machine
Job Exec Args Universe Globus
Cluster
38
Glide-in
  • Problem You want the whole world to be your
    Condor pool
  • Solution Create an overlay Condor pool
  • Run Condor daemons as a job on another pool
  • You have a larger Condor pool

39
Glide-in
40
Nest Glide-in
  • Submit glide-in job that is Nest cache and Condor
    daemons
  • Your remote jobs access Nest cache
  • Your local jobs access Nest master
  • Everything looks like Condor
  • Good performance everywhere

41
NeST Status
  • In active development
  • You can download it today
  • http//www.cs.wisc.edu/condor/nest
  • Cache feature is experimental
  • Paper Pipeline and Batch Sharing in Grid
    Workloads

42
Stork
  • Background the job problem
  • Globus-job-run
  • Unreliable no persistent job queue
  • No retries
  • Condor-G
  • Reliable persistent job queue
  • Retry after failures
  • Submit it and forget it!

43
Stork the file problem
  • Background the file transfer problem
  • Globus-url-copy (or wget, or)
  • Unreliable no persistent queue
  • No retries
  • Stork
  • Reliable persistent queue of file transfers to
    be done
  • Retries on failure

44
Why bother?
  • You could do Stork with Condor, but
  • Stork understands file transfers
  • Local files
  • FTP
  • Nest
  • Stork understands space reservations
  • Stork recovers from failures
  • Just submit forget
  • GSI-FTP
  • SRB
  • SRM

45
A Stork job
  • dap_type "transfer"
  • src_url "srb//ghidorac.sdsc.edu/test1.
    txt"
  • dest_url "nest//db18.cs.wisc.edu/test8.
    txt"
  • stork_submit queue transfer
  • stork_status show progress

46
One job isnt enough
  • Reserve space
  • Transfer files
  • Run job
  • Release space
  • How do we combine these?

47
Condor DAGMan
  • DAG Directed Acyclic Graph
  • A DAG is the data structure used by DAGMan to
    represent these dependencies.
  • Each job is a node in the DAG.
  • Each node can have any number of parent or
    children nodes as long as there are no loops!

48
DAGMan Stork
  • A DAG can combine Condor jobs with Stork jobs
  • Useful in a grid
  • Can be used with a NeST or without

Reserve
Transfer
Transfer
Run
Transfer
Release
49
Summary
  • Condor Today
  • Files on shared filesystems
  • Transfer files
  • Remote I/O
  • Condor grids tomorrow
  • Pluggable file system
  • Nest
  • Stork
  • Some combination of the above
Write a Comment
User Comments (0)
About PowerShow.com