Title: Reallife experiences with grids: Its not as easy as it looks
1Real-life experiences with gridsIts not as
easy as it looks
- Alain Roy
- roy_at_cs.wisc.edu
- University of Wisconsin-Madison
- Condor Team
2Who Am I?
- Member of Condor Team
- Experience with Condor
- Experience with grid deployment
- Developer of Virtual Data Toolkit
- Used by GriPhyN, EDG, LCG
- Packaging of Globus, Condor, etc.
- Collaborator with INFN
- Working with Paolo Mazzanti
- In Bologna for four weeks
3Italy
- Italy is beautiful
- The food is wonderful
- The people are friendly
4Background
- Condors environment is a little like a grid
- Not all computers (grid sites) are under Condors
control - Computers (grid sites) disappear at the owners
whim - Everything changes constantly
- Condor was built to deal with this dynamic
environment - Grid software needs to do the same
5Background
- Late 1980s until today
- Condor developed and deployed on hundreds of
sites - Condor built to deal with failures
- Recently
- Condor-G your window to the grid
- Condor team has helped deploy grid technology for
real usenot just experiments
6Background Condor
- Condor is a batch job system
- Goal High throughput computing
- Different than high-performance
- Goal High reliability
- Goal Support distributed ownership
7High-Throughput Computing
- Worry about FLOPS/year, not FLOPS/second
- Use all resources effectively
- Dedicated clusters
- Non-dedicated computers (desktop)
8Effective Resource Use
- Requires high reliability
- Computers come and go, your jobs shouldnt.
- Checkpointing
- Be prepared for everything breaking
- Requires distributed ownership
9Condor-G
- Condor-G submits Globus jobs
- Jobs are in persistent queue
- Unlike globus-job-run
- Jobs are retried on system failures
- Jobs are held on some failures
- Condor-G makes it easy to submit grid jobs
10Background USCMS
- CMS
- Detector online in 2007
- Needs to simulate reconstruct millions of
events - USCMS testbed
- Joint PPDG/GriPhyN effort
- Integrate CMS tools with grid tools
- Globus
- Condor-G
- Contribute real work to CMS
11Background USCMS
- 7 sites, 250 CPUs
- Spring 2002 Deploy test
- Fall 2002
- Last minute production
- 150,000 events in two weeks
- Successful, but lots of work
- Today
- Wider deployment use
12Background DØ
- Experiment at Fermilab
- Already doing real production, real analysis
- Deploying on grid sites today
- Condor-G
- Globus
- SAM
13DØ Condor-G
- They liked Condor-G
- Condor-G missing a feature
- Deciding which grid-site to use
- SAM (data handling software) knows where data is
located - SAMGrid
- Condor-G asks SAM for advice
- Condor-G decides where to run jobs
14DØ deployment
- Spring Beginning of deployment
- Late summer production
- Early results
- It looks good
- We have more work to do
- Better error reporting
- Better matchmaking
- What will we learn later?
15Problems Lessons
- During our experiences, weve
- Encountered many problems
- Developed solutions to these problems
- Learned many lessons about grids
- This talk
- Shares some interesting problems
- Gives some advice solutions
16Taking a taxi
Problem
- How do you take a taxi in Paestum, Italy?
- We dont need to walk 4km there
- The ruins were lovely
- The ruins were outside
- It was about 35C
- Wife is pregnant
17Use all your resources
Lesson
- Walk up to storekeeper
- Ask Dovay Ooon Taxi? (Dove un taxi?)
- Be patient Wait ten minutes
- Take taxi
- I assumed my resources (local knowledge, Italian)
were insufficient, but they saved me time when I
used them
18Use all your resources
Lesson
- Condor
- Uses dedicated machines (I can walk)
- Uses non-dedicated machines (I can sometimes ask
for help) - Grids
- Connect your machine rooms
- Can you take advantage of other resources?
- Avoid mentality I must control all resources,
and you will prosper
19Grid distributed machine room?
- You can have good control
- You can pre-install applications
- You know how everything works
- BUT
- You lose flexibility
- How quickly can you upgrade sites?
- Did they install everything correctly?
- Can you use new grid sites easily?
20Grid Use all resources
- Assume basic grid software is installed
- Assume nothing else is installed
- Bring your software with you
- Submit one job install software
- Submit N jobs use software
- You control software
- You ensure correct installation
- Easy to use any grid site
21Long-running programs
Problem
- Long-running programs crash
- Condor has daemons on each machine
- User (job) agent
- Machine agent
- Matchmaker
- They crash
- Programming errors
- Network failures
- Disk failures
22Watch programs
Lesson
- Condor master
- Small program, rarely changed
- Runs Condor daemons
- When daemon crashes
- Restart daemon, send email
- If it crashes again, restart after backoff
- Result
- Many errors are silently fixed
- Yet we dont just ignore crashes
23Short-running programs
Problem
- Short-running programs crash/hang
- Example globus-url-copy
- USCMS testbed staging data
- Some fraction of copies hang or fail
- Programming error delicate network
- Hard to reproduce and fix
24Watch programs
Lesson
- When copy exceeds timeout, kill and retry
- Possible to do in shell scripting languages, but
not easy - Use Fault Tolerant Shell to watch programs
25Fault Tolerant Shell
- Shell language built for coping with errors
- try for 30 minutes
- wget http//www.example.com/file.tar.gz
- gunzip file.tar.gz
- tar xf file.tar
- end
26FTSH exponential backoff
- Why exponential backoff?
- What if 100 ftsh scripts are executing?
- Avoid synchronization ? reduce load, increase
chance of success - Similar to Ethernet
27Fault Tolerant Shell
- Easier to cope with failures
- try 5 times
- wget http//www.example.com/file.tar.gz
- catch
- rm f file.tar.gz
- failure
- end
Cleanup partially downloaded file, if it exists
28Fault Tolerant Shell
- Flexible
- try for 30 minutes
- try for 5 minutes
- wget http//example.com/file.tar.gz
- end
- try for 1 minute or 3 times
- gunzip file.tar.gz
- tar xf file.tar
- catch
- rm rf file.tar
- end
- end
29FTSH More information
- Work of Doug Thain
- thain_at_cs.wisc.edu
- Excellent paper
- The Ethernet Approach to Grid Computing, by Doug
Thain - Available from http//www.cs.wisc.edu/thain
- Even if you dont use FTSH, read this paper!
30Whose error is it?
Problem
- The source of an error is not always obvious
- The source of an error influences how you react
to the error - Example Java universe in Condor
31Java Universe
- Users submit Java jobs to Condor
- Whose error is it? Check result code
- 1 Program dereferenced NULL pointer
- 1 Jobs image is corrupt
- 1 VM doesnt have enough memory to run program
- 1 Java installation is misconfigured
Job shouldnt run again
Job shouldnt run again
Try another machine with more memory
Dont use this machine for Java
32Dont trust configuration
Lesson
- Users tells Condor Java is installed
- This is just a hint!
- Condor verifies Java configuration
- Run simple job, verify output
- If Java works, Condor advertises that Java can be
used - If Java fails, error is reported, Java cant be
used
33Look for error scope
Lesson
- Add Java wrapper to all Java jobs
- Run program
- Examine return code/exception
- Write all details to file
- Examine output of wrapper, or exception from JVM
- We know if job is bad
- We know if JVM is insufficient for job
- We know if JVM is bad
34Error Scope
- We could have an entire talk on error scope
- Excellent paper Error Scope on a Computational
Grid Theory and Practice, by Doug Thain - Useful paper even if you dont use Condor or Java
35Many layers in a grid
Problem
36We forgot inetd
- We submitted 300 jobs at once
- Inetd noticed many connections per second
- Inetd presumed there was a denial of service
attack and refused connections for five minutes - Lots of debugging!
37There are more layers!
USCMS Testbed Architecture (A bit dated)
Master Site
Worker
Impala
Globus
MOP
Batch System (Condor, PBS)
DAGMan
Real Work
Condor-G
38More layers than that!
USCMS Testbed Architecture (A bit dated)
- MCRunJob
- Impala
- MOP
- condor_schedd
- DAGMan
- Condor-G condor_schedd
- condor_gridmanager
- gahp_server
- globus-gatekeeper
- globus-job-manager
- globus-job-manager-script.pl
- local batch system submit
- local batch system execute
- MOP wrapper
- Impala wrapper
- actual job
This disregards inetd, network, file servers,
file transfers
39Recovery at multiple levels
Lesson
- Fault-tolerance and recovery is built in at many
levels - Condor_master restart daemons
- Condor_schedd job queue
- DAGMan checkpoint DAG of jobs
- Gahp_server isolate Globus libraries
- And others
40Allocate debugging time
Lesson
- Allocate lots of debugging time
- It is very hard to propagate errors
- How does a user find a remote error?
- Call system administrator
- Admin looks through log files for each layer (not
accessible to user) - We need better debugging methods
41Everything will fail(Everything)
Problem
- In the USCMS testbed production
- Power outage for several hours
- Network outages few minutes-11 hr.
- Failed configuration change
- Site upgraded
- Jobs accidentally removed
- Software bugs everywhere
42How do you cope?
- Condor-G
- Error job cannot run. This is not good enough
- Resubmit jobs that can be resubmitted, perhaps
after a delay - Put jobs on hold in queue
- User examines hold reason (proxy is expired)
- User fixes error
- User restarts job
43Everything will fail(Even the little things)
Problem
- Condor Matchmaker
- Collects descriptions of machines jobs
- Soft state in matchmaker (push smarts to edge,
like Internet) - UDP packets to advertise machines
- Less overhead than many TCP connections
- Works great in a LAN
- But
44Everything will fail UDP
- But you lose some UDP packets
- Send packets every five minutes
- Keep stale information for 15 minutes
- Be prepared to cope with stale information
- This has worked for years in Condor
- DØ matchmaking on grid
- UDP packets from Korea to Chicago were completely
lost on weekdays - Added TCP option
45Be prepared
Lesson
- Assume everything will fail
- Have recovery at multiple levels
- Understand scope of errors
- Dont trust configuration
- Verify it
- Install configure software on the fly
- Assume bugs are everywhere
- Build software to cope with errors