Reallife experiences with grids: Its not as easy as it looks - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Reallife experiences with grids: Its not as easy as it looks

Description:

Who Am I? Member of Condor Team. Experience with Condor. Experience with grid deployment ... Inetd noticed many connections per second ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 46
Provided by: Alai160
Category:

less

Transcript and Presenter's Notes

Title: Reallife experiences with grids: Its not as easy as it looks


1
Real-life experiences with gridsIts not as
easy as it looks
  • Alain Roy
  • roy_at_cs.wisc.edu
  • University of Wisconsin-Madison
  • Condor Team

2
Who Am I?
  • Member of Condor Team
  • Experience with Condor
  • Experience with grid deployment
  • Developer of Virtual Data Toolkit
  • Used by GriPhyN, EDG, LCG
  • Packaging of Globus, Condor, etc.
  • Collaborator with INFN
  • Working with Paolo Mazzanti
  • In Bologna for four weeks

3
Italy
  • Italy is beautiful
  • The food is wonderful
  • The people are friendly

4
Background
  • Condors environment is a little like a grid
  • Not all computers (grid sites) are under Condors
    control
  • Computers (grid sites) disappear at the owners
    whim
  • Everything changes constantly
  • Condor was built to deal with this dynamic
    environment
  • Grid software needs to do the same

5
Background
  • Late 1980s until today
  • Condor developed and deployed on hundreds of
    sites
  • Condor built to deal with failures
  • Recently
  • Condor-G your window to the grid
  • Condor team has helped deploy grid technology for
    real usenot just experiments

6
Background Condor
  • Condor is a batch job system
  • Goal High throughput computing
  • Different than high-performance
  • Goal High reliability
  • Goal Support distributed ownership

7
High-Throughput Computing
  • Worry about FLOPS/year, not FLOPS/second
  • Use all resources effectively
  • Dedicated clusters
  • Non-dedicated computers (desktop)

8
Effective Resource Use
  • Requires high reliability
  • Computers come and go, your jobs shouldnt.
  • Checkpointing
  • Be prepared for everything breaking
  • Requires distributed ownership

9
Condor-G
  • Condor-G submits Globus jobs
  • Jobs are in persistent queue
  • Unlike globus-job-run
  • Jobs are retried on system failures
  • Jobs are held on some failures
  • Condor-G makes it easy to submit grid jobs

10
Background USCMS
  • CMS
  • Detector online in 2007
  • Needs to simulate reconstruct millions of
    events
  • USCMS testbed
  • Joint PPDG/GriPhyN effort
  • Integrate CMS tools with grid tools
  • Globus
  • Condor-G
  • Contribute real work to CMS

11
Background USCMS
  • 7 sites, 250 CPUs
  • Spring 2002 Deploy test
  • Fall 2002
  • Last minute production
  • 150,000 events in two weeks
  • Successful, but lots of work
  • Today
  • Wider deployment use

12
Background DØ
  • Experiment at Fermilab
  • Already doing real production, real analysis
  • Deploying on grid sites today
  • Condor-G
  • Globus
  • SAM

13
DØ Condor-G
  • They liked Condor-G
  • Condor-G missing a feature
  • Deciding which grid-site to use
  • SAM (data handling software) knows where data is
    located
  • SAMGrid
  • Condor-G asks SAM for advice
  • Condor-G decides where to run jobs

14
DØ deployment
  • Spring Beginning of deployment
  • Late summer production
  • Early results
  • It looks good
  • We have more work to do
  • Better error reporting
  • Better matchmaking
  • What will we learn later?

15
Problems Lessons
  • During our experiences, weve
  • Encountered many problems
  • Developed solutions to these problems
  • Learned many lessons about grids
  • This talk
  • Shares some interesting problems
  • Gives some advice solutions

16
Taking a taxi
Problem
  • How do you take a taxi in Paestum, Italy?
  • We dont need to walk 4km there
  • The ruins were lovely
  • The ruins were outside
  • It was about 35C
  • Wife is pregnant

17
Use all your resources
Lesson
  • Walk up to storekeeper
  • Ask Dovay Ooon Taxi? (Dove un taxi?)
  • Be patient Wait ten minutes
  • Take taxi
  • I assumed my resources (local knowledge, Italian)
    were insufficient, but they saved me time when I
    used them

18
Use all your resources
Lesson
  • Condor
  • Uses dedicated machines (I can walk)
  • Uses non-dedicated machines (I can sometimes ask
    for help)
  • Grids
  • Connect your machine rooms
  • Can you take advantage of other resources?
  • Avoid mentality I must control all resources,
    and you will prosper

19
Grid distributed machine room?
  • You can have good control
  • You can pre-install applications
  • You know how everything works
  • BUT
  • You lose flexibility
  • How quickly can you upgrade sites?
  • Did they install everything correctly?
  • Can you use new grid sites easily?

20
Grid Use all resources
  • Assume basic grid software is installed
  • Assume nothing else is installed
  • Bring your software with you
  • Submit one job install software
  • Submit N jobs use software
  • You control software
  • You ensure correct installation
  • Easy to use any grid site

21
Long-running programs
Problem
  • Long-running programs crash
  • Condor has daemons on each machine
  • User (job) agent
  • Machine agent
  • Matchmaker
  • They crash
  • Programming errors
  • Network failures
  • Disk failures

22
Watch programs
Lesson
  • Condor master
  • Small program, rarely changed
  • Runs Condor daemons
  • When daemon crashes
  • Restart daemon, send email
  • If it crashes again, restart after backoff
  • Result
  • Many errors are silently fixed
  • Yet we dont just ignore crashes

23
Short-running programs
Problem
  • Short-running programs crash/hang
  • Example globus-url-copy
  • USCMS testbed staging data
  • Some fraction of copies hang or fail
  • Programming error delicate network
  • Hard to reproduce and fix

24
Watch programs
Lesson
  • When copy exceeds timeout, kill and retry
  • Possible to do in shell scripting languages, but
    not easy
  • Use Fault Tolerant Shell to watch programs

25
Fault Tolerant Shell
  • Shell language built for coping with errors
  • try for 30 minutes
  • wget http//www.example.com/file.tar.gz
  • gunzip file.tar.gz
  • tar xf file.tar
  • end

26
FTSH exponential backoff
  • Why exponential backoff?
  • What if 100 ftsh scripts are executing?
  • Avoid synchronization ? reduce load, increase
    chance of success
  • Similar to Ethernet

27
Fault Tolerant Shell
  • Easier to cope with failures
  • try 5 times
  • wget http//www.example.com/file.tar.gz
  • catch
  • rm f file.tar.gz
  • failure
  • end

Cleanup partially downloaded file, if it exists
28
Fault Tolerant Shell
  • Flexible
  • try for 30 minutes
  • try for 5 minutes
  • wget http//example.com/file.tar.gz
  • end
  • try for 1 minute or 3 times
  • gunzip file.tar.gz
  • tar xf file.tar
  • catch
  • rm rf file.tar
  • end
  • end

29
FTSH More information
  • Work of Doug Thain
  • thain_at_cs.wisc.edu
  • Excellent paper
  • The Ethernet Approach to Grid Computing, by Doug
    Thain
  • Available from http//www.cs.wisc.edu/thain
  • Even if you dont use FTSH, read this paper!

30
Whose error is it?
Problem
  • The source of an error is not always obvious
  • The source of an error influences how you react
    to the error
  • Example Java universe in Condor

31
Java Universe
  • Users submit Java jobs to Condor
  • Whose error is it? Check result code
  • 1 Program dereferenced NULL pointer
  • 1 Jobs image is corrupt
  • 1 VM doesnt have enough memory to run program
  • 1 Java installation is misconfigured

Job shouldnt run again
Job shouldnt run again
Try another machine with more memory
Dont use this machine for Java
32
Dont trust configuration
Lesson
  • Users tells Condor Java is installed
  • This is just a hint!
  • Condor verifies Java configuration
  • Run simple job, verify output
  • If Java works, Condor advertises that Java can be
    used
  • If Java fails, error is reported, Java cant be
    used

33
Look for error scope
Lesson
  • Add Java wrapper to all Java jobs
  • Run program
  • Examine return code/exception
  • Write all details to file
  • Examine output of wrapper, or exception from JVM
  • We know if job is bad
  • We know if JVM is insufficient for job
  • We know if JVM is bad

34
Error Scope
  • We could have an entire talk on error scope
  • Excellent paper Error Scope on a Computational
    Grid Theory and Practice, by Doug Thain
  • Useful paper even if you dont use Condor or Java

35
Many layers in a grid
Problem
36
We forgot inetd
  • We submitted 300 jobs at once
  • Inetd noticed many connections per second
  • Inetd presumed there was a denial of service
    attack and refused connections for five minutes
  • Lots of debugging!

37
There are more layers!
USCMS Testbed Architecture (A bit dated)
Master Site
Worker
Impala
Globus
MOP
Batch System (Condor, PBS)
DAGMan
Real Work
Condor-G
38
More layers than that!
USCMS Testbed Architecture (A bit dated)
  • MCRunJob
  • Impala
  • MOP
  • condor_schedd
  • DAGMan
  • Condor-G condor_schedd
  • condor_gridmanager
  • gahp_server
  • globus-gatekeeper
  • globus-job-manager
  • globus-job-manager-script.pl
  • local batch system submit
  • local batch system execute
  • MOP wrapper
  • Impala wrapper
  • actual job

This disregards inetd, network, file servers,
file transfers
39
Recovery at multiple levels
Lesson
  • Fault-tolerance and recovery is built in at many
    levels
  • Condor_master restart daemons
  • Condor_schedd job queue
  • DAGMan checkpoint DAG of jobs
  • Gahp_server isolate Globus libraries
  • And others

40
Allocate debugging time
Lesson
  • Allocate lots of debugging time
  • It is very hard to propagate errors
  • How does a user find a remote error?
  • Call system administrator
  • Admin looks through log files for each layer (not
    accessible to user)
  • We need better debugging methods

41
Everything will fail(Everything)
Problem
  • In the USCMS testbed production
  • Power outage for several hours
  • Network outages few minutes-11 hr.
  • Failed configuration change
  • Site upgraded
  • Jobs accidentally removed
  • Software bugs everywhere

42
How do you cope?
  • Condor-G
  • Error job cannot run. This is not good enough
  • Resubmit jobs that can be resubmitted, perhaps
    after a delay
  • Put jobs on hold in queue
  • User examines hold reason (proxy is expired)
  • User fixes error
  • User restarts job

43
Everything will fail(Even the little things)
Problem
  • Condor Matchmaker
  • Collects descriptions of machines jobs
  • Soft state in matchmaker (push smarts to edge,
    like Internet)
  • UDP packets to advertise machines
  • Less overhead than many TCP connections
  • Works great in a LAN
  • But

44
Everything will fail UDP
  • But you lose some UDP packets
  • Send packets every five minutes
  • Keep stale information for 15 minutes
  • Be prepared to cope with stale information
  • This has worked for years in Condor
  • DØ matchmaking on grid
  • UDP packets from Korea to Chicago were completely
    lost on weekdays
  • Added TCP option

45
Be prepared
Lesson
  • Assume everything will fail
  • Have recovery at multiple levels
  • Understand scope of errors
  • Dont trust configuration
  • Verify it
  • Install configure software on the fly
  • Assume bugs are everywhere
  • Build software to cope with errors
Write a Comment
User Comments (0)
About PowerShow.com