Reallife experiences with grids: Its not as easy as it looks - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Reallife experiences with grids: Its not as easy as it looks

Description:

Who Am I? Member of Condor Team. Experience with Condor. Experience with grid deployment ... Inetd noticed many connections per second ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 46

Provided by: Alai160

Category:

more less

Transcript and Presenter's Notes

Title: Reallife experiences with grids: Its not as easy as it looks

1
Real-life experiences with gridsIts not as
easy as it looks

Alain Roy
roy_at_cs.wisc.edu
University of Wisconsin-Madison
Condor Team

2
Who Am I?

Member of Condor Team
Experience with Condor
Experience with grid deployment
Developer of Virtual Data Toolkit
Used by GriPhyN, EDG, LCG
Packaging of Globus, Condor, etc.
Collaborator with INFN
Working with Paolo Mazzanti
In Bologna for four weeks

3
Italy

Italy is beautiful
The food is wonderful
The people are friendly

4
Background

Condors environment is a little like a grid
Not all computers (grid sites) are under Condors
control
Computers (grid sites) disappear at the owners
whim
Everything changes constantly
Condor was built to deal with this dynamic
environment
Grid software needs to do the same

5
Background

Late 1980s until today
Condor developed and deployed on hundreds of
sites
Condor built to deal with failures
Recently
Condor-G your window to the grid
Condor team has helped deploy grid technology for
real usenot just experiments

6
Background Condor

Condor is a batch job system
Goal High throughput computing
Different than high-performance
Goal High reliability
Goal Support distributed ownership

7
High-Throughput Computing

Worry about FLOPS/year, not FLOPS/second
Use all resources effectively
Dedicated clusters
Non-dedicated computers (desktop)

8
Effective Resource Use

Requires high reliability
Computers come and go, your jobs shouldnt.
Checkpointing
Be prepared for everything breaking
Requires distributed ownership

9
Condor-G

Condor-G submits Globus jobs
Jobs are in persistent queue
Unlike globus-job-run
Jobs are retried on system failures
Jobs are held on some failures
Condor-G makes it easy to submit grid jobs

10
Background USCMS

CMS
Detector online in 2007
Needs to simulate reconstruct millions of
events
USCMS testbed
Joint PPDG/GriPhyN effort
Integrate CMS tools with grid tools
Globus
Condor-G
Contribute real work to CMS

11
Background USCMS

7 sites, 250 CPUs
Spring 2002 Deploy test
Fall 2002
Last minute production
150,000 events in two weeks
Successful, but lots of work
Today
Wider deployment use

12
Background DØ

Experiment at Fermilab
Already doing real production, real analysis
Deploying on grid sites today
Condor-G
Globus
SAM

13
DØ Condor-G

They liked Condor-G
Condor-G missing a feature
Deciding which grid-site to use
SAM (data handling software) knows where data is
located
SAMGrid
Condor-G asks SAM for advice
Condor-G decides where to run jobs

14
DØ deployment

Spring Beginning of deployment
Late summer production
Early results
It looks good
We have more work to do
Better error reporting
Better matchmaking
What will we learn later?

15
Problems Lessons

During our experiences, weve
Encountered many problems
Developed solutions to these problems
Learned many lessons about grids
This talk
Shares some interesting problems
Gives some advice solutions

16
Taking a taxi
Problem

How do you take a taxi in Paestum, Italy?
We dont need to walk 4km there
The ruins were lovely
The ruins were outside
It was about 35C
Wife is pregnant

17
Use all your resources
Lesson

Walk up to storekeeper
Ask Dovay Ooon Taxi? (Dove un taxi?)
Be patient Wait ten minutes
Take taxi
I assumed my resources (local knowledge, Italian)
were insufficient, but they saved me time when I
used them

18
Use all your resources
Lesson

Condor
Uses dedicated machines (I can walk)
Uses non-dedicated machines (I can sometimes ask
for help)
Grids
Connect your machine rooms
Can you take advantage of other resources?
Avoid mentality I must control all resources,
and you will prosper

19
Grid distributed machine room?

You can have good control
You can pre-install applications
You know how everything works
BUT
You lose flexibility
How quickly can you upgrade sites?
Did they install everything correctly?
Can you use new grid sites easily?

20
Grid Use all resources

Assume basic grid software is installed
Assume nothing else is installed
Bring your software with you
Submit one job install software
Submit N jobs use software
You control software
You ensure correct installation
Easy to use any grid site

21
Long-running programs
Problem

Long-running programs crash
Condor has daemons on each machine
User (job) agent
Machine agent
Matchmaker
They crash
Programming errors
Network failures
Disk failures

22
Watch programs
Lesson

Condor master
Small program, rarely changed
Runs Condor daemons
When daemon crashes
Restart daemon, send email
If it crashes again, restart after backoff
Result
Many errors are silently fixed
Yet we dont just ignore crashes

23
Short-running programs
Problem

Short-running programs crash/hang
Example globus-url-copy
USCMS testbed staging data
Some fraction of copies hang or fail
Programming error delicate network
Hard to reproduce and fix

24
Watch programs
Lesson

When copy exceeds timeout, kill and retry
Possible to do in shell scripting languages, but
not easy
Use Fault Tolerant Shell to watch programs

25
Fault Tolerant Shell

Shell language built for coping with errors
try for 30 minutes
wget http//www.example.com/file.tar.gz
gunzip file.tar.gz
tar xf file.tar
end

26
FTSH exponential backoff

Why exponential backoff?
What if 100 ftsh scripts are executing?
Avoid synchronization ? reduce load, increase
chance of success
Similar to Ethernet

27
Fault Tolerant Shell

Easier to cope with failures
try 5 times
wget http//www.example.com/file.tar.gz
catch
rm f file.tar.gz
failure
end

Cleanup partially downloaded file, if it exists
28
Fault Tolerant Shell

Flexible
try for 30 minutes
try for 5 minutes
wget http//example.com/file.tar.gz
end
try for 1 minute or 3 times
gunzip file.tar.gz
tar xf file.tar
catch
rm rf file.tar
end
end

29
FTSH More information

Work of Doug Thain
thain_at_cs.wisc.edu
Excellent paper
The Ethernet Approach to Grid Computing, by Doug
Thain
Available from http//www.cs.wisc.edu/thain
Even if you dont use FTSH, read this paper!

30
Whose error is it?
Problem

The source of an error is not always obvious
The source of an error influences how you react
to the error
Example Java universe in Condor

31
Java Universe

Users submit Java jobs to Condor
Whose error is it? Check result code
1 Program dereferenced NULL pointer
1 Jobs image is corrupt
1 VM doesnt have enough memory to run program
1 Java installation is misconfigured

Job shouldnt run again
Job shouldnt run again
Try another machine with more memory
Dont use this machine for Java
32
Dont trust configuration
Lesson

Users tells Condor Java is installed
This is just a hint!
Condor verifies Java configuration
Run simple job, verify output
If Java works, Condor advertises that Java can be
used
If Java fails, error is reported, Java cant be
used

33
Look for error scope
Lesson

Add Java wrapper to all Java jobs
Run program
Examine return code/exception
Write all details to file
Examine output of wrapper, or exception from JVM
We know if job is bad
We know if JVM is insufficient for job
We know if JVM is bad

34
Error Scope

We could have an entire talk on error scope
Excellent paper Error Scope on a Computational
Grid Theory and Practice, by Doug Thain
Useful paper even if you dont use Condor or Java

35
Many layers in a grid
Problem
36
We forgot inetd

We submitted 300 jobs at once
Inetd noticed many connections per second
Inetd presumed there was a denial of service
attack and refused connections for five minutes
Lots of debugging!

37
There are more layers!
USCMS Testbed Architecture (A bit dated)
Master Site
Worker
Impala
Globus
MOP
Batch System (Condor, PBS)
DAGMan
Real Work
Condor-G
38
More layers than that!
USCMS Testbed Architecture (A bit dated)