Title: Error Scope on a Computational Grid: Theory and Practice
1Error Scopeon a Computational GridTheory and
Practice
- Douglas Thain
- Computer Sciences Department
- University of Wisconsin
- USC Reliability Workshop
- July 2002
2Outline
- An Exercise Condor Java
- Bad News Error Explosion
- A Theory of Error Propagation
- Down with Generic Errors!
- Condor Revisited
- Parting Thoughts
3An ExerciseCoupling Condor and Java
- The Condor Project, est. 1985.
- Production high-throughput computing facility.
- Provides a stable execution environment on a Grid
of unstable, autonomous resources. - The Java Language, est. 1991.
- Production language, compiler, and interpreter.
- Provides a standard instruction set and libraries
on any processor and system. - The Grid, est. ????
- Execute any code any where at any time.
- Dependable, consistent, pervasive, inexpensive...
- Are we there yet?
4The Condor High Throughput Computing System
- HTC ! HPC
- Measured in sims/week, frames/month, cycles/year.
- All participants are autonomous.
- Users give constraints on usable machines.
- Machines give constraints on jobs and users.
- ClassAds a language for matchmaking.
- If you are willing to re-link jobs...
- Remote system calls for transparent mobility.
- Binary checkpointing for migration and
fault-tolerance. - Cant relink? All other features available.
- Special universes support software
environments. - PVM, MPI, Master-Worker, Vanilla, Globus, Java
5Execution Site
Submission Site
Match- Maker
User Agent (schedd)
Machine Agent (startd)
Home File System
6Java Universe
- Execution
- User specifies .class and .jar files.
- Machine provides the JVM details.
- Input and Output
- Know all of your files?
- Condor transfers whole files for you.
- Need online I/O?
- Link program with Chirp I/O Library.
- Execution site provides proxy to home site.
7Execution Site
Submission Site
Job Agent (starter)
Job Agent (shadow)
I/O Server
I/O Proxy
Home File System
Wrapper
The Job
I/O Library
8Initial Experience
- Bad news! Any kind of error sent the job back to
the user with an exception message - NullPointerException - Program is faulty.
- OutOfMemory - Program outgrew machine.
- ClassNotFoundError - Machine incorrectly
installed. - ConnectionRefused - Network temporarily
unavailable. - Users were frustrated because they had to
evaluate whether the job failed or the system
failed. - These were correct in the sense they were true.
- These were not bugs. We deliberately trapped all
possible errors and passed them up the chain.
9Whats the Problem?
- To reason about this problem, we began to
construct a theory of error propagation. - This theory offers some common definitions and
four principles that outline a design discipline. - We re-examined the Java Universe according to
this theory. - Our most serious mistake We failed to propagate
errors according to their scope.
10We are NOT Talking About
- Fault Tolerance
- What algorithms are fault-resistant?
- How many disks can I lose without losing data?
- How many copies should I make for five nines?
- Language Structures
- Should I use Objects or Strings to represent
errors? - Should I use Exceptions or Signals to communicate
errors? - These are important and valuable questions, but
we are asking something different!
11We ARE Talking About
- Where is the problem?
- How should a program respond to an error?
- Who should receive an error message?
- What information should an error carry?
- How can we even reason about this stuff?
12Engineering Perspective
- Fault
- A physical disruption of the machine.
- Error
- An information state that reflects a fault.
- Failure
- A violation of documented/guaranteed behavior.
- Fault
- (A failure in ones underlying components.)
13Interface Perspective
- Implicit Error
- A result presented as valid, but found to be
false. - Example sqrt(3) -gt 2.
- Explicit Error
- A result describing an inability to carry out the
request. - Example open(file) -gt ENOENT.
- Escaping Error
- A return to a higher level of abstraction.
- Example read -gt virt mem failure -gt process
abort. - Example server out of memory -gt shutdown socket
14Would like to return an explicit error, but a
load insn has no exit code.
Program
Could return a default value, but that creates an
implicit error.
load
data
Escaping error Tell the parent that the program
could not complete.
Virtual Memory System
Backing Store
Physical Memory
15Interface Contracts
- int load( int address )
- The implementor must either compute a result that
conforms to the contract, or is obliged to cause
an escaping error.
16Exceptions
- int open( String filename )
- throws FileNotFound, AccessDenied
- A language with exceptions provides more
structure to the contract. A declared exception
is an explicit error. Yet, escaping errors are
still possible.
17Program
Success, FileNotFound, AccessDenied
open
MemoryCorrupt, DiskOffline, PigeonLost
INTERFACE
Virtual File System
IMPLEMENTATION
Disk
Memory
18Error Scope
- In order to be accepted by end users, a
distributed system must be able to distinguish
between errors computed by the program and errors
forced upon it by the environment. - We use the term scope to draw the distinction.
19Error Scope
- The scope of an error is the portion of the
system that it invalidates. - An error must be delivered to the process
responsible for managing that scope.
Error Scope Handler
FileNotFound File Calling Function
RPC Disconnect Process Parent Process
Cache Coherency Problem Machine Hypervisor or Operator
PVM Node Crash PVM Cluster Parent Process
20Error Detail
- The detail of an error describes in
phenomenological terms the cause of the error. - In the right hands, the detail is useful. In the
wrong hands, the detail can be misleading. - Suppose open returns AccessDenied...
- File is not accessible - Ok.
- Library containing open is not accessible -
Problem!
21What To Do With An Error?
- A program cannot possibly know what to do with an
error outside its scope. - Should sin(x) deal with math library not
available? - Propagate an error to the manager of the scope as
directly as possible. - Sometimes, a direct mechanism
- Signal, exception, dropped connection, message.
- Sometimes, an indirect mechanism
- Touch a file, then exit by any means available.
22Principles for Error Design
- Principle 1
- A routine must not generate an implicit error as
a result of receiving an explicit error. - Principle 2
- An escaping error converts a potential implicit
error into an explicit error at a higher level. - Principle 3
- An escaping error must be propagated to the
program that manages the errors scope. - Principle 4
- Error interfaces must be concise and finite.
23Return to Condor
- What did we do wrong?
- We failed to carefully consider the scope of an
error. - We fell prey to the deadly generic error.
- Whats the solution?
- Identify error scopes in Condor.
- Find more direct mechanisms to send escaping
errors to the managing process.
24schedd
Job Scope
shadow
Local Resource Scope
starter
Remote Resource Scope
JVM
Virtual Machine Scope
Prog Image
User Policy
program
Program Scope
Prog Args
I/O Server
Owner Policy
Code
Data
Mem CPU
Input Data
Output Space
Java Pkg
25Scope in Condor
Detail Scope Handler
Program exited normally. Program User
Null pointer exception. Program User
Out of memory. Virtual Machine JVM
Java misconfigured. Remote Resource Starter
Home file system offline. Local Resource Shadow
Program image corrupt. Job Schedd
26Scope in CondorJVM Exit Code
Detail Scope Handler Exit Code
Program exited normally. Program User (x)
Null pointer exception. Program User 1
Out of memory. Virtual Machine JVM 1
Java misconfigured. Remote Resource Starter 1
Home file system offline. Local Resource Shadow 1
Program image corrupt. Job Schedd 1
27Job Agent (starter)
Job Agent (shadow)
Starter Result Program Result
Result File
JVM Result
JVM
Home File System
Program Result or Error and Scope
Wrapper
The Job
I/O Library
28Errors of Larger Scope
Errors Inside Program Scope
29Half-Way Conclusion
- Small but powerful changes drastically improved
the Java Universe. - Our mistake was to represent all possible errors
explicitly in the closest interface. - Error scope is an analytic tool that helps the
designer decide how to propagate an error. - But, we were initially confused by the presence
of the deadly generic error.
30The Deadly Generic Error
- Whereas, a program may fail in more ways than we
can possibly imagine... - And whereas, generality and flexibility are
virtues of programming... - Be if therefore resolved that interfaces should
return general, flexible, arbitrary values - int open( String name ) throws IOException
31Whats Wrong with Generality?
- The structure and types of errors are as
essential to an interface as the arguments and
return values. - Every error requires a different recovery
mechanism, according to its scope - EINTR - try again right away
- ETIMEDOUT - will be available again in the future
- EPERM - you cant at all without talking to a
person - ESTALE - must kill process
- A program must know the specific details of an
error in order to take the right action. Guesses
dont work. - Exit on unknown errors? Program is brittle.
- Retry on unknown errors? Program waits endlessly.
32An Example of Generality
int open( String name ) throws
IOException int write( int data ) throws
IOException
33An Example of Generality
- Java defines several types of IOException
- AccessDenied, FileNotFound, EndOfFile...
- Can open throw...?
- FileNotFound
- EndOfFile
- DiskFull
- Can write throw...?
- AccessDenied
- FileNotFound
- DiskFull
- Trick Question!
34My Disk Runneth Over!
- What can a program expect for a full disk?
- DiskFullException
- OutOfSpaceException
- Its really neither! (How would we know?)
- What should an implementor do when the disk fills
up? - There is no appropriate exception to throw.
- Making up an exception is not useful.
- Only solution an escaping error. (Example
later.)
35Advice for Constructing Error Interfaces
- Export a small set of expected error types.
- Bad Arguments
- Lost Connection
- No Such File
- Choose an internal error management strategy.
You know the cost of retry vs the cost of
failure. - Retry internally
- Abort process
- Drop connection
36A Better Interface
int open( String name ) throws
AccessDenied, throws FileNotFound int
write( int data ) throws DiskFull
37Conclusion
- Small but powerful changes drastically improved
the Java Universe. - Our mistake was to represent all possible errors
explicitly in the closest interface. - Error scope is an analytic tool that helps the
designer decide how to propagate an error. - An error discipline saves precious resources
time and aggravation!
38For more information...
- Error Scope Paper
- http//www.cs.wisc.edu/thain
- Douglas Thain
- thain_at_cs.wisc.edu
- Miron Livny
- miron_at_cs.wisc.edu
- Condor Software, Manuals, Papers, and More
- http//www.cs.wisc.edu/condor
- Questions now?