Error Scope on a Computational Grid: Theory and Practice - PowerPoint PPT Presentation

About This Presentation
Title:

Error Scope on a Computational Grid: Theory and Practice

Description:

Error Scope on a Computational Grid: Theory and Practice Douglas Thain Computer Sciences Department University of Wisconsin USC Reliability Workshop – PowerPoint PPT presentation

Number of Views:155
Avg rating:3.0/5.0
Slides: 39
Provided by: Dougla242
Learn more at: https://www3.nd.edu
Category:

less

Transcript and Presenter's Notes

Title: Error Scope on a Computational Grid: Theory and Practice


1
Error Scopeon a Computational GridTheory and
Practice
  • Douglas Thain
  • Computer Sciences Department
  • University of Wisconsin
  • USC Reliability Workshop
  • July 2002

2
Outline
  • An Exercise Condor Java
  • Bad News Error Explosion
  • A Theory of Error Propagation
  • Down with Generic Errors!
  • Condor Revisited
  • Parting Thoughts

3
An ExerciseCoupling Condor and Java
  • The Condor Project, est. 1985.
  • Production high-throughput computing facility.
  • Provides a stable execution environment on a Grid
    of unstable, autonomous resources.
  • The Java Language, est. 1991.
  • Production language, compiler, and interpreter.
  • Provides a standard instruction set and libraries
    on any processor and system.
  • The Grid, est. ????
  • Execute any code any where at any time.
  • Dependable, consistent, pervasive, inexpensive...
  • Are we there yet?

4
The Condor High Throughput Computing System
  • HTC ! HPC
  • Measured in sims/week, frames/month, cycles/year.
  • All participants are autonomous.
  • Users give constraints on usable machines.
  • Machines give constraints on jobs and users.
  • ClassAds a language for matchmaking.
  • If you are willing to re-link jobs...
  • Remote system calls for transparent mobility.
  • Binary checkpointing for migration and
    fault-tolerance.
  • Cant relink? All other features available.
  • Special universes support software
    environments.
  • PVM, MPI, Master-Worker, Vanilla, Globus, Java

5
Execution Site
Submission Site
Match- Maker
User Agent (schedd)
Machine Agent (startd)
Home File System
6
Java Universe
  • Execution
  • User specifies .class and .jar files.
  • Machine provides the JVM details.
  • Input and Output
  • Know all of your files?
  • Condor transfers whole files for you.
  • Need online I/O?
  • Link program with Chirp I/O Library.
  • Execution site provides proxy to home site.

7
Execution Site
Submission Site
Job Agent (starter)
Job Agent (shadow)
I/O Server
I/O Proxy
Home File System
Wrapper
The Job
I/O Library
8
Initial Experience
  • Bad news! Any kind of error sent the job back to
    the user with an exception message
  • NullPointerException - Program is faulty.
  • OutOfMemory - Program outgrew machine.
  • ClassNotFoundError - Machine incorrectly
    installed.
  • ConnectionRefused - Network temporarily
    unavailable.
  • Users were frustrated because they had to
    evaluate whether the job failed or the system
    failed.
  • These were correct in the sense they were true.
  • These were not bugs. We deliberately trapped all
    possible errors and passed them up the chain.

9
Whats the Problem?
  • To reason about this problem, we began to
    construct a theory of error propagation.
  • This theory offers some common definitions and
    four principles that outline a design discipline.
  • We re-examined the Java Universe according to
    this theory.
  • Our most serious mistake We failed to propagate
    errors according to their scope.

10
We are NOT Talking About
  • Fault Tolerance
  • What algorithms are fault-resistant?
  • How many disks can I lose without losing data?
  • How many copies should I make for five nines?
  • Language Structures
  • Should I use Objects or Strings to represent
    errors?
  • Should I use Exceptions or Signals to communicate
    errors?
  • These are important and valuable questions, but
    we are asking something different!

11
We ARE Talking About
  • Where is the problem?
  • How should a program respond to an error?
  • Who should receive an error message?
  • What information should an error carry?
  • How can we even reason about this stuff?

12
Engineering Perspective
  • Fault
  • A physical disruption of the machine.
  • Error
  • An information state that reflects a fault.
  • Failure
  • A violation of documented/guaranteed behavior.
  • Fault
  • (A failure in ones underlying components.)

13
Interface Perspective
  • Implicit Error
  • A result presented as valid, but found to be
    false.
  • Example sqrt(3) -gt 2.
  • Explicit Error
  • A result describing an inability to carry out the
    request.
  • Example open(file) -gt ENOENT.
  • Escaping Error
  • A return to a higher level of abstraction.
  • Example read -gt virt mem failure -gt process
    abort.
  • Example server out of memory -gt shutdown socket

14
Would like to return an explicit error, but a
load insn has no exit code.
Program
Could return a default value, but that creates an
implicit error.
load
data
Escaping error Tell the parent that the program
could not complete.
Virtual Memory System
Backing Store
Physical Memory
15
Interface Contracts
  • int load( int address )
  • The implementor must either compute a result that
    conforms to the contract, or is obliged to cause
    an escaping error.

16
Exceptions
  • int open( String filename )
  • throws FileNotFound, AccessDenied
  • A language with exceptions provides more
    structure to the contract. A declared exception
    is an explicit error. Yet, escaping errors are
    still possible.

17
Program
Success, FileNotFound, AccessDenied
open
MemoryCorrupt, DiskOffline, PigeonLost
INTERFACE
Virtual File System
IMPLEMENTATION
Disk
Memory
18
Error Scope
  • In order to be accepted by end users, a
    distributed system must be able to distinguish
    between errors computed by the program and errors
    forced upon it by the environment.
  • We use the term scope to draw the distinction.

19
Error Scope
  • The scope of an error is the portion of the
    system that it invalidates.
  • An error must be delivered to the process
    responsible for managing that scope.

Error Scope Handler
FileNotFound File Calling Function
RPC Disconnect Process Parent Process
Cache Coherency Problem Machine Hypervisor or Operator
PVM Node Crash PVM Cluster Parent Process
20
Error Detail
  • The detail of an error describes in
    phenomenological terms the cause of the error.
  • In the right hands, the detail is useful. In the
    wrong hands, the detail can be misleading.
  • Suppose open returns AccessDenied...
  • File is not accessible - Ok.
  • Library containing open is not accessible -
    Problem!

21
What To Do With An Error?
  • A program cannot possibly know what to do with an
    error outside its scope.
  • Should sin(x) deal with math library not
    available?
  • Propagate an error to the manager of the scope as
    directly as possible.
  • Sometimes, a direct mechanism
  • Signal, exception, dropped connection, message.
  • Sometimes, an indirect mechanism
  • Touch a file, then exit by any means available.

22
Principles for Error Design
  • Principle 1
  • A routine must not generate an implicit error as
    a result of receiving an explicit error.
  • Principle 2
  • An escaping error converts a potential implicit
    error into an explicit error at a higher level.
  • Principle 3
  • An escaping error must be propagated to the
    program that manages the errors scope.
  • Principle 4
  • Error interfaces must be concise and finite.

23
Return to Condor
  • What did we do wrong?
  • We failed to carefully consider the scope of an
    error.
  • We fell prey to the deadly generic error.
  • Whats the solution?
  • Identify error scopes in Condor.
  • Find more direct mechanisms to send escaping
    errors to the managing process.

24
schedd
Job Scope
shadow
Local Resource Scope
starter
Remote Resource Scope
JVM
Virtual Machine Scope
Prog Image
User Policy
program
Program Scope
Prog Args
I/O Server
Owner Policy
Code
Data
Mem CPU
Input Data
Output Space
Java Pkg
25
Scope in Condor
Detail Scope Handler
Program exited normally. Program User
Null pointer exception. Program User
Out of memory. Virtual Machine JVM
Java misconfigured. Remote Resource Starter
Home file system offline. Local Resource Shadow
Program image corrupt. Job Schedd
26
Scope in CondorJVM Exit Code
Detail Scope Handler Exit Code
Program exited normally. Program User (x)
Null pointer exception. Program User 1
Out of memory. Virtual Machine JVM 1
Java misconfigured. Remote Resource Starter 1
Home file system offline. Local Resource Shadow 1
Program image corrupt. Job Schedd 1
27
Job Agent (starter)
Job Agent (shadow)
Starter Result Program Result
Result File
JVM Result
JVM
Home File System
Program Result or Error and Scope
Wrapper
The Job
I/O Library
28
Errors of Larger Scope
Errors Inside Program Scope
29
Half-Way Conclusion
  • Small but powerful changes drastically improved
    the Java Universe.
  • Our mistake was to represent all possible errors
    explicitly in the closest interface.
  • Error scope is an analytic tool that helps the
    designer decide how to propagate an error.
  • But, we were initially confused by the presence
    of the deadly generic error.

30
The Deadly Generic Error
  • Whereas, a program may fail in more ways than we
    can possibly imagine...
  • And whereas, generality and flexibility are
    virtues of programming...
  • Be if therefore resolved that interfaces should
    return general, flexible, arbitrary values
  • int open( String name ) throws IOException

31
Whats Wrong with Generality?
  • The structure and types of errors are as
    essential to an interface as the arguments and
    return values.
  • Every error requires a different recovery
    mechanism, according to its scope
  • EINTR - try again right away
  • ETIMEDOUT - will be available again in the future
  • EPERM - you cant at all without talking to a
    person
  • ESTALE - must kill process
  • A program must know the specific details of an
    error in order to take the right action. Guesses
    dont work.
  • Exit on unknown errors? Program is brittle.
  • Retry on unknown errors? Program waits endlessly.

32
An Example of Generality
int open( String name ) throws
IOException int write( int data ) throws
IOException
33
An Example of Generality
  • Java defines several types of IOException
  • AccessDenied, FileNotFound, EndOfFile...
  • Can open throw...?
  • FileNotFound
  • EndOfFile
  • DiskFull
  • Can write throw...?
  • AccessDenied
  • FileNotFound
  • DiskFull
  • Trick Question!

34
My Disk Runneth Over!
  • What can a program expect for a full disk?
  • DiskFullException
  • OutOfSpaceException
  • Its really neither! (How would we know?)
  • What should an implementor do when the disk fills
    up?
  • There is no appropriate exception to throw.
  • Making up an exception is not useful.
  • Only solution an escaping error. (Example
    later.)

35
Advice for Constructing Error Interfaces
  • Export a small set of expected error types.
  • Bad Arguments
  • Lost Connection
  • No Such File
  • Choose an internal error management strategy.
    You know the cost of retry vs the cost of
    failure.
  • Retry internally
  • Abort process
  • Drop connection

36
A Better Interface
int open( String name ) throws
AccessDenied, throws FileNotFound int
write( int data ) throws DiskFull
37
Conclusion
  • Small but powerful changes drastically improved
    the Java Universe.
  • Our mistake was to represent all possible errors
    explicitly in the closest interface.
  • Error scope is an analytic tool that helps the
    designer decide how to propagate an error.
  • An error discipline saves precious resources
    time and aggravation!

38
For more information...
  • Error Scope Paper
  • http//www.cs.wisc.edu/thain
  • Douglas Thain
  • thain_at_cs.wisc.edu
  • Miron Livny
  • miron_at_cs.wisc.edu
  • Condor Software, Manuals, Papers, and More
  • http//www.cs.wisc.edu/condor
  • Questions now?
Write a Comment
User Comments (0)
About PowerShow.com