Error Scope on a Computational Grid - PowerPoint PPT Presentation

About This Presentation
Title:

Error Scope on a Computational Grid

Description:

JVM exit code: JVM- starter- home. But, we failed to draw a distinction: ... Starter Result Program Result. I/O Proxy. Local I/O (Chirp) Errors of Larger Scope ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 23
Provided by: dougla9
Learn more at: https://www3.nd.edu
Category:

less

Transcript and Presenter's Notes

Title: Error Scope on a Computational Grid


1
Error Scope on a Computational Grid
  • Douglas Thain
  • University of Wisconsin
  • 4 March 2002

2
Overview
  • We have added a Java Universe to Condor. (More
    from Todd.)
  • Adding this code forced us to think about the
    fundamental problem of coupling systems and
    representing errors.
  • A lesson One must consider the scope of an error
    as well as its detail.

3
Java for Scientific Computing
  • Java is emerging as a tool for large scale
    (Grande) scientific computing.
  • More accessible to domain scientists.
  • Simplified porting.
  • Faster development, debugging.
  • User communities are forming
  • ACM Java Grande Conference
  • The Java Grande Forum

4
The Hype
  • Java
  • Write once, run anywhere!
  • Condor
  • Submit once, run everywhere!
  • The Grid
  • Uniform, dependable, consistent, pervasive, and
    inexpensive computing.

5
The Reality
  • Coupling systems is not trivial!
  • The easy part
  • Putting java in front of the program name.
  • The tricky parts
  • Dealing with unexpected events!
  • Bad java installation.
  • Unavailable file system.
  • Temporary resource exhaustion.

6
Architecture
  • Execution
  • User just specifies java universe.
  • Execution site gives details of JVM.
  • I/O
  • Know all of your files?
  • Condor transfers whole files for you.
  • Need online I/O?
  • Link program with Chirp I/O Library.
  • Execution site provides proxy to home site.

7
Execution Site
Submission Site
starter
shadow
Home File System
8
Execution Site
Submission Site
starter
shadow
Fork
JVM
Home File System
9
Execution Site
Submission Site
starter
shadow
Fork
JVM
Home File System
The Job
10
Execution Site
Submission Site
starter
shadow
Secure Remote I/O
I/O Server
I/O Proxy
Local I/O (Chirp)
Fork
Local System Calls
JVM
Home File System
The Job
I/O Library
11
Initial Experience
  • Bad news Nearly any unexpected failure would
    cause the job to be returned to the user
  • Out of memory at execution site.
  • Java misconfigured at execution site.
  • I/O proxy cant initialize.
  • Home file system offline.

12
What do Users Want?
  • This was correct in a certain sense
  • The information was true.
  • But, still frustrating.
  • Users want to know when their program fails by
    design (NullPointerException,) but not if it
    fails due to the environment.

13
What Did We Do Wrong?
  • We thought that we were very careful to propagate
    errors
  • I/O errors server-gtproxy-gtlibrary-gtjob
  • JVM exit code JVM-gtstarter-gthome
  • But, we failed to draw a distinction
  • Errors that are a natural property of the
    program.
  • Errors that were an incidental result of the
    environment.

14
Scope and Detail
  • The scope of an error is the portion of the
    system that it invalidates.
  • The detail of an error describes its
    philosophical cause.
  • An error must be delivered according to the
    handler that manages its scope.

15
Examples
16
An Example
  • With this understanding, we reconsidered many
    elements of the Java Universe.
  • One example
  • The JVM exit code is not a useful result.
  • It gives results that ignore error scope.
  • Solution
  • Trap the program exit at a higher level.
  • Report the result and scope on a separate channel.

17
JVM Exit Code
18
starter
shadow
Starter Result Program Result
Result File
JVM Result
JVM
Home File System
Program Result or Error and Scope
Wrapper
The Job
I/O Library
19
Errors of Larger Scope
Errors Inside Program Scope
20
Conclusion
  • We started building the Java Universe with some
    naive assumptions about errors.
  • On encountering practical difficulties, we
    thought more abstractly about errors and
    developed the notion of scope and detail.
  • By routing errors according to their scope, we
    made the system more robust and usable.
  • Details in an upcoming paper.

21
Deeper Problems
  • Systems have deep semantic differences that cross
    multiple functions.
  • Consider this self-cleaning program
  • Open a file.
  • Delete the file.
  • Close the file.
  • Works on UNIX, fails on WinNT.
  • Can we really provide a uniform interface?

22
More Info
  • Demo on Wednesday Morning
  • Room 3381 CS anytime
  • The Condor Project
  • http//www.cs.wisc.edu/condor
  • These slides
  • http//www.cs.wisc.edu/thain
  • Douglas Thain
  • thain_at_cs.wisc.edu
  • Questions now?
Write a Comment
User Comments (0)
About PowerShow.com