Systems Seminar Schedule - PowerPoint PPT Presentation

About This Presentation
Title:

Systems Seminar Schedule

Description:

'Exploiting Gray-Box Knowledge of Buffer-Cache Management' - Nathan Burnett ... community of computers may be any collection of machines that agree to work together. ... – PowerPoint PPT presentation

Number of Views:226
Avg rating:3.0/5.0
Slides: 53
Provided by: dougla9
Learn more at: https://www3.nd.edu
Category:

less

Transcript and Presenter's Notes

Title: Systems Seminar Schedule


1
Systems Seminar Schedule
  • Monday, 18 Februrary, 4pm
  • New Wine in Old Bottles - Douglas Thain
  • 4 March
  • No seminar Paradyn/Condor Week
  • Tuesday, 19 March, 3pm
  • The Microsoft .NET System - Mike Litzkow
  • Tuesday, 2 April, 3pm
  • Condor and the Grid - Miron Livny
  • Monday, 15 April, 4pm
  • Exploiting Gray-Box Knowledge of Buffer-Cache
    Management - Nathan Burnett
  • Monday, 29 April, 4pm
  • Bridging the Information gap in Storage Protocol
    Stacks - Tim Denehy

2
New Winein Old BottlesJava on Condor
  • Douglas Thain
  • University of Wisconsin
  • 18 February 2002

3
Abstract
  • We have added Java support to Condor. Ill tell
    you how it works and how to use it. There are
    some nifty features for end users.
  • Adding this code forced us to think about the
    fundamental problem of coupling systems and
    representing errors.
  • A lesson One must consider the scope of an error
    as well as its detail.

4
Disclaimer
  • This is still rough around the edges.
  • (Someone had to go first!)

5
Outline
  • Why Java and Condor?
  • Architecture
  • Initial Experience
  • A Little Error Theory
  • Changes for the Better
  • Conclusions

6
Java for Scientific Computing
  • Java is emerging as a tool for large scale
    (Grande) scientific computing.
  • More accessible to domain scientists.
  • Simplified porting.
  • Faster development, debugging.
  • User communities are forming
  • ACM Java Grande Conference
  • The Java Grande Forum

A. Globus, E. Langhirt, M. Livny, R. Ramamurthy,
M. Solomon, and S. Traugott. JavaGenes and
Condor Cycle-Scavenging genetic algorithms. ACM
Conf on Java Grande, 2000.
7
Limitations
  • Java floating point and complex arithmetic do not
    yet satisfy all of the scientific community.
  • Arguments continue between industry and academia.
  • Java is yet slower than comparable programs in
    C/C/Fortran.
  • WAT compilers and JIT compilers are catching up.
  • You choose 2x slowdown vs 5x machines.
  • Can we really harness 5x machines while still
    maintaining platform independence?

8
Condor for Scientific Computing
  • Condor creates a high-throughput computing system
    on a community of computers.
  • A high-throughput computing system seeks to
    maximize the amount of work done over a long
    period of time.
  • A community of computers may be any collection of
    machines that agree to work together.

9
Condor Enables Ordinary Users
10
Top 10 Condor Pools
226 Condor Pools
5576 Condor Hosts
11
The Hype
  • Java
  • Write once, run anywhere!
  • Condor
  • Submit once, run everywhere!
  • The Grid
  • Uniform, dependable, consistent, pervasive, and
    inexpensive computing.

12
The Reality
  • Coupling systems is not trivial!
  • The easy part
  • Putting java in front of the program name.
  • The tricky parts
  • Java installation messes.
  • Unavailable file systems.
  • Distinguishing program errors from environmental
    errors.

13
Outline
  • Why Java and Condor?
  • Architecture
  • Initial Experience
  • A Little Error Theory
  • Changes for the Better
  • Conclusions

14
Match Maker
schedd
startd
Job Policies
Machine Policies
Creates the execution environment.
Exports the details, policy, and I/O services.
Home File System
15
Home File System
16
User Interface
  • condor_status -java
  • Name JavaVendor Ver State
    Activity LoadAv Mem
  • aish.cs.wisc. Sun Microsy 1.2.2 Owner Idle
    0.000 249
  • anfrom.cs.wis Sun Microsy 1.2.2 Owner Idle
    0.030 249
  • babe.cs.wisc. Sun Microsy 1.2.2 Claimed Busy
    1.120 123
  • ...
  • Machines Owner Claimed Unclaimed
    Matched Preempting
  • INTEL/LINUX 514 101 408 5
    0 0
  • Total 514 101 408 5
    0 0

17
User Interface
  • universe java
  • executable Main.class
  • jar_files MyLibrary.jar
  • input infile
  • output outfile
  • arguments Main 1 2 3
  • queue

condor_submit
18
I/O Interface
  • Input, output, and error files are automatically
    transferred to/from the execution site.
  • Any other named files may be transferred as well.
  • To do online I/O without transferring whole
    files, you must make small changes to the code
  • FileInputStream -gt ChirpInputStream
  • FileOutputStream -gt ChirpOutputStream

19
Application
Added a new library on existing interfaces. User
must call new constructors.
Chirp I/O Library
Java symbols are fully qualified, so transparent
replacedment of classes is not possible.
Java Standard Libraries
Java Virtual Machine
JNI
Could replace native methods in the JVM, but this
ties us to open-source JVMs.
C Standard Library
Could trap real system calls, but these are
complex (asynchronous, nonblocking, threaded) and
may be difficult to distringuish from the JVMs
own operations.
Operating System
20
Outline
  • Why Java and Condor?
  • Architecture
  • Initial Experience
  • A Little Error Theory
  • Changes for the Better
  • Conclusions

21
Initial Experience
  • Bad news Nearly any unexpected failure would
    cause the job to be returned to the user
  • Out of memory at execution site.
  • Java misconfigured at execution site.
  • I/O proxy cant initialize.
  • Home file system offline.

22
Initial Experience
  • Although this was correct in some sense -- the
    information was true -- it was very frustrating.
  • Users want to know when their program fails by
    design (NullPointerException,) but not if it
    fails due to the environment.
  • What did we do wrong?

23
Outline
  • Why Java and Condor?
  • Architecture
  • Initial Experience
  • A Little Error Theory
  • Changes for the Better
  • Conclusions

24
A Little Error Theory
  • Build on standard definitions from
    fault-tolerance and programming languages.
  • Some brief examples to get the idea.
  • Return to Condor and use the theory to understand
    our design mistakes.

25
Fault Tolerance Terminology
  • Failure
  • An externally-visible deviation from
    specifications.
  • Error
  • An internal data state that leads to a failure.
  • Fault
  • An external event that creates an error.

A. Avizienis and J.C. Laprie. Dependable
computing From concepts to design diversity.
IEEE 74(5) May 1986.
26
Example
FAULT
Client
Server
Hmm, sqrt(4) is...
Hmm, sqrt(9) is...
FAILURE
ERROR
27
  • Implicit errors
  • The system claims to have reached a valid result,
    but an auditor claims it is invalid. Example
    sqrt(3)2
  • Explicit errors
  • The system tells us it cannot complete the
    desired action. Example file not found.
  • Escaping errors
  • The system detects an error, but has no method of
    reporting it, so it escapes by an alternate
    route. Example core dump, kernel panic.

John B. Goodenough, Exception Handling issues
and a proposed notation. CACM 18(120, December
1975. K. Ekandham and A. Bernstein. Some new
Transitions in hierarchical level structures.
Operating Systems Review 12(4), 1978.
28
Would like to return an explicit error, but a
load insn has no exit code.
Program
Could return a default value, but that creates an
implicit error.
load
data
Escaping error Tell the parent that the program
could not complete.
Virtual Memory System
Backing Store
Physical Memory
29
Interface Contracts
  • int load( int address )
  • The implementor must either compute a result that
    conforms to the contract, or is obliged to cause
    an escaping error.

C. Hoare. An axiomatic basis for computer
programming. CACM 12(10576-580, October
1969. B. Meyer. Object-Oriented Software
Construction. Prentice Hall, 1997.
30
Exceptions
  • int open( String filename )
  • throws FileNotFound, AccessDenied
  • A language with exceptions provides more
    structure to the contract. A declared exception
    is an explicit error. Yet, escaping errors are
    still possible.

31
Program
Success, FileNotFound, AccessDenied
open
MemoryCorrupt, DiskOffline, PigeonLost
INTERFACE
Virtual File System
IMPLEMENTATION
Disk
Memory
32
Error Scope
  • In order to be accepted by end users, a
    distributed system must be able to distinguish
    between errors computed by the program and errors
    forced upon it by the environment.
  • We use the term scope to draw the distinction.

33
Error Scope
  • The scope of an error is the portion of the
    system that it invalidates.
  • An error must be delivered to the process
    responsible for managing that scope.

Error Scope Handler
FileNotFound File Calling Function
RPC Disconnect Process Parent Process
Cache Coherency Problem Machine Hypervisor or Operator
PVM Node Crash PVM Cluster Parent Process
34
Error Detail
  • The detail of an error describes in
    phenomenological terms the cause of the error.
  • In the right hands, the detail is useful. In the
    wrong hands, the detail can be misleading.
  • Suppose open returns AccessDenied...
  • File is not accessible - Ok.
  • Library containing open is not accessible -
    Problem!

35
Lessons
  • Principle 1
  • A routine must not generate an implicit error as
    a result of receiving an explicit error.
  • Principle 2
  • An escaping error converts a potential implicit
    error into an explicit error at a higher level.
  • Principle 3
  • An escaping error must be propagated to the
    program that manages the errors scope.

36
Outline
  • Why Java and Condor?
  • Architecture
  • Initial Experience
  • A Little Error Theory
  • Changes for the Better
  • Conclusions

37
Java and Condor Revisited
  • What did we do wrong?
  • We focussed on error detail without considering
    error scope.

38
(No Transcript)
39
Java and Condor Revisited
  • To fix the system, we revisited the notion of
    error scope throughout.
  • Two examples
  • JVM exit code
  • I/O errors

40
JVM Exit Code
Detail Scope Exit Code
Program exited by completing main Program 0
Program exited through System.exit(x) Program x
Exception Null pointer. Program 1
Exception Out of memory. Virtual Machine 1
Exception Java Misconfigured. Remote Resource 1
Exception Home file system offline. Local Resource 1
Exception Program image corrupt. Job 1
41
starter
shadow
Starter Result Program Result
Result File
JVM Result
JVM
Home File System
Result of Execution Attempt Result of Program,
If any.
42
I/O Error Scope
  • All Java I/O operations throw a single exception
    type -- IOException.
  • Our mistake convert all detected errors into
    IOExceptions and pass them to the program.
  • Makes sense for FileNotFound, but not for
    ProxyUnavailable or CredentialsExpired.

43
starter
To I/O Proxy
Result of Execution Attempt Result of Program,
If any.
Result File
JVM Result
JVM
Error Outside Program Scope
Error Inside Program Scope
44
Outline
  • Why Java and Condor?
  • Architecture
  • Initial Experience
  • A Little Error Theory
  • Changes for the Better
  • Conclusions

45
Conclusion
  • We started building the Java Universe with some
    naive assumptions about errors.
  • On encountering practical difficulties, we
    thought more abstractly about errors and
    developed the notion of scope and detail.
  • By routing errors according to their scope, we
    made the system more robust and usable.

46
Food for Thought
  • There isnt always an easy way to propagate an
    error to the scope handler.
  • Escaping error to parent process
  • Raise a POSIX signal.
  • Escaping error to the starter
  • Throw a Java Error, trapped by the Wrapper,
    placed in file, read after process exits.

47
Food for Thought
  • The mere use of exceptions in a program does not
    imply a disciplined error management.
  • For example, throws IOException is a very vague
    statement about an interface.
  • What is an implementor allowed to throw?
  • Can open() return FileNotFound?
  • (Probably.)
  • Can read() throws FileNotFound?
  • (Asking for trouble.)
  • What about ConnectionRefused?

48
Food for Thought
  • An contract can govern more than simply the
    interface specification.
  • Consider this self-cleaning program
  • fd open(file)
  • unlink(file)
  • close(fd)
  • Works on UNIX, fails on WinNT.
  • Can an interface (codedocs) really state all the
    necessary semantic information?
  • Should it?

49
Deployment
  • As of February 14th, the Java Universe is running
    on 515 RedHat 7.2 machines.
  • Will be rolled out as part of Condor 6.3.2 on all
    platforms in the regular release schedule.
  • Sun JDK 1.2.2 on UNIX machines.
  • Sun JDK 1.3.2 on WinNT machines.
  • Is the Java Universe available on my machine?
  • condor_status -java

50
skywalker.cs.wisc.edu
c2 cluster
tux lab
istat
51
Acknowledgements
  • Although we me take credit (or blame) for the
    most recent changes, the Condor architecture has
    dealt with errors for many years. Much credit
    goes to the core designers, esp. Mike Litzkow,
    Todd Tannenbaum, and Derek Wright.

52
More Info
  • The Condor Project
  • http//www.cs.wisc.edu/condor
  • These slides
  • http//www.cs.wisc.edu/thain
  • Douglas Thain
  • thain_at_cs.wisc.edu
  • Questions now?
Write a Comment
User Comments (0)
About PowerShow.com