Errors, Status, and Asynchrony Discussion Session - PowerPoint PPT Presentation

About This Presentation
Title:

Errors, Status, and Asynchrony Discussion Session

Description:

Data Job Management and Fault-Tolerance. What faults do we intend to tolerate/expose/ignore? ... Error space is an amalgam of all back end error spaces. ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 36
Provided by: dougla9
Category:

less

Transcript and Presenter's Notes

Title: Errors, Status, and Asynchrony Discussion Session


1
Errors, Status,and AsynchronyDiscussion Session
  • PPDG Data Replication Meeting
  • 10 January 2002
  • Douglas Thain, Condor Project
  • University of Wisconsin

2
Agenda
  • A Working Model
  • Two Error-Management Issues
  • Thinking of Data-Movement as Jobs
  • Reconciling Error Representations
  • Example Problem
  • Discussion
  • Open Issues
  • Hints and Absolutes in Replica Management
  • Tradeoff between consistency and availability

3
Discussion Points
  • Data Job Management and Fault-Tolerance
  • What faults do we intend to tolerate/expose/ignore
    ?
  • Can we develop a general transaction
    infrastructure for replication-related
    activities?
  • How should we evaluate designs that may be error
    sensitive? (design review, stress testing)
  • Error Identification and Representation
  • Should we have a uniform error space?
  • Is it feasible to translate between existing
    error spaces?
  • What systems have unusual errors modes that
    outsiders may not expect?
  • How do we deal with unusual errors that must pass
    through existing APIs?

4
A Working Model Giggle
GRIN
L1
B
L2
B
L3
B
Replica Site B
L1
P1
L2
P2
L3
P3
Foster, Iamnitchi, Ripeanu, Chervenak, Deelman,
Kesselman, Hoschek, Kunszt, Stockinger,
Stockinger, Tierney, Giggle A Framework for
Constructing Scalable Replica Location Services
5
The Problem
  • Replication systems will be subject to a wide
    variety of errors.
  • How do we build systems that maintain consistency
    in the face of errors?
  • Answer Use transactions to manage jobs, but...
  • How do we build systems that make reasonable
    performance decisions in the face of errors?
  • Answer Informative errors, but

6
Fault Tolerance Terminology
  • Failure
  • An externally-visible deviation from
    specifications.
  • Error
  • An internal data state that leads to a failure.
  • Fault
  • An external event that creates an error.

A. Avizienis and J.C. Laprie, Dependable
computing From concepts to design diversity,
Proc IEEE 74, 5 (May) 629-638
7
Example
FAULT
Client
Server
Hmm, sqrt(4) is...
Hmm, sqrt(9) is...
FAILURE
ERROR
8
  • Silent errors (failures)
  • The system claims to have reached a valid result,
    but an auditor claims it is invalid.
  • Explicit errors (failures)
  • The system tells us it cannot complete the
    desired action.
  • Escaping errors (failures)
  • The system detects an error, but has no method of
    reporting it, so it escapes by an alternate route
    -- drop connection, core dump, kernel panic.
    (exception)

John B. Goodenough, Exception Handling Issues
and a Proposed Notation, CACM 1822 (1975), pp
683-696.
9
What Errors to Expect in a Replication System?
  • Errors of communication
  • File transfer was broken between bytes.
  • Collection transfer was broken between files.
  • Errors of omission
  • Requested some files, but response was slow, so
    the caller gave up and left. (with/out abort?)
  • Errors in configuration
  • Space at target server cant admit all incoming
    data at once.

10
What Must Be Consistent?
Replica Catalog
L1
B
L2
B
L3
B
Replica Site B
L1
P1
L2
P2
P3
L3
P3
P2
P1
11
Data Movement as a Job
  • Each request issued for replication must have a
    past, present, and future
  • Who issued it, and why?
  • What is it doing now?
  • Is it done? Did it succeed?
  • Enough information to roll back after a failure.
  • A complete program execution
  • data jobs cpu jobs dependencies
    DAGMan/DaPMan

12
Job Management
  • Primary technique for reliable interacting with
    the job queue transaction.
  • ACID Test Atomicity, Consistency, Isolation,
    Durability.
  • Of course, the natural interface to a db, but not
    all participants are a full db.
  • Interface
  • 2PL and friends
  • Implementation
  • Logging, shadowing, a real db?

13
Two-Phase Commit
Server
Client
Stable Storage
Work Space
Archival Space
J. Eliot Moss, Nested Transactions An Approach
to Reliable Distributed Computing, MIT Press,
1985.
14
Two-Phase Commit
Server
Client
Stable Storage
Work Space
Archival Space
James Frey, Todd Tannenbaum, Ian Foster, Miron
Livny, and Steven Tuecke, "Condor-G A
Computation Management Agent for
Multi-Institutional Grids", Proceedings of the
Tenth IEEE Symposium on High Performance
Distributed Computing (HPDC10), 2001.
15
Transactions and Status
  • The transaction ID then becomes a persistent job
    number for later queries
  • Success, failure, abort, timeout
  • unknown-past, unknown-future.
  • For this status to be useful, a record of the job
    must be kept around for a certain period of time.
  • Also ok to time out, cancel, or otherwise remove
    data movement jobs.
  • But, a committed transaction must be kept.
  • Cant re-use a job number!

16
Transaction Implementations
  • Logging
  • Keep a log of all actions, new and old values.
  • Read forward to redo, backwards to undo.
  • Shadowing
  • Add changed data to unallocated space.
  • Atomically commit new pointers to data.

M
D
D
D
D
D
D
D
17
Transaction Implementations
  • If a standard file system is the underlying
    storage, then shadowing is a natural fit.
  • Most metadata updates are designed to be atomic
    and synchronous.
  • Most large data updates are designed to provide
    good xput, but are asynchronous and not
    guaranteed until after an explicit commit.

18
Atomic File Update
fd creat(file.tmp) write(fd,data,length) fsyn
c(fd) close(fd) rename(file.tmp,file)
On Failure or abort
On Success
On reboot
Done.
unlink(file.tmp)
unlink(.tmp)
(Technique used on Condor checkpoint servers and
scheduler processes.)
19
Unifying Storage Services
App
POSIX
Virtual Operating System
UNIX Driver
SRB Driver
GridFTP Driver
NeST Driver
Kangaroo Driver
GASS Driver
An Alphabet Soup of Protocols, APIs, Systems,
Authorities, and Authors
20
Error RepresentationA Problemof Depth
App
Tape Archive
POSIX
Bypass Agent
???
Disk Cache
PPDG API
Win32
Replica Access Library
FTP Server
Replica Catalog
RAP
FTP
Replica Server
Replica Server
RMP
RMP
21
A Problem ofDesign Direction
App
App
Bottom Up Design
???
POSIX
Application Library
Virtual OS
Outside In Design
ANSI
PPDG API
Standard Library
Replica Access
POSIX
SRB
OS Kernel
Replica Server
22
The End-to-End Argument
  • In complex software, the outermost layer has the
    ultimate responsibility for interpreting and
    recovering from errors.
  • Recovery in a lower layer is an optimization of
    performance or convenience.
  • If the possibility of error is very high,
    lower-level recovery is needed for good
    performance.

Saltzer, Reed, and Clark, End-to-End Arguments in
System Design, Computer Systems 24, pp 277-288,
1984.
23
UNIX Errnos
  • A single namespace of integer errors that apply
    to all levels of the system.
  • Any call is free to return any possible error.
    (124)
  • General vs specific
  • ENOENT vs ECHILD
  • Some artifacts
  • EACCESS vs EPERM
  • EADV and EDOTDOT

EPERM 1 / Operation not permitted / ENOENT 2
/ No such file or directory / ESRCH 3 / No
such process / EINTR 4 / Interrupted system
call / EIO 5 / I/O error / ENXIO 6 / No
such device or address / E2BIG 7 / Arg list
too long / ENOEXEC 8 / Exec format error
/ EBADF 9 / Bad file number / ECHILD 10 / No
child processes / EAGAIN 11 / Try again
/ ENOMEM 12 / Out of memory / EACCES 13 /
Permission denied / ..
24
FTP Reply Codes
  • Integer codes indicate the severity of a response
    to an action.
  • Many transfer problems are identified, but few
    file system problems are.
  • Third digit specified infrequently, and for wide
    classes of errors.

100 - Positive Preliminary 200 - Positive
Completion 300 - Positive Intermediate 400 -
Transient Negative 500 - Permanent negative 000
- Syntax 010 - Information 020 - Connections 030
- Authentication 040 - Unspecified 050 - File
System 550 e.g. File not found, no access
25
SRB Reply Codes
  • Error space is an amalgam of all back end error
    spaces.
  • Pros No information is ever lost in translation.
  • Cons Very difficult to write code that switches
    on the error number (1026 cases.)

UNIX_EPERM -1301 UNIX_ENOENT -1302 . .
. UNIX_EDEADLOCK -1356
HPSS_EPERM -1401 HPSS_ENOENT -1402 . .
. HPSS_NOCOS -1499
SQL_RSLT_TOO_LONG -1600
HTTP_ERR_BAD_PATH -1700
MCAT_OPEN_ERROR -3001 MCAT_CONNECT_ERROR -3002 .
. . MCAT_USER_NOT_IN_DOMN -3032
26
Globus Error Objects
  • Pros
  • Errors may be identified at varying levels of
    granularity.
  • Easily expandable.
  • Lots of debug info.
  • Cons
  • Can be difficult to decide in which class to
    place an external error.
  • In practice, most errors are returned as objects
    of type string.

Error
Authen- tication
Author- ization
Commun- ication
String
No Creds
Expired Creds
No Trust
27
Translation Can be Done to a Point
UNIX_EPERM -1301 UNIX_ENOENT -1302 . .
. UNIX_EDEADLOCK -1356
EPERM
ENOENT
ESRCH
HPSS_EPERM -1401 HPSS_ENOENT -1402 . .
. HPSS_NOCOS -1499
EINTR
EIO
SQL_RSLT_TOO_LONG -1600
EACCESS
HTTP_ERR_BAD_PATH -1700
EISDIR
MCAT_OPEN_ERROR -3001 MCAT_CONNECT_ERROR -3002 .
. . MCAT_USER_NOT_IN_DOMN -3032
OTHER
28
Grope in the Dark
  • if GET succeeds
  • return success
  • else
  • if CHDIR succeeds
  • return EISDIR
  • else
  • if LIST succeeds
  • return EACCESS
  • else
  • return ENOENT
  • end
  • end
  • end

GET
CHDIR
LIST
EACCESS
29
Error Identification isa Performance Concern
  • We can always find some way to produce an
    execution that avoids a silent failure.
  • Pass all errors up one level.
  • Retry all errors until time expires.
  • Abort process completely.
  • But, a known, finite, space allows the caller to
    make targeted decisions about what to do next
  • Not Authorized -- best to pass up one level.
  • Operation Interrupted -- best to retry here.

30
Give the Essence orGive the Details?
  • Example in file systems
  • Fell off the end of the directory linked list.
  • or No file by that name.
  • Example in networking
  • Timer went off, but no network interrupt
    received.
  • or Connection lost.
  • Example in security
  • Failure in PEM_do_header while reading
    password.
  • or You have no credentials.
  • Example in Storage
  • HPSS_NOCOS
  • or ?????

31
Example and Discussion
32
Example
  • Goal
  • User requests a repl of a file from B to A.
  • Data Structures at each Node
  • A persistent map map from LFNs to PFNs.
  • A persistent store for transactions.
  • A persistent store for data.
  • Assumptions
  • Files are read-only, no need for invalidation.
  • All nodes must survive reboot cleanly.
  • File transfers may be resumed from any point.

33
Replica Catalog
L1
B
Client
L2
B
L3
B
Replica Site B
L1
P1
L2
P2
L3
P3
34
Replica Site A
LFN
TRN
Server
Client
L2
T53
T53.tmp LFN L2 PFN P16 State Working
T53.tmp LFN L2 PFN P16 State Working
T53.tmp LFN L2 PFN P16 State Done
P16 Physical Data File
T53 LFN L2 PFN P16 State Working
T53 LFN L2 PFN P16 State Done
35
More Issues
  • Cleanup at Reboot
  • Remove uncommitted transactions.
  • Jobs in progress Update LFN-gtTRN entry.
  • Client Status Check
  • Requesting client examines state of transaction.
  • Or, other clients indirect through LFN entry.
  • Notification of Status Change
  • Unreliable -- Server sends messages to client.
  • Reliable --Server must do transaction to client.
  • (See Condor-G Paper)
Write a Comment
User Comments (0)
About PowerShow.com