Checkpoint - PowerPoint PPT Presentation

About This Presentation
Title:

Checkpoint

Description:

Algorithms for generating global component states, and storing them into stable storage ... for checkpointing, and algorithms for capturing global checkpoints ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 36
Provided by: srik2
Learn more at: https://users.sdsc.edu
Category:
Tags: checkpoint

less

Transcript and Presenter's Notes

Title: Checkpoint


1
Checkpoint Restart for Distributed Components
in XCAT3
  • Sriram Krishnan
  • Indiana University, San Diego Supercomputer
    Center
  • Dennis Gannon
  • Indiana University
  • srikrish_at_cs.indiana.edu

2
Long-running Distributed Applications on the Grid
The Problem 1 Launch simulation at Y 2. Launch
simulation at Z 3. Link both simulations 4.
Execute both simulations 5. Store results at X
X
Z
Y
The Grid
Need an effective way to orchestrate such
computations
3
Checkpoint Restart
  • Motivation
  • Basic fault tolerance via periodic checkpointing
  • Rollback to saved checkpoint upon failure
  • Dynamic rescheduling of jobs
  • Checkpoint and restart on another location
  • Checkpointing Goals
  • Correctness
  • Portability
  • Minimal checkpoint size
  • Scalability
  • Interoperability
  • Checkpoint Availability

4
Outline
  • Motivation
  • Background
  • The XCAT3 framework
  • Checkpoint Restart
  • Checkpointing Restart in XCAT3
  • Software Techniques
  • Algorithms
  • Experiments
  • Conclusions Future work

5
Application Orchestration Component Architectures
  • A Component Architecture consists of two parts
  • Components
  • Software objects that implement a set of required
    behaviors
  • Frameworks
  • A runtime environment
  • A set of services used by components
  • Benefits
  • Encapsulation, modular construction of programs
    (via composition), reuse
  • Component Architectures adopted in various
    domains
  • Business EJB, CCM, COM/DCOM
  • Scientific Computing CCA

6
Common Component Architecture
  • A ComponentID for identification management
    purposes
  • Ports the public interfaces of a component
  • Defines the different ways we can interact with a
    component and the ways the component uses other
    services and components.

setImage(Image I)
Image Processing Component
Image getImage()
adjustColor()
calls doFFT()
setFilter(Filter)
Uses Ports - interface of a service used by
component
Provides Ports - interfaces functions provided by
component
7
XCAT3 CCA Framework for the Grid
  • Grid Service Extensions (GSX) Toolkit used for
    OGSI Compatible Grid services
  • Standard protocols used by Grid services SOAP,
    HTTP
  • http//www.extreme.indiana.edu/xgws/GSX
  • A Component is represented as a set of Grid
    services
  • Provides ports, ComponentIDs are Grid services
  • Uses ports are Grid service clients
  • Sriram Krishnan and Dennis Gannon. XCAT3 A
    Framework for CCA Components as OGSA Services. In
    HIPS 2004, 9th International Workshop on
    High-Level Parallel Programming Models and
    Supportive Environments. April 2004.

8
Checkpointing Software Techniques
  • System-level Techniques
  • Automatic transparent checkpointing for an
    application at the operating system or middleware
    level
  • User-defined Techniques
  • Non-transparent checkpointing for an application
    that relies on the programmer to identify the
    minimal information needed for restart

9
Checkpointing Software Techniques
System-Level
User-defined
  • Transparent to the user No expertise required
  • Not very portable across platforms
  • Larger checkpoint sizes Typically complete
    process images stored
  • Less flexible Application is treated as a black
    box
  • Not transparent to the user Considerable
    expertise required
  • More portable across platforms
  • Smaller checkpoint sizes Only minimal state
    stored
  • More flexible Application information can be used

10
Checkpointing Examples
  • System-level Techniques
  • Condor
  • LAM-MPI
  • Enterprise Java Beans
  • CORBA Components
  • User-defined Techniques
  • CUMULVS
  • Enterprise Java Beans
  • CORBA Components
  • Global Grid Forum Grid Checkpoint/Recovery Group
  • User-defined checkpointing APIs for Grid services
  • Do not address consistent global checkpoints for
    distributed applications
  • A set of individual checkpoints that constitute a
    state that occurs in a failure-free, correct
    execution

11
Checkpointing Technique in XCAT3
  • User-defined System-assisted
  • User is responsible for identifying local
    component state
  • Framework is responsible for
  • Generating complete state of the component, viz.
    local component state, connection state, and
    environment state
  • Algorithms for generating global component
    states, and storing them into stable storage
  • Component writer implements the following
    methods
  • generateComponentState()
  • loadComponentState()
  • resumeExecution()

12
Distributed Checkpointing
  • Algorithm Overview Coordinated blocking
    checkpoint algorithm
  • Block all port communication between components
  • Take individual checkpoints, and commit them
    atomically
  • Resume port communication between components
  • Novelty Application to RPC-based component
    framework
  • Typically, such algorithms are applied to
    messaging frameworks

13
The Big Picture
Distributed Components on the Grid
Application Coordinator
MS
IS
IS
IS
IS
Persistent Storage
Federation of Master (MS) Individual Storage
(IS) Services
14
Checkpoint Algorithm
Application Coordinator
Checkpoint Components
Persistent Storage
15
Checkpoint Algorithm
Application Coordinator
Block all port communication between components
Persistent Storage
16
Checkpoint Algorithm
Application Coordinator
All communication between components blocked
Persistent Storage
17
Checkpoint Algorithm
Application Coordinator
Find best available Storage service URLs
Persistent Storage
18
Checkpoint Algorithm
Application Coordinator
Store checkpoints into Storage services
Persistent Storage
19
Checkpoint Algorithm
Application Coordinator
Return storageIDs for stored state
Persistent Storage
20
Checkpoint Algorithm
Application Coordinator
Atomically update locators for individual
checkpoints
Persistent Storage
21
Checkpoint Algorithm
Application Coordinator
Un-block communication between components
Persistent Storage
22
Checkpointing Correctness
  • Consistency of Global Checkpoint
  • A flavor of coordinated blocking algorithms
    well accepted to be correct
  • Atomicity of Checkpoints
  • Locators for the global checkpoint are updated
    atomically after all components have been
    checkpointed
  • Not possible to have a scenario where a global
    checkpoint consists of a combination of old and
    new individual checkpoints

23
Restart Algorithm
  • Also implemented by the Application Coordinator
  • Details
  • Destroy executing instances, if need be
  • Restart all components (possibly on other
    resources)
  • Load state of components from the Storage
    services
  • Resume execution of all control threads, after
    the states of every component have been loaded
    from the Storage services

24
Test Application Chem-Eng Simulation
  • Based on the simulation of copper
    electro-deposition on resistive substrate
    (NCSA-UIUC)
  • Master-Worker model of execution
  • Variable number of workers, and data size per
    worker
  • generateComponentState(), loadComponentState(),
    and resumeExecution() methods added to support
    checkpointing and restart
  • Required identification of the various execution
    states of the master and worker components

25
Experiment Setup
  • Hardware setup
  • 8 node Linux cluster
  • 2.8GHz dual processor Intel Xeon processors
  • Red Hat Linux 8.0
  • 2GB Memory
  • 1Gbps Ethernet
  • SUNs JDK 1.4.2_04
  • Federation of 1 Master 8 Individual Storage
    services used
  • Single GSX-based Handle Resolver

26
Checkpointing Master Processing
27
Checkpointing Workers Processing
28
Future Work
  • Framework
  • Integration with the Web Service Resource
    Framework (WSRF)
  • Fault Tolerance
  • Fault Monitoring
  • Reliable communication between components
  • Checkpoint Optimizations
  • Storage Service Optimizations
  • Applications
  • Use of XCAT3 for LEAD (http//lead.ou.edu)

29
Conclusions
  • A framework for checkpointing restart of
    distributed applications on the Grid
  • CCA-based component framework consistent with
    Grid standards
  • User-defined, platform-independent checkpoints
  • APIs for checkpointing, and algorithms for
    capturing global checkpoints and for restart
    provided by the framework
  • http//www.extreme.indiana.edu/xcat/

30
Appendix
31
OGSI Compatibility
  • Representation for Provides ports
  • In traditional Grid/Web services, multiple ports
    of the same portType are semantically equivalent
  • CCA allows multiple ports of the same type
  • CCA ports can not be mapped to Web service ports!
  • Hence, every Provides port is mapped as a
    separate Grid service
  • A single portType containing the Provides port
    interface
  • Representation for Uses ports
  • Clients of Grid services (Provides ports)
  • Connections to Provides ports made at runtime

32
OGSI Compatibility
  • Representation for the ComponentID
  • Also a Grid service
  • Acts as a Manager for the other Provides ports
  • Contains SDEs containing GSH/GSRs for the various
    Provides ports
  • The Provides ports and ComponentID services, and
    the Uses ports communicate via shared state

33
Building Applications by Composition
  • Connect Uses Ports to Provides Ports.

Image database component
setImage()
Image Processing Component
getImage()
Acme FFT component
doFFT()
adjustColor()
Image tool graphical interface component
34
Restart Algorithm
35
Test Application Chem-Eng Simulation
Write a Comment
User Comments (0)
About PowerShow.com