Ewa Deelman, deelmanisi'eduwww'isi'edudeelmanhttp:pegasus'isi'edu - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Ewa Deelman, deelmanisi'eduwww'isi'edudeelmanhttp:pegasus'isi'edu

Description:

Cyberinfrastructure: Local machine, cluster, Condor pool, OSG, TeraGrid. Abstract Workflow ... DAGMan: Miron Livny, Kent Wenger, and the Condor team ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 32
Provided by: ewa73
Category:

less

Transcript and Presenter's Notes

Title: Ewa Deelman, deelmanisi'eduwww'isi'edudeelmanhttp:pegasus'isi'edu


1
Workflow Technologies in Support of Science
  • Ewa Deelman
  • University of Southern California
  • Information Sciences Institute

2
Motivation
  • Codes are being developed by many individuals
  • Finding the right code for the job can be
    difficult
  • Understanding how to invoke somebody elses code
    can be challenging
  • An analysis can be a composed of a sequence of
    computations
  • Data can exist in different formats
  • The process of analysis definition can be tedious
    and labor intensive
  • Need to save the output of one computation and
    use it invoke it the next computation
  • The execution of the analysis can be time
    consuming and error prone
  • The codes fail, networks go down, computers crash
  • Finding resources to conduct the computations can
    be difficult
  • A particular computer does not have enough
    memory, disk space, etc.

3
Generating mosaics of the sky (Bruce Berriman,
Caltech)
The full moon is 0.5 deg. sq. when viewed form
Earth, Full Sky is 400,000 deg. sq.
4
Specification Place Y F(x) at L
  • Find where x is--- S1,S2,
  • Find where F can be computed--- C1,C2,
  • Choose c and s subject to constraints
    (performance, space availability,.)
  • Move x from s to c
  • Move F to c
  • Compute F(x) at c
  • Move Y from c to L
  • Register Y in data registry
  • Record provenance of Y, performance of F(x) at c

Error! x was not at s!
Error! F(x) failed!
Error! c crashed!
Error! there is not enough space at L!
5
Abstract Workflow Description (devoid of resource
bindings, Portable across resources) Logical
names for data and codes
Results Delivered To user-specified location
Pegasus WMS
Provenance and Performance Recorded
Monitoring information
Tasks
6
Pegasus workflow system
  • Allows scientist to design an analysis at a
    high-level without worrying about how to invoke
    it, execute it
  • Automatically finds compute resources and data
    needed for the computation
  • Automatically executes computations on
    computational resources available to the
    community or individual
  • When failures occur, it tries to recover from
    them using a variety of mechanisms
  • Record provenance---provides information how the
    results were obtained, which codes where invoked,
    what parameters were used, what input data was
    used in the processing

7
Pegasus Workflow Management System
  • client tool with no special requirements on the
    infrastructure

Abstract Workflow
A reliable, scalable workflow management system
that an application or workflow composition
service can depend on to get the job done
A decision system that develops strategies for
reliable and efficient execution in a variety of
environments
Pegasus mapper
DAGMan
Reliable and scalable execution of dependent
tasks
Condor Schedd
Reliable, scalable execution of independent tasks
(locally, across the network), priorities,
scheduling
Cyberinfrastructure Local machine, cluster,
Condor pool, OSG, TeraGrid
8
Basic Workflow Mapping
  • Select where to run the computations
  • Change task nodes into nodes with executable
    descriptions
  • Execution location
  • Environment variables initializes
  • Appropriate command-line parameters set
  • Select which data to access
  • Add stage-in nodes to move data to computations
  • Add stage-out nodes to transfer data out of
    remote sites to storage
  • Add data transfer nodes between computation nodes
    that execute on different resources

9
Basic Workflow Mapping
  • Add nodes to create an execution directory on a
    remote site
  • Add nodes that register the newly-created data
    products
  • Add data cleanup nodes to remove data from remote
    sites when no longer needed
  • reduces workflow data footprint
  • Provide provenance capture steps
  • Information about source of data, executables
    invoked, environment variables, parameters,
    machines used, performance

10
Pegasus Workflow Mapping
4
1
Original workflow 15 compute nodes devoid of
resource assignment
8
5
9
10
12
13
15
60 tasks
11
Catalogs used for discovery
  • To execute on the a grid Pegasus needs to
    discover
  • Data ( the input data that is required by the
    workflows )
  • Can use project-specific capabilities
  • Executables ( Are there any application
    executables installed before hand)
  • Can be configured by one person and shared
  • Site Layout (What are the services running on an
    system for example)
  • Can be built automatically using information
    services or by hand

12
How to make Pegasus Work
Metadata Catalog
Workflow generator
Analysis parameters
Workflow (DAX)
tasks
Pegasus
Transformation Catalog
Condor or Globus
Site Catalog
Computing resource
Replica Catalog (optional)
Properties (for Pegasus Behavior)
13
Pegasus DAX
lt!-- part 1 list of all files used (may be
empty) --gt ltfilename file"f.input"
link"input"/gt ltfilename file"f.intermediate"
link"input"/gt ltfilename file"f.output"
linkoutput"/gt ltfilename filekeg
linkinputgt lt!-- part 2 definition of all
jobs (at least one) --gt ltjob id"ID000001"
namespacepegasus" name"preprocess"
version"1.0" gt ltargumentgt-a top -T 6 -i
ltfilename filef.input"/gt -o ltfilename
filef.intermediate"/gt lt/argumentgt ltuses
filef.input" link"input" register"false"
transfertrue"/gt ltuses filef.intermediate"
link"output" registerfalse" transferfalsegt
lt!-- specify any extra executables the job needs
. Optional --gt ltuses filekeg linkinput
registerfalse transfertrue
typeexecutablegt lt/jobgt ltjob id"ID000002"
namespacepegasus" nameanalyze" version"1.0"
gt ltargumentgt-a top -T 6 -i ltfilename
filef.intermediate"/gt -o ltfilename
filef.output"/gt lt/argumentgt ltuses
filef.intermediate" link"input"
register"false transfertrue"/gt ltuses
filef.output link"output" registertrue"
transfertrue"/gt lt/jobgt lt!-- part 3 list of
control-flow dependencies (empty for single jobs)
--gt ltchild ref"ID000002"gt ltparent
ref"ID000001"/gt lt/childgt (excerpted for
display)
  • Resource-independent
  • Portable across platforms

14
How to generate a DAX
  • Write the XML directly
  • Use the Pegasus Java API
  • In the works python and perl APIs
  • Looking at visual composition
  • You can add flags in the DAX to save and/or
    register intermediate data products.

15
Discovery of Execution Site Layout
  • Pegasus queries a site catalog to discover site
    layout
  • Installed job-managers for different types of
    schedulers
  • Installed GridFTP servers
  • Local Replica Catalogs where data residing in
    that site has to be catalogued
  • Site Wide Profiles like environment variables
  • Work and storage directories

This catalog will need to be updated as you add
new execution environments.
16
Discovery of Executables
  • Transformation Catalog maps logical
    transformations to their physical locations
  • Used to
  • discover application codes installed on the grid
    sites
  • discover statically compiled codes, that can be
    deployed at grid sites on demand

As new versions of the code are developed, this
catalog needs to be updated.
How to A single client tc-client to interface
with all type of transformation catalogs
17
Discovery of Data
  • Replica Catalog stores mappings between logical
    files and their target locations.
  • Globus RLS
  • discover input files for the workflow
  • track data products created
  • data reuse
  • Pegasus also interfaces with a variety of replica
    catalogs
  • File based Replica Catalog
  • useful for small datasets ( like this tutorial)
  • cannot be shared across users.
  • Database based Replica Catalog
  • useful for medium sized datasets.
  • can be used across users.
  • Project-specific systems

How to A single client rc-client to interface
with all type of replica catalogs
18
Optimizations during Mapping
  • Fully automated optimizations
  • Data reuse in case intermediate data products are
    available
  • Performance and reliability advantagesworkflow-le
    vel checkpointing
  • Data cleanup nodes can reduce workflow data
    footprint
  • by 50 for Montage, applications such as LIGO
    need restructuring
  • Partially automated optimizations
  • Node clustering for fine-grained computations
  • Can obtain significant performance benefits for
    some applications (in Montage 80, SCEC 50 )
  • Workflow partitioning to adapt to changes in the
    environment
  • Map and execute small portions of the workflow at
    a time

19
Reliability Features of Pegasus and DAGMan
  • Provides workflow-level checkpointing through
    data re-use
  • Allows for automatic re-tries of
  • task execution
  • overall workflow execution
  • workflow mapping
  • Tries alternative data sources for staging data
  • Provides a rescue-DAG when all else fails
  • Clustering techniques can reduce some of failures
  • Reduces load on CI services

20
Pegasus Applications-LIGO
Support for LIGO on Open Science Grid LIGO
Workflows 185,000 nodes, 466,000 edges 10 TB of
input data, 1 TB of output data.
LIGO Collaborators Kent Blackburn, Duncan Brown,
Britta Daubert, Scott Koranda, Stephen Fairhurst,
and others
21
SCEC (Southern California Earthquake Center)
SCEC CyberShake workflows run using Pegasus-WMS
on the TeraGrid and USC resources
Cumulatively, the workflows consisted of over
half a million tasks and used over 2.5 CPU Years.
The largest CyberShake workflow contained on
the order of 100,000 nodes and accessed 10TB of
data
SCEC Collaborators Scott Callahan, Robert
Graves, Gideon Juve, Philip Maechling, David
Meyers, David Okaya, Mona Wong-Barnum
22
National Virtual Observatory and Montage
NVOs Montage mosaic application Transformed a
single-processor code into a workflow and
parallelized computations to process larger-scale
images
  • Pegasus mapped workflow of 4,500 nodes onto NSFs
    TeraGrid
  • Pegasus improved runtime by 90 through automatic
    workflow restructuring and minimizing execution
    overhead
  • Montage is a collaboration between IPAC, JPL and
    CACR

23
Portal Interfaces for Pegasus workflows
SCEC
Gridsphere-based portal for workflow monitoring
24
Ensemble Manager
  • Ensemble a set of workflows
  • Command-line interfaces to submit, start, monitor
    ensembles and their elements
  • The state of the workflows and ensembles is
    stored in a DB
  • Priorities can be given to workflows and
    ensembles
  • Future work
  • Kill
  • Suspend
  • Restart
  • Web-based interface

25
What does Pegasus do for an application?
  • Provides a Grid-aware workflow management tool
  • Interfaces with data registries to discover data
  • Does replica selection to select replica.
  • Manages data transfer by interfacing to various
    transfer services GridFTP, http, others
  • No need to stage-in data before hand. We do it
    within the workflow as and when it is required.
  • Reduced Storage footprint. Data is also cleaned
    as the workflow progresses.
  • Improves successful application execution
  • Improves application performance
  • Data Reuse
  • Avoids duplicate computations
  • Can reuse data that has been generated earlier.

26
Relevant Links
  • Pegasus http//pegasus.isi.edu
  • Distributed as part of VDT (http//vdt.cs.wisc.edu
    / )
  • Can be downloaded directly from
  • http//pegasus.isi.edu/code.php
  • A pure PegasusDAGMan release coming soon
  • Interested in trying out Pegasus
  • Do the tutorial
  • http//pegasus.isi.edu/tutorial/mardi08/index.php
  • Send email to pegasus_at_isi.edu,
  • to do tutorial on ISI cluster.
  • Quickstart Guide
  • Available at http//pegasus.isi.edu/doc.php
  • More detailed documentation appearing soon.
  • Support lists
  • pegasus-support_at_mailman.isi.edu

27
Acknowledgments
  • Pegasus Gaurang Mehta, Mei-Hui Su, Karan Vahi
  • DAGMan Miron Livny, Kent Wenger, and the Condor
    team
  • LIGO Kent Blackburn, Duncan Brown, Stephen
    Fairhurst, David Meyers
  • Montage Bruce Berriman, John Good, Dan Katz, and
    Joe Jacobs
  • SCEC Tom Jordan, Robert Graves, Phil Maechling,
    David Okaya, Li Zhao
  • Other Collaborators Yolanda Gil, Jihie Kim,
    Varun Ratnakar (Wings System)

28
Pegasus optimizations in detail
29
Workflow Reduction (Data Reuse)
How to To trigger workflow reduction the files
need to be cataloged in replica catalog at
runtime. The registration flags for these files
need to be set in the DAX
30
Job clustering
Level-based clustering
Arbitrary clustering
Vertical clustering
Useful for small granularity jobs
How to To turn job clustering on, pass --cluster
to pegasus-plan
31
Managing execution environment changes through
partitioning
Provides reliabilitycan replan at
partition-level Provides scalabilitycan handle
portions of the workflow at a time
  • How to 1) Partition the workflow into smaller
    partitions at runtime using partitiondax tool.
  • 2) Pass the partitioned dax to
    pegasus-plan using the --pdax option.
  • Paper Pegasus a Framework for Mapping Complex
    Scientific Workflows onto Distributed Systems,
    E. Deelman, et al. Scientific Programming
    Journal, Volume 13, Number 3, 2005

Ewa Deelman, deelman_at_isi.edu www.isi.edu/deelma
n pegasus.isi.edu
Write a Comment
User Comments (0)
About PowerShow.com