Practical%20approaches%20to%20Grid%20workload%20management%20in%20the%20EGEE%20project%20%20Massimo%20Sgaravatto%20INFN%20Padova%20On%20behalf%20of%20the%20EGEE%20JRA1%20IT-CZ%20cluster - PowerPoint PPT Presentation

About This Presentation
Title:

Practical%20approaches%20to%20Grid%20workload%20management%20in%20the%20EGEE%20project%20%20Massimo%20Sgaravatto%20INFN%20Padova%20On%20behalf%20of%20the%20EGEE%20JRA1%20IT-CZ%20cluster

Description:

EGEE is a project funded by the European Union under contract INFSO-RI-508833 ... available to matchmaker. Updated via notifications. and/or active. polling on sources ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Practical%20approaches%20to%20Grid%20workload%20management%20in%20the%20EGEE%20project%20%20Massimo%20Sgaravatto%20INFN%20Padova%20On%20behalf%20of%20the%20EGEE%20JRA1%20IT-CZ%20cluster


1
Practical approaches to Grid workload management
in the EGEE project Massimo SgaravattoINFN
PadovaOn behalf of the EGEE JRA1 IT-CZ cluster
CHEP 2004
www.eu-egee.org
EGEE is a project funded by the European Union
under contract INFSO-RI-508833
2
EGEE project
  • EGEE project
  • Aim build a consistent, robust and secure Grid
    infrastructure
  • Focus first on two pilot applications areas
    (HENP, Biomedical applications)
  • But the goal is to take other researchers in
    academia and industry
  • Middleware activity (JRA1)
  • Re-engineer Grid software to provide production
    quality middleware
  • Evolution towards emerging standards, based on
    Service Oriented Architectures
  • Taking into account application requirements and
    production/ deployment/ management needs
  • See talk 247 (E. Laure)

3
Workload management
  • Grid workload and resource management is one of
    the key Grid middleware functionality
  • How to efficiently schedule a big number of
    different data-intensive jobs, submitted by a
    distributed community of users, to a Grid
    encompassing many and heterogeneous resources
  • Progress was made in various projects with
    different integrated software solutions
  • DataGrid Workload Management System
  • Condor
  • EuroGrid-Unicore resource broker
  • Still a lot to do
  • Scalability, reliability
  • Identification and handling of failures
    originating from different software layers, and
    possibly from 'foreign' Grid system and resources
  • Distributed (hierarchical ?) super-scheduling
  • Proper semantics of resource information
    collection and distribution (push, pull, index,
    cache, refresh)

4
Workload Management System
  • Provision of Grid Workload Management System
    services assigned to the EGEE JRA1 Italian Czech
    cluster
  • CESNET
  • Datamat S.p.A.
  • INFN
  • Architecture of the EGEE WMS designed and being
    implemented
  • Taking into account feedback and requirements
    from reference applications and
    deployment/production/management activities
  • Taking into account previous experiences from
    other Grid projects (in particular the DataGrid
    WMS)
  • Set of Grid services
  • Workload Manager (WM)
  • Computing Element (CE) Resource access
  • Logging Bookkeeping (LB)
  • Job Provevance (JP)
  • Grid Accounting service
  • Interoperating among them and with other EGEE
    Grid Services

5
Workload Manager
6
Workload Manager
Job management requests (submission,
cancellation) expressed via a Job
Description Language (JDL)
7
Workload Manager
Keeps submission requests Requests are kept
for a while if no matching resources available
8
Workload Manager
Repository of resource information available to
matchmaker Updated via notifications and/or
active polling on sources
9
Workload Manager
Finds an appropriate CE for each submission
request, taking into account job requests and
preferences, Grid status, utilization policies
on resources
10
Scheduling policies
  • Different possible policies
  • Eager scheduling a job is bound to a resource as
    soon as possible
  • Job is then forwarded to that CE, where very
    likely it will end up in a queue
  • Lazy scheduling job held by the WM until a
    resource becomes available
  • Job then forwarded to that CE for immediate
    execution
  • WM architecture able to accommodate both models
    (and the intermediate solutions)
  • Eager scheduling matching a job against multiple
    resources
  • Lazy scheduling matching a resource against
    multiple jobs
  • Needed to better investigate strengths and
    weaknesses of different policies in different
    scenarios
  • Evaluation of relevant metrics, covering both
    resource utilization and user satisfaction

11
Computing Element
  • Service representing a computing resource
  • Main functionality job management
  • Run jobs
  • Cancel jobs
  • Suspend and resume jobs
  • Provide info on quality of service
  • How many resources match the job requirements ?
  • What is the estimated time to have the job
    starting its execution ?
  • Used by the WM or by any other client (e.g.
    end-user)
  • CE architecture accommodated to support both push
    and pull model
  • Push model the job is pushed to the CE by the WM
  • Pull model the CE asks the WM for jobs
  • These two models are somewhat mirrored in the
    resource information flow
  • In order to 'pull' a job a resource must choose
    where to 'push' information about itself

12
CE Architecture
Client
JobSubmit JobAssess JobKill JobSuspend JobResume J
obGetStatus
WEB
WEB
CE
Mon
Web service accepting job management requests
LSF
PBS
?
Worker Nodes
13
CE Architecture
Client
Notifications Job requests
WEB
WEB
CE
Mon
Async. notifications about job/CE events Job
requests (for CE working in pull mode)
LSF
PBS
?
Worker Nodes
14
Logging Bookkeeping
  • Collects and manages job-related events (e.g.
    submission, suitable CE found, start of
    execution, ) from the WMS components
  • Processes these events to give a higher level
    view on job states
  • Both job states and raw data available to users
  • Also via Web Service interface
  • Possible to subscribe to receive notifications on
    particular job state changes
  • LB event trail can be analyzed to identify
    problems with resources ("black holes", unusual
    failure rates, etc).
  • See poster 419 for more details

15
Job Provenance
  • Keeps track of definition of submitted jobs,
    execution conditions and job life cycle for a
    long time
  • Job life logs (JDL, timestamps, jobids, )
  • Executable and input/output files
  • Execution environment (OS, installed software
    version, )
  • Custom data provided by user
  • Used for
  • Debugging
  • Post-portem analysis
  • Comparison of job executions in an evolving
    environment
  • Service components
  • Primary Storage Server
  • Keeps data in the most compact and economic form
  • Index Servers
  • Configured to support a set of queryable
    attributes
  • See poster 419 for more details

16
Grid Accounting
  • Accumulates information about the usage of Grid
    resources by users / groups (e.g. VOs)
  • To be used
  • To track resource usage
  • To discover abuses (and help avoiding them)
  • Also possible to charge users for the resources
    they have used
  • Allows implementation of submission policies
    based on resource usage
  • Exchange market among Grid users and Grid
    resource owners, which should result in market
    equilibrium
  • ? Load balancing on the Grid

17
Accounting architecture
Accounting
Resource metering getting info about resource
usage
Storage Element
Computing Element
18
Accounting architecture
Accounting
Reports about resource usage per user / VO/
resource
Storage Element
Computing Element
19
Accounting architecture
Resource pricing
Accounting
Storage Element
Computing Element
Resource owner
20
Accounting architecture
Resource pricing
Cost computation
Accounting
Storage Element
Computing Element
Resource owner
21
Status
  • Workload Manager, Logging Bookkeeping, Grid
    Accounting software inherited by DataGrid WMS
    software
  • Being revised and complemented according to the
    new architecture
  • E.g. Information Supermarket, TaskQueue new
    developments
  • Web services interfaces
  • First implementation already deployed in the EGEE
    GLITE prototype testbed
  • Computing Element
  • New fresh developments
  • CEMon prototype already implemented
  • Job Provenance
  • New component being implemented

22
Links
  • EGEE JRA1 IT-CZ cluster homepage
  • http//egee-jra1-wm.mi.infn.it/egee-jra1-wm
  • EGEE JRA1 (middleware activity) homepage
  • http//egee-jra1.web.cern.ch/egee-jra1
  • EGEE project homepage
  • http//www.eu-egee.org
Write a Comment
User Comments (0)
About PowerShow.com