Title: Practical%20approaches%20to%20Grid%20workload%20management%20in%20the%20EGEE%20project%20%20Massimo%20Sgaravatto%20INFN%20Padova%20On%20behalf%20of%20the%20EGEE%20JRA1%20IT-CZ%20cluster
1Practical approaches to Grid workload management
in the EGEE project Massimo SgaravattoINFN
PadovaOn behalf of the EGEE JRA1 IT-CZ cluster
CHEP 2004
www.eu-egee.org
EGEE is a project funded by the European Union
under contract INFSO-RI-508833
2EGEE project
- EGEE project
- Aim build a consistent, robust and secure Grid
infrastructure - Focus first on two pilot applications areas
(HENP, Biomedical applications) - But the goal is to take other researchers in
academia and industry - Middleware activity (JRA1)
- Re-engineer Grid software to provide production
quality middleware - Evolution towards emerging standards, based on
Service Oriented Architectures - Taking into account application requirements and
production/ deployment/ management needs - See talk 247 (E. Laure)
3Workload management
- Grid workload and resource management is one of
the key Grid middleware functionality - How to efficiently schedule a big number of
different data-intensive jobs, submitted by a
distributed community of users, to a Grid
encompassing many and heterogeneous resources - Progress was made in various projects with
different integrated software solutions - DataGrid Workload Management System
- Condor
- EuroGrid-Unicore resource broker
-
- Still a lot to do
- Scalability, reliability
- Identification and handling of failures
originating from different software layers, and
possibly from 'foreign' Grid system and resources - Distributed (hierarchical ?) super-scheduling
- Proper semantics of resource information
collection and distribution (push, pull, index,
cache, refresh) -
4Workload Management System
- Provision of Grid Workload Management System
services assigned to the EGEE JRA1 Italian Czech
cluster - CESNET
- Datamat S.p.A.
- INFN
- Architecture of the EGEE WMS designed and being
implemented - Taking into account feedback and requirements
from reference applications and
deployment/production/management activities - Taking into account previous experiences from
other Grid projects (in particular the DataGrid
WMS) - Set of Grid services
- Workload Manager (WM)
- Computing Element (CE) Resource access
- Logging Bookkeeping (LB)
- Job Provevance (JP)
- Grid Accounting service
- Interoperating among them and with other EGEE
Grid Services
5Workload Manager
6Workload Manager
Job management requests (submission,
cancellation) expressed via a Job
Description Language (JDL)
7Workload Manager
Keeps submission requests Requests are kept
for a while if no matching resources available
8Workload Manager
Repository of resource information available to
matchmaker Updated via notifications and/or
active polling on sources
9Workload Manager
Finds an appropriate CE for each submission
request, taking into account job requests and
preferences, Grid status, utilization policies
on resources
10Scheduling policies
- Different possible policies
- Eager scheduling a job is bound to a resource as
soon as possible - Job is then forwarded to that CE, where very
likely it will end up in a queue - Lazy scheduling job held by the WM until a
resource becomes available - Job then forwarded to that CE for immediate
execution - WM architecture able to accommodate both models
(and the intermediate solutions) - Eager scheduling matching a job against multiple
resources - Lazy scheduling matching a resource against
multiple jobs - Needed to better investigate strengths and
weaknesses of different policies in different
scenarios - Evaluation of relevant metrics, covering both
resource utilization and user satisfaction
11Computing Element
- Service representing a computing resource
- Main functionality job management
- Run jobs
- Cancel jobs
- Suspend and resume jobs
- Provide info on quality of service
- How many resources match the job requirements ?
- What is the estimated time to have the job
starting its execution ? -
-
- Used by the WM or by any other client (e.g.
end-user) - CE architecture accommodated to support both push
and pull model - Push model the job is pushed to the CE by the WM
- Pull model the CE asks the WM for jobs
- These two models are somewhat mirrored in the
resource information flow - In order to 'pull' a job a resource must choose
where to 'push' information about itself
12CE Architecture
Client
JobSubmit JobAssess JobKill JobSuspend JobResume J
obGetStatus
WEB
WEB
CE
Mon
Web service accepting job management requests
LSF
PBS
?
Worker Nodes
13CE Architecture
Client
Notifications Job requests
WEB
WEB
CE
Mon
Async. notifications about job/CE events Job
requests (for CE working in pull mode)
LSF
PBS
?
Worker Nodes
14Logging Bookkeeping
- Collects and manages job-related events (e.g.
submission, suitable CE found, start of
execution, ) from the WMS components - Processes these events to give a higher level
view on job states - Both job states and raw data available to users
- Also via Web Service interface
- Possible to subscribe to receive notifications on
particular job state changes - LB event trail can be analyzed to identify
problems with resources ("black holes", unusual
failure rates, etc). - See poster 419 for more details
15Job Provenance
- Keeps track of definition of submitted jobs,
execution conditions and job life cycle for a
long time - Job life logs (JDL, timestamps, jobids, )
- Executable and input/output files
- Execution environment (OS, installed software
version, ) - Custom data provided by user
- Used for
- Debugging
- Post-portem analysis
- Comparison of job executions in an evolving
environment - Service components
- Primary Storage Server
- Keeps data in the most compact and economic form
- Index Servers
- Configured to support a set of queryable
attributes - See poster 419 for more details
16Grid Accounting
- Accumulates information about the usage of Grid
resources by users / groups (e.g. VOs) - To be used
- To track resource usage
- To discover abuses (and help avoiding them)
- Also possible to charge users for the resources
they have used - Allows implementation of submission policies
based on resource usage - Exchange market among Grid users and Grid
resource owners, which should result in market
equilibrium - ? Load balancing on the Grid
17Accounting architecture
Accounting
Resource metering getting info about resource
usage
Storage Element
Computing Element
18Accounting architecture
Accounting
Reports about resource usage per user / VO/
resource
Storage Element
Computing Element
19Accounting architecture
Resource pricing
Accounting
Storage Element
Computing Element
Resource owner
20Accounting architecture
Resource pricing
Cost computation
Accounting
Storage Element
Computing Element
Resource owner
21Status
- Workload Manager, Logging Bookkeeping, Grid
Accounting software inherited by DataGrid WMS
software - Being revised and complemented according to the
new architecture - E.g. Information Supermarket, TaskQueue new
developments - Web services interfaces
- First implementation already deployed in the EGEE
GLITE prototype testbed - Computing Element
- New fresh developments
- CEMon prototype already implemented
- Job Provenance
- New component being implemented
22Links
- EGEE JRA1 IT-CZ cluster homepage
- http//egee-jra1-wm.mi.infn.it/egee-jra1-wm
- EGEE JRA1 (middleware activity) homepage
- http//egee-jra1.web.cern.ch/egee-jra1
- EGEE project homepage
- http//www.eu-egee.org