Contribution for discussion on analysis control and monitor - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Contribution for discussion on analysis control and monitor

Description:

Discussions (Andrea, Craig, Juha, Julia, Massimo, Simone, ... Fetching results. Storing results to Castor. Output files location. Application,applicationversion, ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 16
Provided by: massimo9
Category:

less

Transcript and Presenter's Notes

Title: Contribution for discussion on analysis control and monitor


1
Contribution for discussion on analysis control
and monitor
  • Inspired by the ASAP experience
  • Discussions (Andrea, Craig, Juha, Julia, Massimo,
    Simone, Tao-Sheng, et al.)
  • _at_
  • http//cmsdoc.cern.ch/cms/cpt/Computing/Technical/
    workshops/wm/jul2005/Agenda.txt

2
Example ASAP (ARDA/CMS)
Monalisa
RefDB
PubDB
Job running on the Worker Node
Job monitoring directory
gLite
JDL
ASAP UI
Job submission Checking job status Resubmission
in case of failure Fetching results Storing
results to Castor
Delegates user credentials using MyProxy
ASAP Job Monitoring service
Application,applicationversion, Executable, Data
sample, Working directoory, Castor directory to
save output, Number of events to be
processed Number of events per job Data cards for
ORCA application
Publishing Job status On the WEB
Output files location
3
CMS - Using MonAlisafor user job monitoring
Dynamic monitoring of the total number of the
events of processed by all sub-jobs belonging to
the same Master job
A single job is submiited to gLite JDL contains
job-splitting instructions Master job is split
by gLite into sub-jobs
4
Job Monitoring
  • ASAP Monitor

5
ARDA/ASAP Merging the results
6
Experience with ASAP
  • ASAP uses a server to provide useful
    functionality
  • Job submission, control, resubmission, results
    collection on behalf of the user
  • Job information gathering and presentation
  • The ASAP prototype gives direction on the next
    version
  • Monitor is very important
  • Split functionality (control and monitor)
  • Task Manager and Job Monitor
  • What can be done on top of the already
    implemented functionality

7
Information to be collected
8
How is information collected?
  • COBRA sends information to Monalisa
  • (Switches in orcarc)
  • Independent from the job submission tool
  • In the new framework should get it through
    Framework Job Report
  • Useful to discuss now what we want to be there!
  • Alternative/complementary to job wrappers
  • It would be nice to have a standardized way to
    express common error conditions (non-grid
    failures)
  • software distribution is not found
  • failures caused by catalogs not downloaded from
    PubDB (analysis)
  • failures caused by failures of providing input
    files to the worker node (production)
  • It might be very useful to require that a new
    framework has a clear indication of the job
    failure (through exit code?)
  • Reduce the necessity to check the log file, which
    is normally not straight forward.

9
Monitor mechanics
In the current implementation, ASAP needs user
credentials to get information about user job
from LB
ASAP job monitor
Analysis Job generation Framework Crab
Production Job generation Framework McRunjob/MCPS
Logging Bookkeeping
Myproxy server
Monitoring DB
RB
CE
Random user
WN
WN
WN
WN
In the current implementation, it is just a
stand-alone prototype
Monalisa
10
How to decide if a job should be resubmitted
  • How to decide if a job should be resubmitted
    (user driven)
  • The job failed because of Grid
  • In some cases, it could be done by the RB
    (shallow resubmission)
  • Job failed after execution had started, how to
    understand what to do
  • Presently we consider
  • Any event analysed?
  • First failed attempt?
  • Failing on the same event?
  • What is happening to the other jobs in the task?
  • Not difficult to add things like
  • Correlation with the CE?
  • Monitor information needed
  • Presently the task manager and the job
    monitor are the same entity (in ASAP)

11
Control mechanics
It maintains the list of jobsVery similar
concepts (and working system!) in Ganga (ATLAS
LHCb)
Monalisa
WN
WN
Job generation framework
WN
Job repository
RB
CE
WN
Myproxy server
Logging Bookkeeping
User task is registered to the manager
ASAP job monitor
User is registered to the monitor
Takes a decision about resubmission
ASAP task manager
Job repository
Monitoring DB
In the current implementation, it is just a
stand-alone prototype
12
Examples of re-submission scenarios
  • Job is failing on the same event, stop
    resubmission after 2-3 attempts
  • Data file corrupted or event triggering
    exceptions
  • All jobs of the task are failing on the same CE,
    exclude this CE from the valid CE list for this
    task and resubmit
  • All jobs of the task are failing everywhere, no
    point to resubmit, warn the user , stop task
    processing
  • Jobs are queued too long on a given site, kill
    them and resubmit

13
DashBoard
  • Passive display
  • Quick view of the CMS activities
  • Users
  • Activities
  • Services
  • Resources
  • Useful tools for many
  • Expose usage
  • Expose performances
  • Expose activities
  • A very important tool
  • Assume other tools will be available for other
    specific activities (production, analysis, grid
    management, site management,)
  • Examples
  • How many CMS users are running
  • Who are the top 5 users?
  • Who are the top 5 analysis groups?
  • Where they are running
  • Top 5 centres delivering CPUtime
  • What is running?
  • Top 5 data samples
  • Top 5 applications
  • Failures
  • Top 5 failure reasons
  • And of course
  • Quantities could be displayed
  • Last hour, last 24h, last week
  • Some correlations
  • Under/over replicated data samples
  • Data sample/analysis groups

14
Many possible clients
CMS Task manager
CMS dashboard
CMS Job monitor
CMS siteacceptance
Monitoring DB
Phedextrigger

15
Investigating the Monitor DB Structure
  • Experimenting with different backends
  • Currently PostgreSQL, SQLite
  • Know-how and experience with DB services
  • GANGA4 job repository
Write a Comment
User Comments (0)
About PowerShow.com