Report from USA - PowerPoint PPT Presentation

About This Presentation
Title:

Report from USA

Description:

Site1. Site2. Site3. condor_submit (Globus Universe) Personal. Condor. Master. GIS. Submit jobs ... The user can submit his 10,000 jobs and he will be sure that ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 16
Provided by: massimosg
Category:
Tags: usa | report | site1

less

Transcript and Presenter's Notes

Title: Report from USA


1
Report from USA
  • Massimo Sgaravatto
  • INFN Padova

2
Introduction
  • Workload management system for productions
  • Monte Carlo productions, data reconstructions and
    production analyses
  • Scheduled activities
  • Goal optimization of overall throughput

3
Possible architecture
Resource Discovery
Master
GIS
Submit jobs
condor_submit (Globus Universe)
condor_q condor_rm
Information on characteristics and status of
local resources
Personal Condor
globusrun
GRAM
GRAM
GRAM
CONDOR
LSF
PBS
Site1
Site2
Site3
4
Overview
  • GRAM as uniform interface to different local
    resource management systems
  • Personal Condor able to provide robustness and
    reliability
  • The user can submit his 10,000 jobs and he will
    be sure that they will be completed (even if
    there are problems in the submitting machine, in
    the executing machines, in the network, )
    without human intervention
  • Usage of Condor interface and tools to manage
    the jobs
  • Robust tools with all the required capabilities
    (monitor, logging, )
  • Master smart enough to decide in which Globus
    resources the jobs must be submitted
  • The Master uses the information on
    characteristics and status of resources published
    in the GIS

5
Globus GRAM
  • Fixed problems
  • I/O with vanilla Condor jobs
  • Globus-job-status with LSF and Condor
  • Publishing of Globus LSF and Condor jobs in the
    GIS
  • Open problems
  • Submission of multiple instances of a same job to
    a LSF cluster
  • Necessary to modify the Globus LSF scripts
  • Scalability
  • Fault tolerance

6
Globus GRAM Architecture












Globus front-end machine

















Client

  • globusrun b r lxpd.pd.infn.it/jobmanager-lsf \
  • f file.rsl
  • file.rsl
  • (executable(CMS)/startcmsim.sh)
  • (stdin(CMS)/Pythia/inp)
  • (stdout(CMS)/Cmsim/out)
  • (count1)
  • (queuecmsprod)











LSF/ Condor/ PBS/

Jobmanager
Job
7
Scalability
  • One jobmanager for each globusrun
  • If I want to submit 1000 jobs ???
  • 1000 globusrun
  • 1000 jobmanagers running in the front-end machine
    !!!
  • globusrun b r lxpd.pd.infn.it/jobmanager-lsf
    f file.rsl
  • file.rsl
  • (executable(CMS)/startcmsim.sh)
  • (stdin(CMS)/Pythia/inp)
  • (stdout(CMS)/Cmsim/out)
  • (count1000)
  • (queuecmsprod)
  • Problems with LSF
  • It is not possible to specify in the RSL file
    1000 different input files and 1000 different
    output files

8
Fault tolerance
  • The jobmanager is not persistent
  • If the jobmanager cant be contacted, Globus
    assumes that the job(s) has been completed
  • Example
  • Submission of n jobs on a LSF cluster
  • Reboot of the front end machine
  • The jobmanager(s) doesnt run anymore
  • Orphan jobs -gt Globus assumes that the jobs have
    been successfully completed
  • Globus is not able to understand if a job exited
    normally, or if it doesnt run anymore for a
    problem (i.e. crash of the executing machine) and
    therefore must be re-submitted

9
Globus Universe
  • Condor-G tested with
  • Workstation using the fork system call
  • LSF Cluster
  • Condor pool
  • Submission (condor_submit), monitoring
    (condor_q), removing (condor_rm) seem working
    fine, but

10
Globus Universe problems
  • It is not possible to have the input/output/error
    files in the submitting machine
  • Very difficult to understand about errors
  • Condor-G is not able to provide fault tolerance
    and robustness (because Globus doesnt provide
    these features)
  • Fault tolerance only in the submitting side

11
Condor-G Architecture












condor_submit ? globusrun
Globus front-end machine



polling (globus_job_status)











Personal Condor (Globus Client)














LSF/ Condor/ PBS/

Jobmanager
Job
12
Possible solutions
  • Some improvements foreseen with Condor 6.3 (but
    they will not solve all the problems)
  • Persistent Globus jobmanager
  • ???
  • Direct interaction between Condor and local
    resource management systems (LSF)
  • Necessary to modify the Condor startd
  • GlideIn
  • Only ready-to-use solution if robustness is
    considered a fundamental requirement

13
GlideIn
  • Condor daemons run on Globus resources
  • Local resource management systems used only to
    run Condor daemons
  • Robustness and fault tolerance
  • Use of Condor matchmaking system
  • Viable solution if the goal is just to find idle
    CPUs
  • And if we have to take into account other
    parameters (i.e. location of input files) ???
  • Various changes have been necessary in the
    condor_glidein script

14
GlideIn
  • GlideIn tested with
  • Workstation using the fork system call as job
    manager
  • Seems working
  • Condor pool
  • Seems working
  • Condor flocking better solution if authentication
    is not required
  • LSF cluster
  • Problems (because Globus assumes SMP machines
    managed by LSF, while there are some problems
    with clusters)
  • Necessary to modify the Globus LSF scripts

15
Conclusions
  • Major problems related with scalability and fault
    tolerance with Globus
  • Necessary to re-implement the GRAM service
  • The foreseen architecture doesnt work
  • Personal Condor able to provide robustness only
    in the submitting side
Write a Comment
User Comments (0)
About PowerShow.com