Vladimir Litvin, Harvey Newman - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Vladimir Litvin, Harvey Newman

Description:

Master Condor at Caltech launches job to stage data from NCSA UniTree to NCSA Linux cluster ... job launched via Globus jobmanager on cluster ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 18
Provided by: RandyB153
Category:

less

Transcript and Presenter's Notes

Title: Vladimir Litvin, Harvey Newman


1
Grid Infrastructure for Caltech CMS Distributed
Production on Alliance Resources
  • Vladimir Litvin, Harvey Newman
  • Caltech CMS
  • Scott Koranda, Bruce Loftis, John Towns
  • NCSA
  • Miron Livny, Peter Couvares, Todd Tannenbaum,
    Jamie Frey
  • Wisconsin Condor

2
CMS Physics
  • The CMS detector at the LHC will probe
    fundamental forces in our Universe and search for
    the yet undetected Higgs Boson
  • Detector expected to come online 2006

3
CMS Physics
4
ENORMOUS Data Challenges
  • One sec of CMS running will equal data volume
    equivalent to 10,000 Encyclopaedia Britannica
  • Data rate handled by the CMS event builder (500
    Gbit/s) will be equivalent to amount of data
    currently exchanged by the world's telecom
    networks
  • Number of processors in the CMS event filter will
    equal number of workstations at CERN today (4000)

5
Leveraging Alliance Grid Resources
  • The Caltech CMS group is using Alliance Grid
    resources today for detector simulation and data
    processing prototyping
  • Even during this simulation and prototyping phase
    the computational and data challenges are
    substantial

6
Challenges of a CMS Run
  • CMS run naturally divided into two phases
  • Monte Carlo detector response simulation
  • 100s of jobs per run
  • each generating 1 GB
  • all data passed to next phase and archived
  • reconstruct physics from simulated data
  • 100s of jobs per run
  • jobs coupled via Objectivity database access
  • 100 GB data archived
  • Specific challenges
  • each run generates 100 GB of data to be moved
    and archived
  • many, many runs necessary
  • simulation reconstruction jobs at different
    sites
  • large human effort starting monitoring jobs,
    moving data

7
Meeting Challenge With Globus and Condor
  • Globus
  • middleware deployed across entire Alliance Grid
  • remote access to computational resources
  • dependable, robust, automated data transfer
  • Condor
  • strong fault tolerance including checkpointing
    and migration
  • job scheduling across multiple resources
  • layered over Globus as personal batch system
    for the Grid

8
CMS Run on the Alliance Grid
  • Caltech CMS staff prepares input files on local
    workstation
  • Pushes one button to launch master Condor job
  • Input files transferred by master Condor job to
    Wisconsin Condor pool (700 CPUs) using Globus
    GASS file transfer

Caltech workstation
Input files via Globus GASS
WI Condor pool
9
CMS Run on the Alliance Grid
  • Master Condor job at Caltech launches secondary
    Condor job on Wisconsin pool
  • Secondary Condor job launches 100 Monte Carlo
    jobs on Wisconsin pool
  • each runs 1224 hours
  • each generates 1GB data
  • Condor handles checkpointing migration
  • no staff intervention

10
CMS Run on the Alliance Grid
  • When each Monte Carlo job completes data
    automatically transferred to UniTree at NCSA
  • each file 1 GB
  • transferred using Globus-enabled FTP client
    gsiftp
  • NCSA UniTree runs Globus-enabled FTP server
  • authentication to FTP server on users behalf
    using digital certificate

100 Monte Carlo jobs on Wisconsin Condor pool
100 data files transferred via gsiftp, 1 GB each
NCSA UniTree with Globus-enabled FTP server
11
CMS Run on the Alliance Grid
  • When all Monte Carlo jobs complete Secondary
    Condor reports to Master Condor at Caltech
  • Master Condor at Caltech launches job to stage
    data from NCSA UniTree to NCSA Linux cluster
  • job launched via Globus jobmanager on cluster
  • data transferred using Globus-enabled FTP
  • authentication on users behalf using digital
    certificate

Master starts job via Globus jobmanager on
cluster to stage data
12
CMS Run on the Alliance Grid
  • Master Condor at Caltech launches physics
    reconstruction jobs on NCSA Linux cluster
  • job launched via Globus jobmanager on cluster
  • Master Condor continually monitors job and logs
    progress locally at Caltech
  • no user intervention required
  • authentication on users behalf using digital
    certificate

Master starts reconstruction jobs via Globus
jobmanager on cluster
13
CMS Run on the Alliance Grid
  • When reconstruction jobs complete data
    automatically archived to NCSA UniTree
  • data transferred using Globus-enabled FTP
  • After data transferred run is complete and Master
    Condor at Caltech emails notification to staff

data files transferred via gsiftp to UniTree for
archiving
14
Production Data
  • 7 Signal Data Sets 50000 events each have been
    simulated and reconstructed without pileup and
    with low luminocity (ORCA 4.3.2 and 4.4.0)
  • Large QCD background Data Set (1M of events) has
    been simulated through this system
  • Data has been stored both NCSA UniTree and
    Caltech HPSS

15
Condor Details for Experts
  • Use CondorG
  • Condor Globus
  • allows Condor to submit jobs to remote host via a
    Globus jobmanager
  • any Globus-enabled host reachable (with
    authorization)
  • Condor jobs run in the Globus universe
  • use familiar Condor classads for submitting jobs

universe globus globusscheduler
beak.cs.wisc.edu/jobmanager-
condor-INTEL-LINUX environment
CONDOR_UNIVERSEscheduler executable
CMS/condor_dagman_run arguments -f -t -l .
-Lockfile cms.lock -Condorlog
cms.log -Dag cms.dag -Rescue
cms.rescue input CMS/hg_90.tar.gz remote_
initialdir Prod2001 output
CMS/hg_90.out error CMS/hg_90.err log
CMS/condor.log notification
always queue
16
Condor Details for Experts
  • Exploit Condor DAGman
  • DAGdirected acyclic graph
  • submission of Condor jobs based on dependencies
  • job B runs only after job A completes, job D runs
    only after job C completes, job E only after
    A,B,C D complete
  • includes both pre- and post-job script execution
    for data-staging, cleanup, or the like
  • Job jobA_632 Prod2000/hg_90_gen_632.cdr
  • Job jobB_632 Prod2000/hg_90_sim_632.cdr
  • Script pre jobA_632 Prod2000/pre_632.csh
  • Script post jobB_632 Prod2000/post_632.csh
  • PARENT jobA_632 CHILD jobB_632
  • Job jobA_633 Prod2000/hg_90_gen_633.cdr
  • Job jobB_633 Prod2000/hg_90_sim_633.cdr
  • Script pre jobA_633 Prod2000/pre_633.csh
  • Script post jobB_633 Prod2000/post_633.csh
  • PARENT jobA_633 CHILD jobB_633

17
Future Directions
  • Include Alliance LosLobos Linux cluster at AHPCC
    in two ways
  • Add path so that physics reconstruction jobs may
    run on Alliance LosLobos Linux cluster at AHPCC
    in addition to NCSA cluster
  • Allow Monte Carlo jobs at Wisconsin to
    glide-into LosLobos
  • merge with MOP (FNAL)

75 Monte Carlo jobs on Wisconsin Condor pool
25 Monte Carlo jobs on LosLobos via Condor
glide-in
Write a Comment
User Comments (0)
About PowerShow.com