Vladimir Litvin, Harvey Newman - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Vladimir Litvin, Harvey Newman

Description:

Master Condor at Caltech launches job to stage data from NCSA UniTree to NCSA Linux cluster ... job launched via Globus jobmanager on cluster ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 18

Provided by: RandyB153

Category:

more less

Transcript and Presenter's Notes

Title: Vladimir Litvin, Harvey Newman

1
Grid Infrastructure for Caltech CMS Distributed
Production on Alliance Resources

Vladimir Litvin, Harvey Newman
Caltech CMS
Scott Koranda, Bruce Loftis, John Towns
NCSA
Miron Livny, Peter Couvares, Todd Tannenbaum,
Jamie Frey
Wisconsin Condor

2
CMS Physics

The CMS detector at the LHC will probe
fundamental forces in our Universe and search for
the yet undetected Higgs Boson
Detector expected to come online 2006

3
CMS Physics
4
ENORMOUS Data Challenges

One sec of CMS running will equal data volume
equivalent to 10,000 Encyclopaedia Britannica
Data rate handled by the CMS event builder (500
Gbit/s) will be equivalent to amount of data
currently exchanged by the world's telecom
networks
Number of processors in the CMS event filter will
equal number of workstations at CERN today (4000)

5
Leveraging Alliance Grid Resources

The Caltech CMS group is using Alliance Grid
resources today for detector simulation and data
processing prototyping
Even during this simulation and prototyping phase
the computational and data challenges are
substantial

6
Challenges of a CMS Run

CMS run naturally divided into two phases
Monte Carlo detector response simulation
100s of jobs per run
each generating 1 GB
all data passed to next phase and archived
reconstruct physics from simulated data
100s of jobs per run
jobs coupled via Objectivity database access
100 GB data archived

Specific challenges
each run generates 100 GB of data to be moved
and archived
many, many runs necessary
simulation reconstruction jobs at different
sites
large human effort starting monitoring jobs,
moving data

7
Meeting Challenge With Globus and Condor

Globus
middleware deployed across entire Alliance Grid
remote access to computational resources
dependable, robust, automated data transfer

Condor
strong fault tolerance including checkpointing
and migration
job scheduling across multiple resources
layered over Globus as personal batch system
for the Grid

8
CMS Run on the Alliance Grid

Caltech CMS staff prepares input files on local
workstation
Pushes one button to launch master Condor job
Input files transferred by master Condor job to
Wisconsin Condor pool (700 CPUs) using Globus
GASS file transfer

Caltech workstation
Input files via Globus GASS
WI Condor pool
9
CMS Run on the Alliance Grid

Master Condor job at Caltech launches secondary
Condor job on Wisconsin pool
Secondary Condor job launches 100 Monte Carlo
jobs on Wisconsin pool
each runs 1224 hours
each generates 1GB data
Condor handles checkpointing migration
no staff intervention

10
CMS Run on the Alliance Grid

When each Monte Carlo job completes data
automatically transferred to UniTree at NCSA
each file 1 GB
transferred using Globus-enabled FTP client
gsiftp
NCSA UniTree runs Globus-enabled FTP server
authentication to FTP server on users behalf
using digital certificate

100 Monte Carlo jobs on Wisconsin Condor pool
100 data files transferred via gsiftp, 1 GB each
NCSA UniTree with Globus-enabled FTP server
11
CMS Run on the Alliance Grid

When all Monte Carlo jobs complete Secondary
Condor reports to Master Condor at Caltech
Master Condor at Caltech launches job to stage
data from NCSA UniTree to NCSA Linux cluster
job launched via Globus jobmanager on cluster
data transferred using Globus-enabled FTP
authentication on users behalf using digital
certificate

Master starts job via Globus jobmanager on
cluster to stage data
12
CMS Run on the Alliance Grid

Master Condor at Caltech launches physics
reconstruction jobs on NCSA Linux cluster
job launched via Globus jobmanager on cluster
Master Condor continually monitors job and logs
progress locally at Caltech
no user intervention required
authentication on users behalf using digital
certificate

Master starts reconstruction jobs via Globus
jobmanager on cluster
13
CMS Run on the Alliance Grid

When reconstruction jobs complete data
automatically archived to NCSA UniTree
data transferred using Globus-enabled FTP
After data transferred run is complete and Master
Condor at Caltech emails notification to staff

data files transferred via gsiftp to UniTree for
archiving
14
Production Data

7 Signal Data Sets 50000 events each have been
simulated and reconstructed without pileup and
with low luminocity (ORCA 4.3.2 and 4.4.0)
Large QCD background Data Set (1M of events) has
been simulated through this system
Data has been stored both NCSA UniTree and
Caltech HPSS

15
Condor Details for Experts

Use CondorG
Condor Globus
allows Condor to submit jobs to remote host via a
Globus jobmanager
any Globus-enabled host reachable (with
authorization)
Condor jobs run in the Globus universe
use familiar Condor classads for submitting jobs

universe globus globusscheduler
beak.cs.wisc.edu/jobmanager-
condor-INTEL-LINUX environment
CONDOR_UNIVERSEscheduler executable
CMS/condor_dagman_run arguments -f -t -l .
-Lockfile cms.lock -Condorlog
cms.log -Dag cms.dag -Rescue
cms.rescue input CMS/hg_90.tar.gz remote_
initialdir Prod2001 output
CMS/hg_90.out error CMS/hg_90.err log
CMS/condor.log notification
always queue
16
Condor Details for Experts

Exploit Condor DAGman
DAGdirected acyclic graph
submission of Condor jobs based on dependencies
job B runs only after job A completes, job D runs
only after job C completes, job E only after
A,B,C D complete
includes both pre- and post-job script execution
for data-staging, cleanup, or the like

Job jobA_632 Prod2000/hg_90_gen_632.cdr
Job jobB_632 Prod2000/hg_90_sim_632.cdr
Script pre jobA_632 Prod2000/pre_632.csh
Script post jobB_632 Prod2000/post_632.csh
PARENT jobA_632 CHILD jobB_632
Job jobA_633 Prod2000/hg_90_gen_633.cdr
Job jobB_633 Prod2000/hg_90_sim_633.cdr
Script pre jobA_633 Prod2000/pre_633.csh
Script post jobB_633 Prod2000/post_633.csh
PARENT jobA_633 CHILD jobB_633

17
Future Directions

Include Alliance LosLobos Linux cluster at AHPCC
in two ways
Add path so that physics reconstruction jobs may
run on Alliance LosLobos Linux cluster at AHPCC
in addition to NCSA cluster
Allow Monte Carlo jobs at Wisconsin to
glide-into LosLobos
merge with MOP (FNAL)

75 Monte Carlo jobs on Wisconsin Condor pool
25 Monte Carlo jobs on LosLobos via Condor
glide-in

Write a Comment

User Comments (0)