Vladimir Litvin - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Vladimir Litvin

Description:

input = file including all CMSIM inputs and DAG output for this job logged to Caltech machine. the DAG run at Wisc is for all 100 CMSIM jobs ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 12
Provided by: RandyB153
Category:
Tags: dag | litvin | vladimir

less

Transcript and Presenter's Notes

Title: Vladimir Litvin


1
Infrastructure for CMS Production Runs on
NCSA/Alliance Resources A Prototype
  • Vladimir Litvin
  • Caltech HEP
  • Scott Koranda
  • NCSA
  • University of Wisconsin--Milwaukee

2
Why build this prototype now?
  • NCSA/Alliance wants to help US CMS use Alliance
    resources as efficiently as possible (of course)
  • but also successful distributed run important
    part of upcoming NCSA proposal for DTF
  • Given roughly two weeks start to finish to make
    success story happen!

3
Data Terascale Facility
  • Alliance and NPACI Proposal to NSF (April
    19)NCSA (UIUC), SDSC, Argonne, Caltech Linked
    at OC192
  • To Deploy a DTF based on Linux clusters,
    large-scale data archives and high bandwidth
    national networks
  • Atop the DTF hardware, deploy a TeraGrid a new
    unified modelof distributed data analysis,
    computing and communication for science
  • Integration Partners IBM, Intel, Qwest
  • Four Complementary Foci
  • Computing intensive applications (NCSA)
  • 6 TF IA-64, Myrinet, gt 100 TB disk, 1 PB
    Tertiary
  • Data intensive applications (SDSC)
  • 4 TF Linux Cluster, gt 100 TB Disk, Multi-PB
    Tertiary
  • Remote rendering and visualization
    (Argonne)
  • Linux clusters and graphics cards serving remote
    imagery
  • Applications Consortia (Caltech)
  • Software Linux and vendor (IBM) cluster
    softwareGlobus, Condor and other Grid tools

4
Build on top of Condor/Globus
  • Leverage new CondorG from Wisc
  • Personal Condor with ability to submit universe
    globus jobs
  • use Condor to launch jobs through Globus
    gatekeeper
  • Includes Condor DAGMan
  • Directed Acyclic Graph Meta-scheduler
  • graph described using parent-child relationship
  • pre- and post- scripts for each job

5
What we started with
  • CMSIM already running directly on Wisc Condor
    pool
  • usually runs of 100 Condor jobs
  • each approximately 500 events
  • NCSA UniTree running GSI-enabled FTP server
  • 2 Tb disk cache
  • http//www.ncsa.uiuc.edu/SCD/Hardware/UniTree
  • 32 nodes (64 proc) of 1 GHz IBM Linux plugged in
    but never been used
  • 1 frame of 1024 platinum cluster
  • friendly user status next week, full production
    May
  • login node with Globus 1.1.4

6
Timeline for a Run
  • At Caltech submit Condor DAGMan job
  • two jobs in DAG
  • launch Wisc part of job
  • CMSIM
  • transfer zebra data files to NCSA UniTree
  • launch NCSA part of job
  • retrieve zebra data files from UniTree
  • ooHits
  • ooDigis (ORCA reconstruction)
  • post-script to run after Wisc job
  • DAGMan fires up job 1

7
Timeline for a Run
  • Job 1 is the Wisc part of run
  • itself a Condor job
  • universe globus
  • globusscheduler beak.cs.wisc.edu/jobmanger-INTEL
    -LINUX
  • executable condor_dagman
  • input ltfile including all CMSIM inputs and DAGgt
  • output for this job logged to Caltech machine
  • the DAG run at Wisc is for all 100 CMSIM jobs
  • end of each CMSIM post-script is run to
    gsincftpput file to NCSA UniTree
  • X509 proxy cert authenticates on user behalf

8
Timeline for a Run
  • Job 1 at Wisc completes DAGMan at Caltech runs
    Job 1 post-script
  • since Globus doesnt (currently!) pass along
    return value need to check if Job 1 succeeded
  • post-script checks log file on Caltech machine
    for success
  • DAGMan starts Job 2 at NCSA
  • Job 2 is another Condor job
  • universe globus
  • globusscheduler posic.ncsa.uiuc.edu
  • executable ltscript on posicgt

9
Timeline for a Run
  • Why not use jobmanager-PBS?
  • turned out that PBS installation was a bit
    customized
  • somewhat of a culture issue
  • Globus admins not always same as systems admins
  • default Globus scripts for submiting PBS jobs
    wouldnt work
  • no time to customize so punt and use the default
    fork jobmanger
  • in future definitely fix this!

10
Timeline for a Run
  • The executable run through Globus jobmanager
    was script prepared ahead of time
  • In future
  • have direct access to batch system (PBS) so
    better control for entire process, leverage more
    of Condor
  • necessary non-data input files prepared and
    transferred directly from Caltech
  • obvious goal is no need to connect to Wisc or
    NCSA machine and do anything by hand

11
Future Directions
  • Automatic preparation for launch
  • add Alliance LosLobos cluster as resource
  • 512 processor (733 MHz) Linux
  • use both for CMSIM by gliding in to Wisc Condor
    pool, and for ORCA reconstruction
  • add 3rd party file transfers so CondorG at
    Caltech manages Wisc to NCSA data transfer
  • more sophisticated monitoring and logging
Write a Comment
User Comments (0)
About PowerShow.com