Dave Newbold, University of Bristol2462003 - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

Dave Newbold, University of Bristol2462003

Description:

Based upon existing distributed toolset: IMPALA, BOSS, RefDB. Evolution draws from experience ... Core functionality from IMPALA, handling job preparation ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 9
Provided by: davene2
Category:

less

Transcript and Presenter's Notes

Title: Dave Newbold, University of Bristol2462003


1
CMS MC production tools
  • A lot of work in this area recently!
  • Context PCP03 (100TB) just started
  • Short-term development team 10 people core
    deployment team 10 people? (incl. UK).
  • New generation of tools
  • Based upon existing distributed toolset IMPALA,
    BOSS, RefDB
  • Evolution draws from experience gained in DC02
  • Not explicitly designed for use on LCG testbed,
    but intended to operate on Grid later (experience
    from CMS EDG stress test, etc).
  • New umbrella project OCTOPUS
  • Covers all CMS distributed production and Grid
    tools
  • Overtly Contrived Toolkit of Previously
    Unrelated Stuff?
  • Oh Crap Time to Operate Production
    Uber-Software
  • Formal support system / bug tracking now in place
    (via Savannah)
  • Our worldwide Octopus has more than eight arms

2
The problems to solve
  • The nature of CMS production
  • Highly distributed (30 sites)
  • Some sites have MUCH more resource (kit, people)
    than others
  • We produce useful data, so DQM is very
    important
  • The application chain is somewhat complex
  • Different event types require different
    processing chains
  • High-lumi background simulation presents some
    special problems
  • Some key issues
  • Communication ( fortnightly VRVS meetings, very
    useful)
  • Documentation, support for installation and use
    of tools
  • Adaptability of production system to local
    conditions (now easier)
  • Real-time data and metadata validation
  • Data storage and migration between sites (data is
    NOT bunged off to CERN)
  • Hotspots in distributed computing system (CERN
    RAL, FNAL)

3
Core user-side toolset
  • McRunjob generic python local production
    framework
  • Originally a D0 tool D0 and CMS versions almost
    merged
  • Glues together the various stages of a
    production chain in a consistent and generic way
    handles job setup and input / output tracking
  • CMS-specific classes are provided to configure
    our applications.
  • ImpalaLite CMS-specific modules in McRunjob
  • Core functionality from IMPALA, handling job
    preparation
  • Interfaces global CMS bookkeeping database
    (RefDB), data validation, job submission
  • BOSS local job submission and tracking
  • Provides a uniform interface to the various batch
    systems (PBS, LSF, BQS, MOP etc etc)
  • Based on MySQL job tracking database
  • BODE is a web-based front end for local job
    management

4
System-side toolset
  • RefDB central bookkeeping / metadata database
  • Provides (physicist) user interface for
    requesting data
  • Web interface allows users to track their
    requests, drill down into detailed metadata
    corresponding to produced data
  • Used remotely by ImpalaLite at job preparation
    time to establish job input parameters, etc
  • Based upon MySQL database at CERN
  • DAR packaging of applications
  • Very simple way of automatically packaging CMS
    software components (CMKIN, CMSIM, OSCAR, ORCA)
    with required libraries, etc
  • Minimal dependence upon site conditions
  • Ensures uniformity of application versions, etc,
    across sites.
  • NB only one current platform for production,
    linux RH73

5
RefDB web user interface
One drawback need big laptop screen for browser!
6
Data handling
  • Dcache pileup background serving
  • Highly challenging from the hardware point of
    view
  • e.g. need to serve up to 200MByte/s to the RAL
    farm during high-lumi digitisation step cheap
    disk servers dont cut it due to random seek
    access pattern
  • Some large sites planning to use dcache for
    background library
  • Each sub-farm (workers on one network switch)
    has its own local disk pool should provide a
    scaleable solution without killing network
  • SRB wide-area data management
  • Subject of some debate in CMS (versus Grid tools)
  • SRB is short-term solution, since nothing else
    works (at 100TB scale) results from CMS EDG
    stress test, UK / US work in 03.
  • Supported via UCSD / FNAL and RAL e-science
    centre
  • RAL will host central MCAT server for PCP03
    (thanks RAL).
  • Generic Interface to RAL datastore in testing
    phase
  • CMSUK responsible for roll-out and support for
    PCP03

7
Grid integration
  • Current status
  • Toolset designed for distributed use but not
    built on Grid middleware
  • Reflection of the current scalability of many
    Grid components?
  • EDG stress test taught us a lot about what is
    possible (now).
  • Plan Grid tools to be introduced and tested
    during PCP03
  • The goal Grid data handling, monitoring, job
    scheduling for DC04
  • Some first targets
  • BOSS RGMA for real-time monitoring
  • replica management to supplement / replace SRB
  • CMS owned testbed (LCG-0) in place at several
    sites
  • Yes, yet another testbed
  • Based upon LCG pilot VOMS R-GMA Ganglia
  • Can test CMSprod product, integrating existing
    toolset with Grid middleware
  • NB many crucial local issues unaddressed by
    Grid model discuss!

8
The worrying side effects of PCP
Write a Comment
User Comments (0)
About PowerShow.com