Diapositiva 1 - PowerPoint PPT Presentation

About This Presentation
Title:

Diapositiva 1

Description:

N. De Filippis Department ... on first tests Crashes (needed CORAL ... on AlCaReco datasets at some Tier-1s and CAF Re-reconstruction performed at Tier-1s Skim jobs ... – PowerPoint PPT presentation

Number of Views:179
Avg rating:3.0/5.0
Slides: 51
Provided by: NicolaDe8
Category:

less

Transcript and Presenter's Notes

Title: Diapositiva 1


1
The CMS Computing Software and Analysis
Challenge 2006
N. De Filippis
Department of Physics and INFN Bari
On behalf of the CMS collaboration
2
(No Transcript)
3
  • A 50 million event exercise to test the workflow
    and dataflow as defined in the CMS computing
    model
  • A test at 25 of the capacity needed in 2008
  • Main components
  • Preparation of large MC simulated datasets (some
    with HLT-tags)
  • Prompt reconstruction at Tier-0
  • Reconstruction at 40 Hz (over 150 Hz) using CMSSW
  • Application of calibration constants from offline
    DB
  • Generation of Reco, AOD, and AlCaReco datasets
  • Splitting of an HLT-tagged sample into 10 streams
  • Distribution of all AOD some FEVT to all
    participating Tier-1s
  • Calibration jobs on AlCaReco datasets at some
    Tier-1s and CAF
  • Re-reconstruction performed at Tier-1s
  • Skim jobs at some Tier-1s with data propagated to
    Tier-2s
  • Physics jobs at Tier-2s and Tier-1s on AOD and
    Reco

Italian contribution
4
June 1 June 14 First Version of Detector and
Physics reconstruction SW for CSA06 June 1
Computing systems ready for Service Challenge
SC4 June 15 physics simulation validation
complete July 1 start MC production Aug.15
Calibration, alignment, HLT (and first version L1
simulation), reconstruction, and analysis
tools ready Aug.30 50 Mevt produced, 5M with
HLT pre-processing Sep. 1 Computing systems
ready for CSA Sep 15 Start CSA06 Oct 1
Smooth operation for CSA06 Oct 30 End smooth
operation for CSA06 Nov 15 Finish CSA06
5
  • Most of performance metrics of the CSA06 are
  • Number of participating Tier-1 - Goal 7 -
    Threshold 5
  • Number of participating Tier-2 - Goal 20 -
    Threshold 15
  • Weeks of running at sustained rate - Goal 4 -
    Threshold 2
  • Tier-0 Efficiency - Goal 80 - Threshold 30
    , measured as unattended uptime fraction over 2
    best weeks of the running period
  • Running grid jobs (Tier-1Tier-2) per day (2h
    jobs typ.) - Goal 50K - Threshold 30K
  • Grid job efficiency - Goal 90 - Threshold 70
  • Data serving capability at each participating
    site from the disk storage to CPU Goal
    1MB/s/execution slot - Threshold 400 MB/s
    (Tier-1) or 100 MB/sec (Tier-2)
  • Data transfer Tier-0 to Tier-1 to tape -
    Individual goals (threshold at 50 of goal) for
    CNAF it was 25 MB/s
  • Data transfer Tier-1 to Tier-2 - Goal 20 MB/s
    into each Tier-2 - Threshold 5 MB/s
  • Overall "success" is to have 50 of participant
    at or above goal and 90 above threshold.

6
  • Tier-0 (CERN)
  • 1.4M SI2K (1400 CPUs at CERN)
  • 240 TB
  • Tier-1 (7 sites)
  • 2500 CPUs in total
  • 70 TB disk tape as minimum to participate
  • Tier-2 (25 sites)
  • 2400 CPUs in total
  • Average 10 TB disk at participating Tier-2

7
(No Transcript)
8
  • ProdAgent tool used to automatise the
    production
  • consists of many agents running in parallel
    JobCreator, JobSubmitter, JobTracking,
    MergeSensor.
  • ouput files are registered in Data bookeeping
    service (DBS) blocks of files are registered in
    Data Location System (DLS) which takes care of
    mapping of file blocks and storage elements where
    they exist
  • Files are merged for optimum size before
    transfer to CERN
  • CMS software (CMSSW) installed via grid tools or
    directly by site admins in remote sites. A local
    catalogue used to map LFNs to local PFNs via a
    set of rules
  • Storage technologies deployed CASTOR, dCache,
    DPM

9
  • 4 production teams active
  • 1 for OSG with contact person
  • -- Ajit Mohapatra Wisconsin
  • (taking care of 7 OSG CMS Tier2)
  • 3 for LCG
  • -- LCG(1) with contact person
  • Jose Hernandez Madrid (Spain,
  • France, Belgium, CERN)
  • -- LCG(2) with contact person
  • Carsten Hof Aachen (Germany,
  • Estonia, Taiwan, Russia,
  • Switzerland, FNAL)
  • -- LCG(3) with contact person Nicola
  • De Filippis Bari (Italy, UK, Hungary)
  • Large partecipation of CMS T1s and T2s involved

10
Maximum rate per day 1.15 M
11
T1 -CNAF
Pisa
LNL
Bari
Most of the failures at CNAF were related to
stageout and stagein problems with CASTOR2
12
Total 66 M eventsTotal FEVT O(150) TB
  • 1. Minimum bias (40M)
  • 2. Z?µµ (2M)
  • 3. T-Tbar (6M)
  • All decays
  • 4. W?e? (4M)
  • events selected in narrow range to illuminate 2
    SMs
  • 5. Electroweak soup (5M)
  • W?l nu Drell-Yan (mgt15 GeV) WW H?WW
  • 6. HLT soup (5M) 10 effective MC HLT triggers
    (no taus pass)
  • W (leptons) Drell-Yan (leptons) t-tbar (all
    modes) dijets
  • 7. Jet calibration soup (1M)
  • dijet Zjet, various pt-hat ranges
  • 8. Soft Muon Soup (2M)
  • Inclusive muons in minbias J/Psi production
  • 9. Exotics Soup (1M)
  • LM1 SUSY, Z (700 GeV), and excited quark (2000
    GeV) all decays

12 M of events produced by the LCG(3) team
13
  • Efficiency
  • Overall efficiency 88
  • Probability for a job to end successfully once it
    is submitted
  • Grid efficiency 95
  • Aborted jobs jobs not submitted because
    requirements not met (merge jobs) or jobs once
    submitted fail due to Grid infrastructure reason
  • Problems
  • stage out was the main cause of job failures.
    More robust checking were implemented, more
    attempts to stage, a fallback strategy etc..
  • merge jobs caused tipically an overload of the
    storage system because of the high rate of read
    access CASTOR2 at CNAF was tuned to cope with
    the needs of the production (D. Bonacorsi and
    CNAF admins)
  • site validation storage, software tag, software
    mount points, matching of CE
  • consistency between fileblock/files in DBS/DLS
    and the reality at sites.

Support of Italian Tier-1 and Tier-2 very
effective also in August
14
(No Transcript)
15
  • Reconstruction with CMSSW_1_0_x (x?6)
  • All main reconstruction components included
  • Detector-specific local reconstruction and
    clustering
  • Tracking (only 1 algo used), vertexing,
    standalone ?, jets
  • Global ? (with tracker), electrons, photons,
    btau tagging
  • Reconstruction time small (no p/u!) 4.5s/ev MB,
    20s/ev TTbar
  • Computing model assumes 25 s/ev
  • Calibration/Alignment
  • Ability to pull in constants from Offline DB
    included for ECAL, Tracker, and Muon
    reconstruction
  • Direct access to Oracle or via Frontier cache

16
  • Processing for CSA officially launched October 2
  • First week mostly minbias (with some EWK) using
    CMSSW102 while bugs fixed to improve robustness
    on signal samples
  • Second week processing included signal samples at
    rates generally matched to T1 bandwidth metrics
    and using CMSSW103
  • After having run for about 23 days, 120M events
    at 100 uptime, decided to increase scale for
    last days
  • Reprocessed all signal samples in 5 days using
    CMSSW106 and maximum CPU usage
  • Useful to re-do some samples (FEVT, Reco, AOD,
    AlCaReco) because of some problems/mistakes in
    earlier generation (missing files, missing muon
    objects)
  • Performance
  • 160 Hz processing rate, peaking at 300 Hz
  • signals, minbias, and HLT split samples
  • 1250 CPUs for prompt reconstruction
  • 150 CPUs for AOD and AlCaReco production
    (separate step)
  • All constants pulled from Frontier
  • i.e. full complexity of CSA exercise
  • 4 weeks uptime (goal), 207M events processed

17
  • Calibration/alignment tasks
  • Specialized tasks to align/calibrate subsystems
    using start-up miscalibrated samples, e.g.
  • Align a portion of Tracker with HIP algorithm by
    using Z ?mm sample on the central analysis
    facility (CAF) for prompt calibration/alignment
  • Intercalibrate ECAL crystals by phi symmetry in
    minbias events, ?0/?, or by isolated electrons
    from W/Z
  • Specialized reduced RECO data format (AlCaReco)
    to be used for calibration/alignment stream from
    Tier-0
  • Mechanism to write constants back into offline DB
    to be used
  • Re-reconstruction at Tier-1 required to test new
    constants
  • Propose that miscalibration is applied at RECO
  • Datasets for alignment exercise Z?µµ

18
  • CSA06 misalignment scenario TIB dets and TOB
    rods misaligned by applying
  • random shifts, drawing from a flat distribution
    of witdth /-100 mm in (x,y,z) for the double
    sided modules and in x (sensitive coordinate) for
    the single sided ones
  • random rotations, drawing from a flat
    distribution of witdth /-10 mrad, in
    (alpha,beta,gamma) for all the modules

TIB double sided dets positions
  • Alignment exercise
  • to read the object in the DB, to apply the
    initial misalignment
  • to run the iterative HIP algorithm and to
    determine alignment constants
  • 1M events used and 10 iterations.
  • jobs running in parallel on 20 CPUs on a
    dedicated queue at Tier-0
  • new costants inserted into the DB

19
  • Tomcat and squids (caching servers) in place
  • and tested before CSA
  • DB populated with some sets of constants
  • No miscalib., start-up miscalib. (4), etc
  • But multiple failures on first tests
  • Crashes (needed CORAL patch)
  • Logging of 28K queries/job kills servers
    (disabled)
  • Successfully in CSA by Oct.24

In CSA
Good Tests
Failed tests
20
  • All 7 Tier-1 centers participated in the
    challenge performing very well
  • some storage element software or hardware
    problems at individual sites
  • but all have recovered and rapidly cleared any
    accumulated backlogs
  • The longest down time at any site has been about
    18 hours
  • Files are injected into the CMS data transfer
    system PhEDEx and transferred using FTS
  • One central service failures
  • Recovery has been rapid
  • Highest rate from CERN was 550MB/s

First 3 Week Average First 3 Week Average
Site Rate
ASGC 14.3MB/s
CNAF 18.0MB/s
FNAL 47.8MB/s
GridKa 21.7MB/s
IN2P3 14.6MB/s
PIC 14.4MB/s
RAL 16.4MB/s
Total 147MB/s
21
..after the prompt reconstruction at Tier-0
Transfer to Tier1 CNAF overall successfull
22
  • To fit data at T2, and to reduce primary datasets
    to manageable sizes, it was needed to run skim
    jobs at T1s to select events according to the
    analyses
  • Skim configuration files prepared according to
    the RECO and AOD format (also including some MC
    truth information)
  • Organized skim jobs ran with ProdAgent
  • Different skim procedures prepared by the users
    for running on the same dataset were unified in a
    single skim job producing different streams
  • 10 filters prepared by the Italian people to cope
    with the analyses prepared
  • 4 teams for running skim jobs at tier-1s
  • N. De Filippis Electroweak soup (RAL, CNAF,
    ASGC, IN2P3)
  • D. Mason Jets (FNAL)
  • C. Hof TTbar ( FZK and FNAL)
  • J. Hernandez Zmumu (PIC and CNAF)
  • Skim job output files shipped to Tier-2s for
    end-user analyses
  • 9 Oct. T1 Skim jobs started

23
  • First RECO/AOD definition completed for CSA06
    production
  • RECO Content
  • Tracker Clusters
  • Rec-hits skipped for disk space reasons
  • Can be recomputed from clusters
  • Ecal/HCal/Muon RecHits
  • Track core plus extra attached RecHits
  • Refitting is straightforward from attached hits
  • Vertices, Ecal Clusters, Calo Towers
  • High Level Objects
  • Photons, Electrons (links with tracks missing),
    Muons, Jets, Met (from Calo Towers and Generator)
  • Tau tagging
  • HLT output summary
  • Trigger bits links to High Level Objects (as
    candidates)
  • HepMC Generator
  • Geant 4 Tracks/Vertices
  • AOD Content a proper subset of RECO
  • Clusters, Hits are dropped
  • Track core only saved

24
(No Transcript)
25
  • Problems related to
  • wrong config. of Tier-2 sites
  • wrong setup of download agents with FTS
  • CNAF related problems (FTS server, CASTOR)

26
Exceeded 1PB in 1 month!
27
P. Govoni
28
  • All INFN Tier2s took part to the last step of
    the CSA06 the physics analyses starting from the
    output of skim procedures

Legnaro/ Padua (W?mn selection )
Pisa (tau validation)
(Study of minimum bias/underlying event)
Rome (electron reco)
Bari (tracker misalignment)
29
  • Three analyses with goal
  • to study of the electron reconstruction in Z ?
    ee events (Meridiani)
  • to measure the W mass in W ? en events
    (Tabarelli De Fatis, Malberti, CMS NOTE 2006-061)
  • to run a simple calibration with W ? en events
    (Govoni)
  • Electron and Z mass reconstruction using the
    hybrid supercluster
  • energy (barrel only)

Eff vs h
mZ
Eff vs pT
30
  • The general idea is to simulate a "early data
    taking" activity of the t group
  • the goal is to study the tau tag efficiency from
    the Z?? tt events (like described in CMS/AN
    2006/074)
  • the goal is to study the misidentification with
    the recoiling jet with Zjet, Z ? mm events
  • In addition run t validation package on skimmed
    events

3) The t validation package has been run on pure
di-tau sample and on skimmed ttbar sample (S.
Gennai, G. Bagliesi).
pT of the jet
Isolation efficiency vs Isolation Cone
31
Study of minimum bias/underlying event (Fanò,
Ambroglini, Bartalini)
  • Monte Carlo tuning for LHC
  • Pileup undestanding
  • UE contribution measurements in MB events

MinBias
UE
32
Goal to study the W? mn preselection with
different Monte Carlo data samples
Two data samples were considered (Torassa,
Margoni, Gasparini) (1) the electroweak soup
(3.4 M evts, 50 W?mn and 50 DY) (2) the soft
muons (1.8 M evts, 50 minimum bias and 50 J/y,
pTm gt 4 GeV)
EWK soup
The transverse momentum, the efficiency vs h and
vs pT as obtained with the GlobalMuon
reconstructor (to be compared with standalone)
33
  • Goals to study the effect of tracker
    misalignment on track reconstruction performances
    (De Filippis)
  • with the perfect tracker geometry
  • in the short term and in the long term
    misalignment scenario by reading misalignment
    position and errors via frontier/squid from the
    offline database ORCAOFF.
  • by using the tracker module position and errors
    as obtained by the output of the alignment
    process that will be run at CERN T0.
  • Data samples used Z?mm and TTbar (the second for
    computing the fake rate)

34
  • CRAB_1_4_0 used to submit 1.8 k jobs
  • grid efficiency 99 , appl. eff 94
  • Bunch of 150 jobs run in different time slots
  • max 45 jobs run in parallel
  • the configuration of squid tuned to ensure that
    the alignment data were read by the local cache
    of squid via the frontier client rather than from
    CERN (blue histo).

? frontier/squid works as expected at tier-2
Bari when accessing alignment data
35
  • Goals
  • to demonstrate re-reconstruction from some RAW
    data at Tier-1s as part of the calibration
    exercise
  • Status
  • access of Offline database via frontier working
  • re-reconstruction demonstrated at ASGC, FNAL,
    IN2P3, PIC and CNAF
  • Running at RAL and further tests at CNAF

PIC
36
  • Problems with CMSSW
  • the "reasonability" of the code was not too much
    taken into account. Operations were driven by
    computing, and the feeing was "whatever you run
    we do not care. It is enough it is not crashing".
  • as it often happens in this case, the release
    schedule was crazy. Also the initial milestones
    were somehow crazy, and it meant a really hard
    work to cope with them.
  • CSA06 meant blocking developments for some time,
    to make sure we were maintaining the
    backward-compatibility. But it also meant a lot
    of code had to live either in the head, or in pre
    releases for some time. It would be better to
    have specifically two releases ongoing at a time
    a production one, and a development one.
  • - Framework proved to be usable for T0
    reconstruction. HLT was not attempted at CSA06
    and so no conclusions on that.

37
  • Storage system
  • CASTOR and DPM support (in general rfio access )
    for CMS application had a lot of problems (
    libdpm patched, gt 2 GB files required a patch)
  • CASTOR updates too much critical for the
    operation during the CSA06 operations that
    caused a lot of problems and an emergency status
    for CNAF
  • Integration issues
  • all the pieces of the CSA06 worked (example
    CMSSW releases, PA, skim jobs, DBS/DLS
    interactions) but
  • a lot of effort of operation teams to make them
    integrated each other
  • PA tool that required a lot of distributed
    expertise, dedicated hw/sw setup (at least three
    machines), realtime monitoring
  • the CMS SW installation in remote sites was
    problematic
  • LCG/OSG performances very good

38
  • CSA06 was successful at INFN (all the steps were
    executed) but thanks to the 100 work of few
    experts and to the coordinated effort of many
    people at Tier-1 and Tier-2 sites.
  • CSA06 was supposed to be a challenge to
    commission the computing/software/analysis system
    but in some cases it required also
    development/deployment of the tools
  • CSA06 analysis exercises could be as the ramp-up
    for the physics program/organization in Italy
  • A new CSA would be the best for 2007 with
    simulated and real data focus on start-up
    operations (calibration and alignment) and
    analysis preparation

39
(No Transcript)
40
PA_035, PA_041 PA_045, PA_047
various productions monitored Managed By
different PA versions
pccms30
Test and backup setup PhEDEx injection
ProdAgent UI
41
First prototype of monitoring was developed by
Bari team
42
(No Transcript)
43
Overwhelming response from CSA analysis
demonstrations About 25 filters producing 37
(and 21 jet) datasets ! Variety of outputs and
sizes FEVT, RECOSim, AlCaReco
44
  • Goals to study the effect of tracker
    misalignment on track reconstruction
    performances.
  • with the perfect tracker geometry
  • in the short term and in the long term
    misalignment scenario by reading misalignment
    position and errors via frontier/squid from the
    offline database ORCAOFF. This step requires to
    refit tracks with misaligned geometry but it can
    be done at the T2. The effect of alignment
    position error APE to be checked.
  • by using the tracker module position and errors
    as obtained by the output of the alignment
    process that will be run at CERN T0 to verify the
    efficiency of the alignment procedure on the
    track reconstruction. Refit of tracks to be done
    in the T2.
  • Global efficiency of track recostruction, track
    parameter resolution and fake rate are compared
    in the a), b) and c) cases.
  • The same analysis was performed in ORCA. Plots
    and documents at link
  • http//webcms.ba.infn.it/cms-software/cms-grid/ind
    ex.php/Main/StudiesOfCMSTrackerMisalignment
  • Data samples needed Z?mm and TTbar (the second
    for computing the fake rate)

45
  • Z?mm and TTbar samples produced during CSA06
    pre-production with CMSSW_0_8_2.
  • CSA06 events reconstructed at T0 with CMSW_1_0_3
    (and Z?mm with CMSSW_1_0_5 in transfer)
  • 2 skim cfg files used for skimming Z?mm and TTbar
    sample . Skim jobs just run at T1 with CMSW_1_0_4
    and CMSSW_1_0_5 and output data in reduced format
    RECOSIM are produced. RECOSIM includes enough
    information for misalignment analysis.
  • Z? mm filter to select HepMC muons from Z decay
    with h lt 2.55, with pTgt 5 GeV/c2 and 50 lt m
    (Z?mm) lt 130. Filter efficiency between 50 and 60
    .
  • TTbar filter to select events with two muons
    with h lt 2.5 and pTgt 15 GeV/c2
  • RECOSIM produced with CMSSW_1_0_4 transferred at
    T2-Bari and misalignment analysis run over
    RECOSIM with CMSSW_1_0_6.
  • ¼ of the full statistics already analyzed at
    T2-Bari .waiting for all the statistics of the
    samples.

46
  • Selection
  • track seeding, building, ambiguity resolution,
    smoothing with KF.
  • ctfWithMaterialTracks refit after applying
    alignment uncertainties
  • track associator by c2 to match simtracks with
    rectracks
  • Efficiency number of reco tracks matching simul.
    tracks / number of simul tracks
  • - Simul. track pT 0.9 GeV/c, 0lthlt2.5 , d0 3
    cm, z0 30 cm, nhitgt0
  • Reco. track pT 0.7 GeV/c, 0lthlt2.6 , d0 120
    cm, z0 170 cm, nhit8
  • Fake Rate number of reco tracks not associated
    to simul tracks / number of reco tracks
  • - Simul. track pT 0.7 GeV/c, 0lthlt2.6 , d0 300
    cm, z0 300 cm, nhitgt8 not used because Simtrack
    does not have the number of simihit method ?
    Tracking Particle will have but TP is not
    compatible with CSA data samples
  • Reco. track pT 0.9 GeV/c 0lthlt2.5 , d0 3 cm,
    z0 30 cm, nhit8
  • Track parameters resolution sigma of Gauss fit
    to distribution of residuals

47
  • CRAB_1_4_0 used to submit 1.8 k jobs
  • grid efficiency 99 , appl. eff 94
  • Bunch of 150 jobs run in different time slots
  • max 45 jobs run in parallel
  • the configuration of squid tuned to ensure that
    the alignment data were read by the local cache
    of squid via the frontier client rather than from
    CERN (blue histo).

? frontier/squid works as expected at tier-2
Bari when accessing alignment data
48
  • The effect of misalignment affects the global
    track reconstruction efficiency in the first data
    taking scenario.
  • The effect of tracker misalignment is enough
    relevant in track parameters resolution (factor
    2-3 of degradation)

49
  • A factor between 2 and 3 in impact parameters
    resolution due to misalignment

50
Using CSA06 Z?mm sample
The Z mass resolution is increased by a factor
larger than 2 in the first data taking scenario
(RMS from 1.3 to 2.8)
Write a Comment
User Comments (0)
About PowerShow.com