Diapositiva 1

About This Presentation

Title:

Diapositiva 1

Description:

N. De Filippis Department ... on first tests Crashes (needed CORAL ... on AlCaReco datasets at some Tier-1s and CAF Re-reconstruction performed at Tier-1s Skim jobs ... – PowerPoint PPT presentation

Number of Views:179

Avg rating:3.0/5.0

Slides: 51

Provided by: NicolaDe8

Category:

more less

Transcript and Presenter's Notes

Title: Diapositiva 1

1
The CMS Computing Software and Analysis
Challenge 2006
N. De Filippis
Department of Physics and INFN Bari
On behalf of the CMS collaboration
2
(No Transcript)
3

A 50 million event exercise to test the workflow
and dataflow as defined in the CMS computing
model
A test at 25 of the capacity needed in 2008
Main components
Preparation of large MC simulated datasets (some
with HLT-tags)
Prompt reconstruction at Tier-0
Reconstruction at 40 Hz (over 150 Hz) using CMSSW
Application of calibration constants from offline
DB
Generation of Reco, AOD, and AlCaReco datasets
Splitting of an HLT-tagged sample into 10 streams
Distribution of all AOD some FEVT to all
participating Tier-1s
Calibration jobs on AlCaReco datasets at some
Tier-1s and CAF
Re-reconstruction performed at Tier-1s
Skim jobs at some Tier-1s with data propagated to
Tier-2s
Physics jobs at Tier-2s and Tier-1s on AOD and
Reco

Italian contribution
4
June 1 June 14 First Version of Detector and
Physics reconstruction SW for CSA06 June 1
Computing systems ready for Service Challenge
SC4 June 15 physics simulation validation
complete July 1 start MC production Aug.15
Calibration, alignment, HLT (and first version L1
simulation), reconstruction, and analysis
tools ready Aug.30 50 Mevt produced, 5M with
HLT pre-processing Sep. 1 Computing systems
ready for CSA Sep 15 Start CSA06 Oct 1
Smooth operation for CSA06 Oct 30 End smooth
operation for CSA06 Nov 15 Finish CSA06
5

Most of performance metrics of the CSA06 are
Number of participating Tier-1 - Goal 7 -
Threshold 5
Number of participating Tier-2 - Goal 20 -
Threshold 15
Weeks of running at sustained rate - Goal 4 -
Threshold 2
Tier-0 Efficiency - Goal 80 - Threshold 30
, measured as unattended uptime fraction over 2
best weeks of the running period
Running grid jobs (Tier-1Tier-2) per day (2h
jobs typ.) - Goal 50K - Threshold 30K
Grid job efficiency - Goal 90 - Threshold 70
Data serving capability at each participating
site from the disk storage to CPU Goal
1MB/s/execution slot - Threshold 400 MB/s
(Tier-1) or 100 MB/sec (Tier-2)
Data transfer Tier-0 to Tier-1 to tape -
Individual goals (threshold at 50 of goal) for
CNAF it was 25 MB/s
Data transfer Tier-1 to Tier-2 - Goal 20 MB/s
into each Tier-2 - Threshold 5 MB/s
Overall "success" is to have 50 of participant
at or above goal and 90 above threshold.

Tier-0 (CERN)
1.4M SI2K (1400 CPUs at CERN)
240 TB
Tier-1 (7 sites)
2500 CPUs in total
70 TB disk tape as minimum to participate
Tier-2 (25 sites)
2400 CPUs in total
Average 10 TB disk at participating Tier-2

7
(No Transcript)
8

ProdAgent tool used to automatise the
production
consists of many agents running in parallel
JobCreator, JobSubmitter, JobTracking,
MergeSensor.
ouput files are registered in Data bookeeping
service (DBS) blocks of files are registered in
Data Location System (DLS) which takes care of
mapping of file blocks and storage elements where
they exist
Files are merged for optimum size before
transfer to CERN
CMS software (CMSSW) installed via grid tools or
directly by site admins in remote sites. A local
catalogue used to map LFNs to local PFNs via a
set of rules
Storage technologies deployed CASTOR, dCache,
DPM

4 production teams active
1 for OSG with contact person
-- Ajit Mohapatra Wisconsin
(taking care of 7 OSG CMS Tier2)
3 for LCG
-- LCG(1) with contact person
Jose Hernandez Madrid (Spain,
France, Belgium, CERN)
-- LCG(2) with contact person
Carsten Hof Aachen (Germany,
Estonia, Taiwan, Russia,
Switzerland, FNAL)
-- LCG(3) with contact person Nicola
De Filippis Bari (Italy, UK, Hungary)
Large partecipation of CMS T1s and T2s involved

10
Maximum rate per day 1.15 M
11
T1 -CNAF
Pisa
LNL
Bari
Most of the failures at CNAF were related to
stageout and stagein problems with CASTOR2
12
Total 66 M eventsTotal FEVT O(150) TB

1. Minimum bias (40M)
2. Z?µµ (2M)
3. T-Tbar (6M)
All decays
4. W?e? (4M)
events selected in narrow range to illuminate 2
SMs
5. Electroweak soup (5M)
W?l nu Drell-Yan (mgt15 GeV) WW H?WW
6. HLT soup (5M) 10 effective MC HLT triggers
(no taus pass)
W (leptons) Drell-Yan (leptons) t-tbar (all
modes) dijets
7. Jet calibration soup (1M)
dijet Zjet, various pt-hat ranges
8. Soft Muon Soup (2M)
Inclusive muons in minbias J/Psi production
9. Exotics Soup (1M)
LM1 SUSY, Z (700 GeV), and excited quark (2000
GeV) all decays

12 M of events produced by the LCG(3) team
13

Efficiency
Overall efficiency 88
Probability for a job to end successfully once it
is submitted
Grid efficiency 95
Aborted jobs jobs not submitted because
requirements not met (merge jobs) or jobs once
submitted fail due to Grid infrastructure reason

Problems
stage out was the main cause of job failures.
More robust checking were implemented, more
attempts to stage, a fallback strategy etc..
merge jobs caused tipically an overload of the
storage system because of the high rate of read
access CASTOR2 at CNAF was tuned to cope with
the needs of the production (D. Bonacorsi and
CNAF admins)
site validation storage, software tag, software
mount points, matching of CE
consistency between fileblock/files in DBS/DLS
and the reality at sites.

Support of Italian Tier-1 and Tier-2 very
effective also in August
14
(No Transcript)
15

Reconstruction with CMSSW_1_0_x (x?6)
All main reconstruction components included
Detector-specific local reconstruction and
clustering
Tracking (only 1 algo used), vertexing,
standalone ?, jets
Global ? (with tracker), electrons, photons,
btau tagging
Reconstruction time small (no p/u!) 4.5s/ev MB,
20s/ev TTbar
Computing model assumes 25 s/ev
Calibration/Alignment
Ability to pull in constants from Offline DB
included for ECAL, Tracker, and Muon
reconstruction
Direct access to Oracle or via Frontier cache

Processing for CSA officially launched October 2
First week mostly minbias (with some EWK) using
CMSSW102 while bugs fixed to improve robustness
on signal samples
Second week processing included signal samples at
rates generally matched to T1 bandwidth metrics
and using CMSSW103
After having run for about 23 days, 120M events
at 100 uptime, decided to increase scale for
last days
Reprocessed all signal samples in 5 days using
CMSSW106 and maximum CPU usage
Useful to re-do some samples (FEVT, Reco, AOD,
AlCaReco) because of some problems/mistakes in
earlier generation (missing files, missing muon
objects)
Performance
160 Hz processing rate, peaking at 300 Hz
signals, minbias, and HLT split samples
1250 CPUs for prompt reconstruction
150 CPUs for AOD and AlCaReco production
(separate step)
All constants pulled from Frontier
i.e. full complexity of CSA exercise
4 weeks uptime (goal), 207M events processed

Calibration/alignment tasks
Specialized tasks to align/calibrate subsystems
using start-up miscalibrated samples, e.g.
Align a portion of Tracker with HIP algorithm by
using Z ?mm sample on the central analysis
facility (CAF) for prompt calibration/alignment
Intercalibrate ECAL crystals by phi symmetry in
minbias events, ?0/?, or by isolated electrons
from W/Z
Specialized reduced RECO data format (AlCaReco)
to be used for calibration/alignment stream from
Tier-0
Mechanism to write constants back into offline DB
to be used
Re-reconstruction at Tier-1 required to test new
constants
Propose that miscalibration is applied at RECO
Datasets for alignment exercise Z?µµ

CSA06 misalignment scenario TIB dets and TOB
rods misaligned by applying
random shifts, drawing from a flat distribution
of witdth /-100 mm in (x,y,z) for the double
sided modules and in x (sensitive coordinate) for
the single sided ones
random rotations, drawing from a flat
distribution of witdth /-10 mrad, in
(alpha,beta,gamma) for all the modules

TIB double sided dets positions

Alignment exercise
to read the object in the DB, to apply the
initial misalignment
to run the iterative HIP algorithm and to
determine alignment constants
1M events used and 10 iterations.
jobs running in parallel on 20 CPUs on a
dedicated queue at Tier-0
new costants inserted into the DB

Tomcat and squids (caching servers) in place
and tested before CSA
DB populated with some sets of constants
No miscalib., start-up miscalib. (4), etc
But multiple failures on first tests
Crashes (needed CORAL patch)
Logging of 28K queries/job kills servers
(disabled)
Successfully in CSA by Oct.24

In CSA
Good Tests
Failed tests
20

All 7 Tier-1 centers participated in the
challenge performing very well
some storage element software or hardware
problems at individual sites
but all have recovered and rapidly cleared any
accumulated backlogs
The longest down time at any site has been about
18 hours
Files are injected into the CMS data transfer
system PhEDEx and transferred using FTS
One central service failures
Recovery has been rapid
Highest rate from CERN was 550MB/s

First 3 Week Average First 3 Week Average
Site Rate
ASGC 14.3MB/s
CNAF 18.0MB/s
FNAL 47.8MB/s
GridKa 21.7MB/s
IN2P3 14.6MB/s
PIC 14.4MB/s
RAL 16.4MB/s
Total 147MB/s
21
..after the prompt reconstruction at Tier-0
Transfer to Tier1 CNAF overall successfull
22

To fit data at T2, and to reduce primary datasets
to manageable sizes, it was needed to run skim
jobs at T1s to select events according to the
analyses
Skim configuration files prepared according to
the RECO and AOD format (also including some MC
truth information)
Organized skim jobs ran with ProdAgent
Different skim procedures prepared by the users
for running on the same dataset were unified in a
single skim job producing different streams
10 filters prepared by the Italian people to cope
with the analyses prepared
4 teams for running skim jobs at tier-1s
N. De Filippis Electroweak soup (RAL, CNAF,
ASGC, IN2P3)
D. Mason Jets (FNAL)
C. Hof TTbar ( FZK and FNAL)
J. Hernandez Zmumu (PIC and CNAF)
Skim job output files shipped to Tier-2s for
end-user analyses
9 Oct. T1 Skim jobs started

First RECO/AOD definition completed for CSA06
production
RECO Content
Tracker Clusters
Rec-hits skipped for disk space reasons
Can be recomputed from clusters
Ecal/HCal/Muon RecHits
Track core plus extra attached RecHits
Refitting is straightforward from attached hits
Vertices, Ecal Clusters, Calo Towers
High Level Objects
Photons, Electrons (links with tracks missing),
Muons, Jets, Met (from Calo Towers and Generator)
Tau tagging
HLT output summary
Trigger bits links to High Level Objects (as
candidates)
HepMC Generator
Geant 4 Tracks/Vertices
AOD Content a proper subset of RECO
Clusters, Hits are dropped
Track core only saved

24
(No Transcript)
25

Problems related to
wrong config. of Tier-2 sites
wrong setup of download agents with FTS
CNAF related problems (FTS server, CASTOR)

26
Exceeded 1PB in 1 month!
27
P. Govoni
28

All INFN Tier2s took part to the last step of
the CSA06 the physics analyses starting from the
output of skim procedures

Legnaro/ Padua (W?mn selection )
Pisa (tau validation)
(Study of minimum bias/underlying event)
Rome (electron reco)
Bari (tracker misalignment)
29

Three analyses with goal
to study of the electron reconstruction in Z ?
ee events (Meridiani)
to measure the W mass in W ? en events
(Tabarelli De Fatis, Malberti, CMS NOTE 2006-061)
to run a simple calibration with W ? en events
(Govoni)

Electron and Z mass reconstruction using the
hybrid supercluster
energy (barrel only)

Eff vs h
mZ
Eff vs pT
30

The general idea is to simulate a "early data
taking" activity of the t group
the goal is to study the tau tag efficiency from
the Z?? tt events (like described in CMS/AN
2006/074)
the goal is to study the misidentification with
the recoiling jet with Zjet, Z ? mm events
In addition run t validation package on skimmed
events

3) The t validation package has been run on pure
di-tau sample and on skimmed ttbar sample (S.
Gennai, G. Bagliesi).
pT of the jet
Isolation efficiency vs Isolation Cone
31
Study of minimum bias/underlying event (Fanò,
Ambroglini, Bartalini)

Monte Carlo tuning for LHC
Pileup undestanding
UE contribution measurements in MB events

MinBias
UE
32
Goal to study the W? mn preselection with
different Monte Carlo data samples
Two data samples were considered (Torassa,
Margoni, Gasparini) (1) the electroweak soup
(3.4 M evts, 50 W?mn and 50 DY) (2) the soft
muons (1.8 M evts, 50 minimum bias and 50 J/y,
pTm gt 4 GeV)
EWK soup
The transverse momentum, the efficiency vs h and
vs pT as obtained with the GlobalMuon
reconstructor (to be compared with standalone)
33

Goals to study the effect of tracker
misalignment on track reconstruction performances
(De Filippis)
with the perfect tracker geometry
in the short term and in the long term
misalignment scenario by reading misalignment
position and errors via frontier/squid from the
offline database ORCAOFF.
by using the tracker module position and errors
as obtained by the output of the alignment
process that will be run at CERN T0.
Data samples used Z?mm and TTbar (the second for
computing the fake rate)

CRAB_1_4_0 used to submit 1.8 k jobs
grid efficiency 99 , appl. eff 94
Bunch of 150 jobs run in different time slots
max 45 jobs run in parallel
the configuration of squid tuned to ensure that
the alignment data were read by the local cache
of squid via the frontier client rather than from
CERN (blue histo).

? frontier/squid works as expected at tier-2
Bari when accessing alignment data
35

Goals
to demonstrate re-reconstruction from some RAW
data at Tier-1s as part of the calibration
exercise
Status
access of Offline database via frontier working
re-reconstruction demonstrated at ASGC, FNAL,
IN2P3, PIC and CNAF
Running at RAL and further tests at CNAF

PIC
36

Problems with CMSSW
the "reasonability" of the code was not too much
taken into account. Operations were driven by
computing, and the feeing was "whatever you run
we do not care. It is enough it is not crashing".
as it often happens in this case, the release
schedule was crazy. Also the initial milestones
were somehow crazy, and it meant a really hard
work to cope with them.
CSA06 meant blocking developments for some time,
to make sure we were maintaining the
backward-compatibility. But it also meant a lot
of code had to live either in the head, or in pre
releases for some time. It would be better to
have specifically two releases ongoing at a time
a production one, and a development one.
- Framework proved to be usable for T0
reconstruction. HLT was not attempted at CSA06
and so no conclusions on that.

Storage system
CASTOR and DPM support (in general rfio access )
for CMS application had a lot of problems (
libdpm patched, gt 2 GB files required a patch)
CASTOR updates too much critical for the
operation during the CSA06 operations that
caused a lot of problems and an emergency status
for CNAF
Integration issues
all the pieces of the CSA06 worked (example
CMSSW releases, PA, skim jobs, DBS/DLS
interactions) but
a lot of effort of operation teams to make them
integrated each other
PA tool that required a lot of distributed
expertise, dedicated hw/sw setup (at least three
machines), realtime monitoring
the CMS SW installation in remote sites was
problematic
LCG/OSG performances very good

CSA06 was successful at INFN (all the steps were
executed) but thanks to the 100 work of few
experts and to the coordinated effort of many
people at Tier-1 and Tier-2 sites.
CSA06 was supposed to be a challenge to
commission the computing/software/analysis system
but in some cases it required also
development/deployment of the tools
CSA06 analysis exercises could be as the ramp-up
for the physics program/organization in Italy
A new CSA would be the best for 2007 with
simulated and real data focus on start-up
operations (calibration and alignment) and
analysis preparation

39
(No Transcript)
40
PA_035, PA_041 PA_045, PA_047
various productions monitored Managed By
different PA versions
pccms30
Test and backup setup PhEDEx injection
ProdAgent UI
41
First prototype of monitoring was developed by
Bari team
42
(No Transcript)
43
Overwhelming response from CSA analysis
demonstrations About 25 filters producing 37
(and 21 jet) datasets ! Variety of outputs and
sizes FEVT, RECOSim, AlCaReco
44

Goals to study the effect of tracker
misalignment on track reconstruction
performances.
with the perfect tracker geometry
in the short term and in the long term
misalignment scenario by reading misalignment
position and errors via frontier/squid from the
offline database ORCAOFF. This step requires to
refit tracks with misaligned geometry but it can
be done at the T2. The effect of alignment
position error APE to be checked.
by using the tracker module position and errors
as obtained by the output of the alignment
process that will be run at CERN T0 to verify the
efficiency of the alignment procedure on the
track reconstruction. Refit of tracks to be done
in the T2.
Global efficiency of track recostruction, track
parameter resolution and fake rate are compared
in the a), b) and c) cases.
The same analysis was performed in ORCA. Plots
and documents at link
http//webcms.ba.infn.it/cms-software/cms-grid/ind
ex.php/Main/StudiesOfCMSTrackerMisalignment
Data samples needed Z?mm and TTbar (the second
for computing the fake rate)

Z?mm and TTbar samples produced during CSA06
pre-production with CMSSW_0_8_2.
CSA06 events reconstructed at T0 with CMSW_1_0_3
(and Z?mm with CMSSW_1_0_5 in transfer)
2 skim cfg files used for skimming Z?mm and TTbar
sample . Skim jobs just run at T1 with CMSW_1_0_4
and CMSSW_1_0_5 and output data in reduced format
RECOSIM are produced. RECOSIM includes enough
information for misalignment analysis.
Z? mm filter to select HepMC muons from Z decay
with h lt 2.55, with pTgt 5 GeV/c2 and 50 lt m
(Z?mm) lt 130. Filter efficiency between 50 and 60
.
TTbar filter to select events with two muons
with h lt 2.5 and pTgt 15 GeV/c2
RECOSIM produced with CMSSW_1_0_4 transferred at
T2-Bari and misalignment analysis run over
RECOSIM with CMSSW_1_0_6.
¼ of the full statistics already analyzed at
T2-Bari .waiting for all the statistics of the
samples.

Selection
track seeding, building, ambiguity resolution,
smoothing with KF.
ctfWithMaterialTracks refit after applying
alignment uncertainties
track associator by c2 to match simtracks with
rectracks
Efficiency number of reco tracks matching simul.
tracks / number of simul tracks
- Simul. track pT 0.9 GeV/c, 0lthlt2.5 , d0 3
cm, z0 30 cm, nhitgt0
Reco. track pT 0.7 GeV/c, 0lthlt2.6 , d0 120
cm, z0 170 cm, nhit8
Fake Rate number of reco tracks not associated
to simul tracks / number of reco tracks
- Simul. track pT 0.7 GeV/c, 0lthlt2.6 , d0 300
cm, z0 300 cm, nhitgt8 not used because Simtrack
does not have the number of simihit method ?
Tracking Particle will have but TP is not
compatible with CSA data samples
Reco. track pT 0.9 GeV/c 0lthlt2.5 , d0 3 cm,
z0 30 cm, nhit8
Track parameters resolution sigma of Gauss fit
to distribution of residuals

CRAB_1_4_0 used to submit 1.8 k jobs
grid efficiency 99 , appl. eff 94
Bunch of 150 jobs run in different time slots
max 45 jobs run in parallel
the configuration of squid tuned to ensure that
the alignment data were read by the local cache
of squid via the frontier client rather than from
CERN (blue histo).

? frontier/squid works as expected at tier-2
Bari when accessing alignment data
48

The effect of misalignment affects the global
track reconstruction efficiency in the first data
taking scenario.
The effect of tracker misalignment is enough
relevant in track parameters resolution (factor
2-3 of degradation)