Distributed Computing and Data Analysis for CMS in view of the LHC startup - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Distributed Computing and Data Analysis for CMS in view of the LHC startup

Description:

Distributed Computing and Data Analysis for CMS in view of the LHC startup Peter Kreuzer RWTH-Aachen IIIa International Symposium on Grid Computing (ISGC) – PowerPoint PPT presentation

Number of Views:120

Avg rating:3.0/5.0

Slides: 31

Provided by: pkr45

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Computing and Data Analysis for CMS in view of the LHC startup

1
Distributed Computing and Data Analysis for CMS
in view of the LHC startup

Peter Kreuzer
RWTH-Aachen IIIa

International Symposium on Grid Computing
(ISGC) Taipei, April 9, 2008
2
Outline

Brief overview of Worldwide LHC Grid WLCG
Distributed Computing Challenges at CMS
Simulation
Reconstruction
Analysis
The physicist view
The road to the LHC startup

3
From local to distributed Analysis

Before centrally organised Analysis

Example CMS 4-6 PBytes data per year, 2900
scientists, 40 countries, 184 institutes !

Solution Tiered Computing Model

4
Worldwide LHC Computing GRID

Level of distribution motivated by the desire to
leverage and empower resources share load,
infrastructure and funding

Tier-0 at CERN
Prompt Reconstruction
Calibration and Low
latency work
Archiving
1.0 GByte/s

Tier-1s at large national
labs or universities
Re-Reconstruction
Physics skimming
Data Serving
Archiving

Aggregate Rate from CERN to Tier-1s ? gt
1.0 GByte/s

Transfer Rate
to Tier-2
50-500
MBytes/s

Tier-2s primarily at Universities
Simulation
User Analysis

Tier-3s at Institutes with
modest Infrastructure
Local User Analysis
Opportunistic Simulation

5
WLCG Infrastructure

EGEE Enabling Grid for E-Science
OSG Open Science Grid

1 Tier-0 11 Tier-1 67 Tier-2
CMS 1 Tier-0 7 Tier-1 35 Tier-2
Tier-0 -- Tier-1 dedicated 10Gbs Optical Network
6
Examples of Sites

T2 RWTH (Aachen) ?
CPU 540 KSI2k 360 cores
Disc 100TB
Network (WAN) 2Gbit/sec
(2009 450 cores 150TB)

T1 ASGC
CPU 2.4 MSI2k 1800 cores
Disc 930TB ? 1.5PB
Tape 586TB ? 800TB ?
Network 10Gbit/sec
T2 Taiwan
CPU 150 KSI2k
Disc 19TB ? 62TB
Network up to 10Gbit/sec

7
Pledged WLCG Resources
250,000 cores
2008 66,000 cores
? CPU
MSI2k

1MSI2K 670 cores

2008 40 PetaBytes
Disc Storage ?
PetaBytes

(Tape Storage
33 PBytes in 2008)

(Reference LCG Project Planning 1.3.08)
8
Challenges for Experiments Example CMS

Scale-up and test distributed Computing
Infrastructure
Mass Storage Systems and Computing Elements
Data Transfer
Calibration and Reconstruction
Event skimming
Simulation
Distributed Data Analysis
Test CMS Software Analysis Framework
Operate in quasi-real data taking conditions and
simulateously at various Tier levels
? Computing Software Analysis (CSA) Challenge

9
CMS Computing and Software Analysis Challenges

CMS Scaling-up in the last 4 years
Test (year) Goal Jobs/day
Scale
DC04 15,000 5
2005 - 2006 New Data Model and
New Software Framework
CSA06 50,000 25
CSA07 100,000 50
CSA08 150,000 100
Requires 100s M simulated events input

?
10
The CSA07 data Challenge
100M Simulated Data
Reconstruction 100Hz
TIER-0
CASTOR
CAF
HLT
Calibration Express Analysis
300MB/s
Re-Reconstruction Skimms 25k jobs/day
TIER-1
TIER-1
TIER-1
TIER-1
20-200MB/s
10MB/s
TIER-2
TIER-2
TIER-2
TIER-2
Analysis 75k jobs/day
Simulation 50M evt/month
11
In this presentation

Mainly covering CMS Simulation, Reconstruction
and Analysis challenges
Data transfers challenges covered in talk by
Daniele Bonacorsi during this session

12
CMS Simulation System
CMS Physicist
ltlt Please simulate new physics gtgt
ltlt Where are my data ? gtgt
Tier-1
Global Data Bookkeeping (DBS)
Tier-2
ProdAgent
ProdRequest
Tier-2
Production Manager
Tier-2
Tier-2
ProdAgent
ProdAgent
Tier-2
Tier-2
GRID
Tier-2
Tier-2
13
ProdAgent workflows
2) Merging
1) Processing

Data processing / bookkeeping / tracking /
monitoring in local-scope
Output promoted to global-scope DBS Data
transfer system PhEDEx
Scaling achieved by running in parallel multiple
ProdAgent instances

14
CMS Simulation Performance

250M Events in 5 months
Tier-2 alone 72
OSG alone 50
(Overall 07-08 450M)

June November 2007
M Evts / Month
Production Rate x 1.8
70
60
50

20k jobs/day reached
lt Job efficiency gt 75

40
30
Jul
Jan
Oct
Apr
15
Utilization of CMS Resources

average 50
In best productions periods 75

Missing Requests
5000 job- slots
June November 2007
16
CSA07 Simulation lessons

Major boost in scale and reliability of
production machinery
Still too many manual operations. From 2008 on
Deploy ProdManager component (in CSA07 was
human !)
Deploy Resource Monitor
Deploy CleanUpSchedule component
Further improvments in scale and reliability
gLite WMS bulk submission 20k jobs/day with 1
WMS server
Condor-G JobRouter bulk submission 100k
jobs/day and can saturate all OSG resources in 1
hour.
Threaded JobTracking and Central Job Log Archival
Introduced task-force for CMS Site Commissioning
help detect site issues via stress-test tool
(enforce metrics)
couple site-state to production and analysis
machinery
Regular CMS Site Availability Monitoring (SAM)
checks

17
CMS Site Availability Monitoring
Availability Ranking
(ARDA Dashboard)
03/22/08
04/03/08
0
100

Important tool to protect CMS use cases at sites

18
CSA07 Reconstruction Skimming
0) preparation of Primary Datasets
mimics real CMS Detector Trigger data
1) Archive and Reconstruction at CERN T0 2)
Archive and Re-Reconstruction at T1s 3) Skimming
at T1s 4) Express analysis Calibration at CERN
Analysis Facility
? 3 different calibrations 10pb-1,100pb-1, 0pb-1
19
Produced CSA07 Data Volumes
x1e8
DIGI-RAW-HLT-RECO events
Total CSA07 event counts 80M GEN-SIM 80M
DIGI-RAW 80M HLT 330M RECO (3 diff.
calibrations) 250M AOD 100M skims --------------
------------- 920M events
10/07
02/08

Total Data volume 2PB
Corresponds to
expected 2008 volume !

CMS data in CASTOR_at_CERN 3.7PB
20
CSA07 Reconstruction lessons
2k running jobs
T0 and T1 processing

T0 Reconstruction at 100Hz
only in bursts, mainly due
to stream splitting activity
Heavy load on CASTOR
Usefull feedback to ProdAgent Developpers to
prepare 2008 data taking (repacker, )
T1 Processing submission rate was main
limitation. Now based on gLite bulk submission
and reaching 12-14k jobs/day with 1 ProdAgent
instance
Further rate improvment to be expected with T1
resource up-scaling

21
CMS Analysis System
CRAB CMS Remote Analysis Builder An interface
to the GRID for CMS physicists Challenge match
processing resources with large quantities of
data chaotic Processing
Tier-1
Tier-2
Global Data Bookkeeping (DBS)
ltlt Please analyse datasets X/Y gtgt
CMS Physicist
Tier-2
CRAB
ltlt Where are my jobs ? gtgt
Tier-2
Tier-2
CRAB Server
Tier-2
Tier-2
GRID
Tier-2
Tier-2
22
CRAB Architecture

Easy and transparent means for CMS users to
submit analysis jobs via the GRID (LCG RB, gLite
WMS, Condor-G)

CSA07 analysis direct submission by user to
GRID. Simple, but lacking automation and
scalability
? 2008 CRAB server

Other new feature local DBS for private users

23
CSA07 Analysis

100k jobs/day not achieved
mainly due to lacking data during the challenge
still limitted by data distribution 55 jobs at
3 largest Tier-1s
and failure rate too high

53 Successful Jobs 20 failed Jobs 27 Unknown
20k jobs/day achieved regularly 30k/day
JobRobot submissions
Number of jobs
24
CMS Grid Users since 1 year

plot showing distinct users
300 users during February 2008
20 most active users carry 1/3 of jobs

Users
Month
CRAB Server
25
The Physicist View

SUSY Search in
di-lepton jets MET
Goal Simulate excess over Standard Model (LM1
at 1 fb-1)
Infrastructure
1 desktop PC
CMS Software Environment (CMSSW , CRAB,
Discovery GUI, )
GRID Certificate member of a Virtual
Organisation (CMS)
Input data (CSA07 simulation/production)
Signal (RECO) 120k events 360 GB
Skimmed Background (AOD) 3.3 M events 721 GB
WW / WZ / ZZ / single top
ttbar / Z / W jets
Unskimmed Background 27 M events 4 TB (for
detailed studies only)
Location of input data
T0/T1 CERN (CH), FNAL (US), FZK (Germany)
T2 Legnaro (Italy), UCSD (US), IFCA (Spain)

1.1 TB
26
GRID Analysis Result
End-Point Signal

Analysis Latency
Signal Bgd
322 jobs ? 22h
to produce this result !
Detailed studies 1300 jobs ? 3.5 days

Z peak from SUSY cascades
GeV
Georgia Karapostoli Athens Univ.
27
CSA07 Analysis lessons

Improve Analysis scalability, automation and
reliability
CRAB-Server
Automate job re-submission
Optimize job distribution
Decrease failure rate
Move Analysis to Tier-2s
To protect Tier-0/1 LSF and storage systems
To make use of all available GRID resources
Encourage Tier-2_to_Physics_group association
In close collaboration with sites
With solid overall Data Management strategy
Assess local scope DM for Physics groups
storage of user data
Aim for 500 users by June and exceed capacity of
several gLite WMS

28
Goals for CSA08 (May 08)

Play through first 3 months of data taking
Simulation
150M events at 1 pb-1 (S43)
150M events at 10 pb-1 (S156)
Tier-0 Prompt reconstruction
S43 with startup-calibration
S156 with improved calibration
CERN Analysis Facility (CAF)
Demonstrate low turn-around AlignmentCalibration
workflows
Coordinated and time-critical physics analyses
Proof-of-principle of CAF Data and Workflow
Managment Systems
Tier-1 Re-Reconstruction with new calibration
constants
S43 with improved constants based on 1 pb-1
S156 with improved constants based on 10 pb-1
Tier-2
iCSA08 simulation (GEN-SIM-DIGI-RAW-HLT)
repeat CAF-based Physics analyses with Re-Reco
data ?

29
2008
Detector installation, commissioning and operation
Preparation of Software, Computing and Physics
analysis
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
2007 Physics Analyses results
Cooldown of magnet
Private global runs (2 days/week) Private
mini-daq
CCRC08-1
GRUMM
CMSSW 1.8.0 sample production
CMSSW 2.0 release production start-up MC samples
Low i test
2 weeks of 2.0 testing
Beam-pipe baked-out Pixels installed
CR 0T
iCSA08 sample generation
CROT
CR 0T
iCSA08 / CCRC08-2
CMS closed
pre CR 4T
CRAFT
CMSSW 2.1 release all basic sw components
ready for LHC, new T0 prod tools
Initial CMS ready for run
CR 4T
fCSA08
or beam!
Must keep exercises mostly non-overlapped
CCRC Common-Vo Computing Readiness Challenge CR
Commissioning Run
30
Where do we stand ?