Distributed Computing and Data Analysis for CMS in view of the LHC startup - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Distributed Computing and Data Analysis for CMS in view of the LHC startup

Description:

Distributed Computing and Data Analysis for CMS in view of the LHC startup Peter Kreuzer RWTH-Aachen IIIa International Symposium on Grid Computing (ISGC) – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 31
Provided by: pkr45
Category:

less

Transcript and Presenter's Notes

Title: Distributed Computing and Data Analysis for CMS in view of the LHC startup


1
Distributed Computing and Data Analysis for CMS
in view of the LHC startup
  • Peter Kreuzer
  • RWTH-Aachen IIIa

International Symposium on Grid Computing
(ISGC) Taipei, April 9, 2008
2
Outline
  • Brief overview of Worldwide LHC Grid WLCG
  • Distributed Computing Challenges at CMS
  • Simulation
  • Reconstruction
  • Analysis
  • The physicist view
  • The road to the LHC startup

3
From local to distributed Analysis
  • Before centrally organised Analysis
  • Example CMS 4-6 PBytes data per year, 2900
    scientists, 40 countries, 184 institutes !
  • Solution Tiered Computing Model

4
Worldwide LHC Computing GRID
  • Level of distribution motivated by the desire to
    leverage and empower resources share load,
    infrastructure and funding
  • Tier-0 at CERN
  • Prompt Reconstruction
  • Calibration and Low
  • latency work
  • Archiving
  • 1.0 GByte/s
  • Tier-1s at large national
  • labs or universities
  • Re-Reconstruction
  • Physics skimming
  • Data Serving
  • Archiving

Aggregate Rate from CERN to Tier-1s ? gt
1.0 GByte/s
  • Transfer Rate
  • to Tier-2
  • 50-500
  • MBytes/s
  • Tier-2s primarily at Universities
  • Simulation
  • User Analysis
  • Tier-3s at Institutes with
  • modest Infrastructure
  • Local User Analysis
  • Opportunistic Simulation

5
WLCG Infrastructure
  • EGEE Enabling Grid for E-Science
  • OSG Open Science Grid

1 Tier-0 11 Tier-1 67 Tier-2
CMS 1 Tier-0 7 Tier-1 35 Tier-2
Tier-0 -- Tier-1 dedicated 10Gbs Optical Network
6
Examples of Sites
  • T2 RWTH (Aachen) ?
  • CPU 540 KSI2k 360 cores
  • Disc 100TB
  • Network (WAN) 2Gbit/sec
  • (2009 450 cores 150TB)
  • T1 ASGC
  • CPU 2.4 MSI2k 1800 cores
  • Disc 930TB ? 1.5PB
  • Tape 586TB ? 800TB ?
  • Network 10Gbit/sec
  • T2 Taiwan
  • CPU 150 KSI2k
  • Disc 19TB ? 62TB
  • Network up to 10Gbit/sec

7
Pledged WLCG Resources
250,000 cores
2008 66,000 cores
? CPU
MSI2k
  • 1MSI2K 670 cores

2008 40 PetaBytes
Disc Storage ?
PetaBytes
  • (Tape Storage
  • 33 PBytes in 2008)

(Reference LCG Project Planning 1.3.08)
8
Challenges for Experiments Example CMS
  • Scale-up and test distributed Computing
    Infrastructure
  • Mass Storage Systems and Computing Elements
  • Data Transfer
  • Calibration and Reconstruction
  • Event skimming
  • Simulation
  • Distributed Data Analysis
  • Test CMS Software Analysis Framework
  • Operate in quasi-real data taking conditions and
    simulateously at various Tier levels
  • ? Computing Software Analysis (CSA) Challenge

9
CMS Computing and Software Analysis Challenges
  • CMS Scaling-up in the last 4 years
  • Test (year) Goal Jobs/day
    Scale
  • DC04 15,000 5
  • 2005 - 2006 New Data Model and
    New Software Framework
  • CSA06 50,000 25
  • CSA07 100,000 50
  • CSA08 150,000 100
  • Requires 100s M simulated events input

?
10
The CSA07 data Challenge
100M Simulated Data
Reconstruction 100Hz
TIER-0
CASTOR
CAF
HLT
Calibration Express Analysis
300MB/s
Re-Reconstruction Skimms 25k jobs/day
TIER-1
TIER-1
TIER-1
TIER-1
20-200MB/s
10MB/s
TIER-2
TIER-2
TIER-2
TIER-2
Analysis 75k jobs/day
Simulation 50M evt/month
11
In this presentation
  • Mainly covering CMS Simulation, Reconstruction
    and Analysis challenges
  • Data transfers challenges covered in talk by
    Daniele Bonacorsi during this session

12
CMS Simulation System
CMS Physicist
ltlt Please simulate new physics gtgt
ltlt Where are my data ? gtgt
Tier-1
Global Data Bookkeeping (DBS)
Tier-2
ProdAgent
ProdRequest
Tier-2
Production Manager
Tier-2
Tier-2
ProdAgent
ProdAgent
Tier-2
Tier-2
GRID
Tier-2
Tier-2
13
ProdAgent workflows
2) Merging
1) Processing
  • Data processing / bookkeeping / tracking /
    monitoring in local-scope
  • Output promoted to global-scope DBS Data
    transfer system PhEDEx
  • Scaling achieved by running in parallel multiple
    ProdAgent instances

14
CMS Simulation Performance
  • 250M Events in 5 months
  • Tier-2 alone 72
  • OSG alone 50
  • (Overall 07-08 450M)

June November 2007
M Evts / Month
Production Rate x 1.8
70
60
50
  • 20k jobs/day reached
  • lt Job efficiency gt 75

40
30
Jul
Jan
Oct
Apr
15
Utilization of CMS Resources
  • average 50
  • In best productions periods 75

Missing Requests
5000 job- slots
June November 2007
16
CSA07 Simulation lessons
  • Major boost in scale and reliability of
    production machinery
  • Still too many manual operations. From 2008 on
  • Deploy ProdManager component (in CSA07 was
    human !)
  • Deploy Resource Monitor
  • Deploy CleanUpSchedule component
  • Further improvments in scale and reliability
  • gLite WMS bulk submission 20k jobs/day with 1
    WMS server
  • Condor-G JobRouter bulk submission 100k
    jobs/day and can saturate all OSG resources in 1
    hour.
  • Threaded JobTracking and Central Job Log Archival
  • Introduced task-force for CMS Site Commissioning
  • help detect site issues via stress-test tool
    (enforce metrics)
  • couple site-state to production and analysis
    machinery
  • Regular CMS Site Availability Monitoring (SAM)
    checks

17
CMS Site Availability Monitoring
Availability Ranking
(ARDA Dashboard)
03/22/08
04/03/08
0
100
  • Important tool to protect CMS use cases at sites

18
CSA07 Reconstruction Skimming
0) preparation of Primary Datasets
mimics real CMS Detector Trigger data
1) Archive and Reconstruction at CERN T0 2)
Archive and Re-Reconstruction at T1s 3) Skimming
at T1s 4) Express analysis Calibration at CERN
Analysis Facility
? 3 different calibrations 10pb-1,100pb-1, 0pb-1
19
Produced CSA07 Data Volumes
x1e8
DIGI-RAW-HLT-RECO events
Total CSA07 event counts 80M GEN-SIM 80M
DIGI-RAW 80M HLT 330M RECO (3 diff.
calibrations) 250M AOD 100M skims --------------
------------- 920M events
10/07
02/08
  • Total Data volume 2PB
  • Corresponds to
  • expected 2008 volume !

CMS data in CASTOR_at_CERN 3.7PB
20
CSA07 Reconstruction lessons
2k running jobs
T0 and T1 processing
  • T0 Reconstruction at 100Hz
  • only in bursts, mainly due
  • to stream splitting activity
  • Heavy load on CASTOR
  • Usefull feedback to ProdAgent Developpers to
    prepare 2008 data taking (repacker, )
  • T1 Processing submission rate was main
    limitation. Now based on gLite bulk submission
    and reaching 12-14k jobs/day with 1 ProdAgent
    instance
  • Further rate improvment to be expected with T1
    resource up-scaling

21
CMS Analysis System
CRAB CMS Remote Analysis Builder An interface
to the GRID for CMS physicists Challenge match
processing resources with large quantities of
data chaotic Processing
Tier-1
Tier-2
Global Data Bookkeeping (DBS)
ltlt Please analyse datasets X/Y gtgt
CMS Physicist
Tier-2
CRAB
ltlt Where are my jobs ? gtgt
Tier-2
Tier-2
CRAB Server
Tier-2
Tier-2
GRID
Tier-2
Tier-2
22
CRAB Architecture
  • Easy and transparent means for CMS users to
    submit analysis jobs via the GRID (LCG RB, gLite
    WMS, Condor-G)
  • CSA07 analysis direct submission by user to
    GRID. Simple, but lacking automation and
    scalability
  • ? 2008 CRAB server
  • Other new feature local DBS for private users

23
CSA07 Analysis
  • 100k jobs/day not achieved
  • mainly due to lacking data during the challenge
  • still limitted by data distribution 55 jobs at
    3 largest Tier-1s
  • and failure rate too high

53 Successful Jobs 20 failed Jobs 27 Unknown
20k jobs/day achieved regularly 30k/day
JobRobot submissions
Number of jobs
24
CMS Grid Users since 1 year
  • plot showing distinct users
  • 300 users during February 2008
  • 20 most active users carry 1/3 of jobs

Users
Month
CRAB Server
25
The Physicist View
  • SUSY Search in
  • di-lepton jets MET
  • Goal Simulate excess over Standard Model (LM1
    at 1 fb-1)
  • Infrastructure
  • 1 desktop PC
  • CMS Software Environment (CMSSW , CRAB,
    Discovery GUI, )
  • GRID Certificate member of a Virtual
    Organisation (CMS)
  • Input data (CSA07 simulation/production)
  • Signal (RECO) 120k events 360 GB
  • Skimmed Background (AOD) 3.3 M events 721 GB
  • WW / WZ / ZZ / single top
  • ttbar / Z / W jets
  • Unskimmed Background 27 M events 4 TB (for
    detailed studies only)
  • Location of input data
  • T0/T1 CERN (CH), FNAL (US), FZK (Germany)
  • T2 Legnaro (Italy), UCSD (US), IFCA (Spain)

1.1 TB
26
GRID Analysis Result
End-Point Signal
  • Analysis Latency
  • Signal Bgd
  • 322 jobs ? 22h
  • to produce this result !
  • Detailed studies 1300 jobs ? 3.5 days

Z peak from SUSY cascades
GeV
Georgia Karapostoli Athens Univ.
27
CSA07 Analysis lessons
  • Improve Analysis scalability, automation and
    reliability
  • CRAB-Server
  • Automate job re-submission
  • Optimize job distribution
  • Decrease failure rate
  • Move Analysis to Tier-2s
  • To protect Tier-0/1 LSF and storage systems
  • To make use of all available GRID resources
  • Encourage Tier-2_to_Physics_group association
  • In close collaboration with sites
  • With solid overall Data Management strategy
  • Assess local scope DM for Physics groups
    storage of user data
  • Aim for 500 users by June and exceed capacity of
    several gLite WMS

28
Goals for CSA08 (May 08)
  • Play through first 3 months of data taking
  • Simulation
  • 150M events at 1 pb-1 (S43)
  • 150M events at 10 pb-1 (S156)
  • Tier-0 Prompt reconstruction
  • S43 with startup-calibration
  • S156 with improved calibration
  • CERN Analysis Facility (CAF)
  • Demonstrate low turn-around AlignmentCalibration
    workflows
  • Coordinated and time-critical physics analyses
  • Proof-of-principle of CAF Data and Workflow
    Managment Systems
  • Tier-1 Re-Reconstruction with new calibration
    constants
  • S43 with improved constants based on 1 pb-1
  • S156 with improved constants based on 10 pb-1
  • Tier-2
  • iCSA08 simulation (GEN-SIM-DIGI-RAW-HLT)
  • repeat CAF-based Physics analyses with Re-Reco
    data ?

29
2008
Detector installation, commissioning and operation
Preparation of Software, Computing and Physics
analysis
Jan
Feb
Mar
Apr
May
Jun
Jul
Aug
Sep
Oct
2007 Physics Analyses results
Cooldown of magnet
Private global runs (2 days/week) Private
mini-daq
CCRC08-1
GRUMM
CMSSW 1.8.0 sample production
CMSSW 2.0 release production start-up MC samples
Low i test
2 weeks of 2.0 testing
Beam-pipe baked-out Pixels installed
CR 0T
iCSA08 sample generation
CROT
CR 0T
iCSA08 / CCRC08-2
CMS closed
pre CR 4T
CRAFT
CMSSW 2.1 release all basic sw components
ready for LHC, new T0 prod tools
Initial CMS ready for run
CR 4T
fCSA08
or beam!
Must keep exercises mostly non-overlapped
CCRC Common-Vo Computing Readiness Challenge CR
Commissioning Run
30
Where do we stand ?
  • WLCG major up-scaling since 2 years !
  • CMS impressive results and valuable lessons
    from CSA07
  • Major boost in Simulation
  • Produced 2 PBytes data in T0/T1 Reconstruction
    and Skimming
  • Analysis number of CMS Grid-users ramping up
    fast !
  • Software addressed memory footprint and data
    size issues
  • Further Challenges for CMS scale from 50 to
    100
  • Simultaneous and continuous operations at all
    Tier levels
  • Analysis distribution and automation
  • Transfer rates (see talk by D.Bonacorsi)
  • Upscale and commission the CERN Analysis Facility
    (CAF)
  • CSA08, CCRC08, Commissioning Runs
  • Challenging and motivating goals in view of Day-1
    LHC !
Write a Comment
User Comments (0)
About PowerShow.com