Management of Large Scale Data Productions for the CMS Experiment - PowerPoint PPT Presentation

About This Presentation

Title:

Management of Large Scale Data Productions for the CMS Experiment

Description:

... (Pythia CMSIM/Geant3) half OO (ORCA using Objectivity/DB) ... this kind of operation is managed by application-sw (e.g. ORCA) 16-20 October 2000 ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 29

Provided by: CLAUDIO105

Learn more at: https://conferences.fnal.gov

Category:

more less

Transcript and Presenter's Notes

Title: Management of Large Scale Data Productions for the CMS Experiment

1
Management of Large Scale Data Productions for
the CMS Experiment

Presented by
L.M.Barone
Università di Roma INFN

2
The Framework

The CMS experiment is producing a large amount of
MC data for the development of High Level Trigger
algorithms (HLT) for fast data reduction at LHC
Current production is half traditional (Pythia
CMSIM/Geant3) half OO (ORCA using Objectivity/DB)

3
The Problem
Dealing with actual MC productions and not with
2005 data taking

Data size 106 - 107 events, 1 MB/ev 104
files (typically 500 evts/file)
Resource dispersion many production
sites CERN,FNAL,Caltech, INFN etc.

4
The Problem (contd)

Data Relocation data produced in site A are
stored centrally (CERN) site B may need a
fraction of them combinatorics increasing
Objectivity/DB does not make life easier(but the
problem would exist anyway)

5
ORCA Production 2000
Signal
Zebra files with HITS
HEPEVT ntuples
CMSIM
MC Prod.
MB
Catalog import
ORCA Digitization (merge signal and MB)
Objectivity Database
ORCA ooHit Formatter
Objectivity Database
ORCA Prod.
Catalog import
HLT Algorithms New Reconstructed Objects
Objectivity Database
HLT Grp Databases
Mirrored Dbs (US, Russia, Italy..)
6
The Old Days

Question how was it done before ?A mix of ad
hoc scripts/programs with a lot of manual
intervention... but the problem was smaller and
less dispersed

7
Requirements for a Solution

Solution must be as automatic as possible ?
decrease manpower
Tools should be independent from data type and
from site
Network traffic should be optimized (or minimized
?)
Users need complete information on data location

8
Present Status

Job creation is managed by a variety of scripts
in different sites
Job submission again goes through diverse
methods, from UNIX commands to LSF or Condor
File transfer has been managed up to now by Perl
scripts ? not generic, not site independent

9
Present Status (contd)

The autumn 2000 production round is a trial
towards standardization ? same layout (OS,
installation) ? same scripts (T.Wildish) for non
Objy data transfer? first use of GRID tools
(see talk by A.Samar)? validation procedure
for production sites

10
Collateral Activities

Linux CMS software automatic installation kit
(INFN)
Globus installation kit (INFN)
Production monitoring tools with Web interface

11
What is missing ?

Scripts and tools are still too specific and not
robust enough? need practice on this scale
Information service needs a clear definition in
our context and then an effective implementation
(see later)
File replication management is just appearing and
needs careful evaluation

12
Ideas for Replica Management

A case study with Objectivity/DB(thanks to
C.Grandi Bologna,INFN)
can be extended to any kind of file

13
Cloning federations

Cloned federations have a local catalog (boot
file)
It is possible to manage each of them in an
independent way. Some databases may be attached
(or exist) only in one site
Manual work is needed to keep the schemas
synchronized (this is not the key point today...)

14
Cloning federations
Clone FD
15
Productions

Using a DB-id pre-allocation system it is
possible to produce databases at RCs which can
then be exported to other sites
A notification system is needed to inform other
sites when a database is completed
This is today accomplished by GDMP using a
publish-subscribe mechanism

16
Productions

When a site receives notification, it can
ooattachdb to the remote site DB
copy the DB and ooattachdb it locally
ignore it

17
Productions
18
Analysis

In each site a complete catalog with the location
of all the datasets is needed. Some DBs are local
and some are remote
In case more copies of a DB are available it
would be nice to have in the local catalog the
closest one (NWS)

19
Information service

Create an Information Service with information
about all the replicas of the databases (GIS ?)
In each RC there is a reference catalog which is
updated taking into account the available
replicas
It is even possible to have a catalog created
on-the-fly only for the datasets needed by a job

20
Analysis
CERN Boot
RC1 Boot
CERN FD
RC1 FD
DBn1
DBnm
DB2
DB3
DBn
DB1
DBnm1
DBn1
DBnm
DBnmk
21
Logical vs Physical Datasets

Each dataset is composed by one or more databases
datasets are managed by application-sw
Each DB is uniquely identified by a DBid
DBid assignment is a logical-db creation
The physical-db is the file
zero, one or more instancies
The IS manages the link between a dataset, its
logical-dbs and its physical-dbs

22
Logical vs Physical Datasets
Dataset H ?2?
pccms1.bo.infn.it/data1/Hmm1.hits.DB
shift23.cern.ch/db45/Hmm1.hits.DB
id12345
Hmm.1.hits.DB
pccms1.bo.infn.it/data1/Hmm2.hits.DB
shift23.cern.ch/db45/Hmm2.hits.DB
id12346
Hmm.2.hits.DB
pccms3.pd.infn.it/data3/Hmm2.hits.DB
Hmm.3.hits.DB
id12347
shift23.cern.ch/db45/Hmm3.hits.DB
Dataset H ?2e
pccms5.roma1.infn.it/data/Hee1.hits.DB
shift49.cern.ch/db123/Hee1.hits.DB
id5678
Hee.1.hits.DB
pccms5.roma1.infn.it/data/Hee2.hits.DB
shift49.cern.ch/db123/Hee2.hits.DB
id5679
Hee.2.hits.DB
pccms5.roma1.infn.it/data/Hee3.hits.DB
id5680
Hee.3.hits.DB
shift49.cern.ch/db123/Hee3.hits.DB
23
Database creation

In each production site we have
a production federation including incomplete
databases
a reference federation with only complete
databases (both local and remote ones)
When a DB is completed it is attached to the site
reference federation
The IS monitors the reference federations of all
the sites and updates the database list

24
Database creation
shift.cern.ch
CERN FD
0001 DB1.DB shift.cern.ch/shift/data 0002
DB2.DB shift.cern.ch/shift/data 0003
DB3.DB shift.cern.ch/shift/data 0004
DB4.DB pc.rc1.net/pc/data
shift.cern.ch/shift/data 0005
0001 DB1.DB shift.cern.ch/shift/data 0002
DB2.DB shift.cern.ch/shift/data 0003
DB3.DB shift.cern.ch/shift/data 0004
DB4.DB pc.rc1.net/pc/data
shift.cern.ch/shift/data 0005 DB5.DB
pc.rc1.net/pc/data
0001 DB1.DB shift.cern.ch/shift/data 0002
DB2.DB shift.cern.ch/shift/data 0003
DB3.DB shift.cern.ch/shift/data 0004
DB4.DB pc.rc1.net/pc/data
shift.cern.ch/shift/data 0005 DB5.db
pc.rc1.net/ps.data shift.cern.ch/shift/da
ta
DB1
DB2
DB3
DB4
25
Replica Management

In case of multiple copies of the same DB each
site may choose which copy to use
it should be possible to update the reference
federation at given times
it should be possible to create on-the-fly a
mini-catalog only with information about the
datasets requested by a job
this kind of operation is managed by
application-sw (e.g. ORCA)

26
Replica Management
shift.cern.ch
pc1.bo.infn.it
DB1
CERN FD
BO Ref
DB3
DB1
DB2
pc1.pd.infn.it
PD Ref
0001 DB1.DB shift.cern.ch/shift/data
pc1.bo.infn.it/data 0002 DB2.DB
shift.cern.ch/shift/data 0003 DB3.DB
shift.cern.ch/shift/data
0001 DB1.DB shift.cern.ch/shift/data
pc1.bo.infn.it/data 0002 DB2.DB
shift.cern.ch/shift/data
pc1.bo.infn.it/data 0003 DB3.DB
shift.cern.ch/shift/data
27
Summary of the Case Study

Basic functionalities of a Replica Manager for
production are already implemented in GDMP
The use of an Information Server would allow easy
synchronization of federations and optimized data
access during analysis
The same functionalities offered by the
Objectivity/DB catalog may be implemented for
other kind of files

28
Conclusions (?)