Management of Large Scale Data Productions for the CMS Experiment - PowerPoint PPT Presentation

About This Presentation
Title:

Management of Large Scale Data Productions for the CMS Experiment

Description:

... (Pythia CMSIM/Geant3) half OO (ORCA using Objectivity/DB) ... this kind of operation is managed by application-sw (e.g. ORCA) 16-20 October 2000 ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 29
Provided by: CLAUDIO105
Category:

less

Transcript and Presenter's Notes

Title: Management of Large Scale Data Productions for the CMS Experiment


1
Management of Large Scale Data Productions for
the CMS Experiment
  • Presented by
  • L.M.Barone
  • Università di Roma INFN

2
The Framework
  • The CMS experiment is producing a large amount of
    MC data for the development of High Level Trigger
    algorithms (HLT) for fast data reduction at LHC
  • Current production is half traditional (Pythia
    CMSIM/Geant3) half OO (ORCA using Objectivity/DB)

3
The Problem
Dealing with actual MC productions and not with
2005 data taking
  • Data size 106 - 107 events, 1 MB/ev 104
    files (typically 500 evts/file)
  • Resource dispersion many production
    sites CERN,FNAL,Caltech, INFN etc.

4
The Problem (contd)
  • Data Relocation data produced in site A are
    stored centrally (CERN) site B may need a
    fraction of them combinatorics increasing
  • Objectivity/DB does not make life easier(but the
    problem would exist anyway)

5
ORCA Production 2000
Signal
Zebra files with HITS
HEPEVT ntuples
CMSIM
MC Prod.
MB
Catalog import
ORCA Digitization (merge signal and MB)
Objectivity Database
ORCA ooHit Formatter
Objectivity Database
ORCA Prod.
Catalog import
HLT Algorithms New Reconstructed Objects
Objectivity Database
HLT Grp Databases
Mirrored Dbs (US, Russia, Italy..)
6
The Old Days
  • Question how was it done before ?A mix of ad
    hoc scripts/programs with a lot of manual
    intervention... but the problem was smaller and
    less dispersed

7
Requirements for a Solution
  • Solution must be as automatic as possible ?
    decrease manpower
  • Tools should be independent from data type and
    from site
  • Network traffic should be optimized (or minimized
    ?)
  • Users need complete information on data location

8
Present Status
  • Job creation is managed by a variety of scripts
    in different sites
  • Job submission again goes through diverse
    methods, from UNIX commands to LSF or Condor
  • File transfer has been managed up to now by Perl
    scripts ? not generic, not site independent

9
Present Status (contd)
  • The autumn 2000 production round is a trial
    towards standardization ? same layout (OS,
    installation) ? same scripts (T.Wildish) for non
    Objy data transfer? first use of GRID tools
    (see talk by A.Samar)? validation procedure
    for production sites

10
Collateral Activities
  • Linux CMS software automatic installation kit
    (INFN)
  • Globus installation kit (INFN)
  • Production monitoring tools with Web interface

11
What is missing ?
  • Scripts and tools are still too specific and not
    robust enough? need practice on this scale
  • Information service needs a clear definition in
    our context and then an effective implementation
    (see later)
  • File replication management is just appearing and
    needs careful evaluation

12
Ideas for Replica Management
  • A case study with Objectivity/DB(thanks to
    C.Grandi Bologna,INFN)
  • can be extended to any kind of file

13
Cloning federations
  • Cloned federations have a local catalog (boot
    file)
  • It is possible to manage each of them in an
    independent way. Some databases may be attached
    (or exist) only in one site
  • Manual work is needed to keep the schemas
    synchronized (this is not the key point today...)

14
Cloning federations
Clone FD
15
Productions
  • Using a DB-id pre-allocation system it is
    possible to produce databases at RCs which can
    then be exported to other sites
  • A notification system is needed to inform other
    sites when a database is completed
  • This is today accomplished by GDMP using a
    publish-subscribe mechanism

16
Productions
  • When a site receives notification, it can
  • ooattachdb to the remote site DB
  • copy the DB and ooattachdb it locally
  • ignore it

17
Productions
18
Analysis
  • In each site a complete catalog with the location
    of all the datasets is needed. Some DBs are local
    and some are remote
  • In case more copies of a DB are available it
    would be nice to have in the local catalog the
    closest one (NWS)

19
Information service
  • Create an Information Service with information
    about all the replicas of the databases (GIS ?)
  • In each RC there is a reference catalog which is
    updated taking into account the available
    replicas
  • It is even possible to have a catalog created
    on-the-fly only for the datasets needed by a job

20
Analysis
CERN Boot
RC1 Boot
CERN FD
RC1 FD
DBn1
DBnm
DB2
DB3
DBn
DB1
DBnm1
DBn1
DBnm
DBnmk
21
Logical vs Physical Datasets
  • Each dataset is composed by one or more databases
  • datasets are managed by application-sw
  • Each DB is uniquely identified by a DBid
  • DBid assignment is a logical-db creation
  • The physical-db is the file
  • zero, one or more instancies
  • The IS manages the link between a dataset, its
    logical-dbs and its physical-dbs

22
Logical vs Physical Datasets
Dataset H ?2?
pccms1.bo.infn.it/data1/Hmm1.hits.DB
shift23.cern.ch/db45/Hmm1.hits.DB
id12345
Hmm.1.hits.DB
pccms1.bo.infn.it/data1/Hmm2.hits.DB
shift23.cern.ch/db45/Hmm2.hits.DB
id12346
Hmm.2.hits.DB
pccms3.pd.infn.it/data3/Hmm2.hits.DB
Hmm.3.hits.DB
id12347
shift23.cern.ch/db45/Hmm3.hits.DB
Dataset H ?2e
pccms5.roma1.infn.it/data/Hee1.hits.DB
shift49.cern.ch/db123/Hee1.hits.DB
id5678
Hee.1.hits.DB
pccms5.roma1.infn.it/data/Hee2.hits.DB
shift49.cern.ch/db123/Hee2.hits.DB
id5679
Hee.2.hits.DB
pccms5.roma1.infn.it/data/Hee3.hits.DB
id5680
Hee.3.hits.DB
shift49.cern.ch/db123/Hee3.hits.DB
23
Database creation
  • In each production site we have
  • a production federation including incomplete
    databases
  • a reference federation with only complete
    databases (both local and remote ones)
  • When a DB is completed it is attached to the site
    reference federation
  • The IS monitors the reference federations of all
    the sites and updates the database list

24
Database creation
shift.cern.ch
CERN FD
0001 DB1.DB shift.cern.ch/shift/data 0002
DB2.DB shift.cern.ch/shift/data 0003
DB3.DB shift.cern.ch/shift/data 0004
DB4.DB pc.rc1.net/pc/data
shift.cern.ch/shift/data 0005
0001 DB1.DB shift.cern.ch/shift/data 0002
DB2.DB shift.cern.ch/shift/data 0003
DB3.DB shift.cern.ch/shift/data 0004
DB4.DB pc.rc1.net/pc/data
shift.cern.ch/shift/data 0005 DB5.DB
pc.rc1.net/pc/data
0001 DB1.DB shift.cern.ch/shift/data 0002
DB2.DB shift.cern.ch/shift/data 0003
DB3.DB shift.cern.ch/shift/data 0004
DB4.DB pc.rc1.net/pc/data
shift.cern.ch/shift/data 0005 DB5.db
pc.rc1.net/ps.data shift.cern.ch/shift/da
ta
DB1
DB2
DB3
DB4
25
Replica Management
  • In case of multiple copies of the same DB each
    site may choose which copy to use
  • it should be possible to update the reference
    federation at given times
  • it should be possible to create on-the-fly a
    mini-catalog only with information about the
    datasets requested by a job
  • this kind of operation is managed by
    application-sw (e.g. ORCA)

26
Replica Management
shift.cern.ch
pc1.bo.infn.it
DB1
CERN FD
BO Ref
DB3
DB1
DB2
pc1.pd.infn.it
PD Ref
0001 DB1.DB shift.cern.ch/shift/data
pc1.bo.infn.it/data 0002 DB2.DB
shift.cern.ch/shift/data 0003 DB3.DB
shift.cern.ch/shift/data
0001 DB1.DB shift.cern.ch/shift/data
pc1.bo.infn.it/data 0002 DB2.DB
shift.cern.ch/shift/data
pc1.bo.infn.it/data 0003 DB3.DB
shift.cern.ch/shift/data
27
Summary of the Case Study
  • Basic functionalities of a Replica Manager for
    production are already implemented in GDMP
  • The use of an Information Server would allow easy
    synchronization of federations and optimized data
    access during analysis
  • The same functionalities offered by the
    Objectivity/DB catalog may be implemented for
    other kind of files

28
Conclusions (?)
  • Globus and the various GRID projects try
  • to address the issue of Large Scale
  • distributed data access
  • Their effectiveness is still to be proven
  • The problem again is not the software,
  • it is the organization
Write a Comment
User Comments (0)
About PowerShow.com