Title: LHCb Distributed Computing and the Grid Nick Brook University of Bristol
1LHCb Distributed Computingand the GridNick
BrookUniversity of Bristol
- D. Galli, U. Marconi, V. Vagnoni INFN Bologna
- N. Brook Bristol
- K. Harrison Cambridge
- E. Van Herwijnen, J. Closier, P. Mato CERN
- A. Khan Edinburgh
- A. Tsaregorodtsev Marseille
- H. Bulten, S. Klous Nikhef
- F. Harris, I. McArthur, A. Soroko Oxford
- G. N. Patrick, G. Kuznetsov RAL
2Overview of presentation
- Current organisation of LHCb distributed
computing - UK facilities and support through GridPP
- Current use of Globus and EDG middleware
- Planning for data challenges and the use of Grid
- Current LHCb Grid/applications R/D
- Conclusions
3History of distributed MC production
- Distributed System has been running for 3 years
processed many millions of events for LHCb
design. - Main production sites
- CERN, Bologna, Liverpool, Lyon, NIKHEF RAL
- Globus already used for job submission to RAL and
Lyon - System interfaced to GRID and demonstrated at
EU-DG Review and NeSC/UK Opening. - For 2002 Data Challenges, adding new institutes
- Bristol, Cambridge, Oxford, ScotGrid
- In 2003, add
- Barcelona, Moscow, Germany, Switzerland Poland.
4Current Architecture
Production Manager Create no. of jobs (500 events
each) Determine configuration Run
executable Check data Copy data/logs
Physics Coordinator
Physicist
Job Creation/Submission via Web Identify
outstanding requests Select workflow Create
scripts via Java servlets.
Monitoring via PVSS Submit jobs to distributed
sites See what jobs are running Check
configuration Kill jobs, etc
Bookkeeping Database
5LOGICAL FLOW
Submit jobs remotely via Web
Analysis
Execute on farm
Data quality check
Update bookkeeping database
Transfer data to mass store
6Monitoring and Control of MC jobs
- LHCb has adopted PVSS II as prototype control and
monitoring system for MC production. - PVSS is a commercial SCADA (Supervisory Control
And Data Acquisition) product developed by ETM. - Adopted as Control framework for LHC Joint
Controls Project (JCOP). - Available for Linux and Windows platforms.
7(No Transcript)
8UK Tier 1 - RAL
New Computing Farm 4 racks holding 156 dual
1.4GHz Pentium III cpus. Each box has 1GB of
memory, a 40GB internal disk and 100Mb ethernet.
Tape Robot upgraded last year uses 60GB STK 9940
tapes 45TB current capacity could hold 330TB.
50TByte disk-based Mass Storage Unit after RAID 5
overhead. PCs are clustered on network switches
with up to 8x1000Mb ethernet out of each rack.
2004 Scale 1000 CPUs 0.5 PBytes
9UK Regional Centres
Local Perspective Consolidate Research
Computing Optimisation of Number of
Nodes? 4 Relative size dependent on funding
dynamics
10UK Prototype Tier2 - ScotGrid
- ScotGrid Processing nodes at Glasgow
- 59 IBM X Series 330 dual 1 GHz Pentium III with
2GB memory - 2 IBM X Series 340 dual 1 GHz Pentium III with
2GB memory and dual ethernet - 3 IBM X Series 340 dual 1 GHz Pentium III with
2GB memory and 100 1000 Mbit/s ethernet - 1TB disk
- LTO/Ultrium Tape Library
- Cisco ethernet switches
- ScotGrid Storage at Edinburgh
- IBM X Series 370 PIII Xeon with 512 MB memory 32
x 512 MB RAM - 70 x 73.4 GB IBM FC Hot-Swap HDD
2004 Scale 300 CPUs 0.1 PBytes
11GridPP support
- 2 LHCb posts
- to work on Gaudi (software framework) persistency
services - to work on MC monitoring and control software
- 2 ATLAS/LHCb
- Gaudi/GANGA posts
- Interface between software framework and Grid
services
12Current Use of Grid Middleware in development
system
- Authentication
- grid-proxy-init
- Job submission to DataGrid
- dg-job-submit
- Monitoring and control
- dg-job-status
- dg-job-cancel
- dg-job-get-output
- Data publication and replication
- globus-url-copy, GDMP
- Resource scheduling use of CERN MSS
- JDL, sandboxes, storage elements
13Example 1Job Submission
- dg-job-submit /home/evh/sicb/sicb/bbincl1600061.jd
l -o /home/evh/logsub/ - bbincl1600061.jdl
-
- Executable "script_prod"
- Arguments "1600061,v235r4dst,v233r2"
- StdOutput "file1600061.output"
- StdError "file1600061.err"
- InputSandbox "/home/evhtbed/scripts/x509up_u149
","/home/evhtbed/sicb/mcsend","/home/evhtbed/sicb/
fsize","/home/evhtbed/sicb/cdispose.class","/home/
evhtbed/v235r4dst.tar.gz","/home/evhtbed/sicb/sicb
/bbincl1600061.sh","/home/evhtbed/script_prod","/h
ome/evhtbed/sicb/sicb1600061.dat","/home/evhtbed/s
icb/sicb1600062.dat","/home/evhtbed/sicb/sicb16000
63.dat","/home/evhtbed/v233r2.tar.gz" - OutputSandbox "job1600061.txt","D1600063","file
1600061.output","file1600061.err","job1600062.txt"
,"job1600063.txt"
14Example 2 Data Publishing Replication
Compute Element
Storage Element
MSS
Local disk
Job
Data
globus-url-copy
Data
register-local-file
publish
CERN TESTBED
Replica Catalogue NIKHEF - Amsterdam
REST-OF-GRID
replica-get
Job
Data
Storage Element
15 LHCb Data Challenge 1 (July-September 2002)
- Physics Data Challenge (PDC) for detector,
physics and trigger evaluations - based on existing MC production system small
amount of Grid tech to start with - Generate 3107 events (signal specific
background generic b and c min bias) - Computing Data Challenge (CDC) for checking
developing software - will make more extensive use of Grid middleware
- Components will be incorporated into PDC once
proven in CDC
16LHCb software framework - Gaudi
17GANGA Gaudi ANd Grid AllianceJoint Atlas (C.
Tull) and LHCb (P. Mato) project,formally
supported by GridPP/UK with 2 joint Atlas/LHCb
research posts at Cambridge and Oxford
- Application facilitating end-user physicists and
production managers the use of Grid services for
running Gaudi/Athena jobs.
- a GUI based application that should help for the
complete job life-time - - job preparation and
- configuration
- - resource booking
- - job submission
- - job monitoring and control
GANGA
GUI
Collective Resource Grid Services
Histograms Monitoring Results
JobOptions Algorithms
GAUDI Program
18Required functionality
- Before Gaudi/Athena program starts
- Security (obtaining certificates and credentials)
- Job configuration (algorithm configuration, input
data selection, ...) - Resource booking and policy checking (CPU,
storage, network) - Installation of required software components
- Job preparation and submission
- While Gaudi/Athena program is running
- Job monitoring (generic and specific)
- Job control (suspend, abort, ...)
- After program has finished
- Data management (registration)
19Python Bus Design(A possible model for
implementation)
20Conclusions
- LHCb already has distributed MC production using
GRID facilities for job submission - We are embarking on large scale data challenges
commencing July 2002, and we are developing our
analysis model - Grid middleware will be being progressively
integrated into our production environment as it
matures (starting with EDG, and looking forward
to GLUE) - R/D projects are in place
- for interfacing users (production analysis) and
Gaudi/Athena software framework to Grid services - for putting production system into integrated
Grid environment with monitoring and control - All work being conducted in close participation
with EDG and LCG projects - Ongoing evaluations of EDG middleware with
physics jobs - Participate in LCG working groups e.g. Report on
Common use cases for a HEP Common Application
layer http//cern.ch/fca/HEPCAL.doc