LHCb on the Grid

About This Presentation

Title:

LHCb on the Grid

Description:

CERN (Tier-0) is the hub of all activity. Full copy at CERN of all raw data and DSTs ... GGUS ticket opened by Derek Ross. 9. LHCb and CCRC08 ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 16

Provided by: rn5

Category:

more less

Transcript and Presenter's Notes

Title: LHCb on the Grid

1
LHCb on the Grid

Raja Nandakumar
(with contributions from Greig Cowan)?

GridPP21 3rd September 2008
2
LHCb computing model

CERN (Tier-0) is the hub of all activity
Full copy at CERN of all raw data and DSTs
All T1s have a full copy of dst-s
Simulation at all possible sites (CERN, T1, T2)?
LHCb has used about 120 sites on 5 continents so
far
Reconstruction, Stripping and Analysis at T0 / T1
sites only
Some analysis may be possible at large T2 sites
in the future
Almost all the computing (except for development
/ tests) will be run on the grid.
Large productions production team
Ganga (Dirac) grid user interface

3
LHCb on the grid

Small amount of activity over past year
DIRAC3 has been under development
Physics groups have not asked for new productions
Situation has changed recently...

4
LHCb on the grid

DIRAC3
Nearing stable production release
Extensive experience with CCRC08 and follow-up
exercises
Used as THE production system for LHCb
Now testing of the interfaces by Ganga developers
Generic pilot agent framework
Critical problems found with the gLite WMS 3.0,
3.1
Mixing of VOMS roles under certain reasonably
common conditions
Cannot have people with different VOMS roles!
Savannah bug 39641
Being worked on by developers
Waiting for this to be solved before restarting
tests

5
DIRAC3 Production
gt90,000 jobs in past 2 months
Real production activity and testing of gLite WMS
6
DIRAC3 Job Monitor
https//lhcbweb.pic.es/DIRAC/jobs/JobMonitor/displ
ay
7
LHCb storage at RAL

LHCb storage primarily on the Tier-1s and CERN
CASTOR used as storage system at RAL
Fully moved out of dCache in May 2008
One tape damaged and file on it marked lost
Was stable (more or less) until 20 Aug 2008
Not been able to take great load on servers
Low upper limit (8) on lsf job slots on various
castor diskservers
Too many jobs (gt500) can come into the batch
system. The concerned service class hangs then
Temporarily fixed for now. Needs to be monitored
(probably by the shifter on duty?)?
Increase limit to gt100 rfio jobs per server
Not all hardware can handle a limit of 200 jobs
(start using swap space)?
Problem seen many times now over the last few
months
Castor now in downtime
This is worrying given how close we are to data
taking

8
LHCb at RAL

Move to srm-v2 by LHCb
Needed to retire srm-v1 endpoints, hardware for
RAL
When DIRAC3 becomes baseline for User analysis
Already used for almost all production
Ganga working on submitting through DIRAC3
Needs LHCb also to rename files in the LFC
All space tokens, etc have been setup
Target Turn off srm-v1 access by end September
Currently use srm-v1 for user analysis
DIRAC2 does not support srm-v2
Batch system
Pausing of jobs during downtime?
Not clear about the status of this
For now, stop the batch system from accepting
LHCb jobs a few hours before scheduled downtimes
No LHCb job should run for gt24 hours
Announce beginning and end of downtimes
Problems with broadcast tools
GGUS ticket opened by Derek Ross

9
LHCb and CCRC08

Planned tasks Test the LHCb computing model
Raw data distribution from pit to T0 centre
Use of rfcp into CASTOR from pit - T1D0
Raw data distribution from T0 to T1 centres
Use of FTS - T1D0
Recons of raw data at CERN T1 centres
Production of rDST data - T1D0
Use of SRM 2.2
Stripping of data at CERN T1 centres
Input data RAW rDST - T1D0
Output data DST - T1D1
Use SRM 2.2
Distribution of DST data to all other centres
Use of FTS

10
LHCb and CCRC08

Reconstruction
Stripping

11
LHCb CCRC08 Problems

CCRC08 highlighted areas to be improved
File access problems
Random or permanent failure to open files using
gsidcap
Request IN2P3 and NL-T1 to allow dcap protocol
for local read access
Now using xroot at IN2P3 appears to be
successful
Wrong file status returned by dCache SRM after a
put
bringOnline was not doing anything
Software area access problems
Site banned for a while until problem is fixed
Application crashes
Fixed with new SW release and deployment
Major issues with LHCb bookkeeping
Especially for stripping
Lessons learned
Better error reporting in pilot logs and workflow
Alternative forms of data access needed in
emergencies
Downloading of files to WN (used at IN2P3, RAL)?

12
LHCb Grid Operations

Grid Operations and Production team has been
created

13
Communications

LHCb sites
Grid operations team keep track of problems
Report to sites via GGUS and eLogger
All posts are reported on lhcb-production_at_cern.ch
Please subscribe if you want to know what is
going on
LHCb users
Mailing lists
lhcb-distributed-analysis_at_cern.ch
All problems directed here
Specific lists for each LHCb application and
Ganga
Ticketing systems (Savannah, GGUS) for DIRAC,
Ganga, apps
User by developers and power users
Software weeks provide training sessions for
using Grid tools
Weekly distributed analysis meetings (starts
Friday)?
DIRAC, Ganga, core software developers along with
some users
Aims to identify needs and coordinate release
plans

http//lblogbook.cern.ch/Operations
http//lblogbook.cern.ch/Operations
RSS feed available
14
Summary

Concerned about CASTOR stability close to data
taking
DIRAC3 workload and data management system now
online
Has been extensively tested when running LHCb
productions
Now moving it into the user analysis system
Ganga needs some additional development
Grid operations team working with sites, users
and devs to identify and resolve problems quickly
and efficiently
LHCb looking forward to imminent switch on of the
LHC!

15
Backup - CCRC08 Throughput

Write a Comment

User Comments (0)

About PowerShow.com

LHCb on the Grid - PowerPoint PPT Presentation

LHCb on the Grid

CERN (Tier-0) is the hub of all activity. Full copy at CERN of all raw data and DSTs ... GGUS ticket opened by Derek Ross. 9. LHCb and CCRC08 ... – PowerPoint PPT presentation