LHCb on the Grid - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

LHCb on the Grid

Description:

CERN (Tier-0) is the hub of all activity. Full copy at CERN of all raw data and DSTs ... GGUS ticket opened by Derek Ross. 9. LHCb and CCRC08 ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 16
Provided by: rn5
Category:
Tags: derek | grid | lhcb

less

Transcript and Presenter's Notes

Title: LHCb on the Grid


1
LHCb on the Grid
  • Raja Nandakumar
  • (with contributions from Greig Cowan)?

GridPP21 3rd September 2008
2
LHCb computing model
  • CERN (Tier-0) is the hub of all activity
  • Full copy at CERN of all raw data and DSTs
  • All T1s have a full copy of dst-s
  • Simulation at all possible sites (CERN, T1, T2)?
  • LHCb has used about 120 sites on 5 continents so
    far
  • Reconstruction, Stripping and Analysis at T0 / T1
    sites only
  • Some analysis may be possible at large T2 sites
    in the future
  • Almost all the computing (except for development
    / tests) will be run on the grid.
  • Large productions production team
  • Ganga (Dirac) grid user interface

3
LHCb on the grid
  • Small amount of activity over past year
  • DIRAC3 has been under development
  • Physics groups have not asked for new productions
  • Situation has changed recently...

4
LHCb on the grid
  • DIRAC3
  • Nearing stable production release
  • Extensive experience with CCRC08 and follow-up
    exercises
  • Used as THE production system for LHCb
  • Now testing of the interfaces by Ganga developers
  • Generic pilot agent framework
  • Critical problems found with the gLite WMS 3.0,
    3.1
  • Mixing of VOMS roles under certain reasonably
    common conditions
  • Cannot have people with different VOMS roles!
  • Savannah bug 39641
  • Being worked on by developers
  • Waiting for this to be solved before restarting
    tests

5
DIRAC3 Production
gt90,000 jobs in past 2 months
Real production activity and testing of gLite WMS
6
DIRAC3 Job Monitor
https//lhcbweb.pic.es/DIRAC/jobs/JobMonitor/displ
ay
7
LHCb storage at RAL
  • LHCb storage primarily on the Tier-1s and CERN
  • CASTOR used as storage system at RAL
  • Fully moved out of dCache in May 2008
  • One tape damaged and file on it marked lost
  • Was stable (more or less) until 20 Aug 2008
  • Not been able to take great load on servers
  • Low upper limit (8) on lsf job slots on various
    castor diskservers
  • Too many jobs (gt500) can come into the batch
    system. The concerned service class hangs then
  • Temporarily fixed for now. Needs to be monitored
    (probably by the shifter on duty?)?
  • Increase limit to gt100 rfio jobs per server
  • Not all hardware can handle a limit of 200 jobs
    (start using swap space)?
  • Problem seen many times now over the last few
    months
  • Castor now in downtime
  • This is worrying given how close we are to data
    taking

8
LHCb at RAL
  • Move to srm-v2 by LHCb
  • Needed to retire srm-v1 endpoints, hardware for
    RAL
  • When DIRAC3 becomes baseline for User analysis
  • Already used for almost all production
  • Ganga working on submitting through DIRAC3
  • Needs LHCb also to rename files in the LFC
  • All space tokens, etc have been setup
  • Target Turn off srm-v1 access by end September
  • Currently use srm-v1 for user analysis
  • DIRAC2 does not support srm-v2
  • Batch system
  • Pausing of jobs during downtime?
  • Not clear about the status of this
  • For now, stop the batch system from accepting
    LHCb jobs a few hours before scheduled downtimes
  • No LHCb job should run for gt24 hours
  • Announce beginning and end of downtimes
  • Problems with broadcast tools
  • GGUS ticket opened by Derek Ross

9
LHCb and CCRC08
  • Planned tasks Test the LHCb computing model
  • Raw data distribution from pit to T0 centre
  • Use of rfcp into CASTOR from pit - T1D0
  • Raw data distribution from T0 to T1 centres
  • Use of FTS - T1D0
  • Recons of raw data at CERN T1 centres
  • Production of rDST data - T1D0
  • Use of SRM 2.2
  • Stripping of data at CERN T1 centres
  • Input data RAW rDST - T1D0
  • Output data DST - T1D1
  • Use SRM 2.2
  • Distribution of DST data to all other centres
  • Use of FTS

10
LHCb and CCRC08
  • Reconstruction
  • Stripping

11
LHCb CCRC08 Problems
  • CCRC08 highlighted areas to be improved
  • File access problems
  • Random or permanent failure to open files using
    gsidcap
  • Request IN2P3 and NL-T1 to allow dcap protocol
    for local read access
  • Now using xroot at IN2P3 appears to be
    successful
  • Wrong file status returned by dCache SRM after a
    put
  • bringOnline was not doing anything
  • Software area access problems
  • Site banned for a while until problem is fixed
  • Application crashes
  • Fixed with new SW release and deployment
  • Major issues with LHCb bookkeeping
  • Especially for stripping
  • Lessons learned
  • Better error reporting in pilot logs and workflow
  • Alternative forms of data access needed in
    emergencies
  • Downloading of files to WN (used at IN2P3, RAL)?

12
LHCb Grid Operations
  • Grid Operations and Production team has been
    created

13
Communications
  • LHCb sites
  • Grid operations team keep track of problems
  • Report to sites via GGUS and eLogger
  • All posts are reported on lhcb-production_at_cern.ch
  • Please subscribe if you want to know what is
    going on
  • LHCb users
  • Mailing lists
  • lhcb-distributed-analysis_at_cern.ch
  • All problems directed here
  • Specific lists for each LHCb application and
    Ganga
  • Ticketing systems (Savannah, GGUS) for DIRAC,
    Ganga, apps
  • User by developers and power users
  • Software weeks provide training sessions for
    using Grid tools
  • Weekly distributed analysis meetings (starts
    Friday)?
  • DIRAC, Ganga, core software developers along with
    some users
  • Aims to identify needs and coordinate release
    plans

http//lblogbook.cern.ch/Operations
http//lblogbook.cern.ch/Operations
RSS feed available
14
Summary
  • Concerned about CASTOR stability close to data
    taking
  • DIRAC3 workload and data management system now
    online
  • Has been extensively tested when running LHCb
    productions
  • Now moving it into the user analysis system
  • Ganga needs some additional development
  • Grid operations team working with sites, users
    and devs to identify and resolve problems quickly
    and efficiently
  • LHCb looking forward to imminent switch on of the
    LHC!

15
Backup - CCRC08 Throughput
Write a Comment
User Comments (0)
About PowerShow.com