Castor Review LHCb Experiences - PowerPoint PPT Presentation

1 / 16

About This Presentation

Title:

Castor Review LHCb Experiences

Description:

Failed on SRM get: SRM getRequestStatus timed out on get' DIRAC retry policy can cope but: ... Peak rates out of Castor at double SC3 target rate. Andrew C. ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 17

Provided by: Andrew735

Category:

more less

Transcript and Presenter's Notes

Title: Castor Review LHCb Experiences

1
Castor Review LHCb Experiences
2
Overview

Use of Castor2
Introduction during SC3
till current LHCb instance
Experience during staging operations
LHCb Migration to Castor2
Users experience

3
LHCb Data Management

Data Management handled by DIRAC
LHCb specific grid software

Data Management Clients
UserInterface
WMS
TransferAgent
FileCatalogC
ReplicaManager
FileCatalogB
DIRAC Data Management Components
FileCatalogA
StorageElement
GridFTPStorage
HTTPStorage
SRMStorage
SE Service
Physical storage
4
DM Setup for SC3
LCG SC3 Machinery
LHCb - DIRAC DMS

Central Data Movement model based at CERN.
Replication of 8TB of data to 6 Tier1 sites
DIRAC Transfer Agent submit/monitor FTS jobs
Removed from actual interaction with underlying
storages
This lead to a problem

File Transfer Service
Replica Manager
Transfer Agent
Transfer Manager Interface
5
No FTS Staging

Data to be used for SC3 was stored on tape
FTS issues SRM Get triggering stage
Operation longer than timeout of FTS Agent
Failed on SRM get SRM getRequestStatus timed
out on get
DIRAC retry policy can cope but
Fills transfer slots
Dramatically reduces effective bandwidth
Decide to pre-stage to make progress
50,000 files in dedicated Castor2 pools
10TB used for CERN-T1 transfers

6
Some teething problems

Castor2 pools set up 12/10/05
Began staging data into pools
13-14/10
CASTOR2 Oracle instance problem blocking
scheduling
15-16/10
LSF hang -gt DB load too high -gt NO_Contact
alarm
17/10
Stager dead
19/10
Intervention

7
Achieved Performance
Rate (MB/s)
60

Pre-intervention
Many problems
System felt unstable
Post-intervention
Stable staging and access
Peak rates out of Castor at double SC3 target
rate.

50
40
30
20
10
0
9/10/05
2/11/05
4/11/05
6/11/05
11/10/05
13/10/05
15/10/05
17/10/05
19/10/05
21/10/05
23/10/05
25/10/05
27/10/05
29/10/05
31/10/05
Date
8
Some stager_get errors

Several large stage requests during SC3
Observed some requests being returned
Error Internal error
Request effectively lost
Receive confirmation of receipt of a portion of
files
i.e. submit stage request for 100 files
Stager returns receipt of a fraction
Received 58 responses
Conducted subsequent stages to trace behaviour

9
Subsequent Stage 1

1138 analysis files staged to lhcb Castor2
instance, lhcbdata pool (4/4/06)
stager_get with 50 files
server response -gt issue next request
15 mins for submission of all files
Only 2 files encountered problems
unexpected RFIO errornot retried and the
request failed
Retry logic in place as result
tape recall on way to exit and too late for new
file
Exit handling reviewed

10
Subsequent Stage 1

The lemon disk use plot below shows progress of
stage
First 600G staged at constant rate
2.5hours
240G/hr or 400files/hr

After this rate decreases
Queues not saturated?

11
Subsequent Stage 2

10,000 RAW files staged to lhcb Castor2 instance,
wan pool (26/5/06)
Same method as before
Mid-stage lxfsrc6201 went down
Many files disappeared from stager
When machine returned all requests returned
All (bar two) files staged successfully
Impressed by resilience and ability to recover

12
Reporting of problems

Contact with the castor group by email
castor-deployment_at_cern.ch
Found this to be responsive and helpful
Problems/bugs taken onboard/fixed quickly

13
Castor2 migration

For new data (i.e. using latest software)
Initial problem as ROOT was not Castor2-aware
fixed since ROOT 5.10.00c (13/4/06)
Gaudi forcing Castor 2 usage since 17/5/06 (next
Gaudi release)
Legacy data (using ROOT 3)
Private version of ROOT 3 built for Castor2 on
17/5/06
Castor2 usage forced since 30/5/06
Default mapping of all LHCb users to Castor2 on
8/6/06 (today!)

14
Users Castor2 Experience

Very little experience so far (late migration)
Users mapped by default to default pool
Files disk resident on lhcbdata
Small penalty (in time negligible), but bad usage
of disk space, due to the copy between lhcbdata
and default
No plan to let users define their mapping
manually
Expect that moving to using SRM and related
mapping will fix this caveat
Welcome the distribution of libshift.so through
LCG (allows consistency in applications)

15
Summary