HEP Grid Computing in the UK moving towards the LHC era - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

HEP Grid Computing in the UK moving towards the LHC era

Description:

HEP Grid Computing in the UK. moving towards the LHC era ... UKI-SCOTGRID-DURHAM helmsley.dur.scotgrid.ac.uk. 89.73. lcgse1.shef.ac.uk ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 27

Provided by: jeremy87

Learn more at: http://event.twgrid.org

Category:

more less

Transcript and Presenter's Notes

Title: HEP Grid Computing in the UK moving towards the LHC era

1
HEP Grid Computing in the UKmoving towards the
LHC era

International Symposium on Grid Computing 2007

Jeremy Coles J.Coles_at_rl.ac.uk
27th March 2007
2
1 High-level metrics and results
2 Availability and reliability
3 Current activities
4 Problems and issues
5 The future
6 Summary
3
Progress in deploying CPU resources
UKI is 23 (20 at GridPP17) of the
contribution to EGEE. Max EGEE 46795 Max UKI
10393
http//goc.grid.sinica.edu.tw/gstat/UKI.html
4
Estimated utilisation based on gstat job
slots/usage

Guess what happened
5
Usage of CPU is still dominated by LHCb
Two weeks ago
Note the presence of the new VOs camont and
totalep and also many non-LHC HEP collaborations.
LHCb has been a major user of the CPU
resources. There will be increasing competition
as ATLAS and CMS double their MC targets every
3-6 months in 2007.
https//gfe03.hep.ph.ic.ac.uk4175/cgi-bin/load
6
2006 Outturn
Many sites have seen large changes this year. For
example
CPU KSI2K CPU KSI2K CPU KSI2K Storage TB Storage TB Storage TB
Promised Delivered Ratio Promised Delivered Ratio
Brunel 155 480 310 21 6.3 30
Imperial 1165 807 69 93.3 60.3 65
QMUL 917 1209 132 58.5 18 31
RHUL 204 163 80 23.2 8.8 38
UCL 60 121 202 0.7 1.1 149
Lancaster 510 484 95 86.7 72 83
Liverpool 605 592 98 80.3 2.8 3
Manchester 1305 1840 141 372.6 145 39
Sheffield 183 183 100 3 2 67
Durham 86 99 115 5 4 79
Edinburgh 7 11 152 70.5 20 28
Glasgow 246 800 325 14.8 40 270
Birmingham 196 223 114 9.3 9.3 100
Bristol 39 12 31 1.9 3.8 200
Cambridge 33 40 123 4.4 4.4 101
Oxford 414 150 36 24.5 27 110
RAL PPD 199 320 161 17.4 66.1 381
London 2501 2780 111 196.7 94.4 48
NorthGrid 2602 3099 119 542.6 221.8 41
ScotGrid 340 910 268 90.3 64 71
SouthGrid 880 745 85 57.5 110.6 192
Total 6322 7534 119 887.1 490.8 55

Tier-1 1604 1034 64 1495 712 48
1) Glasgows new cluster
August 28
September 1
October 13
2) RAL PPD new CPU commissioned
T2 disk 55
T2 CPU 119
T1 disk48
T1 CPU 64
7
LCG Disk Usage
Available (TB) Available (TB) Available (TB) Available (TB) Used (TB) Used (TB) Used (TB) Used (TB) Ratio Ratio Ratio Ratio
1Q06 2Q06 3Q06 4Q06 1Q06 2Q06 3Q06 4Q06 1Q06 2Q06 3Q06 4Q06
Brunel 1.5 1.1 4.7 0.1 0.2 4.3 6.70 18.10 91.10
Imperial 0.3 3.2 5.6 35.4 0.3 2.2 2.9 25.5 88.80 69.40 51.70 72.00
QMUL 18.2 15.9 18.2 18.2 14.3 3.6 3.4 4.8 78.40 22.60 18.40 26.40
RHUL 2.7 2.7 2.7 5.5 2.5 0.3 0.2 1.5 90.50 10.60 7.70 27.30
UCL 1.1 0 1 2 0.9 0 0.3 1.4 82.60 54.30 32.60 70.00
Lancaster 63.4 53.1 47.7 60 29.9 13.1 26.9 12.8 47.10 24.70 56.30 21.30
Liverpool 2.8 0.6 2.8 0 0 0.1 1.4 0.80 16.30 50.00
Manchester 66.9 67.6 176.8 0 1.9 3.9 5.4 2.80 5.80 3.10
Sheffield 4.5 3.9 2.3 2.2 4.4 1.2 0.3 0.1 95.80 32.10 12.40 4.50
Durham 1.9 1.9 3.5 3.5 0.6 1.3 0.9 1.2 30.90 68.10 25.40 34.30
Edinburgh 31 30 29 20 16.6 13.5 2.8 3.9 53.60 45.10 9.50 19.50
Glasgow 4.3 4.3 1.6 34 3.8 0.6 1.1 4.1 89.90 15.00 70.80 12.10
Birmingham 1.8 1.8 1.9 1.8 1.3 0.6 0.8 1.3 73.30 31.80 41.60 72.20
Bristol 0.2 0.2 2.1 1.8 0.2 0 0.3 0.4 89.60 12.00 16.00 22.20
Cambridge 3.2 3.2 3 3.1 3.1 0 0.8 2.1 94.70 0.60 26.30 67.70
Oxford 3.2 1.6 3.2 3.2 2.5 0 0 0.5 80.10 1.10 0.00 15.60
RAL PPD 6.8 6.8 6.4 16.6 6.4 0.6 0.3 13.5 93.50 9.40 4.20 81.30
London 22.4 23.4 28.7 65.8 17.9 6.2 7 37.5 80.30 26.60 24.40 57.00
NorthGrid 67.9 126.7 118.2 241.8 34.2 16.2 31.2 19.7 50.40 12.80 26.40 8.10
ScotGrid 37.1 36.2 34.1 57.5 21 15.5 4.8 9.2 56.60 42.80 14.00 16.00
SouthGrid 15.2 13.6 16.6 26.5 13.4 1.3 2.2 17.8 88.60 9.30 13.20 67.20
Total Tier-2 142.5 199.8 197.5 391.6 86.6 39.2 45.1 84.2 60.70 19.60 22.80 21.50

Tier-1 121.1 114.4 123.1 145.3 56.4 107.2 149.4 177.7 46.60 93.70 121.40 122.30
21.5
122.3
8
Storage accounting has become more stable
http//www.gridpp.ac.uk/storage/status/gridppDiscS
tatus.html
2007
ATLAS and LHCb have been steadily increasing
their stored data across UK sites. A new issue is
dealing with full Storage Elements.
9
We need to address the KSI2K/TB ratios with
future purchases
Site CE SE CPUKSpecInt2k/Size (TB)
UKI-LT2-IC-HEP ce00.hep.ph.ic.ac.uk gfe02.hep.ph.ic.ac.uk 8
UKI-LT2-IC-HEP gw39.hep.ph.ic.ac.uk gfe02.hep.ph.ic.ac.uk 0.54
UKI-LT2-IC-LeSC mars-ce2.mars.lesc.doc.ic.ac.uk gfe02.hep.ph.ic.ac.uk 5.99
RAL-LCG2 lcgce01.gridpp.rl.ac.uk dcache.gridpp.rl.ac.uk 9.43
ScotGRID-Edinburgh ce.epcc.ed.ac.uk srm.epcc.ed.ac.uk 0.14
UKI-LT2-QMUL ce02.esc.qmul.ac.uk se01.esc.qmul.ac.uk n
UKI-LT2-RHUL ce1.pp.rhul.ac.uk se1.pp.rhul.ac.uk 20.53
UKI-LT2-UCL-CENTRAL gw-2.ccc.ucl.ac.uk gw-3.ccc.ucl.ac.uk 164.06
UKI-LT2-UCL-HEP pc90.hep.ucl.ac.uk pc55.hep.ucl.ac.uk 82.82
UKI-NORTHGRID-LANCS-HEP fal-pygrid-18.lancs.ac.uk fal-pygrid-20.lancs.ac.uk n
UKI-NORTHGRID-MAN-HEP ce01.tier2.hep.manchester.ac.uk dcache01.tier2.hep.manchester.ac.uk 12.05
UKI-NORTHGRID-MAN-HEP ce02.tier2.hep.manchester.ac.uk dcache02.tier2.hep.manchester.ac.uk 10.45
UKI-NORTHGRID-SHEF-HEP lcgce0.shef.ac.uk lcgse1.shef.ac.uk 89.73
UKI-SCOTGRID-DURHAM helmsley.dur.scotgrid.ac.uk gallows.dur.scotgrid.ac.uk 20.65
UKI-SCOTGRID-GLASGOW svr016.gla.scotgrid.ac.uk svr018.gla.scotgrid.ac.uk 23.35
UKI-SOUTHGRID-BHAM-HEP epgce1.ph.bham.ac.uk epgse1.ph.bham.ac.uk 12.36
UKI-SOUTHGRID-BHAM-HEP epgce2.ph.bham.ac.uk epgse1.ph.bham.ac.uk 24.81
UKI-SOUTHGRID-BRIS-HEP lcgce01.phy.bris.ac.uk lcgse01.phy.bris.ac.uk 2.35
UKI-SOUTHGRID-OX-HEP t2ce02.physics.ox.ac.uk t2se01.physics.ox.ac.uk 41.17
UKI-SOUTHGRID-RALPP heplnx206.pp.rl.ac.uk heplnx204.pp.rl.ac.uk n
The LCG experiments need well defined ratios of
Tier-2 CPU to Disk (KSI2K/TB) which are about 2
for ATLAS, 3 for CMS, and almost entirely CPU for
LHCb.
10
Overall stability is improving but this picture
is not seen very often!

All SAM results green! This lasted for a single
iteration of the tests. A production grid should
have this as the norm.
11
The GridView availability figures need
cross-checking but we will use them to start
setting Tier-2 targets to improve performance
BDII failures in SRM and SE tests which were
the problem. More that that, it was not failures
in the site designated BDII, but that the BDII
had been hardcoded to sam-bdii.cern.ch - and all
the failures were from this component. ScotGrid
Blog 23rd March. https//gus.fzk.de/pages/ticket
_details.php?ticket19989
Targets For April gt80 For June gt85 For July
gt90 For Sept gt95
12
From a user perspective things are nowhere near
stable so we need other measures these are UK
ATLAS test results starting Jan 2007
http//hepwww.ph.qmul.ac.uk/lloyd/atlas/atest.php
but note that there is no special priority for
these test jobs.
Average job success
Date
13
From a user perspective things are nowhere near
stable so we need other measures this is the UK
ATLAS test average success rate
Average job success
More sites and tests introduced
System failures
Date
14
From a user perspective things are nowhere near
stable so we need other measures one example of
a service problem encountered
Matching on one RB is too slow. Do we spend time
investigating the underlying problem?
Average job success
Date
15
Current networking (related) activities
LAN tests Example rfio test on DPM at Glasgow
(shown at WLCG workshop). Other sites are
beginning such testing.
T0 -gt T1
Still uncovering problems with CASTOR
T1 -gt T2 T2 -gt T2
T1 to T2 - Target rate 300Mb/s or better Intra
T2 - Target rate 200Mb/s or better reading /
writing. 2008 targets 1Gb/s
Bottleneck Tier-1 outbound
16
GridMon network monitoring
Network monitoring control nodes have been placed
at each site to provide a background for
network analysis

Tier-2 examples
http//gridmon3.dl.ac.uk/gridmon/graph.html
T1 example
17
Issues that we are facing

As job loads increase the information system is
showing some signs of lack of scaling query
static information at site
The regular m/w releases have made things easier.
Recently several problems noted (eg. With gLite
3.0 r16 DPM Torque/Maui first major gLite
update)
Site administrators are having to quickly learn
new skills to keep up with growing resource
levels and stricter availability requirements.
Improved monitoring is likely to be a theme for
this year.
Workarounds are required in several areas
especially to compensate for lack of VOMS aware
middleware such as enabling job priorities on
batch queues and access control on storage (the
ATLAS ACL request caused concern and confusion!)
Memory per job vs core vs 64-bit may become an
issue as could available WN disk space unless we
have a clear strategy
Setting up disk pools also requires careful
thought since disk quotas are not settable (disks
are filling up and sites fail SAM tests)
The middleware is still not available for SL4
32-bit let alone SL4 64-bit.

18
Tier-1 organisation going forward in 2007
19
Disk and CASTOR
Problems throughout most of 2006 with new disk
purchases. Drives ejected from arrays under
normal load. Many theories put forwards and tests
conducted. WD using analyzers on SATA
interconnects at two sites uncovered problem as
due to drive head staying in one place for too
long! Fixed with a return following reposition
firmware update. Disk now all deployed or ready
to be deployed.

Successes
Experiments see good rates (progress pushed by
CSA06)
More reliable than dCache
Success with disk1tape0 service
Bug-fix release should solve many problems

CASTOR ongoing problems
Garbage collection does not always work
Under heavy load, jobs submitted to wrong
servers (ones not in correct storage class) -gt
jobs hang. Jobs in LSF queue build up -gt can be
catastrophic!
File systems within disk pool fill unevenly
Tape migration sometimes loops -gt mount/unmount
without any work done
Significant number of disk-to-disk copies hang
Unstable releases need more testing
Lack of admin tools (configuration risky and
time consuming) and good user interface
No way to throttle job load
Logging is inadequate (problem resolution
difficult)
. Some of these being addressed but current
situation has impacted experiment migration to
CASTOR and requires more support than
anticipated.

20
CPU efficiency still a concern at T1 and some
T2 sites
90 CPU efficiency due to i/o bottlenecks is
OK Concern that this fell to 75
target
Each experiment needs to work to improve their
system/deployment practice anticipating e.g.
hanging gridftp connections during batch work
21
CPU efficiency still a concern at T1 and some
T2 sites
90 CPU efficiency due to i/o bottlenecks is
OK Concern that this fell to 75
---- 47
target
Each experiment needs to work to improve their
system/deployment practice anticipating e.g.
hanging gridftp connections during batch work
22
6-month focus with LHC start-up in mind
1. Improve monitoring at (especially) T2 sites
and persuade more sites to join in with
cross-site working
2. Strengthen experiment deployment team
interaction 3. Site readiness reviews. These
have started (site visits plus questionnaire) 4.
Continue site testing now using experiment
tools 5. Address the KSI2K/TB ratios
23
The Future 1 GridPP3 Project Approved
Two weeks ago, funding for GridPP3 was announced
25.9m of new money plus contingency etc (30m
project).
GridPP1 Sep 2001 Sep 2004, 17.0m or
5.7m/year. GridPP2 Sep 2004 Sep 2007,
15.9m or 5.3m/year. GridPP2 Sep 2007 Apr
2008 GridPP3 Apr 2008 Apr 2011, 25.9m
or 7.2m/year.
In the current UK funding environment this is a
very good outcome. The total funding is
consistent with a 70 scenario presented to the
project review body.
24
(No Transcript)
25
The Future 2 Working with the UK National Grid
Service
-gt EGI

NGS components
Heterogeneous hardware middleware
Computation services based on GT2
Data services SRB, Oracle, OGSA-DAI
NGS portal, P-Grade
BDII using GLUE schema from 1995
RB - gLite WMS-LB deployed Feb 2007

Interoperability
NGS RBBDII currently configured to work with
core NGS nodes (Oxford,Leeds,Manchester,RAL)
Other NGS sites to follow soon
GridPP sites reporting to RAL GridPP resources
GridPP and the NGS VO (ngs.ac.uk)
YAIM problem for DNS-style VO names. Workaround
is to edit GLUE static information restart GRIS
but sites

Issues Direction of CE, policies,
With input from Matt Viljoen
26
1 Metrics CPU on track and disk to catch up as
demand increases
2 Availability many sources. Working on
targets.
3 Ongoing work e.g. monitoring and bandwidth
(WAN LAN) testing
4 Tier-1 is shaping up. Fabric better but CASTOR
a concern
5 GridPP3 is funded. Interoperation with NGS is
progressing
6 Still lots to do many areas not mentioned!
Acknowledgments references This talk relies on
contributions from many sources but notably from
talks given at GridPP18 http//www.gridpp.ac.uk/g
ridpp18/. Most material is linked from here
http//www.gridpp.ac.uk/deployment/links.html.
Our blogs and wiki (http//www.gridpp.ac.uk/w/inde
x.php?titleSpecialCategoriesarticleMain_Page)
may also be of interest.

Write a Comment

User Comments (0)