David Colling - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

David Colling

Description:

... Les Robertson, Jamie Shears, Bob Jones, Gabriel Zaquine, Jeremy Coles, Gidon Moont ... Description of the LCG, what the target are, how it works. ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 29

Provided by: davidc147

Category:

more less

Transcript and Presenter's Notes

Title: David Colling

1
Performance of the LHC Computing Grid (LCG)
2
Thanks Slides/pictures/text taken from several
people including Les Robertson, Jamie Shears, Bob
Jones, Gabriel Zaquine, Jeremy Coles, Gidon Moont

Caveat LCG means different things to different
people and funding bodies.
3
Contents

Description of the LCG, what the target are, how
it works.
The monitoring that is currently in place
The current and future metrics.
The Service Challenges
The testing and release procedure

4
The LHC
5
What is the LHC?

LHC will collide beams of protons at an energy of
14 TeV
Using the latest super-conducting technologies,
it will operate at about 270ºC, just above the
absolute zero of temperature
With its 27 km circumference, the accelerator
will be the largest superconducting installation
in the world.
Four detectors constructed and operated by
international collaborations of thousands of
physicists, egineers and technicians.

The largest terrestrial scientific Endeavour
ever undertaken Due to start taking data in 2007
LHC is due to switch on in 2007 Four
experiments, with detectors as big as
cathedrals ALICE ATLAS CMS LHCb
6
Data Volume
Data accumulating at 15 PetaBytes/year Equivalen
t to writing a CD every 2 seconds
7
The Role of LCG

LCG is the system on which this data will be
analysed and similar volumes of MC simulation
generated
High Energy Physics jobs have particular
characteristics e.g. they are thankfully
parallel
However, LCG and EGEE are very closely linked
and EGEE has a more general remit e.g. biomed,
earth obs, etc applications as well HEP

8
Middleware and Deployment

Current Middleware based on EDG but hardened
and extended
New middleware being developed with the EGEE
project
Deployment and monitoring is also done jointly
with EGEE

9
The System (ATLAS Case)
10 Tier-1s reprocess house simulation Group
Analysis
Workstations
10
The World as seen by the EDG
Now a happy user
Replica Location service (Replicac Catalogue)
Each Site consists of
edg-job-get-output ltdg-job-idgt
VO server
Confused and unhappy user
So now the user knows about what machines are out
there and can communicate with them however
where to submit the job is too complex a decision
for user alone.
What is needed is an automated system
So lets introduce some grid infrastructure
Security and an information system
This is the world without Grids

Sites are not identical.
Different Computers
Different Storage
Different Files
Different Usage Policies

Workload Management System (Resource Broker)
WMS using RC decide on execution location
Logging Bookkeeping
11
So what is actually there now?

Currently, 138 sites in 36 countries
14K cpus, 10PB storage
1000 registered users (gt100 active users)

12
Monitoring LCG/EGEE

Four forms of monitoring (demos)
What are the state of a given site
What is currently in being used
Accounting how many resources have been used by
a given Virtual Organisation.
EGEE quality assurance

These different activities are not always well
connected
13
What is the state of site

Series of site functional tests run
automatically at every site some involve asking
a site questions some by running jobs
These tests are defined as critical or
non-critical. If a site consistently fails
critical tests automated messages are sent to the
site and it will be removed from the information
system if the error is not connected.
Also analyses the information published by a
site.

14
What is the state of site
Information gathered at two GOCs http//goc.grid.s
inica.edu.tw/ and http//goc.grid-support.ac.uk/
gridsite/gocmain/
15
What is the state of site
Maps as well
16
What is currently being usedGridIce
http//gridice2.cnaf.infn.it50080/gridice/site/si
te.php Kind of like the gstat asked for earlier
today
17
Accounting APEL
Uses the local batch system logs and publishes
information over RGMA
18
Quality assurance
Interrogates the logging and bookkeeping

Overall Job success, from January 2005
Job Success rate Done(ok) / (Submitted-Cancelled
)
Results should be validated

http//ccjra2.in2p3.fr/EGEE-JRA2/QAmeasurement/sho
wstatsVO.php

19
Quality assurance

VOs job throughput and success rate, from January
until May-2005

20
Quality assurance
Next step is to understand these failures
By the end of June will also will measure the
overhead caused by running via the LCG by
measuring the running time/total time
21
Quality assurance

Many other metrics have been suggested (especialy
in the UK) including
Number of user ( from different communities)
Training quality
Maintenance and reliability (already measured)
Upgrade time etc etc

UK sites only, target was 3 weeks
22
Demos
http//www.hep.ph.ic.ac.uk/e-science/projects/demo
/index.html
23
How will we know if we are going get there?

There are an ongoing set of service challenges
Each Service Challenge growing in complexity
approaching the full production service
Currently we are between SC2 and SC3.
SC2 only the T0 and T1.
SC3 will involve 5 T2s as well and SC4 will
involve all T2 sites.

24
Service Challenge 2

Goal for throughput was gt600MB/s daily average
for 10 days was achieved - Midday 23rd March to
Midday 2nd April
Not without outages, but system showed it could
recover rate again from outages
Load reasonable evenly divided over sites (give
network bandwidth constraints of Tier-1 sites)

25
Service Challenge 3 and beyond
26
Testing and deployment

Multi stage release
New components first tested on the testing
testbed. Rapid feedback to developers. This
testing to include performance/scalability
testing. Currently, this only at 4 (5) site.
CERN, NIKHEF, RAL, Imperial (two installations)
Pre-production testbed
Releases on to production every 3 months

27
Conclusions

Very hard deadline by which this must work
We are monitoring as much as we can to try to
understand where our current failures come from.
We have a release process that hopefully will
improve performance of future releases

28
http//goc.grid-support.ac.uk/gridsite/monitoring/
http//goc.grid.sinica.edu.tw/gstat/ http//gr
idice2.cnaf.infn.it50080/gridice/site/site.php h
ttp//www.hep.ph.ic.ac.uk/e-science/projects/demo/
index.html

Write a Comment

User Comments (0)