David Colling - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

David Colling

Description:

... Les Robertson, Jamie Shears, Bob Jones, Gabriel Zaquine, Jeremy Coles, Gidon Moont ... Description of the LCG, what the target are, how it works. ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 29
Provided by: davidc147
Category:

less

Transcript and Presenter's Notes

Title: David Colling


1
Performance of the LHC Computing Grid (LCG)
2
Thanks Slides/pictures/text taken from several
people including Les Robertson, Jamie Shears, Bob
Jones, Gabriel Zaquine, Jeremy Coles, Gidon Moont

Caveat LCG means different things to different
people and funding bodies.
3
Contents
  • Description of the LCG, what the target are, how
    it works.
  • The monitoring that is currently in place
  • The current and future metrics.
  • The Service Challenges
  • The testing and release procedure

4
The LHC
5
What is the LHC?
  • LHC will collide beams of protons at an energy of
    14 TeV
  • Using the latest super-conducting technologies,
    it will operate at about 270ºC, just above the
    absolute zero of temperature
  • With its 27 km circumference, the accelerator
    will be the largest superconducting installation
    in the world.
  • Four detectors constructed and operated by
    international collaborations of thousands of
    physicists, egineers and technicians.

The largest terrestrial scientific Endeavour
ever undertaken Due to start taking data in 2007
LHC is due to switch on in 2007 Four
experiments, with detectors as big as
cathedrals ALICE ATLAS CMS LHCb
6
Data Volume
Data accumulating at 15 PetaBytes/year Equivalen
t to writing a CD every 2 seconds
7
The Role of LCG
  • LCG is the system on which this data will be
    analysed and similar volumes of MC simulation
    generated
  • High Energy Physics jobs have particular
    characteristics e.g. they are thankfully
    parallel
  • However, LCG and EGEE are very closely linked
    and EGEE has a more general remit e.g. biomed,
    earth obs, etc applications as well HEP

8
Middleware and Deployment
  • Current Middleware based on EDG but hardened
    and extended
  • New middleware being developed with the EGEE
    project
  • Deployment and monitoring is also done jointly
    with EGEE

9
The System (ATLAS Case)
10 Tier-1s reprocess house simulation Group
Analysis
Workstations
10
The World as seen by the EDG
Now a happy user
Replica Location service (Replicac Catalogue)
Each Site consists of
edg-job-get-output ltdg-job-idgt
VO server
Confused and unhappy user
So now the user knows about what machines are out
there and can communicate with them however
where to submit the job is too complex a decision
for user alone.
What is needed is an automated system
So lets introduce some grid infrastructure
Security and an information system
This is the world without Grids
  • Sites are not identical.
  • Different Computers
  • Different Storage
  • Different Files
  • Different Usage Policies

Workload Management System (Resource Broker)
WMS using RC decide on execution location
Logging Bookkeeping
11
So what is actually there now?
  • Currently, 138 sites in 36 countries
  • 14K cpus, 10PB storage
  • 1000 registered users (gt100 active users)

12
Monitoring LCG/EGEE
  • Four forms of monitoring (demos)
  • What are the state of a given site
  • What is currently in being used
  • Accounting how many resources have been used by
    a given Virtual Organisation.
  • EGEE quality assurance

These different activities are not always well
connected
13
What is the state of site
  • Series of site functional tests run
    automatically at every site some involve asking
    a site questions some by running jobs
  • These tests are defined as critical or
    non-critical. If a site consistently fails
    critical tests automated messages are sent to the
    site and it will be removed from the information
    system if the error is not connected.
  • Also analyses the information published by a
    site.

14
What is the state of site
Information gathered at two GOCs http//goc.grid.s
inica.edu.tw/ and http//goc.grid-support.ac.uk/
gridsite/gocmain/
15
What is the state of site
Maps as well
16
What is currently being usedGridIce
http//gridice2.cnaf.infn.it50080/gridice/site/si
te.php Kind of like the gstat asked for earlier
today
17
Accounting APEL
Uses the local batch system logs and publishes
information over RGMA
18
Quality assurance
Interrogates the logging and bookkeeping
  • Overall Job success, from January 2005
  • Job Success rate Done(ok) / (Submitted-Cancelled
    )
  • Results should be validated
  • http//ccjra2.in2p3.fr/EGEE-JRA2/QAmeasurement/sho
    wstatsVO.php

19
Quality assurance
  • VOs job throughput and success rate, from January
    until May-2005

20
Quality assurance
Next step is to understand these failures
By the end of June will also will measure the
overhead caused by running via the LCG by
measuring the running time/total time
21
Quality assurance
  • Many other metrics have been suggested (especialy
    in the UK) including
  • Number of user ( from different communities)
  • Training quality
  • Maintenance and reliability (already measured)
  • Upgrade time etc etc

UK sites only, target was 3 weeks
22
Demos
http//www.hep.ph.ic.ac.uk/e-science/projects/demo
/index.html
23
How will we know if we are going get there?
  • There are an ongoing set of service challenges
  • Each Service Challenge growing in complexity
    approaching the full production service
  • Currently we are between SC2 and SC3.
  • SC2 only the T0 and T1.
  • SC3 will involve 5 T2s as well and SC4 will
    involve all T2 sites.

24
Service Challenge 2
  • Goal for throughput was gt600MB/s daily average
    for 10 days was achieved - Midday 23rd March to
    Midday 2nd April
  • Not without outages, but system showed it could
    recover rate again from outages
  • Load reasonable evenly divided over sites (give
    network bandwidth constraints of Tier-1 sites)

25
Service Challenge 3 and beyond
26
Testing and deployment
  • Multi stage release
  • New components first tested on the testing
    testbed. Rapid feedback to developers. This
    testing to include performance/scalability
    testing. Currently, this only at 4 (5) site.
    CERN, NIKHEF, RAL, Imperial (two installations)
  • Pre-production testbed
  • Releases on to production every 3 months

27
Conclusions
  • Very hard deadline by which this must work
  • We are monitoring as much as we can to try to
    understand where our current failures come from.
  • We have a release process that hopefully will
    improve performance of future releases

28
http//goc.grid-support.ac.uk/gridsite/monitoring/
http//goc.grid.sinica.edu.tw/gstat/ http//gr
idice2.cnaf.infn.it50080/gridice/site/site.php h
ttp//www.hep.ph.ic.ac.uk/e-science/projects/demo/
index.html
Write a Comment
User Comments (0)
About PowerShow.com