Title: Performance monitoring and analysis on the EGEE Grid infrastructure status and plans
1Grid Computinga Project for CoimbrainATLASPhys
icsand beyond
Helmut Wolters Coimbra 26/5/2006
2Overview
- What is the GRID
- Role of LIP in the GRID
- Advanced computing in Coimbraat Center of
Computational Physicscentopeia - Joined venture LIP/CFC
- Building the Tier2 node at LIP Coimbra/Lisbon
3What is the GRID ?
- So you have the Internet
- it links all computers to a common network (if
you want) - what do you need more?
- The GRID is much more than that !
- it offers completely new perspectives
- and it is not easy to be implemented
4What is the GRID?
- GRID computing is a recent concept which takes
distributing computing a step forward - The name GRID is chosen by analogy with the
electric power grid - Transparent plug-in to obtain computing power
without worrying where it comes from - Permanent and available everywhere
- Pay per use
- The World Wide Web provides seamless access to
information that is stored in many millions of
different geographical locations. - In contrast, the GRID is a new computing
infrastructure which provides seamless access to
computing power and data storage distributed all
over the globe.
5Motivation to build a GRID
- Single institutions are no longer able to support
the computing power and storage capacity needed
for modern scientific research. - Compute intensive sciences which are presently
driving the GRID development
- Physics/Astronomy data from different kinds of
research instruments - Medical/Healthcare imaging, diagnosis and
treatment - Bioinformatics study of the human genome and
proteome to understand genetic diseases - Nanotechnology design of new materials from the
molecular scale - Engineering design optimization, simulation,
failure analysis and remote Instrument access and
control - Natural Resources and the Environment weather
forecasting, earth observation, modeling and
prediction of complex systems river floods and
earthquake simulation
6GRID vs. Distributed Computing
- Distributed infrastructures already exist, but
- they normally tend to be specialized and local
systems - intended for a single purpose or user group
- Restricted to a limit number of users
- Do not allow coherent interactions with resources
from other institutions. - The GRID goes further and takes into account
- Different kinds of resources
- Not always the same hardware, data, applications
and admin. policies. - Different kinds of interactions
- User groups or applications want to interact with
Grids in different ways. - Access computing power / storage capacity across
different administrative domains by an unlimited
set of non-local users. - Dynamic nature
- Resources added/removed/changed frequently.
- World wide dimension.
7The GRID metaphor
- The interaction between heterogeneous resources
(owned by geographically spread organizations),
applications and users is only possible through - the use of a specialized layer of software
called middleware.
- The middleware hides all the infrastructure
technical details allowing a transparent and
uniform interaction with the grid.
8How is LIP involved
- LIP is involved in
- LHC Computing GRID (LCG)
- The biggest worldwide GRID infrastructure
- Will be used in the data analysis produced by
the LHC accelerator built at CERN - Enabling GRIDS for E-Science in Europe (EGEE)
- A European GRID infrastructure being built for
multi-disciplinary sciences - Iniciativa Nacional GRID
- Portuguese Government
9The LCG project
- The LHC will be the world most powerful particle
accelerator (2007) - Accelerates two beams of particles (protons or Pb
ions) in opposite directions, around a 27-km
tunnel and at velocities close to the speed of
light - These beams are smashed against each other
producing new particles which will be detected by
4 LHC experiments (CMS, ATLAS, ALICE and LHC-b). - Such collisions are expected to produce the
states of matter which may have existed in the
very first instants of the Universe.
10The LCG project
- The simulation and reconstruction of a full
(central) PbPb collision at LHC (Alice, about
84000 primary tracks!) takes 12 hours of a
top-PC and produces more than 2 GB output.
11The LCG project
- pp in Atlas still 1000s and 5 MB for a tt
production on a Pentium IV 2.6 GHz
- global estimation 140 CPUs and up to 40 TB
storage
12The LCG project
- LHC Computing GRID
- LCG aims to build/maintain a data storage and
analysis infrastructure for the large LHC physics
community.
- LHC experiments are expected to produce 10
Petabytes of experimental data annually - 109 collisions/second (1 Ghz) 100Hz (after
filtering) - 1 collision 1MB of data
- 100 MB/s ?10 PB per year 20 Km CD stack
- Computing power 100 000 today's fastest PC for
analysis and simulation - Needed to be available during the 15 years life
time of the LHC machine - Fully accessible to about 5000 scientists from
more than 500 institutes around the world.
13EGEE project
- EGEE is a European Union project which has as
fundamental goal the deployment of a secure,
reliable and robust grid for all fields of
science.
- EGEE is built on top of the infrastructures
developed for LCG - It integrates national/regional grids
- Strong requirements for middleware development to
provide the proper tools to science fields such
as High-Energy Physics and Bio-medical
applications - Very strong commitment in the dissemination of
grid technologies - Extensive support and training tutorials for both
users and administrators.
14LIP and the EGEE organization
- EGEE South-West federation (SWE)
- includes LIP and several Spanish sites
- Responsible for the operation of essential
core
services - Responsible for the monitoring of
grid resources - Receives, responds and coordinates
GRID
operation problems - Regional Operations Centre (ROC)
- Coordinates the EGEE federations
- Shared among the different SWE institutes
- Certifies that a site fulfills all requirements
to join production infrastructure - Negotiates service level agreements (SLAs)
- EGEE South-West federation offers
- 8 Resource Brokers and 8 top BDII machines as
production core services. - Local sites deploy
- 738.3 CPUs (value normalized to a 1000
SpectInt2000) - 7.7 TB of online storage 2.9 PB of nearline
storage (tape backend) - Shared by more than 20 virtual organizations.
15LCG tier model
Tier 1
Tier 2
Tier 4
Tier 3
RAL
.
small centres
desktops portables
IN2P3
IC
FNAL
IFCA
UB
- Tier-0 the accelerator centre
- Filter? raw data ? reconstruction ?
event summary data (ESD) - Record the master copy of raw and ESD
- Tier-1
- Managed Mass Storage permanent storage raw,
ESD, calibration data, meta-data, analysis data
and databases (Tape) - Data-heavy (ESD-based) analysis
- Re-processing of raw data
- National, regional support
- online to the data acquisition processhigh
availability, long-term commitment
NIKHEF
Cambridge
Tier 0
TRIUMF
Budapest
CNAF
Prague
FZK
Taipei
BNL
PIC
LIP
Nordic
ICEPP
Legnaro
CSCS
IFIC
Rome
.
CIEMAT
MSU
USC
Krakow
- Tier-2
- Well-managed, grid-enabled disk storage
- End-user analysis batch and interactive
- Simulation
16Advanced computing in Coimbra
CFC - Center for Computational PhysicsCentopeia
17Centopeia
- System for parallelized computing
- 108 legs Intel Pentium 4
- 12 x 2.2 GHz,
- 24 x 2.4 GHz,
- 60 x 2.8 GHz,
- 12 x 3.0 GHz
- Þ GRID-like technical infrastructure
- UPS 23 KVA
- Air condition
-
18Centopeia how is it used ?
- it is open to external users
- 30 from University of Coimbra (18 CFC other
Centers/Departments) - 25 outside UC (Porto, IST, Aveiro, Évora, Braga,
etc.) - Some topics
- Dynamic processes in molecules TDDFT - Time
Dependent Density Functional Theory - Lattice QCD - (Gluon propagator)
- Protein Unfolding (NAMD code)
- Optimization of molecular structures (PSM-MGP)
- Monte-Carlo Simulations of various problems
19Centopeia how is it used ?
- Is this a GRID ?
- Of course not the system is open to every-one,
but it is more like a single parallel
Supercomputer - Can it become part of the GRID ?
- Not easily. Because it is a uniform parallelized
system. - gridification takes away many optimizations of
parallel processing
20Where goes CFC ?
- New system (coming soon)
- 130 Sun Fire X4100each 8 GB RAM 2 x AMD
Opteron Double Core 2.2GHz 520 CPUs 1,5
Teraflops - 2 x Gigabit network (data/administration)
- 6 TB central storage
- SuSE Linux Enterprise Server 9
- Sun GRID Engine
- Gigabit link to University Computing Center
- Challenges and necessities
- enforce links with HPC community in Coimbra and
Portugal, including enterprises - Integration in GRID projects, in Coimbra and
national wide - large band width allocation by RCCN (public
scientific and educational network) (dedicated
lines ?) - So there are challenges in common with LHC Grid !
21Why should we join efforts ?
- LIP needs much more computing power for
implementing the GRID Tier2 node - we need computing power and storage for
Simulation, Reconstruction and Data Analysis in
ATLAS - LIP Lisbon is very restricted due to physical
space limitation - Basic infrastructure is a very expensive part for
computing installations of this level - we can share the space
- we can share the network not really, but we can
join efforts - even if the computing part would be kept aside
22How do we join efforts ?
- joint venture
- gridify Centopeia when new cluster up and running
- install LHC compatible GRID structure
- we got 108 Pentium-IV CPUs
- 200 Gigaflops
- This is a quite good start
- Share efforts (funding requests, ) in
- infrastructure
- manpower
- computing power (?)
23How can we get closer
- Share efforts in computing power (?)
- LHC experiments will need a lot of computing
power - convince CFC people to gridify their new cluster
in LHC sense? - So we could become clients
24How can we get closer
- No way
- LHC Simulation and Analysis software is not
optimized for parallel computing - in fact, it depends on a very CERN-like
software environment (Scientific Linux,
libraries, ) - our simulation software only runs on Scientific
Linux 3.0.6 - it has to be gcc 3.2.3
- if we adapt the system to LCG needs,the system
looses the best part of its capacities - you probably wont even get SLC 3 to run on this
architecture easily
25How can we get closer
- the CFC philosophy is adapt your software to
the system a parallel super computer - the LHC Grid philosophy is adapt the system to
the software needs at least for now - LHC simulation and analysis software is a very
complex system of libraries and tools,that is
not easy to adapt - results are needed now, soon
26How can we get closer
- maybe we can meet somewhere in a common GRIDbut
not yet - challengecreate parallelized analyzation/simulat
ion tools ? - CERN software must become more CERN-infrastructu
re independentThis is a long time process. - Collaboration with CFC is a very interesting
opportunity
27Maybe we can share something ?
- sometimes it could run as GRID and sometimes as
Parallel Computer - This is noting completely strange in LCG
- Example ATLAS TDAQ Event Filter (online!)Event
Handler Requirements Use Cases - UC010
- The hardware of the EF can be shared with the
offline computing, in particular with the Tier 0
of the LHC Computing GRID model. During data
taking period, a part possibly variable in size
with the time of the Tier 0 machines can be
used by the EF. Conversely, the machines normally
devoted to the EF only can be used for offline
work when ATLAS is not running.
28Maybe we can share something ?
- probably we cannot use the same operating system.
- We could try
- dual bootvery confusing to configurate and
maintain - chrootvirtual environment contained in the
overlaying operating system - XENemulate a Virtual Machine in the overlaying
operating system - But these are virtual hypotheses this can be
tried if we have the systems up and running in
production mode
29LHC Computing GRID (LCG)
- The largest GRID infrastructure under operation
(data from 14/5/2006)
30LIP Tier2 - Site 1
LIP Lisboa
31LIP Tier2 - Site 1
- CPUs for GRID
- 6 dual Xeon 2.8GHz
- 9 dual PIII 933MHz
- 20 single CPU (PIV and PIII)
- partly for GRID
- 17 AMD 2.2 GHz (2 CPUs dual core)
- 18 AMD 2.2 GHz (2 CPUs single core)
- 8 TB disk storage
- future plans
- 2006 - 150 CPUs
- 2007 - 300 CPUs
- 2008 - 450 CPUs
32LIP Tier2
Coimbra
200 km
Lisboa
33LIP Tier2 - Site 2
LIP Coimbra
34Gridification of Centopeia
CFC LIP
Gridification
35Gridification of Centopeia initial phase
- 12 legs taken out of Centopeia for testing
- installing SLC 3.0.6
- kernel does not recognize disk interface
- kernel upgrade from 2.4.x to 2.6.x
- implementing a minimum GRID environment
- Computing Element (CE)
- Storage Element (SE)
- Monitoring Box (MonBox)
- User Interface (UI)
- working nodes
- gaining experience
- join in all 108 nodes as soon as they become
available
36GRIDS - Still an unfinished work
- Many key concepts identified, known and tested
- Major efforts now on establishing
- Standards (a slow process) (Global Grid Forum,
http//www.gridforum.org/ ) - Production Grids for multiple Virtual
Organisations - Production Reliable, sustainable, quality of
service - In Europe, EGEE
- In US, Teragrid
- One stack of middleware that serves many research
(and other!!!) communities - Operational procedures and services (people,
policy,..) - New user communities
- Research development continues
37The final idea
- GRID technologies are already implemented in more
than 200 EGEE sites worldwide - Strong indication that the scientific community
is benefiting from such computing infrastructure - GRIDS will become more powerful, sophisticated
and user-friendly
38Until realizing the final idea
- we have to do simulations and analysis
- we use the LCG
- we make part of LCG
- we collaborate in developing LCG
- we should also study how CFC supercomputer can be
useful for us (simulation, analysis) - this could help us on the way to a common GRID
- gives access to much more computing
infrastructure in the future - in Coimbra !
39Local motivation
- Portugal is somewhat peripheric in Europe
- Coimbra is somewhat peripheric in Portugal
40Local motivation
- Coimbra cidade do conhecimento
- we will have to do something in order to continue
in the game as a global player not just as an
educated spectator with 700 years of great
history and knowledge - be that game
- top physics (LIP, CFC, and more)
- top ICT infra-structure (LIP, CFC)
- much more (but that is probably not with us)
- CERN where the Web was born
- visibility also matters
GRID