Title: The ARDA project: Grid analysis prototypes of the LHC experiments Massimo Lamanna ARDA Project Leade
1The ARDA project Grid analysis prototypes of
the LHC experiments Massimo LamannaARDA
Project LeaderMassimo.Lamanna_at_cern.ch
DESY, 10 May 2004
http//cern.ch/arda
www.eu-egee.org
cern.ch/lcg
EGEE is a project funded by the European Union
under contract IST-2003-508833
2Contents
- A bit of history
- LHC experiments and the LCG project
- EGEE project
- ARDA Project
- Mandate and organisation
- ARDA activities during 2004
- Now
- Second half of 2004
- Conclusions and Outlook
3LHC Experiments
ATLAS
CMS
Storage Raw recording rate 0.1 1 GByte/s
Accumulating at 5-8 PetaByte/year 10 PetaByte
of disk Processing 200,000 of todays
fastest PCs
LHCb
ALICE
LHCb
4Multi-Tiered View of LHC Computing
5The LHC Computing Grid Project
- Prepare and deploy the computing environment for
the LHC experiments - Common applications, tools, frameworks and
environments, - Move from testbed systems to real production
services - Experiments need a dependable system
- Operated and supported 24x7 globally
- Computing fabrics run as production physics
services - Computing environment must be robust, stable,
predictable, and supportable - Foster collaboration, coherence of the LHC
computing centres - LCG is not a grid technology RD project
- Enable physics data analysis and distributed
collaboration to a new scale
6The LHC Computing Grid ProjectPhase 1 and Phase
2
- Phase 1 2002-05
- Development and prototyping
- Approved by CERN Council 20 September 2001
- Phase 2 2006-08
- Installation and operation of the full world-wide
initial production Grid - Exploiting Phase 1 experience
- Costs (materials staff) included in the LHC
cost to completion estimates
7The LCG Phase 1 Goals
- Prepare the LHC computing environment
- Provide the common tools and infrastructure for
the physics application software - Establish the technology for fabric, network and
grid management - Operate a series of data challenges for the
experiments - Build a solid collaboration and a fertile
exchange of experience within the community of
the centres contributing to the LCG. - Validate the technology and models by building
progressively more complex Grid prototypes - Develop models for building the Phase 2 Grid
- Maintain reasonable opportunities for the re-use
of the results of the project in other fields - Deploy a 50 model production GRID including the
committed LHC Regional Centres - Produce a Technical Design Report for the full
LHC Computing Grid to be built in Phase 2 of the
project - 50 of the complexity of one of the LHC
experiments
8Too early?
- First collisions in Spring 2007
- 1 year to procure, install, and test the full LHC
computing fabrics - Infrastructure work like civil engineering
already started - The Computing TDR must be ready in mid-2005
- At least 1 year of experience in operating a
production grid to validate the computing model - Experiments data challenges should run within
LCG in 2004 - With a reasonable level of production service
- How do we evolve the present services (LCG-2)
into the final system?
9The EGEE project
- Create a European-wide Grid production quality
infrastructure for multiple sciences - Profit from current and planned national and
regional Grid programmes, building on - the results of existing projects such as DataGrid
(EDG), LCG and others - EU Research Network and industrial Grid
developers - Support Grid computing needs common to the
different communities - integrate the computing infrastructures and agree
on common access policies - Exploit International connections (US and AP)
- Provide interoperability with other major Grid
initiatives such as the US NSF Cyberinfrastructure
, establishing a worldwide Grid infrastructure - Leverage national resources in a more effective
way - 70 leading institutions in 27 countries(including
Russia and US)
10EGEE Scope
- The project started April 2004
- First phase will last 2 years with EU funding of
32M - Possibility of 2nd phase if successful
- EGEE Scope ALL-inclusive for academic
applications - Open to industrial and socio-economic world as
well - Industrial participation both as potential
end-users and IT technology and service suppliers - EGEE organises an Industry Forum to keep
Industrial and Commercial parties in close
contact - Services developed in 2004-5 may be tendered to
Industry in the second phase (2006-7) - The major success criterion of EGEE how many
satisfied users from how many different domains ? - 5000 users from at least 5 disciplines
- 2 Pilot Application Domains Physics
Bioinformatics
11EGEE and LCG
- Strong links already established between EDG and
LCG and this approach will continue in the scope
of EGEE - The core infrastructure of the LCG and EGEE grids
will be operated as a single service, and will
grow out of LCG service - LCG includes US and Asia
- EGEE includes other sciences
- Substantial part of infrastructure common to both
- Parallel production lines
- LCG-2
- 2004 data challenges
- Pre production prototype
- EGEE MW
- ARDA playground
12ARDA working group recommendations
- New service decomposition
- Strong influence of Alien system
- the Grid system developed by the ALICE
experiments and used by a wide scientific
community (not only HEP) - Role of experience, existing technology
- Web service framework
- Interfacing to existing middleware to enable
their use in the experiment frameworks - Early deployment of (a series of) prototypes to
ensure functionality and coherence
EGEE Middleware
ARDA project
13Web Services
- Web servicesThe term Web services describes a
standardized way of integrating Web-based
applications using the XML, SOAP, WSDL and UDDI
open standards over an Internet protocol
backbone. XML is used to tag the data, SOAP is
used to transfer the data, WSDL is used for
describing the services available and UDDI is
used for listing what services are available.
Used primarily as a means for businesses to
communicate with each other and with clients, Web
services allow organizations to communicate data
without intimate knowledge of each other's IT
systems behind the firewall.Unlike traditional
client/server models, such as a Web server/Web
page system, Web services do not provide the user
with a GUI. Web services instead share business
logic, data and processes through a programmatic
interface across a network. The applications
interface, not the users. Developers can then add
the Web service to a GUI (such as a Web page or
an executable program) to offer specific
functionality to users.Web services allow
different applications from different sources to
communicate with each other without
time-consuming custom coding, and because all
communication is in XML, Web services are not
tied to any one operating system or programming
language. For example, Java can talk with Perl,
Windows applications can talk with UNIX
applications.N.B. Web services do not require
the use of browsers or HTML. From
http//www.webopedia.com
14End-to-end prototypes why?
- Provide a fast feedback to the EGEE MW
development team - Avoid uncoordinated evolution of the middleware
- Coherence between users expectations and final
product - Experiments ready to benefit from the new MW as
soon as possible - Frequent snapshots of the middleware available
- Expose the experiments (and the community in
charge of the deployment) to the current
evolution of the whole system - Experiments system are very complex and still
evolving - Move forward towards new-generation real systems
(analysis!) - Prototypes should be exercised with realistic
workload and conditions - No academic exercises or synthetic demonstrations
- LHC experiments users absolutely required here!!!
- A lot of work (and useful software) is involved
in current experiments data challenges this will
be used as a starting point - Adapt/complete/refactorise the existing we do
not need another system!
15E2E Prototypes implementation
- Every experiment has already at least one system
- Analysis/Production typically distinct entities
- Using a variety of back-ends (Batch systems,
different grid systems) - ARDA will put its effort on the experiment
(sub)system the experiment chooses - EGEE MW as foundation layer
- Multigrid interfaces outside our scope
- Experiments do know how to deal with this
- By default, we expect 4 systems
- There is nothing like an ARDA prototype
- Adapt/complete/refactorise the existing
(sub)system! - Collaborative effort (not a parallel development)
- Commonality is not ruled out, but it should
emerge and become attractive for the experiments.
Anyway not imposed from outside - Users users users!!!
- First important checkpoint December 2004
16Experiment End-to-End Prototypes
- The initial prototype will have a reduced scope
- Components selection for the first prototype
- Experiments components not in use for the first
prototype are not ruled out (and used/selected
ones might be replaced later on) - Not all use cases/operation modes will be
supported - Attract and involve users
- Many users are absolutely required
- The Use Cases are still being defined
- Example
- A physicist selects a data sample (from current
Data Challenges) - With an example/template as starting point (s)he
prepares a job to scan the data - The job is split in sub-jobs, dispatched to the
Grid, some error-recovery is automatically
performed, merged back in a single output - The output (histograms, ntuples) is returned
together with simple information on the job-end
status
17E2E Prototypes
Experiment software
- Each experiment chooses the starting point (1
system) - Subset of the existing system
- Emphasis on analysis
- EGEE MW as foundation layer
- There is nothing like an ARDA prototype!
- Adapt/complete/refactorise the existing one
together with the experiments teams - The initial prototype will have a reduced scope
- Just the most sensible starting point
Experiment-specific middleware
EGEE Middleware Interface Layer
Other systems in use (LCG2, G2003,
NorduGrid, LSF, PBS, )
Generic middleware
FileCatalog
CE
Workload
FileCatalog
FileCatalog
SE
18ARDA Project current set up
- LCG
- Project leader (Massimo Lamanna/CERN)
- 4 LCG staff (100 at CERN) matching the 4 EGEE
staff - 1 more staff from LCG (100 at CERN)
- About 4 FTEs from other sources (20 at CERN)
- EGEE
- 4 NA4 staff (100 at CERN)
- Experiments
- 4 experiments interfaces
- Represent the experiments in project definition,
implementation and evaluation - Identify and coordinate the experiment
contributions - analysis groups in the experiments with whom the
middleware people can work to specify the
services and validate the implementations - upper middleware teams (experiment-specific MW)
Strong link with exp. teams
Strong link with regional centres
Strong link with exp. teams
Users
Exp.System
Strong link with exp. teams
19People
- Massimo Lamanna
- Birger Koblitz
- Dietrich Liko
- Frederik Orellana
- Derek Feichtinger
- Andreas Peters
- Julia Andreeva
- Juha Herrala
- Andrew Maier
- Kuba Moscicki
Russia
- Andrey Demichev
- Viktor Pose
- Wei-Long Ueng
- Tao-Sheng Chen
-
ALICE
Taiwan
ATLAS
Experiment interfaces Piergiorgio Cerello
(ALICE) David Adams (ATLAS) Lucia Silvestris
(CMS) Ulrik Egede (LHCb)
CMS
LHCb
20ARDA _at_ Regional Centres
- Deployability is a key factor of MW success
- A few Regional Centres will have the
responsibility to provide early installation for
ARDA - Understand Deployability issues
- Extend the ARDA test bed
- The ARDA test bed will be the next step after the
most complex EGEE Middleware test bed - Stress and performance tests could be ideally
located outside CERN - This is for experiment-specific components (e.g.
a Meta Data catalogue) - Leverage on Regional Centre local know how
- Data base technologies
- Web services
-
- Pilot sites might enlarge the resources available
and give fundamental feedback in terms of
deployability to complement the EGEE SA1
activity (EGEE/LCG operations)
21Coordination and forum activities
- The coordination activities would flow naturally
from the fact that ARDA will be open to provide
demonstration benches - Since it is neither necessary nor possible that
all projects could be hosted inside the ARDA
experiments prototypes, some coordination is
needed to ensure that new technologies can be
exposed to the relevant community - Transparent process
- ARDA should organise a set of regular meetings
(one per quarter?) to discuss results, problems,
new/alternative solutions and possibly agree on
some coherent program of work. - The ARDA project leader organises this activity
which will be truly distributed and lead by the
active partners - Special relation with LCG GAG
- LCG forum for Grid requirements and use cases
- Experiments representatives coincide with the
EGEE NA4 experiments representatives - ARDA will channel this information to the
appropriate recipients - ARDA workshop (January 2004 at CERN open over
150 participants) - ARDA workshop (June 21-23 at CERN by invitation)
- The first 30 days of EGEE middleware
- ARDA workshop (September 2004? open)
22Coordination and forum activities
ALICE Distr. Analysis
ATLAS Distr. Analysis
CMS Distr. Analysis
LHCb Distr. Analysis
EGEE NA4 Application identification and support
ARDA Collaboration Coordination Integration Specif
ication Priorities Planning
GAE
Experience ? ?Use Cases
PROOF
LCG-GAG Grid Application Group
SEAL
POOL
EGEE middleware
Resource Providers Community
23Plans and activity within the experiments
- General pattern
- Planning
- Example
- LHCb
- CMS
- ATLAS
- ALICE
24Example of activity
- Existing system as starting point
- Every experiment has different implementations of
the standard services - Used mainly in production environments
- Few expert users
- Coordinated update and read actions
- ARDA
- Interface with the EGEE middleware
- Verify (help to evolve to) such components to
analysis environments - Many users
- Robustness
- Concurrent read actions
- Performance
- One prototype per experiment
- A Common Application Layer might emerge in future
- ARDA emphasis is to enable each of the experiment
to do its job
Very soon
Already started
25LHCb
- The LHCb system within ARDA uses GANGA as
principal component. - The LHCb/GANGA plans to enable physicists to use
GANGA to analyse the data being produced during
2004 for their studies naturally matches the ARDA
mandate - At the beginning, the emphasis will be to
validate the tool focusing on usability,
validation of the splitting and merging
functionality for users jobs - The DIRAC system (LHCb grid system, used mainly
in production so far, could be a useful
playground to understand the detailed behaviour
of some components, like the file catalog)
26GANGAGaudi/Athena aNd Grid Alliance
- Gaudi/Athena LHCb/ATLAS frameworks
- The Athena uses Gaudi as a foundation
- Single desktop for a variety of tasks
- Help configuring and submitting analysis jobs
- Keep track of what they have done, hiding
completely all technicalities - Resource Broker, LSF, PBS, DIRAC, Condor
- Job registry stored locally or in the roaming
profile - Automate config/submit/monitor procedures
- Provide a palette of possible choices and
specialized plug-ins (pre-defined application
configurations, batch/grid systems, etc.) - Friendly user interface (CLI/GUI) is essential
- GUI Wizard Interface
- Help users to explore new capabilities
- Browse job registry
GANGA
GUI
Collective Resource Grid Services
Histograms Monitoring Results
JobOptions Algorithms
GAUDI Program
GANGA
UI
Internal Model
BkSvc
WLM
ProSvc
Monitor
Grid Services
Bookkeeping Service
WorkLoad Manager
Profile Service
GAUDI Program
Instr.
File catalog
SE
CE
27ARDA contribution to Ganga
- Integration with EGEE middleware
- Waiting for the EGEE middleware, we developed an
interface to Condor - Use of Condor DAGMAN for splitting/merging and
error recovery capability - Design and Development
- Command Line Interface
- Future evolution of Ganga
- Release management
- Software process and integration
- Testing, tagging policies etc.
- Infrastructure
- Installation, packaging etc.
28LHCb Metadata catalog
- Used in production (for large productions)
- Web Service layer being developed (main
developers in the UK) - Oracle backend
- ARDA contributes a testing focused on the
analysis usage - Robustness
- Performances under high concurrency (read mode)
Measured network rate vs no. of concurrent
clients
29CERN/Taiwan tests
Client
Network monitor
Virtual Users
Bookkeeping Server
- CPU Load
- Network
- Process time
- Web XML-RPC Service performance tests
- CPU Load
- Network
- Process time
Oracle DB
CERN
Bookkeeping Server
- Clone Bookkeeping DB in Taiwan
- Install the WS layer
- Performance Tests
- Database I/O Sensor
- Bookkeeping Server performance tests
- Taiwan/CERN Bookkeeping Server DB
- XML-RPC Service performance tests
- CPU Load, Network send/receive sensor, Process
time - Client Host performance tests
- CPU Load, Network send/receive sensor, Process
time
TAIWAN
Oracle DB
30CMS
- The CMS system within ARDA is still under
discussion - This Wednesday CMS session during CMS software
week - It is already clear that the complex RefDB system
(the heart of the data challenge DC04, recently
finished) will be one of the area of
collaboration between CMS and the corresponding
ARDA team - RefDB is the bookkeeping engine to plan and steer
the production across different phases
(simulation, reconstruction, to some degree into
the analysis phase) . It contained all necessary
information except file physical location (RLS)
and information related to the transfer
management system (TMDB). - Measuring performances underway (similar
philosophy as for the LHCb Metadata catalog
measurements)
31DC04 data flow at T0 (CERN)
Summaries of successful jobs
RefDB
Reconstruction jobs
McRunjob
Reconstruction instructions
T0 worker nodes
Transfer agent
Reconstructed data
Checks what has arrived
GDB castor pool
Updates
Updates
Tapes
RLS
TMDB
Reconstructed data
Export Buffers
32ATLAS
- The ATLAS system within ARDA has been agreed
- ATLAS has a complex strategy for distributed
analysis, adressing different area with specific
projects (Fast response, user-driven analysis,
massive production, etc see http//www.usatlas.b
nl.gov/ADA/) - Starting point is the DIAL system
- The AMI metadata catalog is a key component
- mySQL as a back end
- Genuine Web Server implementation
- Robustness and performance tests from ARDA
- In the start up phase, ARDA provided some help in
developing ATLAS production tools - Finishing
33What is DIAL?
34ATLAS Metadata Catalog (AMI)
- Atlas Metadata- Catalogue, contains File
Metadata - Simulation/Reconstruction-Version
- File-ContentEvent types
- Does not contain physical filenames
- SOAP-Proxy (in Java) front-end to hierarchical
databases (institute ? collaboration) - Proxy allows database schema evolution
- SOAP allows automatic code generation for client
Planned
35AMI studies in ARDA
- Studied behaviour using many concurrent clients
User
SOAP-Proxy
Meta-Data(MySQL)
User
User
- Many problems still open
- Large network traffic overhead due to schema
independent tables - SOAP proxy supposed to provide DB properties
- Browsable results
- Note that Web Services are stateless (not
automatic handles to have the concept of session,
transaction, etc) 1 query 1 (full) response - Large queries crashed server
- Shall proxy re-implement all database
functionality? - Nice collaboration in place with ATLAS-Grenoble
36ATLAS ATCOM
- AtCom II planned successor of AtCom
- Graphical interactive tool to support production
management in ATLAS - Large scale job definition, submission and
progress monitoring - Linked to several bookkeeping databases (AMI and
Magda) - Plug-ins for LSF, EDG and Nordugrid
37ALICE
- The ALICE system within ARDA will be the
evolution of the analysis system presented by
ALICE at SuperComputing 2003 (SC2003) - With the new EGEE middleware (at SC2003, AliEn
was used) - Some activity on the PROOF system
- Robustness
- Error recovery
38ALIEN system/Grid enabled PROOF (SC2003 Demo)
Site C
Site B
Site A
PROOF SLAVES
TcpRouter
TcpRouter
TcpRouter
PROOF MASTER SERVER
TcpRouter
USER SESSION
39ALICE-ARDA prototype improvements
- SC2003
- The setup was heavily connected with the
middleware services - Somewhat inflexible configuration
- No chance to use PROOF on federated grids like
LCG in AliEn - TcpRouter service needs incoming connectivity in
each site - Libraries can not be distributed using the
standard rootd functionality - Improvement ideas
- Distribute another daemon with ROOT, which
replaces the need for aTcpRouter service - Connect each slave proofd/rootd via this daemon
to two central proofd/rootd master multiplexer
daemons, which run together with the proof
master - Use Grid functionality for daemon start-up and
booking policies througha plug-in interface from
ROOT - Put PROOF/ROOT on top of the grid services
- Improve on dynamic configuration and error
recovery
40ALICE-ARDA improved system
Proxy proofd
Proxy rootd
Grid Services
Booking
- The remote proof slaves looklike a local proof
slave onthe master machine - Booking service is usable also on local clusters
Master
41Conclusions and Outlook
- ARDA is starting
- Main tool experiment prototypes for analysis
- Detailed project plan being prepared
- Good feedback from the LHC experiments
- Good collaboration with EGEE NA4
- Good collaboration with Regional Centres
- Look forward to contribute to the success of EGEE
- Helping EGEE Middleware to deliver a fully
functionally solution - ARDA main focus
- Collaborate with the LHC experiments to set up
the end-to-end prototypes - Aggressive schedule
- First milestone for the end-to-end prototypes is
Dec 2004
42Links
- LCG
- http//cern.ch/lcg
- EGEE
- www.eu-egee.org
- NA4 (Application Identification and Support)
http//egee-na4.ct.infn.it/index.php - NA4 HEPhttp//egee-na4.ct.infn.it/hep/
- ARDA
- http//cern.ch/arda
- GAG
- http//project-lcg-gag.web.cern.ch/project-lcg-gag
/