Title: The ARDA project: Grid analysis prototypes of the LHC experiments Massimo Lamanna ARDA Project Leade
1The ARDA project Grid analysis prototypes of
the LHC experiments Massimo LamannaARDA
Project LeaderMassimo.Lamanna_at_cern.ch
RAL, 13 May 2004
http//cern.ch/arda
www.eu-egee.org
cern.ch/lcg
EGEE is a project funded by the European Union
under contract IST-2003-508833
2Contents
- ARDA Project
- Mandate and organisation
- ARDA activities during 2004
- General pattern
- LHCb
- CMS
- ATLAS
- ALICE
- Conclusions and Outlook
3ARDA working group recommendations our starting
point
- New service decomposition
- Strong influence of Alien system
- the Grid system developed by the ALICE
experiments and used by a wide scientific
community (not only HEP) - Role of experience, existing technology
- Web service framework
- Interfacing to existing middleware to enable
their use in the experiment frameworks - Early deployment of (a series of) prototypes to
ensure functionality and coherence
EGEE Middleware
ARDA project
4EGEE and LCG
- Strong links already established between EDG and
LCG. It will continue in the scope of EGEE - The core infrastructure of the LCG and EGEE grids
will be operated as a single service, and will
grow out of LCG service - LCG includes many US and Asia partners
- EGEE includes other sciences
- Substantial part of infrastructure common to both
- Parallel production lines as well
- LCG-2
- 2004 data challenges
- Pre production prototype
- EGEE MW
- ARDA playground for the LHC experiments
ARDA
5End-to-end prototypes why?
- Provide a fast feedback to the EGEE MW
development team - Avoid uncoordinated evolution of the middleware
- Coherence between users expectations and final
product - Experiments ready to benefit from the new MW as
soon as possible - Frequent snapshots of the middleware available
- Expose the experiments (and the community in
charge of the deployment) to the current
evolution of the whole system - Experiments system are very complex and still
evolving - Move forward towards new-generation real systems
(analysis!) - Prototypes should be exercised with realistic
workload and conditions - No academic exercises or synthetic demonstrations
- LHC experiments users absolutely required here!!!
EGEE Pilot Application - A lot of work (experience and useful software) is
involved in current experiments data challenges - Concrete starting point
- Adapt/complete/refactorise the existing we do
not need another system!
6End-to-end prototypes how?
- The initial prototype will have a reduced scope
- Components selection for the first prototype
- Experiments components not in use for the first
prototype are not ruled out (and used/selected
ones might be replaced later on) - Not all use cases/operation modes will be
supported - Every experiment has a production system (with
multiple backends, like PBS, LCG, G2003,
NorduGrid, ). We focus on end-user analysis on a
EGEE MW based infrastructure - Adapt/complete/refactorise the existing
experiment (sub)system! - Collaborative effort (not a parallel development)
- Attract and involve users
- Many users are absolutely required
- Informal Use Cases are still being defined, e.g.
- A physicist selects a data sample (from current
Data Challenges) - With an example/template as starting point (s)he
prepares a job to scan the data - The job is split in sub-jobs, dispatched to the
Grid, some error-recovery is automatically
performed, merged back in a single output - The output (histograms, ntuples) is returned
together with simple information on the job-end
status
7ARDA _at_ Regional Centres
- Deployability is a key factor of MW success
- A few Regional Centres will have the
responsibility to provide early installation for
ARDA - Understand Deployability issues
- Extend the ARDA test bed
- The ARDA test bed will be the next step after the
most complex EGEE Middleware test bed - Stress and performance tests could be ideally
located outside CERN - This is for experiment-specific components (e.g.
a Meta Data catalogue) - Leverage on Regional Centre local know how
- Data base technologies
- Web services
-
- Pilot sites might enlarge the resources available
and give fundamental feedback in terms of
deployability to complement the EGEE SA1
activity (EGEE/LCG operations) - Running ARDA pilot installations
- Experiment data available where the experiment
prototype is deployed
8Coordination and forum activities
- The coordination activities would flow naturally
from the fact that ARDA will be open to provide
demonstration benches - Since it is neither necessary nor possible that
all projects could be hosted inside the ARDA
experiments prototypes, some coordination is
needed to ensure that new technologies can be
exposed to the relevant community - Transparent process
- ARDA should organise a set of regular meetings
(one per quarter?) to discuss results, problems,
new/alternative solutions and possibly agree on
some coherent program of work. - The ARDA project leader organises this activity
which will be truly distributed and lead by the
active partners - ARDA is embedded in EGEE NA4
- namely NA4-HEP
- Special relation with LCG GAG
- LCG forum for Grid requirements and use cases
- Experiments representatives coincide with the
EGEE NA4 experiments representatives - ARDA will channel this information to the
appropriate recipients - ARDA workshop (January 2004 at CERN open over
150 participants) - ARDA workshop (June 21-23 at CERN by invitation)
- The first 30 days of EGEE middleware
- NA4 meeting mid July (NA4/JRA1 and NA4/SA1
sessions foreseen. - Organised by M. Lamanna and F. Harris)
- ARDA workshop (September 2004? open)
9People
- Massimo Lamanna
- Birger Koblitz
- Dietrich Liko
- Frederik Orellana
- Derek Feichtinger
- Andreas Peters
- Julia Andreeva
- Juha Herrala
- Andrew Maier
- Kuba Moscicki
Russia
- Andrey Demichev
- Viktor Pose
- Wei-Long Ueng
- Tao-Sheng Chen
-
ALICE
Taiwan
ATLAS
Experiment interfaces Piergiorgio Cerello
(ALICE) David Adams (ATLAS) Lucia Silvestris
(CMS) Ulrik Egede (LHCb)
CMS
LHCb
10Example of activity
- Existing system as starting point
- Every experiment has different implementations of
the standard services - Used mainly in production environments
- Few expert users
- Coordinated update and read actions
- ARDA
- Interface with the EGEE middleware
- Verify (help to evolve to) such components to
analysis environments - Many users
- Robustness
- Concurrent read actions
- Performance
- One prototype per experiment
- A Common Application Layer might emerge in future
- ARDA emphasis is to enable each of the experiment
to do its job
Very very soon
Already started
11LHCb
- The LHCb system within ARDA uses GANGA as
principal component (see next slide). - The LHCb/GANGA plans
- enable physicists (via GANGA) to analyse the data
being produced during 2004 for their studies - It naturally matches the ARDA mandate
- Have the prototype where the LHCb data will be
the key - At the beginning, the emphasis will be to
validate the tool focusing on usability,
validation of the splitting and merging
functionality for users jobs - The DIRAC system (LHCb grid system, used mainly
in production so far, could be a useful
playground to understand the detailed behaviour
of some components, like the file catalog)
12GANGAGaudi/Athena aNd Grid Alliance
- Gaudi/Athena LHCb/ATLAS frameworks
- The Athena uses Gaudi as a foundation
- Single desktop for a variety of tasks
- Help configuring and submitting analysis jobs
- Keep track of what they have done, hiding
completely all technicalities - Resource Broker, LSF, PBS, DIRAC, Condor
- Job registry stored locally or in the roaming
profile - Automate config/submit/monitor procedures
- Provide a palette of possible choices and
specialized plug-ins (pre-defined application
configurations, batch/grid systems, etc.) - Friendly user interface (CLI/GUI) is essential
- GUI Wizard Interface
- Help users to explore new capabilities
- Browse job registry
GANGA
GUI
Collective Resource Grid Services
Histograms Monitoring Results
JobOptions Algorithms
GAUDI Program
GANGA
UI
Internal Model
BkSvc
WLM
ProSvc
Monitor
Grid Services
Bookkeeping Service
WorkLoad Manager
Profile Service
GAUDI Program
Instr.
File catalog
SE
CE
13ARDA contribution to Ganga
- Integration with EGEE middleware
- Waiting for the EGEE middleware, we developed an
interface to Condor - Use of Condor DAGMAN for splitting/merging and
error recovery capability - Design and Development
- Command Line Interface
- Future evolution of Ganga
- Release management
- Software process and integration
- Testing, tagging policies etc.
- Infrastructure
- Installation, packaging etc.
14LHCb Metadata catalog
- Used in production (for large productions)
- Web Service layer being developed (main
developers in the UK) - Oracle backend
- ARDA contributes a testing focused on the
analysis usage - Robustness
- Performances under high concurrency (read mode)
Measured network rate vs no. of concurrent
clients
15CERN/Taiwan tests
Client
Network monitor
Virtual Users
Bookkeeping Server
- CPU Load
- Network
- Process time
- Web XML-RPC Service performance tests
- CPU Load
- Network
- Process time
Oracle DB
CERN
Bookkeeping Server
- Clone Bookkeeping DB in Taiwan
- Install the WS layer
- Performance Tests
- Database I/O Sensor
- Bookkeeping Server performance tests
- Taiwan/CERN Bookkeeping Server DB
- XML-RPC Service performance tests
- CPU Load, Network send/receive sensor, Process
time - Client Host performance tests
- CPU Load, Network send/receive sensor, Process
time
TAIWAN
Oracle DB
16CMS
- The CMS system within ARDA is still under
discussion - Provide easy access (and possibly sharing) of
data for the CMS users is a key issue - RefDB is the bookkeeping engine to plan and steer
the production across different phases
(simulation, reconstruction, to some degree into
the analysis phase) - It contained all necessary information except
file physical location (RLS) and info related to
the transfer management system (TMDB) - The actual mechanism to provide these data to
analysis users is under discussion - Measuring performances underway (similar
philosophy as for the LHCb Metadata catalog
measurements)
RefDB in CMS DC04
RefDB
Reconstruction instructions
Summaries of successful jobs
Reconstruction jobs
McRunjob
T0 worker nodes
Transfer agent
Reconstructed data
Checks what has arrived
GDB castor pool
Updates
Updates
Tapes
RLS
TMDB
Reconstructed data
Export Buffers
17ATLAS
- The ATLAS system within ARDA has been agreed
- ATLAS has a complex strategy for distributed
analysis, addressing different area with specific
projects (Fast response, user-driven analysis,
massive production, etc see http//www.usatlas.b
nl.gov/ADA/) - Starting point is the DIAL system
- The AMI metadata catalog is a key component
- mySQL as a back end
- Genuine Web Server implementation
- Robustness and performance tests from ARDA
- In the start up phase, ARDA provided some help in
developing ATLAS production tools - Being finalised
18What is DIAL?
19AMI studies in ARDA
- Atlas Metadata- Catalogue, contains File
Metadata - Simulation/Reconstruction-Version
- Does not contain physical filenames
- Many problems still open
- Large network traffic overhead due to schema
independent tables - SOAP proxy supposed to provide DB access
- Note that Web Services are stateless (not
automatic handles to have the concept of session,
transaction, etc) 1 query 1 (full) response - Large queries might crashed server
- Shall proxy re-implement all database
functionality? - Good collaboration in place with ATLAS-Grenoble
User
Meta-Data(MySQL)
User
SOAP-Proxy
User
- Studied behaviour using many concurrent clients
20ALICE Grid enabled PROOF SuperComputing 2003
(SC2003) Demo
Site C
Site B
Site A
PROOF SLAVES
TcpRouter
- Strategy
- The ALICE/ARDA will evolve the analysis system
presented by ALICE at SuperComputing 2003 - With the new EGEE middleware (at SC2003, AliEn
was used) - Activity on PROOF
- Robustness
- Error recovery
TcpRouter
PROOF MASTER SERVER
TcpRouter
USER SESSION
21ALICE-ARDA prototype improvements
- SC2003
- The setup was heavily connected with the
middleware services - Somewhat inflexible configuration
- No chance to use PROOF on federated grids like
LCG in AliEn - TcpRouter service needs incoming connectivity in
each site - Libraries can not be distributed using the
standard rootd functionality - Improvement ideas
- Distribute another daemon with ROOT, which
replaces the need for aTcpRouter service - Connect each slave proofd/rootd via this daemon
to two central proofd/rootd master multiplexer
daemons, which run together with the proof
master - Use Grid functionality for daemon start-up and
booking policies througha plug-in interface from
ROOT - Put PROOF/ROOT on top of the grid services
- Improve on dynamic configuration and error
recovery
22ALICE-ARDA improved system
Proxy proofd
Proxy rootd
Grid Services
Booking
- The remote proof slaves looklike a local proof
slave onthe master machine - Booking service is usable also on local clusters
Master
23Conclusions and Outlook
- ARDA is starting
- Main tool experiment prototypes for analysis
- Detailed project plan being prepared
- Good feedback from the LHC experiments
- Good collaboration with EGEE NA4
- Good collaboration with Regional Centres. More
help needed - Look forward to contribute to the success of EGEE
- Helping EGEE Middleware to deliver a fully
functionally solution - ARDA main focus
- Collaborate with the LHC experiments to set up
the end-to-end prototypes - Aggressive schedule
- First milestone for the end-to-end prototypes is
Dec 2004