Title: The Grid Observatory: towards collaborations between EGEE and CoreGRID
1The Grid Observatorytowards collaborations
between EGEE and CoreGRID
- C. Germain-Renaud
- Laboratoire de Recherche en Informatique
- Laboratoire Accélérateur Linéaire
- CNRS Université Paris-Sud XI
- Desden 9-11 May 2007
2The EGEE Grid
- FP6 and FP7 infrastructure
- Large-scale, production-quality grid
infrastructure for e-Science - 91 partners, 32 countries
- 30K CPUs, 10PB,
- 30K jobs
- 24X7
3Applications
- Applications from a growing number of domains
- High Energy Physics
- Life Sciences
- Astrophysics
- Computational Chemistry
- Earth Sciences
- Financial Simulation
- Fusion
- Geophysics
- Multimedia
- Material Sciences
Applications have moved from testing to routine
and daily usage
4Collaborating e-infrastructures
Potential for linking 80 countries by 2008
5Sustainability
- Need to prepare for permanent Grid infrastructure
- Ensure a high quality of service for all user
communities - Independent of short project funding cycles
- Infrastructure managed in collaboration with
National Grid Initiatives (NGIs)
6The Grid Observatory - overview
- Part of the EGEE-III proposal (NA4 Cluster)
- LRI, LAL, London Imperial College, Trinity
College Dublin, Università Piemonte Orientale - Integrate the collection of data on the behaviour
of the EGEE grid and users with the development
of models and of an ontology for the domain
knowledge.
7Data Collection and Publication
- Design, implement and deploy a set of tools for
the acquisition, consolidation, long-term
conservation, and publication of traces of EGEE
activities - Permanent storage of reliable, exhaustive,
filtered information - Publication service
- Navigation and structured access
- Requires indexing along the needs of the
scientific communities - Operational basis for collaboration with national
research initiatives and other UE projects - Input for Machine Learning research e.g. PASCAL
(Pattern Analysis Statistical Modelling and
Computational Learning) Challenge, or KD-Ubiq
summer School 2008 - Input for research in grid architecture and
middleware - Baseline references
8Data Collection and Publication
- Scientific Challenges
- Most data is organized along operational
semantics redundant and often undocumented
features - A Grid Ontology would be required for efficient
indexing and querying. The glue schema is an
ontology of the grid components. It is a good
starting point, but we need concepts for the grid
dynamics (e.g. job lifecycle or users relation
graph), and for performance metrics - Lack of usage-oriented sensors at the network
level
9Data Collection and Publication
- Technical challenges
- Ecosystem of sources with very different scopes,
deployment and institutional status CIC tools
(GOCDB, SAM, SFT), core gLite (LB, BDII,),
sites (Maui/PBS logs), gLite integrators (R-GMA,
Job Provenance), experiences integrators
(DashBoard), external software (Alice/MonaLisa) - Volume, operational constraints
- Legal challenge
- Privacy
- Scientific usage helps, but to be considered from
the beginning
10Example operational monitoring tools (From Ian
Birds talk _at_EGEE06)
- Tools used by the Grid Operator on Duty team to
detect problems - Distributed responsibility
- CIC portal
- single entry point
- Integrated view of monitoring tools
- Site Functional Tests (SFT) -gt Service
Availability Monitoring (SAM) - Grid Operations Centre Core Database (GOCDB)
- GIIS monitor (Gstat)
- GOC certificate lifetime
- GOC job monitor
- Others
11The Logging and Bookkeeping
- Behavioural trace of a job lifecycle users,
middleware and infrastructure - LAL Broker 2005-2006 300K jobs, 3Mevts, 2GB
12Pre-processing the LB logs
- A lot of information in blobs incremental
verbatim of the reports from the various services
- More than 400 attributes
13Models
- Models are required for dimensioning, capacity
planning, optimizing grid middleware and
applications - Self-aware grids behavioural models
- Intrinsic characterizations of grid traffic
(distribution of) e.g. jobs running time,
application data locality - Characterizations of middleware-dependant metrics
e.g. queuing delays, SE load - Inference of models for middleware components,
applications (including, but not limited to,
workflows), users and usage profiles
14Evaluation
- Assessing performance at the grid scale is a
challenge - Need a snapshot of the inputs and grid state e.g.
workload and available services during a relevant
time range - Classical optimization does not scale
- Advanced optimization anytime algorithms
- Experiments in controlled settings
15Autonomic grids
- Autonomic computing computing systems that
manage themselves in accordance with high-level
objectives from humans - Self- configuration, healing, optimization,
protection - Industrial (IBM, Bouygues) and academic (IEEE
conference since 2004, ECML tutorial 2006,)
activity - Machine learning techniques
- Relevant for grids
- The components are known to be complex systems
- Very large datasets, but actually sparse in the
state space - Non steady-state dynamic behavior
- High computational (analysis) of operational
(probing) costs
16Autonomic grids
- Autonomic computing computing systems that
manage themselves in accordance with high-level
objectives from humans - Self- configuration, healing, optimization,
protection - Industrial (IBM, Bouygues) and academic (IEEE
conference since 2004, ECML tutorial 2006,)
activity - Machine learning techniques
- Techniques
- Pruning data feature selection and construction
- Identifying the main modes of the grid
supervised (classification, regression,
collaborative prediction) and unsupervised
(clustering,,) learning - On-line decision making active learning,
reinforcement learning
17Autonomic dependability
- On-line failure detection and anticipation
- Passive vs Active probing a lot of information
is available from useful work - Black-box
- On-line statistics from similar actions
(executions, data access, middleware modules)
sequential testing - Off-line supervised and unsupervised learning
- Active probing
- Adaptive on-line test selection for best coverage
of possibly faulty components - Experience planning
18Example blackhole detection
Software bug
- Site fault
- Page-Hinckley statistics
- Time-sequential version of Walds statistics
also known as CUSUM - Provides an intelligent threshold test
Blackhole
19Example Mining the Logging and Bookkeeping data
- Aggressive sub-sampling
- Constructive feature induction
- SVM and ROGER (ROC based genetic learner)
- Double clustering (Slonim Tishby 2000) with
K-means - first clustering compress the features along
the examples - second clustering cluster the examples along the
synthetic features - Stable clusters
- Good prediction performance when ABU excluded
- Mostly pure clusters, natural use for detection,
diagnostic in progress
20Other areas for interaction
- Differentiated services without advance
reservation - Urgent, but not fully explicit, requirement
priorities (analysis), hard real-time (critical),
soft real-time (interactive), latency (isolated
job), bandwidth, - Two EGEE WG
- SDJ (Short Deadline Jobs) based on controlled
time-sharing - Priorities based on refinements of VO (role) and
CE (VOView) - Necessarily short-term solutions, but useful for
exploring constraints and basic support
mechanisms - Basic research no quantum, faire share is at the
accounting time scale - Working with gLite
- A turn-key deployment of gLite on Xen is planned
21Conclusion
- The Grid Observatory targets
- Contributions to a quantitative approach of grid
middleware and architecture, in the RISC sense,
through data collection, publication and
characterization - Statistically significant benchmarks
- Basic research in autonomic computing
- Operational impacts on EGEE evaluation,
autonomic dependability - Advances in grid ontology
- The Grid Observatory data and analysis are
relevant for computer science, middleware
development and system administration, well
beyond EGEE