The Grid Observatory: towards collaborations between EGEE and CoreGRID - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

The Grid Observatory: towards collaborations between EGEE and CoreGRID

Description:

Applications have moved from testing to routine and daily usage ... integrators (R-GMA, Job Provenance), experiences integrators (DashBoard) ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 22

Provided by: marce226

Category:

more less

Transcript and Presenter's Notes

Title: The Grid Observatory: towards collaborations between EGEE and CoreGRID

1
The Grid Observatorytowards collaborations
between EGEE and CoreGRID

C. Germain-Renaud
Laboratoire de Recherche en Informatique
Laboratoire Accélérateur Linéaire
CNRS Université Paris-Sud XI
Desden 9-11 May 2007

2
The EGEE Grid

FP6 and FP7 infrastructure
Large-scale, production-quality grid
infrastructure for e-Science
91 partners, 32 countries
30K CPUs, 10PB,
30K jobs
24X7

3
Applications

Applications from a growing number of domains
High Energy Physics
Life Sciences
Astrophysics
Computational Chemistry
Earth Sciences
Financial Simulation
Fusion
Geophysics
Multimedia
Material Sciences

Applications have moved from testing to routine
and daily usage
4
Collaborating e-infrastructures
Potential for linking 80 countries by 2008
5
Sustainability

Need to prepare for permanent Grid infrastructure
Ensure a high quality of service for all user
communities
Independent of short project funding cycles
Infrastructure managed in collaboration with
National Grid Initiatives (NGIs)

6
The Grid Observatory - overview

Part of the EGEE-III proposal (NA4 Cluster)
LRI, LAL, London Imperial College, Trinity
College Dublin, Università Piemonte Orientale
Integrate the collection of data on the behaviour
of the EGEE grid and users with the development
of models and of an ontology for the domain
knowledge.

7
Data Collection and Publication

Design, implement and deploy a set of tools for
the acquisition, consolidation, long-term
conservation, and publication of traces of EGEE
activities
Permanent storage of reliable, exhaustive,
filtered information
Publication service
Navigation and structured access
Requires indexing along the needs of the
scientific communities
Operational basis for collaboration with national
research initiatives and other UE projects
Input for Machine Learning research e.g. PASCAL
(Pattern Analysis Statistical Modelling and
Computational Learning) Challenge, or KD-Ubiq
summer School 2008
Input for research in grid architecture and
middleware
Baseline references

8
Data Collection and Publication

Scientific Challenges
Most data is organized along operational
semantics redundant and often undocumented
features
A Grid Ontology would be required for efficient
indexing and querying. The glue schema is an
ontology of the grid components. It is a good
starting point, but we need concepts for the grid
dynamics (e.g. job lifecycle or users relation
graph), and for performance metrics
Lack of usage-oriented sensors at the network
level

9
Data Collection and Publication

Technical challenges
Ecosystem of sources with very different scopes,
deployment and institutional status CIC tools
(GOCDB, SAM, SFT), core gLite (LB, BDII,),
sites (Maui/PBS logs), gLite integrators (R-GMA,
Job Provenance), experiences integrators
(DashBoard), external software (Alice/MonaLisa)
Volume, operational constraints
Legal challenge
Privacy
Scientific usage helps, but to be considered from
the beginning

10
Example operational monitoring tools (From Ian
Birds talk _at_EGEE06)

Tools used by the Grid Operator on Duty team to
detect problems
Distributed responsibility
CIC portal
single entry point
Integrated view of monitoring tools
Site Functional Tests (SFT) -gt Service
Availability Monitoring (SAM)
Grid Operations Centre Core Database (GOCDB)
GIIS monitor (Gstat)
GOC certificate lifetime
GOC job monitor
Others

11
The Logging and Bookkeeping

Behavioural trace of a job lifecycle users,
middleware and infrastructure
LAL Broker 2005-2006 300K jobs, 3Mevts, 2GB

12
Pre-processing the LB logs

A lot of information in blobs incremental
verbatim of the reports from the various services
More than 400 attributes

13
Models

Models are required for dimensioning, capacity
planning, optimizing grid middleware and
applications
Self-aware grids behavioural models
Intrinsic characterizations of grid traffic
(distribution of) e.g. jobs running time,
application data locality
Characterizations of middleware-dependant metrics
e.g. queuing delays, SE load
Inference of models for middleware components,
applications (including, but not limited to,
workflows), users and usage profiles

14
Evaluation

Assessing performance at the grid scale is a
challenge
Need a snapshot of the inputs and grid state e.g.
workload and available services during a relevant
time range
Classical optimization does not scale
Advanced optimization anytime algorithms
Experiments in controlled settings

15
Autonomic grids

Autonomic computing computing systems that
manage themselves in accordance with high-level
objectives from humans
Self- configuration, healing, optimization,
protection
Industrial (IBM, Bouygues) and academic (IEEE
conference since 2004, ECML tutorial 2006,)
activity
Machine learning techniques
Relevant for grids
The components are known to be complex systems
Very large datasets, but actually sparse in the
state space
Non steady-state dynamic behavior
High computational (analysis) of operational
(probing) costs

16
Autonomic grids

Autonomic computing computing systems that
manage themselves in accordance with high-level
objectives from humans
Self- configuration, healing, optimization,
protection
Industrial (IBM, Bouygues) and academic (IEEE
conference since 2004, ECML tutorial 2006,)
activity
Machine learning techniques
Techniques
Pruning data feature selection and construction
Identifying the main modes of the grid
supervised (classification, regression,
collaborative prediction) and unsupervised
(clustering,,) learning
On-line decision making active learning,
reinforcement learning

17
Autonomic dependability

On-line failure detection and anticipation
Passive vs Active probing a lot of information
is available from useful work
Black-box
On-line statistics from similar actions
(executions, data access, middleware modules)
sequential testing
Off-line supervised and unsupervised learning
Active probing
Adaptive on-line test selection for best coverage
of possibly faulty components
Experience planning

18
Example blackhole detection
Software bug

Site fault
Page-Hinckley statistics
Time-sequential version of Walds statistics
also known as CUSUM
Provides an intelligent threshold test

Blackhole
19
Example Mining the Logging and Bookkeeping data

Aggressive sub-sampling
Constructive feature induction
SVM and ROGER (ROC based genetic learner)
Double clustering (Slonim Tishby 2000) with
K-means
first clustering compress the features along
the examples
second clustering cluster the examples along the
synthetic features
Stable clusters
Good prediction performance when ABU excluded
Mostly pure clusters, natural use for detection,
diagnostic in progress

20
Other areas for interaction

Differentiated services without advance
reservation
Urgent, but not fully explicit, requirement
priorities (analysis), hard real-time (critical),
soft real-time (interactive), latency (isolated
job), bandwidth,
Two EGEE WG
SDJ (Short Deadline Jobs) based on controlled
time-sharing
Priorities based on refinements of VO (role) and
CE (VOView)
Necessarily short-term solutions, but useful for
exploring constraints and basic support
mechanisms
Basic research no quantum, faire share is at the
accounting time scale
Working with gLite
A turn-key deployment of gLite on Xen is planned

21
Conclusion

The Grid Observatory targets
Contributions to a quantitative approach of grid
middleware and architecture, in the RISC sense,
through data collection, publication and
characterization
Statistically significant benchmarks
Basic research in autonomic computing
Operational impacts on EGEE evaluation,
autonomic dependability
Advances in grid ontology
The Grid Observatory data and analysis are
relevant for computer science, middleware
development and system administration, well
beyond EGEE