The Grid Observatory: towards collaborations between EGEE and CoreGRID - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

The Grid Observatory: towards collaborations between EGEE and CoreGRID

Description:

Applications have moved from testing to routine and daily usage ... integrators (R-GMA, Job Provenance), experiences integrators (DashBoard) ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 22
Provided by: marce226
Category:

less

Transcript and Presenter's Notes

Title: The Grid Observatory: towards collaborations between EGEE and CoreGRID


1
The Grid Observatorytowards collaborations
between EGEE and CoreGRID
  • C. Germain-Renaud
  • Laboratoire de Recherche en Informatique
  • Laboratoire Accélérateur Linéaire
  • CNRS Université Paris-Sud XI
  • Desden 9-11 May 2007

2
The EGEE Grid
  • FP6 and FP7 infrastructure
  • Large-scale, production-quality grid
    infrastructure for e-Science
  • 91 partners, 32 countries
  • 30K CPUs, 10PB,
  • 30K jobs
  • 24X7

3
Applications
  • Applications from a growing number of domains
  • High Energy Physics
  • Life Sciences
  • Astrophysics
  • Computational Chemistry
  • Earth Sciences
  • Financial Simulation
  • Fusion
  • Geophysics
  • Multimedia
  • Material Sciences

Applications have moved from testing to routine
and daily usage
4
Collaborating e-infrastructures
Potential for linking 80 countries by 2008
5
Sustainability
  • Need to prepare for permanent Grid infrastructure
  • Ensure a high quality of service for all user
    communities
  • Independent of short project funding cycles
  • Infrastructure managed in collaboration with
    National Grid Initiatives (NGIs)

6
The Grid Observatory - overview
  • Part of the EGEE-III proposal (NA4 Cluster)
  • LRI, LAL, London Imperial College, Trinity
    College Dublin, Università Piemonte Orientale
  • Integrate the collection of data on the behaviour
    of the EGEE grid and users with the development
    of models and of an ontology for the domain
    knowledge.

7
Data Collection and Publication
  • Design, implement and deploy a set of tools for
    the acquisition, consolidation, long-term
    conservation, and publication of traces of EGEE
    activities
  • Permanent storage of reliable, exhaustive,
    filtered information
  • Publication service
  • Navigation and structured access
  • Requires indexing along the needs of the
    scientific communities
  • Operational basis for collaboration with national
    research initiatives and other UE projects
  • Input for Machine Learning research e.g. PASCAL
    (Pattern Analysis Statistical Modelling and
    Computational Learning) Challenge, or KD-Ubiq
    summer School 2008
  • Input for research in grid architecture and
    middleware
  • Baseline references

8
Data Collection and Publication
  • Scientific Challenges
  • Most data is organized along operational
    semantics redundant and often undocumented
    features
  • A Grid Ontology would be required for efficient
    indexing and querying. The glue schema is an
    ontology of the grid components. It is a good
    starting point, but we need concepts for the grid
    dynamics (e.g. job lifecycle or users relation
    graph), and for performance metrics
  • Lack of usage-oriented sensors at the network
    level

9
Data Collection and Publication
  • Technical challenges
  • Ecosystem of sources with very different scopes,
    deployment and institutional status CIC tools
    (GOCDB, SAM, SFT), core gLite (LB, BDII,),
    sites (Maui/PBS logs), gLite integrators (R-GMA,
    Job Provenance), experiences integrators
    (DashBoard), external software (Alice/MonaLisa)
  • Volume, operational constraints
  • Legal challenge
  • Privacy
  • Scientific usage helps, but to be considered from
    the beginning

10
Example operational monitoring tools (From Ian
Birds talk _at_EGEE06)
  • Tools used by the Grid Operator on Duty team to
    detect problems
  • Distributed responsibility
  • CIC portal
  • single entry point
  • Integrated view of monitoring tools
  • Site Functional Tests (SFT) -gt Service
    Availability Monitoring (SAM)
  • Grid Operations Centre Core Database (GOCDB)
  • GIIS monitor (Gstat)
  • GOC certificate lifetime
  • GOC job monitor
  • Others

11
The Logging and Bookkeeping
  • Behavioural trace of a job lifecycle users,
    middleware and infrastructure
  • LAL Broker 2005-2006 300K jobs, 3Mevts, 2GB

12
Pre-processing the LB logs
  • A lot of information in blobs incremental
    verbatim of the reports from the various services
  • More than 400 attributes

13
Models
  • Models are required for dimensioning, capacity
    planning, optimizing grid middleware and
    applications
  • Self-aware grids behavioural models
  • Intrinsic characterizations of grid traffic
    (distribution of) e.g. jobs running time,
    application data locality
  • Characterizations of middleware-dependant metrics
    e.g. queuing delays, SE load
  • Inference of models for middleware components,
    applications (including, but not limited to,
    workflows), users and usage profiles

14
Evaluation
  • Assessing performance at the grid scale is a
    challenge
  • Need a snapshot of the inputs and grid state e.g.
    workload and available services during a relevant
    time range
  • Classical optimization does not scale
  • Advanced optimization anytime algorithms
  • Experiments in controlled settings

15
Autonomic grids
  • Autonomic computing computing systems that
    manage themselves in accordance with high-level
    objectives from humans
  • Self- configuration, healing, optimization,
    protection
  • Industrial (IBM, Bouygues) and academic (IEEE
    conference since 2004, ECML tutorial 2006,)
    activity
  • Machine learning techniques
  • Relevant for grids
  • The components are known to be complex systems
  • Very large datasets, but actually sparse in the
    state space
  • Non steady-state dynamic behavior
  • High computational (analysis) of operational
    (probing) costs

16
Autonomic grids
  • Autonomic computing computing systems that
    manage themselves in accordance with high-level
    objectives from humans
  • Self- configuration, healing, optimization,
    protection
  • Industrial (IBM, Bouygues) and academic (IEEE
    conference since 2004, ECML tutorial 2006,)
    activity
  • Machine learning techniques
  • Techniques
  • Pruning data feature selection and construction
  • Identifying the main modes of the grid
    supervised (classification, regression,
    collaborative prediction) and unsupervised
    (clustering,,) learning
  • On-line decision making active learning,
    reinforcement learning

17
Autonomic dependability
  • On-line failure detection and anticipation
  • Passive vs Active probing a lot of information
    is available from useful work
  • Black-box
  • On-line statistics from  similar  actions
    (executions, data access, middleware modules)
    sequential testing
  • Off-line supervised and unsupervised learning
  • Active probing
  • Adaptive on-line test selection for best coverage
    of possibly faulty components
  • Experience planning

18
Example blackhole detection
Software bug
  • Site fault
  • Page-Hinckley statistics
  • Time-sequential version of Walds statistics
    also known as CUSUM
  • Provides an  intelligent threshold  test

Blackhole
19
Example Mining the Logging and Bookkeeping data
  • Aggressive sub-sampling
  • Constructive feature induction
  • SVM and ROGER (ROC based genetic learner)
  • Double clustering (Slonim Tishby 2000) with
    K-means
  • first clustering compress the features along
    the examples
  • second clustering cluster the examples along the
    synthetic features
  • Stable clusters
  • Good prediction performance when ABU excluded
  • Mostly pure clusters, natural use for detection,
    diagnostic in progress

20
Other areas for interaction
  • Differentiated services without advance
    reservation
  • Urgent, but not fully explicit, requirement
    priorities (analysis), hard real-time (critical),
    soft real-time (interactive), latency (isolated
    job), bandwidth,
  • Two EGEE WG
  • SDJ (Short Deadline Jobs) based on controlled
    time-sharing
  • Priorities based on refinements of VO (role) and
    CE (VOView)
  • Necessarily short-term solutions, but useful for
    exploring constraints and basic support
    mechanisms
  • Basic research no quantum, faire share is at the
    accounting time scale
  • Working with gLite
  • A turn-key deployment of gLite on Xen is planned

21
Conclusion
  • The Grid Observatory targets
  • Contributions to a quantitative approach of grid
    middleware and architecture, in the RISC sense,
    through data collection, publication and
    characterization
  • Statistically significant benchmarks
  • Basic research in autonomic computing
  • Operational impacts on EGEE evaluation,
    autonomic dependability
  • Advances in grid ontology
  • The Grid Observatory data and analysis are
    relevant for computer science, middleware
    development and system administration, well
    beyond EGEE
Write a Comment
User Comments (0)
About PowerShow.com