Title: Systmes distribus grande chelle
1A Nation Wide Experimental Grid
2Grid Distributed System Problematic renewal
Grid raises a lot of research issues Security,
Performance, Fault tolerance, Load Balancing,
Fairness, Coordination, Message passing, Data
storage, Programming, Communication protocols and
architecture, Deployment, etc.
Theoretical models and simulators cannot capture
real life conditions Production platforms have
strong difficulties to reproduce experimental
conditions
- How to test and compare?
- Fault tolerance protocols
- Security mechanisms
- Deployment tools
- etc.
3Tools for Distributed System Studies
To investigate Distributed System issues, we
need 1) Tools (model, simulators, emulators,
experi. Platforms) 2) Strong interaction between
these research tools
Tools for Large Scale Distributed Systems
log(cost)
Real systems Real applications In-lab
platforms Synthetic conditions
Real systems Real applications Real
platforms Real conditions
Models Sys, apps, Platforms, conditions
Key system mecas. Algo, app. kernels Virtual
platforms Synthetic conditions
log(realism)
emulation
math
simulation
live systems
4We need a Grid experimental platform
According to the current knowledge There is no
large scale testbed dedicated to Grid experiments
- Grid5000 as a live system
- Grid eXplorer as a large scale emulator
log(cost)
Grid5000 TERAGrid PlanetLab Naregi Testbed
Grid eXplorer WANinLab Emulab
SimGrid MicroGrid Bricks NS, etc.
Model Protocol proof
log(realism)
emulation
math
simulation
live systems
5What do we need for Grid experiments ?
- Remotely controllable Grid nodes installed in
geographically distributed laboratories - A Controllable and Monitorable Network
between the - Grid nodes
- A middleware infrastructure connecting the nodes
(security) - A playground to prepare experiments
- A toolkit to deploy, manage, run experiments and
collect results -
6The Grid5000 Project
- Building a nation wide experimental platform for
- Grid researches (like a particle accelerator for
the computer - scientists)
- 10/11 geographically distributed sites
- every site hosts a cluster (from 256 CPUs to 1K
CPUs) - All sites are connected by RENATER (French
Academ. Network) - RENATER hosts probes to trace network condition
load - Design and develop a system/middleware
environment - for safely test and repeat experiments
- 2) Use the platform for Grid experiments
- Address critical issues of Grid
system/middleware - Programming, Scalability, Fault Tolerance,
Scheduling - Address critical issues of Grid Networking
- High performance transport protocols, Qos
- Port and test applications
- Investigate original mechanisms
- P2P resources discovery, Desktop Grids
-
7Grid5000 Big Picture
Control site
Site 2
Users (ssh loggin password)
Front end
Control Master
Control Slave
Site 1
LAB/Firewall
Router
Control Slave
Test Cluster
Firewall/nat
Labs Network
Site 3
Gateway VPN (192. For all nodes)
Test Cluster
One machine Can be seen as a Virtual Grid Gateway
8Grid5000 Committees
Technical Committee
Steering Committee (organizer Franck Cappello,
Orsay)
-David Gueldrech (Sophia) -Jean Claude Barbet
(Orsay) -Franck Bonnassieux (UREC) -Julien le duc
(Grenoble) -Fred Desprez (Lyon) -Yvon Jégou
(Rennes) -Olivier Coulaud (Bordeaux) -Frédéric
Barbaresco (Toulouse)
-Thierry Priol (ACI Grid Director) -Brigitte
Plateau (President of ACI Grid SC) -Dani
Vandrome (Director of Renater) -Frédéric Desprez
(Lyon) -Michel Daydé (Toulouse) -Yvon Jégou
(Rennes) -Stéphane Lantéri (Sophia) -Raymond
Namyst (Bordeaux) -Pascale Primet (Lyon) -Olivier
Richard (Grenoble)
Forums Deployment/exploitation Franck Cappello
(AS1, RTP8) Programming models Raymond Namyst
(AS2, RTP8)
9Grid5000 Schedule
Call for Expression Of Interest
Vendor selection
Instal. First tests
Final review
Fisrt Demo (SC04)
Call for proposals
Selection of 7 sites
ACI GRID Funding
Grid5000 Hardware
Grid5000 System/middleware Forum
Security Prototypes
Control Prototypes
Grid5000 Programming Forum
Grid5000 Builder Community
Grid5000 Experiments
March04
Jun/July 04
Spt 04
Oct 04
Nov 04
Sept03
Nov03
Jan04
10Grid5000 Funding (ACI Local
District/Prefecture)
0,6M
0,4
0,5
0,35
0,5
0,3?
0,35
Grid5000 2004
3M for hardware only
11Grid5000 in September2004
Grid 5000 nodes
(soon 4)
3
12Summary of Grid5000 XPs
- Networking
- End Host Communication layer
- High performance long distance protocols
- High Speed Network Emulation
- Grid Networking Layer
- Middleware / OS
- Grid5000 control/access/experiment automation
- Scheduling / data distribution in Grid
- Fault tolerance in Grid
- Resource management
- Computational Steering
- Grid SSI OS and Grid I/O
- Desktop Grid/P2P systems
- Programming
- Component programming for the Grid (Java, Corba)
- GRID-RPC
- GRID-MPI
- Code Coupling
- Applications
13Middleware1(XP)Grid5000
XP eXPeriments on
- Grid5000 control
- - Computing Environment deployment (Ka-tools)
- - Experiment automation (security and control)
- - VGrid mapping a virtual Grid on a real
testbed - - Monitoring, benchmarking, performance
characterization and analysis - Grid Scheduling / data distribution
- - Scheduling Data transfers, global
communications, work stealing,... - - Data re-distribution in Grid
- - Task distribution and load balancing in
heterogeneous Grid - - Mixed Parallelism (task and data parallelism)
- - Mixing data management and task scheduling
- - Hierarchical and Distributed Scheduling
- Fault tolerance in Grid
- - Fault tolerant Grid-RPC (RPC-V)
- - Hierarchical Fault tolerant MPI (MPICH-V)
- - Fault tolerant in data-flow approach
(Athapascan)
14Middleware2(XP)Grid5000
- Grid Management
- - AROMA tool resources management over a Grid
of clusters with different classes of services - - Mobile agents for open Grid management
- - Management of Grids and hosted services
(security, QoS, monitoring control, dynamic
configuration, ) - Optimization for wide area distributed query
processing - Tools to support the development, administration
and usage of heterogeneous resources over the
Grid - Virtualization of data storage on Grids
- Automatic Deployment of GridRPC middle tier.
- - Multiclusters and lightweights Grid resource
management (OAR/CIGRI) - Global Computing/P2P Middleware
- - Executing Web Services on Desktop Grid Workers
(XtremWeb) - - Distributing the Coordination in Desktop Grids
(XtremWeb) - - Harnessing Clusters as parallel Workers
- - Probabilistic certification in peer-to-peer
systems - - Large Scale Data Sharing Service based on JXTA
(JuxMem) - - Management services for textual document in
P2P systems
15Network(XP)Grid5000
- End Host Communication layer
- Communication libraries Madeleine,
MPICH/Madeleine - - Intelligent Usage of NICs for local and wide
area communications - - Direct file access over Myrinet ORFA/NFS and
ORFA/LUSTRE - High performance long distance protocols
- - Alternative Transport for very high speed
networks (backpressure) - - Differentiated transport with delay control on
WAN - Reliable active and non active Multicast
- Network Bandwidth optimization in Grid (VTHD,
Paco). - - High performance communication across
heterogeneous networks - Fast forwarding and Multiplexing of data on
gateway nodes - High Speed Network Emulation
- - Automatic Deployment of emulated high speed
domains - - Experiment design for grid flow interactions
studies - Grid Networking Layer
- - Network Resource and QoS on demand
- - Grid Overlay and Programmable Routers
- Measurement Services for network aware middleware
16Programming(XP)Grid5000
- Component programming on the grid
- - ProActive a JAVA library (parallel,
distributed, concurrent - computing with security and mobility)
- Assessment of scalability, deployment, security
and fault - tolerance issues
- Hierarchical components architecture
- PadicoTM/Paco combining parallel and
distributed computing - RPC Environment
- Large scale experimentation of the DIET
platform (Distributed - Interactive Engineering Toolbox)
- Client/Agent/Server model following the GridRPC
standard with - distributed scheduling agents
- MPI Environment
- - Time sharing Grid resources
- Migration over Clusters with heterogeneous high
speed networks - Code Coupling
17Applications1(XP)Grid5000
- Multi-parametric applications
- - ACI GRID-TLSE Project expertise site for
sparse linear algebra - - Climate modeling and Global Change
- DataGène Project Functional genomic
- Large scale experimentation of distributed
applications - MECAGRID (ACI GRID project, Smash project-team)
- Massively parallel computations in multi-material
fluid mechanics - Study of numerical algorithms for heterogeneous
computing platforms - Grid computing for medical applications (Epidaure
project-team) - Interoperable medical image registration grid
service - Optimal design of complex systems (Coprin
project-team) - Evaluation of parallel optimization algorithms
based on interval analysis techniques - Study of load balancing strategies on
heterogeneous resources - Fluid mechanics, molecular dynamics and
host-parasite systems in population dynamics,
etc. - CFD, astrophysics, applications
- Collaborating tools in virtual 3D environment.
18Applications2(XP)Grid5000
- Steering
- JECS a JAVA Environment for Computational
Steering - Distributed computing and interactive
visualization of 3D numerical simulations (Caiman
and Oasis project-teams) - Collaborative environment
- Computational Electromagnetism application
(JEM3D) - Steering of numerical simulations (ACI GRID-EPSN
Project) - Parallel on-line visualization / monitoring
- Data Redistribution
- Computational Steering by direct image
manipulation