Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters

Description:

Clusters as an alternative to multiprocessor machines for high performance computing ... Configurable modular global scheduler ... – PowerPoint PPT presentation

Number of Views:322

Avg rating:3.0/5.0

Slides: 16

Provided by: cmo79

Learn more at: https://www.cs.sandia.gov

Category:

more less

Transcript and Presenter's Notes

Title: Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters

1
Design and Implementation of a Single System
Image Operating System for High Performance
Computing on Clusters

Christine MORIN
PARIS project-team, IRISA/INRIA (Rennes, France)

2
Motivation

Clusters as an alternative to multiprocessor
machines for high performance computing
Workloads of scientific applications
Independent sequential processes
Compute intensive, huge memory requirements
Parallel applications
Shared memory (multithreaded applications,
OpenMP)
Message passing (MPI)
Hybrid applications

3
Some Issues

No obvious solution to support standard Posix
multithreaded applications on clusters
Memory distribution
Need of efficient placement and load-balancing
strategies to take advantage of all cluster
resources
Efficient process migration
Scientific applications execution time may be
greater than the cluster MTBF
High availability and checkpointing

4
Single System Image Operating System

Vision of a single machine (virtual SMP)
Same interface as a traditional OS for an SMP
machine
Same vision for all applications
Efficiency
Properties of a SSI OS
Resource distribution transparency
Intra- and inter- application resource sharing
High availability
Scalability

5
Kerrighed SSI OS

Combining high performance, high availability and
ease of programming
Global resource management
Processor, memory, disk
Integrated resource management
Dynamic resource management
To deal with configuration changes
Extension of the standard OS running on each node
Small clusters
lt 100 nodes

6
Outline

Global process management
Global memory management
Conclusion and Perspectives

7
Global Process Management

Global scheduling policy
Load balancing
Several policies
Configurable modular global scheduler
The policy can be changed without stopping the
operating system or the applications
The local scheduler on each node is not modified

8
Architecture of the Global Scheduler
Global scheduler
Global scheduler
Local Analyzers
Local Analyzers
Monitors
Monitors
Standard OS
Standard OS
Node 1
Node 2
9
Process Management Mechanisms
Global scheduler (Application management)
Global scheduler (Application management)
Process creation
Process checkpt
Process migration
Process creation
Process checkpt
Process migration
Process state extraction
Process state extraction
10
Checkpointing

Common mechanisms for supporting checkpointing
protocols for both shared memory and
message-passing applications
Efficient checkpoint creation
Several memory checkpoints between two disk
checkpoints
Disk checkpoints stored on local disks
Incremental checkpoints
Combination of data replication for efficiency
and for high availability for shared memory
applications
Data replication due to data sharing exploited to
decrease the cost of checkpoint creation
Recovery data can be used for the computation
until the first modification