Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters

Description:

Clusters as an alternative to multiprocessor machines for high performance computing ... Configurable modular global scheduler ... – PowerPoint PPT presentation

Number of Views:322
Avg rating:3.0/5.0
Slides: 16
Provided by: cmo79
Category:

less

Transcript and Presenter's Notes

Title: Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters


1
Design and Implementation of a Single System
Image Operating System for High Performance
Computing on Clusters
  • Christine MORIN
  • PARIS project-team, IRISA/INRIA (Rennes, France)

2
Motivation
  • Clusters as an alternative to multiprocessor
    machines for high performance computing
  • Workloads of scientific applications
  • Independent sequential processes
  • Compute intensive, huge memory requirements
  • Parallel applications
  • Shared memory (multithreaded applications,
    OpenMP)
  • Message passing (MPI)
  • Hybrid applications

3
Some Issues
  • No obvious solution to support standard Posix
    multithreaded applications on clusters
  • Memory distribution
  • Need of efficient placement and load-balancing
    strategies to take advantage of all cluster
    resources
  • Efficient process migration
  • Scientific applications execution time may be
    greater than the cluster MTBF
  • High availability and checkpointing

4
Single System Image Operating System
  • Vision of a single machine (virtual SMP)
  • Same interface as a traditional OS for an SMP
    machine
  • Same vision for all applications
  • Efficiency
  • Properties of a SSI OS
  • Resource distribution transparency
  • Intra- and inter- application resource sharing
  • High availability
  • Scalability

5
Kerrighed SSI OS
  • Combining high performance, high availability and
    ease of programming
  • Global resource management
  • Processor, memory, disk
  • Integrated resource management
  • Dynamic resource management
  • To deal with configuration changes
  • Extension of the standard OS running on each node
  • Small clusters
  • lt 100 nodes

6
Outline
  • Global process management
  • Global memory management
  • Conclusion and Perspectives

7
Global Process Management
  • Global scheduling policy
  • Load balancing
  • Several policies
  • Configurable modular global scheduler
  • The policy can be changed without stopping the
    operating system or the applications
  • The local scheduler on each node is not modified

8
Architecture of the Global Scheduler
Global scheduler
Global scheduler
Local Analyzers
Local Analyzers
Monitors
Monitors
Standard OS
Standard OS
Node 1
Node 2
9
Process Management Mechanisms
Global scheduler (Application management)
Global scheduler (Application management)
Process creation
Process checkpt
Process migration
Process creation
Process checkpt
Process migration
Process state extraction
Process state extraction
10
Checkpointing
  • Common mechanisms for supporting checkpointing
    protocols for both shared memory and
    message-passing applications
  • Efficient checkpoint creation
  • Several memory checkpoints between two disk
    checkpoints
  • Disk checkpoints stored on local disks
  • Incremental checkpoints
  • Combination of data replication for efficiency
    and for high availability for shared memory
    applications
  • Data replication due to data sharing exploited to
    decrease the cost of checkpoint creation
  • Recovery data can be used for the computation
    until the first modification

11
Process Migration
  • Communicating processes can migrate
  • Processes sharing memory
  • Processes communicating with data streams
    (sockets, pipes, )
  • Efficiency of the process transfer
  • Address space transfered on demand (containers)
  • Efficiency of the process execution after
    migration
  • Efficient access to open files (containers)
  • Global management of data streams

12
Global Memory Management
  • Different services
  • Shared virtual memory
  • Remote paging
  • Cooperative file cache
  • A unique concept the container
  • Software object to store and share data cluster
    wide (COMA like management)
  • Global management of physical memory
  • Segments of a process address space, files are
    associated to containers

13
Integration of Containers in a Standard OS
14
Conclusion Perspectives
  • A SSI OS for clusters is still missing in 2003
  • Kerrighed represents a promising approach
  • A first prototype based on Linux is available
  • Current work directions
  • High availability and checkpointing
  • OpenMP on Kerrighed
  • Experimentation with industrial applications
  • EDF, DGA
  • Grid-aware OS for a federation of clusters

15
http//www.kerrighed.org
Kerrighed has been filed as a community
trademark.
Write a Comment
User Comments (0)
About PowerShow.com