Title: OpenMOSIX approach to build scalable HPC farms with an easy management infrastructure
1OpenMOSIX approach to build scalable HPC farms
with an easy management infrastructure
Rosario Esposito1 Paolo Mastroserio1 Francesco
Maria Taurino1,2 Gennaro Tortone1
- INFN - Napoli1INFM - UDR Napoli2
- CHEP 2003 La Jolla (San Diego)
2Index
- Introduction
- OpenMosix overview
- Farm setup
- Use cases
- Conclusions
3What makes clusters hard ?
- Setup (administrator)
- setting up a 16 node farm by hand is prone to
errors - Maintenance (administrator)
- ever tried to update a package on every node in
the farm? - Running jobs (users)
- running a parallel program or set of sequential
programs requires the users to figure out which
hosts are available and manually assign tasks to
the nodes, or use software tools based on static
process allocation (queue managers)
4What is OpenMosix ?
- Description
- OpenMosix is an OpenSource enhancement to the
Linux kernel providing adaptive (on-line)
load-balancing between x86 Linux machines. It
uses preemptive process migration to assign and
reassign the processes among the nodes to take
the best advantage of the available resources - OpenMosix moves processes around the Linux farm
to balance the load, using less loaded machines
first - URL
- http//www.openmosix.org
5OpenMosix introduction
- Execution environment
- farm of diskless x86 based nodes both UP and
SMP that are connected by standard or high-speed
LAN - Implementation level
- Linux kernel (no library to link with sources)
- System image model
- virtual machine with a lot of memory and CPU
- Granularity
- Process
- Goal
- improve the overall (cluster-wide) performance
and create a convenient multi-user, time-sharing
environment for the execution of both sequential
and parallel applications
6OpenMosix architecture (1/5)
- Network transparency
- the interactive user and the application level
programs are provided by a virtual machine that
looks like a single MP machine - Preemptive process migration
- any users process, trasparently and at any
time, can migrate to any available node. - The migrating process is divided into two
contexts - system context (deputy) that may not be migrated
from home workstation (UHN) - user context (remote) that can be migrated on a
diskless node
7OpenMosix architecture (2/5)
- Preemptive process migration
master node
diskless node
8OpenMosix architecture (3/5)
- Dynamic load balancing
- initiates process migrations in order to balance
the load of farm - responds to variations in the load of the nodes,
runtime characteristics of the processes, number
of nodes and their speeds - makes continuous attempts to reduce the load
differences between pairs of nodes and
dynamically migrating processes from nodes with
higher load to nodes with a lower load - the policy is symmetrical and decentralized all
of the nodes execute the same algorithm and the
reduction of the load differences is performed
indipendently by any pair of nodes
9OpenMosix architecture (4/5)
- Memory sharing
- places the maximal number of processes in the
farm main memory, even if it implies an uneven
load distribution among the nodes - delays as much as possible swapping out of pages
- makes the decision of which process to migrate
and where to migrate it is based on the knoweldge
of the amount of free memory in other nodes - Efficient kernel communication
- is specifically developed to reduce the overhead
of the internal kernel communications (e.g.
between the process and its home site, when it is
executing in a remote site) - fast and reliable protocol with low startup
latency and high throughput
10OpenMosix architecture (5/5)
- Probabilistic information dissemination
algorithms - provide each node with sufficient knowledge about
available resources in other nodes, without
polling - measure the amount of the available resources on
each node - receive the resources indices that each node send
at regular intervals to a randomly chosen subset
of nodes - the use of randomly chosen subset of nodes is due
for support of dynamic configuration and to
overcome partial nodes failures - Decentralized control and autonomy
- each node makes its own control decisions
independently and there is no master-slave
relationship between nodes - each node is capable of operating as an
independent system this property allows a
dynamic configuration, where nodes may join or
leave the farm with minimal disruption
11Farm setup PXE ClusterNFS
- diskless nodes
- low cost
- eliminates install/upgrade of hardware, software
on diskless client side - backups are centralized in one single main server
- zero administration at diskless client side
12Diskless farm setup traditional method (1/2)
- Traditional method
- Server
- BOOTP server
- NFS server
- separate root directory for each client
- Client
- BOOTP to obtain IP
- TFTP to load tagged kernel image
- rootNFS to load root filesystem
13Diskless farm setup traditional method (2/2)
- Traditional method Problems
- separate root directory structure for each node
- hard to set up
- lots of directories with slightly different
contents - difficult to maintain
- changes must be propagated to each directory
14ClusterNFS
- Description
- cNFS is a patch to the standard Universal-NFS
server code that parses file request to
determine an appropriate match on the server - Example
- when client machine foo2 asks for file
/etc/hostname it gets the contents of
/etc/hostnameHOSTfoo2 - URL
- https//sourceforge.net/projects/clusternfs
15ClusterNFS features
-
- ClusterNFS allows all machines (including
server) to share the root filesystem - all files are shared by default
- files for all clients are named
filenameCLIENT - files for specific client are namedfilenameIPx
xx.xxx.xxx.xxx orfilenameHOSThost.domain.com
16Diskless farm setup with ClusterNFS (1/2)
- ClusterNFS method
- Server
- DHCP and TFTP server
- ClusterNFS server
- single root directory for server and clients
- Clients
- DHCP to obtain IP
- TFTP to load PXE boot loader and then kernel
image - rootNFS to load root filesystem
17Diskless farm setup with ClusterNFS (2/2)
- ClusterNFS method Advantages
- easy to set up
- just copy (or create) the files that need to be
different - easy to maintain
- changes to shared files are global
- easy to add nodes
- A node can be added to a running farm in 1 minute
18VIRGO experiment (Jun 2001) (1/4)
VIRGO is the collaboration between Italian and
French research teams, for the realization of an
interferometric gravitational wave detector The
main goal of the VIRGO project is the first
direct detection of gravitational waves emitted
by astrophysical sources Interferometric
gravitational wave detectors produce a large
amount of raw data that require a significant
computing power to be analysed. To satisfy such
a strong requirement of computing power we
decided to build a Linux cluster running MOSIX
(and now OpenMosix)
19VIRGO experiment (Jun 2001) (2/4)
Hardware Farm nodesSuperMicro6010H- Dual
Pentium III 1Ghz- RAM 512Mbyte- HD 18Gbyte-
2 Fast Ethernet interfaces- 1 Gbit Ethernet
interface- (only on master-node)StorageAlpha
Server 4100HD 144GB
20VIRGO experiment (Jun 2001) (3/4)
- The Linux farm has been strongly tested by
executing intensive data analysis procedures,
based on the Matched Filter algorithm, one of the
best ways to search for known waveforms within a
signal affected by background noise. - Matched Filter analysis requires a high
computational cost as the method consists in an
exhaustive comparison between the source signal
and a set of known waveforms, called templates,
to find possible matches. Using a large number of
templates the quality of known signals
identification gets better and better but a
great amount of floating points operations has to
be performed. -
- Running Matched Filter test procedures on the
OpenMosix cluster have shown a progressive
reduction of execution times, due to a high
scalability of the computing nodes and an
efficient dynamic load distribution
21VIRGO experiment (Jun 2001) (4/4)
speed-up of repeated Matched Filter executions
The increase of computing speed respect to the
number of processors doesnt follow an exactly
linear curve this is mainly due to the growth of
communication time, spent by the computing nodes
to transmit data over the local area network.
22ARGO experiment (Jan 2002) (1/3)
The aim of the ARGO-YBJ experiment is to study
cosmic rays, mainly cosmic gamma-radiation, at an
energy threshold of 100 GeV, by means of the
detection of small size air showers. This goal
will be achieved by operating a full coverage
array in the Yangbajing Laboratory (Tibet, P.R.
China) at 4300m a.s.l. As we have seen for the
Virgo experiment, the analysis of data produced
by Argo requires a significant amount of
computing power. To satisfy this requirement we
decided to implement an OpenMOSIX cluster.
23ARGO experiment (Jan 2002) (2/3)
- currently Argo researchers are using a small
Linux farm, located in Naples, constituted by - 5 machines (dual 1Ghz Pentium III with 1 Gbyte
RAM) running RedHat 7.2 openmosix 2.4.13. - 1 file server with 1 Tbyte of disk space
24ARGO experiment (Jan 2002) (3/3)
- At this time the Argo OpenMOSIX farm is mainly
used to run Monte Carlo simulations using
Corsika, a Fortran application developed to
simulate and analyse extensive air showers. - The farm is also used to run other applications
such as GEANT to simulate the behaviour of the
Argo detector. - The OpenMOSIX farm is responding very well to
the researchers computing requirements and we
already decided to upgrade the cluster in the
near future, adding more computing nodes and
starting the analysis of real data produced by
Argo. - Currently ARGO researchers in Naples have
produced 400 Gbytes of simulated data with
this OpenMOSIX cluster
25Conclusions (1/2)
- the most noticeable features of OpenMOSIX are its
load-balancing and process migration algorithms,
which implies that users need not have knowledge
of the current state of the nodes - this is most useful in time-sharing, multi-user
environments, where users do not have means (and
usually are not interested) in the status (e.g.
load of the nodes) - parallel application can be executed by forking
many processes, just like in an SMP, where
OpenMOSIX continuously attempts to optimize the
resource allocation
26Conclusions (2/2)
- Building up farms with the OpenMosixClusterNFS
approach requires no more than 2 hours - With this approach management of a farm
management of a single server - This solution has proven to be scalable in farms
up to 32 nodes