CLUSTERMATIC An Innovative Approach to Cluster Computing

About This Presentation

Title:

CLUSTERMATIC An Innovative Approach to Cluster Computing

Description:

Tutorial CD Contents. RPMs for all Clustermatic components ... Strace, gdb, TotalView transparently work on remote processes! A node with two remote processes ... – PowerPoint PPT presentation

Number of Views:148

Avg rating:3.0/5.0

Slides: 150

Provided by: greg234

Category:

more less

Transcript and Presenter's Notes

Title: CLUSTERMATIC An Innovative Approach to Cluster Computing

1
CLUSTERMATIC An Innovative Approach to Cluster
Computing

2004 LACSI Symposium
The Los Alamos Computer Science Institute

LA-UR-03-8015
2
Tutorial Outline (Morning)
3
Tutorial Outline (Afternoon)
4
Tutorial Introduction

Tutorial is divided into modules
Each module has clear objectives
Modules comprise short theory component, followed
by hands on
indicates theory
indicates hands-on

Please ask questions at any time!
5
Module 1 Overview of ClustermaticPresenter
Greg Watson

Objective
To provide a brief overview of the Clustermatic
architecture
Contents
What is Clustermatic?
Why Use Clustermatic?
Clustermatic Components
Installing Clustermatic
More Information
http//www.clustermatic.org

6
What is Clustermatic?

Clustermatic is a suite of software that
completely controls a cluster from the BIOS to
the high level programming environment

Clustermatic is modular
Each component is responsible for a specific set
of activities in the cluster
Each component can be used independently of other
components

7
Why Use Clustermatic?

Clustermatic clusters are easy to build, manage
and program
A cluster can be installed and operational in a
few minutes
The architecture is designed for simplicity,
performance and reliability
Utilization is maximized by ensuring machine is
always available
Supports machines from 2 to 1024 nodes (and
counting)
System administration is no more onerous than for
a single machine, regardless of the size of the
cluster
Upgrade O/S on entire machine with a single
command
No need to synchronize node software versions
The entire software suite is GPL open-source

8
Clustermatic Components

LinuxBIOS
Replaces normal BIOS
Improves boot performance and node startup times
Elimates reliance on proprietary BIOS
No interaction required, important for 100s of
nodes

LinuxBIOS
9
Clustermatic Components

Linux
Mature O/S
Demonstrated performance in HPC applications
No proprietary O/S issues
Extensive hardware and network device support

10
Clustermatic Components

V9FS
Avoids problems associated with global mounts
Processes are provided with a private shared
filesystem
Namespace exists only for duration of process
Nodes are returned to pristine state once
process is complete

Users
Compilers Debuggers
BJS
v9fs
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
11
Clustermatic Components

Beoboot
Manages booting cluster nodes
Employs a tree-based boot scheme for
fast/scalable booting
Responsible for configuring nodes once they have
booted

Users
Compilers Debuggers
BJS
Beoboot
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
12
Clustermatic Components

BProc
Manages a single process-space across machine
Responsible for process startup and management
Provides commands for starting processes, copying
files to nodes, etc.

Users
Compilers Debuggers
BJS
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
13
Clustermatic Components

BJS
BProc Job Scheduler
Enforces policies for allocating jobs to nodes
Nodes are allocated to pools which can have
different policies

BJS
Users
Compilers Debuggers
BJS
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
14
Clustermatic Components

Supermon
Provides a system monitoring infrastructure
Provides kernel and hardware status information
Low overhead on compute nodes and interconnect
Extensible protocol based on s-expressions

Users
Compilers Debuggers
BJS
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
15
Clustermatic Components

MPI
Uses standard MPICH 1.2 (ANL) or LA-MPI (LANL)
Supports Myrinet (GM) and Ethernet (P4) devices
Supports debugging with TotalView

MPI
Users
Compilers Debuggers
BJS
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
16
Clustermatic Components

Compilers Debuggers
Commercial and non-commercial compilers available
GNU, Intel, Absoft
Commercial and non-commercial debuggers available
Gdb, TotalView, DDT

Users
Compilers Debuggers
BJS
BProc
Supermon
MPI
Beoboot
v9fs
Linux
LinuxBIOS
17
Linux Support

Linux Variants
For RedHat Linux
Installed as a series of RPMs
Supports RH 9 2.4.22 kernel
For other Linuxs
Must be compiled and installed from source

18
Tutorial CD Contents

RPMs for all Clustermatic components
Architectures included for x86, x86_64, athlon,
ppc and alpha
Full distibution available on Clustermatic web
site (www.clustermatic.org)
SRPMs for all Clustermatic components
Miscellaneous RPMs
Full source tree for LinuxBIOS (gzipped tar
format)
Source for MPI example programs
Presentation handouts

19
Cluster Hardware Setup

Laptop installed with RH9
Will act as the master node
Two slave nodes
Preloaded with LinuxBIOS and a phase 1 kernel in
flash
iTuner M-100 VIA EPIA 533MHz 128Mb
8 port 100baseT switch
Total cost (excluding laptop) 800

20
Clustermatic Installation

Installation process for RedHat
Log into laptop
Username root
Password lci2004
Insert and mount CD-ROM
mount /mnt/cdrom
Locate install script
cd /mnt/cdrom/LCI
Install Clustermatic
./install_clustermatic
Reboot to load new kernel
reboot

21
Module 2 BProc BeobootPresenter Erik Hendriks

Objective
To introduce BProc and gain a basic understanding
of how it works
To introduce Beoboot and understand how it fits
together with BProc
To configure and manage a BProc cluster
Contents
Overview of BProc
Overview of Beoboot
Configuring BProc For Your Cluster
Bringing Up BProc
Bringing Up The Nodes
Using the Cluster
Managing a Cluster
Troubleshooting Techniques

22
BProc Overview

BProc Beowulf Distributed Process Space
BProc is a Linux kernel modification which
provides
A single system image for process control in a
cluster
Process migration for creating processes in a
cluster
BProc is the foundation for the rest of the
Clustermatic software

23
Process Space

A process space is
A pool of process IDs
A process tree
A set of parent/child relationships
Every instance of the Linux kernel has a process
space
A distributed process space allows parts of one
nodes process space to exist on other nodes

24
Distributed Process Space

With a distributed process space, some processes
will exist on other nodes
Every remote process has a place holder in the
process tree
All remote processes remain visible
Process related system calls (fork, wait, kill,
etc.) work identically on local and remote
processes
Kill works on remote processes
No runaway processes
Ptrace works on remote processes
Strace, gdb, TotalView transparently work on
remote processes!

A node with two remote processes
25
Distributed Process Space Example
Master
Slave
B

The master starts processes on slave nodes
These remote processes remain visible on the
master node
Not all processes on the slave are part of the
masters process space

26
Process Creation Example
Master
Slave
A
A
A
fork()
B
B
B

Process A migrates to the slave node

Process A calls fork() to create a child
process B

A new place holder for B is created

Once the place holder exists B is allowed to run

27
BProc in a Cluster

In a BProc cluster, there is a single master and
many slaves
Users (including root) only log into the master
The masters process space is the process space
for the cluster
All processes in the cluster are
Created from the master
Visible on the master
Controlled from the master

28
Process Migration

BProc provides a process migration system to
place processes on other nodes in the cluster
Process migration on BProc is not
Transparent
Preemptive
A process must call the migration system call in
order to move
Process migration on BProc is
Very fast (1.9s to place a 16MB process on 1024
nodes)
Scalable
It can create many copies for the same process
(e.g. MPI startup) very efficiently
O(log copies)

29
Process Migration

Process migration does preserve
The contents of memory and memory related
metadata
CPU State (registers)
Signal handler state
Process migration does not preserve
Shared memory regions
Open files
SysV IPC resources
Just about anything else that isnt memory

30
Running on a Slave Node

BProc is a process management system
All other system calls are handled locally on the
slave node
BProc does not impose any extra overhead on
non-process related system calls
File and Network I/O are always handled locally
Calling open() will not cause contact with the
master node
This means network and file I/O are as fast as
they can be

31
Implications

All processes are started from the master with
process migration
All processes remain visible on the master
No runaways
Normal UNIX process control works for ALL
processes in the cluster
No need for direct interaction
There is no need to log into a node to control
what is running there
No software is required on the nodes except the
BProc slave daemon
ZERO software maintenance on the nodes!
Diskless nodes without NFS root
Reliable nodes

32
Beoboot

BProc does not provide any mechanism to get a
node booted
Beoboot fills this role
Hardware detection and driver loading
Configuration of network hardware
Generic network boot using Linux
Starts the BProc slave daemon
Beoboot also provides the corresponding boot
servers and utility programs on the front end

33
Booting a Slave Node
Master
Slave
Request (who am I?)
Phase 1 Small kernel Minimal functionality
Response (IPs, servers, etc)
Request phase 2 Image
Phase 2 Image
?
Load phase 2 Image (Using magic)
Request (who am I again?)
Phase 2 Operation kernel Full featured
Response
BProc Slave Connect
34
Loading the Phase 2 Image

Two Kernel Monte is a piece of software which
will load a new Linux kernel replacing one that
is already running
This allows you to use Linux as your boot loader!
Using Linux means you can use any network that
Linux supports.
There is no PXE bios or Etherboot support for
Myrinet, Quadrics or Infiniband
Pink network boots on Myrinet which allowed us
to avoid buying a 1024 port ethernet network
Currently supports x86 (including AMD64) and Alpha

35
BProc Configuration

Main configuration file
/etc/clustermatic/config
Edit with favorite text editor
Lines consist of comments (starting with )
Rest are keyword followed by arguments
Specify interface
interface eth0 10.0.4.1 255.255.255.0
eth0 is interface connected to nodes
IP of master node is 10.0.4.1
Netmask of master node is 255.255.255.0
Interface will be configured when BProc is started

36
BProc Configuration

Specify range of IP addresses for nodes
iprange 0 10.0.4.10 10.0.4.14
Start assigning IP addresses at node 0
First address is 10.0.4.10, last is 10.0.4.14
The size of this range determines the number of
nodes in the cluster
Next entries are default libraries to be
installed on nodes
Can explicitly specify libraries or extract
library information from an executable
Need to add entry to install extra libraries
librariesfrombinary /bin/ls /usr/bin/gdb
The bplib command can be used to see libraries
that will be loaded

37
BProc Configuration

Next line specifies the name of the phase 2 image
bootfile /var/clustermatic/boot.img
Should be no need to change this
Need to add a line to specify kernel command line
kernelcommandline apmoff consolettyS0,19200
Turn APM support off (since these nodes dont
have any)
Set console to use ttyS0 and speed to 19200
This is used by beoboot command when building
phase 2 image

38
BProc Configuration

Final lines specify ethernet addresses of nodes,
examples given
node 0 005056000000
node 005056000001
Needed so node can learn its IP address from
master
First 0 is optional, assign this address to node
0
Can automatically determine and add ethernet
addresses using the nodeadd command
We will use this command later, so no need to
change now
Save file and exit from editor

39
BProc Configuration

Other configuration files
Should not need to be changed for this
configuration
/etc/clustermatic/config.boot
Specifies PCI devices that are going to be used
by the nodes at boot time
Modules are included in phase 1 and phase 2 boot
images
By default the node will try all network
interfaces it can find
/etc/clustermatic/node_up.conf
Specifies actions to be taken in order to bring a
node up
Load modules
Configure network interfaces
Probe for PCI devices
Copy files and special devices out to node

40
Bringing Up BProc

Check BProc will be started at boot time
chkconfig --list clustermatic
Restart master daemon and boot server
service bjs stop
service clustermatic restart
service bjs start
Load the new configuration
BJS uses BProc, so needs to be stopped first
Check interface has been configured correctly
ifconfig eth0
Should have IP address we specified in config file

41
Build a Phase 2 Image

Run the beoboot command on the master
beoboot -2 -n --plugin mon
-2 this is a phase 2 image
-n image will boot over network
--plugin add plugin to the boot image
The following warning messages can be safely
ignored
WARNING Didnt find a kernel module called
gmac.o
WARNING Didnt find a kernel module called
bmac.o
Check phase 2 image is available
ls -l /var/clustermatic/boot.img

42
Bringing Up The First Node

Ensure both nodes are powered off
Run the nodeadd command on the master
/usr/lib/beoboot/bin/nodeadd -a -e -n 0 eth0
-a automatically reload daemon
-e write a node number for every node
-n 0 start node numbering at 0
eth0 interface to listen on for RARP requests
Power on the first node
Once the node boots, nodeadd will display a
message
New MAC 00304823ac9c
Sending SIGHUP to beoserv.

43
Bringing Up The Second Node

Power on the the second node
In a few seconds you should see another message
New MAC 00304823ade1
Sending SIGHUP to beoserv.
Exit nodeadd when second node detected (C)
At this point, cluster is up and fully
operational
Check cluster status
bpstat -U
Node(s) Status Mode User
Group
0-1 up ---x------ root
root

44
Using the Cluster

bpsh
Migrates a process to one or more nodes
Process is started on front-end, but is
immediately migrated onto nodes
Effect similar to rsh command, but no login is
performed and no shell is started
I/O forwarding can be controlled
Output can be prefixed with node number
Run date command on all nodes which are up
bpsh -a -p date
See other arguments that are available
bpsh -h

45
Using the Cluster

bpcp
Copies files to a node
Files can come from master node, or other nodes
Note that a node only has a ram disk by default
Copy /etc/hosts from master to /tmp/hosts on node
0
bpcp /etc/hosts 0/tmp/hosts
bpsh 0 cat /tmp/hosts

46
Managing the Cluster

bpstat
Shows status of nodes
up node is up and available
down node is down or cant be contacted by master
boot node is coming up (running node_up)
error an error occurred while the node was
booting
Shows owner and group of node
Combined with permissions, determines who can
start jobs on the node
Shows permissions of the node
---x------ execute permission for node owner
------x--- execute permission for users in node
group
---------x execute permission for other users

47
Managing the Cluster

bpctl
Control a nodes status
Reboot node 1 (takes about a minute)
bpctl -S 1 -R
Set state of node 0
bpctl -S 0 -s groovy
Only up, down, boot and error have special
meaning, everything else means not down
Set owner of node 0
bpctl -S 0 -u nobody
Set permissions of node 0 so anyone can execute a
job
bpctl -S 0 -m 111

48
Managing the Cluster

bplib
Manage libraries that are loaded on a node
List libraries to be loaded
bplib -l
Add a library to the list
bplib -a /lib/libcrypt.so.1
Remove a library from the list
bplib -d /lib/libcrypt.so.1

49
Troubleshooting Techniques

The tcpdump command can be used to check for node
activity during and after a node has booted
Connect a cable to serial port on node to check
console output for errors in boot process
Once node reaches node_up processing, messages
will be logged in /var/log/clustermatic/node.N
(where N is node number)

50
Module 3 LinuxBIOSPresenter Ron Minnich

Objective
To introduce LinuxBIOS
Build and install LinuxBIOS on a cluster node
Contents
Overview
Obtaining LinuxBIOS
Source Tree
Building LinuxBIOS
Installing LinuxBIOS
Booting a Cluster without LinuxBIOS
More Information
http//www.linuxbios.org

51
LinuxBIOS Overview

Replacement for proprietary BIOS
Based entirely on open source code
Can boot from a variety of devices
Supports a wide range of architectures
Intel P3 P4
AMD K7 K8 (Opteron)
PPC
Alpha
Ports available for many systems

compaq ibm lippert rlx tyan advantech dell intel
matsonic sis via asus digitallogic irobot mo
torola stpc winfast6300 bcm elitegroup lan
ner nano supermicro bitworks leadtek
pcchips supertek cocom gigabit lex rcn technoland
52
Why Use LinuxBIOS?

Proprietary BIOSs are inherently interactive
Major problem when building clusters with 100s
or 1000s of nodes
Proprietary BIOSs misconfigure hardware
Impossible to fix
Examples that really happen
Put in faster memory, but it doesnt run faster
Can misconfigure PCI address space - huge problem
Proprietary BIOSs cant boot over HPC networks
No Myrinet or Quadrics drivers for Phoenix BIOS
LinuxBIOS is FAST
This is the least important thing about LinuxBIOS

53
Definitions

Bus
Two or more wires used to connect two or more
chips
Bridge
A chip that connects two or more busses of the
same or different type
Mainboard
Aka motherboard/platform
Carrier for chips that are interconnected via
buses and bridges
Target
A particular instance of a mainboard, chips and
LinuxBIOS configuration
Payload
Software loaded by LinuxBIOS from non volatile
storage into RAM

54
Typical Mainboard
Rev D
CPU
CPU
Front-side Bus
DDR
AGP
RAM
Video
Northbridge
I/O Buses (PCI)
BIOS Chip
Southbridge
Keyboard
Legacy
Floopy
55
What Is LinuxBIOS?

That question has changed over time
In 1999, at the start of the project, LinuxBIOS
was literal
Linux is the BIOS
Hence the name
The key questions are
Can you learn all about the hardware on the
system by asking the hardware on the system?
Does the OS know how to do that?
The answer, in 1995 or so on PCs, was NO in
both cases
OS needed the BIOS to do significant work to get
the machine ready to use

56
What Does The BIOS Do Anyway?

Make the processor(s) sane
Make the chipsets sane
Make the memory work (HARD on newer systems)
Set up devices so you can talk to them
Set up interrupts so the go to the right place
Initialize memory even though you dont want it
to
Totally useless memory test
Ive never seen a useful BIOS memory test
Spin up the disks
Load primary bootstrap from the right place
Start up the bootstrap

57
Is It Possible With Open-Source Software?

1995 very hard - tightly coded assembly that
barely fits into 32KB
1999 pretty easy - the Flash is HUGH (256KB at
least)
So the key in 1999 was knowing how to do the
startup
Lots of secret knowledge which took a while to
work out
Vendors continue to make this hard, some help
AMD is good example of a very helpful vendor
LinuxBIOS community wrote the first-ever
open-source code that could
Start up Intel and AMP SMPs
Enable L2 cache on the PII
Initialize SDRAM and DDRAM

58
Only Really Became Possible In 1999

Huge 512K byte Flash parts could hold the huge
kernel
Almost 400KB
PCI bus had self-identifying hardware
Old ISA, EISA, etc. were DEAD thank goodness!
SGI Visual Workstation showed you could build x86
systems without standard BIOS
Linux learned how to do a lot of configuration,
ignoring the BIOS
In summary
The hardware could do it (we thought)
Linux could do it (we thought)

59
LinuxBIOS Image In The 512KB Flash
60
The Basic Load Sequence ca. 1999

Top 16 bytes jump to top 64K
Top 64K
Set up hardware for Linux
Copy Linux from FLASH to bottom of memory
Jump to 0x100020 (start of Linux)
Linux do all the stuff you normally do
2.2 not much, was a problem
2.4 did almost everything
In 1999, Linux did not do all we needed (2.2)
In 2000, 2.4 could do almost as much as we want
The 64K bootstrap ended up doing more than we
planned

61
What We Thought Linux Would Do

Do ALL the PCI setup
Do ALL the odd processor setup
In fact, do everything all the 64K code had to
do was copy Linux to RAM

62
What We Changed (Due To Hardware)

DRAM does not start life operational, like the
old days
Turn-on for DRAM is very complex
The single hardest part of LinuxBIOS is DRAM
support
To turn on DRAM, you need to turn on chipsets
To turn on chipsets, you need to set up PCI
And, on AMD Athlon SMPs, we need to grab hold of
all the CPUs (save one) and idle them
So the 64K chunk ended up doing more

63
Getting To DRAM
Rev D
CPU
CPU
Front-side Bus
DDR
AGP
RAM
Video
Northbridge
I/O Buses (PCI)
BIOS Chip
Southbridge
Keyboard
Legacy
Floopy
64
Another Problem

IRQ wiring can not be determined from hardware!
Botch in PCI results in having to put tables in
the BIOS
This is true for all motherboards
So, although PCI hardware is self-identifying,
hardware interrupts are not
So Linux cant figure out what interrupt is for
what card
LinuxBIOS has to pick up this additional function

65
The PCI Interrupt Botch
1 2 3 4
A B C D
1 2 3 4
A B C D
66
What We Changed (Due To Linux)

Linux could not set up a totally empty PCI bus
Needed some minimal configuration
Linux couldnt find the IRQs
Not really its fault, but
Linux needed SMP hardware set up as per BIOS
Linux needed per-CPU hardware set up as per
BIOS
Linux needed tables (IRQ, ACPI, etc.) set up as
per BIOS
Over time, this is changing
Someone has a patent on the SRAT ACPI table
SRAT describes hardware
So Linux ignores SRAT, talks to hardware directly

67
As Of 2000/2001

We could boot Linux from flash (quickly)
Linux would find the hardware and the tables
ready for it
Linux would be up and running in 3-12 seconds
Problem solved?

68
Problems

Looking at trends, in 1999 we counted on
motherboard flash sizes doubling every 2 years or
so
From 1999 to 2000 the average flash size either
shrank or stayed the same
Linux continued to grow in size though
Linux outgrew the existing flash parts, even as
they were getting smaller
Venders went to a socket that couldnt hold a
larger replacement
Why did vendors do this?
Everyone wants cheap mainboards!

69
LinuxBIOS Was Too Big

Enter the alternate bootstraps
Etherboot
FILO
Built-in etherboot
Built-in USB loader

70
The New Picture
Compact Flash (32MB)
Flash (256KB)
Top 16 bytes Top 64K (LinuxBIOS) Next 64K
(Etherboot)
Linux Kernel
Empty
Loaded over IDE channel by bootloader
Empty
71
LinuxBIOS Now

The aggregate of the 64K loader, Etherboot (or
FILO), and Linux from Compact Flash?
Too confusing
LinuxBIOS now means only the 64K piece, even
though its not Linux any more
On older systems, LinuxBIOS loads Etherboot which
loads Linux from Compact Flash
Compact Flash read as raw set of blocks
On newer systems, LinuxBIOS loads FILO which
loads Linux from Compact Flash
Compact Flash treated as ext2 filesystem

72
Final Question

Youre reflashing 1024 nodes on a cluster and the
power fails
Youre now the proud owner of 1024 bricks, right?
Wrong
Linux NetworX developed fallback BIOS technology

73
Fallback BIOS
Flash (256KB)

Jump to BIOS jumps to fallback BIOS
Fallback BIOS checks conditions
Was the last boot successful?
Do we want to just use fallback anyway?
Does normal BIOS look ok?
If things are good, use normal
If things are bad, use fallback
Note there is also a fallback and normal FILO
These load different files from CF
So normal kernel, FILO, and BIOS can be hosed and
youre ok

Jump to BIOS Fallback BIOS Normal BIOS Fallback
FILO Normal FILO
74
Rules For Upgrading Flash

NEVER replace the fallback BIOS
NEVER replace the fallback FILO
NEVER replace the fallback kernel
Mess up other images at will, because you can
always fall back

75
A Last Word On Flash Size

Flash size decreased to 256KB from 1999-2003
Driven by packaging constraints
Newer technology uses address-address
multiplexing to pack lots of address bits onto 3
address lines - up to 128 MB!
Driven by cell phone and MP3 player demand
So the same small package can support 1,2,4,8 MB
Will need them kernel initrd can be 4MB!
This will allow us to realize our original vision
Linux in flash
Etherboot, FILO, etc., are really a hiccup

76
Source Tree

/console
Device independent console support
/cpu
Implementation specific files
/devices
Dynamic device allocation routines
/include
Header files
/lib
Generic library functions (atoi)

COPYING
NEWS
ChangeLog
documentation
Not enough!
src
/arch
Architecture specific files, including initial
startup code
/boot
Main LinuxBIOS entry code hardwaremain()
/config
Configuration for a given platform

77
Source Tree

/stream
Source of payload data
/superio
Chip to emulate legacy crap
targets
Instances of specific platforms
utils
Utility programs

/mainboard
Mainboard specific code
/northbridge
Memory and bus interface routines
/pc80
Legacy crap
/pmc
Processor mezzanine cards
/ram
Generic RAM support
/sdram
Synchronous RAM support
/southbridge
Bridge to interface to legacy crap

78
Building LinuxBIOS

For this demonstration, untar source from CDROM
mount /mnt/cdrom
cd /tmp
tar zxvf /mnt/cdrom/LCI/linuxbios/freebios2.tgz
cd freebios2
Find target that matches your hardware
cd targets/via/epia
Edit configuration file Config.lb and change any
settings specific to your board
Should not need to make any changes in this case

79
Building LinuxBIOS

Build the target configuration files
cd ../..
./buildtarget via/epia
Now build the ROM image
cd via/epia/epia
make
Should result in a single file
linuxbios.rom
Copy ROM image onto a node
bpcp linuxbios.rom 0/tmp

80
Installing LinuxBIOS

This will overwrite old BIOS with LinuxBIOS
Prudent to keep a copy of the old BIOS chip
Bad BIOS useless junk
Build flash utility
cd /tmp/freebios2/util/flash_and_burn
make
Now flash the ROM image - please do not do this
step
bpsh 0 ./flash_rom /tmp/linuxbios.rom
Reboot node and make sure it comes up
bpctl -S 0 -R
Use BProc troubleshooting techniques if not!

81
Booting a Cluster Without LinuxBIOS

Although an important part of Clustermatic, its
not always possible to deploy LinuxBIOS
Requires detailed understanding of technical
details
May not be available for a particular mainboard
In this situation it is still possible to set up
and boot a cluster using a combination of DHCP,
TFTP and PXE
Dynamic Host Configuration Protocol (DHCP)
Used by node to obtain IP address and bootloader
image name
Trivial File Transfer Program (TFTP)
Simple protocol to ransfer files across an IP
network
Pre-Execution Environment (PXE)
BIOS support for network booting

82
Configuring DHCP

Copy configuration file
cp /mnt/cdrom/LCI/pxi/dhcpd.conf /etc
Contains the following entry (one host entry
for each node)
ddns-update-style ad-hoc
subnet 10.0.4.0 netmask 255.255.255.0
host node1
hardware ethernet xxxxxxxxxxxx
fixed-address 10.0.4.14
filename pxelinux.0
Replace xxxxxxxxxxxx with MAC address of
node
Restart server to load new configuration
service dhcpd restart

83
Configuring TFTP

Create directory to hold bootloader
mkdir -p /tftpboot
Edit TFTP config file
/etc/xinetd.d/tftp
Enable TFTP
Change
disable yes
To
disable no
Restart server
service xinetd restart

84
Configuring PXE

Depends on BIOS, enabled through menu
Create correct directories
mkdir -p /tftpboot/pxelinux.cfg
Copy bootloader and config file
cd /mnt/cdrom/LCI/pxe
cp pxelinux.0 /tftpboot/
cp default /tftpboot/pxelinux.cfg/
Generate a bootable phase 2 image
beoboot -2 -i -o /tftpboot/node --plugin mon
Creates a kernel and initrd image
/tftpboot/node
/tftpboot/node.initrd

85
Booting The Cluster

Run nodeadd to add node to config file
/usr/lib/beoboot/bin/nodeadd -a -e eth0
Node can now be powered on
BIOS uses DHCP to obtain IP address and filename
pxelinux.0 will be loaded
pxelinux.0 will in turn load phase 2 image and
initrd
Node should boot
Check status using bpstat command
Requires monitor to observe behavior of node

86
Module 4 FilesystemsPresenter Ron Minnich

Objective
To show the different kinds of filesystems that
can be used with a BProc cluster and demonstrate
the advantages and disadvantages of each
Contents
Overview
No Local Disk, No Network Filesystem
Local Disks
Global Network Filesystems
NFS
Third Party Filesystems
Private Network Filesystems
V9FS

87
Filesystems Overview

Nodes in a Clustermatic cluster do not require
any type of local or network filesystem to
operate
Jobs that operate with only local data need no
other filesystems
Clustermatic can provide a range of different
filesystem options

88
No Local Disk, No Network Filesystem

Root filesystem is a tmpfs located in system RAM,
so size is limited to RAM size of nodes
Applications that need an input deck must copy
necessary files to nodes prior to execution and
from nodes after execution
30K input deck can be copied to 1023 nodes in
under 2.5 seconds
This can be a very fast option for suitable
applications
Removes node dependency on potentially unreliable
fileserver

89
Local Disks

Nodes can be provided with one or more local
disks
Disks are automatically mounted by creating entry
in /etc/clustermatic/fstab
Solves local space problem, but filesystems are
still not shared
Also reduces reliability of nodes since they are
now dependent on spinning hardware

90
NFS

Simplest solution to providing a shared
filesystem on nodes
Will work in most environments
Nodes are now dependent on availability of NFS
server
Master can act as NFS server
Adds extra load
Master may already be loaded if there are a large
number of nodes
Better option is to provide a dedicated server
Configuration can be more complex if server is on
a different network
May require mutliple network adapters in master
Performance is never going to be high

91
Configuring Master as NFS Server

Standard Linux NFS configuration on server
Check NFS is enabled at boot time
chkconfig --list nfs
chkconfig nfs on
Start NFS daemons
service nfs start
Add exported filesystem to /etc/exports
/home 10.0.4.0/24(rw,sync,no_root_squash)
Export filesystem
exportfs -a

92
Configuring Nodes To Use NFS

Edit /etc/clustermatic/fstab to mount filesystem
when node boots
MASTER/home /home nfs nolock 0 0
MASTER will be replaced with IP address of front
end
nolock must be used unless portmap is run on each
node
/home will be automatically created on node at
boot time
Reboot nodes
bpctl -S allup -R
When nodes have rebooted, check NFS mount is
available
bpsh 0-1 df

93
Third Party Filesystems

GPFS (http//www.ibm.com)
Panasas (http//www.panasas.com)
Lustre (http//www.lustre.org)

94
GPFS

Supports up to 2.4.21 kernel (latest is 2.4.26 or
2.6.5)
Data striping across multiple disks and multiple
nodes
Client-side data caching
Large blocksize option for higher efficiencies
Read-ahead and write-behind support
Block level locking supports concurrent access to
files
Network Shared Disk Model
Subset of nodes are allocated as storage nodes
Software layer ships I/O requests from
application node to storage nodes across cluster
interconnect
Direct Attached Model
Each node must have direct connection to all
disks
Requires Fibre Channel Switch and Storage Area
Network disk configuration

95
Panasas

Latest version supports 2.4.26 kernel
Object Storage Device (OSD)
Intelligent disk drive
Can be directly accessed in parallel
PanFS Client
Object-based installable filesystem
Handles all mounting, namespace operations, file
I/O operations
Parallel access to multiple object storage
devices
Metadata Director
Separate control path for managing OSDs
mapping of directories and files to data
objects
Authentication and secure access
Metadata Director and OSD require dedicated
proprietary hardware
PanFS Client is open source

96
Lustre

Lustre Lite supports 2.4.24 kernel
Full Lustre will support 2.6 kernel
Luster Lite Lustre - clustered metadata
scalability
All open source
Meta Data Servers (MDSs)
Supports all filesystem namespace operations
Lock manager and concurrency support
Transaction log of metadata operations
Handles failover of metadata servers
Object Storage Targets (OSTs)
Handles actual file I/O operations
Manages storage on Object-Based Disks (OBDs)
Object-Based Disk drivers support normal Linux
filesystems
Arbitrary network support through Network
Abstraction Layer
MDSs and OSTs can be standard Linux hosts

97
V9FS

Provides a shared private network filesystem
Shared
All nodes running a parallel process can access
the filesystem
Private
Only processes in a single process group can see
or access files in the filesystem
Mounts exist only for duration of process
Node cleanup is automatic
No hanging mount problems
Protocol is lightweight
Pluggable authentication services

98
V9FS

Experimental
Can be mounted across a secure channel (e.g. ssh)
for additional security
1000 concurrent mounts in 20 seconds
Multiple servers will improve this
Servers can run on cluster nodes or dedicated
systems
Filesystem can use cluster interconnect or
dedicated network
More information
http//v9fs.sourceforge.net

99
Configuring Master as V9FS Server

Start server
v9fs_server
Can be started at boot if desired
Create mount point on nodes
bpsh 0-1 mkdir /private
Can add mkdir command to end of node_up script if
desired

100
V9FS Server Commands

Define filesystems to be mounted on the nodes
v9fs_addmount 10.0.4.1/home /private
List filesystems to be mounted
v9fs_lsmount

101
V9FS On The Cluster

Once filesystem mounts have been defined on the
server, filesystems will be automatically mounted
when a process is migrated to the node
cp /etc/hosts /home
bpsh 0-1 ls -l /private
bpsh 0 cat /private/hosts
Remove filesystems to be mounted
v9fs_rmmount /private
bpsh 0-1 ls -l /private

102
One Note

Note that we ran the file server as root
You can actually run the file server as you
If run as you, there is added security
The server cant run amok
And subtracted security
We need a better authentication system
Can use ssh, but something tailored to the
cluster would be better
Note that the server can chroot for even more
safety
Or be told to serve from a file, not a file
system
There is tremendous flexibility and capability in
this approach

103
Also

Recall that on 2.4.19 and later there is a /proc
entry for each process
/proc/mounts
It really is quite private
There is a lot of potential capability here we
have not started to use
Still trying to determine need/demand

104
Why Use V9FS?

Youve got some wacko library you need to use for
one application
Youve got a giant file which you want to serve
as a file system
Youve got data that you want visible to you only
Original motivation compartmentation in grids
(1996)
You want a mount point but its not possible for
some reason
You want an encrypted data file system

105
Wacko Library

Clustermatic systems (intentionally) limit the
number of libraries on nodes
Current systems have about 2GB worth of libraries
Putting all these on nodes would take 2GB of
memory!
Keeping node configuration consistent is a big
task on 1000 nodes
Need to do rsync, or whatever
Lots of work, lots of time for libraries you
dont need
What if you want some special library available
all the time
Painful to ship it out, set up paths, etc., every
time
V9FS allows custom mounts to be served from your
home directory

106
Giant File As File System

V9FS is a user-level server
i.e. an ordinary program
On Plan 9, there are all sorts of nifty uses of
this
Servers for making a tar file look like a
read-only file system
Or cpio archive, or whatever
So, instead of trying to locate something in the
middle of a huge tar file
Run the server to serve the tar file
Save disk blocks and time

107
Data Visible To You Only

This usage is still very important
Run your own personal server (assuming
authentication is fixed) or use the global server
Files that you see are not visible to anyone else
at all
Even root
On Unix, if you cant get to the mount point, you
cant see the files
On Linux with private mounts, other people dont
even know the mount point exists

108
You Want A Mount Point But Cant Get One

Please Mr. Sysadmin, sir, can I have another
mount point?
NO!
System administrators have enough to do, than to
Modify fstab on all nodes
Modify permissions on a server
And so on
Just to make your library available on the nodes?
Doubtful
V9FS gives a level of flexibility that you cant
get otherwise

109
Want Encrypted Data File System

This one is really interesting
Crypto file systems are out there in abundance
But they always require lots of root
involvement to set up
Since V9FS is user-level, you can run one
yourself
Set up your own keys, crypto, all your own stuff
Serve a file system out of one big encrypted file
Copy the file elsewhere, leaving it encrypted
Not easily done with existing file systems
So you have a personal, portable, encrypted file
system

110
So Why Use V9FS?

Opens up a wealth of new ways to store, access
and protect your data
Dont have to bother System Administrators all
the time
Can extend the file system name space of a node
to your specification
Can create a whole file system in one file, and
easily move that file system around (cp, scp,
etc.)
Can do special per-user policy on the file system
Tar or compressed file format
Per-user crypto file system
Provides capabilities you cant get any other way

111
Module 5 SupermonPresenter Matt Sottile

Objectives
Present an overview of supermon
Demonstrate how to install and use supermon to
monitor a cluster
Contents
Overview of Supermon
Starting Supermon
Monitoring the Cluster
More Information
http//supermon.sourceforge.net

112
Overview of Supermon

Provides monitoring solution for clusters
Capable of high sampling rates (Hz)
Very small memory and computational footprint
Sampling rates are controlled by clients at
run-time
Completely extensible without modification
User applications
Kernel modules

113
Node View

Data sources
Kernel module(s)
User application
Mon daemon
IANA-registered port number
2709

114
Cluster View

Data sources
Node mon daemons
Other supermons
Supermon daemon
Same port number
2709
Same protocol at every level
Composable, extensible

115
Data Format

S-expressions
Used in LISP, Scheme, etc.
Very mature
Extensible, composable, ASCII
Very portable
Easily changed to support richer data and
structures
Composable
(expr 1) o (expr 2) ((expr 1) (expr 2))
Fast to parse, low memory and time overhead

116
Data Protocol

command
Provides description of what data is provided and
how it is structured
Shows how the data is organized in terms of rough
categories containing specific data variables
(e.g. cpuinfo category, usertime variable)
S command
Request actual data
Structure matches that described in command
R command
Revive clients that disappeared and were
restarted
N command
Add new clients

117
User Defined Data

Each node allows user-space programs to push data
into mon to be sent out on the next sample
Only requirement
Data is arbitrary text
Recommended to be an s-expression
Very simple interface
Uses UNIX domain socket for security

118
Starting Supermon

Start supermon daemon
supermon n0 n1 2gt /dev/null
Check output from kernel
bpsh 1 cat /proc/sys/supermon/
bpsh 1 cat /proc/sys/supermon/S
Check sensor output from kernel
bpsh 1 cat /proc/sys/supermon_sensors_t/
bpsh 1 cat /proc/sys/supermon_sensors_t/S

119
Supermon In Action

Check mon output from a node
telnet n1 2709
S
close
Check output from supermon daemon
telnet localhost 2709
S
close

120
Supermon In Action

Read supermon data and display to console
supermon_stats options
Create trace file for off-line analysis
supermon_tracer options
supermon_stats can be used to process trace data
off-line

121
Module 6 BJSPresenter Matt Sottile

Objectives
Introduce the BJS scheduler
Configure and submit jobs using BJS
Contents
Overview of BJS
BJS Configuration
Using BJS

122
Overview of BJS

Designed to cover the needs of most users
Simple, easy to use
Extensible interface for adding policies
Used in production environments
Optimized for use with BProc
Traditional schedulers require O(N) processes,
BJS requires O(1)
Schedules and unschedules 1000 processes in 0.1
seconds

123
BJS Configuration

Nodes are divided into pools, each with a policy
Standard policies
Filler
Attempts to backfill unused nodes
Shared
Allows multiple jobs to run on a single node
Simple
Very simple FIFO scheduling algorithm

124
Extending BJS

BJS was designed to be extensible
Policies are plug-ins
They require coding to the BJS C API
Not hard, but nontrivial
Particularly useful for installation-specific
policies
Based on shared-object libraries
A fair-share policy is currently in testing at
LANL for BJS
Enforce fairness between groups
Enforce fairness between users within a group
Optimal scheduling between users own jobs

125
BJS Configuration

BJS configuration file
/etc/clustermatic/bjs.config
Global configuration options (usually dont need
to be changed)
Location of spool files
spooldir
Location of dynamically loaded policy modules
policypath
Location of UNIX domain socket
socketpath
Location of user accouting log file
acctlog

126
BJS Configuration

Per-pool configuration options
Defines the default pool
pool default
Name of policy module for this pool (must exist
in policydir)
policy filler
Nodes that are in this pool
nodes 0-10000
Maximum duration of a job (wall clock time)
maxsecs 86400
Optional Users permitted to submit to this pool
users
Optional Groups permitted to submit to this pool
groups

127
BJS Configuration

Restart BJS daemon to accept changes
service bjs restart
Check nodes are available
bjsstat
Pool default Nodes (total/up/free) 5/2/2
ID User Command
Requirements

128
Using BJS

bjssub
Submit a request to allocate nodes
ONLY runs the command on the front end
The command is responsible for executing on nodes
-p specify node pool
-n number of nodes to allocate
-s run time of job (in seconds)
-i run in interactive mode
-b run in batch mode (default)
-D set working directory
-O redirect command output to file

129
Using BJS

bjsstat
Show status of node pools
Name of pool
Total number of nodes in pool
Number of operational nodes in pool
Number of free nodes in pool
Lists status of jobs in each pool

130
Using BJS

bjsctl
Terminate a running job
-r specify ID number of job to terminate

131
Interactive vs Batch

Interactive jobs
Schedule a node or set of nodes for use
interactively
bjssub will wait until nodes are available, then
run the command
Good during development
Good for single run, short runtime jobs
Hands-on interaction with nodes
bjssub -p default -n 2 -s 1000 -i bash
Waiting for interactive job nodes.
(nodes 0 1)
Starting interactive job.
NODES0,1
JOBID59
gt bpsh NODES date
gt exit

CLUSTERMATIC An Innovative Approach to Cluster Computing - PowerPoint PPT Presentation

CLUSTERMATIC An Innovative Approach to Cluster Computing

Tutorial CD Contents. RPMs for all Clustermatic components ... Strace, gdb, TotalView transparently work on remote processes! A node with two remote processes ... – PowerPoint PPT presentation