High Performance Linux Clusters - PowerPoint PPT Presentation

1 / 96

About This Presentation

Title:

High Performance Linux Clusters

Description:

High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC Overview of San Diego Supercomputer Center Founded in 1985 Non-military ... – PowerPoint PPT presentation

Number of Views:242

Avg rating:3.0/5.0

Slides: 97

Provided by: gregb61

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Linux Clusters

1
High Performance Linux Clusters

Guru Session, Usenix, Boston
June 30, 2004
Greg Bruno, SDSC

2
Overview of San Diego Supercomputer Center

Founded in 1985
Non-military access to supercomputers
Over 400 employees
Mission Innovate, develop, and deploy technology
to advance science
Recognized as an international leader in
Grid and Cluster Computing
Data Management
High Performance Computing
Networking
Visualization
Primarily funded by NSF

3
My Background

1984 - 1998 NCR - Helped to build the worlds
largest database computers
Saw the transistion from proprietary parallel
systems to clusters
1999 - 2000 HPVM - Helped build Windows clusters
2000 - Now Rocks - Helping to build Linux-based
clusters

4
Why Clusters?
5
Moores Law
6
Cluster Pioneers

In the mid-1990s, Network of Workstations project
(UC Berkeley) and the Beowulf Project (NASA)
asked the question

Can You Build a High Performance Machine
From Commodity Components?
7
The Answer is Yes
Source Dave Pierce, SIO
8
The Answer is Yes
9
Types of Clusters

High Availability
Generally small (less than 8 nodes)
Visualization
High Performance
Computational tools for scientific computing
Large database machines

10
High Availability Cluster

Composed of redundant components and multiple
communication paths

11
Visualization Cluster

Each node in the cluster drives a display

12
High Performance Cluster

Constructed with many compute nodes and often a
high-performance interconnect

13
Cluster Hardware Components
14
Cluster Processors

Pentium/Athlon
Opteron
Itanium

15
Processors x86

Most prevalent processor used in commodity
clustering
Fastest integer processor on the planet
3.4 GHz Pentium 4, SPEC2000int 1705

16
Processors x86

Capable floating point performance
5 machine on Top500 list built with Pentium 4
processors

17
Processors Opteron

Newest 64-bit processor
Excellent integer performance
SPEC2000int 1655
Good floating point performance
SPEC2000fp 1691
10 machine on Top500

18
Processors Itanium

First systems released June 2001
Decent integer performance
SPEC2000int 1404
Fastest floating-point performance on the planet
SPEC2000fp 2161
Impressive Linpack efficiency 86

19
Processors Summary
Processor GHz SPECint SPECfp Price
Pentium 4 EE 3.4 1705 1561 791
Athlon FX-51 2.2 1447 1423 728
Opteron 150 2.4 1655 1644 615
Itanium 2 1.5 1404 2161 4798
Itanium 2 1.3 1162 1891 1700
Power4 1.7 1158 1776 ????
20
But What You Really Build?

Itanium Dell PowerEdge 3250
Two 1.4 GHz CPUs (1.5 MB cache)
11.2 Gflops peak
2 GB memory
36 GB disk
7,700
Two 1.5 GHz (6 MB cache) makes the system cost
17,700
1.4 GHz vs. 1.5 GHz
7 slower
130 cheaper

21
Opteron

IBM eServer 325
Two 2.0 GHz Opteron 246
8 Gflops peak
2 GB memory
36 GB disk
4,539
Two 2.4 GHz CPUs 5,691
2.0 GHz vs. 2.4 GHz
17 slower
25 cheaper

22
Pentium 4 Xeon

HP DL140
Two 3.06 GHz CPUs
12 Gflops peak
2 GB memory
80 GB disk
2,815
Two 3.2 GHz 3,368
3.06 GHz vs. 3.2 GHz
4 slower
20 cheaper

23
If You Had 100,000 To Spend On A Compute Farm
System of Boxes Peak GFlops Aggregate SPEC2000fp Aggregate SPEC2000int
Pentium 4 3 GHz 35 420 89810 104370
Opteron 246 2.0 GHz 22 176 56892 57948
Itanium 1.4 GHz 12 132 46608 24528
24
What People Are Buying

Gartner study
Servers shipped in 1Q04
Itanium 6,281
Opteron 31,184
Opteron shipped 5x more servers than Itanium

25
What Are People Buying

Gartner study
Servers shipped in 1Q04
Itanium 6,281
Opteron 31,184
Pentium 1,000,000
Pentium shipped 30x more than Opteron

26
Interconnects
27
Interconnects

Ethernet
Most prevalent on clusters
Low-latency interconnects
Myrinet
Infiniband
Quadrics
Ammasso

28
Why Low-Latency Interconnects?

Performance
Lower latency
Higher bandwidth
Accomplished through OS-bypass

29
How Low Latency Interconnects Work

Decrease latency for a packet by reducing the
number memory copies per packet

30
Bisection Bandwidth

Definition If split system in half, what is the
maximum amount of data that can pass between each
half?
Assuming 1 Gb/s links
Bisection bandwidth 1 Gb/s

31
Bisection Bandwidth

Assuming 1 Gb/s links
Bisection bandwidth 2 Gb/s

32
Bisection Bandwidth

Definition Full bisection bandwidth is a network
topology that can support N/2 simultaneous
communication streams.
That is, the nodes on one half of the network can
communicate with the nodes on the other half at
full speed.

33
Large Networks

When run out of ports on a single switch, then
you must add another network stage
In example above Assuming 1 Gb/s links, uplinks
from stage 1 switches to stage 2 switches must
carry at least 6 Gb/s

34
Large Networks

With low-port count switches, need many switches
on large systems in order to maintain full
bisection bandwidth
128-node system with 32-port switches requires 12
switches and 256 total cables

35
Myrinet

Long-time interconnect vendor
Delivering products since 1995
Deliver single 128-port full bisection bandwidth
switch
MPI Performance
Latency 6.7 us
Bandwidth 245 MB/s
Cost/port (based on 64-port configuration) 1000
Switch NIC cable
http//www.myri.com/myrinet/product_list.html

36
Myrinet

Recently announced 256-port switch
Available August 2004

37
Myrinet

5 System on Top500 list
System sustains 64 of peak performance
But smaller Myrinet-connected systems hit 70-75
of peak

38
Quadrics

QsNetII E-series
Released at the end of May 2004
Deliver 128-port standalone switches
MPI Performance
Latency 3 us
Bandwidth 900 MB/s
Cost/port (based on 64-port configuration) 1800
Switch NIC cable
http//doc.quadrics.com/Quadrics/QuadricsHome.nsf/
DisplayPages/A3EE4AED738B6E2480256DD30057B227

39
Quadrics

2 on Top500 list
Sustains 86 of peak
Other Quadrics-connected systems on Top500 list
sustain 70-75 of peak

40
Infiniband

Newest cluster interconnect
Currently shipping 32-port switches and 192-port
switches
MPI Performance
Latency 6.8 us
Bandwidth 840 MB/s
Estimated cost/port (based on 64-port
configuration) 1700 - 3000
Switch NIC cable
http//www.techonline.com/community/related_conten
t/24364

41
Ethernet

Latency 80 us
Bandwidth 100 MB/s
Top500 list has ethernet-based systems sustaining
between 35-59 of peak

42
Ethernet

What we did with 128 nodes and a 13,000 ethernet
network
101 / port
28/port with our latest Gigabit Ethernet switch
Sustained 48 of peak

With Myrinet, would have sustained 1 Tflop
At a cost of 130,000
Roughly 1/3 the cost of the system

43
Rockstar Topology

24-port switches
Not a symmetric network
Best case - 41 bisection bandwidth
Worst case - 81
Average - 5.31

44
Low-Latency Ethernet

Bring os-bypass to ethernet
Projected performance
Latency less than 20 us
Bandwidth 100 MB/s
Potentially could merge management and
high-performance networks
Vendor Ammasso

45
Application Benefits
46
Storage
47
Local Storage

Exported to compute nodes via NFS

48
Network Attached Storage

A NAS box is an embedded NFS appliance

49
Storage Area Network

Provides a disk block interface over a network
(Fibre Channel or Ethernet)
Moves the shared disks out of the servers and
onto the network
Still requires a central service to coordinate
file system operations

50
Parallel Virtual File System

PVFS version 1 has no fault tolerance
PVFS version 2 (in beta) has fault tolerance
mechanisms

51
Lustre

Open Source
Object-based storage
Files become objects, not blocks

52
Cluster Software
53
Cluster Software Stack

Linux Kernel/Environment
RedHat, SuSE, Debian, etc.

54
Cluster Software Stack

HPC Device Drivers
Interconnect driver (e.g., Myrinet, Infiniband,
Quadrics)
Storage drivers (e.g., PVFS)

55
Cluster Software Stack

Job Scheduling and Launching
Sun Grid Engine (SGE)
Portable Batch System (PBS)
Load Sharing Facility (LSF)

56
Cluster Software Stack

Cluster Software Management
E.g., Rocks, OSCAR, Scyld

57
Cluster Software Stack

Cluster State Management and Monitoring
Monitoring Ganglia, Clumon, Nagios, Tripwire,
Big Brother
Management Node naming and configuration (e.g.,
DHCP)

58
Cluster Software Stack

Message Passing and Communication Layer
E.g., Sockets, MPICH, PVM

59
Cluster Software Stack

Parallel Code / Web Farm / Grid / Computer Lab
Locally developed code

60
Cluster Software Stack

Questions
How to deploy this stack across every machine in
the cluster?
How to keep this stack consistent across every
machine?

61
Software Deployment

Known methods
Manual Approach
Add-on method
Bring up a frontend, then add cluster packages
OpenMosix, OSCAR, Warewulf
Integrated
Cluster packages are added at frontend
installation time
Rocks, Scyld

62
Rocks
63
Primary Goal

Make clusters easy
Target audience Scientists who want a capable
computational resource in their own lab

64
Philosophy

Not fun to care and feed for a system
All compute nodes are 100 automatically
installed
Critical for scaling
Essential to track software updates
RHEL 3.0 has issued 232 source RPM updates since
Oct 21
Roughly 1 updated SRPM per day
Run on heterogeneous standard high volume
components
Use the components that offer the best
price/performance!

65
More Philosophy

Use installation as common mechanism to manage a
cluster
Everyone installs a system
On initial bring up
When replacing a dead node
Adding new nodes
Rocks also uses installation to keep software
consistent
If you catch yourself wondering if a nodes
software is up-to-date, reinstall!
In 10 minutes, all doubt is erased
Rocks doesnt attempt to incrementally update
software

66
Rocks Cluster Distribution

Fully-automated cluster-aware distribution
Cluster on a CD set
Software Packages
Full Red Hat Linux distribution
Red Hat Linux Enterprise 3.0 rebuilt from source
De-facto standard cluster packages
Rocks packages
Rocks community packages
System Configuration
Configure the services in packages

67
Rocks Hardware Architecture
68
Minimum Components
Local Hard Drive
Power
Ethernet
OS on all nodes (not SSI)
X86, Opteron, IA64 server
69
Optional Components

Myrinet high-performance network
Infiniband support in Nov 2004
Network-addressable power distribution unit
keyboard/video/mouse network not required
Non-commodity
How do you manage your management network?
Crash carts have a lower TCO

70
Storage

NFS
The frontend exports all home directories
Parallel Virtual File System version 1
System nodes can be targeted as Compute PVFS or
strictly PVFS nodes

71
Minimum Hardware Requirements

Frontend
2 ethernet connections
18 GB disk drive
512 MB memory
Compute
1 ethernet connection
18 GB disk drive
512 MB memory
Power
Ethernet switches

72
Cluster Software Stack
73
Rocks Rolls

Rolls are containers for software packages and
the configuration scripts for the packages
Rolls dissect a monolithic distribution

74
Rolls User-Customizable Frontends

Rolls are added by the Red Hat installer
Software is added and configured at initial
installation time
Benefit apply security patches during initial
installation
This method is more secure than the add-on method

75
Red Hat Installer Modified to Accept Rolls
76
Approach

Install a frontend
Insert Rocks Base CD
Insert Roll CDs (optional components)
Answer 7 screens of configuration data
Drink coffee (takes about 30 minutes to install)
Install compute nodes
Login to frontend
Execute insert-ethers
Boot compute node with Rocks Base CD (or PXE)
Insert-ethers discovers nodes
Goto step 3
Add user accounts
Start computing

Optional Rolls
Condor
Grid (based on NMI R4)
Intel (compilers)
Java
SCE (developed in Thailand)
Sun Grid Engine
PBS (developed in Norway)
Area51 (security monitoring tools)

77
Login to Frontend

Create ssh public/private key
Ask for passphrase
These keys are used to securely login into
compute nodes without having to enter a password
each time you login to a compute node
Execute insert-ethers
This utility listens for new compute nodes

78
Insert-ethers

Used to integrate appliances into the cluster

79
Boot a Compute Node in Installation Mode

Instruct the node to network boot
Network boot forces the compute node to run the
PXE protocol (Pre-eXecution Environment)
Also can use the Rocks Base CD
If no CD and no PXE-enabled NIC, can use a boot
floppy built from Etherboot (http//www.rom-o-ma
tic.net)

80
Insert-ethers Discovers the Node
81
Insert-ethers Status
82
eKVEthernet Keyboard and Video

Monitor your compute node installation over the
ethernet network
No KVM required!
Execute ssh compute-0-0

83
Node Info Stored In A MySQL Database

If you know SQL, you can execute some powerful
commands

84
Cluster Database
85
Kickstart

Red Hats Kickstart
Monolithic flat ASCII file
No macro language
Requires forking based on site information and
node type.
Rocks XML Kickstart
Decompose a kickstart file into nodes and a graph
Graph specifies OO framework
Each node specifies a service and its
configuration
Macros and SQL for site configuration
Driven from web cgi script

86
Sample Node File
lt?xml version"1.0" standalone"no"?gt lt!DOCTYPE
kickstart SYSTEM "_at_KICKSTART_DTD_at_" lt!ENTITY ssh
"openssh"gtgt ltkickstartgt ltdescriptiongt Enable
SSH lt/descriptiongt ltpackagegtsshlt/packagegt
ltpackagegtssh-clientslt/packagegt ltpackagegtssh-s
erverlt/packagegt ltpackagegtssh-askpasslt/packagegt
ltpostgt ltfile name"/etc/ssh/ssh_config"gt Host
CheckHostIP no
ForwardX11 yes ForwardAgent
yes StrictHostKeyChecking
no UsePrivilegedPort no
FallBackToRsh no Protocol
1,2 lt/filegt chmod orx /root mkdir
/root/.ssh chmod orx /root/.ssh lt/postgt lt/kickst
artgtgt
87
Sample Graph File
lt?xml version"1.0" standalone"no"?gt lt!DOCTYPE
kickstart SYSTEM "_at_GRAPH_DTD_at_"gt ltgraphgt ltdescrip
tiongt Default Graph for NPACI Rocks. lt/descripti
ongt ltedge from"base" to"scripting"/gt ltedge
from"base" to"ssh"/gt ltedge from"base"
to"ssl"/gt ltedge from"base" to"lilo"
arch"i386"/gt ltedge from"base" to"elilo"
arch"ia64"/gt ltedge from"node" to"base"
weight"80"/gt ltedge from"node"
to"accounting"/gt ltedge from"slave-node"
to"node"/gt ltedge from"slave-node"
to"nis-client"/gt ltedge from"slave-node"
to"autofs-client"/gt ltedge from"slave-node"
to"dhcp-client"/gt ltedge from"slave-node"
to"snmp-server"/gt ltedge from"slave-node"
to"node-certs"/gt ltedge from"compute"
to"slave-node"/gt ltedge from"compute"
to"usher-server"/gt ltedge from"master-node"
to"node"/gt ltedge from"master-node"
to"x11"/gt ltedge from"master-node"
to"usher-client"/gt lt/graphgt
88
Kickstart framework
89
Appliances