Introduction to Clusters

About This Presentation

Title:

Introduction to Clusters

Description:

Follow-on lectures talk more in detail about various aspects of clustering ... (SHRIMP) Scalable High-performance Really Inexpensive Multi-Processor (Princeton) ... – PowerPoint PPT presentation

Number of Views:199

Avg rating:3.0/5.0

Slides: 112

Provided by: Phi675

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Clusters

1
Introduction to Clusters

Philip Papadopoulos
Greg Bruno
Mason Katz

2
Overview

12 Lectures covering various aspects of cluster
computing
History
Architecture
Construction
Programming
Management/Monitoring
Application
Application Optimization
First two lectures provide initial high-level
overview with some details about message
pipelines
Follow-on lectures talk more in detail about
various aspects of clustering

3
Scoping Rules

Focused on computing clusters
Large number of nodes that need similar system
software footprints
MPI-style parallelism is the dominant application
model
Not assuming homogeneity of hardware
configurations
Do assume the same OS
Even homogeneous systems exhibit hardware
differences
Not high-availability clusters
Our techniques can help here, but we dont
address the specific software needs of HA

4
Modern Clusters

What are the design issues involved in building a
commodity cluster
Selecting nodes
Networks
Operating System
Historical perspective important
Understand where technologies started
Dont repeat mistakes
Technical background for classifications

5
High-Performance Clusters
Gigabit Networks - Myrinet, SCI, FC-AL,
Giganet,GigE,ATM

Killer micros Low-cost Gigaflop processors here
for a few kilos /processor
Killer networks Gigabit network hardware, high
performance software (e.g. Fast Messages), soon
at 100s-/ connection
Leverage HW, commodity SW (nix/Windows NT),
build key technologies
gt high performance computing in a RICH software
environment

6
Cluster Research Groups

Many other cluster groups that have had impact
Active Messages/Network of workstations (NOW) UCB
Basic Interface for Parallelism (BIP) Univ. of
Lyon
Fast Messages(FM)/High Performance Virtual
Machines(HPVM) (UIUC/UCSD)
Real World Computing Partnership (Japan)
(SHRIMP) Scalable High-performance Really
Inexpensive Multi-Processor (Princeton)
Most of these groups moved on to other activities
at the end of the 90s
Were now in the stage of taking these
proofs-of-concept to the level of production
machines

7
Clusters are Different
?

A pile of PCs is not a large-scale SMP server.
Why? Performance and programming model
A clusters closest cousin is an MPP
Whats the major difference? Clusters run N
copies of the OS, MPPs usually run one.

8
A Quick Snapshot of What The Top Supercomputers
Look like

Linpack benchmark used in the Top 500

9
Linpack Performance (Nov 2001)
Source Jack Dongarra
10
Top 500 Architectures (Nov 2001)
Source Jack Dongarra
11
Top 500 Observations

Clusters or Clusters of SMPs account for 150/500
machines
Clusters and MPPs account for 80 of the machines
Single processor machines dropped off the list in
1997
Earth Simulator will represent about 20 of the
aggregate computing speed of the Top 500

12
Some Architectural Background

Machine Classification
Algorithmic Models
Processor Types

13
Machine Classifications

Flynn (1966) Classified machines by data and
control streams

14
Machine Classification SISD

Single Instruction Single Data Stream
Your garden-variety single CPU system
Even this view isnt so simple because dual
processor PCs are getting cheap
Standard Von Neumann Architecture

15
Machine Classification SIMD

SIMD
All processors execute the same program in
lockstep
Data that each processor sees is different
Single control processor
Individual processors can be turned on/off at
each cycle
Illiac IV, CM-2, MasPar are some examples
Silicon Graphics Reality Graphics engine
Also called data parallel

16
Machine Classification MIMD

All processors execute their own set of
instructions
Processors operate on separate data streams
No centralized clock implied
SP-2, T3E, Clusters, Crays, E10000
Valid on all memory hierarchies
SMP
NUMA
Distributed
Important to realize that memory distribution is
independent of the the Machine Classification

17
Distributed Memory vs. Shared

Distributed Memory
Memory for individual processors not shared on
the same bus (address space)
Data is explicitly sent from one address space to
another when needed
Sending data also synchronizes processors
Very scalable
Parallelization done through explicit algorithms
Shared memory
Memory is shared among all processors
No explicit data movement
Synchronization managed through locks/semaphores
Not as scalable (Max 106 CPUs on Enterprise 10K)
Parallelization either explicit or through
compiler (or both)

18
Programming Models

People often refer to the memory system as the
programming model
Shared Memory Programming Model
Distributed Memory Programming Model
This is only part of the story
Synchronization and how the parallel application
program is expressed is the other half

19
Programming Models SPMD

Single/Multiple Program Multiple Data
Similar to SIMD, but generalized for common CPUs
SPMD processors run the same program but
processors are necessarily run in lock step.
Very popular and scalable programming style
MPI assumes an SPMD model.
MPMD is similar except that different can
processors run different programs
PVM distribution has some simple examples
Grid Computing is really MPMD

20
Processor Types

Four general types
Vector
Cache-based, pipelined
Custom (eg. Tera MTA or KSR-1)
Bit serial
Commodity Clusters use cache-based, pipelined
Intel x86 is the most common building block
NOW project built on SPARC
Some want to play with game platforms
(Playstation)

21
Bit Serial (Early 90s)

Only seen in SIMD machines like Connection
Machine (CM-2) or MasPar
Each clock cycle, one bit of the data is
loaded/written
Simplifies memory system and memory trace count
Were Popular for very dense (64K) processor
arrays
Limited efficiencies and problem domains
eventually lead to their demise.

22
Cache-based, Pipelined

Garden Variety Microprocessor
Sparc, Intel x86, MC68xxx, MIPs,
Register-based ALUs and FPUs
Registers are of scalar type
Pipelined execution to improve performance of
individual chips
Splits up components of basic operation like
addition into stages (P4 has a 20 stage pipeline)
The more stages, the faster the speedup, but more
problems with branching and data/control hazards
Per-processor caches make it challenging to build
SMPs (coherency issues)
Now dominates the high-end market

23
Vector Processors

Very specialized (eg. ) machines
Registers are true vectors with power of 2
lengths
Designed to efficiently perform matrix-style
operations
Ax b ( b(I) ? A(I,J)x(J))
Vector registers v1, v2, v3
V1 A(I,), V2 b()
MULV V3(I), V1, V2
Chaining to efficiently handle larger vectors
than size of vector registers
Cray SV-2, Hitachi, NEC (Earth Simulator) are
examples
Highly optimized compilers -gt high efficiencies

24
Some Custom Processors

Denelcor HEP/Tera MTA
Multiple register sets
Stack Pointer, Instruction Pointer, Frame
Pointer, etc.
Switch each clock cycle to different register set
SMT (Simultaneous Multithreading)
Why? Stalls to memory subsystem in one thread can
be hidden by concurrency
Compilers needed to express concurrenc
KSR-1 (Company lasted only about 4 years)
Cache-only memory processor
Peak capability was 2 generations behind
standard micros of the day

25
Going Parallel

Late 70s, even vector monsters started to to
go parallel
For //-processing to work, individual processors
must eventually synchronize
SIMD Hardware synchronizes every clock cycle
MIMD Explicit synchronization done in program
Message passing
Data and synchronization is in the message itself
Can be on shared or distributed memory machines
Shared Memory semaphores, monitors,
fetch-and-increment
Well review some key interconnect properties
later in this talk

26
Rough Timelines of Software and Hardware that has
lead to clusters
27
Some Happenings
MPI 2
Legion
Rocks
Linux
Globus
OSCAR
VIA
SCore (RWCP)
PVM
MPI 1
Scyld
SCE
NOW
Beowulf
HPVM BIP
IBM SP1 1024 CPUs
IBM SP3 1TF
ASCI White 12 TF
Earth Simulator 40 TF
Intel Paragon 150GF 1024 CPUs
Hitachi CP-PACs 700GF
Cray Y-MP 8 Vectors 2.5GF
KSR
ASCI Blue 3 TF
NCSA Platinum 1 TF x86
TeraGrid 13 TF IA64
CM-5
ASCI Red 1 TF
1990
1993
1994
1997
1999
2000
2001
2002
28
Network of Workstations (NOWs)

David Culler (UC Berkeley) started early 90s
SunOS on SPARC Microprocessor
First-generation Myrinet
Active messages for high-performance
Glunix (Global Unix) execution environment
Split-C programming, PVM and eventually MPI
NOW work became the base technology for Hotbot
(Inktomi, Inc. started in 1997)

29
Impact of NOW Project

Brought key issues to the forefront of
commodity-based computing
Global OS
Parallel file systems
Fault tolerance
High-performance messaging
System Management

30
Clusters, Beowulfs, and more

How do you put a Pile-of-PCs into a room and
make them do real work?
Interconnection technologies
Programming them
Monitoring
Starting and running applications
Running at Scale
NOW pioneered the vision for clusters of
commodity processors.
Beowulf popularized the notion and made it very
affordable

31
Beowulf Cluster Definition

Current working definition a collection of
commodity PCs running an open-source operating
system with a commodity interconnection network
Dual Intel PIIIs with fast ethernet, Linux
Program with PVM, MPI,
Single Alpha PCs running Linux

32
Beowulf Clusters contd

Interconnection network is usually fast ethernet
running TCP/IP
(Relatively) slow network
Programming model is message passing
Most people now associate the name Beowulf with
any cluster of PCs
Beowulfs are differentiated from
high-performance clusters by the network
www.beowulf.org has lots of information

33
Outcome of these activities

Brought most of key ingredients of MPPs into the
commodity space
Allowed many more people to really work on
parallel computing
Wider application audience can understand issues
Had a large impact on MPPs of the day. NOW
project analysis improved Paragon messaging
performance by 2X
Almost all software components were made
available as open source
This was key to technology sharing instead of
reinvention

34
Hardware variations on a basic layout
Front-end Node(s)
Power Distribution (Net addressable units as
option)
Public Ethernet
Fast-Ethernet Switching Complex
Gigabit Network Switching Complex
35
High Performance Commodity Clusters

Rocks v2.1
2 Frontends, 4 NFS Servers
100 nodes
Compaq
800, 933, IA-64
SCSI, IDA
IBM
733, 1000
SCSI
50 GB RAM
Ethernet
For management
Myrinet 2000

36
Beowulfs vs. High Performance

Beowulfs traditionally have ethernet (Store and
forward switches)
Very inexpensive interconnect
High host CPU processing overhead
Higher latency
Messaging characteristics limit scalability
High-performance Clusters
Interconnect significant cost
Better scalability
Myrinet brought the technology of Intel Paragon
to the commodity market.

37
Clusters vs. MPPs

MPPs introduced in late 80s
Connection Machine
Paragons
IBM SP
Cray T3E (90s)
MPPs have specialized interconnects, proprietary
OSes. Designed to give the illusion of a uniform
machine
Clusters were designed to replace expensive MPPs.
Successful. New large machines are mostly
clusters
PC clusters are now affordable in lab/single PI
environments

38
Linux

Linux started as student project in 1991
Good integrated distributions in 1993 (e.g.
Slackware)
Becker (Beowulf Project) wrote high-performing
ethernet drivers
Fundamental enabler for clusters
Major releases of Kernel improved multiprocessing
performance, stability, support of devices
This became the essential piece to finish the
commodity puzzle
Networks
CPUs (Intel)
Operating System
Message passing software (PVM and MPI)

39
Other OSes

Windows (NT, 2000, XP)
HPVM Project (Chien)
Velocity cluster (Cornell Theory Center)
SunOS/Solaris
OS of the Berkeley NOW project
What about
AIX
MacOS
HPUX
Tru64 (Compaq)
All can be used as basic OS. Have not the wide
acceptance of Linux for cluster architectures

40
Do Commodity Clusters based on Linux perform
adequately?
41
IA-32 Application Scaling
Source Dave Pierce, SIO
42
Itanium Cluster Performance
NAMD Scalable Molecular Dynamics
Simulation of large biomolecular systems on
parallel computers File compatible with CHARMM
AMBER Message-driven and object-oriented design
implemented with Charm/Converse (from PPL at
UIUC) Ported to PACI systems, clusters, and
desktop PCs Available for FREE, includes source
code
Pentium III cluster
Itanium cluster
ApoA1 (PME) 92K atoms
Source Rob Pennington, NCSA
43
Clusters on the Grid
METEOR II
Deep Impact
150 GB disk total
32 cpu's total
Myrinet
Deep Impact
Broad Impact
44
PC Cluster Performance

Right on par with more expensive MPPs
Sometimes outperforms on particular codes
What are some things that are lacking
Natural application development/debugging
environment
High-performance disk I/O
Management of clusters can be a challenge without
scalable techniques

45
Putting a cluster together

(16, 32, 64, X) Individual Node
Eg. Dual Processor Pentium III/1.13GHz, 1 GB mem,
ethernet
Scalable High-speed network
Myrinet, Giganet, Gigabit Ethernet
Message-passing libraries
TCP, MPI, PVM, VIA
Multiprocessor job launch
Portable batch System
Load Sharing Facility
PVM spawn, mpirun, ssh,
Techniques for system management
NPACI Rocks (Rocks) is a good example

46
Providing an abstraction

A pile of PCs is not an attractive model from
the application point of view.
Need a coherent view and abstraction
Abstractions simplify the hardware so that
algorithms can be more naturally mapped
A cluster is a distributed memory MIMD
MPI is the preferred way to express parallelism
in applications
Understanding some of the lower-level details can
be essential to obtaining good application
performance.

47
Virtualization of Machines

Want the illusion that a collection of machines
(cluster) is a single machine
Start, stop, monitor distributed programs
Programming and debugging should work seamlessly
PVM (Parallel Virtual Machine) was the first,
widely-adopted virtualization for parallel
computing
MPI is a standard API for message passing.
This illusion is only partially complete in any
software system. Some issues
Node heterogeneity.
Real network topology can lead to contention
Unrelated What is a Java Virtual Machine?

48
High-Performance Communication
Switched Multigigabit, User-level access Networks
Switched 100 Mbit OS mediated access

Level of network interface support NIC/network
router latency
Overhead and latency of communication ?
deliverable bandwidth
High-performance communication ?
Programmability!
Low-latency, low-overhead, high-bandwidth cluster
communication
much more is needed
Usability issues, I/O, Reliability, Availability
Remote process debugging/monitoring

49
Communication Networks

Understanding Characteristics is important to
understanding scalability of machines

50
Characterizing Networks

Bandwidth
Device/switch latency
Switching types
Circuit switched (eg. Telephone)
Packet switched (eg. Internet)
Store and forward
Virtual Cut Through
Wormhole routed
Topology
Number of connections
Diameter (how many hops through switches)

51
Latency

Latency is the amount of time taken for a command
to start before any effect is seen
Push on gas pedal before car goes forward
Time you enter a line, before cashier starts on
your job
First bit leaves computer A, first bit arrives at
computer B
OR
(Message latency) First bit leaves computer A,
last bit arrives at computer B
Startup latency is the amount of time to send a
zero length message

52
Bandwidth

Bits/second that can travel through a connection
A really simple model for calculating the time to
send a message of N bytes
Time latency N/bandwidth
Bisection is the minimum number of wires that
must be cut to divide a network of machines into
two equal halves.
Bisection bandwidth is the total bandwidth
through the bisection

53
Interconnection Topologies

Completely connected
Every node has a direct wire connection to every
other node
(N x (N-1))/2 Wires, Clearly impractical at
scale

54
Line/Ring
2
1
3
4
5
6
7

Simple interconnection
First topology where routing is an issue
Needed when no direct connection exists between
nodes
Want go to node 4 from node 2 have to pass
through node 3
What happens if 2 want to communicate with 3 at
the same time 1 want to communicate with 4?
What is the bisection of a line/ring
If the links are of bandwidth B, what is the
bisection bandwidth
What is the aggregate bandwidth of the network?

55
Mesh/Torus

Generalization of line/ring to multiple
dimensions
More routes between nodes
What is the bisection of this network?
Paragon is an example

2
1
3
4
5
6
7
2
1
3
4
5
6
7
2
1
3
4
5
6
7
56
Hop Count

Networks are measured by diameter
This is the minimum number of hops that message
must traverse for the two nodes that furthest
apart
Line Diameter N-1
2D (NxM) Mesh Diameter NM-2

57
Tree-based Networks

Nodes organized in a tree fashion (important for
some global algorithms)

Diameter of this network? Bisection, Bisection
Bandwidth? CM-5 was a Fat Tree links got
faster near the top
58
Hypercubes
1D
2D
4D
3D
59
Hypercubes 2

Dimension N Hypercube is constructed by
connecting the corners of two N-1 hypercubes
Relatively low wire count to build large networks
Multiple routes from any destination to any node.
Exercise to the reader, what is the dimenision of
a K-dimensional Hypercube

60
Communication Topologies

Interconnect topologies were very important areas
of research in the early/mid 90s
MPI-1 spent a great deal of time addressing
topologies for optimization
Hardware Topologies largely unimportant now
because of wormhole routed networks and crossbar
networks
Logical topologies are very important in
constructing efficient parallel programs
Collective operations (Sum, Reduce, Broadcast)
MPI topologies important from this aspect

61
Modern Networks are Packet Switched

Break message into smaller blocks and send these
pieces through the network
Network intermediate points (routers) can be
store-and-forward or virtual cut through
Store and forward requires buffering at each
switch if an incoming packet has packets ahead of
it on an outgoing port (congestion)
Virtual cut-through eliminates the buffering for
store and forward by cutting through the switch
when the output port is free

62
Switch Types
Store and Forward
BUF
e.g. Ethernet
BUF
Cut Through
e.g. Myrinet
63
Wormhole Routing

Wormhole routing is a variation of virtual cut
through
Small headers (flow control digits Flits) pass
through the network.
When a flit is allowed to cut through a switch,
the original sender is guaranteed a clear path
through that switch.
A tail flit closes the connection
Going through multiple switches sets up a virtual
circuit from sender to receiver
Wormhole was defined by Seitz and is used in
Myrinet, a very popular cluster interconnect.

64
Wormhole-Routed Networks
Message stream is a virtual Circuit
65
Routing and Deadlock

If routing algorithms not carefully constructed,
deadlock can occur
Head flits block and can never establish a
connection
Routing algorithms provably deadlock-free under
mild assumptions
Streams are of finite duration (packetized)
Receiver/sender coordinate so that the tail flit
is finally processed (hence virtual connection is
closed and input/output ports on switches are
freed).

66
Latency of Circuit Switched and Virtual Cut
Through

Circuit Switch Latency
(Lc/B) l (L/B)
Lc length of control packet
B bandwidth
l number of links
L Length of Packet
Virtual Cut-through latency
(Lh/B) l (L/B)
Lh length of header packet

67
Store-Forward and Wormhole routing Latency

Wormhole Routing Latency
(Lf/B) l (L/B)
Lf Length of flit
Store-Forward Latency
(L/B) l
Store and forward latency can be much worse for
many hops.
Virtual Cut Through, Wormhole, and Circuit Switch
reach (L/B) as message length increases

68
Message Passing

Details of Networks to achieve High Performance

69
Communication Style is Message Passing
Packetized message
B
A
4
3
2
1
1
2

How do we efficiently get a message from Machine
A to Machine B?
How do we efficiently break a large message into
packets and reassemble at receiver?
How does receiver differentiate among message
fragments (packets) from different senders?

70
Will use the details of FM to illustrate some
communication engineering

Previous slides focused on the switch hardware
These look at some what the endpoints must do to
take advantage of high-speed wormhole-routed
networks

71
FM on Commodity PCs
FM Host Library
FM NIC Firmware
FM Device Driver
Pentium III
NIC
2000 Mbps
1000 MIPS
133 MIPS
PCI
P6 bus

Host Library API presentation, flow control,
segmentation/reassembly, multithreading
Device driver protection, memory mapping,
scheduling monitors
NIC Firmware link management, incoming buffer
management, routing, multiplexing/demultiplexing

72
Fast Messages 2.x Performance (1998)

Latency 8.8ms, Bandwidth 100MB/s, N1/2 250
bytes
Fast in absolute terms (compares to MPPs,
internal memory BW)
Delivers a large fraction of hardware performance
for short messages
Technology transferred in emerging cluster
standards Intel/Compaq/Microsofts Virtual
Interface Architecture.

73
Comments about Performance

Latency and Bandwidth are the most basic
measurements message passing machines
Will discuss in detail performance models because
Latency and bandwidth do not tell the entire
story
High-performance clusters exhibit
20X deliverable bandwidth over 100Mbit ethernet
Myrinet 2000 240 MB/sec vs. 11MB/sec (FastEther)
10X improvement in latency
Myrinet 2000 8 us vs. 80us (FastEther)

74
How do FM/GM/PM/AM really get Speed?

Protected user-level access to network
(OS-bypass)
Efficient credit-based flow control
assumes reliable hardware network only OK for
System Area Networks
No buffer overruns ( stalls sender if no receive
space)
Early de-multiplexing of incoming packets
multithreading, use of NT user-schedulable
threads
Careful implementation with many tuning cycles
Overlapping DMAs (Recv), Programmed I/O send
No interrupts! Polling only.

75
OS-Bypass Background

Suppose you want to perform a sendto on a
standard IP socket?
Operating System mediates access to the network
device
Must trap into the kernel to insure authorization
on each and every message (Very time consuming)
Message is copied from user program to kernel
packet buffers
Protocol information about each packet is
generated by the OS and attached to a packet
buffer
Message is finally sent out onto the physical
device (ethernet)
Receiving does the inverse with a recvfrom
Packet to kernel buffer, OS strip of header,
reassembly of data, OS mediation for
authorization, copy into user program

76
OS-Bypass

A user program is given a protected slice of the
network interface
Authorization is done once (not per message)
Outgoing packets get directly copied or DMAed to
network interface
Protocol headers added by user-level library
Incoming packets get routed by network interface
card (NIC) into user-defined receive buffers
NIC must know how to differentiate incoming
packets. This is called early demultiplexing.
Outgoing and incoming message copies are
eliminated.
Traps to OS kernel are eliminated (bypass)

77
Whats the Catch to OS Bypass

Because only the user application is involved in
message transmission, it must actively service
the network connection
Kernel timers cant be used (they are bypassed)
Usually a service thread takes the place of the
kernel-based mechanisms
When not handled properly, can cause strange
results
Because applications get a slice of the network,
only a small number of processes can
simultaneously access the high-speed links

78
Packet Pathway
DMA
Programmed I/O/DMA
User level Handler 1
Pkt
Pkt
Pkt
Pkt
User Message Buffer
User level Handler 2
DMA to/from Network
User Message Buffer
Pkt
Pinned DMA receive region

Concurrency of I/O busses
Sender specifies receiver handler ID
Flow control keeps DMA region from being
overflowed

User Buffer
79
MPI-FM 2.x Layering
MPI Header
MPI Header
Source buffer
Destination buffer

Gather-scatter interface handler multithreading
enables efficient layering, data manipulation
without copies

80
MPI on FM 2.x
Msg Size

MPI-FM 91 MB/s, 13ms latency, 4 ms overhead
Short messages much better than IBM SP2, PCI
limited
Latency SGI O2K

81
MPI-FM 2.x Efficiency
Efficiency

High Transfer Efficiency, approaches 100
Lauria, Pakin et al. HPDC7 98
Other systems much lower even at 1KB (100Mbit
40, 1Gbit 5)

82
Is this detail important?

Yes! Detail of a particular high-performance
interface illustrates some of the complexity for
these systems
Performance and scaling are very important.
Sometimes the underlying structure needs to be
understood to reason about applications.
Overhead vs. Latency
Bandwidth and communication payload
Basic understanding of the mechanisms
de-mystifies what is actually going on.

83
How do we program/run such machines?

PVM (Parallel Virtual Machine) provides
Simple message passing API
Construction of virtual machine with a software
console
Ability to spawn (start), kill (stop), monitor
jobs
XPVM is a graphical console, performance monitor
MPI (Message Passing Interface)
Complex and complete message passing API
Defacto, community-defined standard
No defined method for job management
Mpirun provided as a tool for the MPICH
distribution
Commericial and non-commercial tools for
monitoring debugging
Jumpshot, VaMPIr,

84
More on MPI

Started as a standards effort in 1994
Fuse the best ideas from several projects
Had a good reference implementation (MPICH), but
encouraged vendors/researchers to improve and/or
replace
Allows users to write standard parallel
subroutine libraries
Is really a cornerstone software capability for
parallel machines in general (and clusters in
particular).

85
Modern HPC clusters should be thought of as
affordable, scalable machines programmed with MPI
86
Cluster Projects have focused on high-performance
messaging

BIP (Basic Interface for Parallelism) Linux
MVIA Berkeley Lab Modular VIA project
Active Messages Berkeley NOW/Millennium
GM From Myricom
General purpose (what we use on our Linux
Cluster),
Real World Computing Partnership Japanese
consortium
UNET Cornell
High-performance over ATM and Fast Ethernet
HPVM Fast Messaging and NT

87
Integrating Key Components

NPACI Rocks RedHat Linux-based Clustering
Toolkit
Beowulf Project The ones that pop-cultured
clusters.
Scyld Computing Commercialization of Beowulf
technologies. Founded by Donald Becker of Linux
Ethernet Driver fame.
OSCAR Collection of standard cluster components
SCore RWCP single system image
PVM The original message passing/distributed
computing software toolkit.
MPI Message Passing Interface Standard
VIA Virtual Interface architecture. A hardware
standard for low-latency system area networks
Myricom Corporation Low-latency gigabit networks
IBM, Compaq are joining the cluster
vendor/software fray
Many Projects, Few Standards

88
Technological Shifts

Memory bandwidth of COTS systems
4 8X increase this year (Rambus, DDR)
Increased I/O performance
4X improvement today (64bit/66MHz)
10X (PCI-X) in some Pentium 4 MBs
Increased network performance/decrease in
1X infiniband (2.5 Gbits/sec) hardware
convergence
Intel designing Mboards with multiple I/O busses
and on-board Infiniband.
Gigabit Ethernet now getting very cheap
64 bit integer performance everywhere (eg IA-64,
Alpha (dead soon), Power4, UltraSparc, AMD
Hammer)

89
Where we are now

Clusters are Proven Computational Engines (Many
existence proofs)
Upcoming hardware technology dislocations makes
them very attractive at multiple scales
Research Software has not focused on management
System (Management) Software is a bazaar
Dual Processor, High-performance Network, Large
Memory
Standard Building block
5K/node

90
Designing/Building a Cluster

Hardware layout
Processors, networks, power
Logical system design
Management Philosophies
Well cover the very high-level view. More
details will follow in later lectures

91
Hardware basic layout
Front-end Node(s)
Power Distribution (Net addressable units as
option)
Public Ethernet
Fast-Ethernet Switching Complex
Gigabit Network Switching Complex
92
Current Configuration of the Meteor

Rocks v2.2 (RedHat 7.2)
2 Frontends, 4 NFS Servers
100 nodes
Compaq
800, 933, IA-64
SCSI, IDA
IBM
733, 1000
SCSI
50 GB RAM
Ethernet
For management
Myrinet 2000

93
Compute Nodes - Meteor Specs and Choices

Dual Processor PIIIs (733 and 800 MHz)
933s and 1.0 GHz as we expand
½ GB node (1 GB would be better)
Hot swap SCSI on these nodes.
Choices
Uni vs. Dual Processor
Processors Alpha, Intel, Sparc, PowerPC
Linux is, in reality an Intel OS
Rack-mount vs. Tower
Rackmount essential for large installations
SCSI vs. IDE
Hot Swap unimportant. IDE Removable disks will
work.
Rackmount servers usually are SCSI
User integration versus system integrator

94
Networks

Ethernet only ? Beowulf-class
Nodes are in Private IP (10.x.x.x0 space,
front-end does NAT
Gigabit networks
Myrinet, Giganet, Gigabit Ethernet
Power Network
Highly desirable to have network addressable
power controllers (When hard reset needed)
We will be experimenting with Baytech
Essential to figure power needs (300W/system
peak for our current systems)
A serial console network is not really
necessary
A KVM (keyboard video monitor) switching system
is adds too much complexity, cables, and cost

95
Services

Front-end Node
Node seen by external world
Performs Network Address Translation (NAT)
NFS Server(s) for user home areas
Beware of scalability issues!
Compilers, libraries
Configuration for Nodes
DHCP Server
NIS Domain Controller
NTP Server
Installation Server for defining system on nodes
Method(s) to start jobs on compute nodes
Batch System
Interactive launching of jobs

96
Installation/Management

Need to have a strategy for managing cluster
nodes
Common methods and (pitfalls)
Installing each node by hand
Difficult to keep software on nodes up to date
Management increases as node count increases
Disk Imaging techniques (eg. VA Disk Imager)
Difficult to handle heterogeneous nodes
Treats OS as a single monolithic system
Specialized installation programs (e.g. IBMs
LUI, or RWCPs Multicast installer)
RedHat Kickstart
Define packages needed for OS on nodes, kickstart
gives a reasonable measure of contro.
Need to fully automate to scale out

97
Job Management, Debugging

Once a parallel application (usually MPI) has
been created, it needs to run/debugged/scheduled
Job Queuing systems (like PBS) exist and help
with the sharing of resources
Debugging across N copies of the OS is quite
challenging with only some moderate success in
debugging (Like Totalview) environments

98
The Dark Side of Clusters

Clusters are phenomenal price/performance
computational engines
Can be hard to manage without experience
High-performance I/O is still unsolved
Finding out where something has failed increases
at least linearly as cluster size increases
Not cost-effective if every cluster burns a
person just for care and feeding
NPACI Rocks helps here
Programming environment could be vastly improved
Technology is changing very rapidly. Scaling up
is becoming commonplace (128 nodes)

99
The Top 2 Most Critical Problems

The largest problem in clusters is software skew
When Software configuration on some nodes is
different than on others
Small differences (minor version numbers on
libraries) can cripple a parallel program
Its taken the community almost 7 years from
original Beowulf book to understand this
The Second most important problem is adequate job
control of the parallel process
Signal propagation
Cleanup

100
NPACI Rocks Toolkit rocks.npaci.edu

Collection of software components needed for
software
Techniques and software for easy installation,
management and update of clusters
Node management philosophy
Make it trivial to completely reinstall any (all)
nodes.
Nodes must be 100 automatically configured
Use of DHCP, NIS for configuration
Use RedHats Kickstart to define the set of
software that defines a node.
All software is packaged in a Redhat Package
(RPM)
Encapsulate configuration for a package (eg.
Myrinet)
Manage dependencies
Never try to figure out if node software is
consistent
Bootable CD to first build front-end installation
server and then to build nodes.

101
Trends

CPU trends
Network trends
Technology is changing rapidly in the PC
marketplace
Knowing and following these trends (and having
software to help you through them) is part of the
commodity cluster game

102
Cluster Compute NodeToday
103
Cluster Compute NodeTomorrow (Single P4 is here
today)
1.6 GHz
64 bit _at_ 400MHz 3.2 GB/s
2 channels 16bit _at_ 800 MHz 3.2 GB/s
PCI-X 64 bit _at_ 133 MHz 1.06 GB/s

In the next 9 months, every speed and feed gets
at least a 2x bump!

104
Commodity CPU Pentium 3

0.8 Gflops (Peak)
1 Flop / cycle _at_ 800 MHz
12.9 GB/s L2 cache feed
800 MHz 1/2 256-bit (Advanced Transfer Cache)
1.06 GB/s Memory-I/O bus
133 MHz 64-bit

105
Commodity CPU Pentium 4

2.8 Gflops
2 Flops / cycle _at_ 1.4 GHz
128-bit vector registers (Streaming SIMD
Extensions
Can apply operations on 2 64-bit floating point
values per clock (SIMD Streaming Extensions 2)
44 GB/s L2 cache feed (Full speed 1.4GHz x
256bits)
3.2 GB/s Memory-I/O bus

106
Power4

2.5 GB/s / CPU memory bus feed
Numbers in the figure are aggregate
10 GB/s / CPU in 8-way configuration
5 GB/s / CPU I/O feed

Chip Multiprocessor (CMP)
4.0 GFlop / CPU (Peak)
50 GB/s / CPU L2 cache feed

107
TeraGrid

Taking clusters to the next stage for the NSF
PACI program (Partnership in Advanced Computing
Infrastructure)
13 TFlops aggregate speed across 4 sites
4 Linux clusters
Next-generation processor (IA64 McKinley)
Designing I/O as an integral component of the
cluster
Large Storage Area Network
Still the same basic design of lab clusters

108
TeraGrid Partners

Strategic partners
IBM
Cluster integration. GPFS parallel file system
Intel
McKinley IA-64 software and compilers
Oracle
data archive management and mining
Qwest
40 Gb/s DTF WAN backbone
Myricom
Cluster interconnect
Sun
Data Management at SDSC

109
4 TeraGrid Sites Have Focal Points

SDSC Large-scale Data
Large-scale and high-performance data
analysis/handling
Every Cluster Node is Directly Attached to SAN
NCSA High-performance Computing
Large-scale, Large Flops computation
Argonne Visualization
Scalable Visualization walls, Human-Computer
Interfaces
Caltech Applications
Data and flops for applications Especially some
of the GriPhyN Apps (LIGO, NVO)
Specific site configurations reflect these foci
Sites are not limited to just there focus area
One organization cannot do it all

110
TeraGrid Architecture
ANL 1 TF .25 TB Memory 25 TB disk
Caltech 0.5 TF .4 TB Memory 86 TB disk
Extreme Blk Diamond
574p IA-32 Chiba City
256p HP X-Class
32
32
32
32
24
128p Origin
128p HP V2500
32
24
32
24
HR Display VR Facilities
92p IA-32
5
4
5
8
8
HPSS
HPSS
OC-48
NTON
OC-12
Calren
ESnet HSCC MREN/Abilene Starlight
Chicago LA DTF Core Switch/Routers Cisco 65xx
Catalyst Switch (256 Gb/s Crossbar)
Juniper M160
OC-48
OC-12 ATM
OC-12
GbE
NCSA 62 TF 4 TB Memory 240 TB disk
SDSC 4.1 TF 2 TB Memory 225 TB SAN
Juniper M40
Juniper M40
OC-12
vBNS Abilene Calren ESnet
OC-12
OC-12
OC-3
Myrinet
8
4
UniTree
8
HPSS
2
Sun Starcat
Myrinet
4
1024p IA-32 320p IA-64
1176p IBM SP Blue Horizon
16
14
4
15xxp Origin
Sun E10K
111
TeraGrid Redux

Expect full clusters up and running by Nov 2002
Push towards the Grid is founded on our ability
to manage and control the HPC software stack
Logical next step

112
Summary

Clusters are real machines
Still missing some key components such as
High-perf I/O
Complexity is abstracted by MPI, but still needs
to be understood by application developers
System integration software is the next step
beyond the messaging proof-of-principle work of
the mid 90s
Clustering Toolkits
Simplify installation/management/monitoring
Serve as collection points of software
Were on the On the cusp of a large step-changes
in commodity hardware.
Teragrid is one of the first projects to go to
the future generation of intel architectures