Introduction to Clusters - PowerPoint PPT Presentation

Loading...

PPT – Introduction to Clusters PowerPoint presentation | free to view - id: 74ccf-ZjI0Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Introduction to Clusters

Description:

Follow-on lectures talk more in detail about various aspects of clustering ... (SHRIMP) Scalable High-performance Really Inexpensive Multi-Processor (Princeton) ... – PowerPoint PPT presentation

Number of Views:164
Avg rating:3.0/5.0
Slides: 112
Provided by: Phi675
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Introduction to Clusters


1
Introduction to Clusters
  • Philip Papadopoulos
  • Greg Bruno
  • Mason Katz

2
Overview
  • 12 Lectures covering various aspects of cluster
    computing
  • History
  • Architecture
  • Construction
  • Programming
  • Management/Monitoring
  • Application
  • Application Optimization
  • First two lectures provide initial high-level
    overview with some details about message
    pipelines
  • Follow-on lectures talk more in detail about
    various aspects of clustering

3
Scoping Rules
  • Focused on computing clusters
  • Large number of nodes that need similar system
    software footprints
  • MPI-style parallelism is the dominant application
    model
  • Not assuming homogeneity of hardware
    configurations
  • Do assume the same OS
  • Even homogeneous systems exhibit hardware
    differences
  • Not high-availability clusters
  • Our techniques can help here, but we dont
    address the specific software needs of HA

4
Modern Clusters
  • What are the design issues involved in building a
    commodity cluster
  • Selecting nodes
  • Networks
  • Operating System
  • Historical perspective important
  • Understand where technologies started
  • Dont repeat mistakes
  • Technical background for classifications

5
High-Performance Clusters
Gigabit Networks - Myrinet, SCI, FC-AL,
Giganet,GigE,ATM
  • Killer micros Low-cost Gigaflop processors here
    for a few kilos /processor
  • Killer networks Gigabit network hardware, high
    performance software (e.g. Fast Messages), soon
    at 100s-/ connection
  • Leverage HW, commodity SW (nix/Windows NT),
    build key technologies
  • gt high performance computing in a RICH software
    environment

6
Cluster Research Groups
  • Many other cluster groups that have had impact
  • Active Messages/Network of workstations (NOW) UCB
  • Basic Interface for Parallelism (BIP) Univ. of
    Lyon
  • Fast Messages(FM)/High Performance Virtual
    Machines(HPVM) (UIUC/UCSD)
  • Real World Computing Partnership (Japan)
  • (SHRIMP) Scalable High-performance Really
    Inexpensive Multi-Processor (Princeton)
  • Most of these groups moved on to other activities
    at the end of the 90s
  • Were now in the stage of taking these
    proofs-of-concept to the level of production
    machines

7
Clusters are Different
?
  • A pile of PCs is not a large-scale SMP server.
  • Why? Performance and programming model
  • A clusters closest cousin is an MPP
  • Whats the major difference? Clusters run N
    copies of the OS, MPPs usually run one.

8
A Quick Snapshot of What The Top Supercomputers
Look like
  • Linpack benchmark used in the Top 500

9
Linpack Performance (Nov 2001)
Source Jack Dongarra
10
Top 500 Architectures (Nov 2001)
Source Jack Dongarra
11
Top 500 Observations
  • Clusters or Clusters of SMPs account for 150/500
    machines
  • Clusters and MPPs account for 80 of the machines
  • Single processor machines dropped off the list in
    1997
  • Earth Simulator will represent about 20 of the
    aggregate computing speed of the Top 500

12
Some Architectural Background
  • Machine Classification
  • Algorithmic Models
  • Processor Types

13
Machine Classifications
  • Flynn (1966) Classified machines by data and
    control streams

14
Machine Classification SISD
  • Single Instruction Single Data Stream
  • Your garden-variety single CPU system
  • Even this view isnt so simple because dual
    processor PCs are getting cheap
  • Standard Von Neumann Architecture

15
Machine Classification SIMD
  • SIMD
  • All processors execute the same program in
    lockstep
  • Data that each processor sees is different
  • Single control processor
  • Individual processors can be turned on/off at
    each cycle
  • Illiac IV, CM-2, MasPar are some examples
  • Silicon Graphics Reality Graphics engine
  • Also called data parallel

16
Machine Classification MIMD
  • All processors execute their own set of
    instructions
  • Processors operate on separate data streams
  • No centralized clock implied
  • SP-2, T3E, Clusters, Crays, E10000
  • Valid on all memory hierarchies
  • SMP
  • NUMA
  • Distributed
  • Important to realize that memory distribution is
    independent of the the Machine Classification

17
Distributed Memory vs. Shared
  • Distributed Memory
  • Memory for individual processors not shared on
    the same bus (address space)
  • Data is explicitly sent from one address space to
    another when needed
  • Sending data also synchronizes processors
  • Very scalable
  • Parallelization done through explicit algorithms
  • Shared memory
  • Memory is shared among all processors
  • No explicit data movement
  • Synchronization managed through locks/semaphores
  • Not as scalable (Max 106 CPUs on Enterprise 10K)
  • Parallelization either explicit or through
    compiler (or both)

18
Programming Models
  • People often refer to the memory system as the
    programming model
  • Shared Memory Programming Model
  • Distributed Memory Programming Model
  • This is only part of the story
  • Synchronization and how the parallel application
    program is expressed is the other half

19
Programming Models SPMD
  • Single/Multiple Program Multiple Data
  • Similar to SIMD, but generalized for common CPUs
  • SPMD processors run the same program but
    processors are necessarily run in lock step.
  • Very popular and scalable programming style
  • MPI assumes an SPMD model.
  • MPMD is similar except that different can
    processors run different programs
  • PVM distribution has some simple examples
  • Grid Computing is really MPMD

20
Processor Types
  • Four general types
  • Vector
  • Cache-based, pipelined
  • Custom (eg. Tera MTA or KSR-1)
  • Bit serial
  • Commodity Clusters use cache-based, pipelined
  • Intel x86 is the most common building block
  • NOW project built on SPARC
  • Some want to play with game platforms
    (Playstation)

21
Bit Serial (Early 90s)
  • Only seen in SIMD machines like Connection
    Machine (CM-2) or MasPar
  • Each clock cycle, one bit of the data is
    loaded/written
  • Simplifies memory system and memory trace count
  • Were Popular for very dense (64K) processor
    arrays
  • Limited efficiencies and problem domains
    eventually lead to their demise.

22
Cache-based, Pipelined
  • Garden Variety Microprocessor
  • Sparc, Intel x86, MC68xxx, MIPs,
  • Register-based ALUs and FPUs
  • Registers are of scalar type
  • Pipelined execution to improve performance of
    individual chips
  • Splits up components of basic operation like
    addition into stages (P4 has a 20 stage pipeline)
  • The more stages, the faster the speedup, but more
    problems with branching and data/control hazards
  • Per-processor caches make it challenging to build
    SMPs (coherency issues)
  • Now dominates the high-end market

23
Vector Processors
  • Very specialized (eg. ) machines
  • Registers are true vectors with power of 2
    lengths
  • Designed to efficiently perform matrix-style
    operations
  • Ax b ( b(I) ? A(I,J)x(J))
  • Vector registers v1, v2, v3
  • V1 A(I,), V2 b()
  • MULV V3(I), V1, V2
  • Chaining to efficiently handle larger vectors
    than size of vector registers
  • Cray SV-2, Hitachi, NEC (Earth Simulator) are
    examples
  • Highly optimized compilers -gt high efficiencies

24
Some Custom Processors
  • Denelcor HEP/Tera MTA
  • Multiple register sets
  • Stack Pointer, Instruction Pointer, Frame
    Pointer, etc.
  • Switch each clock cycle to different register set
  • SMT (Simultaneous Multithreading)
  • Why? Stalls to memory subsystem in one thread can
    be hidden by concurrency
  • Compilers needed to express concurrenc
  • KSR-1 (Company lasted only about 4 years)
  • Cache-only memory processor
  • Peak capability was 2 generations behind
    standard micros of the day

25
Going Parallel
  • Late 70s, even vector monsters started to to
    go parallel
  • For //-processing to work, individual processors
    must eventually synchronize
  • SIMD Hardware synchronizes every clock cycle
  • MIMD Explicit synchronization done in program
  • Message passing
  • Data and synchronization is in the message itself
  • Can be on shared or distributed memory machines
  • Shared Memory semaphores, monitors,
    fetch-and-increment
  • Well review some key interconnect properties
    later in this talk

26
Rough Timelines of Software and Hardware that has
lead to clusters
27
Some Happenings
MPI 2
Legion
Rocks
Linux
Globus
OSCAR
VIA
SCore (RWCP)
PVM
MPI 1
Scyld
SCE
NOW
Beowulf
HPVM BIP
IBM SP1 1024 CPUs
IBM SP3 1TF
ASCI White 12 TF
Earth Simulator 40 TF
Intel Paragon 150GF 1024 CPUs
Hitachi CP-PACs 700GF
Cray Y-MP 8 Vectors 2.5GF
KSR
ASCI Blue 3 TF
NCSA Platinum 1 TF x86
TeraGrid 13 TF IA64
CM-5
ASCI Red 1 TF
1990
1993
1994
1997
1999
2000
2001
2002
28
Network of Workstations (NOWs)
  • David Culler (UC Berkeley) started early 90s
  • SunOS on SPARC Microprocessor
  • First-generation Myrinet
  • Active messages for high-performance
  • Glunix (Global Unix) execution environment
  • Split-C programming, PVM and eventually MPI
  • NOW work became the base technology for Hotbot
    (Inktomi, Inc. started in 1997)

29
Impact of NOW Project
  • Brought key issues to the forefront of
    commodity-based computing
  • Global OS
  • Parallel file systems
  • Fault tolerance
  • High-performance messaging
  • System Management

30
Clusters, Beowulfs, and more
  • How do you put a Pile-of-PCs into a room and
    make them do real work?
  • Interconnection technologies
  • Programming them
  • Monitoring
  • Starting and running applications
  • Running at Scale
  • NOW pioneered the vision for clusters of
    commodity processors.
  • Beowulf popularized the notion and made it very
    affordable

31
Beowulf Cluster Definition
  • Current working definition a collection of
    commodity PCs running an open-source operating
    system with a commodity interconnection network
  • Dual Intel PIIIs with fast ethernet, Linux
  • Program with PVM, MPI,
  • Single Alpha PCs running Linux

32
Beowulf Clusters contd
  • Interconnection network is usually fast ethernet
    running TCP/IP
  • (Relatively) slow network
  • Programming model is message passing
  • Most people now associate the name Beowulf with
    any cluster of PCs
  • Beowulfs are differentiated from
    high-performance clusters by the network
  • www.beowulf.org has lots of information

33
Outcome of these activities
  • Brought most of key ingredients of MPPs into the
    commodity space
  • Allowed many more people to really work on
    parallel computing
  • Wider application audience can understand issues
  • Had a large impact on MPPs of the day. NOW
    project analysis improved Paragon messaging
    performance by 2X
  • Almost all software components were made
    available as open source
  • This was key to technology sharing instead of
    reinvention

34
Hardware variations on a basic layout
Front-end Node(s)
Power Distribution (Net addressable units as
option)
Public Ethernet
Fast-Ethernet Switching Complex
Gigabit Network Switching Complex
35
High Performance Commodity Clusters
  • Rocks v2.1
  • 2 Frontends, 4 NFS Servers
  • 100 nodes
  • Compaq
  • 800, 933, IA-64
  • SCSI, IDA
  • IBM
  • 733, 1000
  • SCSI
  • 50 GB RAM
  • Ethernet
  • For management
  • Myrinet 2000

36
Beowulfs vs. High Performance
  • Beowulfs traditionally have ethernet (Store and
    forward switches)
  • Very inexpensive interconnect
  • High host CPU processing overhead
  • Higher latency
  • Messaging characteristics limit scalability
  • High-performance Clusters
  • Interconnect significant cost
  • Better scalability
  • Myrinet brought the technology of Intel Paragon
    to the commodity market.

37
Clusters vs. MPPs
  • MPPs introduced in late 80s
  • Connection Machine
  • Paragons
  • IBM SP
  • Cray T3E (90s)
  • MPPs have specialized interconnects, proprietary
    OSes. Designed to give the illusion of a uniform
    machine
  • Clusters were designed to replace expensive MPPs.
  • Successful. New large machines are mostly
    clusters
  • PC clusters are now affordable in lab/single PI
    environments

38
Linux
  • Linux started as student project in 1991
  • Good integrated distributions in 1993 (e.g.
    Slackware)
  • Becker (Beowulf Project) wrote high-performing
    ethernet drivers
  • Fundamental enabler for clusters
  • Major releases of Kernel improved multiprocessing
    performance, stability, support of devices
  • This became the essential piece to finish the
    commodity puzzle
  • Networks
  • CPUs (Intel)
  • Operating System
  • Message passing software (PVM and MPI)

39
Other OSes
  • Windows (NT, 2000, XP)
  • HPVM Project (Chien)
  • Velocity cluster (Cornell Theory Center)
  • SunOS/Solaris
  • OS of the Berkeley NOW project
  • What about
  • AIX
  • MacOS
  • HPUX
  • Tru64 (Compaq)
  • All can be used as basic OS. Have not the wide
    acceptance of Linux for cluster architectures

40
Do Commodity Clusters based on Linux perform
adequately?
41
IA-32 Application Scaling
Source Dave Pierce, SIO
42
Itanium Cluster Performance
NAMD Scalable Molecular Dynamics
Simulation of large biomolecular systems on
parallel computers File compatible with CHARMM
AMBER Message-driven and object-oriented design
implemented with Charm/Converse (from PPL at
UIUC) Ported to PACI systems, clusters, and
desktop PCs Available for FREE, includes source
code
Pentium III cluster
Itanium cluster
ApoA1 (PME) 92K atoms
Source Rob Pennington, NCSA
43
Clusters on the Grid
METEOR II
Deep Impact
150 GB disk total
32 cpu's total
Myrinet
Deep Impact
Broad Impact
44
PC Cluster Performance
  • Right on par with more expensive MPPs
  • Sometimes outperforms on particular codes
  • What are some things that are lacking
  • Natural application development/debugging
    environment
  • High-performance disk I/O
  • Management of clusters can be a challenge without
    scalable techniques

45
Putting a cluster together
  • (16, 32, 64, X) Individual Node
  • Eg. Dual Processor Pentium III/1.13GHz, 1 GB mem,
    ethernet
  • Scalable High-speed network
  • Myrinet, Giganet, Gigabit Ethernet
  • Message-passing libraries
  • TCP, MPI, PVM, VIA
  • Multiprocessor job launch
  • Portable batch System
  • Load Sharing Facility
  • PVM spawn, mpirun, ssh,
  • Techniques for system management
  • NPACI Rocks (Rocks) is a good example

46
Providing an abstraction
  • A pile of PCs is not an attractive model from
    the application point of view.
  • Need a coherent view and abstraction
  • Abstractions simplify the hardware so that
    algorithms can be more naturally mapped
  • A cluster is a distributed memory MIMD
  • MPI is the preferred way to express parallelism
    in applications
  • Understanding some of the lower-level details can
    be essential to obtaining good application
    performance.

47
Virtualization of Machines
  • Want the illusion that a collection of machines
    (cluster) is a single machine
  • Start, stop, monitor distributed programs
  • Programming and debugging should work seamlessly
  • PVM (Parallel Virtual Machine) was the first,
    widely-adopted virtualization for parallel
    computing
  • MPI is a standard API for message passing.
  • This illusion is only partially complete in any
    software system. Some issues
  • Node heterogeneity.
  • Real network topology can lead to contention
  • Unrelated What is a Java Virtual Machine?

48
High-Performance Communication
Switched Multigigabit, User-level access Networks
Switched 100 Mbit OS mediated access
  • Level of network interface support NIC/network
    router latency
  • Overhead and latency of communication ?
    deliverable bandwidth
  • High-performance communication ?
    Programmability!
  • Low-latency, low-overhead, high-bandwidth cluster
    communication
  • much more is needed
  • Usability issues, I/O, Reliability, Availability
  • Remote process debugging/monitoring

49
Communication Networks
  • Understanding Characteristics is important to
    understanding scalability of machines

50
Characterizing Networks
  • Bandwidth
  • Device/switch latency
  • Switching types
  • Circuit switched (eg. Telephone)
  • Packet switched (eg. Internet)
  • Store and forward
  • Virtual Cut Through
  • Wormhole routed
  • Topology
  • Number of connections
  • Diameter (how many hops through switches)

51
Latency
  • Latency is the amount of time taken for a command
    to start before any effect is seen
  • Push on gas pedal before car goes forward
  • Time you enter a line, before cashier starts on
    your job
  • First bit leaves computer A, first bit arrives at
    computer B
  • OR
  • (Message latency) First bit leaves computer A,
    last bit arrives at computer B
  • Startup latency is the amount of time to send a
    zero length message

52
Bandwidth
  • Bits/second that can travel through a connection
  • A really simple model for calculating the time to
    send a message of N bytes
  • Time latency N/bandwidth
  • Bisection is the minimum number of wires that
    must be cut to divide a network of machines into
    two equal halves.
  • Bisection bandwidth is the total bandwidth
    through the bisection

53
Interconnection Topologies
  • Completely connected
  • Every node has a direct wire connection to every
    other node
  • (N x (N-1))/2 Wires, Clearly impractical at
    scale

54
Line/Ring
2
1
3
4
5
6
7
  • Simple interconnection
  • First topology where routing is an issue
  • Needed when no direct connection exists between
    nodes
  • Want go to node 4 from node 2 have to pass
    through node 3
  • What happens if 2 want to communicate with 3 at
    the same time 1 want to communicate with 4?
  • What is the bisection of a line/ring
  • If the links are of bandwidth B, what is the
    bisection bandwidth
  • What is the aggregate bandwidth of the network?

55
Mesh/Torus
  • Generalization of line/ring to multiple
    dimensions
  • More routes between nodes
  • What is the bisection of this network?
  • Paragon is an example

2
1
3
4
5
6
7
2
1
3
4
5
6
7
2
1
3
4
5
6
7
56
Hop Count
  • Networks are measured by diameter
  • This is the minimum number of hops that message
    must traverse for the two nodes that furthest
    apart
  • Line Diameter N-1
  • 2D (NxM) Mesh Diameter NM-2

57
Tree-based Networks
  • Nodes organized in a tree fashion (important for
    some global algorithms)

Diameter of this network? Bisection, Bisection
Bandwidth? CM-5 was a Fat Tree links got
faster near the top
58
Hypercubes
1D
2D
4D
3D
59
Hypercubes 2
  • Dimension N Hypercube is constructed by
    connecting the corners of two N-1 hypercubes
  • Relatively low wire count to build large networks
  • Multiple routes from any destination to any node.
  • Exercise to the reader, what is the dimenision of
    a K-dimensional Hypercube

60
Communication Topologies
  • Interconnect topologies were very important areas
    of research in the early/mid 90s
  • MPI-1 spent a great deal of time addressing
    topologies for optimization
  • Hardware Topologies largely unimportant now
    because of wormhole routed networks and crossbar
    networks
  • Logical topologies are very important in
    constructing efficient parallel programs
  • Collective operations (Sum, Reduce, Broadcast)
  • MPI topologies important from this aspect

61
Modern Networks are Packet Switched
  • Break message into smaller blocks and send these
    pieces through the network
  • Network intermediate points (routers) can be
    store-and-forward or virtual cut through
  • Store and forward requires buffering at each
    switch if an incoming packet has packets ahead of
    it on an outgoing port (congestion)
  • Virtual cut-through eliminates the buffering for
    store and forward by cutting through the switch
    when the output port is free

62
Switch Types
Store and Forward
BUF
e.g. Ethernet
BUF
Cut Through
e.g. Myrinet
63
Wormhole Routing
  • Wormhole routing is a variation of virtual cut
    through
  • Small headers (flow control digits Flits) pass
    through the network.
  • When a flit is allowed to cut through a switch,
    the original sender is guaranteed a clear path
    through that switch.
  • A tail flit closes the connection
  • Going through multiple switches sets up a virtual
    circuit from sender to receiver
  • Wormhole was defined by Seitz and is used in
    Myrinet, a very popular cluster interconnect.

64
Wormhole-Routed Networks
Message stream is a virtual Circuit
65
Routing and Deadlock
  • If routing algorithms not carefully constructed,
    deadlock can occur
  • Head flits block and can never establish a
    connection
  • Routing algorithms provably deadlock-free under
    mild assumptions
  • Streams are of finite duration (packetized)
  • Receiver/sender coordinate so that the tail flit
    is finally processed (hence virtual connection is
    closed and input/output ports on switches are
    freed).

66
Latency of Circuit Switched and Virtual Cut
Through
  • Circuit Switch Latency
  • (Lc/B) l (L/B)
  • Lc length of control packet
  • B bandwidth
  • l number of links
  • L Length of Packet
  • Virtual Cut-through latency
  • (Lh/B) l (L/B)
  • Lh length of header packet

67
Store-Forward and Wormhole routing Latency
  • Wormhole Routing Latency
  • (Lf/B) l (L/B)
  • Lf Length of flit
  • Store-Forward Latency
  • (L/B) l
  • Store and forward latency can be much worse for
    many hops.
  • Virtual Cut Through, Wormhole, and Circuit Switch
    reach (L/B) as message length increases

68
Message Passing
  • Details of Networks to achieve High Performance

69
Communication Style is Message Passing
Packetized message
B
A
4
3
2
1
1
2
  • How do we efficiently get a message from Machine
    A to Machine B?
  • How do we efficiently break a large message into
    packets and reassemble at receiver?
  • How does receiver differentiate among message
    fragments (packets) from different senders?

70
Will use the details of FM to illustrate some
communication engineering
  • Previous slides focused on the switch hardware
  • These look at some what the endpoints must do to
    take advantage of high-speed wormhole-routed
    networks

71
FM on Commodity PCs
FM Host Library
FM NIC Firmware
FM Device Driver
Pentium III
NIC
2000 Mbps
1000 MIPS
133 MIPS
PCI
P6 bus
  • Host Library API presentation, flow control,
    segmentation/reassembly, multithreading
  • Device driver protection, memory mapping,
    scheduling monitors
  • NIC Firmware link management, incoming buffer
    management, routing, multiplexing/demultiplexing

72
Fast Messages 2.x Performance (1998)
  • Latency 8.8ms, Bandwidth 100MB/s, N1/2 250
    bytes
  • Fast in absolute terms (compares to MPPs,
    internal memory BW)
  • Delivers a large fraction of hardware performance
    for short messages
  • Technology transferred in emerging cluster
    standards Intel/Compaq/Microsofts Virtual
    Interface Architecture.

73
Comments about Performance
  • Latency and Bandwidth are the most basic
    measurements message passing machines
  • Will discuss in detail performance models because
  • Latency and bandwidth do not tell the entire
    story
  • High-performance clusters exhibit
  • 20X deliverable bandwidth over 100Mbit ethernet
  • Myrinet 2000 240 MB/sec vs. 11MB/sec (FastEther)
  • 10X improvement in latency
  • Myrinet 2000 8 us vs. 80us (FastEther)

74
How do FM/GM/PM/AM really get Speed?
  • Protected user-level access to network
    (OS-bypass)
  • Efficient credit-based flow control
  • assumes reliable hardware network only OK for
    System Area Networks
  • No buffer overruns ( stalls sender if no receive
    space)
  • Early de-multiplexing of incoming packets
  • multithreading, use of NT user-schedulable
    threads
  • Careful implementation with many tuning cycles
  • Overlapping DMAs (Recv), Programmed I/O send
  • No interrupts! Polling only.

75
OS-Bypass Background
  • Suppose you want to perform a sendto on a
    standard IP socket?
  • Operating System mediates access to the network
    device
  • Must trap into the kernel to insure authorization
    on each and every message (Very time consuming)
  • Message is copied from user program to kernel
    packet buffers
  • Protocol information about each packet is
    generated by the OS and attached to a packet
    buffer
  • Message is finally sent out onto the physical
    device (ethernet)
  • Receiving does the inverse with a recvfrom
  • Packet to kernel buffer, OS strip of header,
    reassembly of data, OS mediation for
    authorization, copy into user program

76
OS-Bypass
  • A user program is given a protected slice of the
    network interface
  • Authorization is done once (not per message)
  • Outgoing packets get directly copied or DMAed to
    network interface
  • Protocol headers added by user-level library
  • Incoming packets get routed by network interface
    card (NIC) into user-defined receive buffers
  • NIC must know how to differentiate incoming
    packets. This is called early demultiplexing.
  • Outgoing and incoming message copies are
    eliminated.
  • Traps to OS kernel are eliminated (bypass)

77
Whats the Catch to OS Bypass
  • Because only the user application is involved in
    message transmission, it must actively service
    the network connection
  • Kernel timers cant be used (they are bypassed)
  • Usually a service thread takes the place of the
    kernel-based mechanisms
  • When not handled properly, can cause strange
    results
  • Because applications get a slice of the network,
    only a small number of processes can
    simultaneously access the high-speed links

78
Packet Pathway
DMA
Programmed I/O/DMA
User level Handler 1
Pkt
Pkt
Pkt
Pkt
User Message Buffer
User level Handler 2
DMA to/from Network
User Message Buffer
Pkt
Pinned DMA receive region
  • Concurrency of I/O busses
  • Sender specifies receiver handler ID
  • Flow control keeps DMA region from being
    overflowed

User Buffer
79
MPI-FM 2.x Layering
MPI Header
MPI Header
Source buffer
Destination buffer
  • Gather-scatter interface handler multithreading
    enables efficient layering, data manipulation
    without copies

80
MPI on FM 2.x
Msg Size
  • MPI-FM 91 MB/s, 13ms latency, 4 ms overhead
  • Short messages much better than IBM SP2, PCI
    limited
  • Latency SGI O2K

81
MPI-FM 2.x Efficiency
Efficiency
  • High Transfer Efficiency, approaches 100
    Lauria, Pakin et al. HPDC7 98
  • Other systems much lower even at 1KB (100Mbit
    40, 1Gbit 5)

82
Is this detail important?
  • Yes! Detail of a particular high-performance
    interface illustrates some of the complexity for
    these systems
  • Performance and scaling are very important.
    Sometimes the underlying structure needs to be
    understood to reason about applications.
  • Overhead vs. Latency
  • Bandwidth and communication payload
  • Basic understanding of the mechanisms
    de-mystifies what is actually going on.

83
How do we program/run such machines?
  • PVM (Parallel Virtual Machine) provides
  • Simple message passing API
  • Construction of virtual machine with a software
    console
  • Ability to spawn (start), kill (stop), monitor
    jobs
  • XPVM is a graphical console, performance monitor
  • MPI (Message Passing Interface)
  • Complex and complete message passing API
  • Defacto, community-defined standard
  • No defined method for job management
  • Mpirun provided as a tool for the MPICH
    distribution
  • Commericial and non-commercial tools for
    monitoring debugging
  • Jumpshot, VaMPIr,

84
More on MPI
  • Started as a standards effort in 1994
  • Fuse the best ideas from several projects
  • Had a good reference implementation (MPICH), but
    encouraged vendors/researchers to improve and/or
    replace
  • Allows users to write standard parallel
    subroutine libraries
  • Is really a cornerstone software capability for
    parallel machines in general (and clusters in
    particular).

85
Modern HPC clusters should be thought of as
affordable, scalable machines programmed with MPI
86
Cluster Projects have focused on high-performance
messaging
  • BIP (Basic Interface for Parallelism) Linux
  • MVIA Berkeley Lab Modular VIA project
  • Active Messages Berkeley NOW/Millennium
  • GM From Myricom
  • General purpose (what we use on our Linux
    Cluster),
  • Real World Computing Partnership Japanese
    consortium
  • UNET Cornell
  • High-performance over ATM and Fast Ethernet
  • HPVM Fast Messaging and NT

87
Integrating Key Components
  • NPACI Rocks RedHat Linux-based Clustering
    Toolkit
  • Beowulf Project The ones that  pop-cultured
    clusters.
  • Scyld Computing Commercialization of Beowulf
    technologies. Founded by Donald Becker of Linux
    Ethernet Driver fame.
  • OSCAR Collection of standard cluster components
  • SCore RWCP single system image
  • PVM The original message passing/distributed
    computing  software toolkit.
  • MPI Message Passing Interface Standard
  • VIA Virtual Interface architecture. A hardware
    standard for low-latency system area networks
  • Myricom Corporation Low-latency gigabit networks
  • IBM, Compaq are joining the cluster
    vendor/software fray
  • Many Projects, Few Standards

88
Technological Shifts
  • Memory bandwidth of COTS systems
  • 4 8X increase this year (Rambus, DDR)
  • Increased I/O performance
  • 4X improvement today (64bit/66MHz)
  • 10X (PCI-X) in some Pentium 4 MBs
  • Increased network performance/decrease in
  • 1X infiniband (2.5 Gbits/sec) hardware
    convergence
  • Intel designing Mboards with multiple I/O busses
    and on-board Infiniband.
  • Gigabit Ethernet now getting very cheap
  • 64 bit integer performance everywhere (eg IA-64,
    Alpha (dead soon), Power4, UltraSparc, AMD
    Hammer)

89
Where we are now
  • Clusters are Proven Computational Engines (Many
    existence proofs)
  • Upcoming hardware technology dislocations makes
    them very attractive at multiple scales
  • Research Software has not focused on management
  • System (Management) Software is a bazaar
  • Dual Processor, High-performance Network, Large
    Memory
  • Standard Building block
  • 5K/node

90
Designing/Building a Cluster
  • Hardware layout
  • Processors, networks, power
  • Logical system design
  • Management Philosophies
  • Well cover the very high-level view. More
    details will follow in later lectures

91
Hardware basic layout
Front-end Node(s)
Power Distribution (Net addressable units as
option)
Public Ethernet
Fast-Ethernet Switching Complex
Gigabit Network Switching Complex
92
Current Configuration of the Meteor
  • Rocks v2.2 (RedHat 7.2)
  • 2 Frontends, 4 NFS Servers
  • 100 nodes
  • Compaq
  • 800, 933, IA-64
  • SCSI, IDA
  • IBM
  • 733, 1000
  • SCSI
  • 50 GB RAM
  • Ethernet
  • For management
  • Myrinet 2000

93
Compute Nodes - Meteor Specs and Choices
  • Dual Processor PIIIs (733 and 800 MHz)
  • 933s and 1.0 GHz as we expand
  • ½ GB node (1 GB would be better)
  • Hot swap SCSI on these nodes.
  • Choices
  • Uni vs. Dual Processor
  • Processors Alpha, Intel, Sparc, PowerPC
  • Linux is, in reality an Intel OS
  • Rack-mount vs. Tower
  • Rackmount essential for large installations
  • SCSI vs. IDE
  • Hot Swap unimportant. IDE Removable disks will
    work.
  • Rackmount servers usually are SCSI
  • User integration versus system integrator

94
Networks
  • Ethernet only ? Beowulf-class
  • Nodes are in Private IP (10.x.x.x0 space,
    front-end does NAT
  • Gigabit networks
  • Myrinet, Giganet, Gigabit Ethernet
  • Power Network
  • Highly desirable to have network addressable
    power controllers (When hard reset needed)
  • We will be experimenting with Baytech
  • Essential to figure power needs (300W/system
    peak for our current systems)
  • A serial console network is not really
    necessary
  • A KVM (keyboard video monitor) switching system
    is adds too much complexity, cables, and cost

95
Services
  • Front-end Node
  • Node seen by external world
  • Performs Network Address Translation (NAT)
  • NFS Server(s) for user home areas
  • Beware of scalability issues!
  • Compilers, libraries
  • Configuration for Nodes
  • DHCP Server
  • NIS Domain Controller
  • NTP Server
  • Installation Server for defining system on nodes
  • Method(s) to start jobs on compute nodes
  • Batch System
  • Interactive launching of jobs

96
Installation/Management
  • Need to have a strategy for managing cluster
    nodes
  • Common methods and (pitfalls)
  • Installing each node by hand
  • Difficult to keep software on nodes up to date
  • Management increases as node count increases
  • Disk Imaging techniques (eg. VA Disk Imager)
  • Difficult to handle heterogeneous nodes
  • Treats OS as a single monolithic system
  • Specialized installation programs (e.g. IBMs
    LUI, or RWCPs Multicast installer)
  • RedHat Kickstart
  • Define packages needed for OS on nodes, kickstart
    gives a reasonable measure of contro.
  • Need to fully automate to scale out

97
Job Management, Debugging
  • Once a parallel application (usually MPI) has
    been created, it needs to run/debugged/scheduled
  • Job Queuing systems (like PBS) exist and help
    with the sharing of resources
  • Debugging across N copies of the OS is quite
    challenging with only some moderate success in
    debugging (Like Totalview) environments

98
The Dark Side of Clusters
  • Clusters are phenomenal price/performance
    computational engines
  • Can be hard to manage without experience
  • High-performance I/O is still unsolved
  • Finding out where something has failed increases
    at least linearly as cluster size increases
  • Not cost-effective if every cluster burns a
    person just for care and feeding
  • NPACI Rocks helps here
  • Programming environment could be vastly improved
  • Technology is changing very rapidly. Scaling up
    is becoming commonplace (128 nodes)

99
The Top 2 Most Critical Problems
  • The largest problem in clusters is software skew
  • When Software configuration on some nodes is
    different than on others
  • Small differences (minor version numbers on
    libraries) can cripple a parallel program
  • Its taken the community almost 7 years from
    original Beowulf book to understand this
  • The Second most important problem is adequate job
    control of the parallel process
  • Signal propagation
  • Cleanup

100
NPACI Rocks Toolkit rocks.npaci.edu
  • Collection of software components needed for
    software
  • Techniques and software for easy installation,
    management and update of clusters
  • Node management philosophy
  • Make it trivial to completely reinstall any (all)
    nodes.
  • Nodes must be 100 automatically configured
  • Use of DHCP, NIS for configuration
  • Use RedHats Kickstart to define the set of
    software that defines a node.
  • All software is packaged in a Redhat Package
    (RPM)
  • Encapsulate configuration for a package (eg.
    Myrinet)
  • Manage dependencies
  • Never try to figure out if node software is
    consistent
  • Bootable CD to first build front-end installation
    server and then to build nodes.

101
Trends
  • CPU trends
  • Network trends
  • Technology is changing rapidly in the PC
    marketplace
  • Knowing and following these trends (and having
    software to help you through them) is part of the
    commodity cluster game

102
Cluster Compute NodeToday
103
Cluster Compute NodeTomorrow (Single P4 is here
today)
1.6 GHz
64 bit _at_ 400MHz 3.2 GB/s
2 channels 16bit _at_ 800 MHz 3.2 GB/s
PCI-X 64 bit _at_ 133 MHz 1.06 GB/s
  • In the next 9 months, every speed and feed gets
    at least a 2x bump!

104
Commodity CPU Pentium 3
  • 0.8 Gflops (Peak)
  • 1 Flop / cycle _at_ 800 MHz
  • 12.9 GB/s L2 cache feed
  • 800 MHz 1/2 256-bit (Advanced Transfer Cache)
  • 1.06 GB/s Memory-I/O bus
  • 133 MHz 64-bit

105
Commodity CPU Pentium 4
  • 2.8 Gflops
  • 2 Flops / cycle _at_ 1.4 GHz
  • 128-bit vector registers (Streaming SIMD
    Extensions
  • Can apply operations on 2 64-bit floating point
    values per clock (SIMD Streaming Extensions 2)
  • 44 GB/s L2 cache feed (Full speed 1.4GHz x
    256bits)
  • 3.2 GB/s Memory-I/O bus

106
Power4
  • 2.5 GB/s / CPU memory bus feed
  • Numbers in the figure are aggregate
  • 10 GB/s / CPU in 8-way configuration
  • 5 GB/s / CPU I/O feed
  • Chip Multiprocessor (CMP)
  • 4.0 GFlop / CPU (Peak)
  • 50 GB/s / CPU L2 cache feed

107
TeraGrid
  • Taking clusters to the next stage for the NSF
    PACI program (Partnership in Advanced Computing
    Infrastructure)
  • 13 TFlops aggregate speed across 4 sites
  • 4 Linux clusters
  • Next-generation processor (IA64 McKinley)
  • Designing I/O as an integral component of the
    cluster
  • Large Storage Area Network
  • Still the same basic design of lab clusters

108
TeraGrid Partners
  • Strategic partners
  • IBM
  • Cluster integration. GPFS parallel file system
  • Intel
  • McKinley IA-64 software and compilers
  • Oracle
  • data archive management and mining
  • Qwest
  • 40 Gb/s DTF WAN backbone
  • Myricom
  • Cluster interconnect
  • Sun
  • Data Management at SDSC

109
4 TeraGrid Sites Have Focal Points
  • SDSC Large-scale Data
  • Large-scale and high-performance data
    analysis/handling
  • Every Cluster Node is Directly Attached to SAN
  • NCSA High-performance Computing
  • Large-scale, Large Flops computation
  • Argonne Visualization
  • Scalable Visualization walls, Human-Computer
    Interfaces
  • Caltech Applications
  • Data and flops for applications Especially some
    of the GriPhyN Apps (LIGO, NVO)
  • Specific site configurations reflect these foci
  • Sites are not limited to just there focus area
  • One organization cannot do it all

110
TeraGrid Architecture
ANL 1 TF .25 TB Memory 25 TB disk
Caltech 0.5 TF .4 TB Memory 86 TB disk
Extreme Blk Diamond
574p IA-32 Chiba City
256p HP X-Class
32
32
32
32
24
128p Origin
128p HP V2500
32
24
32
24
HR Display VR Facilities
92p IA-32
5
4
5
8
8
HPSS
HPSS
OC-48
NTON
OC-12
Calren
ESnet HSCC MREN/Abilene Starlight
Chicago LA DTF Core Switch/Routers Cisco 65xx
Catalyst Switch (256 Gb/s Crossbar)
Juniper M160
OC-48
OC-12 ATM
OC-12
GbE
NCSA 62 TF 4 TB Memory 240 TB disk
SDSC 4.1 TF 2 TB Memory 225 TB SAN
Juniper M40
Juniper M40
OC-12
vBNS Abilene Calren ESnet
OC-12
OC-12
OC-3
Myrinet
8
4
UniTree
8
HPSS
2
Sun Starcat
Myrinet
4
1024p IA-32 320p IA-64
1176p IBM SP Blue Horizon
16
14
4
15xxp Origin
Sun E10K
111
TeraGrid Redux
  • Expect full clusters up and running by Nov 2002
  • Push towards the Grid is founded on our ability
    to manage and control the HPC software stack
  • Logical next step

112
Summary
  • Clusters are real machines
  • Still missing some key components such as
    High-perf I/O
  • Complexity is abstracted by MPI, but still needs
    to be understood by application developers
  • System integration software is the next step
    beyond the messaging proof-of-principle work of
    the mid 90s
  • Clustering Toolkits
  • Simplify installation/management/monitoring
  • Serve as collection points of software
  • Were on the On the cusp of a large step-changes
    in commodity hardware.
  • Teragrid is one of the first projects to go to
    the future generation of intel architectures
About PowerShow.com