Supercomputing on Windows Clusters: Experience and Future - PowerPoint PPT Presentation

About This Presentation
Title:

Supercomputing on Windows Clusters: Experience and Future

Description:

Supercomputing on Windows Clusters: Experience and Future Directions Andrew A. Chien CTO, Entropia, Inc. SAIC Chair Professor Computer Science and Engineering, UCSD – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 57
Provided by: dbUsenixO
Learn more at: https://www.usenix.org
Category:

less

Transcript and Presenter's Notes

Title: Supercomputing on Windows Clusters: Experience and Future


1
Supercomputing on Windows Clusters Experience
and Future Directions
  • Andrew A. Chien
  • CTO, Entropia, Inc.
  • SAIC Chair Professor
  • Computer Science and Engineering, UCSD
  • National Computational Science Alliance
  • Invited Talk, USENIX Windows, August 4, 2000

2
Overview
  • Critical Enabling Technologies
  • The Alliances Windows Supercluster
  • Design and Performance
  • Other Windows Cluster Efforts
  • Future
  • Terascale Clusters
  • Entropia

3
External Technology Factors
4
Microprocessor Performance
DEC Alpha (5)
X86/Alpha (1)
Year Introduced
  • Micros 10MF -gt 100 MF -gt 1GF -gt 3GF -gt 6GF
    (2001?)
  • gt Memory system performance catching up (2.6
    GB/s 21264 memory BW)

Adapted from Baskett, SGI and CSC Vanguard
5
Killer Networks
GigSAN/GigE 110 MB/s
  • LAN 10Mb/s -gt 100Mb/s -gt ?
  • SAN 12MB/s -gt 110MB/s (Gbps) -gt 1100MB/s -gt ?
  • Myricom, Compaq, Giganet, Intel,...
  • Network bandwidths limited by system internal
    memory bandwidths
  • Cheap and very fast communication hardware

UW Scsi 40 MB/s
FastE 12 MB/s
Ethernet 1MB/s
6
Rich Desktop Operating Systems Environments
Clustering, Performance, Mass store, HP
networking, Management, Availability, etc.
Multiprocess Protection SMP support
Graphical Interfaces Audio/Graphics
HD Storage Networks
Basic device access
1981
1985
1995
1999
1990
  • Desktop (PC) operating systems now provide
  • richest OS functionality
  • best program development tools
  • broadest peripheral/driver support
  • broadest application software/ISV support

7
Critical Enabling Technologies
8
Critical Enabling Technologies
  • Cluster management and resource integration (use
    like one system)
  • Delivered communication performance
  • IP protocols inappropriate
  • Balanced systems
  • Memory bandwidth
  • I/O capability

9
The HPVM System
  • Goals
  • Enable tightly coupled and distributed clusters
    with high efficiency and low effort (integrated
    solution)
  • Provide usable access thru convenient standard
    parallel interfaces
  • Deliver highest possible performance and simple
    programming model

10
Delivered Communication Performance
  • Early 1990s, Gigabit testbeds
  • 500Mbits (60MB/s) _at_ 1 MegaByte packets
  • IP protocols not for Gigabit SANs
  • Cluster Objective High performance communication
    to small and large messages
  • Performance Balance Shift Networks faster than
    I/O, memory, processor

11
Fast Messages Design Elements
  • User-level network access
  • Lightweight protocols
  • flow control, reliable delivery
  • tightly-coupled link, buffer, and I/O bus
    management
  • Poll-based notification
  • Streaming API for efficient composition
  • Many generations 1994-1999
  • IEEE Concurrency, 6/97
  • Supercomputing 95, 12/95
  • Related efforts UCB AM, Cornell U-Net,RWCP PM,
    Princeton VMMC/Shrimp, Lyon BIP gt VIA standard

12
Improved Bandwidth
  • 20MB/s -gt 200 MB/s (10x)
  • Much of advance is software structure APIs and
    implementation
  • Deliver all of the underlying hardware
    performance

13
Improved Latency
  • 100ms to 2ms overhead (50x)
  • Careful design to minimize overhead while
    maintaining throughput
  • Efficient event handling, fine-grained resource
    management and interlayer coordination
  • Deliver all of the underlying hardware
    performance

14
HPVM Cluster Supercomputers
MPI
Put/Get
Global Arrays
BSP
Scheduling Mgmt (LSF)
HPVM 1.0 (8/1997) HPVM 1.2 (2/1999) - multi,
dynamic, install HPVM 1.9 (8/1999) -
giganet, smp
Fast Messages
Performance Tools
Myrinet
Server- Net
Giganet VIA
SMP
WAN
  • Turnkey Cluster Computing Standard APIs
  • Network hardware and APIs increase leverage for
    users, achieve critical mass for system
  • Each involved new research challenges and
    provided deeper insights into the research issues
  • Drove continually better solutions (e.g.
    multi-transport integration, robust flow control
    and queue management)

15
HPVM Communication Performance
  • N1/2 400 Bytes
  • Delivers underlying performance for small
    messages, endpoints are the limits
  • 100MB/s at 1K vs 60MB/s at 1000K
  • gt1500x improvement

16
HPVM/FM on VIA
  • N1/2 400 Bytes
  • FM Protocol/techniques portable to Giganet VIA
  • Slightly lower performance, comparable N1/2
  • Commercial version WSDI (stay tuned)

17
Unified Transfer and Notification (all transports)
ltspacegt
Fixed Size Frames
Procs
Variable Size Data
Increasing Addresses
Networks
Fixed Size Trailer Length/Flag
  • Solution Uniform notify and poll (single Q
    representation)
  • Scalability n into k (hash) arbitrary SMP size
    or number of NIC cards
  • Key integrate variable-sized messages achieve
    single DMA transfer
  • no pointer-based memory management, no special
    synchronization primitives, no complex
    computation
  • Memory format provides atomic notification in
    single contiguous memory transfer (bcopy or DMA)

18
Integrated Notification Results
Single Transport Integrated Myrinet
(latency) 8.3ms 8.4ms Myrinet (BW)
101MB/s 101MB/s Shared Memory (latency)
3.4ms 3.5ms Shared Memory (BW) 200MB/s
200MB/s
  • No polling or discontiguous access performance
    penalties
  • Uniform high performance which is stable over
    changes of configuration or the addition of new
    transports
  • no custom tuning for configuration required
  • Framework is scalable to large numbers of SMP
    processors and network interfaces

19
Supercomputer Performance Characteristics (11/99)
MF/Proc Flops/Byte Flops/NetworkRT Cray
T3E 1200 2 2,500 SGI
Origin2000 500 0.5 1,000 HPVM NT
Supercluster 600 8 12,000 IBM SP2
(4 or 8-way) 2.6-5.2GF 12-25 150-300K Be
owulf (100Mbit) 600 50 200,000
20
The NT Supercluster
21
Windows Clusters
  • Early prototypes in CSAG
  • 1/1997, 30P, 6GF
  • 12/1997, 64P, 20GF
  • Alliances Supercluster
  • 4/1998, 256P, 77GF
  • 6/1999, 256P, 109GF

22
NCSAs Windows Supercluster
Engineering Fluid Flow Problem
207 in Top 500 Supercomputing Sites
D. Tafti, NCSA
Rob Pennington (NCSA), Andrew Chien (UCSD)
23
Windows Cluster System
Front-End Systems
File servers LSF master
Fast Ethernet
FTP to Mass Storage Daily backups
128 GB Home 200 GB Scratch
LSF BatchJob Scheduler
Internet
  • Apps development
  • Job submission

128 Compute Nodes, 256 CPUs
Windows NT, Myrinet and HPVM
128 Dual 550 MHz Systems
Infrastructure and Development Testbeds
Windows 2K and NT
8 4p 550 32 2p 300 8 2p 333
(courtesy Rob Pennington, NCSA)
24
Example Application Results
  • MILC QCD
  • Navier-Stokes Kernel
  • Zeus-MP Astrophysics CFD
  • Large-scale Science and Engineering codes
  • Comparisons to SGI O2K and Linux clusters

25
MILC Performance
Src D. Toussaint and K. Orginos, Arizona
26
Zeus-MP (Astrophysics CFD)
27
2D Navier Stokes Kernel
Source Danesh Tafti, NCSA
28
Applications with High Performance on Windows
Supercluster
  • Zeus-MP (256P, Mike Norman)
  • ISIS (192P, Robert Clay)
  • ASPCG (256P, Danesh Tafti)
  • Cactus (256P, Paul Walker/John Shalf/Ed Seidel)
  • MILC QCD (256P, Lubos Mitas)
  • QMC Nanomaterials (128P, Lubos Mitas)
  • Boeing CFD Test Codes, CFD Overflow (128P, David
    Levine)
  • freeHEP (256P, Doug Toussaint)
  • ARPI3D (256P, weather code, Dan Weber)
  • GMIN (L. Munro in K. Jordan)
  • DSMC-MEMS (Ravaioli)
  • FUN3D with PETSc (Kaushik)
  • SPRNG (Srinivasan)
  • MOPAC (McKelvey)
  • Astrophysical N body codes (Bode)
  • gt Little code retuning and quickly running ...
  • Parallel Sorting (Rivera CSAG),

29
MinuteSort
  • Sort max data disk-to-disk in 1 minute
  • Indy sort
  • fixed size keys, special sorter, and file format
  • HPVM/Windows Cluster winner for 1999 (10.3GB) and
    2000 (18.3GB)
  • Adaptation of Berkeley NOWSort code (Arpaci and
    Dusseau)
  • Commodity configuration ( not a metric)
  • PCs, IDE disks, Windows
  • HPVM and 1Gbps Myrinet

30
MinuteSort Architecture
HPVM 1Gbps Myrinet
32 HP Kayaks 3Ware Controllers 4 x 20GB IDE disks
32 HP Netservers 2 x 16GB SCSI disks
(Luis Rivera UIUC, Xianan Zhang UCSD)
31
Sort Scaling
  • Concurrent read/bucket-sort/communicate is
    bottleneck faster I/O infrastructure required
    (busses and memory, not disks)

32
MinuteSort Execution Time
33
Reliability
  • Gossip Windows platforms are not reliable
  • Larger systems gt intolerably low MTBF
  • Our Experience Nodes dont crash
  • Application runs of 1000s of hours
  • Node failure means an application failure
    effectively not a problem
  • Hardware
  • Short term Infant mortality (1 month burn-in)
  • Long term
  • 1 hardware problem/100 machines/month
  • Disks, network interfaces, memory
  • No processor or motherboard problems.

34
Windows Cluster Usage
  • Lots of large jobs
  • Runs up to 14,000 hours (64p 9 days)

35
Other Large Windows Clusters
  • Sandias Kudzu Cluster (144 procs, 550 disks,
    10/98)
  • Cornells AC3 Velocity Cluster (256 procs, 8/99)
  • Others (sampled from vendors)
  • GE Research Labs (16, Scientific)
  • Boeing (32, Scientific)
  • PNNL (96, Scientific)
  • Sandia (32, Scientific)
  • NCSA (32, Scientific)
  • Rice University (16, Scientific)
  • U. of Houston (16, Scientific)
  • U. of Minnesota (16, Scientific)
  • Oil Gas (8,Scientific)
  • Merrill Lynch (16, Ecommerce)
  • UIT (16, ASP/Ecommerce)

36
The AC3 Velocity
  • 64 Dell PowerEdge 6350 Servers
  • Quad Pentium III 500 MHz/2 MB Cache Processors
    (SMP)
  • 4 GB RAM/Node
  • 50 GB Disk (RAID 0)/Node
  • GigaNet Full Interconnect
  • 100 MB/Sec Bandwidth between any 2 Nodes
  • Very Low Latency
  • 2 Terabytes Dell PowerVault 200S Storage
  • 2 Dell PowerEdge 6350 Dual Processor File
    Servers
  • 4 PowerVault 200S Units/File Server
  • 8 36 GB/Disk Drives/PowerVault 200S
  • Quad Channel SCSI Raid Adapter
  • 180 MB/sec Sustained Throughput/ Server
  • 2 Terabyte PowerVault 130T Tape Library
  • 4 DLT 7000 Tape Drives
  • 28 Tape Capacity

381 in Top 500 Supercomputing Sites
(courtesy David A. Lifka, Cornell TC)
37
Recent AC3 Additions
  • 8 Dell PowerEdge 2450 Servers (Serial Nodes)
  • Pentium III 600 MHz/512 KB Cache
  • 1 GB RAM/Node
  • 50 GB Disk (RAID 0)/Node
  • 7 Dell PowerEdge 2450 Servers (First All NT Based
    AFS Cell)
  • Dual Processor Pentium III 600 MHz/512 KB Cache
  • 1 GB RAM/Node Fileservers, 512 MB RAM/Node
    Database servers
  • 1 TB SCSI based RAID 5 Storage
  • Cross platform filesystem support
  • 64 Dell PowerEdge 2450 Servers (Protein Folding,
    Fracture Analysis)
  • Dual Processor Pentium III 733 Mhz/256 KB Cache
  • 2 GB RAM/Node
  • 27 GB Disk (RAID 0)/Node
  • Full Giganet Interconnect
  • 3 Intel ES6000 1 ES1000 Gigabit switches
  • Upgrading our Server Backbone network to Gigabit
    Ethernet

(courtesy David A. Lifka, Cornell TC)
38
AC3 Goals
  • Only commercially supported technology
  • Rapid spinup and spinout
  • Package technologies for vendors to sell as
    integrated systems
  • gt All of commercial packages were moved from SP2
    to Windows, all users are back and more!
  • Users I dont do windows gt
  • Im agnostic about operating systems, and just
    focus on getting my work done.

39
Protein Folding
Reaction path study of lig and diffusion in
leghemoglobin. The ligand is CO (white) and it
is moving from the binding site, the heme pocket,
to the protein exterior. A study by Weislaw
Nowak and Ron Elber.
The cooperative motion of ion and water through
the gramicidin ion channel. The effective
quasi-article that permeates through the channel
includes eight water molecules and the ion. Work
of Ron Elber with Bob Eisenberg, Danuta Rojewska
and Duan Pin.
http//www.tc.cornell.edu/reports/NIH/resource/Com
pBiologyTools/
(courtesy David A. Lifka, Cornell TC)
40

Protein Folding Per/Processor Performance
Results on different computers for a protein
structures

Results on different computers for (a /b or b
proteins)  

(courtesy David A. Lifka, Cornell TC)
41
AC3 Corporate Members
  • Air Products and Chemicals
  • Candle Corporation
  • Compaq Computer Corporation
  • Conceptual Reality Presentations
  • Dell Computer Corporation
  • Etnus, Inc.
  • Fluent, Inc.
  • Giganet, Inc.
  • IBM Corporation
  • ILOG, Inc.
  • Intel Corporation
  • KLA-Tencor Corporation
  • Kuck Associates, Inc.
  • Lexis-Nexis
  • MathWorks, Inc.
  • Microsoft Corporation
  • MPI Software Technologies, Inc.
  • Numerical Algorithms Group
  • Portland Group, Inc.
  • Reed Elsevier, Inc.
  • Reliable Network Solutions, Inc.
  • SAS Institute, Inc.
  • Seattle Lab, Inc.
  • Visual Numerics, Inc.
  • Wolfram Research, Inc.

(courtesy David A. Lifka, Cornell TC)
42
Windows Cluster Summary
  • Good performance
  • Lots of Applications
  • Good reliability
  • Reasonable Management complexity (TCO)
  • Future is bright uses are proliferating!

43
Windows Cluster Resources
  • NT Supercluster, NCSA
  • http//www.ncsa.uiuc.edu/General/CC/ntcluster/
  • http//www-csag.ucsd.edu/projects/hpvm.html
  • AC3 Cluster, TC
  • http//www.tc.cornell.edu/UserDoc/Cluster/
  • University of Southampton
  • http//www.windowsclusters.org/
  • gt application and hardware/software evaluation
  • gt many of these folks will work with you on
    deployment

44
Tools and Technologies for Building Windows
Clusters
  • Communication Hardware
  • Myrinet, http//www.myri.com/
  • Giganet, http//www.giganet.com/
  • Servernet II, http//www.compaq.com/
  • Cluster Management and Communication Software
  • LSF, http//www.platform.com/
  • Codeine, http//www.gridware.net/
  • Cluster CoNTroller, MPI, http//www.mpi-softtech.c
    om/
  • Maui Scheduler http//www.cs.byu.edu/
  • MPICH, http//www-unix.mcs.anl.gov/mpi/mpich/
  • PVM, http//www.epm.ornl.gov/pvm/
  • Microsoft Cluster Info
  • Win2000 http//www.microsoft.com/windows2000/
  • MSCS,http//www.microsoft.com/ntserver/ntserverent
    erprise/exec/overview/clustering.asp

45
Future Directions
  • Terascale Clusters
  • Entropia

46
A Terascale Cluster
10 Teraflops in 2000?
  • NSF currently running a 36M Terascale
    competition
  • Budget could buy
  • an Itanium cluster (3000 processors)
  • 3TB of main memory
  • gt 1.5Gbps high speed network interconnect

? 1 in Top 500 ? Supercomputing Sites
47
Entropia Beyond Clusters
  • COTS, SHV enable larger, cheaper, faster systems
  • Supercomputers (MPPs) to
  • Commodity Clusters (NT Supercluster) to
  • Entropia

48
Internet Computing
  • Idea Assemble large numbers of idle PCs in
    peoples homes, offices, into a massive
    computational resource
  • Enabled by broadband connections, fast
    microprocessors, huge PC volumes

49
Unprecedented Power
  • Entropia network 30,000 machines (and growing
    fast!)
  • 100,000, 1Ghz gt 100 TeraOp system
  • 1,000,000, 1Ghz gt 1,000 TeraOp system (1 PetaOp)
  • IBM ASCI White (12 TeraOp, 8K processors, 110
    Million system)

50
Why Participate Cause Computing!
51
People Will Contribute
  • Millions have demonstrated willingness to donate
    their idle cycles
  • Great Cause Computing
  • Current Find ET, Large Primes, Crack DES
  • Next find cure for cancer, muscular dystrophy,
    air and water pollution,
  • understand human genome, ecology, fundamental
    properties of matter, economy
  • Participate in science, medical research,
    promoting causes that you care about!

52
Technical Challenges
  • Heterogeneity (machine, configuration, network)
  • Scalability (thousands to millions)
  • Reliability (turn off, disconnect, fail)
  • Security (integrity, confidentiality)
  • Performance
  • Programming
  • . . .
  • Entropia harnessing the computational power of
    the Internet

53
Entropia is . . .
  • Power a network with unprecedented power and
    scale
  • Empower ordinary people to participate in
    solving the great social challenges and mysteries
    of our time
  • Solve team solving fascinating technical problems

54
Summary
  • Windows clusters are powerful, successful high
    performance platforms
  • Cost effective and excellent performance
  • Poised for rapid proliferation
  • Beyond clusters are Internet computing systems
  • Radical technical challenges, vast and profound
    opportunities
  • For more information see
  • HPVM http//www-csag.ucsd.edu/
  • Entropia http//www.entropia.com/

55
Credits
  • NT Cluster Team Members
  • CSAG (UIUC and UCSD Computer Science) my
    research group
  • NCSA Leading Edge Site -- Robert Penningtons
    team
  • Talk materials
  • NCSA (Rob Pennington, numerous application
    groups)
  • Cornell TC (David Lifka)
  • Boeing (David Levine)
  • MPISoft (Tony Skjellum)
  • Giganet (David Wells)
  • Microsoft (Jim Gray)

56
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com