Title: Scalability on Linux Clusters ASCIASAP Scalability Workshop in Santa Fe
1Scalability on Linux ClustersASCI/ASAP
Scalability Workshop in Santa Fe
- Rolf Riesen, Ron Brightwell, and the CplantTM
Team - Sandia National Laboratories
- Scalable Computing Systems Department
- May 11, 2000
2Machines are not Scalable
- Size is not a measure of scalability
- Modem-connected i386 PCs run seti_at_home just
fine, but they wont run MPLINPACK - Network speed is not a measure of scalability
- Bus-connected processors (SMPs) run MPLINPACK
just fine, but wont grow to 9000 nodes - Neither are topology, architecture, CPU type
- The best a machine or architecture can hope for,
is to not inhibit scalability
3System Software is not Scalable
- The data transport layer is not a measure of
scalability - TCP/IP is just fine for cracking RSA challenges,
but it wont do for an FFT transform - The OS is not a measure of scalability
- Windows runs the seti_at_home screen saver just
fine, but it wont run on ASCI Red - Neither are point-to-point latency or bandwidth
- The best system software can hope for, is to not
inhibit scalability
4What is Scalability?
- Whether a machine or system software is scalable,
depends on the intended application - Designing and building scalable systems requires
to leave out features that prevent the intended
application from running - Applications range from distributed to tightly
coupled parallel
5Distributed and Parallel Systems
Tech Report SAND98-2221
Distributed systems hetero- geneous
Massively parallel systems homo- geneous
Legion\Globus
Berkley NOW
SETI_at_home
ASCI Red Tflops
Beowulf
Internet
Cplant
- Gather (unused) resources
- Steal cycles
- System SW manages resources
- System SW adds value
- 10 - 20 overhead is OK
- Resources drive applications
- Time to completion is not critical
- Time-shared
- Bounded set of resources
- Apps grow to consume all cycles
- Application manages resources
- System SW gets in the way
- 5 overhead is maximum
- Apps drive purchase of equipment
- Real-time constraints
- Space-shared
6Cplant Approach to Scalability
- Build a scalable system out of COTS parts that
runs high-performance, scientific applications on
machines with up to 8192 nodes - Distributed applications can run on parallel
machines (at a reduced cost efficiency) - Parallel applications can not run on distibuted
machines (no matter how much money is involved) - Core pieces we are working on
- Scalable app load, boot, and maintenance
- Message passing (Portals 3.0)
- I/O
7Cplant Goals
- Scalable -)
- Production system
- Multiple users
- General purpose for scientific applications (not
Beowulf dedicated to a single user) - 1st step Tflops look and feel for users
8Cplant Strategy
- Hybrid approach combining commodity cluster
technology with MPP technology - Build on the design of the Tflops
- large systems should be built from independent
building blocks - large systems should be partitioned to provide
specialized functionality - large systems should have significant resources
dedicated to system maintenance
9Cplant Approach
- Emulate the ASCI Red environment
- Partition model (functional decomposition)
- Space sharing (reduce turnaround time)
- Scalable services (allocator, loader, launcher)
- Ephemeral user environment
- Complete resource dedication
- Use Existing Software when possible
- Red Hat distribution, Linux/Alpha
- Software developed for ASCI Red
10Phase II Production (Alaska)
- 400 Digital PWS 500a (Miata)
- 500 MHz Alpha 21164 CPU
- 2 MB L3 Cache, 192 MB RAM
- 16-port Myrinet switch
- 32-bit, 33 MHz LANai-4 NIC
- 6 DEC AS1200, 12 RAID (.75 Tbyte) file server
- 1 DEC AS4100 compile user file server
- Integrated by Compaq
- 125.2 GFLOPS on MPLINPACK (350 nodes)
- would place 53rd on June 1999 Top 500
11Phase III Production (Siberia)
- 624 Compaq XP1000 (Monet)
- 500 MHz Alpha 21264 CPU
- 4 MB L3 Cache
- 256 MB ECC SDRAM
- 16-port Myrinet switch
- 64-bit, 33 MHz LANai-7 NIC
- 1.73 TB disk I/O
- Integrated by Compaq and Abba Technologies
- 247.6 GFLOPS on MPLINPACK (572 nodes)
- would place 40th on Nov 1999 Top 500
12CTH Grind Time
13Phase IV (Antarctica, Zermatt?)
- 1350 DS10 Slates (NMCA)
- 466MHz EV6, 256MBRAM
- Myrinet 33MHz 64bit LANai 7.x
- Will be combined with Siberia for a 1600-node
system - Red, black, green switchable
14Myrinet Switch
16-port switch
- Based on 64-port Clos switch
- 8x2 16-port switches in a 12U rack-mount case
- 64 LAN cables to nodes
- 32 SAN cables (64 links) to mesh
4 nodes
15One Switch Rack One Plane
- 4 Clos switches in one rack
- 256 nodes per plane (8 racks)
- Wrap-around in x and y direction
- 128128 links in z direction
16Cplant 2000
Connected to classified network
Wrap-around and z links and nodes not shown
Connected to unclassified network
Compute nodes swing between red, black, or green
Connected to open network
17Cplant 2000 cont.
- 1056 256 256 nodes ? 1600 nodes ? 1.5TFlops
- 320 64-port switches 144 16-port switches
from Siberia - 40 16 system support stations
18MPP Network Paragon and Tflops
Network interface is on the memory bus
Network
Memory
Memory Bus
Processor
Processor
Message passing or computational co-processor
19Commodity Myrinet
Network is far from the memory
Processor
Memory
Memory Bus
Bridge
PCI Bus
OS Bypass
NIC
Network
20http//www.cs.sandia.gov/cplant