Title: The IEEE CS Task Force on Cluster Computing (TFCC)
1The IEEE CS Task Force on Cluster Computing (TFCC)
William GroppMathematics and Computer
ScienceArgonne National Labwww.mcs.anl.gov/grop
p
Thanks to Mark BakerUniversity of Portsmouth,
UKhttp//www.dcs.port.ac.uk/mab
2A Little History
- In 1998 there was obvious huge interest in
clusters, so it seemed natural to set up a
focused group in this area. - A Cluster Computing Task Force was proposed to
the IEEE CS. - The TFCC was approved and started operating in
February 1999 been going just over 2 years.
3Proposed Activities
- Act as an international forum to promote cluster
computing research and education, and participate
in setting up technical standards in this area. - Be involved with issues related to the design,
analysis and development of cluster systems as
well as the applications that use them. - Sponsor professional meetings, produce
publications, set guidelines for educational
programs, and help co-ordinate academic, funding
agency, and industry activities. - Organize events and hold a number of workshops
that would span the range of activities sponsored
by the Task Force. - Publish a bi-annual newsletter to help the
community keep abreast of activities in field.
4IEEE CS Task Forces
- A TF is expected to have a finite term of
existence, normally a period of 2-3 years -
continued existence beyond that point is
generally not appropriate. - A TF is expected to either increase their scope
of activities such that establishment of a
Technical Committee (TC) is warranted, or the
task force will be merged into existing TCs. - TFCC will submit an application to the CS become
a TC later this year.
5Why a separate TFCC!
- It brings together all the activities/technologies
used with Cluster Computing into one area - so
instead of tracking four or five IEEE TCs there
is one... - Cluster Computing is NOT just Parallel,
Distributed, OSs, or the Internet, it is a mix of
them all, and consequently different. - The TFCC is an appropriate body for focusing
activities and publications associated with
Cluster Computing.
6http//www.ieeetfcc.org
7TFCC Mailing Lists
- Currently three emails lists have been set up
- tfcc-l_at_bucknell.edu a discussion list open to
anyone interested in the TFCC - see TFCC page for
info. on how to subscribe. - tfcc-exe_at_port.ac.uk a closed executive
committee mailing reflector. - tfcc-adv_at_port.ac.uk a closed advisory
committee mailing reflector.
8Annual Conference ClusterXY
- 1st IEEE International Workshop on Cluster
Computing (Cluster 1999), Melbourne, Australia,
December 1999, about 105 attendees from 16
countries. - http//www.clustercomp.org
- 2nd IEEE International Conference on Cluster
Computing (Cluster 2000), Chemnitz, Germany,
November, 2000, anticipate 160 attendees. - http//www.tu-chemnitz.de/cluster2000
- 3rd IEEE International Conference on Cluster
Computing (Cluster 2001), Newport Beach,
California, October 8-11, 2001, expect 250-300
attendees. - http//andy.usc.edu/cluster2001
9Associated Events - GRIDXY
- 1st IEEE/ACM International Workshop on Grid
Computing (Grid2000), Bangalore, India, December
17, 2000 (attendees from 15 countries). - http//www.gridcomputing.org
- 2nd IEEE/ACM International Workshop on Grid
Computing (Grid2001), at SC2001, November 2001
10Supercomputing
- Birds of A Feather at SC99 and SC2000.
- Aims of meetings are to gather together
interested parties and bring them up to date, but
also put together a bunch of short talks and
start a discussion on a variety of topics - Probably be another at SC01 depending on the
community interest.
11Other Activities
- Book donation program
- Cluster Computing Archive
- www.ieeetfcc.org/ClusterArchive.html
- TopClusters Project
- www.TopClusters.org
- TFCC Whitepaper
- www.dcs.port.ac.uk/mab/tfcc/WhitePaper
- TFCC Newsletter
- www.eg.bucknell.edu/hyde/tfcc
12TopClusters Project
- http//www.TopClusters.org
- TFCC collaboration with Top500 project.
- Numeric, I/O, Web, Database, and Application
level benchmarking of clusters. - Joint BOF with Top500 at SC2000 on Cluster-based
benchmarking. - Ongoing effort
13TFCC Whitepaper
- A Whitepaper on Cluster Computing, submitted to
the International Journal of High-Performance
Applications and Supercomputing, November 2000 - Snap-shot of the state-of-the-art of Cluster
Computing. - Preprint, www.dcs.port.ac.uk/mab/tfcc/WhitePaper/
14TFCC Membership
- Over 300 registered members
- Free membership open to all, but few benefits may
be restricted - (reduced registration fee for
IEEE members) - Over 450 on the TFCC mailing list
lttfcc-l_at_bucknell.edugt
15Future Plans
- We plan to submit an application to the IEEE CS
Technical Activities Board (TAB) to attain full
Technical Committee status. - The TAB see the TFCC as a success and we hope
that our application will be successful. - Obviously if we achieve TC status, we will need
the continuing assistance and help of the TFCCs
current volunteers plus encourage a bunch of new
ones
16Summary
- Successful conference series has been started,
with commercial sponsorship. - Promoting Cluster-based technologies through TFCC
sponsorship. - Helping the community with our book donation
program. - Engendering debate and discussion through mailing
forum. - Keeping the community informed with our
information rich TFCC Web site.
17Scalable Clusters
- TopCluster.org list
- 26 Clusters with 128 nodes
- 8 with 500 nodes
- 34 with 64-127 nodes
- Most run Linux
- Most dedicated to applications
- Where are scalable tools developed and tested?
- Caveats
- Does not include MPP-like systems (IBM SP, SGI
Origin, Compaq, Intel TFLOPs, etc.) - Not a complete list
- Only clusters explicitly contributed to
topcluster.org
18What is Scalability?
- Most common definition in use
- Works for n1 nodes if it works for n, for small
n - Practical definition
- Operations complete fast enough
- 0.5 to 3 seconds for interactive
- Operations are reliable
- Approach to scalability must not be fragile
19Issues in Clusters and Scalability
- Developing and Testing Tools
- Requires convenient access to large-scale system
- Can this co-exist with production computing?
- Too many different tools
- Why not adopt Unix philosophy?
- Example solution Scalable Unix Tools
- Following slides thanks to Rusty Lusk and Emil Ong
20What Are the Scalable Unix Tools?
- Parallel versions of common Unix commands like
ps, ls, cp, , with appropriate semantics - A few new commands in the same spirit but without
a serial counterpart - Designed for users
- New this spring release of a high-performance
implementation based on MPI - One of the original official Ptools projects
- Original definition published
- Proceedings of the Scalable High Performance
Computing Conference - http//www.mcs.anl.gov/gropp/papers/1994/shpcc-pa
per.ps
21Motivation
- Basic Unix commands (ls, grep, find, ) are
quintessential tools. - Simple syntax and semantics (except maybe find
syntax) - Have same component interface (lines of text,
stdin, stdout) - Unix redirection ( lt, gt, and especially ) allow
tools to be easily combined into powerful command
lines - Old-fashioned no GUI, little interactivity
22Motivation, continued
- Many parallel machines have Unix and at least
partially distinct file systems on each node. - A user needs simple and familiar ways to
- Copy a file to local file space on each node
- Find all processes running on all nodes
- Test for conditions on all nodes
- Avoid getting swamped with output
- On large machines these commands are not useful
unless they take advantage of parallelism in
their execution.
23Design Goals
- Familiar to Unix users
- Similar names (we chose ptltUnix-namegt)
- Same arguments, similar semantics
- Interact well with traditional Unix commands,
facilitating construction of powerful command
lines - Run at interactive speeds (requires scalability
in parallel process manager startup and handling
of I/O)
24Part I Parallel Versions of Traditional Commands
- ptcp
- ptmv
- ptrm
- ptln
- ptmkdir
- ptrmdir
- ptchmod
- ptchgrp
- ptchown
- pttestao
- Select nodes to run on by
- -all
- -m ltfile of hostnamesgt
- -M lthostlistgt
- donner dasher blitzen
- ccnd_at_1-32,42,65-96
25Part II Traditional Commands Producing Lots of
Output
- ptcat, ptls, ptfind
- Have potential to produce lots of output, and the
source is also of interest - With h option ptls M noded_at_1-3 -h
- node1
- myfile1
- node2
- node3
- myfile1
- myfile2
-
26Performance of ptcp
- Copying a single 10 MB file
- to 241 nodes in 14 seconds
Time to Copy 10MB file
Total Bandwidth
27Watching ptcp
- ptcp all bigfile BIGFILE
- X1
- while true do \
- ptexec -all 'echo "hostname ls -s BIGFILE \
- awk \ "print \\"percentage\\" \ (1)/98 \\"
blue \ red\\"\""' ptdisp -h
28Percentage of Completion
29Percentage of Completion
30Availability
- Open source
- Get from http//www.mcs.anl.gov/sut
- All source, man pages
- Configure, make, on Linux, Solaris, Irix, AIX
- Needs MPI implementation with mpirun
- Developed with Linux, MPICH, MPD, on Chiba City
at Argonne
31Chiba City Scalability Testbed
- http//www-unix.mcs.anl.gov/chiba/
32Some Other Efforts in Scalable Clusters
- Large Programs
- DOE Scientific Discovery through Advanced
Computing (SciDAC) - NSF Distributed Terascale Facility (DTF)
- OSCAR
- Goal is a cluster in a box CD
- PVFS (Parallel Virtual File System)
- Many Smaller Efforts
- www.beowulf.org, etc.
- Commercial Efforts
- Scyld, etc.