Clusters, Fault Tolerance, and Other Thoughts

About This Presentation

Title:

Description:

Number of Views:14

Avg rating:3.0/5.0

Slides: 8

Provided by: daniel132

Learn more at: https://www.cs.sandia.gov

Category:

Tags: clusters | fault | fermi | thoughts | tolerance

Transcript and Presenter's Notes

Title: Clusters, Fault Tolerance, and Other Thoughts

1
Clusters, Fault Tolerance, and Other Thoughts

2
Cluster 2002 http//www.mcs.anl.gov/cluster2002/

2002 IEEE International Conference on Cluster
Computing, Chicago, 23-26 Sep. 2002
Next 2 meetings are
December 2003 in Hong Kong
September 2004 in San Diego
Of the 284 attendees at Cluster 2002 and 120 at
SOS7, 23 are common to both meetings
Motivation
The series of conferences and their sponsor, the
Task Force for Cluster Computing (TFCC), were
created to
Bring the together the cluster community
Establish best practices
Provide educational material
Cross-fertilize ideas between industry and
academia

3
Cluster 2002 Topics

4
Cluster 2002 Results and Conclusions

Positives
Software tools are getting better - management,
configuration and administration
Interesting and promising work ongoing in
Self-tuning software
Component redundancy
Applications
Clusters are enabling platforms due to low entry
cost
Negatives
Large (possibly heterogeneous) systems are not
easy to build or maintain
Systems administration is normally underestimated
and un(der)funded
Component failure in large systems can be a
problem
Other
Clusters are good for work for which we know they
are good
Minimum cost clusters can handle some jobs well
Should design and build cluster to suit
application needs

5
FALSE 2002http//false2002.vanderbilt.edu/

Workshop on Fault-Adaptive Large-Scale Real-Time
Systems
Held at Vanderbilt, 14-15 Nov. 2002
Sponsored by NSF ITR Project BTeV Real Time
Embedded Systems
Of the 42 attendees at FALSE 2002 and 120
attendees at SOS7, 2 are common to both meetings
(Tony Skjellum and I)
Motivation
High Energy Physics community wants to build
systems to monitor experiments
Others (DARPA, NASA) have an interest in similar
systems
An occasion to share knowledge and plan future
research
Topics
Scaling fault tolerance up to large systems (the
Fermi system will have 2-5K PEs)
Novel approaches to achieving fault tolerance at
low cost (lt 10 overhead)
How to make fault responses domain-specific
(tools that enable the user to specify the
response to different failures, and to implement
these responses throughout the system)
Results/Consensus
No results from this initial meeting just
information sharing (w/ complete consensus)