Clusters, Fault Tolerance, and Other Thoughts - PowerPoint PPT Presentation

1 / 7
About This Presentation
Title:

Clusters, Fault Tolerance, and Other Thoughts

Description:

Of the 284 attendees at Cluster 2002 and 120 at SOS7, 23 are common to ... Scaling fault tolerance up to large systems (the Fermi system will have 2-5K PEs) ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 8
Provided by: daniel132
Category:

less

Transcript and Presenter's Notes

Title: Clusters, Fault Tolerance, and Other Thoughts


1
Clusters, Fault Tolerance, and Other Thoughts
  • Daniel S. Katz
  • JPL/Caltech
  • SOS7 Meeting
  • 4 March 2003

2
Cluster 2002 http//www.mcs.anl.gov/cluster2002/
  • 2002 IEEE International Conference on Cluster
    Computing, Chicago, 23-26 Sep. 2002
  • Next 2 meetings are
  • December 2003 in Hong Kong
  • September 2004 in San Diego
  • Of the 284 attendees at Cluster 2002 and 120 at
    SOS7, 23 are common to both meetings
  • Motivation
  • The series of conferences and their sponsor, the
    Task Force for Cluster Computing (TFCC), were
    created to
  • Bring the together the cluster community
  • Establish best practices
  • Provide educational material
  • Cross-fertilize ideas between industry and
    academia

3
Cluster 2002 Topics
  • Running a cluster and making it usable
  • Software for management, including configuration
  • Middleware software
  • Building a cluster
  • Software and hardware for networking
  • Choosing node hardware
  • Packaging hardware
  • Making use of a cluster
  • New and innovative applications

4
Cluster 2002 Results and Conclusions
  • Positives
  • Software tools are getting better - management,
    configuration and administration
  • Interesting and promising work ongoing in
  • Self-tuning software
  • Component redundancy
  • Applications
  • Clusters are enabling platforms due to low entry
    cost
  • Negatives
  • Large (possibly heterogeneous) systems are not
    easy to build or maintain
  • Systems administration is normally underestimated
    and un(der)funded
  • Component failure in large systems can be a
    problem
  • Other
  • Clusters are good for work for which we know they
    are good
  • Minimum cost clusters can handle some jobs well
  • Should design and build cluster to suit
    application needs

5
FALSE 2002http//false2002.vanderbilt.edu/
  • Workshop on Fault-Adaptive Large-Scale Real-Time
    Systems
  • Held at Vanderbilt, 14-15 Nov. 2002
  • Sponsored by NSF ITR Project BTeV Real Time
    Embedded Systems
  • Of the 42 attendees at FALSE 2002 and 120
    attendees at SOS7, 2 are common to both meetings
    (Tony Skjellum and I)
  • Motivation
  • High Energy Physics community wants to build
    systems to monitor experiments
  • Others (DARPA, NASA) have an interest in similar
    systems
  • An occasion to share knowledge and plan future
    research
  • Topics
  • Scaling fault tolerance up to large systems (the
    Fermi system will have 2-5K PEs)
  • Novel approaches to achieving fault tolerance at
    low cost (lt 10 overhead)
  • How to make fault responses domain-specific
    (tools that enable the user to specify the
    response to different failures, and to implement
    these responses throughout the system)
  • Results/Consensus
  • No results from this initial meeting just
    information sharing (w/ complete consensus)

6
General Thoughts
  • Fault-Tolerance is becoming important to
    large-scale systems
  • Embedded and non-embedded systems
  • Real-time and non-real-time systems
  • Is there a common solution (or partial solution)
    to this issue?
  • There is no software problem an additional layer
    of abstraction wont solve

7
Thanks
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com