Testing and Evaluation - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Testing and Evaluation

Description:

276 node Forecast Systems Lab cluster. myrinet, fiberchannel, complex software ... Big Pile of Benchmarks (Susan and I) 7. Terascale is different ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 12
Provided by: gatew358
Category:
Tags: evaluation | pile | ras | testing

less

Transcript and Presenter's Notes

Title: Testing and Evaluation


1
Testing and Evaluation
  • Greg Lindahl
  • was University of Virginia,
  • now HPTi

2
My Qualifications
  • assisted Wall Street clustering (130 machines)
  • built Centurion I II (300 nodes) at UVa
  • 276 node Forecast Systems Lab cluster
  • myrinet, fiberchannel, complex software
  • every one is different
  • we (community) would like to do 6,000 nodes

3
Motivationsfor Test and Eval
  • Roll your own Installation testing
  • Acceptance testing Trust but verify
  • Uh, does it run as fast as its supposed to?
  • Well pretend that It doesnt work what the
    heck do we do now? never happens

4
Classes of Machine
  • Boring 300 nodes
  • occasional failures OK, as long as the user job
    eventually runs
  • Exciting 6000 nodes
  • Much more hardware to be subtly flaky on you
  • users job may never finish due to series of
    errors
  • system software more likely to flake out

5
Classes of Failure
  • Infant mortality
  • Burn in, youre done (ha)
  • Systematic errors (broken network adapter, 10
    bad cables)
  • capability test can catch
  • Software
  • Weird failures
  • Compaq shipped me 276 bad power supplies at FSL
    only statistics pointed the finger at it

6
How to test
  • I use the same suite for both installation
    testing and acceptance testing
  • Simulated Use Testing is King
  • apps, I/O, both capability and capacity jobs, job
    mix
  • Add in apps from other disciplines to stress
    machine in unusual ways
  • NAS PB
  • Big Pile of Benchmarks (Susan and I)

7
Terascale is different
  • Occasional failures that are no big deal on small
    machine are fatal to large capability jobs
  • My 300-node reliability daemon has too many false
    positives for a terascale machine!
  • Probability of un-isolated problems increases

8
RAS features neededSmall and TeraScale
  • Reliability Monitoring of networks, daemons,
    logfiles
  • Today, not all relevant info comes out (soft
    memory errors, IDE drive retries)
  • Some straightforward development needed (myrinet
    has SNMP now, etc)
  • symptoms can be subtle bad fibre caused apparent
    software problem on an SGI, but now we know...

9
RAS Continued
  • Higher Availability
  • Reasonably fault tolerant comp node failure only
    takes out 1 job, doesnt require operator
    intervention
  • Serviceability
  • cluster partition (test new software release)
  • rolling upgrade (nodes upgraded as available)
  • checkpoint allows much better access to nodes
    without discomfiting users

10
Summary
  • Lots of raw materials exist to attack the testing
    and evaluation problem
  • The methodology exists, too
  • Even if you buy a complete system, you still need
    to know how to write the acceptance test
  • Open source RAS has a long way to go
  • but there is a small system solution today

11
Gong This SlideMM5 Performance (t3a)
Gigaflops
Write a Comment
User Comments (0)
About PowerShow.com