IBM S390 Parallel Enterprise Server G5 fault tolerance: A historical perspective - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

IBM S390 Parallel Enterprise Server G5 fault tolerance: A historical perspective

Description:

for a significant period of time (forever?) Background. S/390 failure modes. Permanent, ... Duplication is used by several systems. Intel, Himalaya systems ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 11
Provided by: maho
Category:

less

Transcript and Presenter's Notes

Title: IBM S390 Parallel Enterprise Server G5 fault tolerance: A historical perspective


1
IBM S/390 Parallel Enterprise Server G5 fault
toleranceA historical perspective
  • by
  • L. Spainhower T.A. Gregg
  • Presented by
  • Mahmut Yilmaz

2
Some Terms
  • Concurrent error detection repair The system
    finds errors repairs itself while still running
  • In-line error checking EDC, ECC
  • On-line error correction Correct error while the
    system can still operate
  • Transient (soft) faults Temporary faults or bit
    flips like Single Event Upsets
  • Hard faults Persistent faults that remain active
    for a significant period of time (forever?)

3
Background
  • S/390 failure modes
  • Permanent, intermittent and transient faults
  • If an error occurs frequently and reaches a
    threshold ? permanent
  • Thermal Conduction Module (TCM)
  • TCM A liquid cooling method introduced by IBM
    A series of spring loaded cylinders conduct the
    heat from chips to the cooling chamber
  • Circuit growth rates exceed reliability gains
  • Parity check and ECC were used
  • Circuits were encapsulated
  • System repair required all system resources
  • Most repairs were concurrent

4
Background (cont.)
  • CMOS
  • G1 (1994) to G5
  • G1 Less reliable than 9020
  • System failures are more probable
  • G2 Dynamic memory sparing
  • G3 More robust ECC CPU sparing (manual
    replacement)
  • G4 Concurrent CPU sparing CPU instruction
    level retry
  • G5 Most reliable
  • Greatly exceeds any TCM
  • Protected good against soft faults (hard faults?)

5
Microprocessor Fault Tolerant Design
  • Duplication is used by several systems
  • Intel, Himalaya systems
  • Duplication requires more than 100 hardware
    overhead
  • Error detection only!
  • Fetch-decode (I-Unit) and execute (E-Unit) are
    generally not protected
  • S/390 protects
  • Transient fault rates are increasing with
    decreased feature sizes

6
Microprocessor Fault Tolerant Design (cont.)
  • G5 Fault Tolerant Design Point
  • 9X2 Main goal is to keep CPI low
  • G5 Main goal is to keep clock period short
  • In-line error protection is not suitable for G5
  • High fan-out/fan-in
  • Increased chip area
  • Longer wires
  • Increased path length
  • Result Duplicated I-unit and E-unit
  • A checker like DIVA checker R-unit
  • Total hardware overhead 35
  • No performance penalty (?)

7
Microprocessor Fault Tolerant Design (cont.)
  • G5 Fault Tolerant Design Point (cont.)
  • Recovery and on-line repair ? R-unit
  • L1 Store-through cache
  • L2 Shared memory
  • Line sparing
  • Up on error detection If retry is not successful
    ? CPU stopped
  • Dynamic CPU repairing (DCS)
  • Faulty CPU R-unit ? Spare CPU R-unit

8
Memory Fault Tolerance
  • ECC
  • Permanent fault in L1 ? Cache line or quarter
    cache delete
  • Permanent fault in L2 ? Cache delete
  • Data array or address directory marked as invalid
  • Spare lines
  • L3 Main memory
  • Background scrubbing
  • On-line repair Built-in spare chips
  • Word line or chip kill ? After reaching
    threshold, replace module

9
I/O Power/Cooling Subsystem Fault Tolerance
  • Multiple paths ? Path redundancy
  • Power/Cooling subsystems

10
Questions
  • Is duplication the optimal choice? No protection
    against hard faults!
  • How to protect a CPU against intermittent faults?
    (Delay faults)Generally, they are the beginning
    phase of a hard fault
  • How to protect ALU by parity check? Adder? (page
    868, 1st parag.)
  • If the retry is unsuccessful, the CPU is stopped.
    Would not it be better to use a counter to
    account for transient faults? What if a transient
    fault occurs while retrying?
Write a Comment
User Comments (0)
About PowerShow.com