IBM S390 Parallel Enterprise Server G5 fault tolerance: A historical perspective

About This Presentation

Title:

IBM S390 Parallel Enterprise Server G5 fault tolerance: A historical perspective

Description:

for a significant period of time (forever?) Background. S/390 failure modes. Permanent, ... Duplication is used by several systems. Intel, Himalaya systems ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 11

Provided by: maho

Learn more at: http://people.ee.duke.edu

Category:

more less

Transcript and Presenter's Notes

Title: IBM S390 Parallel Enterprise Server G5 fault tolerance: A historical perspective

1
IBM S/390 Parallel Enterprise Server G5 fault
toleranceA historical perspective

by
L. Spainhower T.A. Gregg
Presented by
Mahmut Yilmaz

2
Some Terms

Concurrent error detection repair The system
finds errors repairs itself while still running
In-line error checking EDC, ECC
On-line error correction Correct error while the
system can still operate
Transient (soft) faults Temporary faults or bit
flips like Single Event Upsets
Hard faults Persistent faults that remain active
for a significant period of time (forever?)

3
Background

S/390 failure modes
Permanent, intermittent and transient faults
If an error occurs frequently and reaches a
threshold ? permanent
Thermal Conduction Module (TCM)
TCM A liquid cooling method introduced by IBM
A series of spring loaded cylinders conduct the
heat from chips to the cooling chamber
Circuit growth rates exceed reliability gains
Parity check and ECC were used
Circuits were encapsulated
System repair required all system resources
Most repairs were concurrent

4
Background (cont.)

CMOS
G1 (1994) to G5
G1 Less reliable than 9020
System failures are more probable
G2 Dynamic memory sparing
G3 More robust ECC CPU sparing (manual
replacement)
G4 Concurrent CPU sparing CPU instruction
level retry
G5 Most reliable
Greatly exceeds any TCM
Protected good against soft faults (hard faults?)

5
Microprocessor Fault Tolerant Design

Duplication is used by several systems
Intel, Himalaya systems
Duplication requires more than 100 hardware
overhead
Error detection only!
Fetch-decode (I-Unit) and execute (E-Unit) are
generally not protected
S/390 protects
Transient fault rates are increasing with
decreased feature sizes

6
Microprocessor Fault Tolerant Design (cont.)

G5 Fault Tolerant Design Point
9X2 Main goal is to keep CPI low
G5 Main goal is to keep clock period short
In-line error protection is not suitable for G5
High fan-out/fan-in
Increased chip area
Longer wires
Increased path length
Result Duplicated I-unit and E-unit
A checker like DIVA checker R-unit
Total hardware overhead 35
No performance penalty (?)

7
Microprocessor Fault Tolerant Design (cont.)

G5 Fault Tolerant Design Point (cont.)
Recovery and on-line repair ? R-unit
L1 Store-through cache
L2 Shared memory
Line sparing
Up on error detection If retry is not successful
? CPU stopped
Dynamic CPU repairing (DCS)
Faulty CPU R-unit ? Spare CPU R-unit

8
Memory Fault Tolerance

ECC
Permanent fault in L1 ? Cache line or quarter
cache delete
Permanent fault in L2 ? Cache delete
Data array or address directory marked as invalid
Spare lines
L3 Main memory
Background scrubbing
On-line repair Built-in spare chips
Word line or chip kill ? After reaching
threshold, replace module

9
I/O Power/Cooling Subsystem Fault Tolerance

Multiple paths ? Path redundancy
Power/Cooling subsystems

10
Questions

Is duplication the optimal choice? No protection
against hard faults!
How to protect a CPU against intermittent faults?
(Delay faults)Generally, they are the beginning
phase of a hard fault
How to protect ALU by parity check? Adder? (page
868, 1st parag.)
If the retry is unsuccessful, the CPU is stopped.
Would not it be better to use a counter to
account for transient faults? What if a transient
fault occurs while retrying?