CS184b: Computer Architecture (Abstractions and Optimizations) - PowerPoint PPT Presentation

1 / 74

About This Presentation

Title:

CS184b: Computer Architecture (Abstractions and Optimizations)

Description:

... any point in time, can fail (produce the wrong result) (2nd ... may fail. provide ... If Fail no ack. Retry. Preferably with different resource. Caltech CS184 ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 75

Provided by: csCal

Category:

more less

Transcript and Presenter's Notes

Title: CS184b: Computer Architecture (Abstractions and Optimizations)

1
CS184bComputer Architecture(Abstractions and
Optimizations)

Day 17 May 9, 2005
Defect and Fault Tolerance

2
Today

Defect and Fault Tolerance
Problem
Defect Tolerance
Fault Tolerance

3
Motivation Probabilities

Given
N objects
P yield probability
Whats the probability for yield of composite
system of N items?
Asssume iid faults
P(N items good) PN

4
Probabilities

P(N items good) PN
N106, P0.999999
P(all good) 0.37
N107, P0.999999
P(all good) 0.000045

5
Simple Implications

As N gets large
must either increase reliability
or start tolerating failures
N
memory bits
disk sectors
wires
transmitted data bits
processors
transistors
molecules

As devices get smaller, failure rates increase
chemists think P0.95 is good

6
Defining Problems
7
Three problems

Manufacturing imperfection
Shorts, breaks
wire/node X shorted to power, ground, another
node
Doping/resistance variation too high
Parameters vary over time
Electromigration
Resistance increases
Incorrect operation
node X value flips
crosstalk
alpha particle
bad timing

8
Defects

Shorts example of defect
Persistent problem
reliably manifests
Occurs before computation
Can test for at fabrication / boot time and then
avoid
(1st half of lecture)

9
Faults

Alpha particle bit flips is an example of a fault
Fault occurs dynamically during execution
At any point in time, can fail
(produce the wrong result)
(2nd half of lecture)

10
Lifetime Variation

Starts out fine
Over time changes
E.g. resistance increases until out of spec.
Persistent
So can use defect techniques to avoid
But, onset is dynamic
Must use fault detection techniques to recognize?

11
Sherkhar Bokar Intel Fellow Micro37 (Dec.2004)
12
Defect Rate

Device with 1011 elements (100BT)
3 year lifetime 108 seconds
Accumulating up to 10 defects
1010 defects in 108 seconds
1 defect every 10ms
At 10GHz operation
One new defect every 108 cycles
Pnewdefect10-19

13
First Step to Recover

Admit you have a problem
(observe that there is a failure)

14
Detection

Determine if something wrong?
Some things easy
.wont start
Others tricky
one and gate computes False True?True
Observability
can see effect of problem
some way of telling if defect/fault present

15
Detection

Coding
space of legal values lt space of all values
should only see legal
e.g. parity, ECC (Error Correcting Codes)
Explicit test (defects, recurring faults)
ATPG Automatic Test Pattern Generation
Signature/BISTBuilt-In Self-Test
POST Power On Self-Test
Direct/special access
test ports, scan paths

16
Coping with defects/faults?

Key idea redundancy
Detection
Use redundancy to detect error
Mitigating use redundant hardware
Use spare elements in place of faulty elements
(defects)
Compute multiple times so can discard faulty
result (faults)
Exploit Law-of-Large Numbers

17
Defect Tolerance
18
Two Models

Disk Drives
Memory Chips

19
Disk Drives

Expose defects to software
software model expects faults
Create table of good (bad) sectors
manages by masking out in software
(at the OS level)
yielded capacity varies

20
Memory Chips

Provide model in hardware of perfect chip
Model of perfect memory at capacity X
Use redundancy in hardware to provide perfect
model
Yielded capacity fixed
discard part if not achieve

21
Example Memory

Correct memory
N slots
each slot reliably stores last value written
Millions, billions, etc. of bits
have to get them all right?

22
Memory Defect Tolerance

Idea
few bits may fail
provide more raw bits
configure so yield what looks like a perfect
memory of specified size

23
Memory Techniques

Row Redundancy
Column Redundancy
Block Redundancy

24
Row Redundancy

Provide extra rows
Mask faults by avoiding bad rows
Trick
have address decoder substitute spare rows in for
faulty rows
use fuses to program

25
Spare Row
26
Column Redundancy

Provide extra columns
Program decoder/mux to use subset of columns

27
Spare Memory Column

Provide extra columns
Program output mux to avoid

28
Block Redundancy

Substitute out entire block
e.g. memory subarray
include 5 blocks
only need 4 to yield perfect
(N1 sparing more typical for larger N)

29
Spare Block
30
Yield M of N

P(M of N) P(yield N)
(N choose N-1) P(exactly N-1)
(N choose N-2) P(exactly N-2)
(N choose N-M) P(exactly N-M)
think binomial coefficients

31
M of 5 example

1P5 5P4(1-P)110P3(1-P)210P2(1-P)35P1(1-P)4
1(1-P)5
Consider P0.9
1P5 0.59 M5
P(sys)0.59
5P4(1-P)1 0.33 M4 P(sys)0.92
10P3(1-P)2 0.07 M3 P(sys)0.99
10P2(1-P)3 0.008
5P1(1-P)4 0.00045
1(1-P)5 0.00001

Can achieve higher system yield than individual
components!
32
Repairable Area

Not all area in a RAM is repairable
memory bits spare-able
io, power, ground, control not redundant

33
Repairable Area

P(yield) P(non-repair) P(repair)
P(non-repair) PN
NltltNtotal
Maybe P gt Prepair
e.g. use coarser feature size
P(repair) P(yield M of N)

34
Consider a Crossbar

Allows me to connect any of N things to each
other
E.g.
N processors
N memories
N/2 processors
N/2 memories

35
Crossbar Buses and Defects

Two crossbars
Wires may fail
Switches may fail
Provide more wires
Any wire fault avoidable
M choose N

36
Crossbar Buses and Defects

Two crossbars
Wires may fail
Switches may fail
Provide more wires
Any wire fault avoidable
M choose N

37
Crossbar Buses and Faults

Two crossbars
Wires may fail
Switches may fail
Provide more wires
Any wire fault avoidable
M choose N

38
Crossbar Buses and Faults

Two crossbars
Wires may fail
Switches may fail
Provide more wires
Any wire fault avoidable
M choose N
Same idea

39
Simple System

P Processors
M Memories
Wires

40
Simple System w/ Spares

P Processors
M Memories
Wires
Provide spare
Processors
Memories
Wires

41
Simple System w/ Defects

P Processors
M Memories
Wires
Provide spare
Processors
Memories
Wires
...and defects

42
Simple System Repaired

P Processors
M Memories
Wires
Provide spare
Processors
Memories
Wires
Use crossbar to switch together good processors
and memories

43
In Practice

Crossbars are inefficient CS184A
Use switching networks with
Locality
Segmentation
CS184A
but basic idea for sparing is the same

44
Fault Tolerance
45
Faults

Bits, processors, wires
May fail during operation
Basic Idea same
Detect failure using redundancy
Correct
Now
Must identify and correct online with the
computation

46
Simple Memory Example

Problem bits may lose/change value
Alpha particle
Molecule spontaneously switches
Idea
Store multiple copies
Perform majority vote on result

47
Redundant Memory
48
Redundant Memory

Like M-choose-N
Only fail if gt(N-1)/2 faults
P0.9
P(2 of 3)
All good (0.9)3 0.729
Any 2 good 3(0.9)2(0.1)0.243
0.971

49
Better Less Overhead

Dont have to keep N copies
Block data into groups
Add a small number of bits to detect/correct
errors

50
Row/Column Parity

Think of NxN bit block as array
Compute row and column parities
(total of 2N bits)

51
Row/Column Parity

Think of NxN bit block as array
Compute row and column parities
(total of 2N bits)
Any single bit error

52
Row/Column Parity

Think of NxN bit block as array
Compute row and column parities
(total of 2N bits)
Any single bit error
By recomputing parity
Know which one it is
Can correct it

53
In Use Today

Conventional DRAM Memory systems
Use 72b ECC (Error Correcting Code)
On 64b words
Correct any single bit error
Detect multibit errors
CD blocks are ECC coded
Correct errors in storage/reading
Learn more about ECC in EE127

54
Interconnect

Also uses checksums/ECC
Guard against data transmission errors
Environmental noise, crosstalk, trouble sampling
data at high rates
Often just detect error
Recover by requesting retransmission
E.g. TCP/IP (Internet Protocols)

55
Interconnect

Also guards against whole path failure
Sender expects acknowledgement
If no acknowledgement will retransmit
If have multiple paths
and select well among them
Can route around any fault in interconnect

56
Interconnect Fault Example

Send message
Expect Acknowledgement

57
Interconnect Fault Example

Send message
Expect Acknowledgement
If Fail

58
Interconnect Fault Example

Send message
Expect Acknowledgement
If Fail
No ack

59
Interconnect Fault Example

If Fail ? no ack
Retry
Preferably with different resource

60
Interconnect Fault Example

If Fail ? no ack
Retry
Preferably with different resource

Ack signals success
61
Transit Multipath

Butterfly (or Fat-Tree) networks with multiple
paths
CS184BDay4

62
Multiple Paths

Provide bandwidth
Minimize congestion
Provide redundancy to tolerate faults

63
Routers May be faulty(links may be faulty)

Dynamic
Corrupt data
Misroute
Send data nowhere

64
Multibutterfly Performancew/ Faults
65
Compute Elements

Simplest thing we can do
Compute redundantly
Vote on answer
Similar to redundant memory

66
Compute Elements

Unlike Memory
State of computation important
Once a processor makes an error
All subsequent results may be wrong
Response
reset processors which fail vote
Go to spare set to replace failing processor

67
In Use

NASA Space Shuttle
Uses set of 4 voting processors
Boeing 777
Uses voting processors
(different architectures, code)

68
Forward Recovery

Can take this voting idea to gate level
VonNeuman 1956
Basic gate is a majority gate
Example 3-input voter
Number of technical details
High level bit
Requires Pgategt0.996
Can make whole system as reliable as individual
gate

69
Majority Multiplexing
Maybe theres a better way next time.
RoyBeiu/IEEE Nano2004
70
Rollback Recovery

Commit state of computation at key points
to memory (ECC, RAID protected...)
reduce to previously solved problem
On faults (lifetime defects)
recover state from last checkpoint
like going to last backup.
(snapshot)
analysis next time

71
Defect vs. Fault Tolerance

Defect
Can tolerate large defect rates (10)
Use virtually all good components
Small overhead beyond faulty components
Fault
Require lower fault rate (e.g. VN lt0.4)
Overhead to do so can be quite large

72
Summary

Possible to engineer practical, reliable systems
from
Imperfect fabrication processes (defects)
Unreliable elements (faults)
We do it today for large scale systems
Memories (DRAMs, Hard Disks, CDs)
Internet
and critical systems
Space ships, Airplanes
Engineering Questions
Where invest area/effort?
Higher yielding components? Tolerating faulty
components?
Where do we invoke law of large numbers?
Above/below the device level

73
Big Ideas

Left to itself
reliability of system ltlt reliability of parts
Can design
system reliability gtgt reliability of parts
defects
system reliability reliability of parts
faults
For large systems
must engineer reliability of system
all systems becoming large

74
Big Ideas

Detect failures
static directed test
dynamic use redundancy to guard
Repair with Redundancy
Model
establish and provide model of correctness
perfect model part (memory model)
visible defects in model (disk drive model)

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

CS184b: Computer Architecture (Abstractions and Optimizations) PowerPoint PPT Presentation

CS184b: Computer Architecture (Abstractions and Optimizations) - Basic Architecture Requirements. Fine-Grained Threading. TAM (Threaded Abstract Machine) ... Basic blocks (fine-grained threads) Think of as coarser-grained DF ... | PowerPoint PPT presentation | free to view

CS184b: Computer Architecture (Abstractions and Optimizations) - give to highest priority which requests. consider ordering ... Arrange N=2n nodes in n-dimensional cube. At most n hops from source to sink. N = log2(N) ... | PowerPoint PPT presentation | free to view

CS184b: Computer Architecture (Abstractions and Optimizations) - Only necessary to write/broadcast a value if someone else has it cached ... Why did we need broadcast in Snoop-Bus protocol? Detect sharing ... | PowerPoint PPT presentation | free to view

CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] PowerPoint PPT Presentation

CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] - [Single Threaded Architecture: abstractions, quantification, and ... no effect of exec op. Caltech CS184b Winter2001 -- DeHon. 19. Avoiding Lost Cycle (2) ... | PowerPoint PPT presentation | free to view

CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] - update memory only on eviction. Caltech CS184b Winter2001 -- DeHon. 34. Write Policy ... eviction trivial (just overwrite) every write is slow (main memory time) ... | PowerPoint PPT presentation | free to view

CS184b: Computer Architecture (Abstractions and Optimizations) - Problems we want to run are bigger than the real memory we ... Convenient to run more than one program at a time on a computer. Convenient/Necessary to isolate ... | PowerPoint PPT presentation | free to view

CS184b: Computer Architecture (Abstractions and Optimizations) - probably going to hit exposure over details. Caltech CS184 Spring2005 -- DeHon. 7. Lectures ... Schedule MWF. Accommodate holes as necessary. Currently have 25 ... | PowerPoint PPT presentation | free to view

Computer Architecture: Intro Anatomy of a CPU PowerPoint PPT Presentation

Computer Architecture: Intro Anatomy of a CPU - Design an example architecture using SOTA tools ... One step closer on the abstraction scale is to look at the programmer's model of ... | PowerPoint PPT presentation | free to view

CS252 Graduate Computer Architecture Lecture 16 Memory Technology (Con PowerPoint PPT Presentation

CS252 Graduate Computer Architecture Lecture 16 Memory Technology (Con - Graduate Computer Architecture. Lecture 16. Memory Technology (Con't) ... 4 for access time, 10 cycle time, 1 to send data. Cache Block is 4 words. Simple M.P. ... | PowerPoint PPT presentation | free to view

CS184b: Computer Architecture (Abstractions and Optimizations) - CS184b: Computer Architecture Abstractions and Optimizations | PowerPoint PPT presentation | free to view

CS184b: Computer Architecture Abstractions and Optimizations - CS184b: Computer Architecture Abstractions and Optimizations | PowerPoint PPT presentation | free to view

CS184b: Computer Architecture [Single Threaded Architecture: abstractions, quantification, and optimizations] - A IW WS. Caltech CS184b Winter2001 -- DeHon. 12. Registers. How many virtual registers needed? ... Gets delay down to log(WS) w/ linear layout, delay still linear ... | PowerPoint PPT presentation | free to view

CS184a: Computer Architecture (Structures and Organization) PowerPoint PPT Presentation

CS184a: Computer Architecture (Structures and Organization) - CS184a: Computer Architecture (Structures and Organization) Day20: November 29, 2000 Review Today Review content and themes N.B. EOT Feedback Questionnaire return end ... | PowerPoint PPT presentation | free to view

CS184a: Computer Architecture (Structures and Organization) - CS184a: Computer Architecture (Structures and Organization) Day1: September 25, 2000 Introduction and Overview Today Matter Computes Architecture Matters This Course ... | PowerPoint PPT presentation | free to view

New Directions in Computer Architecture PowerPoint PPT Presentation

New Directions in Computer Architecture - Outline Desktop/Server Microprocessor State of the Art Mobile Multimedia Computing as New ... Trends Affecting New ... edu/papers/direction/paper ... | PowerPoint PPT presentation | free to view

EECS 252 Graduate Computer Architecture Lec 3 PowerPoint PPT Presentation

EECS 252 Graduate Computer Architecture Lec 3 - Title: EECS 252 Graduate Computer Architecture Lec 01 - Introduction Last modified by: EECS Created Date: 1/12/2005 3:15:41 PM Document presentation format | PowerPoint PPT presentation | free to view

Computer Architecture PowerPoint PPT Presentation

Computer Architecture - Computer Architecture Lecture 7 Compiler Considerations and Optimizations | PowerPoint PPT presentation | free to view

EECS 252 Graduate Computer Architecture Lec 12 PowerPoint PPT Presentation

EECS 252 Graduate Computer Architecture Lec 12 - Lec 12 [Removed: Vector Wrap-up] Multiprocessor Introduction David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley | PowerPoint PPT presentation | free to view

EECS 252 Graduate Computer Architecture Lec 12 - Lec 12 Vector Wrap-up and Multiprocessor Introduction David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley | PowerPoint PPT presentation | free to view

CENG 450 Computer Systems and Architecture Lecture 15 PowerPoint PPT Presentation

CENG 450 Computer Systems and Architecture Lecture 15 - Computer Systems and Architecture Lecture 15 Amirali Baniasadi amirali@ece.uvic.ca | PowerPoint PPT presentation | free to view

ECE 366 -- Computer Architecture Lecture Notes 11 -- Adders Shantanu Dutt Univ. of Illinois at Chicago Excerpted from CS152 Computer Architecture and Engineering Lecture 5: Cost and Design - Lecture Notes 11 -- Adders Shantanu Dutt Univ. of Illinois at Chicago Excerpted from CS152 Computer Architecture and Engineering Lecture 5: Cost and Design | PowerPoint PPT presentation | free to view

EECS 252 Graduate Computer Architecture Lec 15 PowerPoint PPT Presentation

EECS 252 Graduate Computer Architecture Lec 15 - Lec 15 T1 ( Niagara ) and Papers Discussion David Patterson Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs ... | PowerPoint PPT presentation | free to view

IIT CS570 Graduate Advenced Computer Architecture PowerPoint PPT Presentation

IIT CS570 Graduate Advenced Computer Architecture - Title: IIT CS570 Graduate Advenced Computer Architecture Author: David Last modified by: sun Created Date: 2/8/2005 3:17:21 AM Document presentation format | PowerPoint PPT presentation | free to view

Sample Undergraduate Lecture: MIPS Instruction Set Architecture PowerPoint PPT Presentation

Sample Undergraduate Lecture: MIPS Instruction Set Architecture - Sample Undergraduate Lecture: MIPS Instruction Set Architecture Jason D. Bakos Optics/Microelectronics Lab Department of Computer Science University of Pittsburgh | PowerPoint PPT presentation | free to view

Effective Compilation Support for Variable Instruction Set Architecture PowerPoint PPT Presentation

Effective Compilation Support for Variable Instruction Set Architecture - Title: Effective Compilation Support for Variable Instruction Set Architecture Last modified by: Fred Chow Document presentation format: On-screen Show | PowerPoint PPT presentation | free to view

CSE 502 Graduate Computer Architecture Lec 16 17,19 20 PowerPoint PPT Presentation

CSE 502 Graduate Computer Architecture Lec 16 17,19 20 - Lec 16+17,19+20 Symmetric MultiProcessing Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David ... | PowerPoint PPT presentation | free to view

CSE 502 Graduate Computer Architecture Lec 16-18 PowerPoint PPT Presentation

CSE 502 Graduate Computer Architecture Lec 16-18 - Lec 16-18 Symmetric MultiProcessing Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and ~lw Slides adapted from David ... | PowerPoint PPT presentation | free to view