Redundant Multithreading Techniques for Transient Fault Detection - PowerPoint PPT Presentation

About This Presentation
Title:

Redundant Multithreading Techniques for Transient Fault Detection

Description:

SRT concepts & design. Preferential Space Redundancy. SRT Performance Analysis ... applicable in a real-world SMT design ~30% slowdown, slightly worse with two ... – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 33
Provided by: stevenkr1
Category:

less

Transcript and Presenter's Notes

Title: Redundant Multithreading Techniques for Transient Fault Detection


1
Redundant Multithreading Techniques for
Transient Fault Detection
  • Shubu Mukherjee
  • Michael Kontz
  • Steve Reinhardt

Intel HP (current) Intel Consultant, U. of
Michigan
Versions of this work have been presented at ISCA
2000 and ISCA 2002
2
Transient Faults from Cosmic Rays Alpha
particles
  • decreasing feature size
  • - decreasing voltage (exponential dependence?)
  • - increasing number of transistors (Moores Law)
  • - increasing system size (number of processors)
  • - no practical absorbent for cosmic rays

3
Fault Detection via Lockstepping(HP Himalaya)
Replicated Microprocessors Cycle-by-Cycle
Lockstepping
4
Fault Detection via Simultaneous Multithreading
R1 ? (R2)
R1 ? (R2)
THREAD
THREAD
Output Comparison
Input Replication
Memory covered by ECC RAID array covered by
parity Servernet covered by CRC
5
Simultaneous Multithreading (SMT)
Example Alpha 21464, Intel Northwood
6
Redundant Multithreading (RMT)
RMT Multithreading Fault Detection
Multithreading (MT) Redundant Multithreading (RMT)
Multithreaded Uniprocessor Simultaneous Multithreading (SMT) Simultaneous Redundant Threading (SRT)
Chip Multiprocessor (CMP) Multiple Threads running on CMP Chip-Level Redundant Threading (CRT)
7
Outline
  • SRT concepts design
  • Preferential Space Redundancy
  • SRT Performance Analysis
  • Single- multi-threaded workloads
  • Chip-level Redundant Threading (CRT)
  • Concept
  • Performance analysis
  • Summary
  • Current Future Work

8
Overview
  • SRT SMT Fault Detection
  • Advantages
  • Piggyback on an SMT processor with little extra
    hardware
  • Better performance than complete replication
  • Lower cost due to market volume of SMT SRT
  • Challenges
  • Lockstepping very difficult with SRT
  • Must carefully fetch/schedule instructions from
    redundant threads

9
Sphere of Replication
Sphere of Replication
LeadingThread
TrailingThread
InputReplication
OutputComparison
Memory System (incl. L1 caches)
  • Two copies of each architecturally visible thread
  • Co-scheduled on SMT core
  • Compare results signal fault if different

10
Basic Pipeline
Dispatch
Decode
Commit
Fetch
Execute
Data Cache
11
Load Value Queue (LVQ)
  • Load Value Queue (LVQ)
  • Keep threads on same path despite I/O or MP
    writes
  • Out-of-order load issue possible

12
Store Queue Comparator (STQ)
  • Store Queue Comparator
  • Compares outputs to data cache
  • Catch faults before propagating to rest of system

13
Store Queue Comparator (contd)
Store Queue
st ...
st 5 ? 0x120
st ...
to data cache
Compareaddress data
st ...
st ...
st 5 ? 0x120
  • Extends residence time of leading-thread stores
  • Size constrained by cycle time goal
  • Base CPU statically partitions single queue among
    threads
  • Potential solution per-thread store queues
  • Deadlock if matching trailing store cannot commit
  • Several small but crucial changes to avoid this

14
Branch Outcome Queue (BOQ)
  • Branch Outcome Queue
  • Forward leading-thread branch targets to trailing
    fetch
  • 100 prediction accuracy in absence of faults

15
Line Prediction Queue (LPQ)
  • Line Prediction Queue
  • Alpha 21464 fetches chunks using line predictions
  • Chunk contiguous block of 8 instructions

16
Line Prediction Queue (contd)
  • Generate stream of chunked line predictions
  • Every leading-thread instruction carries
    itsI-cache coordinates
  • Commit logic merges into fetch chunks for LPQ
  • Independent of leading-thread fetch chunks
  • Commit-to-fetch dependence raised deadlock issues

1F8 add 1FC load R1?(R2) 200 beq
280 204 and 208 bne 200 200 add
17
Line Prediction Queue (contd)
  • Read-out on trailing-thread fetch also complex
  • Base CPU thread chooser gets multiple line
    predictions, ignores all but one
  • Fetches must be retried on I-cache miss
  • Tricky to keep queue in sync with thread progress
  • Add handshake to advance queue head
  • Roll back head on I-cache miss
  • Track both last attempted last successful
    chunks

18
Outline
  • SRT concepts design
  • Preferential Space Redundancy
  • SRT Performance Analysis
  • Single- multi-threaded workloads
  • Chip-level Redundant Threading (CRT)
  • Concept
  • Performance analysis
  • Summary
  • Current Future Work

19
Preferential Space Redundancy
  • SRT combines two types of redundancy
  • Time same physical resource, different time
  • Space different physical resource
  • Space redundancy preferable
  • Better coverage of permanent/long-duration faults
  • Bias towards space redundancy where possible

20
PSR Example Clustered Execution
LPQ
add r1,r2,r3
add r1,r2,r3
add r1,r2,r3
Exec 0
IQ 0
add r1,r2,r3
Dispatch
Decode
Commit
Fetch
Exec 1
IQ 1
  • Base CPU has two execution clusters
  • Separate instruction queues, function units
  • Steered in dispatch stage

21
PSR Example Clustered Execution
LPQ
0
0
0
0
add r1,r2,r3 0
add r1,r2,r3 0
add r1,r2,r3 0
0
Exec 0
IQ 0
Dispatch
Decode
Commit
Fetch
Exec 1
IQ 1
  • Leading thread instructions record their cluster
  • Bit carried with fetch chunk through LPQ
  • Attached to trailing-thread instruction
  • Dispatch sends to opposite cluster if possible

22
PSR Example Clustered Execution
LPQ
Exec 0
IQ 0
Dispatch
Decode
Commit
Fetch
Exec 1
IQ 1
add r1,r2,r3 0
add r1,r2,r3 0
add r1,r2,r3 0
add r1,r2,r3
add r1,r2,r3
  • 99.94 of instruction pairs use different
    clusters
  • Full spatial redundancy for execution
  • No performance impact (occasional slight gain)

23
Outline
  • SRT concepts design
  • Preferential Space Redundancy
  • SRT Performance Analysis
  • Single- multi-threaded workloads
  • Chip-level Redundant Threading (CRT)
  • Concept
  • Performance analysis
  • Summary
  • Current Future Work

24
SRT Evaluation
  • Used SPEC CPU95, 15M instrs/thread
  • Constrained by simulation environment
  • ? 120M instrs for 4 redundant thread pairs
  • Eight-issue, four-context SMT CPU
  • 128-entry instruction queue
  • 64-entry load and store queues
  • Default statically partitioned among active
    threads
  • 22-stage pipeline
  • 64KB 2-way assoc. L1 caches
  • 3 MB 8-way assoc L2

25
SRT Performance One Thread
  • One logical thread ? two hardware contexts
  • Performance degradation 30
  • Per-thread store queue buys extra 4

26
SRT Performance Two Threads
  • Two logical threads ? four hardware contexts
  • Average slowdown increases to 40
  • Only 32 with per-thread store queues

27
Outline
  • SRT concepts design
  • Preferential Space Redundancy
  • SRT Performance Analysis
  • Single- multi-threaded workloads
  • Chip-level Redundant Threading (CRT)
  • Concept
  • Performance analysis
  • Summary
  • Current Future Work

28
Chip-Level Redundant Threading
  • SRT typically more efficient than splitting one
    processor into two half-size CPUs
  • What if you already have two CPUs?
  • IBM Power4, HP PA-8800 (Mako)
  • Conceptually easy to run these in lock-step
  • Benefit full physical redundancy
  • Costs
  • Latency through centralized checker logic
  • Overheads (misspeculation etc.) incurred twice
  • CRT combines best of SRT lockstepping
  • requires multithreaded CMP cores

29
Chip-Level Redundant Threading
CPU A
CPU B
LVQ
LPQ
Stores
LVQ
LPQ
LeadingThread B
Stores
30
CRT Performance
  • With per-thread store queues, 13 improvement
    over lockstepping with 8-cycle checker latency

31
Summary Conclusions
  • SRT is applicable in a real-world SMT design
  • 30 slowdown, slightly worse with two threads
  • Store queue capacity can limit performance
  • Preferential space redundancy improves coverage
  • Chip-level Redundant Threading SRT for CMPs
  • Looser synchronization than lockstepping
  • Free up resources for other application threads

32
More Information
  • Publications
  • S.K. Reinhardt S.S.Mukherjee, Transient Fault
    Detection via Simultaneous Multithreading,
    International Symposium on Computer Architecture
    (ISCA), 2000
  • S.S.Mukherjee, M.Kontz, S.K.Reinhardt,
    Detailed Design and Evaluation of Redundant
    Multithreading Alternatives, International
    Symposium on Computer Architecture (ISCA), 2002
  • Papers available from
  • http//www.cs.wisc.edu/shubu
  • http//www.eecs.umich.edu/stever
  • Patents
  • Compaq/HP filed eight patent applications on SRT
Write a Comment
User Comments (0)
About PowerShow.com