Dynamic%20Verification%20of%20End-to-End%20Multiprocessor%20Invariants - PowerPoint PPT Presentation

About This Presentation
Title:

Dynamic%20Verification%20of%20End-to-End%20Multiprocessor%20Invariants

Description:

(C) 2003 Daniel Sorin. Duke Architecture. Dynamic Verification of ... Proposal: Dynamic verification of invariants. Online checking of end-to-end system invariants ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 27
Provided by: daniel83
Category:

less

Transcript and Presenter's Notes

Title: Dynamic%20Verification%20of%20End-to-End%20Multiprocessor%20Invariants


1
Dynamic Verification of End-to-End
Multiprocessor Invariants
  • Daniel J. Sorin1, Mark D. Hill2, David A. Wood2
  • 1Department of Electrical Computer Engineering
  • Duke University
  • 2Computer Sciences Department
  • University of Wisconsin-Madison

2
My Talk in One Slide
  • Commercial server availability is important
  • System model Symmetric Multiprocessor (SMP)
  • Fault model Mostly transient, some permanent
  • Recent work developed efficient
    checkpoint/recovery
  • But we can only recover from hardware errors we
    detect
  • Many hardware errors are hard to detect
  • Proposal Dynamic verification of invariants
  • Online checking of end-to-end system invariants
  • Checking performed with distributed signature
    analysis
  • Triggers recovery if invariant is violated

3
Outline
  • Background
  • SMPs and availability
  • Existing hardware error detection
  • Invariant checking with distributed signature
    analysis
  • Two invariant checkers
  • Evaluation
  • Conclusions

4
Symmetric Multiprocessor (SMP)
System Model
Cache Coherence Transaction
I
M
Issue request Wait for response Receive response
5
Symmetric Multiprocessor (SMP)
System Model
Cache Coherence Transaction
I
M
Issue request Wait for response Receive response
  • Broadcast request not delivered to subset of
    nodes
  • Broadcast requests delivered out of order to
    subset of nodes

6
Symmetric Multiprocessor (SMP)
System Model
Cache Coherence Transaction
response arrives
request arrives
t2
I
M
t1
t3
issue request
request arrives
response arrives
  • More chances for incorrect state transitions

7
Backward Error Recovery
  • Can improve availability with backward error
    recovery
  • If error detected, then recover to pre-fault
    state
  • Backward error recovery (BER) requires
  • Checkpoint/recovery mechanism
  • Error detection mechanisms

8
SafetyNet Checkpoint/Recovery
  • SafetyNet all-hardware scheme ISCA 2002
  • Periodically take logical checkpoint of
    multiprocessor
  • MP State processor registers, caches, memory
  • Incrementally log changes to caches and memory
  • Consistent checkpointing performed in logical
    time
  • E.g., every 3000 broadcast cache coherence
    requests
  • Can tolerate gt100,000 cycles of error detection
    latency

CP 4
CP 3
CP 2
CP 1
Active execution
Validated execution
Pending validation Still detecting errors
time
9
Error Detection
  • Error model mostly due to transient faults
  • Example error detection mechanisms
  • Parity bit on cache line
  • Checksum on incoming message
  • Timeout on cache coherence transaction
  • But error detection for servers is still weak
  • Why?
  • Error detection is often on critical path and
    must be fast
  • Fast error detection cant incorporate info from
    other nodes

10
Why Local Information Isnt Sufficient
Shared
Owned
11
Why Local Information Isnt Sufficient
Broadcast Request for Exclusive
fault!
Shared
Owned
12
Why Local Information Isnt Sufficient
Broadcast Request for Exclusive
fault!
Shared
Owned
Invalid
Data Response
13
Why Local Information Isnt Sufficient
Shared
Modified
Neither P1 nor P2 can detect that an error has
occurred!
14
Outline
  • Background
  • End-to-end invariant checking
  • Two invariant checkers
  • Evaluation
  • Conclusions

15
Distributed Signature Analysis
  • Reduces long history of events into small
    signature
  • Signatures map almost-uniquely to event histories

Event N at P1 Event 2 at P1 Event 1
at P1
Event N at P2 Event 2 at P2 Event 1
at P2
P1
P2
Signature
Signature
P2s signature
P1s signature

Check periodically in logical time (every 3000
requests)
Checker
16
Designing Signature Analysis Schemes
  • Must devise two functions Update and Check
  • Signature(Pi) Update(Signature(Pi), Event)
  • Check(Signature(P1),,Signature(PN)) true if
    error
  • Simple example check that message inflowoutflow
  • Assume only unicast messages
  • Update 1 for receive, -1 for send
  • Check true if sum of all signatures doesnt
    equal 0

17
Implementing Distributed Signature Analysis
  • All components cooperate to perform checking
  • Component cache controller or memory controller
  • Each component contains
  • Local signature register
  • Logic to compute signature updates
  • System contains
  • System controller that performs check function
  • Use distributed signature analysis for dynamic
    verification
  • Verify end-to-end invariants

18
Outline
  • Background
  • End-to-end invariant checking
  • Two invariant checkers
  • Message invariant
  • Cache coherence invariant
  • Evaluation
  • Conclusions

19
A Message-Level Invariant Checker
  • Context symmetric multiprocessor (SMP)
  • Cache coherence with broadcast snooping protocol
  • Invariant all nodes see same total order of
    broadcast cache coherence requests
  • Update for each incoming broadcast, add
    Address
  • Not quite this simple (e.g., doesnt detect
    reorderings)
  • Check error if all signatures arent equal

20
Aliasing
  • Aliasing occurs if two histories have same
    signature
  • 3 possible sources of aliasing
  • Finite resources b bits can only distinguish 2b
    histories
  • Fault in signature analysis hardware itself
  • Inherent flaw in scheme
  • Examples of inherent aliasing in previous scheme
  • Arrival of message with Address0 doesnt change
    signature
  • Reordering of messages doesnt change signature
  • We solve aliasing issues in paper
  • Tricks hash more than 1 field of message, use
    LFSRs, etc.

21
A Cache Coherence Invariant Checker
  • Invariant all coherence upgrades cause
    downgrades
  • Upgrade increase permissions to block (e.g.,
    none?read)
  • Downgrade decrease permissions (e.g., write ?
    read)
  • Update add Address for upgrade
    subtract Address for downgrade
  • Check error if sum of all signatures doesnt
    equal 0
  • Challenges
  • Can be more than one downgrade per upgrade
  • Upgrader doesnt know how how many downgraders
    exist
  • See paper for solutions to these challenges

22
Outline
  • Background
  • End-to-end invariant checking
  • Two invariant checkers
  • Evaluation
  • Conclusions

23
Methodology
  • Full-system simulation of 16-processor machine
  • Simics provides functional simulation of
    everything
  • We added timing simulation for memory system
    SafetyNet
  • Commercial workloads running on Solaris 8
  • Database IBMs DB2 running online transaction
    processing
  • Static web server Apache
  • Dynamic web server Slashdot
  • Java middleware

24
Detection Coverage
  • How do we know if our checkers work?
  • Inject errors periodically
  • Corrupt messages
  • Drop messages
  • Reorder messages
  • Improperly process cache coherence messages
  • Global invariant checkers detected all errors

25
Performance
  • Error bars represent /- one standard deviation

26
Conclusions
  • Goal improve multiprocessor availability
  • How? Dynamic verification of end-to-end
    invariants
  • Implemented with distributed signature analysis
  • Results
  • Detects previously undetectable hardware errors
  • Negligible performance overhead for error-free
    execution
  • Duke FaultFinder Project
  • http//www.ee.duke.edu/sorin/faultfinder
  • Wisconsin Multifacet Project
  • http//www.cs.wisc.edu/multifacet/
Write a Comment
User Comments (0)
About PowerShow.com