Toward%20Optimal%20Network%20Fault%20Correction%20via%20End-to-End%20Inference - PowerPoint PPT Presentation

About This Presentation
Title:

Toward%20Optimal%20Network%20Fault%20Correction%20via%20End-to-End%20Inference

Description:

Toward Optimal Network Fault Correction via End-to-End Inference ... Example: overlays across externally managed nodes. Data stream. server. OK! No data? No data? ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 24
Provided by: cseCu
Category:

less

Transcript and Presenter's Notes

Title: Toward%20Optimal%20Network%20Fault%20Correction%20via%20End-to-End%20Inference


1
Toward Optimal Network Fault Correction via
End-to-End Inference
Patrick P. C. Lee, Vishal Misra, Dan Rubenstein
Distributed Network Analysis (DNA) LabColumbia
University May 9, 2007
2
Outline
  • Motivation
  • Framework for end-to-end inference
  • Inference algorithm
  • Performance evaluation
  • Conclusions

3
Motivation
  • Goal Correct (diagnose and repair) data-path
    failures in a system where only end-to-end
    information is available and link-level probing
    is unreliable.
  • Example overlays across externally managed nodes

No data?
No data?
Data stream server
OK!
4
Problem
  • What should an administrator do if some paths
    fail to deliver data?
  • What the administrator knows
  • some nodes on the faulty paths must have failed
  • What the administrator doesnt know
  • which nodes on the paths failed
  • how many nodes on the paths failed
  • reasons the nodes failed
  • Solution Checking, via a series of sanity tests,
    the nodes that potentially failed, and repairing
    those that did.

5
Constraints
  • Checking and repairing a node incurs a cost
  • e.g., wages and man-hours of support staff, or
    cost of test equipment
  • Such a cost can be highly varying
  • e.g., service providers may charge different
    costs of checking nodes

6
Objective
  • Assume each node i has a priori known
  • failure probability pi the likelihood that node
    i has failed
  • checking cost ci the cost needed to perform
    sanity tests on node i
  • Objective minimize the expected total checking
    cost of correcting (i.e., diagnosing and
    repairing) all faulty nodes

? i
ci Pr(node i is actually checked)
minimize
over all sequences of nodes to be checked
7
End-to-End Inference
  • End-to-end inference approach for correcting
    data-path failures

Network topology
Input
Monitor paths
Repair identified bad nodes
Check nodes
Select the nodes to check
Bad paths exist?
Yes
No
How to select nodes to check?
Done
8
How to Select Nodes to Check?
  • Suppose that we check one node at a time.
  • Most-Likely Fault (MLF) approach
  • First check the most likely faulty node, i.e.,
    the node with the highest conditional failure
    probability given that some paths failed to
    deliver data.

Does the MLF approach necessarily minimize the
expected total checking cost?
9
Example Why the MLF Scheme is not Optimal?
  • No, the MLF scheme is not optimal in general.
  • Two data paths are given. Both failed to deliver
    data.
  • Nodes have
  • different failure probabilities
  • same checking cost.
  • The conditional failure probabilities can be
    determined accordingly.

0.45
1
0.3
2
3
4
0.5
0.6
Node Conditional failure prob.
1 0.616
2 0.411
3 0.663
4 0.579
10
Example Why the MLF Scheme is not Optimal?
0.45
1
  • Findings
  • Node 3 has the highest conditional failure
    probability.
  • However, by brute-force approach, we find that
    checking node 1 first is optimal (even nodes have
    the same checking cost).
  • Intuition
  • Node 3 affects only one path, but node 1 affects
    both paths.
  • We may repair both paths by only checking node 1.

0.3
2
3
4
0.5
0.6
Node Conditional failure prob.
1 0.616
2 0.411
3 0.663
4 0.579
11
Our Contributions
  • Propose an end-to-end inference approach for
    correcting all data-path failures.
  • Identify a set of candidate nodes, and prove that
    one of them must be checked first in order to
    minimize the expected total checking cost.
  • Evaluate via simulation that our inference
    approach has a smaller expected cost than the
    prior MLF-based approaches Katzela and Schwartz,
    1995 Kandula et al., 2005 Steinder and Sethi,
    2004.

12
Topologies
  • Topologies that we consider

Tree
Multiple trees
  • We prove optimality results for a tree, and
    propose heuristics for multiple trees.

13
Finding Good/Bad Paths
  • For each data path,
  • Good if the data path has no faulty node and
    can deliver data
  • Bad if the data path has at least one faulty
    node and cannot deliver data
  • Assumption
  • Each node has the same data-forwarding behavior
    across all paths upon which it lies.
  • This implies if a node lies on at least one good
    path, it is a non-faulty (good) node.

14
Forming a Bad Tree
  • Monitor data streams from the root node 1 to each
    of the leaf nodes 6, 7, 8, 9.
  • Keep only bad paths, and remove any nodes that
    are known to be good.

1
2
4
3
5
6
7
Bad path
Good path
8
9
Bad path
Bad path
15
Inference Algorithm
  • Our inference algorithm selects which nodes to
    check
  • Each node i is associated with a potential
    function
  • pi failure probability of node i
  • ci checking cost of node i
  • Pr(T Xi, Ai) conditional probability of
    having a bad tree
  • T the event that the tree is a bad tree
  • Xi the event that node i is bad
  • Ai the event that ancestors of node i are good
  • Intuitively, we should first check the node with
    high pi and small ci, i.e., the node with the
    high potential first.

16
Inference Algorithm
  • Candidate node
  • On each bad path, one node has the highest
    potential. We call this node a candidate node.
  • Example of identifying candidate nodes

Bad path Candidate node
3-5-8 5
3-5-9 5
3-6 3
  • Main theorem
  • To minimize the expected total checking cost of
    correcting all faulty nodes for a given bad tree,
    we must check a candidate node first.

17
Inference Algorithm
  • For some special cases, we know which candidate
    node should be checked first to minimize the
    expected cost.
  • Examples of the special cases
  • A path
  • Check the node with the highest
    first
  • A tree in which nodes have a fixed failure
    probability and a fixed checking cost
  • Check the root node first

18
Inference Algorithm
  • For general cases, we dont know which candidate
    node should be checked first to minimize the
    expected cost.
  • e.g., not necessarily the candidate node with the
    highest potential
  • Heuristics
  • Sequential strategy Checks the candidate node
    with the highest potential
  • Parallel strategy Checks simultaneously multiple
    candidate nodes that cover all bad paths

19
Highlights of Experiments
  • Setup
  • Use BRITE to create 200 random experimental
    networks, each of which has 200 routers
  • Assign each node a failure probability and a
    checking cost
  • Focus on multi-tree topologies, each of which is
    a shortest-path tree rooted at a randomly
    selected router
  • Metric
  • Expected total checking cost to diagnose and
    repair all faulty nodes
  • Heuristics to be compared
  • Candidate-based heuristics check the candidate
    nodes first
  • MLF-based heuristics check the most-likely
    faulty nodes first

20
Highlights of Experiments
  • Random failure prob., fixed checking cost
  • pi U(0, 0.2)
  • ci 1
  • Result
  • Both heuristics have almost the same expected
    total checking cost.

21
Highlights of Experiments
  • Random failure prob., random checking cost
  • pi U(0, 0.2)
  • ci U(0, 1)
  • Result
  • Checking first the candidate nodes decreases the
    expected total checking cost by 10.

22
Highlights of Experiments
  • Fixed failure prob., random checking cost
  • pi 0.1
  • ci U(0, 1)
  • Result
  • Checking first the candidate nodes decreases the
    expected total checking cost by 20.

23
Conclusions
  • Presented optimality results for diagnosing and
    repairing all data-path failures, with an
    objective to minimize the expected total checking
    cost.
  • Constructed a potential function to identify
    candidate nodes, one of which must be checked
    first to minimize the expected total checking
    cost.
  • Showed via evaluation that checking candidate
    nodes first can reduce the checking cost by up to
    20 compared to checking the most likely faulty
    nodes first.
Write a Comment
User Comments (0)
About PowerShow.com