Toward%20Optimal%20Network%20Fault%20Correction%20via%20End-to-End%20Inference

About This Presentation

Title:

Toward%20Optimal%20Network%20Fault%20Correction%20via%20End-to-End%20Inference

Description:

Toward Optimal Network Fault Correction via End-to-End Inference ... Example: overlays across externally managed nodes. Data stream. server. OK! No data? No data? ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 24

Provided by: cseCu

Category:

more less

Transcript and Presenter's Notes

Title: Toward%20Optimal%20Network%20Fault%20Correction%20via%20End-to-End%20Inference

1
Toward Optimal Network Fault Correction via
End-to-End Inference
Patrick P. C. Lee, Vishal Misra, Dan Rubenstein
Distributed Network Analysis (DNA) LabColumbia
University May 9, 2007
2
Outline

Motivation
Framework for end-to-end inference
Inference algorithm
Performance evaluation
Conclusions

3
Motivation

Goal Correct (diagnose and repair) data-path
failures in a system where only end-to-end
information is available and link-level probing
is unreliable.
Example overlays across externally managed nodes

No data?
No data?
Data stream server
OK!
4
Problem

What should an administrator do if some paths
fail to deliver data?
What the administrator knows
some nodes on the faulty paths must have failed
What the administrator doesnt know
which nodes on the paths failed
how many nodes on the paths failed
reasons the nodes failed
Solution Checking, via a series of sanity tests,
the nodes that potentially failed, and repairing
those that did.

5
Constraints

Checking and repairing a node incurs a cost
e.g., wages and man-hours of support staff, or
cost of test equipment
Such a cost can be highly varying
e.g., service providers may charge different
costs of checking nodes

6
Objective

Assume each node i has a priori known
failure probability pi the likelihood that node
i has failed
checking cost ci the cost needed to perform
sanity tests on node i
Objective minimize the expected total checking
cost of correcting (i.e., diagnosing and
repairing) all faulty nodes

? i
ci Pr(node i is actually checked)
minimize
over all sequences of nodes to be checked
7
End-to-End Inference

End-to-end inference approach for correcting
data-path failures

Network topology
Input
Monitor paths
Repair identified bad nodes
Check nodes
Select the nodes to check
Bad paths exist?
Yes
No
How to select nodes to check?
Done
8
How to Select Nodes to Check?

Suppose that we check one node at a time.
Most-Likely Fault (MLF) approach
First check the most likely faulty node, i.e.,
the node with the highest conditional failure
probability given that some paths failed to
deliver data.

Does the MLF approach necessarily minimize the
expected total checking cost?
9
Example Why the MLF Scheme is not Optimal?

No, the MLF scheme is not optimal in general.
Two data paths are given. Both failed to deliver
data.
Nodes have
different failure probabilities
same checking cost.
The conditional failure probabilities can be
determined accordingly.

0.45
1
0.3
2
3
4
0.5
0.6
Node Conditional failure prob.
1 0.616
2 0.411
3 0.663
4 0.579
10
Example Why the MLF Scheme is not Optimal?
0.45
1

Findings
Node 3 has the highest conditional failure
probability.
However, by brute-force approach, we find that
checking node 1 first is optimal (even nodes have
the same checking cost).
Intuition
Node 3 affects only one path, but node 1 affects
both paths.
We may repair both paths by only checking node 1.

0.3
2
3
4
0.5
0.6
Node Conditional failure prob.
1 0.616
2 0.411
3 0.663
4 0.579
11
Our Contributions

Propose an end-to-end inference approach for
correcting all data-path failures.
Identify a set of candidate nodes, and prove that
one of them must be checked first in order to
minimize the expected total checking cost.
Evaluate via simulation that our inference
approach has a smaller expected cost than the
prior MLF-based approaches Katzela and Schwartz,
1995 Kandula et al., 2005 Steinder and Sethi,
2004.

12
Topologies

Topologies that we consider

Tree
Multiple trees

We prove optimality results for a tree, and
propose heuristics for multiple trees.

13
Finding Good/Bad Paths

For each data path,
Good if the data path has no faulty node and
can deliver data
Bad if the data path has at least one faulty
node and cannot deliver data
Assumption
Each node has the same data-forwarding behavior
across all paths upon which it lies.
This implies if a node lies on at least one good
path, it is a non-faulty (good) node.

14
Forming a Bad Tree

Monitor data streams from the root node 1 to each
of the leaf nodes 6, 7, 8, 9.

Keep only bad paths, and remove any nodes that
are known to be good.

1
2
4
3
5
6
7
Bad path
Good path
8
9
Bad path
Bad path
15
Inference Algorithm

Our inference algorithm selects which nodes to
check
Each node i is associated with a potential
function

pi failure probability of node i
ci checking cost of node i
Pr(T Xi, Ai) conditional probability of
having a bad tree
T the event that the tree is a bad tree
Xi the event that node i is bad
Ai the event that ancestors of node i are good
Intuitively, we should first check the node with
high pi and small ci, i.e., the node with the
high potential first.

16
Inference Algorithm

Candidate node
On each bad path, one node has the highest
potential. We call this node a candidate node.
Example of identifying candidate nodes

Bad path Candidate node
3-5-8 5
3-5-9 5
3-6 3

Main theorem
To minimize the expected total checking cost of
correcting all faulty nodes for a given bad tree,
we must check a candidate node first.

17
Inference Algorithm

For some special cases, we know which candidate
node should be checked first to minimize the
expected cost.
Examples of the special cases
A path
Check the node with the highest
first
A tree in which nodes have a fixed failure
probability and a fixed checking cost
Check the root node first

18
Inference Algorithm

For general cases, we dont know which candidate
node should be checked first to minimize the
expected cost.
e.g., not necessarily the candidate node with the
highest potential
Heuristics
Sequential strategy Checks the candidate node
with the highest potential
Parallel strategy Checks simultaneously multiple
candidate nodes that cover all bad paths

19
Highlights of Experiments

Setup
Use BRITE to create 200 random experimental
networks, each of which has 200 routers
Assign each node a failure probability and a
checking cost
Focus on multi-tree topologies, each of which is
a shortest-path tree rooted at a randomly
selected router
Metric
Expected total checking cost to diagnose and
repair all faulty nodes
Heuristics to be compared
Candidate-based heuristics check the candidate
nodes first
MLF-based heuristics check the most-likely
faulty nodes first

20
Highlights of Experiments

Random failure prob., fixed checking cost
pi U(0, 0.2)
ci 1
Result
Both heuristics have almost the same expected
total checking cost.

21
Highlights of Experiments

Random failure prob., random checking cost
pi U(0, 0.2)
ci U(0, 1)
Result
Checking first the candidate nodes decreases the
expected total checking cost by 10.

22
Highlights of Experiments

Fixed failure prob., random checking cost
pi 0.1
ci U(0, 1)
Result
Checking first the candidate nodes decreases the
expected total checking cost by 20.

23
Conclusions

Presented optimality results for diagnosing and
repairing all data-path failures, with an
objective to minimize the expected total checking
cost.
Constructed a potential function to identify
candidate nodes, one of which must be checked
first to minimize the expected total checking
cost.
Showed via evaluation that checking candidate
nodes first can reduce the checking cost by up to
20 compared to checking the most likely faulty
nodes first.