Title: Toward%20Optimal%20Network%20Fault%20Correction%20via%20End-to-End%20Inference
1Toward Optimal Network Fault Correction via
End-to-End Inference
Patrick P. C. Lee, Vishal Misra, Dan Rubenstein
Distributed Network Analysis (DNA) LabColumbia
University May 9, 2007
2Outline
- Motivation
- Framework for end-to-end inference
- Inference algorithm
- Performance evaluation
- Conclusions
3Motivation
- Goal Correct (diagnose and repair) data-path
failures in a system where only end-to-end
information is available and link-level probing
is unreliable. - Example overlays across externally managed nodes
No data?
No data?
Data stream server
OK!
4Problem
- What should an administrator do if some paths
fail to deliver data? - What the administrator knows
- some nodes on the faulty paths must have failed
- What the administrator doesnt know
- which nodes on the paths failed
- how many nodes on the paths failed
- reasons the nodes failed
- Solution Checking, via a series of sanity tests,
the nodes that potentially failed, and repairing
those that did.
5Constraints
- Checking and repairing a node incurs a cost
- e.g., wages and man-hours of support staff, or
cost of test equipment - Such a cost can be highly varying
- e.g., service providers may charge different
costs of checking nodes
6Objective
- Assume each node i has a priori known
- failure probability pi the likelihood that node
i has failed - checking cost ci the cost needed to perform
sanity tests on node i - Objective minimize the expected total checking
cost of correcting (i.e., diagnosing and
repairing) all faulty nodes
? i
ci Pr(node i is actually checked)
minimize
over all sequences of nodes to be checked
7End-to-End Inference
- End-to-end inference approach for correcting
data-path failures
Network topology
Input
Monitor paths
Repair identified bad nodes
Check nodes
Select the nodes to check
Bad paths exist?
Yes
No
How to select nodes to check?
Done
8How to Select Nodes to Check?
- Suppose that we check one node at a time.
- Most-Likely Fault (MLF) approach
- First check the most likely faulty node, i.e.,
the node with the highest conditional failure
probability given that some paths failed to
deliver data.
Does the MLF approach necessarily minimize the
expected total checking cost?
9Example Why the MLF Scheme is not Optimal?
- No, the MLF scheme is not optimal in general.
- Two data paths are given. Both failed to deliver
data. - Nodes have
- different failure probabilities
- same checking cost.
- The conditional failure probabilities can be
determined accordingly.
0.45
1
0.3
2
3
4
0.5
0.6
Node Conditional failure prob.
1 0.616
2 0.411
3 0.663
4 0.579
10Example Why the MLF Scheme is not Optimal?
0.45
1
- Findings
- Node 3 has the highest conditional failure
probability. - However, by brute-force approach, we find that
checking node 1 first is optimal (even nodes have
the same checking cost). - Intuition
- Node 3 affects only one path, but node 1 affects
both paths. - We may repair both paths by only checking node 1.
0.3
2
3
4
0.5
0.6
Node Conditional failure prob.
1 0.616
2 0.411
3 0.663
4 0.579
11Our Contributions
- Propose an end-to-end inference approach for
correcting all data-path failures. - Identify a set of candidate nodes, and prove that
one of them must be checked first in order to
minimize the expected total checking cost. - Evaluate via simulation that our inference
approach has a smaller expected cost than the
prior MLF-based approaches Katzela and Schwartz,
1995 Kandula et al., 2005 Steinder and Sethi,
2004.
12Topologies
- Topologies that we consider
Tree
Multiple trees
- We prove optimality results for a tree, and
propose heuristics for multiple trees.
13Finding Good/Bad Paths
- For each data path,
- Good if the data path has no faulty node and
can deliver data - Bad if the data path has at least one faulty
node and cannot deliver data - Assumption
- Each node has the same data-forwarding behavior
across all paths upon which it lies. - This implies if a node lies on at least one good
path, it is a non-faulty (good) node.
14Forming a Bad Tree
- Monitor data streams from the root node 1 to each
of the leaf nodes 6, 7, 8, 9.
- Keep only bad paths, and remove any nodes that
are known to be good.
1
2
4
3
5
6
7
Bad path
Good path
8
9
Bad path
Bad path
15Inference Algorithm
- Our inference algorithm selects which nodes to
check - Each node i is associated with a potential
function
- pi failure probability of node i
- ci checking cost of node i
- Pr(T Xi, Ai) conditional probability of
having a bad tree - T the event that the tree is a bad tree
- Xi the event that node i is bad
- Ai the event that ancestors of node i are good
- Intuitively, we should first check the node with
high pi and small ci, i.e., the node with the
high potential first.
16Inference Algorithm
- Candidate node
- On each bad path, one node has the highest
potential. We call this node a candidate node. - Example of identifying candidate nodes
Bad path Candidate node
3-5-8 5
3-5-9 5
3-6 3
- Main theorem
- To minimize the expected total checking cost of
correcting all faulty nodes for a given bad tree,
we must check a candidate node first.
17Inference Algorithm
- For some special cases, we know which candidate
node should be checked first to minimize the
expected cost. - Examples of the special cases
- A path
- Check the node with the highest
first - A tree in which nodes have a fixed failure
probability and a fixed checking cost - Check the root node first
18Inference Algorithm
- For general cases, we dont know which candidate
node should be checked first to minimize the
expected cost. - e.g., not necessarily the candidate node with the
highest potential - Heuristics
- Sequential strategy Checks the candidate node
with the highest potential - Parallel strategy Checks simultaneously multiple
candidate nodes that cover all bad paths
19Highlights of Experiments
- Setup
- Use BRITE to create 200 random experimental
networks, each of which has 200 routers - Assign each node a failure probability and a
checking cost - Focus on multi-tree topologies, each of which is
a shortest-path tree rooted at a randomly
selected router - Metric
- Expected total checking cost to diagnose and
repair all faulty nodes - Heuristics to be compared
- Candidate-based heuristics check the candidate
nodes first - MLF-based heuristics check the most-likely
faulty nodes first
20Highlights of Experiments
- Random failure prob., fixed checking cost
- pi U(0, 0.2)
- ci 1
- Result
- Both heuristics have almost the same expected
total checking cost.
21Highlights of Experiments
- Random failure prob., random checking cost
- pi U(0, 0.2)
- ci U(0, 1)
- Result
- Checking first the candidate nodes decreases the
expected total checking cost by 10.
22Highlights of Experiments
- Fixed failure prob., random checking cost
- pi 0.1
- ci U(0, 1)
- Result
- Checking first the candidate nodes decreases the
expected total checking cost by 20.
23Conclusions
- Presented optimality results for diagnosing and
repairing all data-path failures, with an
objective to minimize the expected total checking
cost. - Constructed a potential function to identify
candidate nodes, one of which must be checked
first to minimize the expected total checking
cost. - Showed via evaluation that checking candidate
nodes first can reduce the checking cost by up to
20 compared to checking the most likely faulty
nodes first.