Title: Fault Detection, Isolation, and Diagnosis In Multihop Wireless Networks
1Fault Detection, Isolation, and Diagnosis In
Multihop Wireless Networks
- Lili Qiu, Paramvir Bahl, Ananth Rao, and Lidong
Zhou - Microsoft Research
- Presented by
- -Maitreya Natu
2Network Management
Faults directory
Root cause
Healthy network
Corrective measure
Faulty network
3Tasks involved in Network Management
- Continuously monitoring the functioning
- Collecting information about the nodes and the
links - Removing inconsistencies and noise from the
reported information - Analyzing the information
- Taking appropriate actions to improve network
reliability and performance
4Challenges in wireless networks
- Dynamic and unpredictable topology
- link errors due to fluctuating environment
conditions - Node mobility
- Limited capacity
- Scarcity of resources
- Link attacks
5Proposed framework
- Reproduce inside a simulator, the real-world
events that took place - Use online trace driven simulation to detect
faults and analyze the root causes
6Network Management
Network model
Types of faults
Healthy network
Faults directory
Creating a network model
7Network Management
Network model
Fault diagnosis
Types of faults
Faulty network
Faults directory
Detected faults
8Network Management
Network model
what-if analysis
Types of faults
Faults directory
Corrective measures
Detected faults
9Key issues
- How to Accurately reproduce what happened in the
network inside a simulator - How to build fault diagnosis on top of a
simulator to perform root cause analysis
10Accurate modeling
- Use real traces from the diagnosed network
- Removes dependency on generic theoretical models
- Captures nuances of the hardware, software and
environment of the particular network - Collect good quality data
- By developing a technique to effectively rule out
erroneous data
11Fault diagnosis
- Performance data emitted by trace driven
simulation is used as baseline - Any significant deviation indicates a potential
fault - Simulator selectively injects a set of suspected
faults and searches a set that most produces the
expected performance - An efficient algorithm is designed to determine
root causes
12System Overview
6. Search for set of faults that result in best
explanation
Link/Node failure
Faults Directory
7. Report the cause of failure
simulator
Link RSS
Interference Injection
Link Load
Error
Traffic Simulator
/-
Expected loss rate Throughput noise
Topology changes
Routing update
5. Discrepancy Found
Loss rate Throughput noise
4. Compare Expected Average Performance
1. Receive Cleaned Data
2. Drive Simulation
3. Compute Expected Performance
13Why Simulation Based Diagnosis?
- Much better insights into the network behavior
than any heuristic or theoretical technique - Highly customizable and applies to a large class
of networks - Ability to perform what-if analysis
- Helps to foresee the consequences of a corrective
action - Recent advances in simulators have made possible
their use for real-time analysis
14Accurate modeling
Network model
Types of faults
Healthy network
Faults directory
15Current network models
- Bayesian networks to map symptom-fault
dependencies - Context Free Grammars
- Correlation Matrix
16Can on-line simulations be used as core tool?
17Building confidence in simulator accuracy
- Problem
- Hard to accurately model the physical layer and
the RF propagation - Traffic demands on the router are hard to predict
18Building confidence in simulator accuracy
- Problem
- Hard to accurately model the physical layer and
the RF propagation - Traffic demands on the router are hard to predict
- Solution
- after the fact simulation
- Agents periodically report information about the
link conditions and traffic patterns to the link
simulators
19Simulations when the RF condition of the link is
good
Modeling the contention from flows within
the interference and communication ranges.
Modeling the overheads of the protocol stack such
as parity bits, MAC-layer back-off, IEEE 802.11
inter-frame spacing and ACK, and headers.
20Simulations with varying received signal strength
Simulator estimate deviates from real, when
signal strength is poor
Throughput matches closely with the simulators
estimate, when signal quality is good
21Why simulation results deviate in case of poor
signal strength?
- Lack of accurate packet loss as a function of
packet size, RSS and ambient noise. - Depends on signal processing hardware and the RF
antenna within the wireless cards - Lack of accurate auto-rate control
- Adjustment of sending rate done by WLAN cards
based on the transmission conditions
22How to model auto-rate control done by WLAN cards?
- Use Trace driven simulation
- When auto-rate is in use
- Collect the rate at which the wireless card is
operating and provide the reported rate to the
simulator - Otherwise
- Data rate is known to the simulator
23How to model accurate packet loss as a function
of packet-size, RSS and ambient noise?
- Use offline analysis
- Calibrate the wireless cards and create a
database associating environmental factors with
expected performance - E.g., mapping from signal strength and noise to
loss rate
24Experiment to model the loss rates due to poor
signal strength
- Collect another set of traces
- Slowly send out packets
- Place packet sniffers near both the sender and
the receiver, and derive loss rate from the
packet level trace - Seed the wireless link in the simulator with a
Bernoulli loss rate that matches loss rate with
the real traces
25Estimated and measured throughput when
compensating for the loss rate due to poor signal
strength
Loss rate and the measured throughput do not
monotonically decrease with the signal strength
due to the effect of auto-rate
- Even though the match is not perfect, its not
expected to be a - problem, because
- many routing protocols try to avoid the use of
poor quality links - Poor quality links are used only when certain
parts of mesh network have poor connectivity to
the rest of the network - In a well-engineered network, not many nodes
depend on such bad link for routing
26Stability of channel conditions
- How rapidly do channel conditions change and how
often a trace should be collected?
27Temporal fluctuation in RSS
- Fluctuation magnitude is not significant
- Relative quality of signals across different
number of walls remain stable
28Stability of channel conditions
- How rapidly do channel conditions change and how
often a trace should be collected? - When the environment is generally static, nodes
may report only the average and standard
deviation of the RSS to the manager every few
minutes
29Dealing with imperfect data
- By neighborhood monitoring
- Each node reports performance and traffic
statistics for its incoming and outgoing links - And for other links in its communication range
- Possible when node is in promiscuous mode
- Thus multiple reports are sent for each link
- Redundant reports can be used to detect
inconsistency - Find the minimum set of nodes that can explain
the inconsistency in the reports
30Summary
- How to accurately model the real behavior?
- Solution Use trace-based simulation
- Problem Simulation results are good for strong
signals but deviate for bad RF conditions - Need to model the autorate control
- Use trace-driven data
- Need to model the loss rate due to poor signal
strength - Use offline analysis
- How often a trace should be collected?
- Very little data (average and standard deviation
of RSS), at fairly low time granularity, as
channels are relatively stable - How to deal with imperfect data
- By neighborhood monitoring
31Fault diagnosis
Network model
Types of faults
Faulty network
Faults directory
Detected faults
32Current fault diagnosis approaches
- AI techniques
- Rule based systems
- Neural networks
- Model traversing techniques
- Dependency graphs
- Causality graphs
- Bayesian networks
33Fault Isolation and Diagnosis
- Establish the expected performance in the
simulation - Find difference between expected and observed
performance - Search over the fault space to detect which set
of faults can re-produce performance similar to
what has been observed
34Collecting data from traces
- Trace data collection
- Network topology
- Each node reports its neighbor and routing tables
- Traffic statistics
- Each node maintains counters of traffic sent and
received from immediate neighbors - Physical medium
- Each node reports signal strength of wireless
links to neighbors - Network performance
- Includes both the link and end-to-end
performance, which can be measured through loss
rate, delay, throughputs - Focus is on link level performance
35Simulating the network performance
- Traffic load simulation
- Link based traffic simulation
- Adjust application sending rate to match the
observed link-level traffic counts - Route simulation
- Use actual routes taken by packets as input to
the simulator - Wireless signal
- Use real measurement of signal strength
- Fault injection
- Random packet dropping
- External noise sources
- MAC misbehavior
36Fault diagnosis algorithm
Simulator
Expected performance
Network settings
Simulator
Observed performance
Network settings
Faults set
How to find ?
37How to search the faults efficiently?
- Different types of faults often change one or few
metrics - E.g., random dropping only affects link loss rate
- Thus use metrics in which observed and expected
performance is significantly different, to guide
the search
38Scenario where faults do not have strong
interactions
- Consider large deviation from expected
performance as anomaly - Use decision tree to determine the type of fault
- Fault type determines the metric to quantify
performance difference - Locate faults by finding the set of nodes and
links with large difference between expected and
observed performance
39Scenario where faults have strong interactions
- Get the initial diagnosis set from the decision
tree algorithm - Iteratively refine the fault set
- Adjust the magnitudes of faults in the fault set
- Translate difference in performance into change
in faults magnitude - It maps the impact of a fault into its magnitude
- Remove fault whose magnitude is too small
- Add new faults that can explain large differences
between the expected and observed performances - Iterate till the change in fault set is negligible
40Example scenario
2
3
1
4
5
41Example scenario
- Observed performance
- Increased loss rate at 1-4 and 1-2
- No increase in the sending rate of 1-4, 1-2
- No increase in noise experienced by
- neighbors
2
3
Inference
1
Increased Sending Rate
4
5
Y
N
Increased Noise
Too low CW
Y
N
Increased Loss
Noise
Y
N
Packet Drop
Normal
42Example scenario
- Observed performance
- Increased loss rate at 1-4 and 1-2
- No increase in the sending rate of 1-4, 1-2
- No increase in noise experienced by
- neighbors
2
3
Inference
1
Increased Sending Rate
4
5
Y
N
Increased Noise
Too low CW
Y
N
Increased Loss
Noise
Y
N
Packet dropping at node 1
Packet Drop
Normal
43Accuracy of fault diagnosis
- Correctness of the model
- Complete information
- Consistent information
- Timely information
- Correctness of the reported symptoms
- Right size of the threshold to report a symptom
- Difference in the behavior of faults
- Timely reporting of symptoms
44System implementation
- Windows XP
- Agents run on every wireless node and reports
information collected on demand - Managers collect and analyze information
- Collected information is cast into performance
counters supported by Windows - Manager is connected to a backend simulator.
Collected information is converted to script to
drive the simulation - Testbed
- Multihop wireless testbed built using IEEE
802.11a cards - Commercially available network sniffer called
Airopeek is used for data collection - Native 802.11 NICs provide rich set of networking
information
45Evaluation Data collection overhead
Data collection traffic has little effect
Overhead lt 800 bits/s/node
Management traffic overhead
Performance of FTP flow with and without data
collection
No data cleaning Each link is reported only
once With data cleaning Each link is reported by
all observers for consistency check
46Data cleaning effectiveness
Coverage greater than 80 in all cases
Higher accuracy with grid topology
Higher coverage when using history
Higher accuracy with denser networks
Higher accuracy with client-server traffic
47Evaluation Fault diagnosis
Detecting external noise
Detecting random dropping
- Symptom Significant difference in noise level in
nodes - Noise sources are correctly identified with
- at most one or two false positives
- Inference error in magnitudes of noises is
- within 4
- Symptom Significant difference in loss rates in
links - Less than 20 of fault links are left
- undetected
- No-effect faults are faulty links sending
- less that threshold (250) packets of data
48Evaluation Fault diagnosis
Detecting combinations of all
Detecting MAC misbehavior
- Symptom Significant discrepancy in throughput on
links - Coverage is mostly around 80 or higher
- False positives within 2
49what-if analysis
Network model
Types of faults
Faults directory
Corrective measures
Detected faults
50What-if analysis
Diagnosis
Topology
Corrective measures
51Limitations
- Limited by accuracy of the simulator
- Time to detect the faults is acceptable for
detecting long term faults but not transient
faults - Choices of traces to drive the simulation has
important implications - Focus has only been on faults resulting in
different behavior
52Conclusion
- Used trace data for modeling the network
- Data collection techniques are presented to
collect network information and detect a
deviation from the expected performance - Fault diagnosis algorithm is proposed to detect
the root causes of failure - A scheme for what-if analysis is proposed to
evaluate alternative network configuration for
efficient network operation
53Future work
- Validation on a large test-bed
- Performance analysis in presence of mobility
- Detecting malicious attacks
- Diagnosis in presence of incomplete network
information - More deeply investigating the potential of
what-if analysis
54References
- L. Qiu, P. Bahl, A. Rao, L. Zhou, Fault
Detection, Isolation, and Diagnosis in Multihop
Wireless Networks, Microsoft Technical Report,
Microsoft Researh-TR-2004-11, Dec. 2003 - M. Steinder, A. Sethi, A survey of fault
localization techniques in computer networks,
Technical Report 2001, CIS Dept., Univ of
Delaware, Feb 2001 - M. Steinder, Probabilistic inference for
diagnosing service failures in communication
systems, PhD thesis, Univ. of Delaware, 2003
55Questions
- What is proposed solution to model the throughput
when the signal strength is poor? In Table 2, the
simulated throughput monotonically decreases with
the loss rate while the measured throughput does
not. Why? - What could be the causes of generation of false
positives in the fault diagnosis results? When
can the false positive ratio increase? - http//www.cis.udel.edu/natu/861/861.html