Fault Detection, Isolation, and Diagnosis In Multihop Wireless Networks - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

Fault Detection, Isolation, and Diagnosis In Multihop Wireless Networks

Description:

E.g., mapping from signal strength and noise to loss rate ... What is proposed solution to model the throughput when the signal strength is poor? ... – PowerPoint PPT presentation

Number of Views:534

Avg rating:3.0/5.0

Slides: 56

Provided by: mait5

Category:

more less

Transcript and Presenter's Notes

Title: Fault Detection, Isolation, and Diagnosis In Multihop Wireless Networks

1
Fault Detection, Isolation, and Diagnosis In
Multihop Wireless Networks

Lili Qiu, Paramvir Bahl, Ananth Rao, and Lidong
Zhou
Microsoft Research
Presented by
-Maitreya Natu

2
Network Management
Faults directory

Root cause
Healthy network
Corrective measure
Faulty network
3
Tasks involved in Network Management

Continuously monitoring the functioning
Collecting information about the nodes and the
links
Removing inconsistencies and noise from the
reported information
Analyzing the information
Taking appropriate actions to improve network
reliability and performance

4
Challenges in wireless networks

Dynamic and unpredictable topology
link errors due to fluctuating environment
conditions
Node mobility
Limited capacity
Scarcity of resources
Link attacks

5
Proposed framework

Reproduce inside a simulator, the real-world
events that took place
Use online trace driven simulation to detect
faults and analyze the root causes

6
Network Management
Network model
Types of faults
Healthy network
Faults directory

Creating a network model
7
Network Management
Network model
Fault diagnosis
Types of faults
Faulty network
Faults directory

Detected faults
8
Network Management
Network model
what-if analysis
Types of faults
Faults directory
Corrective measures

Detected faults
9
Key issues

How to Accurately reproduce what happened in the
network inside a simulator
How to build fault diagnosis on top of a
simulator to perform root cause analysis

10
Accurate modeling

Use real traces from the diagnosed network
Removes dependency on generic theoretical models
Captures nuances of the hardware, software and
environment of the particular network
Collect good quality data
By developing a technique to effectively rule out
erroneous data

11
Fault diagnosis

Performance data emitted by trace driven
simulation is used as baseline
Any significant deviation indicates a potential
fault
Simulator selectively injects a set of suspected
faults and searches a set that most produces the
expected performance
An efficient algorithm is designed to determine
root causes

12
System Overview
6. Search for set of faults that result in best
explanation
Link/Node failure
Faults Directory
7. Report the cause of failure
simulator
Link RSS
Interference Injection
Link Load
Error
Traffic Simulator
/-
Expected loss rate Throughput noise
Topology changes
Routing update
5. Discrepancy Found
Loss rate Throughput noise
4. Compare Expected Average Performance
1. Receive Cleaned Data
2. Drive Simulation
3. Compute Expected Performance
13
Why Simulation Based Diagnosis?

Much better insights into the network behavior
than any heuristic or theoretical technique
Highly customizable and applies to a large class
of networks
Ability to perform what-if analysis
Helps to foresee the consequences of a corrective
action
Recent advances in simulators have made possible
their use for real-time analysis

14
Accurate modeling
Network model
Types of faults
Healthy network
Faults directory

15
Current network models

Bayesian networks to map symptom-fault
dependencies
Context Free Grammars
Correlation Matrix

16
Can on-line simulations be used as core tool?
17
Building confidence in simulator accuracy

Problem
Hard to accurately model the physical layer and
the RF propagation
Traffic demands on the router are hard to predict

18
Building confidence in simulator accuracy

Problem
Hard to accurately model the physical layer and
the RF propagation
Traffic demands on the router are hard to predict
Solution
after the fact simulation
Agents periodically report information about the
link conditions and traffic patterns to the link
simulators

19
Simulations when the RF condition of the link is
good
Modeling the contention from flows within
the interference and communication ranges.
Modeling the overheads of the protocol stack such
as parity bits, MAC-layer back-off, IEEE 802.11
inter-frame spacing and ACK, and headers.
20
Simulations with varying received signal strength
Simulator estimate deviates from real, when
signal strength is poor
Throughput matches closely with the simulators
estimate, when signal quality is good
21
Why simulation results deviate in case of poor
signal strength?

Lack of accurate packet loss as a function of
packet size, RSS and ambient noise.
Depends on signal processing hardware and the RF
antenna within the wireless cards
Lack of accurate auto-rate control
Adjustment of sending rate done by WLAN cards
based on the transmission conditions

22
How to model auto-rate control done by WLAN cards?

Use Trace driven simulation
When auto-rate is in use
Collect the rate at which the wireless card is
operating and provide the reported rate to the
simulator
Otherwise
Data rate is known to the simulator

23
How to model accurate packet loss as a function
of packet-size, RSS and ambient noise?

Use offline analysis
Calibrate the wireless cards and create a
database associating environmental factors with
expected performance
E.g., mapping from signal strength and noise to
loss rate

24
Experiment to model the loss rates due to poor
signal strength

Collect another set of traces
Slowly send out packets
Place packet sniffers near both the sender and
the receiver, and derive loss rate from the
packet level trace
Seed the wireless link in the simulator with a
Bernoulli loss rate that matches loss rate with
the real traces

25
Estimated and measured throughput when
compensating for the loss rate due to poor signal
strength
Loss rate and the measured throughput do not
monotonically decrease with the signal strength
due to the effect of auto-rate

Even though the match is not perfect, its not
expected to be a
problem, because
many routing protocols try to avoid the use of
poor quality links
Poor quality links are used only when certain
parts of mesh network have poor connectivity to
the rest of the network
In a well-engineered network, not many nodes
depend on such bad link for routing

26
Stability of channel conditions

How rapidly do channel conditions change and how
often a trace should be collected?

27
Temporal fluctuation in RSS

Fluctuation magnitude is not significant
Relative quality of signals across different
number of walls remain stable

28
Stability of channel conditions

How rapidly do channel conditions change and how
often a trace should be collected?
When the environment is generally static, nodes
may report only the average and standard
deviation of the RSS to the manager every few
minutes

29
Dealing with imperfect data

By neighborhood monitoring
Each node reports performance and traffic
statistics for its incoming and outgoing links
And for other links in its communication range
Possible when node is in promiscuous mode
Thus multiple reports are sent for each link
Redundant reports can be used to detect
inconsistency
Find the minimum set of nodes that can explain
the inconsistency in the reports

30
Summary

How to accurately model the real behavior?
Solution Use trace-based simulation
Problem Simulation results are good for strong
signals but deviate for bad RF conditions
Need to model the autorate control
Use trace-driven data
Need to model the loss rate due to poor signal
strength
Use offline analysis
How often a trace should be collected?
Very little data (average and standard deviation
of RSS), at fairly low time granularity, as
channels are relatively stable
How to deal with imperfect data
By neighborhood monitoring

31
Fault diagnosis
Network model
Types of faults
Faulty network
Faults directory

Detected faults
32
Current fault diagnosis approaches

AI techniques
Rule based systems
Neural networks
Model traversing techniques
Dependency graphs
Causality graphs
Bayesian networks

33
Fault Isolation and Diagnosis

Establish the expected performance in the
simulation
Find difference between expected and observed
performance
Search over the fault space to detect which set
of faults can re-produce performance similar to
what has been observed

34
Collecting data from traces

Trace data collection
Network topology
Each node reports its neighbor and routing tables
Traffic statistics
Each node maintains counters of traffic sent and
received from immediate neighbors
Physical medium
Each node reports signal strength of wireless
links to neighbors
Network performance
Includes both the link and end-to-end
performance, which can be measured through loss
rate, delay, throughputs
Focus is on link level performance

35
Simulating the network performance

Traffic load simulation
Link based traffic simulation
Adjust application sending rate to match the
observed link-level traffic counts
Route simulation
Use actual routes taken by packets as input to
the simulator
Wireless signal
Use real measurement of signal strength
Fault injection
Random packet dropping
External noise sources
MAC misbehavior

36
Fault diagnosis algorithm

General approach

Simulator
Expected performance
Network settings
Simulator
Observed performance
Network settings
Faults set
How to find ?
37
How to search the faults efficiently?

Different types of faults often change one or few
metrics
E.g., random dropping only affects link loss rate
Thus use metrics in which observed and expected
performance is significantly different, to guide
the search

38
Scenario where faults do not have strong
interactions

Consider large deviation from expected
performance as anomaly
Use decision tree to determine the type of fault
Fault type determines the metric to quantify
performance difference
Locate faults by finding the set of nodes and
links with large difference between expected and
observed performance

39
Scenario where faults have strong interactions

Get the initial diagnosis set from the decision
tree algorithm
Iteratively refine the fault set
Adjust the magnitudes of faults in the fault set
Translate difference in performance into change
in faults magnitude
It maps the impact of a fault into its magnitude
Remove fault whose magnitude is too small
Add new faults that can explain large differences
between the expected and observed performances
Iterate till the change in fault set is negligible

40
Example scenario
2
3
1
4
5
41
Example scenario

Observed performance
Increased loss rate at 1-4 and 1-2
No increase in the sending rate of 1-4, 1-2
No increase in noise experienced by
neighbors

2
3
Inference
1
Increased Sending Rate
4
5
Y
N
Increased Noise
Too low CW
Y
N
Increased Loss
Noise
Y
N
Packet Drop
Normal
42
Example scenario

Observed performance
Increased loss rate at 1-4 and 1-2
No increase in the sending rate of 1-4, 1-2
No increase in noise experienced by
neighbors

2
3
Inference
1
Increased Sending Rate
4
5
Y
N
Increased Noise
Too low CW
Y
N
Increased Loss
Noise
Y
N
Packet dropping at node 1
Packet Drop
Normal
43
Accuracy of fault diagnosis

Correctness of the model
Complete information
Consistent information
Timely information
Correctness of the reported symptoms
Right size of the threshold to report a symptom
Difference in the behavior of faults
Timely reporting of symptoms

44
System implementation

Windows XP
Agents run on every wireless node and reports
information collected on demand
Managers collect and analyze information
Collected information is cast into performance
counters supported by Windows
Manager is connected to a backend simulator.
Collected information is converted to script to
drive the simulation
Testbed
Multihop wireless testbed built using IEEE
802.11a cards
Commercially available network sniffer called
Airopeek is used for data collection
Native 802.11 NICs provide rich set of networking
information

45
Evaluation Data collection overhead
Data collection traffic has little effect
Overhead lt 800 bits/s/node
Management traffic overhead
Performance of FTP flow with and without data
collection
No data cleaning Each link is reported only
once With data cleaning Each link is reported by
all observers for consistency check
46
Data cleaning effectiveness
Coverage greater than 80 in all cases
Higher accuracy with grid topology
Higher coverage when using history
Higher accuracy with denser networks
Higher accuracy with client-server traffic
47
Evaluation Fault diagnosis
Detecting external noise
Detecting random dropping

Symptom Significant difference in noise level in
nodes
Noise sources are correctly identified with
at most one or two false positives
Inference error in magnitudes of noises is
within 4

Symptom Significant difference in loss rates in
links
Less than 20 of fault links are left
undetected
No-effect faults are faulty links sending
less that threshold (250) packets of data

48
Evaluation Fault diagnosis
Detecting combinations of all
Detecting MAC misbehavior

Symptom Significant discrepancy in throughput on
links
Coverage is mostly around 80 or higher
False positives within 2

49
what-if analysis
Network model
Types of faults
Faults directory
Corrective measures

Detected faults
50
What-if analysis
Diagnosis
Topology
Corrective measures
51
Limitations

Limited by accuracy of the simulator
Time to detect the faults is acceptable for
detecting long term faults but not transient
faults
Choices of traces to drive the simulation has
important implications
Focus has only been on faults resulting in
different behavior

52
Conclusion

Used trace data for modeling the network
Data collection techniques are presented to
collect network information and detect a
deviation from the expected performance
Fault diagnosis algorithm is proposed to detect
the root causes of failure
A scheme for what-if analysis is proposed to
evaluate alternative network configuration for
efficient network operation

53
Future work

Validation on a large test-bed
Performance analysis in presence of mobility
Detecting malicious attacks
Diagnosis in presence of incomplete network
information
More deeply investigating the potential of
what-if analysis

54
References

L. Qiu, P. Bahl, A. Rao, L. Zhou, Fault
Detection, Isolation, and Diagnosis in Multihop
Wireless Networks, Microsoft Technical Report,
Microsoft Researh-TR-2004-11, Dec. 2003
M. Steinder, A. Sethi, A survey of fault
localization techniques in computer networks,
Technical Report 2001, CIS Dept., Univ of
Delaware, Feb 2001
M. Steinder, Probabilistic inference for
diagnosing service failures in communication
systems, PhD thesis, Univ. of Delaware, 2003

55
Questions

What is proposed solution to model the throughput
when the signal strength is poor? In Table 2, the
simulated throughput monotonically decreases with
the loss rate while the measured throughput does
not. Why?
What could be the causes of generation of false
positives in the fault diagnosis results? When
can the false positive ratio increase?
http//www.cis.udel.edu/natu/861/861.html

Write a Comment

User Comments (0)