Fault Detection, Isolation, and Diagnosis In Multihop Wireless Networks - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Fault Detection, Isolation, and Diagnosis In Multihop Wireless Networks

Description:

E.g., mapping from signal strength and noise to loss rate ... What is proposed solution to model the throughput when the signal strength is poor? ... – PowerPoint PPT presentation

Number of Views:534
Avg rating:3.0/5.0
Slides: 56
Provided by: mait5
Category:

less

Transcript and Presenter's Notes

Title: Fault Detection, Isolation, and Diagnosis In Multihop Wireless Networks


1
Fault Detection, Isolation, and Diagnosis In
Multihop Wireless Networks
  • Lili Qiu, Paramvir Bahl, Ananth Rao, and Lidong
    Zhou
  • Microsoft Research
  • Presented by
  • -Maitreya Natu

2
Network Management
Faults directory

Root cause
Healthy network
Corrective measure
Faulty network
3
Tasks involved in Network Management
  • Continuously monitoring the functioning
  • Collecting information about the nodes and the
    links
  • Removing inconsistencies and noise from the
    reported information
  • Analyzing the information
  • Taking appropriate actions to improve network
    reliability and performance

4
Challenges in wireless networks
  • Dynamic and unpredictable topology
  • link errors due to fluctuating environment
    conditions
  • Node mobility
  • Limited capacity
  • Scarcity of resources
  • Link attacks

5
Proposed framework
  • Reproduce inside a simulator, the real-world
    events that took place
  • Use online trace driven simulation to detect
    faults and analyze the root causes

6
Network Management
Network model
Types of faults
Healthy network
Faults directory

Creating a network model
7
Network Management
Network model
Fault diagnosis
Types of faults
Faulty network
Faults directory

Detected faults
8
Network Management
Network model
what-if analysis
Types of faults
Faults directory
Corrective measures

Detected faults
9
Key issues
  • How to Accurately reproduce what happened in the
    network inside a simulator
  • How to build fault diagnosis on top of a
    simulator to perform root cause analysis

10
Accurate modeling
  • Use real traces from the diagnosed network
  • Removes dependency on generic theoretical models
  • Captures nuances of the hardware, software and
    environment of the particular network
  • Collect good quality data
  • By developing a technique to effectively rule out
    erroneous data

11
Fault diagnosis
  • Performance data emitted by trace driven
    simulation is used as baseline
  • Any significant deviation indicates a potential
    fault
  • Simulator selectively injects a set of suspected
    faults and searches a set that most produces the
    expected performance
  • An efficient algorithm is designed to determine
    root causes

12
System Overview
6. Search for set of faults that result in best
explanation
Link/Node failure
Faults Directory
7. Report the cause of failure
simulator
Link RSS
Interference Injection
Link Load
Error
Traffic Simulator
/-
Expected loss rate Throughput noise
Topology changes
Routing update
5. Discrepancy Found
Loss rate Throughput noise
4. Compare Expected Average Performance
1. Receive Cleaned Data
2. Drive Simulation
3. Compute Expected Performance
13
Why Simulation Based Diagnosis?
  • Much better insights into the network behavior
    than any heuristic or theoretical technique
  • Highly customizable and applies to a large class
    of networks
  • Ability to perform what-if analysis
  • Helps to foresee the consequences of a corrective
    action
  • Recent advances in simulators have made possible
    their use for real-time analysis

14
Accurate modeling
Network model
Types of faults
Healthy network
Faults directory

15
Current network models
  • Bayesian networks to map symptom-fault
    dependencies
  • Context Free Grammars
  • Correlation Matrix

16
Can on-line simulations be used as core tool?
17
Building confidence in simulator accuracy
  • Problem
  • Hard to accurately model the physical layer and
    the RF propagation
  • Traffic demands on the router are hard to predict

18
Building confidence in simulator accuracy
  • Problem
  • Hard to accurately model the physical layer and
    the RF propagation
  • Traffic demands on the router are hard to predict
  • Solution
  • after the fact simulation
  • Agents periodically report information about the
    link conditions and traffic patterns to the link
    simulators

19
Simulations when the RF condition of the link is
good
Modeling the contention from flows within
the interference and communication ranges.
Modeling the overheads of the protocol stack such
as parity bits, MAC-layer back-off, IEEE 802.11
inter-frame spacing and ACK, and headers.
20
Simulations with varying received signal strength
Simulator estimate deviates from real, when
signal strength is poor
Throughput matches closely with the simulators
estimate, when signal quality is good
21
Why simulation results deviate in case of poor
signal strength?
  • Lack of accurate packet loss as a function of
    packet size, RSS and ambient noise.
  • Depends on signal processing hardware and the RF
    antenna within the wireless cards
  • Lack of accurate auto-rate control
  • Adjustment of sending rate done by WLAN cards
    based on the transmission conditions

22
How to model auto-rate control done by WLAN cards?
  • Use Trace driven simulation
  • When auto-rate is in use
  • Collect the rate at which the wireless card is
    operating and provide the reported rate to the
    simulator
  • Otherwise
  • Data rate is known to the simulator

23
How to model accurate packet loss as a function
of packet-size, RSS and ambient noise?
  • Use offline analysis
  • Calibrate the wireless cards and create a
    database associating environmental factors with
    expected performance
  • E.g., mapping from signal strength and noise to
    loss rate

24
Experiment to model the loss rates due to poor
signal strength
  • Collect another set of traces
  • Slowly send out packets
  • Place packet sniffers near both the sender and
    the receiver, and derive loss rate from the
    packet level trace
  • Seed the wireless link in the simulator with a
    Bernoulli loss rate that matches loss rate with
    the real traces

25
Estimated and measured throughput when
compensating for the loss rate due to poor signal
strength
Loss rate and the measured throughput do not
monotonically decrease with the signal strength
due to the effect of auto-rate
  • Even though the match is not perfect, its not
    expected to be a
  • problem, because
  • many routing protocols try to avoid the use of
    poor quality links
  • Poor quality links are used only when certain
    parts of mesh network have poor connectivity to
    the rest of the network
  • In a well-engineered network, not many nodes
    depend on such bad link for routing

26
Stability of channel conditions
  • How rapidly do channel conditions change and how
    often a trace should be collected?

27
Temporal fluctuation in RSS
  • Fluctuation magnitude is not significant
  • Relative quality of signals across different
    number of walls remain stable

28
Stability of channel conditions
  • How rapidly do channel conditions change and how
    often a trace should be collected?
  • When the environment is generally static, nodes
    may report only the average and standard
    deviation of the RSS to the manager every few
    minutes

29
Dealing with imperfect data
  • By neighborhood monitoring
  • Each node reports performance and traffic
    statistics for its incoming and outgoing links
  • And for other links in its communication range
  • Possible when node is in promiscuous mode
  • Thus multiple reports are sent for each link
  • Redundant reports can be used to detect
    inconsistency
  • Find the minimum set of nodes that can explain
    the inconsistency in the reports

30
Summary
  • How to accurately model the real behavior?
  • Solution Use trace-based simulation
  • Problem Simulation results are good for strong
    signals but deviate for bad RF conditions
  • Need to model the autorate control
  • Use trace-driven data
  • Need to model the loss rate due to poor signal
    strength
  • Use offline analysis
  • How often a trace should be collected?
  • Very little data (average and standard deviation
    of RSS), at fairly low time granularity, as
    channels are relatively stable
  • How to deal with imperfect data
  • By neighborhood monitoring

31
Fault diagnosis
Network model
Types of faults
Faulty network
Faults directory

Detected faults
32
Current fault diagnosis approaches
  • AI techniques
  • Rule based systems
  • Neural networks
  • Model traversing techniques
  • Dependency graphs
  • Causality graphs
  • Bayesian networks

33
Fault Isolation and Diagnosis
  • Establish the expected performance in the
    simulation
  • Find difference between expected and observed
    performance
  • Search over the fault space to detect which set
    of faults can re-produce performance similar to
    what has been observed

34
Collecting data from traces
  • Trace data collection
  • Network topology
  • Each node reports its neighbor and routing tables
  • Traffic statistics
  • Each node maintains counters of traffic sent and
    received from immediate neighbors
  • Physical medium
  • Each node reports signal strength of wireless
    links to neighbors
  • Network performance
  • Includes both the link and end-to-end
    performance, which can be measured through loss
    rate, delay, throughputs
  • Focus is on link level performance

35
Simulating the network performance
  • Traffic load simulation
  • Link based traffic simulation
  • Adjust application sending rate to match the
    observed link-level traffic counts
  • Route simulation
  • Use actual routes taken by packets as input to
    the simulator
  • Wireless signal
  • Use real measurement of signal strength
  • Fault injection
  • Random packet dropping
  • External noise sources
  • MAC misbehavior

36
Fault diagnosis algorithm
  • General approach

Simulator
Expected performance
Network settings
Simulator
Observed performance
Network settings
Faults set
How to find ?
37
How to search the faults efficiently?
  • Different types of faults often change one or few
    metrics
  • E.g., random dropping only affects link loss rate
  • Thus use metrics in which observed and expected
    performance is significantly different, to guide
    the search

38
Scenario where faults do not have strong
interactions
  • Consider large deviation from expected
    performance as anomaly
  • Use decision tree to determine the type of fault
  • Fault type determines the metric to quantify
    performance difference
  • Locate faults by finding the set of nodes and
    links with large difference between expected and
    observed performance

39
Scenario where faults have strong interactions
  • Get the initial diagnosis set from the decision
    tree algorithm
  • Iteratively refine the fault set
  • Adjust the magnitudes of faults in the fault set
  • Translate difference in performance into change
    in faults magnitude
  • It maps the impact of a fault into its magnitude
  • Remove fault whose magnitude is too small
  • Add new faults that can explain large differences
    between the expected and observed performances
  • Iterate till the change in fault set is negligible

40
Example scenario
2
3
1
4
5
41
Example scenario
  • Observed performance
  • Increased loss rate at 1-4 and 1-2
  • No increase in the sending rate of 1-4, 1-2
  • No increase in noise experienced by
  • neighbors

2
3
Inference
1
Increased Sending Rate
4
5
Y
N
Increased Noise
Too low CW
Y
N
Increased Loss
Noise
Y
N
Packet Drop
Normal
42
Example scenario
  • Observed performance
  • Increased loss rate at 1-4 and 1-2
  • No increase in the sending rate of 1-4, 1-2
  • No increase in noise experienced by
  • neighbors

2
3
Inference
1
Increased Sending Rate
4
5
Y
N
Increased Noise
Too low CW
Y
N
Increased Loss
Noise
Y
N
Packet dropping at node 1
Packet Drop
Normal
43
Accuracy of fault diagnosis
  • Correctness of the model
  • Complete information
  • Consistent information
  • Timely information
  • Correctness of the reported symptoms
  • Right size of the threshold to report a symptom
  • Difference in the behavior of faults
  • Timely reporting of symptoms

44
System implementation
  • Windows XP
  • Agents run on every wireless node and reports
    information collected on demand
  • Managers collect and analyze information
  • Collected information is cast into performance
    counters supported by Windows
  • Manager is connected to a backend simulator.
    Collected information is converted to script to
    drive the simulation
  • Testbed
  • Multihop wireless testbed built using IEEE
    802.11a cards
  • Commercially available network sniffer called
    Airopeek is used for data collection
  • Native 802.11 NICs provide rich set of networking
    information

45
Evaluation Data collection overhead
Data collection traffic has little effect
Overhead lt 800 bits/s/node
Management traffic overhead
Performance of FTP flow with and without data
collection
No data cleaning Each link is reported only
once With data cleaning Each link is reported by
all observers for consistency check
46
Data cleaning effectiveness
Coverage greater than 80 in all cases
Higher accuracy with grid topology
Higher coverage when using history
Higher accuracy with denser networks
Higher accuracy with client-server traffic
47
Evaluation Fault diagnosis
Detecting external noise
Detecting random dropping
  • Symptom Significant difference in noise level in
    nodes
  • Noise sources are correctly identified with
  • at most one or two false positives
  • Inference error in magnitudes of noises is
  • within 4
  • Symptom Significant difference in loss rates in
    links
  • Less than 20 of fault links are left
  • undetected
  • No-effect faults are faulty links sending
  • less that threshold (250) packets of data

48
Evaluation Fault diagnosis
Detecting combinations of all
Detecting MAC misbehavior
  • Symptom Significant discrepancy in throughput on
    links
  • Coverage is mostly around 80 or higher
  • False positives within 2

49
what-if analysis
Network model
Types of faults
Faults directory
Corrective measures

Detected faults
50
What-if analysis
Diagnosis
Topology
Corrective measures
51
Limitations
  • Limited by accuracy of the simulator
  • Time to detect the faults is acceptable for
    detecting long term faults but not transient
    faults
  • Choices of traces to drive the simulation has
    important implications
  • Focus has only been on faults resulting in
    different behavior

52
Conclusion
  • Used trace data for modeling the network
  • Data collection techniques are presented to
    collect network information and detect a
    deviation from the expected performance
  • Fault diagnosis algorithm is proposed to detect
    the root causes of failure
  • A scheme for what-if analysis is proposed to
    evaluate alternative network configuration for
    efficient network operation

53
Future work
  • Validation on a large test-bed
  • Performance analysis in presence of mobility
  • Detecting malicious attacks
  • Diagnosis in presence of incomplete network
    information
  • More deeply investigating the potential of
    what-if analysis

54
References
  • L. Qiu, P. Bahl, A. Rao, L. Zhou, Fault
    Detection, Isolation, and Diagnosis in Multihop
    Wireless Networks, Microsoft Technical Report,
    Microsoft Researh-TR-2004-11, Dec. 2003
  • M. Steinder, A. Sethi, A survey of fault
    localization techniques in computer networks,
    Technical Report 2001, CIS Dept., Univ of
    Delaware, Feb 2001
  • M. Steinder, Probabilistic inference for
    diagnosing service failures in communication
    systems, PhD thesis, Univ. of Delaware, 2003

55
Questions
  • What is proposed solution to model the throughput
    when the signal strength is poor? In Table 2, the
    simulated throughput monotonically decreases with
    the loss rate while the measured throughput does
    not. Why?
  • What could be the causes of generation of false
    positives in the fault diagnosis results? When
    can the false positive ratio increase?
  • http//www.cis.udel.edu/natu/861/861.html
Write a Comment
User Comments (0)
About PowerShow.com