Automated Fault diagnosis in VoIP presentation

About This Presentation

Transcript and Presenter's Notes

Title: Automated Fault diagnosis in VoIP

1
Automated Fault diagnosis in VoIP

31st March,2006
Vishal Kumar Singh and Henning Schulzrinne

2
VoIP Diagnosis

What is automated VoIP diagnosis
Determining failures in network
Automatically finding the root cause of the
failure
Why VoIP diagnosis
Networks are complex, making it difficult to
troubleshoot problems
Automatic fault diagnosis reduces human
intervention
Issues in VoIP diagnosis
Detecting failures/faults
Finding the cause of failure, determining
dependency relationships among different
components for diagnosis
Solution steps and approaches

3
Issues in Automated VoIP Diagnosis

Increasingly complex and diverse network elements
Complex interactions/relationships between
different network elements
Different run time bindings for each application
usage instance, e.g., different calls may use
different DNS, SIP proxy servers, media path
Problem in one network element may manifest
itself as user perceived failure of another
element

4
Fault Identification

Service unavailability reporting
Node/Device/UA generates faults (failure events)
e.g. SNMP Traps, failure messages
Monitoring application e.g., SNMP based
application detects service unavailability and
reports the failure event
Affected user reports service unavailability ,
e.g., by e-mail, calling to helpdesk,
automatically by pressing a button on phone while
in a call and experiencing echo
Dependent application detects service
unavailability and generates fault (failure
events)

5
Fault Localization Determining the Source of
Problem

Fault Classification Local Vs. Global
(Does it affect only me or Does it affect
others also)
Global failures
Server failure e.g. SIP proxy, DNS failure, DB
failures
Network failures
Local failures
Specific Source failure e.g. node A cannot make
call to anyone
Specific destination or participant failure e.g.
No one can make call to node B
Locally observed but global failures e.g., DNS
service failed, but only B observed it.

6
Solution Approach

DYSWIS Do you see what I see 1
Peers (Nodes) perform diagnostic tests when
another peer reports or detects failure
Nodes can choose the diagnostic test depending on
dependency encoded as decision tree
Nodes (at least some) will be initially preloaded
with the dependency relationship in some format
(e.g., XML based)
Nodes (at least some) may build and update the
dependency relationship based on statistical and
temporal analysis of failure events which they
receive and diagnostic tests which they perform

7
Solution Approach

Store context information of past failures
experienced by each node
E.g., specific server that was acting as the
proxy server (for my call which failed)
Store locality of past failures instances
LAN, domain, subnet
First hop at each layer e.g., switch (MAC),
default gateway (IP), domains proxy (Application
layer),
Failure count for each network element
(statistical)
Last failure timestamp for each network element
Last successfully seen timestamp for each network
element (why do I need to test the proxy for you,
my call just went through)
Temporal correlation of past failures (proxy
seems to be failing after DNS fails)
Each node has a runtime dependency list based on
past failures and diagnostic tests

8
Solution Architecture
Nodes in different domains cooperating to
determine cause of failure
9
Solution Architecture Logical View
Failures in Network
Dependency graph generation Bayesian network
based, Inference, other models
Test results
Decision Tree updates
Triggers to perform TESTS. (Peer selection
and Probe selection.
Dependency relationships and tests (XML)
The above figure shows logical entities and
separation of dependency graph generation and
Distributed diagnostic infrastructure (enclosed
in blue).
10
Solution Requirements

Request-Response protocol between the node which
experiences the failure and the peer nodes
Nodes capability to perform diagnostic tests
(probes), probe selection based on cost/result
Encoding the dependency relationship into a
decision tree (giving as an input from an expert
e.g., as XML)
Peer node discovery, based on
Location (local network, domain)
Capability to perform tests (based on specific
tests)
Dependency graph generation and updation, based
on
Network failure events
Diagnostic test results correlated with failures

11
Test/ Probe Selection

Which diagnostic probe to run network layer or
application layer and for what kind of failures.
A probe covering broad range of failures can give
faster and crude but less accurate results
E.g. PING vs TCP Connect vs. SIP PING tests
Cost of Probe

12
Dependency Classifications

Functional dependency
At generic service level e.g. SIP proxy depends
on DB service, DNS service
Structural dependency
Configuration time e.g. Columbia CS SIP proxy is
configured to use mysql database on metro-north
Operational dependency
Runtime dependencies or run time bindings, e.g.,
the call which failed was using failover SIP
server obtained from DNS which was running on
host a.b.c.d in IRT lab

13
Dependency classifications Layered Approach

Vertical and Lateral dependencies Applications
depends on other application layer services
(e.g., SIP service depends on DB, DNS service) as
well as lower layer services
OSI layers as service dependency layers
Application layer service also depends on
transport layer service which in turn depends on
network layer service
MAC layer Access point, Switch
Network layer Router
Application layer DNS, SIP, Database
Topology based dependency
e.g., calls from CS domain depends on specific
SIP server, calls from lab phones depends on
specific switches and routers

14
Dependency Graph
15
Dependency Graph Encoded to Decision Tree
16
Diagnostic Tests

SIP proxy
Proxy server availability
SIP PING
Call Routing availability
Invite tests
Call Path determination
SIP TraceRoute
Media path
Quality related
Speech quality degradation - MOS
Echo
jitter- MOS, PESQ
QoS RTCP
NAT/Firewall
Checking binding expiration.
Firewall failure to open a port - One way media.
How to determine which Firewall in the path ?
SIP signaling ?

17
Diagnostic Tests

DNS tests
DHCP
Switch/Router
ARP/RARP/Multicast
BGP failures
Conference mixers
Gateway
Echo return loss- readings- Analysis
DB
XCAP server tests
Presence service availability tests

18
Example

Call Failure Possible Causes
SIP Proxy server
Database
Authentication
Media path failure
Gateway
Specific call legs ERL, Authentication, etc.
DNS server failure
End station failure
Network failure, e.g., router, switch failure
Different calls will have different run time
dependencies

19
Mapping to a Human Medical System

Doctors perform diagnostic tests to find out the
cause of disease when the symptoms are mentioned
They may learn new things about the disease as
a part of diagnostic tests
Failures and triggered tests update the
dependency graph
Medical researchers do different types of tests
to learn about new diseases, determine the cause
and relationship of a disease with other
physiological system
Set of tests that can run periodically and can be
used to build dependency graph independent of
failures

20
Solution Evolution

Learning the dependency graph from failure events
and diagnostic tests
Learning using random/periodic testing to
identify failures and determine relationships

21
Future Directions

Self healing
Predicting failures
Protocols for labeling event failures which
would enable automatically incorporating new
devices/applications to the dependency system
Decision tree (dependency graph) based event
correlation

22
Reference

1 User-oriented Management of VoIP Applications
(http//www.ibr.cs.tu-bs.de/projects/nmrg/meeting
s/2005/nancy/dyswis.pdf)

Write a Comment

User Comments (0)

About PowerShow.com

Automated Fault diagnosis in VoIP PowerPoint PPT Presentation