Failure Spread in Redundant UMTS Core Network - PowerPoint PPT Presentation

About This Presentation

Title:

Failure Spread in Redundant UMTS Core Network

Description:

Number of Views:20

Avg rating:3.0/5.0

Slides: 18

Provided by: LMFu7

Category:

more less

Transcript and Presenter's Notes

Title: Failure Spread in Redundant UMTS Core Network

1
Failure Spread in Redundant UMTS Core Network

2
Table of Contents

3
Background

Fixed networks have been built reliable, but the
reliability of mobile networks have been given
less attention
Enhanced services (e.g. telemedicine
applications) and escalated competition between
main market players over new subscribers is about
to change this in the near future
In fixed networks, outages involving a large
number of people must be reported. This may be
extended to mobile networks in future also

4
Terminology (1/4)

Availability A probability that the system will
be functioning correctly at any given time
Failure The impact of the faults and errors seen
by user (SW program crash)
Fault Tree Analysis (FTA) A top-down method of
analyzing system design and performance.
Specifies a top event followed by identifying all
of the associated elements in the system that can
cause the top event to occur
Failure Spread A failure occurs in some part(s)
of the system, and propagates to other part(s) of
the system
Fault Tolerance A capability of the system to
withstand and handle faults
Media Gateway (MGW) a network node in UMTS core
network, which is used to interconnect networks
Redundancy Availability of unit(s) and
mechanisms for taking over failed unit(s)
Reliability The capability of and item to carry
out certain functionality in a certain period of
time in certain conditions, or a probability that
it will

5
Terminology (2/4)

6
Terminology (3/4)

There are different types of failures
Sudden, when failure cannot be predicted
(nondeterministic software based problems)
Gradual, when failure can be predicted with prior
examination (hardware wear-out increases
probability of a failure over a period of time)
Partial when failure affects only some parts of
the system (only one network node)
Complete when failure has an impact on the whole
system (complete network)
Catastrophic when failure is both sudden and
complete (power-system failure)
Degradation failures are gradual and partial (HW
component wear-out over time)

7
Terminology (4/4)

8
Problem Setting (1/2)

Effect of failures in UMTS CN is studied in the
thesis, and how redundancy mechanisms may be used
in the network to increase availability and
decrease the effect of failures
Area has not been widely studied before, but some
work related to node failures exists
Network reliability has been studied, and this
thesis compares results from the other studies
and proposes solution alternatives

9
Problem Setting (2/2)

Failures occur in various parts of the system
Node failure (HLR database failure results in
unavailability of permanent user data if no
redundant component and mechanism is available)
Protocol failures (wrong implementation or
design), results in overload of network elements
or signaling links, or faulty interaction between
network nodes. For instance, wrong use of
broadcasting messages, which leads to overload at
the receiving side
HW failures (bus/circuitry failures/memory
corruption, results in HW either malfunctioning
or failing)
Recovery triggering mechnanisms (changeover
procedure failures, DSP device manager failures
or other triggering failures)
Load sharing algorithm (ineffective use of
resources, exceeding the capacity of the system
before taking action or wrong resource sharing on
right network nodes)
HW/SW update procedure failures, which leads to
faulty configurations and interworking of network
elements
Wrong network configuration (often because of
complex network design)

10
Used Methodology (1/3)

11
Used Methodology (2/3)

A partly redundant example network is studied in
the thesis
A tree format FTA (Fault-tree Analysis) is used
for analyzing the causes of failures. FTA is
mainly applied in SW reliability area
A literature study is performed to find
mechanisms for achieving a higher level of system
reliability

12
Used Methodology (3/3)

13
Results of the Study (1/3)

The chain between fault detection, localization,
analysis and recovery must be unbroken, otherwise
failures cannot be recovered completely
Redundancy must be applied in different levels of
the system for achieving high level of
fault-tolerance (system is as strong as its
weakest component)
SW fault-tolerance is increased by building
distributed, reliable and scalable SW
The critical network nodes have to be duplicated,
and restoration algorithms must be available

14
Results of the Study (2/3)

Emphasis on profound system testing (and
especially testing of fault recovery mechanisms,
load control and different failure scenarios)
SW based mechanisms include
Distributed SW architecture (a failure of one
component involves a smaller fraction of the
system, so the loss of data and resources can be
recovered in a better way). A fault can be
isolated to a smaller area when the architecture
is distributed
Multithreaded protocol stacks so that a failure
of a process involves only part of the module
capabilities, and the SW modules use dynamic
checkpointing protocols for recovering from
failure of peer entities
Optimization of the recovery process time to its
minimum (only necessary part of the system is
restarted processor, board or node restart. This
reduces the outage time)
A blackbox for SW failure analysis can be
implemented inside SW components for later
analysis

15
Results of the Study (3/3)

Dynamic routing and meshed network architecture
is a recommendable solution (Advantage high
tolerance for the network failures and
adaptability to different network configurations.
Disadvantage complicated design and maintenance
of the network)
Reliability of the network is an optimization
problem, but investing on redundant HW now can be
used in future to increase the capacity of the
system if needed
Multifunction devices, MSC in Pool, Multihoming
and special algorithms may be utilized to
increase reliability of the system

16
Conclusions

Different redundancy mechanisms were discussed in
this thesis and existing algorithms were compared
Seems like the network design trend goes towards
smaller, adaptable network nodes and architecture
The reliability of the system is best achieved by
using multiple levels of redundancy
Failure spread depends on the availability and
workability of the methods for ensuring the
reliability of the system

17
Future Work

OPEX (OPerating EXpenses) and CAPEX (CApital
EXpenses) calculations for network architecture
solutions
Testing of failure recovery mechanisms and effect
of failures using real network or a simulated
environment
Multiple simultaneous failures have only been
handled partly in this thesis. More research
needs to be performed on the subject (e.g. for
tolerating of geographical catastrophe involving
large number of network nodes)