Failure Spread in Redundant UMTS Core Network - PowerPoint PPT Presentation

About This Presentation
Title:

Failure Spread in Redundant UMTS Core Network

Description:

HW/SW update procedure failures, which leads to faulty configurations and ... Seems like the network design trend goes towards smaller, adaptable network ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 18
Provided by: LMFu7
Category:

less

Transcript and Presenter's Notes

Title: Failure Spread in Redundant UMTS Core Network


1
Failure Spread in Redundant UMTS Core Network
  • Author Tuomas Erke, Helsinki University of
    Technology
  • Supervisor Timo Korhonen, Professor of
    Telecommunication Systems (S72)
  • Tuomas.Erke_at_hut.fi
  • 30.9.2003

2
Table of Contents
  • Background
  • Terminology
  • Problem Setting
  • Used Methodology
  • Results of the Study
  • Conclusions
  • Future Work

3
Background
  • Fixed networks have been built reliable, but the
    reliability of mobile networks have been given
    less attention
  • Enhanced services (e.g. telemedicine
    applications) and escalated competition between
    main market players over new subscribers is about
    to change this in the near future
  • In fixed networks, outages involving a large
    number of people must be reported. This may be
    extended to mobile networks in future also

4
Terminology (1/4)
  • Availability A probability that the system will
    be functioning correctly at any given time
  • Failure The impact of the faults and errors seen
    by user (SW program crash)
  • Fault Tree Analysis (FTA) A top-down method of
    analyzing system design and performance.
    Specifies a top event followed by identifying all
    of the associated elements in the system that can
    cause the top event to occur
  • Failure Spread A failure occurs in some part(s)
    of the system, and propagates to other part(s) of
    the system
  • Fault Tolerance A capability of the system to
    withstand and handle faults
  • Media Gateway (MGW) a network node in UMTS core
    network, which is used to interconnect networks
  • Redundancy Availability of unit(s) and
    mechanisms for taking over failed unit(s)
  • Reliability The capability of and item to carry
    out certain functionality in a certain period of
    time in certain conditions, or a probability that
    it will

5
Terminology (2/4)
  • Fault is connected to physical world (electronic
    components confront faults after a period of
    time), but it also includes mistakes made in
    design (incomplete system architecture) or
    implementation (programming mistakes) of the
    system
  • Error has an impact on information (error in data
    processing)
  • Failure is the impact of the faults and errors
    seen by user (SW program crash)

6
Terminology (3/4)
  • There are different types of failures
  • Sudden, when failure cannot be predicted
    (nondeterministic software based problems)
  • Gradual, when failure can be predicted with prior
    examination (hardware wear-out increases
    probability of a failure over a period of time)
  • Partial when failure affects only some parts of
    the system (only one network node)
  • Complete when failure has an impact on the whole
    system (complete network)
  • Catastrophic when failure is both sudden and
    complete (power-system failure)
  • Degradation failures are gradual and partial (HW
    component wear-out over time)

7
Terminology (4/4)
  • Standby redundancy (triggered only when the other
    unit fails)
  • Parallel redundancy (frequently used in telecom
    networks)

8
Problem Setting (1/2)
  • Effect of failures in UMTS CN is studied in the
    thesis, and how redundancy mechanisms may be used
    in the network to increase availability and
    decrease the effect of failures
  • Area has not been widely studied before, but some
    work related to node failures exists
  • Network reliability has been studied, and this
    thesis compares results from the other studies
    and proposes solution alternatives

9
Problem Setting (2/2)
  • Failures occur in various parts of the system
  • Node failure (HLR database failure results in
    unavailability of permanent user data if no
    redundant component and mechanism is available)
  • Protocol failures (wrong implementation or
    design), results in overload of network elements
    or signaling links, or faulty interaction between
    network nodes. For instance, wrong use of
    broadcasting messages, which leads to overload at
    the receiving side
  • HW failures (bus/circuitry failures/memory
    corruption, results in HW either malfunctioning
    or failing)
  • Recovery triggering mechnanisms (changeover
    procedure failures, DSP device manager failures
    or other triggering failures)
  • Load sharing algorithm (ineffective use of
    resources, exceeding the capacity of the system
    before taking action or wrong resource sharing on
    right network nodes)
  • HW/SW update procedure failures, which leads to
    faulty configurations and interworking of network
    elements
  • Wrong network configuration (often because of
    complex network design)

10
Used Methodology (1/3)
  • Failures are considered to occur on different
    levels

11
Used Methodology (2/3)
  • A partly redundant example network is studied in
    the thesis
  • A tree format FTA (Fault-tree Analysis) is used
    for analyzing the causes of failures. FTA is
    mainly applied in SW reliability area
  • A literature study is performed to find
    mechanisms for achieving a higher level of system
    reliability

12
Used Methodology (3/3)
  • Example network configuration

13
Results of the Study (1/3)
  • The chain between fault detection, localization,
    analysis and recovery must be unbroken, otherwise
    failures cannot be recovered completely
  • Redundancy must be applied in different levels of
    the system for achieving high level of
    fault-tolerance (system is as strong as its
    weakest component)
  • SW fault-tolerance is increased by building
    distributed, reliable and scalable SW
  • The critical network nodes have to be duplicated,
    and restoration algorithms must be available

14
Results of the Study (2/3)
  • Emphasis on profound system testing (and
    especially testing of fault recovery mechanisms,
    load control and different failure scenarios)
  • SW based mechanisms include
  • Distributed SW architecture (a failure of one
    component involves a smaller fraction of the
    system, so the loss of data and resources can be
    recovered in a better way). A fault can be
    isolated to a smaller area when the architecture
    is distributed
  • Multithreaded protocol stacks so that a failure
    of a process involves only part of the module
    capabilities, and the SW modules use dynamic
    checkpointing protocols for recovering from
    failure of peer entities
  • Optimization of the recovery process time to its
    minimum (only necessary part of the system is
    restarted processor, board or node restart. This
    reduces the outage time)
  • A blackbox for SW failure analysis can be
    implemented inside SW components for later
    analysis

15
Results of the Study (3/3)
  • Dynamic routing and meshed network architecture
    is a recommendable solution (Advantage high
    tolerance for the network failures and
    adaptability to different network configurations.
    Disadvantage complicated design and maintenance
    of the network)
  • Reliability of the network is an optimization
    problem, but investing on redundant HW now can be
    used in future to increase the capacity of the
    system if needed
  • Multifunction devices, MSC in Pool, Multihoming
    and special algorithms may be utilized to
    increase reliability of the system

16
Conclusions
  • Different redundancy mechanisms were discussed in
    this thesis and existing algorithms were compared
  • Seems like the network design trend goes towards
    smaller, adaptable network nodes and architecture
  • The reliability of the system is best achieved by
    using multiple levels of redundancy
  • Failure spread depends on the availability and
    workability of the methods for ensuring the
    reliability of the system

17
Future Work
  • OPEX (OPerating EXpenses) and CAPEX (CApital
    EXpenses) calculations for network architecture
    solutions
  • Testing of failure recovery mechanisms and effect
    of failures using real network or a simulated
    environment
  • Multiple simultaneous failures have only been
    handled partly in this thesis. More research
    needs to be performed on the subject (e.g. for
    tolerating of geographical catastrophe involving
    large number of network nodes)
Write a Comment
User Comments (0)
About PowerShow.com