An introduction to

About This Presentation

Title:

An introduction to

Description:

Teleshopping. 64 Kbps 2 Mbps. Very High. Pagina . 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 5.00 4.00 4.00 3.00 5.00 5.00 4.00 5.00 2.00 2.00 1.00 2.00 2.00 2.00 ... – PowerPoint PPT presentation

Number of Views:239

Avg rating:3.0/5.0

Slides: 65

Provided by: uninaStid5

Category:

more less

Transcript and Presenter's Notes

Title: An introduction to

1

An introduction to
NETWORK RESILIENCY
Giorgio Ventre Stefano AvalloneCOMICS
GroupDipartimento di Informatica e
SistemisticaUniversità di Napoli Federico II

2
References

Jean-Philippe Vasseur, Mario Pickavet, Piet
Demeester. Network Recovery, protection and
restoration of optical, SONET-SDH, IP and MPLS.
Morgan Kaufmann
AA. VV. Building Survivable Networks, Feature
Issue of IEEE Network Magazine, March/April 2004

3
Communication Networks Relevance

Communication Networks are becoming fundamental
infrastructures
the amount of data carried out by Communication
Networks is considerably grows in the last years
many social and economic activities depend on
Communication Networks
many safe critical activities depend on
Communication Networks.
Reliability is an essential feature of today
Communication Networks !

4
Network Reliability definition1

The (a) ability of a network to maintain or
restore an acceptable level of performance during
network failures by applying various restoration
techniques, and (b) mitigation or prevention of
service outages from network failures by applying
preventive techniques.
Acronym Network Survivability.
1 Alliance for Telecommunications Industry
Solutions (ATIS) http//www.atis.org/tg2k/_network
_reliability.html

5
Network Reliability related concepts

There are many concepts that are related to
Network Reliability, for example
network element reliability the probability of a
network element to be fully operational during a
certain period of time
network element availability the probability of
a network element to be in an up-state at a given
instant of time t
network element fault the inability of a network
element to perform a required action
....

6
Which failures may occur ?

The ability of a network to provide required
services may be compromised by different
failures
planed or unplanned failures
internal or external failures
software or hardware failures
malicious or casual failures
....

7
Accounted Failures

Provide actions to address all the failures that
may occur on a Communication Network is
unfeasible.
Network provider and ISP normally provides
actions plain to address the most frequent
failures.
These failure are called Accounted Failure
The most common type of Accounted Failure are
single link failure
single node failure.

8
Failures' Impact

In today Communication Networks a single failure
may produces a major disruption in network
availability.
A single cut in an optical cable may drop
thousands of logical network connections.
On July 5, 2002 a submarine cable break affected
the Asia Pacific Cable Network (ACPN 2), causing
a considerable slowdown in all the network
connections among Japan, China, South Korea, etc.

9
Failures' Impact ATC systems

Press Releases (http//www.natca.org/mediacenter/p
ress-release-detail.aspx?id394)
MASSIVE POWER, COMMUNICATIONS FAILURE AT MAJOR
AIR TRAFFIC CONTROL CENTER PUTS CONTROLLERS IN
DARK, FLIGHTS IN JEOPARDY
07/19/2006 Bob Marks
PALMDALE, Calif. A massive
power and communications failure late Tuesday at
the Los Angeles Air Route Traffic Control Center
left scrambling air traffic controllers to deal
with a nightmare scenario how to keep dozens of
flights away from each other above a large swath
of the Southwestern United States despite the
inability to see them, talk to them or relay
crucial instructions for 15 excruciatingly long
minutes.
Every ounce of skill, heart and determination
that controllers bring into the control room
every day was put to the test during one of the
worst outages to ever hit the facility. It was so
bad, controllers say, that the only thing they
had of use to aid the situation that actually
worked was their cell phones devices which the
Federal Aviation Administration, inexplicably,
has barred from control rooms, further impeding
the safety of the system.
More details in http//themainbang.typepad.com/blo
g/2006/07/complete_failur.html

10
Network Reliability Parameters

Some parameters that may be used to characterize
the reliability of a network may be found in ITU
G.911 Recommendation
Parameters and Calculation Methodologies for
Reliability and Availability of Fibre Optic
Systems
In the following slides some of the parameters
defined in ITU G.911 are introduced

11
Failure in Time (FITs) and Maintenance Time

Failure in Time
is the number of device's failure occurred in a
specific time interval
normally is expressed as failures per bilion of
device hours.
Maintenance Time
the time interval during which a maintenance
action is performed on an item either manually or
automatically, ...

12
Mean Time Between Failure (MTBF)

The Mean Time Between Failures (MTBF) is the
steady-state expectation of time between failures
Mathematically the MTBF (in years per failure) is
releated to the failure rate F (in FITs per 109
hours) as follows

13
Mean Time To Repair (MTTR)

The Mean Time To Repair (MTTR) is defined as
total corrective maintenance time divided by the
total number of corrective maintenance actions
during a period of time.
Given the definitions of MTBF and MTTR the
availability A of an item may be derived as

14
Users, services and reliability requirements

Network reliability is a relative concept.
The reliability requirements of a communication
network depend on
the user type
the service type.
Different users-services combinations led to
divers requirements in terms of MTBF and MTTR.

15
User classification

According to their reliability requirements,
network users may be classified in the following
categories
Safety critical users. Users for which service
interruption are unacceptable.
Business critical users. Users for which any
service interruption bring to a high financial
loss.
Low cost users. Users for which service
interruption cause only discomfort.
Basic lever users. Users for which service
reliability is only a side effect.

16
Availability Impact of Outages
Ref Service Applications for SONET DCS
Distribution Restoration, IEEE J. Special Areas
in Comm, Jan 94

Potentially FCC reportable
Major social/ business impacts

Minor social/ Business impacts

Drop all circuit switched connections
PL disconnects
Potential packet (X.25) disconnects
Potential data session time-outs

Social / Business Impact

Network congestion

Packet (X.25) disconnects
Data session time-outs

Potential voiceband discinnects (lt5)
Trigger changeover of CSS7 STP signaling links
Effect cell rerouting process

Unacceptable
Service Outage Impact

May drop voice band calls depending on channel
bank vintage

Undesirable
4th Restoration Target Range
3rd Restoration Target Range
2nd Restoration Target Range
Service Hit (Reframes)
1st Restoration Target Range
Protection Switching Range
200 ms
10 Sec
5 Min
30 Min
15 Min
0
50 ms
2 Sec
Restoration time after failure detection
17
Market Drivers for Survivability

Customer Relations
Competitive Advantage
Revenue
Negative - Tariff Rebates
Positive - Premium Services
Business Customers
Medical Institutions
Government Agencies
Impact on Operations
Minimize Liability

18
Network Survivability

Availability 99.999 (5 nines) gt less than 5
min downtime per year
Since a network is made up of several components,
the ONLY way to reach 5-nines is to add
survivability in the face of failures
Survivability continued services in the
presence of failures
Protection switching or restoration mechanisms
used to ensure survivability
Add redundant capacity, detect faults and
automatically re-route traffic around the failure
Restoration related term, but slower time-scale
Protection fast time-scale 10s-100s of ms
implemented in a distributed manner to ensure
fast restoration

19
Failure Types Other Motivations

Types of failure
Components links, nodes, channels in WDM, active
components, software
Human error backhoe fiber cut
Fiber inside oil/gas pipelines less likely to be
cut
Systems Entire COs can fail due to catastrophic
events
Protection allows easy maintenance and upgrades
Eg switchover traffic when servicing a link
Single failure vs multiple concurrent failures
Goal mean repair time ltlt mean time between
failures
Protection also depends upon kind of application.
Survivability may hence be provided at several
layers

20
Network Survivability Architectures
Linear Protection Architectures
Ring Protection Architectures
Mesh Restoration Architectures
21
Network Availability Survivability
Availability is the probability that an item will
be able to perform its designed functions at the
stated performance level, within the stated
conditions and in the stated environment when
called upon to do so.
Availability
Reliability Reliability Recovery
22
Quantification of Availability
Percent Availability N-Nines Downtime Time Minutes/Year
99 2-Nines 5,000 Min/Yr
99.9 3-Nines 500 Min/Yr
99.99 4-Nines 50 Min/Yr
99.999 5-Nines 5 Min/Yr
99.9999 6-Nines .5 Min/Yr
23
PSTN

Individual elements have an availability of
99.99
One cut off call in 8000 calls (3 min for average
call). Five ineffective calls in every 10,000
calls.

NI
NI
0.005
0.005
AN 0.01
AN 0.01
LE
LE
Facility Entrance
Facility Entrance
NI Network Interface LE Local Exchange LD
Long Distance AN Access Network
LD
0.005
0.005
0.02
Source http//www.packetcable.com/downloads/spec
s/pkt-tr-voipar-v01-001128.pdf
24
IP Network Expectations
Service Delay Jitter Loss Availability
Real Time Interactive (VOIP, Cell Relay ..) L L L H
Layer 2 Layer 3 VPNs (FR/Ethernet/AAL5) M
Internet Service H H M L
Video Services L M M H
H
L
L
L Low M Medium H High
25
Measuring Availability The Port Method

Based on Port count in Network
Does not take into account the Bandwidth of ports
e.g. OC-192 and 64k are both ports
Good for dedicated Access service because ports
are tied to customers.

(Total of Ports X Sample Period) - (number of
impacted port x outage duration)
x 100
(Total number of Ports x sample period)
26
The Port Method Example

10,000 active access ports Network
An Access Router with 100 access ports fails for
30 minutes.
Total Available Port-Hours 10,00024 240,000
Total Down Port-Hours 100.5 50
Availability for a Single Day
(240000-50)/240,000100 99.979166

27
The Bandwidth Method

Based on Amount of Bandwidth available in
Network
Takes into account the Bandwidth of ports
Good for Core Routers

(Total amount of BW X Sample Period) - (Amount of
BE impacted x outage duration)
x 100
(Total amount of BW in network x sample period)
28
The Bandwidth Method Example

Total capacity of network 100 Gigabits/sec
An Access Router with 1 Gigabits/sec BW fails for
30 minutes.
Total BW available in network for a day 10024
2400 Total BW lost in outage 1.5 0.5
Availability for a Single Day
((2400-0.5)/2,400)100 99.979166

29
Basic Ideas Working and Protect Fibers
30
Service classification (1/2)

Communication networks are used to carry many
different services.
Different services may have divers reliability
requirements.
Reliability requirements of such services are
related to QoS parameters
Bit Rate
Delay
Jitter
...

31
Service classification (2/2)
2 A.Lason, et al., Network Scenarios and
Requirements, European IST project Layers
Internetworking in Optical Network (LION),
deliverable D6, Septemper 1999.
32
How to increase network reliability ?

Prevent network failure
put network cables deeper in the ground
more testing for hardware and software
.....
Duplicate vulnerable network elements
dual homing.
Independently from these measures, network
failures still occur.
There is need for network recovery or resilience
schemes !

33
Network recovery basic idea

Build networks to have alternate paths
Design systems to have alternate entities
Monitor for possible falures
Manage networks proactively

34
Network recovery requirements

Network recovery imposes several requirements.
For example
there should be backup capacity to create a
recovery path
the backup capacity must be enough to ensure QoS
constraints
single point of failure must be avoided
.....

35
Recovery and reversion cycles
Recovery Cycle
Reversion Cycle
36
Recovery mechanisms

A high variety of recovery mechanisms exist.
Every mechanisms has advantages and drawbacks
In the following slides some criteria that may be
used to evaluate and classify recovery mechanisms
are reported 3, 4.
3 V. Sharma et al., Framework for MPLS-based
recovery, RFC 3469, IETF web site, Feb 2003
4 K. Owens, V. Sharma, M. Oommen, and F.
Hellstrand, Network Survivability Considerations
for Traffic Engineered IP Networks, Internet
draft draft-owens-te-network-survivability-03,
May 2002. Available at www.ietf.org. Accessed
July 2005

37
Backup Capacity

Dedicated
one to one relationship between the backup
resources and the working path
the simplest solution
an inefficient solution.
Shared
the backup resources are shared among different
working path
a more simple solution
a more efficient solution.

38
Recovery Path

Preplanned
recovery paths for all accounted failure scenario
is calculated in advance
allows fast recovery of failure
lacks flexibility for unaccounted failure
scenarios.
Dynamic
the recover path is calculate on the fly when
the failure is detected
may be used to search recovery paths also for
unaccounted failure scenarios.

39
Recovery Approaches

Protection
the recovery paths are preplanned and fully
signaled before a failure occurs
when a failure occurs no additional signaling is
needed to establish the recovery path
is the faster solution.
Restoration
the recovery pat may be preplanned or dynamically
allocated but are not signaled in advance
when a failure occurs aditional signaling is
needed to establish the recovery path
is a more flexible solution.

40
Protection Variants (1/2)

11 Protection (Dedicated Protection)
there is exactly one dedicated recovery path for
each working segment
the traffic is permanently duplicated on both the
working path and the recovery path
is a quite expensive solution.
11 Protection (Dedicated Protection with extra
traffic)
there is exactly one dedicated recovery path for
each working segment
the traffic is transmitted over only a path at a
time
it is possible to transport extra traffic along
the recovery path in failure free condition.

41
Protection Variants (2/2)

1N (Shared Recovery With Extra Traffic)
each recovery entity is used to protect N working
entities
it is possible use the recovery entities to
transport extra traffic in failure free
conditions.
MN (M N)
a set of M recovery entities are used to protect
a set of N working entities
it is possible use the recovery entities to
transport extra traffic in failure free
conditions.

42
Recovery Extent (1/2)

Local Recovery
in failure condition only the affected network
element are bypassed using the recovery path
the RHE and RTE are closer to the failure, so
they may detect the failure quickly, leading to a
smaller recovery time.
in case of failure the route followed by the
traffic may be not optimal (e.g the same traffic
may cross a link twice !) .
In case of two successive nodes failure will fail

43
Recovery Extent (2/2)

Global Recovery
in failure condition the complete working path
between source and destination is bypassed
the recovery time is greater that that of the
local recovery
an optimal recovery path is used in case of
failure
In case of two successive nodes failure could
still resolve the problem
may generate more state overhead that the local
approach.
An intermediate solution between Local and Global
approach may be adopted !!

44
Control of Recovery Mechanisms (1/2)

Centralized
a central controller determines the action to
take in case of failure
the central controller also determine when and
where a fault ha occurred
the central controller is a single point of
failure.
is generally an efficient approach
in principle is a simpler approach, but
the central controller may become a very complex
system

45
Control of Recovery Mechanisms (2/2)

Distributed
there is not a centralized controller, all the
network elements are capable to autonomously
react to failure
with this approach there is not a global view of
the network condition
the network elements may have to exchange
information to keep a consistent view of the
network
is a more scalable approach.

46
Protection Topologies - Ring

Two or more nodes connected to each other with a
ring of links

E
W
D
L
W
E
L
Working
Protect
W
E
E
W
47
Protection Topologies - Mesh

Three or more nodes connected to each other
Can be sparse or complete meshes
Spans may be individually protected with linear
protection
Overall edge-to-edge connectivity is protected
through multiple paths

Working
Protect
48
Protection Switching Terminology

11 architectures - permanent bridge at the
source - select at sink
mn architectures - m entities provide protection
for n working entities where m is less than or
equal to n
allows unprotected extra traffic
most common - SONET linear 11 and 1n

49
11 vs 1n
Working
Protect
Working
Protect
(11)
(1n)
50
SONET Linear 11 APS
TX Transmitter RX Receiver
BR Bridge SW Switch
Working
BR
SW
TX
RX
Protection
RX
TX
Working
SW
RX
BR
TX
RX
TX
Protection
51
SONET 11 Linear APS
TX Transmitter RX Receiver
BR Bridge SW Switch
APS Channel
BR
SW
TX
RX
RX
TX
Protection
SW
RX
BR
TX
Working
TX
RX
Protection
52
Protection Switching Terminology

Dedicated vs Shared working connection assigned
dedicated or shared protection bandwidth
11 is dedicated, 1n is shared
Revertive vs Non-revertive after failure is
fixed, traffic is automatically or manually
switched back
Shared protection schemes are usually revertive
Uni-directional or bi-directional protection
Uni each direction of traffic is handled
independent of the other.
Fiber cut gt only one direction switched over to
protection . Usually done with dedicated
protection no signaling required.
Bi-directional transmission on fiber (full
duplex) gt requires bi-directional switching
signaling required

53
Mesh Restoration
Working Path
DCS
DCS
Line or Link Restoration
DCS
DCS
DCS
DCS
Path Restoration

Control Centralized or Distributed
Route Calculation Preplanned or Dynamic
Type of Alternate Routing Line or Path

54
Link vs. Path restoration

Link restoration
Requires the ability to identify the failed link
at both ends.
Can not protect node failure.
Link based
Mesh (generalized loop back) insensitive to
additions to network scalable backup path can
be pre-computed fast recovery dynamic
rerouting
Path restoration
More resilient than link restoration.
Reroutes the traffic from the primary path to a
Shared Risk Group (SRG) -disjoint backup path.
Protect both end-to-end paths and single links.

Preferred Path Based

55
Link vs. Path restoration
D
A
C
Fault Link Cut
F
B
D
A
E
C
F
Link (Generalized Loopback) Restoration
B
E
D
A
C
F
B
E
Path Restoration
56
Pre-compute vs. Real-time

Pre-computed
calculates restoration paths before a failure
happens.
Allows prior availability of reroute information
to the nodes where actions need to be taken after
failure is detected.
Enables fast restoration.
Real-time
calculates restoration paths after a failure
happens.
Restoration is slower.
Enables more efficient capacity utilization.

Preferred Pre-computed

57
Centralized vs. Distributed

Centralized restoration
Computes restoration and primary paths for all
demands with up-to-date information
Routes may then be downloaded into nodal
databases.
Effectiveness?
More capacity efficiency
Possibly slow (but may be executed in the
background)
Scalability in question.
Distributed restoration
Source and destination nodes dynamically search
for the protection wavelengths required to
reestablish the disrupted lightpath
Since lack of knowledge of sharing database of
other OXCs, it may not be able to determine
backup sharability for any given primary path

Preferred
Central path determination
Distributed Restoration

58
Protection Topologies - Linear

Two nodes connected to each other with two or
more sets of links

Working
Protect
Working
Protect
(11)
(1n)
59
Mesh Restoration vs Ring/Linear Protection
Extracted from T-H. Wu, Emerging Technologies
for Fiber Network Survivability, See References
60
IP layer restoration

IP Layer Restoration (real-time)
Achieved by exchanging control messages between
adjacent routers
Re-determine the affected route
Update routing tables
Propagate changes (OSPF, BGP-4)
Capable of recovery from multiple faults
Slow (10s of seconds to minutes Fumagalli)
requires online processing upon failure
Fault discovery
Explicitly ICMP messaging
Implicitly Expiring of timers
Guarantees networkwide survivability
Independent of underlying physical network

Application
Presentation
Session
Transport
Network (IP)
Data Link
Physical
61
MPLS layer restoration

MPLS Layer Protection
Real-time or pre-computed
Line or path level protection
Protection path is node and link disjoint from
the primary path.
Protection path may be allocated to low-priority
traffic in the absence of network failure.
Faster than dynamic IP rerouting
Working LSPs have pre-established node/link
disjoint protection paths

Application
Presentation
Session
Transport
Network
MPLS
Data Link
Physical
62
Optical layer restoration

Optical layer restoration
Real-time or pre-computed
Ring protection or mesh restoration
No visibility into higher layer operations.
May be wasteful use of resources.
For ring protection, there is over 100 capacity
redundancy
For mesh restoration, 60-80 physical redundancy
level is typical.
Not recommended for node (or software) failures
Faster than higher layer restorations (??)

Application
Presentation
Session
Transport
Network IP)
DWDM (Optical)
Physical
63
Multilayer Recovery (1/2)

In a multilayer network it is possible to imagine
a situation in which each layer has its own
recovery mechanisms.
Not every failure in a particular layer may be
resolved in the same layer.
If a failure may be resolved in several layer
uncoordinated actions may produce inefficient
results
A coordination among the layers is needed !!

64
Multilayer Recovery (2/2)

Sequential Approach1
using an hold-off time a chronological order
among the recovery mechanisms adopted in
different layer is imposed
alternatively a token may used to impose a
sequential order among the different layers.
Integrated Approach1
there is a recovery scheme that has a full
overview of all the layers
the recovery scheme may decide when and in which
layer (layers) the recovery actions must be
taken.
1 D. Colle, et all., Data-centric optical
networks and their survivability, Selected Areas
in Communications, IEEE Journal on Volume 20,
Issue 1, Jan. 2002 Page(s)6 - 20