OAM and QoS

About This Presentation

Transcript and Presenter's Notes

Title: OAM and QoS

1
OAM and QoS

Presented by
Yaakov (J) Stein
Chief Scientist

Unique Access Solutions
2
Service Guarantees
3
Why do we pay for services ?

Generally good (and frequently much better than
toll quality)
voice service is available free of charge
(Skype, Fring, Nimbuzz, )
So why does anyone pay for voice services ?
Similarly, one can get free
(WiFi) Internet access
email boxes
file storage and sharing
web hosting
software services
So why pay ?

4
Paying for QoS

The simple answer is that one doesnt pay for the
service
one pays for Quality of Service guarantees
In our voice model
But what does QoS mean
and why are we willing to pay for it ?
To explain, we need to review some history

5
Father of the telephone

Everyone knows that the father of the telephone
was
Alexander Graham Bell
(along with his assistant Mr. Watson)
But Bell did not invent the telephone network
Bell and Watson sold pairs of phones to customers
The father of the telephone network was
Theodore Vail

6
Theodore Vail -

Theodore Who?
Son of Alfred Vail (Morses coworker)
Ex-General Superintendent of US Railway Mail
Service
First general manager of Bell Telephone
Father of the PSTN
Why is he so important?
Organized PSTN
Established principle of reinvestment in RD
Established Bell Telephones IPR division
Executed merger with Western Union to form ATT
Solved the main technological problems
use of copper wire
use of twisted pairs
Organized telephony as a service (like the postal
service!)
Vailism is the philosophy that public services
should be run as closed centralized monopolies
for the public good

7
Whats the difference ?

In the Bell-Watson model
the customer pays once, but is responsible for
installation
wires
wiring
operations
power
fault repair
performance (distortion and noise)
infrastructure maintenance
while the Bell company is responsible only for
providing functioning telephones
In the Vail model the customer pays a monthly fee
but the provider assumes responsibility for
everything
including fault repair and performance
maintenance
the telephone company owns the telephone sets
and even the wires in the walls !

8
Service Level Agreements

In order to justify recurring payments
the provider agrees to a minimum level of
service in an SLA
SLAs should capture Quality of user Experience
(QoE)
but this is often hard to quantify
So SLAs usually actually detail measurable
network parameters
that influence QoE, such as
availability (e.g., the famous five nines)
time to repair (e.g., the famous 50 ms)
information rate (throughput)
information latency (delay)
allowable defect densities (noise/distortion)
Availability (basic connectivity) always
influences QoE
It is hard to predict the effect of the other
parameters on QoE even when there is only one
application (e.g., voice)
When multiple applications are in use - it may be
impossible

9
Some Applications

System traffic
routing protocols, DNS, DHCP, time delivery,
system update, OAM,
tunneling and VPN setup
Business processes
database access, backup and data-center, B2B,
ERP
Communications - interactive
voice, video conferencing, telepresence, instant
messaging,
remote desktop, application sharing
Communications non-interactive
email, broadcast programming, music
video progressive download, live streaming,
interactive
Information gathering
http(s), Web 2.0, file transfer
Recreational
gaming, p2p file transfer
Malicious
DoS, malware injection, illicit information
retrieval

10
What do applications need ?

Some applications only require availability
Some also require minimum available throughput
Some require delay less then some end-end (or RT)
delay
Some require packet loss ratio (PLR) less than
some percentage
and these parameters are not necessarily
independent
For example,
TCP throughput drops with PLR

11
Some rules of thumb

Mission Critical (and life critical) applications
require
high availability
If there are any MC applications
then system traffic requires high availability
too
MC applications do not necessarily require strict
throughput
but always indirectly require
a certain minimal average throughput
bounded delay
If the MC application uses TCP then it requires
low PLR
Real-time applications require
sufficient throughput
but not necessarily low PLR (audio and video
codecs have PLC)
Interactive applications require
low RT delay
It may be more scalable for a SP to measure 1-way
delays

12
OAM
13
Monitoring an SLA

The Service Providers justification for payment
is the maintenance of an SLA
To ensure SLA compliance, the SP must
monitor the SLA parameters
take action if parameter is dropping below
compliance levels
But how does the SP verify/ensure that the SLA is
being met ?
Monitoring is carried out using
Operations, Administration, Maintenance (OAM)
The customer too may use OAM to see that the SP
is compliant !
Technical note
OAM is a user-plane function
but may influence control and management plane
operations
for example
OAM may trigger protection switching, but doesnt
switch
OAM may detect provisioned links, but doesnt
provision them

14
Operations, Administration, Maintenance

Traditionally, one distinguishes between 2 OAM
functionalities
Fault Monitoring
OAM runs continuously/periodically at required
rate
detection and reporting of anomalies, defects,
and failures
used to trigger mechanisms in the
control plane (e.g. protection switching) and
management plane (alarms)
required for maintenance of basic connectivity
(availability)
Performance Monitoring
OAM run
before enabling a service
on-demand or
per schedule
measurement of performance criteria (delay, PDV,
etc.)
required for maintenance of all other QoE
attributes

15
Early OAM

Analog channels and 64 kbps digital channels
did not have mechanisms to check signal validity
and quality
Thus
major faults could go undetected for long periods
of time
hard to characterize and localize faults when
reported
minor defects might be unnoticed indefinitely
As PDH networks evolved, more and more OAM was
added on
monitoring for valid signal
loopbacks
defect reporting
alarm indication/inhibition
The OAM overhead started to explode in size !
When SONET/SDH was designed
bounded overhead was reserved for OAM functions

16
OAM for Packet Switched Networks

OAM is more complex for Packet Switched Networks
in addition to the previous defects
loss of signal
bit errors
we have new defect types
packets may be lost
packets may be delayed
packets may delivered to the wrong destination
The first PSN-like network to acquire OAM was ATM
(I.610)
Although technically ATM is cell-based, not
packet-based

17
Some FM OAM mechanisms (1)

How do we perform Continuity Check ?
send OAM packets at a constant known rate
if CC packets are not received for gt3 intervals
then declare a fault
see also LB / echo mode
How do we perform Connectivity Verification ?
send OAM packets to a known destination
if CV packets are received somewhere else then
declare a fault
How do we indicate AIS (FDI) ?
when do not receive forward traffic send AIS OAM
packets
if AIS packets received then declare a fault
How do we indicate RDI (BDI) ?
when do not receive reverse traffic send RDI OAM
packets
if RDI packets received then declare a fault
Note RDI is often a flag set on CC message

18
Some FM OAM mechanisms (2)

How do we use LoopBack ?
non-intrusive (in-service) (echo mode)
send LB request OAM packet to remote site
remote site replies with LB reply
if LB reply not received then declare a fault
intrusive (out-of-service)
put remote site into LB mode
remote sites reflects (and does not forward) all
traffic
(note that it must monitor OAM traffic)
if packets sent are not received then declare a
fault
note need to inform next hops of LB by locking
How do we use LinkTrace ?
send LB request OAM packet to next hop
send LB request to following hop
etc.

19
Some PM OAM mechanisms (1)

How do we measure Packet Loss Ratio ?
Traffic (counter) based
maintain 2 counters
number of packets transmitted to peer Tx
number of packets received from peer Rx
send Tx counter to peer at time 1 Tx(1)
peer notes its Rx counter at time of reception
Rx(2)
and its Tx counter at time of its reply Tx(3)
originator notes its Rx counter when reply is
received Rx(4)
calculate PLR in both directions
Synthetic
do not maintain counters use OAM packets
Note synthetic loss is only a rough estimate
How do we measure Throughput?
Primitive way (RFC 2544)
send packets at maximum rate and observe packet
loss
reduce rate until no loss is observed
Note there are more sophisticated mechanisms !

20
Some PM OAM mechanisms (2)

How do we measure 1-way Packet Delay (Latency) ?
synchronize clocks at both OAM peers
send timestamp T1 to peer
peer timestamps receipt with T2
calculate time difference T2 T1
How do we measure 2-way Packet Delay (Latency) ?
send timestamp T1 to peer
peer timestamps receipt with T2
peer replies at T3
originator timestamps receipt of reply at T4
calculate time difference (T4 T1) (T3 - T2)
assuming symmetry, 1-way delay is half this
amount
Note do not need to synchronize clocks
How do we measure Packet Delay Variation ?
send timestamps at a constant rate
peer calculates timestamp differences and
statistics thereof
Note do not need to synchronize clocks

21
Ethernet OAM
22
What about Ethernet ?

Carrier Ethernet has replaced ATM as the default
layer-2
Ethernet is by far the most widespread network
interface
Ethernet has some advantages as compared to ATM
it has network-wide unique addresses
it has a source address in every packet
but some aspects make Ethernet OAM more difficult
ConnectionLess (CL)
multipoint to multipoint
overlapping layering need OAM for operator,
SPs, customer
some specific problematic ETH behaviors
(flooding, multicast, )

23
Whats the problem with CL ?

OAM makes a lot of sense in Connection Oriented
environments
connections last a relatively long amount of time
there is some SLA at the connection level
For CL networks, the network path is neither
known nor pinned
So it doesnt really make sense to talk about FM
what does continuity mean if when a link goes
down
the network automatically reroutes around the
failure ?
The Ethernet CL problem is solved by overlaying
CO functionality
flows or
EVCs

24
Ethernet OAM

For many years there was no OAM for Ethernet
(LANs dont need OAM)
now there are two incompatible ones!
Link layer OAM 802.3 clause 57 (EFM OAM,
802.3ah)
single link only
slow protocol, limited functionality
some management functions
Service OAM Y.1731, 802.1ag (CFM)
any network configuration
multilevel OAM functionality
In some cases one may need to run both
while in others only service OAM makes sense
Link layer OAM is only for a single link, which
is necessarily CO
Service OAM is most frequently used for
infrastructure networks,
which are also CO

25
Layer 2 control protocols (L2CPs)

Do not be confused - L2CPs are NOT OAM !
Here are a few well-known L2CPs

protocol DA reference
STP/RSTP/MSTP 01-80-C2-00-00-00 802.2 LLC 802.1D 8,9 802.1D17 802.1Q 13
PAUSE 01-80-C2-00-00-01 802.3 31B 802.3x
LACP/LAMP 01-80-C2-00-00-02 EtherType 88-09 Subtype 01 and 02 802.3 43 (ex 802.3ad)
Link OAM 01-80-C2-00-00-02 EtherType 88-09 Subtype 03 802.3 57 (ex 802.3ah)
ESMC 01-80-C2-00-00-02 EtherType 88-09 Subtype 10 G.8264
Port Authentication 01-80-C2-00-00-03 802.1X
E-LMI 01-80-C2-00-00-07 MEF-16
Provider MSTP 01-80-C2-00-00-08 802.1D 802.1ad
Provider MMRP 01-80-C2-00-00-0D 802.1ak
LLDP 01-80-C2-00-00-0E EtherType 88-CC 802.1AB-2009
GARP (GMRP, GVRP) Block 01-80-C2-00-00-20 through 01-80-C2-00-00-2F 802.1D 10, 11, 12
Note IEEE disallows forwarding of L2CPs, MEF
allows it under certain circumstances
26
Link Layer OAM (AKA EFM OAM)

Ethernet in the First Mile (Last Mile ?)
EFM networks are mostly p2p DSL links or p2mp
PONs
thus a link layer OAM is sufficient for EFM
applications
Since EFM link is between customer and Service
Provider
EFM OAM entities are either active (SP) or
passive (customer)
active entity can place passive one into LB mode
but not the reverse
EFM OAMPDUs are a slow protocol frames never
forwarded
Ethertype 88-09 and subtype 03
messages multicast to slow protocol specific
group address
OAMPDUs must be sent once per second (heartbeat)
messages are TLV-based

27
EFM OAM capabilities

6 codes are defined
Information (autodiscovery, heartbeat, fault
notification)
Event notification (statistics reporting)
Variable request (active entity query passives
configuration) (mngt)
Variable response (passive entity responds to
query) (mngt)
Loopback control (active entity enable/disable of
intrusive LB mode)
Organization specific (proprietary extensions)
and there are flags in every OAMPDU to
expedite notification of critical events
link fault (RDI)
dying gasp
unspecified
monitor slow degradations in performance

28
Service OAM (AKA CFM, Y.1731)

Many SPs need to monitor full networks
not just single links
Service layer OAM provides end-to-end integrity
of the Ethernet service over arbitrary server
layers
Because Ethernet is flat
not true client-server layering (except
MAC-in-MAC)
service layer OAM is multilevel
Because SPs want to replace transport networks
with Ethernet
service OAM must support all OAM features
and must enable advanced transport capabilities
(such as linear/ring protection switching)
a transport network is a network with
High availability (Fault Management OAM and
Automatic Protection Switching)
SLA support (Performance Management OAM and QoS
mechanisms)
a Management plane (optionally a control plane)
for configuration and provisioning
Efficiency and Scalability

29
Y.1731 messages

Y.1731 supports many OAM message types
Continuity Check proactive heartbeat with 7
possible rates
Synthetic Loss Measurement on demand loss rate
estimation
LoopBack unicast/multicast pings with optional
patterns
Link Trace identify path taken to detect
failures and loops
AIS periodically sent when CC fails
RDI flag set to indicate reverse defect
Client Signal Fail sent by MEP when client
doesnt support AIS
LoCK signal inform peer entity about diagnostic
actions
TeST signal in-service/out-of-service tests for
loss rate, etc.
Automatic Protection Switching
Maintenance Communications Channel remote
maintenance
EXPerimental
Vendor SPecific

30
Y.1731 frame format

after DA, SA and Ethertype (8902)
Y.1731/802.1ag PDUs have the following header
(may be VLAN tagged)
if there are sequence numbers/timestamp(s)
they immediately follow
then come TLVs, the end TLV, followed by the
CRC
TLVs have 1B type and 2B length fields
there may or not be a value field
the end-TLV has type zero and no length or
value fields

31
Y.1731 PDU types
opcode OAM Type DA
1 CCM M1 or U
3 LBM M1 or U
2 LBR U
5 LTM M2
4 LTR U
6-31 RES IEEE
32-63 unused RES ITU-T
33 AIS M1 or U
35 LCK M1or U
37 TST M1 or U
39 Linear APS M1or U
40 Ring APS M1or U
41 MCC M1 or U
43 LMM M1 or U
42 LMR U DA
45 1DM M1 or U
47 DMM M1 or U
46 DMR UA
49 EXM
48 EXR
51 VSM
50 VSR
52 CSF M1 or U
55 SLM U
54 SLR U
64-255 RES IEEE
32
MEPs and MIPs

Maintenance Entity (ME) entity that requires
maintenance
ME is a relationship between ME end points
because Ethernet is MP2MP, we need to define a ME
Group
MEGs can be nested, but not overlapped
MEG LEVEL takes a value 0 7
by default - 0,1,2 operator, 3,4 SP, 5,6,7
customer
MEP MEG end point (MEG ME group, ME
Maintenance Entity)
(in IEEE
MEG is called MA Maintenance Association)
unique MEG IDs specify to which MEG we send the
OAM message
MEPs responsible for OAM messages not leaking out
but transparently transfer OAM messages of higher
level
MIPs MEG Intermediate Points
never originate OAM messages,
process some OAM messages
transparently transfer others

33
MEPs and MIPs (cont.)
34
How is OAM used ?

MEF-30 Service OAM FM and MEF-xx Service OAM PM
describe the use of OAM for Carrier Ethernet
networks, such as
which Y.1731/802.1 features/messages should be
used
where to put MEPs, what MA and MEG levels names
should be used
minimum number of EVCs that must be supported
what should be reported and how
Y.1564 (ex Y.156sam) Ethernet Service Activation
Test Methodology
describes commissioning procedures (replaces
RFC2544-like benchmarking)
Tests that desired performance level can be
achieved, including
CIR, EIR (and optionally CBS and EBS for
bursting)
traffic policing
rate, loss, delay, delay variation, availability
(measured simultaneously)
Testing in two steps
Service Configuration Test each service
separately
Service Performance Test all services together
Performance testing may be for
15 minutes (new service on operational network)
2 hours (single operator network)
24 hours (multiple operator networks)

35
QoS enforcement
36
QoS approaches

There are two approaches to QoS handling
IntServ (guaranteed QoS)
define traffic flows (CO approach)
guarantee QoS attributes for each flow
reserve resources at each router along the flow
signaling protocol (e.g., RSVP) needed
DiffServ (statistical QoS)
retain CL paradigm
no guaranteed QoS attributes
mark packets (differentiated e.g., gold,
silver, bronze)
marking can be by VLAN, P-bits, IP-ToS/DSCP, or
general flow
offer special treatment (priority) relative to
other packets
no resource reservation
For Ethernet and IP DiffServ is the preferred
approach

37
Some fields for marking

Example
For an IPv4 packet inside Q-in-Q Ethernet
we have various choices for marking priority

802.1p user priority field AKA P-bits 0
7 priority tagging (VLAN0) if no VLAN P0 means
non-expedited traffic 802.1Q recommends mappings

IP ToS
RFC 2474 redefined ToS to contain
6 bit DSCP (see also RFC 4594)
2 bit ECN

38
Queuing

Ethernet switches have queues FIFO buffers
on each output port
If there were only one queue
then traffic handling would be FIF
To enable DiffServ prioritization
multiple queues are used
Outgoing frames are inserted into queues
according to priority marking
Many methods for emptying queues
The most popular are
Strict Priority
always take from nonempty queue
of highest priority
Weighted Fair Queuing
take from nonempty queues according
to configured weight

39
Traffic shaping

One of the most important parts of an SLA is the
Committed Information Rate (bps)
This is the datarate (bandwidth) SP guarantees
will be forwarded
There may also be an
Extra Information Rate (bps)
This is a datarate that the SP will forward if
possible
Packet traffic is often bursty
A customer who did not send data for a while
will expect to be able to send a higher rate
afterwards
This is accomplished via traffic shaping
time integration is accomplished by leaky/token
buckets
the effect of shaping is marking drop eligibility
(marking a packet on the line is only possible
with S-tags!)
There is often also traffic policing
policing simply discards packets to police a
maximum rate !

40
MEF token bucket algorithm

Metro Ethernet Forum 10.x defines a bandwidth
profile
there are two byte buckets, C of size CBS and E
of size EBS (in bytes)
tokens are added to the buckets at rate CIR/8 and
EIR/8
when bucket overflows tokens are lost (use it or
lose it)
if ingress frame length lt number of tokens in C
bucket
frame is green and its length in tokens is
debited from C bucket
else
if ingress frame length lt number of tokens in E
bucket
frame is yellow and its length of tokens is
debited from E bucket
else frame is red
green frames are delivered
and service objectives apply
yellow frames are delivered
but service objectives dont apply
red frames are discarded

for simplicity we assume
no coupling and
no sharing !

Write a Comment

User Comments (0)

About PowerShow.com

OAM and QoS PowerPoint PPT Presentation