Title: COMS/CSEE 4140 Networking Laboratory Lecture 06
1COMS/CSEE 4140 Networking LaboratoryLecture 06
- Salman Abdul Baset
- Spring 2008
2Announcements
- Lab 4 (5-7) due next week before your lab slot
- Prelab 5 due next week.
- There will be Lab 5 next week.
- Midterm (March 10th, duration 1.5 hours)
- Assignment 2 issues
- aslookup compilation?
- ISP name nslookup or whois for IP address
- Lab 4 (count-to-infinity issues)
3Agenda
- Autonomous Systems (AS)
- Policy vs. distance based routing
- Border gateway protocol (BGP)
- Transmission control protocol (TCP)
4Autonomous Systems Terminology
- local traffic traffic with source or
destination in AS - transit traffic traffic that passes through
the AS - Stub AS has connection to only one AS, only
carry local traffic - Multihomed AS has connection to gt1 AS, but
does not carry transit traffic - Transit AS has connection to gt1 AS and
carries transit traffic
5Stub and Transit Networks
- AS 1, AS 2, and AS 5 are stub networks
- AS 2 is a multi-homed stub network
- AS 3 and AS 4 are transit networks
6Selective Transit
- Example
- Transit AS 3 carries traffic between AS 1 and AS
4 and between AS 2 and AS 4 - But AS 3 does not carry traffic between AS 1 and
AS 2 - The example shows a routing policy.
7Customer/Provider
- A stub network typically obtains access to the
Internet through a transit network. - Transit network that is a provider may be a
customer for another network - Customer pays provider for service
8Customer/Provider and Peers
- Transit networks can have a peer relationship
- Peers provide transit between their respective
customers - Peers do not provide transit between peers
- Peers normally do not pay each other for service
9Shortcuts through peering
- Note that peering reduces upstream traffic
- Delays can be reduced through peering
- But Peering may not generate revenue
10ASNs already assigned
Source http//www.potaroo.net/tools/asn32/
private ASN 65412 65536
11ASNs in use
12ASN projections
13Autonomous Routing Domains Dont Always Need BGP
or an ASN
ARDs versus ASes
Qwest
Nail up routes 130.132.0.0/16 pointing to Yale
Nail up default routes 0.0.0.0/0 pointing to Qwest
Yale University
130.132.0.0/16
Static routing is the most common way of
connecting an autonomous routing domain to the
Internet. This helps explain why BGP is a
mystery to many
14ASNs Can Be Shared (RFC 2270)
AS 701 UUNet
AS 7046 Crestar Bank
AS 7046 NJIT
AS 7046 Hood College
128.235.0.0/16
ASN 7046 is assigned to UUNet. It is used
by Customers single homed to UUNet, but needing
BGP for some reason (load balancing, etc..) RFC
2270
15ARDs and ASes Summary
- Most ARDs have no ASN (statically routed at
Internet edge) - Some unrelated ARDs share the same ASN (RFC
2270) - Some ARDs are implemented with multiple ASNs
(example Worldcom)
ASes are just an implementation detail of
Inter-domain routing
16Agenda
- Autonomous Systems (AS)
- Policy vs. distance based routing
- Border gateway protocol (BGP)
- Transmission control protocol (TCP)
17Why not minimize AS hop Count?
Shortest path routing is not compatible with
commercial relations
18Customer versus Provider
provider
customer
Customer pays provider for access to the Internet
19The Peering Relationship
20Peering Provides Shortcuts
21Peering Wars
Peer
Dont Peer
- You would rather have customers
- Peers are usually your competition
- Peering relationships may require periodic
renegotiation
- Reduces upstream transit costs
- Can increase end-to-end performance
- May be the only way to connect your customers to
some part of the Internet (Tier 1)
Peering struggles are by far the most
contentious issues in the ISP world! Peering
agreements are often confidential.
22Agenda
- Autonomous Systems (AS)
- Policy vs. distance based routing
- Border gateway protocol (BGP)
- Transmission control protocol (TCP)
23The Gang of Four
24BGP Overview
- BGP Border Gateway Protocol v4 . RFC 1771. (
60 pages) - Note In the context of BGP, a gateway is nothing
else but an IP router that connects autonomous
systems. - Interdomain routing protocol for routing between
autonomous systems. - Uses TCP to establish a BGP session and to send
routing messages over the BGP session. - Update only new routes.
- BGP is a path vector protocol. Routing messages
in BGP contain complete routes. - Network administrators can specify routing
policies.
25BGP Policy-based Routing
- Each node is assigned an AS number (ASN)
- BGPs goal is to find any AS-path (not an optimal
one). Since the internals of the AS are never
revealed, finding an optimal path is not
feasible. - Network administrator sets BGPs policies to
determine the best path to reach a destination
network.
26The Border Gateway Protocol (BGP)
BGP
RFC 1771
optional extensions RFC 1997 (communities) RFC
2439 (damping) RFC 2796 (reflection) RFC3065
(confederation)
routing policy configuration languages
(vendor-specific)
Current Best Practices in management of
Interdomain Routing
BGP was not DESIGNED. It EVOLVED.
27BGP Route Processing
Open ended programming. Constrain
ed only by vendor configuration language
Apply Policy filter routes tweak attributes
Apply Policy filter routes tweak attributes
Receive BGP Updates
Best Routes
Transmit BGP Updates
Based on Attribute Values
Best Route Selection
Apply Import Policies
Best Route Table
Apply Export Policies
Install forwarding Entries for best Routes.
IP Forwarding Table
28BGP Attributes
Value Code
Reference ----- -----------------------------
---- --------- 1 ORIGIN
RFC1771 2 AS_PATH
RFC1771 3 NEXT_HOP
RFC1771 4
MULTI_EXIT_DISC RFC1771 5
LOCAL_PREF RFC1771
6 ATOMIC_AGGREGATE
RFC1771 7 AGGREGATOR
RFC1771 8 COMMUNITY
RFC1997 9 ORIGINATOR_ID
RFC2796 10 CLUSTER_LIST
RFC2796 11 DPA
Chen 12
ADVERTISER RFC1863 13
RCID_PATH / CLUSTER_ID RFC1863
14 MP_REACH_NLRI
RFC2283 15 MP_UNREACH_NLRI
RFC2283 16 EXTENDED
COMMUNITIES Rosen ... 255
reserved for development
Most important attributes
Not all attributes need to be present in every
announcement
From IANA http//www.iana.org/assignments/bgp-par
ameters
29LOCAL_PREF Attribute
Forces outbound traffic to take primary link,
unless link is down.
30NEXT_HOP Attribute
- EGP IP address used to reach the advertising
router - IGP next-hop address is carried into local AS
31AS_PATH Attribute
- Used to detect routing loops and find shortest
paths
32Shedding Inbound Traffic with ASPATH Prepending
Prepending will (usually) force inbound traffic
from AS 1 to take primary link
AS 1
provider
192.0.2.0/24 ASPATH 2 2 2
192.0.2.0/24 ASPATH 2
backup
primary
customer
192.0.2.0/24
AS 2
Yes, this is a Glorious Hack
33 But Padding Does Not Always Work
AS 1
AS 3
provider
provider
192.0.2.0/24 ASPATH 2 2 2 2 2 2 2 2 2 2 2 2 2
192.0.2.0/24 ASPATH 2
AS 3 will send traffic on backup link because
it prefers customer routes and local preference
is considered before ASPATH length! Padding in
this way is often used as a form of load balancing
backup
primary
customer
192.0.2.0/24
AS 2
34COMMUNITY Attribute to the Rescue!
AS 3 normal customer local pref is 100, peer
local pref is 90
AS 1
AS 3
provider
provider
192.0.2.0/24 ASPATH 2 COMMUNITY 370
192.0.2.0/24 ASPATH 2
backup
primary
Customer import policy at AS 3 If 390 in
COMMUNITY then set local preference to 90 If
380 in COMMUNITY then set local preference
to 80 If 370 in COMMUNITY then set local
preference to 70
customer
192.0.2.0/24
AS 2
35BGP Issues - What is a BGP Wedgie?
- BGP policies make sense locally
- Interaction of local policies allows multiple
stable routings - Some routings are consistent with intended
policies, and some are not - If an unintended routing is installed (BGP is
wedged), then manual intervention is needed to
change to an intended routing - When an unintended routing is installed, no
single group of network operators has enough
knowledge to debug the problem
Full wedgie
36YouTube blocking
- Pakistan blocks YouTube
- How? (according to BBC)
- Advertise a shorter route to reach YouTube
- The incorrect short route gets propagated
- Seen by two thirds of the Internet
- Traffic to YouTube goes through Pakistan
- Since Pakistan blocked YouTube, all traffic
reaches a dead end!
37Dynamic Routing Protocols Summary
- Dynamic routing protocols RIP, OSPF, BGP
- RIP uses distance vector algorithm, and converges
slow (the count-to-infinity problem) - OSPF uses link state algorithm, and converges
fast. But it is more complicated than RIP. - Both RIP and OSPF finds lowest-cost path.
- BGP uses path vector algorithm, and its path
selection algorithm is complicated, and is
influenced by policies. - BGP has its own problems see WIDGI by Tim Griffin
38More Readings (Optional)
- BGP Wedgies Bad Routing Policy Interactions that
Cannot be Debugged - JIs Intro to interdomain routing.
- "Interdomain Setting of PlanetLab Nodes."
PlanetLab Meeting, May 14, 2004. - Understanding the Border Gateway Protocol (BGP)
- ICNP 2002 Tutorial Session
39Agenda
- Autonomous Systems (AS)
- Policy vs. distance based routing
- Border gateway protocol (BGP)
- Transmission control protocol (TCP)
40Transmission Control Protocol (RFC)
- Reliable and in-order byte-stream service
- TCP format
- Connection establishment
- Flow control
- Reaction to congestion
- Packet corruption
41TCP Format
- TCP segments have a 20 byte header with gt 0
bytes of data.
42TCP header fields
- Sequence Number (SeqNo)
- Sequence number is 32 bits long.
- So the range of SeqNo is
- 0 lt SeqNo lt 232 -1 ? 4.3 Gbyte
- Each sequence number identifies a byte in the
byte stream - Initial Sequence Number (ISN) of a connection is
set during connection establishment - Q What are possible requirements for ISN ?
43TCP header fields
- Acknowledgement Number (AckNo)
- Acknowledgements are piggybacked, i.e.,
- a segment from A -gt B can contain an
acknowledgement for a data sent in the B -gt A
direction - Q Why is piggybacking good ?
- A hosts uses the AckNo field to send
acknowledgements. (If a host sends an AckNo in a
segment it sets the ACK flag) - The AckNo contains the next SeqNo that a hosts
wants to receiveExample The acknowledgement
for a segment with sequence numbers 0-1500 is
AckNo1501
44TCP header fields
- Acknowledge Number (contd)
- TCP uses the sliding window flow protocol (see CS
457) to regulate the flow of traffic from sender
to receiver - TCP uses the following variation of sliding
window - no NACKs (Negative ACKnowledgement)
- only cumulative ACKs
- Example
- Assume Sender sends two segments with 1..1500
and 1501..3000, but receiver only gets the
second segment. - In this case, the receiver cannot acknowledge the
second packet. It can only send AckNo1
45TCP header fields
- Header Length ( 4bits)
- Length of header in 32-bit words
- Note that TCP header has variable length (with
minimum 20 bytes)
46TCP header fields
- Flag bits
- URG Urgent pointer is valid
- If the bit is set, the following bytes contain an
urgent message in the rangeSeqNo lt urgent
message lt SeqNourgent pointer - ACK Acknowledgement Number is valid
- PSH PUSH Flag
- Notification from sender to the receiver that the
receiver should pass all data that it has to the
application. - Normally set by sender when the senders buffer
is empty
47TCP header fields
- Flag bits
- RST Reset the connection
- The flag causes the receiver to reset the
connection - Receiver of a RST terminates the connection and
indicates higher layer application about the
reset - SYN Synchronize sequence numbers
- Sent in the first packet when initiating a
connection - FIN Sender is finished with sending
- Used for closing a connection
- Both sides of a connection must send a FIN
48TCP header fields
- Window Size
- Each side of the connection advertises the window
size - Window size is the maximum number of bytes that a
receiver can accept. - Maximum window size is 216-1 65535 bytes
- TCP Checksum
- TCP checksum covers over both TCP header and TCP
data (also covers some parts of the IP header) - 16-bit ones complement
- Urgent Pointer
- Only valid if URG flag is set
49TCP header fields
50TCP header fields
- Options
- NOP is used to pad TCP header to multiples of 4
bytes - Maximum Segment Size
- Window Scale Options
- Increases the TCP window from 16 to 32 bits,
i.e., the window size is interpreted differently - Q What is the different interpretation ?
- This option can only be used in the SYN segment
(first segment) during connection establishment
time - Timestamp Option
- Can be used for roundtrip measurements
51Three-Way Handshake
52Why is a Two-Way Handshake not enough?
Will be discarded as a duplicate SYN
When aida initiates the data transfer (starting
with SeqNo15322112355), mng will reject all
data.
53TCP Connection Termination
54Connection termination with tcpdump
- 1 mng.poly.edu.telnet gt aida.poly.edu.1121 F
172488734172488734(0) ack 1031880221 win 8733 - 2 aida.poly.edu.1121 gt mng.poly.edu.telnet .
ack 172488735 win 17484 - 3 aida.poly.edu.1121 gt mng.poly.edu.telnet F
10318802211031880221(0) ack 172488735 win
17520 - 4 mng.poly.edu.telnet gt aida.poly.edu.1121 . ack
1031880222 win 8733
55TCP States in Normal Connection Lifetime
56TCP State Transition DiagramOpening A Connection
57TCP State Transition DiagramClosing A Connection
Issue close()
582MSL Wait State
- 2MSL Wait State TIME_WAIT
- When TCP does an active close, and sends the
final ACK, the connection must stay in in the
TIME_WAIT state for twice the maximum segment
lifetime. - 2MSL 2 Maximum Segment Lifetime
- Why? TCP is given a chance to resent the final
ACK. (Server will timeout after sending the FIN
segment and resend the FIN) - The MSL is set to 2 minutes or 1 minute or 30
seconds.
59Rules for sending Acknowledgments
- TCP has rules that influence the transmission of
acknowledgments - Rule 1 Delayed Acknowledgments
- Goal Avoid sending ACK segments that do not
carry data - Implementation Delay the transmission of (some)
ACKs - Rule 2 Nagles rule
- Goal Reduce transmission of small segments
Implementation A sender cannot send multiple
segments with a 1-byte payload (i.e., it must
wait for an ACK)
60Delayed Acknowledgement
- TCP delays transmission of ACKs for up to 200ms
- Goal Avoid to send ACK packets that do not carry
data. - The hope is that, within the delay, the receiver
will have data ready to be sent to the receiver.
Then, the ACK can be piggybacked with a data
segment - In Example
- Delayed ACK explains why the ACK of character
and the echo of character are sent in the same
segment - The duration of delayed ACKs can be observed in
the example when Argon sends ACKs - Exceptions
- ACK should be sent for every second full sized
segment - Delayed ACK is not used when packets arrive out
of order
61Observing Delayed Acknowledgements
- Remote terminal applications (e.g., Telnet) send
characters to a server. The server interprets the
character and sends the output at the server to
the client. - For each character typed, you see three packets
- Client ? Server Send typed character
- Server ? Client Echo of character (or user
output) and acknowledgement for first packet - Client ? Server Acknowledgement for second packet
62Observing Delayed Acknowledgements
- This is the output of typing 3 (three) characters
- Time 44.062449 Argon ? Neon Push, SeqNo
01(1), AckNo 1 - Time 44.063317 Neon ? Argon Push, SeqNo
12(1), AckNo 1 - Time 44.182705 Argon ? Neon No Data, AckNo
2 - Time 48.946471 Argon ? Neon Push, SeqNo
12(1), AckNo 2 - Time 48.947326 Neon ? Argon Push, SeqNo
23(1), AckNo 2 - Time 48.982786 Argon ? Neon No Data, AckNo
3 - Time 55.116581 Argon ? Neon Push, SeqNo
23(1) AckNo 3 - Time 55.117497 Neon ? Argon Push, SeqNo
34(1) AckNo 3 - Time 55.183694 Argon ? Neon No Data, AckNo 4
63Why 3 segments per character?
- We would expect four segments per character
- But we only see three segments per character
- This is due to delayed acknowledgements
64Observing Nagles Rule
- This is the output of typing 7 characters
- Time 16.401963 Argon ? Tenet Push, SeqNo
12(1), AckNo 2 - Time 16.481929 Tenet ? Argon Push, SeqNo
23(1) , AckNo 2 - Time 16.482154 Argon ? Tenet Push, SeqNo
23(1) , AckNo 3 - Time 16.559447 Tenet ? Argon Push, SeqNo
34(1), AckNo 3 -
- Time 16.559684 Argon ? Tenet Push, SeqNo
34(1), AckNo 4 - Time 16.640508 Tenet ? Argon Push, SeqNo
45(1) AckNo 4 - Time 16.640761 Argon ? Tenet Push, SeqNo
48(4) AckNo 5 - Time 16.728402 Tenet ? Argon Push, SeqNo
59(4) AckNo 8
65Observing Nagles Rule
- Observation Transmission of segments follows a
different pattern, i.e., there are only two
segments per character typed - Delayed acknowledgment does not kick in at Argon
- The reason is that there is always data at Argon
ready to sent when the ACK arrives - Why is Argon not sending the data (typed
character) as soon as it is available?
66Resetting Connections
- Resetting connections is done by setting the RST
flag - When is the RST flag set?
- Connection request arrives and no server process
is waiting on the destination port - Abort (Terminate) a connection Causes the
receiver to throw away buffered data. Receiver
does not acknowledge the RST segment
67TCP Congestion Control
- TCP has a mechanism for congestion control. The
mechanism is implemented at the sender - The window size at the sender is set as follows
- Send Window MIN (flow control window,
congestion window) - where
- flow control window is advertised by the receiver
- congestion window is adjusted based on feedback
from the network
68TCP Congestion Control
- TCP congestion control is governed by two
parameters - Congestion Window (cwnd)
- Slow-start threshhold Value (ssthresh)
- Initial value is 216-1
- Congestion control works in two modes
- slow start (cwnd lt ssthresh)
- congestion avoidance (cwnd ssthresh
69Slow Start
- Initial value Set cwnd 1
- Note Unit is a segment size. TCP actually is
based on bytes and increments by 1 MSS (maximum
segment size) - The receiver sends an acknowledgement (ACK) for
each Segment - Note Generally, a TCP receiver sends an ACK for
every other segment. - Each time an ACK is received by the sender, the
congestion window is increased by 1 segment - cwnd cwnd 1
- If an ACK acknowledges two segments, cwnd is
still increased by only 1 segment. - Even if ACK acknowledges a segment that is
smaller than MSS bytes long, cwnd is increased by
1. - Does Slow Start increment slowly? Not really. In
fact, the increase of cwnd is exponential
70Slow Start Example
- The congestion window size grows very rapidly
- For every ACK, we increase cwnd by 1 irrespective
of the number of segments ACKed - TCP slows down the increase of cwnd when cwnd gt
ssthresh
71Congestion Avoidance
- Congestion avoidance phase is started if cwnd has
reached the slow-start threshold value - If cwnd ssthresh then each time an ACK is
received, increment cwnd as follows - cwnd cwnd 1/ cwnd
- So cwnd is increased by one only if all cwnd
segments have been acknowledged.
72Example of Slow Start/Congestion Avoidance
ssthresh
Cwnd (in segments)
Roundtrip times
73Responses to Congestion
- So, TCP assumes there is congestion if it detects
a packet loss - A TCP sender can detect lost packets via
- Timeout of a retransmission timer
- Receipt of a duplicate ACK
- TCP interprets a Timeout as a binary congestion
signal. When a timeout occurs, the sender
performs - cwnd is reset to one
- cwnd 1
- ssthresh is set to half the current size of the
congestion window - ssthressh cwnd / 2
- and slow-start is entered
74Fast Retransmit
- If three or more duplicate ACKs are received in a
row, the TCP sender believes that a segment has
been lost. - Then TCP performs a retransmission of what seems
to be the missing segment, without waiting for a
timeout to happen. - Enter slow start
- ssthresh cwnd/2
- cwnd 1
75Fast Recovery
- Fast recovery avoids slow start after a fast
retransmit - Intuition Duplicate ACKs indicate that data is
getting through - After three duplicate ACKs set
- Retransmit packet that is presumed lost
- ssthresh cwnd/2
- cwnd cwnd3
- (note the order of operations)
- Increment cwnd by one for each additional
duplicate ACK - When ACK arrives that acknowledges new data
(here AckNo6148), set - cwndssthresh
- enter congestion avoidance
76Flavors of TCP Congestion Control
- TCP Tahoe (1988, FreeBSD 4.3 Tahoe)
- Slow Start
- Congestion Avoidance
- Fast Retransmit
- TCP Reno (1990, FreeBSD 4.3 Reno)
- Fast Recovery
- New Reno (1996)
- SACK (1996)
- RED (Floyd and Jacobson 1993)
77SACK
- SACK Selective acknowledgment
- Issue Reno and New Reno retransmit at most 1
lost packet per round trip time - Selective acknowledgments The receiver can
acknowledge non-continuous blocks of data (SACK
0-1023, 1024-2047) - Multiple blocks can be sent in a single segment.
- TCP SACK
- Enters fast recovery upon 3 duplicate ACKs
- Sender keeps track of SACKs and infers if
segments are lost. Sender retransmits the next
segment from the list of segments that are deemed
lost.
78TCP in Linux
- Congestion control algorithm is pluggable
- /proc/sys/net/ipv4/tcp_congestion_control
- TCP read and write buffer sizes
- /proc/sys/net/ipv4/tcp_rwmem
79Midterm questions
- ARP, ICMP, UDP, TCP, RIP, OSPF, BGP
- Compare and contrast design principles in
protocols. - Fragmentation