Bandwidth Challenges or "How fast can we really drive a Network" - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Bandwidth Challenges or "How fast can we really drive a Network"

Description:

The SACK Processing is inefficient for large bandwidth delay products ... bytes; Socket size 22 Mbytes; rtt 177ms; SACK off. Move a 2 Gbyte file. Web100 plots: ... – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 42

Provided by: rhu99

Category:

more less

Transcript and Presenter's Notes

Title: Bandwidth Challenges or "How fast can we really drive a Network"

1
Bandwidth Challenges or "How fast can we really
drive a Network?"
Richard Hughes-Jones The University of
Manchester www.hep.man.ac.uk/rich/ then
Talks
2
Collaboration at SC05

Caltech Booth

The BWC at the SLAC Booth

SCINet

Storcloud

ESLEA Boston Ltd. Peta-CacheSun

3
Bandwidth Challenge wins Hat Trick

The maximum aggregate bandwidth was gt151 Gbits/s
130 DVD movies in a minute
serve 10,000 MPEG2 HDTV movies in real-time
22 10Gigabit Ethernet wavesCaltech SLAC/FERMI
booths
In 2 hours transferred 95.37 TByte
24 hours moved 475 TBytes
Showed real-time particle event analysis
SLAC Fermi UK Booth
1 10 Gbit Ethernet to UK NLRUKLight
transatlantic HEP disk to disk
VLBI streaming
2 10 Gbit Links to SALC
rootd low-latency file access application for
clusters
Fibre Channel StorCloud
4 10 Gbit links to Fermi
Dcache data transfers

In to booth
Out of booth
4
ESLEA and UKLight

6 1 Gbit transatlantic Ethernet layer 2 paths
UKLight NLR
Disk-to-disk transfers with bbcp
Seattle to UK
Set TCP buffer and application to give
850Mbit/s
One stream of data 840-620 Mbit/s
Stream UDP VLBI data
UK to Seattle
620 Mbit/s

5
SLAC 10 Gigabit Ethernet

2 Lightpaths
Routed over ESnet
Layer 2 over Ultra Science Net
6 Sun V20Z systems per ?
dcache remote disk data access
100 processes per node
Node sends or receives
One data stream 20-30 Mbit/s
Used Netweion NICs Chelsio TOE
Data also sent to StorCloudusing fibre channel
links
Traffic on the 10 GE link for 2 nodes 3-4 Gbit
per nodes 8.5-9Gbit on Trunk

LightPath Topologies

7
Switched LightPaths 1

Lightpaths are a fixed point to point path or
circuit
Optical links (with FEC) have a BER 10-16 i.e. a
packet loss rate 10-12 or 1 loss in about 160
days
In SJ5 LightPaths known as Bandwidth Channels
Host to host Lightpath
One Application
No congestion
Advanced TCP stacks for large Delay Bandwidth
Products

8
Switched LightPaths 2

User Controlled Lightpaths
Grid Scheduling ofCPUs Network
Many Application flows
No congestion on each path
Lightweight framing possible

Some applications suffer when using TCP may
prefer to use UDP DCCP XCP
E.g. With e-VLBI the data wave-front gets
distorted and correlation fails

Network Transport Layer Issues

Problem 1
Packet Loss
Is it important ?

11
TCP (Reno) Packet loss and Time

TCP takes packet loss as indication of congestion
Time for TCP to recover its throughput from 1
lost 1500 byte packet given by
for rtt of 200 ms _at_ 1 Gbit/s

UK 6 ms Europe 25 ms USA 150 ms1.6 s
26 s 28min
12
Packet Loss and new TCP Stacks

TCP Response Function
Throughput vs Loss Rate further to right
faster recovery
UKLight London-Chicago-London rtt 177 ms
Drop Packets in Kernel
2.6.6 Kernel
Agreement withtheory good
Some new stacksgood at high loss rates

13
High Throughput Demonstrations
Geneva rtt 128 ms
Chicago
Dual Zeon 2.2 GHz
Dual Zeon 2.2 GHz
Cisco GSR
Cisco GSR
Cisco 7609
Cisco 7609
1 GEth
1 GEth
2.5 Gbit SDH MB-NG Core
14
TCP Throughput DataTAG

Different TCP stacks tested on the DataTAG
Network
rtt 128 ms
Drop 1 in 106
High-Speed
Rapid recovery
Scalable
Very fast recovery
Standard
Recovery would take 20 mins

Problem 2
Is TCP fair?
look at
Round Trip Times Max Transfer Unit

16
MTU and Fairness

Two TCP streams share a 1 Gb/s bottleneck
RTT117 ms
MTU 3000 Bytes Avg. throughput over a period
of 7000s 243 Mb/s
MTU 9000 Bytes Avg. throughput over a period
of 7000s 464 Mb/s
Link utilization 70,7

Sylvain Ravot DataTag 2003
17
RTT and Fairness

Two TCP streams share a 1 Gb/s bottleneck
CERN lt-gt Sunnyvale RTT181ms Avg. throughput
over a period of 7000s 202Mb/s
CERN lt-gt Starlight RTT117ms Avg. throughput
over a period of 7000s 514Mb/s
MTU 9000 bytes
Link utilization 71,6

Sylvain Ravot DataTag 2003
18

Problem n
Do TCP Flows Share the Bandwidth ?

19
Test of TCP Sharing Methodology (1Gbit/s)

Chose 3 paths from SLAC (California)
Caltech (10ms), Univ Florida (80ms), CERN (180ms)
Used iperf/TCP and UDT/UDP to generate traffic
Each run was 16 minutes, in 7 regions

Les Cottrell RHJ PFLDnet 2005
20
TCP Reno single stream
Les Cottrell RHJ PFLDnet 2005

Low performance on fast long distance paths
AIMD (add a1 pkt to cwnd / RTT, decrease cwnd by
factor b0.5 in congestion)
Net effect recovers slowly, does not effectively
use available bandwidth, so poor throughput
Unequal sharing

SLAC to CERN
21
Hamilton TCP

One of the best performers
Throughput is high
Big effects on RTT when achieves best throughput
Flows share equally

Problem n1
To SACK or not to SACK ?

23
The SACK Algorithm

SACK Rational
Non-continuous blocks of data can be ACKed
Sender transmits just lost packets
Helps when multiple packets lost in one TCP
window
The SACK Processing is inefficient for large
bandwidth delay products
Sender write queue (linked list) walked for
Each SACK block
To mark lost packets
To re-transmit
Processing so long input Q becomes full
Get Timeouts

What does the User Application make of this?
The view from the Application

25
SC2004 Disk-Disk bbftp

bbftp file transfer program uses TCP/IP
UKLight Path- London-Chicago-London PCs-
Supermicro 3Ware RAID0
MTU 1500 bytes Socket size 22 Mbytes rtt 177ms
SACK off
Move a 2 Gbyte file
Web100 plots
Standard TCP
Average 825 Mbit/s
(bbcp 670 Mbit/s)
Scalable TCP
Average 875 Mbit/s
(bbcp 701 Mbit/s4.5s of overhead)
Disk-TCP-Disk at 1Gbit/sis here!

26
SC05 HEP Moving data with bbcp

What is the end-host doing with your network
protocol?
Look at the PCI-X
3Ware 9000 controller RAID0
1 Gbit Ethernet link
2.4 GHz dual Xeon
660 Mbit/s

Power needed in the end hosts
Careful Application design

27
VLBI TCP Stack CPU Load

Real User problem!
End host TCP flow at 960 Mbit/s with rtt 1 ms
falls to 770 Mbit/s when rtt 15 ms

1.2GHz PIII
TCP iperf rtt 1 ms 960 Mbit/s
94.7 kernel mode idle 1.5
TCP iperf rtt 15 ms 777 Mbit/s
96.3 kernel mode idle 0.05
CPULoad with nice priority
Throughput falls as priorityincreases
No Loss No Timeouts
Not enough CPU power

2.8 GHz Xeon rtt 1 ms
TCP iperf 916 Mbit/s
43 kernel mode idle 55
CPULoad with nice priority
Throughput constant as priority increases
No Loss No Timeouts
Kernel mode includes TCP stackand Ethernet driver

28
ATLAS Remote Computing Application Protocol

Event Request
EFD requests an event from SFI
SFI replies with the event 2Mbytes
Processing of event
Return of computation
EF asks SFO for buffer space
SFO sends OK
EF transfers results of the computation
tcpmon - instrumented TCP request-response
program emulates the Event Filter EFD to SFI
communication.

29
tcpmon TCP Activity Manc-CERN Req-Resp

Web100 hooks for TCP status
Round trip time 20 ms
64 byte Request green1 Mbyte Response blue
TCP in slow start
1st event takes 19 rtt or 380 ms

30
tcpmon TCP Activity Manc-cern Req-Respno cwnd
reduction

Round trip time 20 ms
64 byte Request green1 Mbyte Response blue
TCP starts in slow start
1st event takes 19 rtt or 380 ms

TCP Congestion windowgrows nicely
Response takes 2 rtt after 1.5s
Rate 10/s (with 50ms wait)

Transfer achievable throughputgrows to 800
Mbit/s
Data transferred WHEN theapplication requires
the data

3 Round Trips
2 Round Trips
31
HEP Service Challenge 4

Objective demo 1 Gbit/s aggregate bandwidth
between RAL and 4 Tier 2 sites
RAL has SuperJANET4 and UKLight links
RAL Capped firewall traffic at 800 Mbit/s
SuperJANET Sites
Glasgow Manchester Oxford QMWL
UKLight Site
Lancaster
Many concurrent transfersfrom RAL to each of the
Tier 2 sites

700 Mbit UKLight
Peak 680 Mbit SJ4
32
Summary Conclusions

Well, you CAN fill the Links at 1 and 10 Gbit/s
but its not THAT simple
Packet loss is a killer for TCP
Check on campus links equipment, and access
links to backbones
Users need to collaborate with the Campus Network
Teams
Dante Pert
New stacks are stable and give better response
performance
Still need to set the TCP buffer sizes !
Check other kernel settings e.g. window-scale
maximum
Watch for TCP Stack implementation Enhancements
TCP tries to be fair
Large MTU has an advantage
Short distances, small RTT, have an advantage
TCP does not share bandwidth well with other
streams
The End Hosts themselves
Plenty of CPU power is required for the TCP/IP
stack as well and the application

33
More Information Some URLs

UKLight web site http//www.uklight.ac.uk
ESLEA web site http//www.eslea.uklight.ac.uk
MB-NG project web site http//www.mb-ng.net/
DataTAG project web site http//www.datatag.org/
UDPmon / TCPmon kit writeup http//www.hep.man
.ac.uk/rich/net
Motherboard and NIC Tests
http//www.hep.man.ac.uk/rich/net/nic/GigEth_te
sts_Boston.ppt http//datatag.web.cern.ch/datata
g/pfldnet2003/
Performance of 1 and 10 Gigabit Ethernet Cards
with Server Quality Motherboards FGCS Special
issue 2004
http// www.hep.man.ac.uk/rich/
TCP tuning information may be found
athttp//www.ncne.nlanr.net/documentation/faq/pe
rformance.html http//www.psc.edu/networking/p
erf_tune.html
TCP stack comparisonsEvaluation of Advanced
TCP Stacks on Fast Long-Distance Production
Networks Journal of Grid Computing 2004
PFLDnet http//www.ens-lyon.fr/LIP/RESO/pfldnet200
5/
Dante PERT http//www.geant2.net/server/show/nav.0
0d00h002