Bandwidth Challenges or "How fast can we really drive a Network" - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Bandwidth Challenges or "How fast can we really drive a Network"

Description:

The SACK Processing is inefficient for large bandwidth delay products ... bytes; Socket size 22 Mbytes; rtt 177ms; SACK off. Move a 2 Gbyte file. Web100 plots: ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 42
Provided by: rhu99
Category:

less

Transcript and Presenter's Notes

Title: Bandwidth Challenges or "How fast can we really drive a Network"


1
Bandwidth Challenges or "How fast can we really
drive a Network?"
Richard Hughes-Jones The University of
Manchester www.hep.man.ac.uk/rich/ then
Talks
2
Collaboration at SC05
  • Caltech Booth
  • The BWC at the SLAC Booth
  • SCINet
  • Storcloud
  • ESLEA Boston Ltd. Peta-CacheSun

3
Bandwidth Challenge wins Hat Trick
  • The maximum aggregate bandwidth was gt151 Gbits/s
  • 130 DVD movies in a minute
  • serve 10,000 MPEG2 HDTV movies in real-time
  • 22 10Gigabit Ethernet wavesCaltech SLAC/FERMI
    booths
  • In 2 hours transferred 95.37 TByte
  • 24 hours moved 475 TBytes
  • Showed real-time particle event analysis
  • SLAC Fermi UK Booth
  • 1 10 Gbit Ethernet to UK NLRUKLight
  • transatlantic HEP disk to disk
  • VLBI streaming
  • 2 10 Gbit Links to SALC
  • rootd low-latency file access application for
    clusters
  • Fibre Channel StorCloud
  • 4 10 Gbit links to Fermi
  • Dcache data transfers

In to booth
Out of booth
4
ESLEA and UKLight
  • 6 1 Gbit transatlantic Ethernet layer 2 paths
    UKLight NLR
  • Disk-to-disk transfers with bbcp
  • Seattle to UK
  • Set TCP buffer and application to give
    850Mbit/s
  • One stream of data 840-620 Mbit/s
  • Stream UDP VLBI data
  • UK to Seattle
  • 620 Mbit/s

5
SLAC 10 Gigabit Ethernet
  • 2 Lightpaths
  • Routed over ESnet
  • Layer 2 over Ultra Science Net
  • 6 Sun V20Z systems per ?
  • dcache remote disk data access
  • 100 processes per node
  • Node sends or receives
  • One data stream 20-30 Mbit/s
  • Used Netweion NICs Chelsio TOE
  • Data also sent to StorCloudusing fibre channel
    links
  • Traffic on the 10 GE link for 2 nodes 3-4 Gbit
    per nodes 8.5-9Gbit on Trunk

6
  • LightPath Topologies

7
Switched LightPaths 1
  • Lightpaths are a fixed point to point path or
    circuit
  • Optical links (with FEC) have a BER 10-16 i.e. a
    packet loss rate 10-12 or 1 loss in about 160
    days
  • In SJ5 LightPaths known as Bandwidth Channels
  • Host to host Lightpath
  • One Application
  • No congestion
  • Advanced TCP stacks for large Delay Bandwidth
    Products

8
Switched LightPaths 2
  • User Controlled Lightpaths
  • Grid Scheduling ofCPUs Network
  • Many Application flows
  • No congestion on each path
  • Lightweight framing possible
  • Some applications suffer when using TCP may
    prefer to use UDP DCCP XCP
  • E.g. With e-VLBI the data wave-front gets
    distorted and correlation fails

9
  • Network Transport Layer Issues

10
  • Problem 1
  • Packet Loss
  • Is it important ?

11
TCP (Reno) Packet loss and Time
  • TCP takes packet loss as indication of congestion
  • Time for TCP to recover its throughput from 1
    lost 1500 byte packet given by
  • for rtt of 200 ms _at_ 1 Gbit/s

UK 6 ms Europe 25 ms USA 150 ms1.6 s
26 s 28min
12
Packet Loss and new TCP Stacks
  • TCP Response Function
  • Throughput vs Loss Rate further to right
    faster recovery
  • UKLight London-Chicago-London rtt 177 ms
  • Drop Packets in Kernel
  • 2.6.6 Kernel
  • Agreement withtheory good
  • Some new stacksgood at high loss rates

13
High Throughput Demonstrations
Geneva rtt 128 ms
Chicago
Dual Zeon 2.2 GHz
Dual Zeon 2.2 GHz
Cisco GSR
Cisco GSR
Cisco 7609
Cisco 7609
1 GEth
1 GEth
2.5 Gbit SDH MB-NG Core
14
TCP Throughput DataTAG
  • Different TCP stacks tested on the DataTAG
    Network
  • rtt 128 ms
  • Drop 1 in 106
  • High-Speed
  • Rapid recovery
  • Scalable
  • Very fast recovery
  • Standard
  • Recovery would take 20 mins

15
  • Problem 2
  • Is TCP fair?
  • look at
  • Round Trip Times Max Transfer Unit

16
MTU and Fairness
  • Two TCP streams share a 1 Gb/s bottleneck
  • RTT117 ms
  • MTU 3000 Bytes Avg. throughput over a period
    of 7000s 243 Mb/s
  • MTU 9000 Bytes Avg. throughput over a period
    of 7000s 464 Mb/s
  • Link utilization 70,7

Sylvain Ravot DataTag 2003
17
RTT and Fairness
  • Two TCP streams share a 1 Gb/s bottleneck
  • CERN lt-gt Sunnyvale RTT181ms Avg. throughput
    over a period of 7000s 202Mb/s
  • CERN lt-gt Starlight RTT117ms Avg. throughput
    over a period of 7000s 514Mb/s
  • MTU 9000 bytes
  • Link utilization 71,6

Sylvain Ravot DataTag 2003
18
  • Problem n
  • Do TCP Flows Share the Bandwidth ?

19
Test of TCP Sharing Methodology (1Gbit/s)
  • Chose 3 paths from SLAC (California)
  • Caltech (10ms), Univ Florida (80ms), CERN (180ms)
  • Used iperf/TCP and UDT/UDP to generate traffic
  • Each run was 16 minutes, in 7 regions

Les Cottrell RHJ PFLDnet 2005
20
TCP Reno single stream
Les Cottrell RHJ PFLDnet 2005
  • Low performance on fast long distance paths
  • AIMD (add a1 pkt to cwnd / RTT, decrease cwnd by
    factor b0.5 in congestion)
  • Net effect recovers slowly, does not effectively
    use available bandwidth, so poor throughput
  • Unequal sharing

SLAC to CERN
21
Hamilton TCP
  • One of the best performers
  • Throughput is high
  • Big effects on RTT when achieves best throughput
  • Flows share equally

22
  • Problem n1
  • To SACK or not to SACK ?

23
The SACK Algorithm
  • SACK Rational
  • Non-continuous blocks of data can be ACKed
  • Sender transmits just lost packets
  • Helps when multiple packets lost in one TCP
    window
  • The SACK Processing is inefficient for large
    bandwidth delay products
  • Sender write queue (linked list) walked for
  • Each SACK block
  • To mark lost packets
  • To re-transmit
  • Processing so long input Q becomes full
  • Get Timeouts

24
  • What does the User Application make of this?
  • The view from the Application

25
SC2004 Disk-Disk bbftp
  • bbftp file transfer program uses TCP/IP
  • UKLight Path- London-Chicago-London PCs-
    Supermicro 3Ware RAID0
  • MTU 1500 bytes Socket size 22 Mbytes rtt 177ms
    SACK off
  • Move a 2 Gbyte file
  • Web100 plots
  • Standard TCP
  • Average 825 Mbit/s
  • (bbcp 670 Mbit/s)
  • Scalable TCP
  • Average 875 Mbit/s
  • (bbcp 701 Mbit/s4.5s of overhead)
  • Disk-TCP-Disk at 1Gbit/sis here!

26
SC05 HEP Moving data with bbcp
  • What is the end-host doing with your network
    protocol?
  • Look at the PCI-X
  • 3Ware 9000 controller RAID0
  • 1 Gbit Ethernet link
  • 2.4 GHz dual Xeon
  • 660 Mbit/s
  • Power needed in the end hosts
  • Careful Application design

27
VLBI TCP Stack CPU Load
  • Real User problem!
  • End host TCP flow at 960 Mbit/s with rtt 1 ms
    falls to 770 Mbit/s when rtt 15 ms
  • 1.2GHz PIII
  • TCP iperf rtt 1 ms 960 Mbit/s
  • 94.7 kernel mode idle 1.5
  • TCP iperf rtt 15 ms 777 Mbit/s
  • 96.3 kernel mode idle 0.05
  • CPULoad with nice priority
  • Throughput falls as priorityincreases
  • No Loss No Timeouts
  • Not enough CPU power
  • 2.8 GHz Xeon rtt 1 ms
  • TCP iperf 916 Mbit/s
  • 43 kernel mode idle 55
  • CPULoad with nice priority
  • Throughput constant as priority increases
  • No Loss No Timeouts
  • Kernel mode includes TCP stackand Ethernet driver

28
ATLAS Remote Computing Application Protocol
  • Event Request
  • EFD requests an event from SFI
  • SFI replies with the event 2Mbytes
  • Processing of event
  • Return of computation
  • EF asks SFO for buffer space
  • SFO sends OK
  • EF transfers results of the computation
  • tcpmon - instrumented TCP request-response
    program emulates the Event Filter EFD to SFI
    communication.

29
tcpmon TCP Activity Manc-CERN Req-Resp
  • Web100 hooks for TCP status
  • Round trip time 20 ms
  • 64 byte Request green1 Mbyte Response blue
  • TCP in slow start
  • 1st event takes 19 rtt or 380 ms

30
tcpmon TCP Activity Manc-cern Req-Respno cwnd
reduction
  • Round trip time 20 ms
  • 64 byte Request green1 Mbyte Response blue
  • TCP starts in slow start
  • 1st event takes 19 rtt or 380 ms
  • TCP Congestion windowgrows nicely
  • Response takes 2 rtt after 1.5s
  • Rate 10/s (with 50ms wait)
  • Transfer achievable throughputgrows to 800
    Mbit/s
  • Data transferred WHEN theapplication requires
    the data

3 Round Trips
2 Round Trips
31
HEP Service Challenge 4
  • Objective demo 1 Gbit/s aggregate bandwidth
    between RAL and 4 Tier 2 sites
  • RAL has SuperJANET4 and UKLight links
  • RAL Capped firewall traffic at 800 Mbit/s
  • SuperJANET Sites
  • Glasgow Manchester Oxford QMWL
  • UKLight Site
  • Lancaster
  • Many concurrent transfersfrom RAL to each of the
    Tier 2 sites

700 Mbit UKLight
Peak 680 Mbit SJ4
32
Summary Conclusions
  • Well, you CAN fill the Links at 1 and 10 Gbit/s
    but its not THAT simple
  • Packet loss is a killer for TCP
  • Check on campus links equipment, and access
    links to backbones
  • Users need to collaborate with the Campus Network
    Teams
  • Dante Pert
  • New stacks are stable and give better response
    performance
  • Still need to set the TCP buffer sizes !
  • Check other kernel settings e.g. window-scale
    maximum
  • Watch for TCP Stack implementation Enhancements
  • TCP tries to be fair
  • Large MTU has an advantage
  • Short distances, small RTT, have an advantage
  • TCP does not share bandwidth well with other
    streams
  • The End Hosts themselves
  • Plenty of CPU power is required for the TCP/IP
    stack as well and the application

33
More Information Some URLs
  • UKLight web site http//www.uklight.ac.uk
  • ESLEA web site http//www.eslea.uklight.ac.uk
  • MB-NG project web site http//www.mb-ng.net/
  • DataTAG project web site http//www.datatag.org/
  • UDPmon / TCPmon kit writeup http//www.hep.man
    .ac.uk/rich/net
  • Motherboard and NIC Tests
  • http//www.hep.man.ac.uk/rich/net/nic/GigEth_te
    sts_Boston.ppt http//datatag.web.cern.ch/datata
    g/pfldnet2003/
  • Performance of 1 and 10 Gigabit Ethernet Cards
    with Server Quality Motherboards FGCS Special
    issue 2004
  • http// www.hep.man.ac.uk/rich/
  • TCP tuning information may be found
    athttp//www.ncne.nlanr.net/documentation/faq/pe
    rformance.html http//www.psc.edu/networking/p
    erf_tune.html
  • TCP stack comparisonsEvaluation of Advanced
    TCP Stacks on Fast Long-Distance Production
    Networks Journal of Grid Computing 2004
  • PFLDnet http//www.ens-lyon.fr/LIP/RESO/pfldnet200
    5/
  • Dante PERT http//www.geant2.net/server/show/nav.0
    0d00h002

34
  • Any Questions?

35
  • Backup Slides

36
More Information Some URLs 2
  • Lectures, tutorials etc. on TCP/IP
  • www.nv.cc.va.us/home/joney/tcp_ip.htm
  • www.cs.pdx.edu/jrb/tcpip.lectures.html
  • www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200
    /CCONTENTS
  • www.cisco.com/univercd/cc/td/doc/product/iaabu/cen
    tri4/user/scf4ap1.htm
  • www.cis.ohio-state.edu/htbin/rfc/rfc1180.html
  • www.jbmelectronics.com/tcp.htm
  • Encylopaedia
  • http//www.freesoft.org/CIE/index.htm
  • TCP/IP Resources
  • www.private.org.il/tcpip_rl.html
  • Understanding IP addresses
  • http//www.3com.com/solutions/en_US/ncs/501302.htm
    l
  • Configuring TCP (RFC 1122)
  • ftp//nic.merit.edu/internet/documents/rfc/rfc1122
    .txt
  • Assigned protocols, ports etc (RFC 1010)
  • http//www.es.net/pub/rfcs/rfc1010.txt
    /etc/protocols

37
Packet Loss with new TCP Stacks
  • TCP Response Function
  • Throughput vs Loss Rate further to right
    faster recovery
  • Drop packets in kernel

MB-NG rtt 6ms
DataTAG rtt 120 ms
38
High Performance TCP MB-NG
  • Drop 1 in 25,000
  • rtt 6.2 ms
  • Recover in 1.6 s
  • Standard HighSpeed Scalable

39
Fast
  • As well as packet loss, FAST uses RTT to detect
    congestion
  • RTT is very stable s(RTT) 9ms vs 370.14ms for
    the others

40
SACK
  • Look into whats happening at the algorithmic
    level with web100
  • Strange hiccups in cwnd ? only correlation is
    SACK arrivals

Scalable TCP on MB-NG with 200mbit/sec CBR
Background Yee-Ting Li
41
10 Gigabit Ethernet Tuning PCI-X
  • 16080 byte packets every 200 µs
  • Intel PRO/10GbE LR Adapter
  • PCI-X bus occupancy vs mmrbc
  • Measured times
  • Times based on PCI-X times from the logic
    analyser
  • Expected throughput 7 Gbit/s
  • Measured 5.7 Gbit/s
Write a Comment
User Comments (0)
About PowerShow.com