Title: The ongoing evolution from Packet based networks to Hybrid Networks in Research
1- The ongoing evolution from Packet based networks
to Hybrid Networks in Research Education
Networks
Olivier Martin, CERN Swiss ICT Task Force
(Fribourg)
2 Presentation Outline
- The demise of conventional packet based networks
in the RE community - The advent of community managed dark fiber
networks - The Grid its associated Wide Area Networking
challenges - On-Demand Lambda Grids
- Ethernet over SONET new standards
- WAN-PHY, GFP, VCAT/LCAS, G.709, OTN
- Disclaimer The views expressed herein are not
necessarily those of CERN, furthermore although I
am formally a CERN staff member until July 31,
2006, I do not work for CERN any more since
October 3, being on a pre-retirement program.
3(No Transcript)
4OC-768c
40-GE
5Some facts
(5 of 12)
- Internet is everywhere
- Ethernet is everywhere
- The advent of next generation G.709 Optical
Transport Networks - is very unsure!
- hence one has to learn how to live best with
existing network infrastructures, - which may well explain all the hype about
on-demand lambda Grids! - For the first time in the history of the
Internet, the Commercial and the Research
Education Internet appear to follow different
routes - Will they ever converge again?
- Dark fiber based, customer owned long distance,
networks are booming! - users are becoming their own Telecom Operators
- Is it a good or a bad thing?
6Internet Backbone Speeds
MBPS
IP/?
OC12c
OC3c
ATM-VCs
T3 lines
T1 Lines
7High Speed IP Network Transport Trends
Multiplexing, protection and management at every
layer
IP
Signalling
ATM
SONET/SDH
Optical
B-ISDN
Higher Speed, Lower cost, complexity and overhead
8(No Transcript)
9(No Transcript)
10Network Exponentials
- Network vs. computer performance
- Computer speed doubles every 18 months
- Network speed doubles every 9 months
- Difference order of magnitude per 5 years
- 1986 to 2000
- Computers x 500
- Networks x 340,000
- 2001 to 2010
- Computers x 60
- Networks x 4000
Moores Law vs. storage improvements vs. optical
improvements. Graph from Scientific American
(Jan-2001) by Cleo Vilett, source Vined Khoslan,
Kleiner, Caufield and Perkins.
11Know the user
(3 of 12)
of users
A
C
B
ADSL
GigE LAN
F(t)
BW requirements
A -gt Lightweight users, browsing, mailing, home
use B -gt Business applications, multicast,
streaming, VPNs, mostly LAN C -gt Special
scientific applications, computing, data grids,
virtual-presence
12What the user
(4 of 12)
Total BW
A
B
C
ADSL
GigE LAN
BW requirements
A -gt Need full Internet routing, one to many B -gt
Need VPN services on/and full Internet routing,
several to several C -gt Need very fat pipes,
limited multiple Virtual Organizations, few to few
13So what are the facts
(5 of 12)
- Costs of fat pipes (fibers) are one/third of
equipment to light them up - Is what Lambda salesmen told Cees de Laat
(University of Amsterdam Surfnet) - Costs of optical equipment 10 of switching 10
of full routing equipment for same throughput - 100 Byte packet _at_ 10 Gb/s -gt 80 ns to look up in
100 Mbyte routing table (light speed from me to
you on the back row!) - Big sciences need fat pipes
- Bottom line create a hybrid architecture which
serves all users in one coherent and cost
effective way
14Utilization trends
Gbps
Network Capacity Limit
Jan 2005
15Todays hierarchical IP network
Other national networks
National or Pan-National IP Network
NREN A
NREN C
NREN B
NREN D
University
16Tomorrows peer to peer IP network
World
World
National DWDM Network
World
Child Lightpaths
NREN B
NREN A
NREN C
NREN D
Child Lightpaths
University
Server
17Creation of application VPNs
Direct connect bypasses campus firewall
University
Dept
High Energy Physics Network
CERN
Commodity Internet
Research Network
University
University
Bio-informatics Network
University
University
eVLBI Network
18Production vs Research Campus Networks
- Increasingly campuses are deploying parallel
networks for high end users - Reduces costs by providing high end network
capability to only those who need it - Limitations of campus firewall and border router
are eliminated - Many issues in regards to security, back door
routing, etc - Campus networks may follow same evolution as
campus computing - Discipline specific networks being extended into
the campus
19UCLP intended for projects like National
LambdaRail
20GEANT2 POP Design
21LHC Data Grid Hierarchy
CERN/Outside Resource Ratio 12Tier0/(?
Tier1)/(? Tier2) 111
PByte/sec
100-400 MBytes/sec
Online System
Experiment
CERN 700k SI95 1 PB Disk Tape Robot
Tier 0 1
HPSS
10 Gbps
Tier 1
FNAL 200k SI95 600 TB
IN2P3 Center
INFN Center
RAL Center
2.5/10 Gbps
Tier 2
2.5 Gbps
Tier 3
Institute 0.25TIPS
Institute
Institute
Institute
Physicists work on analysis channels Each
institute has 10 physicists working on one or
more channels
0.11 Gbps
Physics data cache
Tier 4
Workstations
22Main Networking Challenges
- Fulfill the, yet unproven, assertion that the
network can be nearly transparent to the Grid - Deploy suitable Wide Area Network infrastructure
(50-100 Gb/s) - Deploy suitable Local Area Network infrastructure
(matching or exceeding that of the WAN) - Seamless interconnection of LAN WAN
infrastructures - firewall?
- End to End issues (transport protocols, PCs
(Itanium, Xeon), 10GigE NICs (Intel, S2io), where
are we today - memory to memory 7.5Gb/s (PCI bus limit)
- memory to disk 1.2MB (Windows 2003
server/NewiSys) - disk to disk 400MB (Linux), 600MB (Windows)
23Main TCP issues
- Does not scale to some environments
- High speed, high latency
- Noisy
- Unfair behaviour with respect to
- Round Trip Time (RTT
- Frame size (MSS)
- Access Bandwidth
- Widespread use of multiple streams in order to
compensate for inherent TCP/IP limitations (e.g.
Gridftp, BBftp) - Bandage rather than a cure
- New TCP/IP proposals in order to restore
performance in single stream environments - Not clear if/when it will have a real impact
- In the mean time there is an absolute requirement
for backbones with - Zero packet losses,
- And no packet re-ordering
- Which re-inforces the case for lambda Grids
24TCP dynamics(10Gbps, 100ms RTT, 1500Bytes
packets)
- Window size (W) BandwidthRound Trip Time
- Wbits 10Gbps100ms 1Gb
- Wpackets 1Gb/(81500) 83333 packets
- Standard Additive Increase Multiplicative
Decrease (AIMD) mechanisms - WW/2 (halving the congestion window on loss
event) - WW 1 (increasing congestion window by one
packet every RTT) - Time to recover from W/2 to W (congestion
avoidance) at 1 packet per RTT - RTTWp/2 1.157 hour
- In practice, 1 packet per 2 RTT because of
delayed acks, i.e. 2.31 hour - Packets per second
- RTTWpackets 833333 packets
25Internet2 land speed record history (IPv4
IPv6) period 2000-2004
26Layer1/2/3 networking (1)
- Conventional layer 3 technology is no longer
fashionable because of - High associated costs, e.g. 200/300 KUSD for a
10G router interfaces - Implied use of shared backbones
- The use of layer 1 or layer 2 technology is very
attractive because it helps to solve a number of
problems, e.g. - 1500 bytes Ethernet frame size (layer1)
- Protocol transparency (layer1 layer2)
- Minimum functionality hence, in theory, much
lower costs (layer12)
27Layer1/2/3 networking (2)
- 0n-demand Lambda Grids are becoming very
popular - Pros
- circuit oriented model like the telephone
network, hence no need for complex transport
protocols - Lower equipment costs (i.e. in theory a
factor 2 or 3 per layer) - the concept of a dedicated end to end light path
is very elegant - Cons
- End to end still very loosely defined, i.e.
site to site, cluster to cluster or really host
to host - Higher circuit costs, Scalability, Additional
middleware to deal with circuit set up/tear down,
etc - Extending dynamic VLAN functionality is a
potential nightmare!
28 Lambda Grids What does it mean?
- Clearly different things to different people,
hence the apparently easy consensus! - Conservatively, on demand site to site
connectivity - Where is the innovation?
- What does it solve in terms of transport
protocols? - Where are the savings?
- Less interfaces needed (customer) but more
standby/idle circuits needed (provider) - Economics from the service provider vs the
customer perspective? - Traditionally, switched services have been very
expensive, - Usage vs flat charge
- Break even, switches vs leased, few hours/day
- Why would this change?
- In case there are no savings, why bother?
- More advanced, cluster to cluster
- Implies even more active circuits in paralle
- Is it realistic?
- Even more advanced, Host to Host or even per
flow - All optical
- Is it really realisitic?
29Some Challenges
- Real bandwidth estimates given the chaotic nature
of the requirements. - End-end performance given the whole chain
involved - (disk-bus-memory-bus-network-bus-memory-bus-disk)
- Provisioning over complex network infrastructures
(GEANT, NRENs etc) - Cost model for options (packetSLAs, circuit
switched etc) - Consistent Performance (dealing with firewalls)
- Merging leading edge research with production
networking
30 Tentative conclusions
- There is a very clear trend towards community
managed dark fiber networks - As a consequence National Research Education
Networks are evolving into Telecom Operators, is
it right? - In the short term, almost certainly YES
- In the longer term, probably NO
- In many countries, there is NO other way to have
affordable access to multi-Gbit/s networks,
therefore this is clearly the right move - The Grid its associated Wide Area Networking
challenges - on-demand Lambda Grids are, according to me,
extremely doubtful! - Ethernet over SONET new standards will
revolutionize the Internet - WAN-PHY (IEEE) has, according to me NO future!
- However, GFP, VCAT/LCAS, G.709, OTN are very
likely to have a very bright future.
31Single TCP stream performance under periodic
losses
- Loss rate 0.01
- LAN BW utilization 99
- WAN BW utilization1.2
Bandwidth available 1 Gbps
- TCP throughput much more sensitive to packet loss
in WANs than LANs - TCPs congestion control algorithm (AIMD) is not
well-suited to gigabit networks - The effect of packets loss can be disastrous
- TCP is inefficient in high bandwidthdelay
networks - The future performance-outlook for computational
grids looks bad if we continue to rely solely on
the widely-deployed TCP RENO
32Responsiveness
- Time to recover from a single packet loss
2
C . RTT
r
C Capacity of the link
2 . MSS
Path Bandwidth RTT (ms) MTU (Byte) Time to recover
LAN 10 Gb/s 1 1500 430 ms
GenevaChicago 10 Gb/s 120 1500 1 hr 32 min
Geneva-Los Angeles 1 Gb/s 180 1500 23 min
Geneva-Los Angeles 10 Gb/s 180 1500 3 hr 51 min
Geneva-Los Angeles 10 Gb/s 180 9000 38 min
Geneva-Los Angeles 10 Gb/s 180 64k (TSO) 5 min
Geneva-Tokyo 1 Gb/s 300 1500 1 hr 04 min
- Large MTU accelerates the growth of the window
- Time to recover from a packet loss decreases with
large MTU - Larger MTU reduces overhead per frames (saves CPU
cycles, reduces the number of packets)
33Single TCP stream between Caltech and CERN
- Available (PCI-X) Bandwidth8.5 Gbps
- RTT250ms (16000 km)
- 9000 Byte MTU
- 15 min to increase throughput from 3 to 6 Gbps
- Sending station
- Tyan S2882 motherboard, 2x Opteron 2.4 GHz ,
2 GB DDR. - Receiving station
- CERN OpenLabHP rx4640, 4x 1.5GHz Itanium-2,
zx1 chipset, 8GB memory - Network adapter
- S2IO 10 GbE
CPU load 100
Single packet loss
Burst of packet losses
34High Throughput Disk to Disk Transfers From 0.1
to 1GByte/sec
- Server Hardware (Rather than Network)
Bottlenecks - Write/read and transmit tasks share the same
limited resources CPU, PCI-X bus, memory, IO
chipset - PCI-X bus bandwidth 8.5 Gbps 133MHz x 64 bit
- Link aggregation (802.3ad) Logical interface
with two physical interfaces on two independent
PCI-X buses. - LAN test 11.1 Gbps (memory to memory)
Performance in this range (from 100 MByte/sec up
to 1 GByte/sec) is required to build a
responsive Grid-based Processing and Analysis
System for LHC
35Transferring a TB from Caltech to CERN in 64-bit
MS Windows
- Latest disk to disk over 10Gbps WAN 4.3
Gbits/sec (536 MB/sec) - 8 TCP streams from CERN
to Caltech 1TB file - 3 Supermicro Marvell SATA disk controllers 24
SATA 7200rpm SATA disks - Local Disk IO 9.6 Gbits/sec (1.2 GBytes/sec
read/write, with lt20 CPU utilization) - S2io SR 10GE NIC
- 10 GE NIC 7.5 Gbits/sec (memory-to-memory, with
52 CPU utilization) - 210 GE NIC (802.3ad link aggregation) 11.1
Gbits/sec (memory-to-memory) - Memory to Memory WAN data flow, and local Memory
to Disk read/write flow, are not matched when
combining the two operations - Quad Opteron AMD848 2.2GHz processors with 3
AMD-8131 chipsets 4 64-bit/133MHz PCI-X slots. - Interrupt Affinity Filter allows a user to
change the CPU-affinity of the interrupts in a
system. - Overcome packet loss with re-connect logic.
- Proposed Internet2 Terabyte File Transfer
Benchmark