Title: A Protocol for HighSpeed Transport Over Dedicated Channels
1A Protocol for High-Speed TransportOver
Dedicated Channels
Qishi Wu, Nagi Rao Computer Science and
Mathematics Division Oak Ridge National
Laboratory
PFLDnet 2005 Third International Workshop on
Protocols for Fast Long-Distance
Networks February 3-4, 2005, Lyon,
France Sponsored by U.S. Department of
Energy National Science Foundation Defense
Advanced Research Projects Agency
2Outline of Presentation
- Motivation and Introduction
- Dedicated Channels
- Channel Provisioning
- ORNL-Atlanta-ORNL Connection
- Cray X1 ORNL-NCSU Connection
- Transport Protocols
- Data and File Transfers
- Stable Control Streams
3Introduction
- Special Networks or Policies Provide Dedicated
Bandwidth Channels - NSF CHEETAH, DRAGON, OMNINET, UCLP, DOE
UltraScienceNet, others - Dedicated channel Provisioning
- By hardware configuration
- By policy on shared connections
- DOE UltraScienceNet Motivated in-part by
large-science applications - High unimpeded bandwidth for data and file
transfers - Stable channels for visualization, computational
steering and control - Transport protocols might be easier to design and
deploy (?) - Congestion control on the channel is obviated
- High throughputs may be quickly attained and
sustained - TCP underflow problems may be avoided
4DOE UltraScience Net
- The Need
- DOE large-scale science applications on
supercomputers and experimental facilities
require high-performance networking - Petabyte data sets, collaborative visualization
and computational steering - Application areas span the disciplinary spectrum
high energy physics, climate, astrophysics,
fusion energy, genomics, and others
- Promising Solution
- High bandwidth and agile network capable of
providing on-demand dedicated channels multiple
10s Gbps to 150 Mbps - Protocols are simpler for high throughput and
control channels
- Challenges Several technologies need to be
(fully) developed - User-/application-driven agile control plane
- Dynamic scheduling and provisioning
- Security encryption, authentication,
authorization - Protocols, middleware, and applications optimized
for dedicated channels
Contacts Bill Wing (wrw_at_ornl.gov) Nagi Rao
(raons_at_ornl.gov)
5DOE UltraScience Net
- Connects ORNL, Chicago, Seattle and Sunnyvale
- Dynamically provisioned dedicated dual 10Gbps
SONET links - Proximity to several DOE locations SNS, NLCF,
FNL, ANL, NERSC - Peering with ESnet, NSF CHEETAH and other
networks
Atlanta
Data Plane User Connections Direct connections
to core switches SONET 10GgigE channels MSPP
Ethernet channels Utilize UltraScience Net hosts
Funded by U. S. DOE High-Performance Networking
Program at Oak Ridge National Laboratory 4.5M
for 3 years
Atlanta
6Control-Plane
- Phase I
- Centralized VPN connectivity
- TL1-based communication with core switches and
MSPPs - User access via centralized web-based scheduler
- Phase II
- GMPLS direct enhancements and wrappers for TL1
- User access via GMPLS and web to bandwidth
scheduler - Inter-domain GMPLS-based interface
Bandwidth Scheduler
- Computes path with target bandwidth
- Is bandwidth available now?
- Extension of Dijkstras algorithm
- Provide all available slots
- Extension of closed semi ring structure to
sequences of reals - Both are polynomial-time algorithms
- GMPLS does not have this capability
Web-based User Interface and API
- Allows users to logon to website
- Request dedicated circuits
- Based on cgi scripts
7Motivation for this work
- Understand End-to-End Properties
- Application-level performance
- Channel properties
- Host aspects NIC, kernel, application
- Experimental Protocol Performance
- ORNL-Atlanta-ORNL connection
- Cray X1 ORNL-NCSU connection
- Limited Scope
- ORNL-Atlanta-ORNL Connection
- 1Gbps, 500-mile hybrid channel
- Dedicated hosts
- Cray X1 ORNL-NCSU connection
- 400 Mbps IP connection by policy
- Preliminary experimental results
8Protocol Considerations for Dedicated Channels
Primary goal in channel utilization Particularly
absent are the familiar protocol
considerations Fairness no else on the
link Friendliness no other protocol Responsivene
ss fixed bandwidth
9ORNL-Atlanta-ORNL 1Gbps Channel
Juniper M160 Router at ORNL
Juniper M160 Router at Atlanta
GigE
Dell Dual Xeon 3.2GHz
OC192 ORNL-ATL
SONET blade
GigE blade
SONET blade
IP loop
GigE
Dual Opteron 2.2 GHz
- Host to Router
- Dedicated 1GigE NIC
- ORNL Router
- Filter-based forwarding to override both at input
and middle queues and disable other traffic to
GigE interfaces - IP packets on both GigE interfaces are forwarded
to out-going SONET port - Atlanta-SOX router
- Default IP loopback
- Only 1Gbps on OC192 link is used for production
traffic 9Gbps spare capacity
101Gbps Dedicated IP Channel
Juniper M160 Router at ORNL
Juniper M160 Router at Atlanta
GigE
Dell Dual Xeon 3.2GHz
OC192 ORNL-ATL
SONET blade
GigE blade
SONET blade
IP loopback
GigE
Dual Opteron 2.2 GHz
- Non-Uniform Physical Channel
- GigE SONET GigE
- 500 network miles
- End-to-End IP Path
- Both GigE links are dedicated to the channel
layer-2 rate control - Other host traffic is handled through second NIC
- Routers, OC192 and hosts are lightly loaded
- IP-based Applications and Protocols are readily
executed
11Dedicated Hosts
- Hosts
- Linux 2.4 kernel (Redhat, Suse)
- Two NICS
- optical connection to Juniper M160 router
- copper connection Ethernet switch/router
- Disks RAID 0 dual disks (140GB SCSI)
- XFS file system
- Peak disk data rate is 1.2Gbps (IO Zone
measurements) - Disk is not a bottleneck for 1Gbps data rates
12Channel Throughput profile
- Plot of receiving rate as a function of sending
rate - Its precise interpretation depends on
- Sending and receiving mechanisms
- Definition of rates
- For protocol optimizations, it is important to
use its own sending mechanism to generate the
profile - Window-based sending process for UDP datagrams
- Send datagrams in a one step window
size - Wait for time called idle-time or
wait-time - Sending rate at time resolution
- This is an ad hoc mechanism facilitated by 1GigE
NIC rate-control
13UDP goodput and loss profile
High gooput is received at non-trivial loss
Gooput plateau 990Mbps
Non-zero and random loss rate
Point in horizontal plane
14Unimodality of goodput with respect to loss rate
Channel goodput
- Measurements
- Gooput is unimodel with
- respect to
- Sending rate
- Loss rate
sending rate
Assumption loss rate non-decreasing
with respect to sending rate Goodput is unimodel
maximum at
Protocol throughput
Loss rate
- Important Note
- Connection profile is in general different from
the profile of a protocol - Higher loss rates typically consume more host
resources and result in lower throughput
151GigE NICS Act as Layer-2 Rate Controllers
Data rates could exceed 1Gbps
Rate Limited 1Gbps
Host
Juniper M160
Application Buffer
Kernel buffer
GigE NIC
Rate Limited 1Gbps
- Our window-based method
- Flow rate from application to NIC is ON/OFF and
exceeds 1Gbps at times - Flow is regulated to 1Gps NIC rate matches the
link rates - This method does not work well if NIC rate is
higher than link rate or router port rate - - NIC may send at higher rate causing losses at
router port
16Throughput Profile for Internet Connection
ORNL-LSU
Christmas day
Typical day
Compared to dedicated link, this connection has
higher and more random losses
17Best Performance of Existing Protocols
Disk-to-Disk Transfers (unet2 to unet1) after
some tuning Memory-to-Memory Transfers
UDT 958Mbps Both Iperf and throughput
profiles indicated 990 Mbps levels Potentially
such rates are achievable if disk access and
protocol parameters are tuned -
these measurements are stable within 1
- not much changes on the link and hosts
18Hurricane Protocol
- Composed based on principles and experiences with
UDT and SABUL - was not easy for us to figure out all tweaks for
pushing peak performance beyond 900 Mbps - these are HAPs to start with
- 900 Mbps leads to HHAPs
- Beyond 900 Mbps they need to be SHAPs
- UDP window-base flow-control
- Nothing fundamentally new but needed for fine
tuning - 990 Mbps on dedicated 1Gbps connection
disk-to-disk - No attempt for congestion control
HAP human assisted protocol HHAP highly human
assisted protocol SHAP super human assisted
protocol
19Hurricane Control Structure
Sender
receiver
disk
Send datagrams
Receiver buffer
datagrams
Reordering datagrams
disk
TCP
Reload lost datagrams
Group k NACKs
Different subtasks are handled by threads, which
are woken up on demand Thread invocations are
reduced by clustered NCKs instead of individual
ACKS
20Adhoc Optimizations
- Manual tuning of parameters
- Wait-time parameter
- Initial value chosen from throughput profile
900 Mbps - Empirically, goodput is unimodel in
pairwise measurements for binary search 960
Mbps - Group size for k for NACKs
- empirically, goodput is unimodel in k and is
tuned 993 Mbps - Disk-specific details
- Reads done in batch no input buffer
- NAKs are handled using fseek attached to the
next batch - This tuning is not likely to be transferable to
other configurations and different host loads - More work needed automatic tuning and systematic
analysis
21Transport Throughput Stabilization at Target Rate
- Niche Application Requirement Provide stable
throughput at a target rate - typically much
below peak bandwidth - Commands for computational steering and
visualization - Control loops for remote instrumentation
- TCP AIMD is not suited for stable throughput
- Complicated dynamics
- Underflows with sustained traffic
22Adaptation of source rate
Target throughput
- Adjust the window size
- Adjust idle-time
- Both are special cases of classical
Robbins-Monroe method
Noisy estimate
Performance Guarantee Convergence to target
throughput is proved under RM Monotonicity
assumptions
23Stabilization at Target Goodput Levels
1Mbps
10 Mbps
24Cray X1 ORNL-NCSU Datapath
- The following constitute the connection
- Internal Data Path of Cray X1
- LAN-WAN-LAN path from Cray X1 to head node of a
cluster at NCSU - Data path consists of heterogeneous parts
- The impedance between them must be optimally
matched
Conventional IP path GigE-SONET-GigE
Cray X1
Cray Nodes
CNS
End host
Network edge
Wide-area path
Internal path
25IP Datapath Inside Cray X1
1GigE
Cray Network Subsystem (CNS)
To LAN connection
Cray OS nodes
Crossconnect And conversiion
1 Gbps FC
SPC channel
FC Disk storage
- Data path is complicated
- IP connectivity from Cray OS node to its Ethernet
NIC consists of multiple segments - System Port Connect (SPC) channel OS node to
Crossconnect and Conversion Subsystem (CCS) - FiberChannel CCS - Cray Network Subsystem (CNS)
- 1Gig Ethernet connection to external network
26Internal Data Paths from Compute Nodes
Cray X1 has two types of nodes OS nodes that
implement IP stack Application nodes where
computation usually takes place OS is Unicos not
linux IP services at application nodes are
implemented using thread migration to OS
nodes this step creates additional dynamics
delays and loss inferences
Shared by all user connections
Disk path
FC
Cross-connect
FC convert
CNS
Cray X1
Network path
FC
compute nodes
Service nodes
disk
Side note Cluster-based machines have some
designated nodes with NICS attached
27Experimental HAP Results Cray X1 ORNL-NCSU
connection
Currently we conducting experiments to obtain
error bars - CNS is shared
28Conclusions
- Preliminary experimental results
- Protocols for hybrid channels need impedance
matching - Host issues are significant
- disk and process scheduling
- host internal data paths
- Protocols need to be carefully tuned still HAP
- Topics for further investigation
- Provisioning impedance matching
- Rate controls at application, host and NIC
- Host aspects
- Dedicated hosts vs. shared hosts Applications
are time-shared - Disk and file systems multiple Gbps striped
streams - Protocols
- Automatic parameter tuning
- Systematic and analytical study and design
29Thank You
30Hurricane