A Protocol for HighSpeed Transport Over Dedicated Channels - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

A Protocol for HighSpeed Transport Over Dedicated Channels

Description:

Flow rate from application to NIC is ON/OFF and exceeds 1Gbps at times. Flow is regulated to 1Gps: NIC rate matches the link rates ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 31
Provided by: willi316
Category:

less

Transcript and Presenter's Notes

Title: A Protocol for HighSpeed Transport Over Dedicated Channels


1
A Protocol for High-Speed TransportOver
Dedicated Channels
Qishi Wu, Nagi Rao Computer Science and
Mathematics Division Oak Ridge National
Laboratory
PFLDnet 2005 Third International Workshop on
Protocols for Fast Long-Distance
Networks February 3-4, 2005, Lyon,
France Sponsored by U.S. Department of
Energy National Science Foundation Defense
Advanced Research Projects Agency
2
Outline of Presentation
  • Motivation and Introduction
  • Dedicated Channels
  • Channel Provisioning
  • ORNL-Atlanta-ORNL Connection
  • Cray X1 ORNL-NCSU Connection
  • Transport Protocols
  • Data and File Transfers
  • Stable Control Streams

3
Introduction
  • Special Networks or Policies Provide Dedicated
    Bandwidth Channels
  • NSF CHEETAH, DRAGON, OMNINET, UCLP, DOE
    UltraScienceNet, others
  • Dedicated channel Provisioning
  • By hardware configuration
  • By policy on shared connections
  • DOE UltraScienceNet Motivated in-part by
    large-science applications
  • High unimpeded bandwidth for data and file
    transfers
  • Stable channels for visualization, computational
    steering and control
  • Transport protocols might be easier to design and
    deploy (?)
  • Congestion control on the channel is obviated
  • High throughputs may be quickly attained and
    sustained
  • TCP underflow problems may be avoided

4
DOE UltraScience Net
  • The Need
  • DOE large-scale science applications on
    supercomputers and experimental facilities
    require high-performance networking
  • Petabyte data sets, collaborative visualization
    and computational steering
  • Application areas span the disciplinary spectrum
    high energy physics, climate, astrophysics,
    fusion energy, genomics, and others
  • Promising Solution
  • High bandwidth and agile network capable of
    providing on-demand dedicated channels multiple
    10s Gbps to 150 Mbps
  • Protocols are simpler for high throughput and
    control channels
  • Challenges Several technologies need to be
    (fully) developed
  • User-/application-driven agile control plane
  • Dynamic scheduling and provisioning
  • Security encryption, authentication,
    authorization
  • Protocols, middleware, and applications optimized
    for dedicated channels

Contacts Bill Wing (wrw_at_ornl.gov) Nagi Rao
(raons_at_ornl.gov)
5
DOE UltraScience Net
  • Connects ORNL, Chicago, Seattle and Sunnyvale
  • Dynamically provisioned dedicated dual 10Gbps
    SONET links
  • Proximity to several DOE locations SNS, NLCF,
    FNL, ANL, NERSC
  • Peering with ESnet, NSF CHEETAH and other
    networks

Atlanta
Data Plane User Connections Direct connections
to core switches SONET 10GgigE channels MSPP
Ethernet channels Utilize UltraScience Net hosts
Funded by U. S. DOE High-Performance Networking
Program at Oak Ridge National Laboratory 4.5M
for 3 years
Atlanta
6
Control-Plane
  • Phase I
  • Centralized VPN connectivity
  • TL1-based communication with core switches and
    MSPPs
  • User access via centralized web-based scheduler
  • Phase II
  • GMPLS direct enhancements and wrappers for TL1
  • User access via GMPLS and web to bandwidth
    scheduler
  • Inter-domain GMPLS-based interface

Bandwidth Scheduler
  • Computes path with target bandwidth
  • Is bandwidth available now?
  • Extension of Dijkstras algorithm
  • Provide all available slots
  • Extension of closed semi ring structure to
    sequences of reals
  • Both are polynomial-time algorithms
  • GMPLS does not have this capability

Web-based User Interface and API
  • Allows users to logon to website
  • Request dedicated circuits
  • Based on cgi scripts

7
Motivation for this work
  • Understand End-to-End Properties
  • Application-level performance
  • Channel properties
  • Host aspects NIC, kernel, application
  • Experimental Protocol Performance
  • ORNL-Atlanta-ORNL connection
  • Cray X1 ORNL-NCSU connection
  • Limited Scope
  • ORNL-Atlanta-ORNL Connection
  • 1Gbps, 500-mile hybrid channel
  • Dedicated hosts
  • Cray X1 ORNL-NCSU connection
  • 400 Mbps IP connection by policy
  • Preliminary experimental results

8
Protocol Considerations for Dedicated Channels
Primary goal in channel utilization Particularly
absent are the familiar protocol
considerations Fairness no else on the
link Friendliness no other protocol Responsivene
ss fixed bandwidth
9
ORNL-Atlanta-ORNL 1Gbps Channel
Juniper M160 Router at ORNL
Juniper M160 Router at Atlanta
GigE
Dell Dual Xeon 3.2GHz
OC192 ORNL-ATL
SONET blade
GigE blade
SONET blade
IP loop
GigE
Dual Opteron 2.2 GHz
  • Host to Router
  • Dedicated 1GigE NIC
  • ORNL Router
  • Filter-based forwarding to override both at input
    and middle queues and disable other traffic to
    GigE interfaces
  • IP packets on both GigE interfaces are forwarded
    to out-going SONET port
  • Atlanta-SOX router
  • Default IP loopback
  • Only 1Gbps on OC192 link is used for production
    traffic 9Gbps spare capacity

10
1Gbps Dedicated IP Channel
Juniper M160 Router at ORNL
Juniper M160 Router at Atlanta
GigE
Dell Dual Xeon 3.2GHz
OC192 ORNL-ATL
SONET blade
GigE blade
SONET blade
IP loopback
GigE
Dual Opteron 2.2 GHz
  • Non-Uniform Physical Channel
  • GigE SONET GigE
  • 500 network miles
  • End-to-End IP Path
  • Both GigE links are dedicated to the channel
    layer-2 rate control
  • Other host traffic is handled through second NIC
  • Routers, OC192 and hosts are lightly loaded
  • IP-based Applications and Protocols are readily
    executed

11
Dedicated Hosts
  • Hosts
  • Linux 2.4 kernel (Redhat, Suse)
  • Two NICS
  • optical connection to Juniper M160 router
  • copper connection Ethernet switch/router
  • Disks RAID 0 dual disks (140GB SCSI)
  • XFS file system
  • Peak disk data rate is 1.2Gbps (IO Zone
    measurements)
  • Disk is not a bottleneck for 1Gbps data rates

12
Channel Throughput profile
  • Plot of receiving rate as a function of sending
    rate
  • Its precise interpretation depends on
  • Sending and receiving mechanisms
  • Definition of rates
  • For protocol optimizations, it is important to
    use its own sending mechanism to generate the
    profile
  • Window-based sending process for UDP datagrams
  • Send datagrams in a one step window
    size
  • Wait for time called idle-time or
    wait-time
  • Sending rate at time resolution
  • This is an ad hoc mechanism facilitated by 1GigE
    NIC rate-control

13
UDP goodput and loss profile
High gooput is received at non-trivial loss
Gooput plateau 990Mbps
Non-zero and random loss rate
Point in horizontal plane
14
Unimodality of goodput with respect to loss rate
Channel goodput
  • Measurements
  • Gooput is unimodel with
  • respect to
  • Sending rate
  • Loss rate

sending rate
Assumption loss rate non-decreasing
with respect to sending rate Goodput is unimodel
maximum at
Protocol throughput
Loss rate
  • Important Note
  • Connection profile is in general different from
    the profile of a protocol
  • Higher loss rates typically consume more host
    resources and result in lower throughput

15
1GigE NICS Act as Layer-2 Rate Controllers
Data rates could exceed 1Gbps
Rate Limited 1Gbps
Host
Juniper M160
Application Buffer
Kernel buffer
GigE NIC
Rate Limited 1Gbps
  • Our window-based method
  • Flow rate from application to NIC is ON/OFF and
    exceeds 1Gbps at times
  • Flow is regulated to 1Gps NIC rate matches the
    link rates
  • This method does not work well if NIC rate is
    higher than link rate or router port rate
  • - NIC may send at higher rate causing losses at
    router port

16
Throughput Profile for Internet Connection
ORNL-LSU
Christmas day
Typical day
Compared to dedicated link, this connection has
higher and more random losses
17
Best Performance of Existing Protocols
Disk-to-Disk Transfers (unet2 to unet1) after
some tuning Memory-to-Memory Transfers
UDT 958Mbps Both Iperf and throughput
profiles indicated 990 Mbps levels Potentially
such rates are achievable if disk access and
protocol parameters are tuned -
these measurements are stable within 1
- not much changes on the link and hosts
18
Hurricane Protocol
  • Composed based on principles and experiences with
    UDT and SABUL
  • was not easy for us to figure out all tweaks for
    pushing peak performance beyond 900 Mbps
  • these are HAPs to start with
  • 900 Mbps leads to HHAPs
  • Beyond 900 Mbps they need to be SHAPs
  • UDP window-base flow-control
  • Nothing fundamentally new but needed for fine
    tuning
  • 990 Mbps on dedicated 1Gbps connection
    disk-to-disk
  • No attempt for congestion control

HAP human assisted protocol HHAP highly human
assisted protocol SHAP super human assisted
protocol
19
Hurricane Control Structure
Sender
receiver
disk
Send datagrams
Receiver buffer
datagrams
Reordering datagrams
disk
TCP
Reload lost datagrams
Group k NACKs
Different subtasks are handled by threads, which
are woken up on demand Thread invocations are
reduced by clustered NCKs instead of individual
ACKS
20
Adhoc Optimizations
  • Manual tuning of parameters
  • Wait-time parameter
  • Initial value chosen from throughput profile
    900 Mbps
  • Empirically, goodput is unimodel in
    pairwise measurements for binary search 960
    Mbps
  • Group size for k for NACKs
  • empirically, goodput is unimodel in k and is
    tuned 993 Mbps
  • Disk-specific details
  • Reads done in batch no input buffer
  • NAKs are handled using fseek attached to the
    next batch
  • This tuning is not likely to be transferable to
    other configurations and different host loads
  • More work needed automatic tuning and systematic
    analysis

21
Transport Throughput Stabilization at Target Rate
  • Niche Application Requirement Provide stable
    throughput at a target rate - typically much
    below peak bandwidth
  • Commands for computational steering and
    visualization
  • Control loops for remote instrumentation
  • TCP AIMD is not suited for stable throughput
  • Complicated dynamics
  • Underflows with sustained traffic

22
Adaptation of source rate
Target throughput
  • Adjust the window size
  • Adjust idle-time
  • Both are special cases of classical
    Robbins-Monroe method

Noisy estimate
Performance Guarantee Convergence to target
throughput is proved under RM Monotonicity
assumptions
23
Stabilization at Target Goodput Levels
1Mbps
10 Mbps
24
Cray X1 ORNL-NCSU Datapath
  • The following constitute the connection
  • Internal Data Path of Cray X1
  • LAN-WAN-LAN path from Cray X1 to head node of a
    cluster at NCSU
  • Data path consists of heterogeneous parts
  • The impedance between them must be optimally
    matched

Conventional IP path GigE-SONET-GigE
Cray X1
Cray Nodes
CNS
End host
Network edge
Wide-area path
Internal path
25
IP Datapath Inside Cray X1
1GigE
Cray Network Subsystem (CNS)
To LAN connection
Cray OS nodes
Crossconnect And conversiion
1 Gbps FC
SPC channel
FC Disk storage
  • Data path is complicated
  • IP connectivity from Cray OS node to its Ethernet
    NIC consists of multiple segments
  • System Port Connect (SPC) channel OS node to
    Crossconnect and Conversion Subsystem (CCS)
  • FiberChannel CCS - Cray Network Subsystem (CNS)
  • 1Gig Ethernet connection to external network

26
Internal Data Paths from Compute Nodes
Cray X1 has two types of nodes OS nodes that
implement IP stack Application nodes where
computation usually takes place OS is Unicos not
linux IP services at application nodes are
implemented using thread migration to OS
nodes this step creates additional dynamics
delays and loss inferences
Shared by all user connections
Disk path
FC
Cross-connect
FC convert
CNS
Cray X1
Network path
FC
compute nodes
Service nodes
disk
Side note Cluster-based machines have some
designated nodes with NICS attached
27
Experimental HAP Results Cray X1 ORNL-NCSU
connection
Currently we conducting experiments to obtain
error bars - CNS is shared
28
Conclusions
  • Preliminary experimental results
  • Protocols for hybrid channels need impedance
    matching
  • Host issues are significant
  • disk and process scheduling
  • host internal data paths
  • Protocols need to be carefully tuned still HAP
  • Topics for further investigation
  • Provisioning impedance matching
  • Rate controls at application, host and NIC
  • Host aspects
  • Dedicated hosts vs. shared hosts Applications
    are time-shared
  • Disk and file systems multiple Gbps striped
    streams
  • Protocols
  • Automatic parameter tuning
  • Systematic and analytical study and design

29
Thank You
30
Hurricane
Write a Comment
User Comments (0)
About PowerShow.com