Storage Systems CSE 598d, Spring 2007 - PowerPoint PPT Presentation

About This Presentation

Storage Systems CSE 598d, Spring 2007


Storage Systems CSE 598d, Spring 2007 Lecture 15: Consistency Semantics, Introduction to Network-attached Storage March 27, 2007 Agenda Last class Consistency models ... – PowerPoint PPT presentation

Number of Views:307
Avg rating:3.0/5.0
Slides: 117
Provided by: NewU194
Learn more at:


Transcript and Presenter's Notes

Title: Storage Systems CSE 598d, Spring 2007

Storage SystemsCSE 598d, Spring 2007
  • Lecture 15 Consistency Semantics, Introduction
    to Network-attached Storage
  • March 27, 2007

  • Last class
  • Consistency models Brief Overview
  • Next
  • More details on consistency models
  • Network storage introduction
  • NAS vs SAN
  • DAFS
  • Some relevant technology and systems innovations
  • FC, Smart NICs, RDMA,
  • A variety of topics on file systems (and other
    storage-related software)
  • Log-structured file systems
  • Databases and file systems compared
  • Mobile/poorly connected systems, highly
    distributed P2P storage
  • NFS, Google file system
  • Asynchronous I/O
  • Flash-based storage
  • Active disks, object-based storage devices (OSD)
  • Archival and secure storage

Problem Background and Definition
  • Consistency issues were first studied in the
    context of shared-memory multi-processors and we
    will start our discussion in the same context
  • Ideas generalize to any distributed system with
    shared storage
  • Memory consistency model (MCM) of an SMP provides
    a formal specification of how the memory system
    will appear to the programmer
  • Places restrictions on the values that can be
    returned by a read in a shared-memory program
  • An MCM is a contract between the memory and the
  • Why different models?
  • Trade-offs involved between strictness of
    consistency guarantees, implementation efforts
    (hardware, compiler, programmer), system

Atomic/Strict Consistency
  • Most intuitive, naturally appealing
  • Any read to a memory location x returns the value
    stored by the most recent write operation to x
  • Defined w.r.t. a global clock
  • That is the only way most recent can be defined
  • Uni-processors typically observe such consistency
  • A programmer on a uni-processor naturally assumes
    this behavior
  • E.g., As a programmer, one would not expect the
    following code segment to print 1 or any other
    value than 2
  • A 1 A 2 print (A)
  • Still possible for compiler and hardware to
    improve throughput by re-ordering instructions
  • Atomic consistency can be achieved as long as
    data and control dependencies are adhered to
  • Often considered a base model (for evaluating
    MCMs that we will see next)

Atomic/Strict Consistency
  • What happens on a multi-processor?
  • Even on the smallest and fastest multi-processor,
    global time can not be achieved!
  • Achieving atomic consistency not possible
  • But not a hindrance, since programmers manage
    quite well with something weaker than atomic
  • What behavior do we expect when we program on a
  • What we DO NOT expect a global clock
  • What we expect
  • Operations from a process will execute
  • Again A 1 A 2 print (A) should not print 1
  • And then we can use Critical section/Mutual
    exclusion mechanisms to enforce desired order
    among instructions coming from different
  • So we expect a MCM less strict than atomic
    consistency. What is this consistency model, what
    are its properties, and what does the
    hardware/software (compiler) have to do to
    provide it?

Sequential Consistency
  • What we typically expect from a shared-memory
    multi-processor system is captured by sequential
  • Lamport 1979 A multi-processor is sequentially
    consistent if the result of any execution is the
    same as if
  • The operations of all the processors were
    executed in some sequential order
  • That is, memory accesses occur atomically w.r.t.
    other memory accesses
  • The operations of each individual processor
    appear in this sequence in the order specified by
    its program
  • Equivalently, any valid interleaving is
    acceptable as long as all processes see the same
    ordering of memory references
  • Programmers view

Example Sequential Consistency
P1 W(x)1 P2 W(y)2 P3 R(y)2 R(x)0 R(x)1
  • Not atomically consistent because
  • R(y)2 by P3 reads a value that has not been
    written yet
  • W(x)1 and W(y)2 appear commuted at P3
  • But sequentially consistent
  • SC doesnt have the notion of global clock

Example Sequential Consistency
P1 W(x)1 P2 W(y)2 P3 R(y)2 R(x)0 R(x)1
  • Not atomically consistent because
  • R(y)2 by P3 reads a value that has not been
    written yet
  • W(x)1 and W(y)2 appear commuted at P3
  • But sequentially consistent
  • What about?

P1 W(x)1 P2 W(y)2 R(y)2 R(x)0
R(x)1 P3 R(y)2 R(x)0 R(x)1
Example Sequential Consistency
P1 W(x)1 P2 W(y)2 P3 R(y)2 R(x)0 R(x)1
  • Not atomically consistent because
  • R(y)2 by P3 reads a value that has not been
    written yet
  • W(x)1 and W(y)2 appear commuted at P3
  • But sequentially consistent
  • What about?

P1 W(x)1 P2 W(y)2 R(y)2 R(x)0
R(x)1 P3 R(x)1 R(y)0 R(y)2
Causal Consistency
  • Hutto and Ahamad, 1990
  • Each operation is either causally related or
    concurrent with another
  • When a processor performs a read followed later
    by a write, the two operations are said to be
    causally related because the value stored by the
    write may have been dependent upon the result of
    the read
  • A read operation is causally related to an
    earlier write that stored the data retrieved by
    the read
  • Transitivity applies
  • Operations that are not causally related are said
    to be concurrent.
  • A memory is causally consistent if all processors
    agree on the order of causally related writes
  • Weaker than SC that requires all writes to be
    seen in the same order
  • P1 W(x)1 W(x)3
  • P2 R(x)1 W(x)2
  • P3 R(x)1 R(x)3 R(x)2
  • P4 R(x)1 R(x)2 R(x)3
  • W(x)1 and W(x)2 causally related
  • W(x)2 and W(x)3 not causally related!

Summary Uniform MCMs
Atomic consistency
Sequential consistency
Causal consistency
Processor consistency
PRAM consistency
Cache consistency
Slow memory
UNIX and session semantics
  • UNIX file sharing semantics on a uni-processor
  • When a read follows a write, the read returns the
    value just written
  • When two writes happen in quick succession,
    followed by a read, the value read is that stored
    by the last write
  • Problematic for a distributed system
  • Theoretically achievable if single file server
    and no client caching
  • Session semantics
  • Writes made visible to others only upon the
    closing of a file

Delta Consistency
  • Any write will become visible within at most
    Delta time units
  • Barring network latency
  • Meanwhile all bets are off!
  • Push versus pull
  • Compare with sequential, causal, etc. in terms of
    valid orderings of operations
  • Related Mutual consistency with parameter Delta
  • A given set of objects are within Delta time
    units of each other at all times as seen by a
  • Note that it is OK to be stale with respect to
    the server by more than Delta!
  • Generally, specify two parameters
  • Delta1 Freshness w.r.t. server
  • Delta2 Mutual consistency of related objects

File Systems Consistency Semantics
  • What is involved in providing these semantics?
  • UNIX semantics easy to implement on a
  • Session semantics session state at the server
  • Delta consistency timeouts, leases
  • Meta-data consistency
  • Some techniques we have seen
  • Journaling, LFS, Meta-data journaling ext3
  • Synchronous writes
  • NVRAM expensive, unavailable
  • Disk scheduler enforced ordering!
  • File system passes sequencing restrictions to the
    disk scheduler
  • Problem Disk scheduler can not enforce an
    ordering among requests not yet visible to it
  • Soft updates
  • Dependency information is maintained for
    meta-data blocks in write-back cache on a
    per-field and/or per-pointer granularity

Network-attached Storage
  • Introduction to important ideas and technologies
  • Lots of slides, will cover some in class, post
    all on Angel
  • Subsequent classes will cover some topics in depth

Direct Attached Storage
  • Problems/shortcomings in enterprise/commercial
  • Sharing of data difficult
  • Programming and client access inconvenient
  • Wastage of data
  • More?

Remote Storage
  • Idea Separate storage from the clients and
    application servers and locate it on the other
    side of a scalable networking infrastructure
  • Variants on this idea that we will see soon
  • Advantages
  • Reduction in wasted capacity by pooling devices
    and consolidating unused capacity formerly spread
    over many directly-attached storage devices
  • Reduced time to deploy new storage
  • Client software is designed to tolerate dynamic
    changes in network resources but not the changing
    of local storage configurations while the client
    is operating
  • Backup made more convenient
  • Application server involvement removed
  • Management simplified by centralizing storage
    under a consolidate manager interface
  • Availability improved (potentially)
  • All software and hardware is specifically
    developed and tested to run together
  • Disadvantages
  • Complexity, more expertise needed
  • Implies more set-up and management cost

Network Attached Storage
File interface exported to rest of the network
Storage Area Network (SAN)
Block interface exported to rest of the network
SAN versus NAS
Source November 2000/Vol. 43, No. 11
Differences between NAS and SAN
  • NAS
  • TCP/IP or UDP/IP protocols and Ethernet networks
  • High-level requests and responses for files
  • NAS devices translate file requests into
    operations on disk blocks
  • Cheaper
  • SAN
  • Fibre Channel and SCSI
  • More scalable
  • Clients translate files access to operate on
    specific disk
  • Data block level
  • Expensive
  • Separation of storage traffic from general
    network traffic
  • Beneficial from security, performance

NAS File Servers
  • Pre-configured file servers
  • Consists of one or more internal servers with
    pre-configured capacity
  • Have a stripped down OS any component not
    associated with file services is discarded
  • Connected via Ethernet to LAN
  • OS stripping makes it more efficient than a
    general purpose OS
  • Have plug and play functionality

Source Storage Networks Explained Basics and
Application of Fibre Channel SAN, NAS iSCSI and
InfiniBandby Ulf Troppens,Rainer Erkens,Wolfgang
NAS Network Performance
  • NAS and traditional network file systems use
    IP-based protocols over NIC devices.
  • A consequence of this deployment is poor network
  • The main culprits often cited include
  • - Protocol processing in network stacks
  • - Memory copying
  • - Kernel overhead including system calls
  • and context switches.

NAS Network Performance
  • Figure depicting sources of TCP/IP overhead

NAS Network Performance
  • Protocol Processing
  • Data transmission involves the OS services for
    memory and process management, the TCP/IP
    protocol stack and the network device and its
    device driver.
  • The network per-packet costs include the overhead
    to execute the TCP/IP protocol code, allocate and
    release memory buffers, and device interrupts for
    packet arrival and transmit completion.
  • The per-byte costs include overheads to move data
    within the end to end system and to compute
    checksums to detect data corruption in the

NAS Network Performance
Memory Copy
  • Current implementation for data transmission
    requires the same data to be
  • copied at several stages.

NAS Network Performance
  • An NFS client requesting data stored on a NAS
    server with internal SCSI disk would involve
  • - Hard Disk to RAM transfer using SCSI, PCI
  • system buses
  • - RAM to NIC transfer using the System and
  • buses
  • For a traditional NFS this would further involve
    a transfer from the application memory to the
    kernel buffer cache of the transmitting computer
    before forwarding to the network card.

Accelerating Performance
  • Two starting points to accelerate network file
    system performance are
  • - The underlying communication protocol
  • TCP/IP was designed to provide a reliable
    framework for data exchange over an unreliable
    network. The TCP/IP stack is complex and
  • Example alternate VIA/RDMA
  • - The Network file system
  • Development of new network file systems which
    have a reliable network connection requirement.
  • Network file systems could be modified to use
    thinner communication protocols
  • Example alternate DAFS

Proposed Solutions
  • TCP/IP offloading Engines (TOEs)
  • An increasing number of network adapters are able
    to compute internet checksum
  • Some adapters can now perform TCP or UDP protocol
  • Copy Avoidance
  • Several buffer management schemes had been
    proposed to either reduce or eliminate data

Proposed Solutions
  • Fibre Channel
  • Fibre Channel reduces the communication overhead
    by offloading transport processing to the NIC
    instead of using the host processor
  • Zero copying is facilitated by direct
    communication between the host memory and the NIC
  • Direct-Access Transport
  • Requires NIC support for remote DMA
  • User-level networking made possible through
    user-mode process interacting directly with the
    NIC to send or receive messages with minimal
    kernel intervention
  • Reliable message transport network

Proposed Solutions
  • NIC Support Mechanism
  • NIC device exposes an array of connection
    descriptors to the systems physical address
  • During connection setup time network device
    driver maps a free descriptor into the user
    virtual address space
  • This grants user process a direct and safe access
    to the NICs buffers and registers
  • This facilitates user-level networking and copy

Proposed Solutions
  • User-Level File System
  • Kernel policies for file system caching and
    prefetching do not favor some applications
  • The migration of OS functions into user level
    libraries allow user applications more control
    and specialization.
  • Clients would run in user mode as libraries
    linked directly with applications.This reduces
    the overhead due to system calls
  • Clients may evolve independent of the operating
  • Clients could also run on any OS, with no special
    kernel support except the NIC device driver.

Virtual Interface And RDMA
  • The virtual interface architecture facilitates
    fast and efficient data exchange between
    applications running on different machines
  • VIA reduces complexity by allowing applications
    (VI consumers) to communicate directly with the
    network card (VI NIC) via common memory areas,
    bypassing the operating system
  • The VI provider is the NIC and its device driver
  • RDMA is a communication model supported on the
    VIA which allow applications to read and write
    memory areas of processes running on different

VI Architecture and RDMA
Source Storage Networks Explained Basics and
Application of Fibre Channel SAN, NAS iSCSI and
InfiniBandby Ulf Troppens,Rainer Erkens,Wolfgang
Remote DMA (RDMA)
VIA Model
send doorbell
receive doorbell
user address space
user address space
receive descriptor
send descriptor
data packets in NIC memory
receive buffer
send buffer
Myrinet NIC
Myrinet NIC
  • Infinite Bandwidth
  • A Switch-based I/O interconnect architecture
  • Low pin count serial architecture
  • Infiniband Architecture(IBA) defines a System
    Area Network (SAN)
  • IBA SAN is a communications and management
    infrastructure for I/O and IPC
  • IBA defines a switched communications fabric
  • high bandwidth and low latency
  • Backed by top companies in the industries
    Compaq, Dell, Hewlett Packard, IBM, Intel,
    Microsoft and sun

Limits of the PCI Bus
  • Parallel Component Interconnect (PCI)
  • Introduced in 1992
  • Has become the standard bus architecture for
  • PCI bus
  • 32-bit/33MHz -gt 64-bit/66 MHz
  • PCI-X
  • The latest version 64 bits at PCI-X 66, PCI-X
    133, PCI-X 266 and PCI-X 533 4.3GBps
  • Other PCI concerns include
  • Bus sharing
  • Bus speed
  • Scalability
  • Fault Tolerance

PCI Express
  • High-speed point-to-point architecture that is
    essentially a serialized,packetizedversion of PCI
  • General purpose serial I/O bus for chip-to-chip
    communication, USB 2.0 / IEEE 1349b
    interconnects,and high-end graphics
  • viable AGP replacement
  • Bandwidth 4 Gigabit/second full duplex per lane
  • Up to 32 separate lanes
  • 128 Gigabit/second
  • Software-compatible with PCI device driver model
  • Expected to coexist with and not displace
    technologies like PCI-X in the foreseeable future

Benefits of IBA
  • Bandwidths
  • An open and industry-inclusive standard
  • Improved connection flexibility and scalability
  • Improved reliability
  • Offload communications processing from the OS and
  • Wide access to a variety of storage systems
  • Simultaneous device communication
  • Built-in security, quality of Service
  • Support for Internet Protocol version (IPv6)
  • Fewer and better managed system interrupts
  • Support for up to 64000 addressable devices
  • Support for copper cable and optic fiber

InfiniBand Components
  • Host Channel Adapter (HCA)
  • An interface to a host and supports all software
  • Target Channel Adapter (TCA)
  • Provides the connection to an I/O device from
  • Switch
  • Fundamental component of an IB fabric
  • Allows many HCAs and TCAs to connect to it and
    handles network traffic.
  • Router
  • Forwards data packets from a local network to
    other external subnets
  • Subnet Manager
  • An application responsible for configuring the
    local subnet and ensuring its continued operation

(No Transcript)
InfiniBand Layers
  • Physical Layer

Link Pin Count Signaling Rate Data Rate Full-Duplex Data Rate
1x 4 2,5 Gb/s 2 Gb/s 4 Gb/s (500 MB/s)
4x 16 10 Gb/s 8 Gb/s 16 Gb/s (2 GB/s)
12x 48 30 Gb/s 24 Gb/s 48 Gb/s (6 GB/s)
InfiniBand Layers
  • Link Layer
  • Is central to the IBA and includes packet layout,
    point to point link instructions, switching
    within a local subnet and data integrity
  • Packets
  • Data and management packets
  • Switching
  • Data forwarding within a local subnet
  • QoS
  • Supported by Virtual lanes
  • is a unique logical communication link that
    shares a single physical link
  • Up to 15 virtual lane per physical link (VL0
  • Packet is assigned a priority
  • Credit Based Flow Control
  • Used to manage data flow between two
    point-to-point links
  • Integrity check using CRC

InfiniBand Layers
  • Networking Layer
  • Responsible for routing packets from one subnet
    to another
  • The global route header (GRH) located within a
    packet includes an IPv6 address for the source
    and destination of each packet
  • Transport Layer
  • Handles the order of packet delivery as well as
    partitioning, multiplexing and transport services
    that determine reliable connections

Infiniband Architecture
  • The Queue Pair Abstraction
  • 2 queues of communication meta data (send recv)
  • Registered buffers which to send from/recv to

Architectural Interactions of I/O Networks and
Inter-networks, Philip Buonadonna, Intel
Research University of California, Berkeley
Direct Access File System
  • A new network file system derived from NFS
    version 4
  • Tailored to use remote DMA (RDMA) which requires
    the virtual interface (VI) framework
  • Introduced to combine the low overhead of SAN
    products with the generality of NAS file servers
  • Communication between a DAFS server and client is
    done through RDMA
  • Client side caching of locks for easier
    subsequent access to same file
  • Clients can be implemented as a shared library in
    user space or in the kernel

DAFS Architecture
Source Storage Networks Explained Basics and
Application of Fibre Channel SAN, NAS iSCSI and
InfiniBandby Ulf Troppens,Rainer Erkens,Wolfgang
Direct Access File System
  • DAFS Protocol
  • Defined as a set of send and request formats and
    their semantics
  • Defines recommended procedural APIs to access
    DAFS services from a client program
  • Assumes a reliable network transport and offers
    server-directed command flow
  • Each operation is a separate request but also
    supports request chaining
  • Defines features for session recovery and locking

Direct Access File System
  • Direct Access Data Transfer
  • Supports direct variants of data transfer
    operations such as read, write, setattr etc.
  • Direct transfer operations to and from
    client-provided memory using RDMA read and write
  • Client registers each memory region with local
    kernel before requesting direct I/O on region
  • API defined primitives register and unregister
    for memory region management register returns a
    region descriptor
  • Registration issues a system call to pin buffer
    regions in physical memory, then loads page
    translations for the region into a lookup table
    on the NIC

Direct Access File System
  • RDMA Operations
  • RDMA operations for direct I/O are initiated by
    the server.
  • Client write request to server includes a region
    token for the buffer containing the data
  • Server then issues a RDMA read to fetch data from
    client and responds with a write request response
    after RDMA completion

Direct Access File System
  • Asynchronous I/O and Prefetching
  • Supports fully asynchronous API interface which
    enables clients to pipeline I/O operations and
    overlap them with application processing
  • Event notification mechanisms delivers
    asynchronous completions and client may create
    several completion groups
  • DAFS can be implemented as a user library to be
    linked with applications or within the kernel.

Direct Access File System
Figure depicting DAFS and NFS Client Architectures
Source http//
Direct Access File System
  • Server Design and Implementation
  • The kernel server design is fashioned on an event
    driven state transition diagram
  • The main event triggering state transitions are
  • recv_done, send_done
  • and bio_done

Figure 1. An event-driven DAFS server
Source http//
Direct Access File System
  • Event Handlers
  • Each network or disk event is associated with a
    handler routine
  • recv_done - Client initiated transfer is
    complete. This signal is asserted by the NIC and
    initiates the processing of an incoming RPC
  • send_done - Server initiated transfer is
    complete. The handler for this signal releases
    all the locks involved in the RDMA operation and
    returns an RPC response
  • bio_done - Block I/O request from disk is
    complete. This signal is raised by the disk
    controller and wakes up any thread that is
    blocking on a previous disk I/O

Direct Access File System
  • Server Design and Implementation
  • Server performs disk I/O using the zero-copy
    buffer cache interface
  • This interface facilitates the locking pages and
    their mappings
  • Buffers involved in RDMA need to be locked
    during the entire transfer duration
  • Transfers are initiated using RPC handlers and
    processing is asynchronous
  • Kernel buffer cache manager registers and
    de-registers buffer mappings to the NIC on the
    fly, as physical pages are returned or removed
    from the buffers

Direct Access File System
  • Server Design and Implementation
  • Server creates multiple kernel threads to
    facilitate I/O concurrency
  • A single listener thread monitors for new
    transport connections. Other worker threads
    handle data transfer
  • Arriving messages generate a recv_done interrupt
    which is processed by a single handler for the
    completion group
  • Handler queues up incoming RPC requests and
    invokes a worker thread to start data processing
  • A thread locks all the necessary file pages in
    the buffer cache, creates RDMA descriptors and
    issues RDMA operations
  • After RDMA completion, a send_done signal is sent
    which initiates the clean up and release of all
    resources associated with the completed operation

Communication Alternatives
Source Storage Networks Explained Basics and
Application of Fibre Channel SAN, NAS iSCSI and
InfiniBandby Ulf Troppens,Rainer Erkens,Wolfgang
Experimental Setup
Source http//
Experimental Setup
  • System Configuration
  • Pentium III 800 MHz clients and servers
  • Server cache 1GB, 133MHz memory bus
  • 9GB Disks, 10K RPM Seagate Cheetah, 64-bit/33MHz
    PCI bus
  • VI over Giganet cLAN 1000 adapter (DAFS)
  • UDP/IP over Gigabit Ethernet, Alteon Tigon-II
    adapters (NFS)

Experimental Setup
  • NFS block I/O transfer size is set at mount time
  • Packets sent in fragmented UDP packets
  • Interrupt coalescing is set to high on Tigon-II
  • Checksum offloading enabled on Tigon-II
  • NFS-nocopy required modifying Tigon-II firmware,
    IP fragmentation code, file cache code,VM system
    and Tigon-II driver, to facilitate header
    splitting and page remapping

Experimental Results
The table below shows the results for one-byte
round trip latency and bandwidth. The higher
latency in Tigon-II was due to datapath
crossing the kernel UDP/IP stack
Experimental Results
  • Bandwidth and Overhead
  • Server pre-warmed with 768MB dataset
  • Designed to stress on network data transfer
  • Hence client caching not considered
  • Sequential Configuration
  • DAFS client utilized the asynchronous I/O API
  • NFS had read-ahead enabled
  • Random Configuration
  • NFS tuned for best-case performance at each
    request size by selecting a matching NFS transfer

Experimental Results
Experimental Results
Experimental Results
  • TPIE Merge
  • The sequential record merge program combines n
    sorted input files of x y-bytes each into a
    single sorted output file
  • Depicts raw sequential I/O performance with
    varying amounts of processing
  • Performance is limited by the client CPU

Experimental Results
Experimental Results
  • PostMark
  • A synthetic benchmark used in measuring file
    system performance over workloads composed of
    many short-lived, relatively small files
  • Creates a pool of files with random sizes
    followed by sequence of file operations

Experimental Results
  • Berkeley DB
  • Synthetic workload composed of read-only
    transactions, processing one small record at
    random from a B-tree

Disk Storage Interfaces
  • Parallel ATA (IDE, E-IDE)
  • Serial ATA (SATA)
  • Small Computer System Interface (SCSI)
  • Serial Attached SCSI (SAS)
  • Fiber Channel (FC)

"It's More Then the Interface" By Gordy Lutz of
Seagate, August, 2002.
Parallel ATA
  • 16-bit bus
  • Two bytes per bus transaction
  • 40-pin connector
  • Master/slave shared bus
  • Bandwidth
  • 25MHz strobe
  • x 2 for double data rate clocking
  • x 16bits per edge
  • / 8 bits per byte
  • -------------------------------------
  • 100MBytes/sec

Serial ATA (SATA)
  • 7-pin connector
  • Point to Point connections for dedicated
  • Bit-by-bit
  • One single signal path for data transmission
  • The other signal path for acknowledgement
  • Bandwidth
  • 1500MHz embedded clock
  • x 1 bit per clock
  • x 80 for 8b10b encoding
  • / 8 bits per byte
  • -------------------------------------
  • 150MBytes/sec
  • 2002 -gt 150MB/sec
  • 2004 -gt 300MB/sec
  • 2007 -gt 600MB/sec

8b10b encoding
  • IBM Patent
  • Used in SATA, SAS, FC and InfiniBand
  • Convert 8 bits data to 10 bits codes
  • Provides better synchronization than Manchester

Small Computer Systems Interface (SCSI)
  • SCSI for high-performance storage market
  • SCSI-1 proposed in 1986
  • Parallel Interface
  • Maximum cabling distance is 12 meters
  • Terminators required
  • Bus width is 8-bit (narrow)
  • 16 devices per bus
  • A device with a high priority has a bus

SCSI (contd)
  • Peer-to-peer connection (channel)
  • 50/68 pins
  • Hot repair not provided
  • Multiple buses needed beyond 16 devices
  • Low bandwidth
  • Distance limitation

SCSI Roadmap
  • Wide SCSI (16-bit bus)
  • Fast SCSI (double data rate)

Serial Attached SCSI (SAS)
  • ANSI standard in 2003
  • Interoperability with SATA
  • Full-duplex
  • Dual-port
  • 128 devices
  • 10 meters

Dual port
  • ATA, SCSI and SATA support a single port
  • Controller is a single point of failure
  • SAS and FC support dual port

SAS Roadmap
Fibre Channel (FC)
  • Developed to backbone technology of LANs
  • The name is a misnomer
  • Runs on copper also
  • 4 wire cable or fiber optic
  • 10 km or less per link
  • 126 devices per loop
  • No terminators
  • Installed base of Fibre Channel devices
  • 2.45 billion FC HBAs in 2005
  • 5.4 billion FC switches in 2005

Source Gartner, Dec 13, 2001
FC (contd)
  • Advantage
  • High bandwidth
  • Secure
  • Zero-copy send and receive
  • Low host CPU utilization
  • FCP (Fibre Channel Protocol)
  • Disadvantage
  • Not a wide-area network
  • Separate physical network infrastructure
  • Expensive
  • Different management mechanisms
  • Interoperability from difference vendors

Fiber Channel Topologies
Ulf Troppens, Rainer Erkens and Wolfgang Muller,
Storage Networks Explained
Fiber Channel Ports
  • N-Port Node port
  • F-Port Fabric port
  • L-Port Loop port
  • Only connect to AL
  • E-Port Expansion port
  • Connect two switches
  • G-Port Generic port
  • B-Port Bridge port
  • Bridge to other networks (IP, ATM, etc)
  • NL-Port Node_Loop_port
  • Can connect both in fabric and in AL
  • FL-Port Fabric_Loop_port
  • Makes a fabric to connect to a loop

Ulf Troppens, Rainer Erkens and Wolfgang Muller,
Storage Networks Explained
Arbitrated Loop in FC
Ulf Troppens, Rainer Erkens and Wolfgang Muller,
Storage Networks Explained
Arbitrated Loop in FC
Ulf Troppens, Rainer Erkens and Wolfgang Muller,
Storage Networks Explained
Routing mechanisms in switch
  • Store-forward routing
  • Cut-through routing

William James Dally and Brian Towles, Principles
and practices of Interconnection networks,
chapter 13
Fibre Channel Hub and Switch
  • Switch
  • Thousands of connections
  • Bandwidth per device is nearly constant
  • Aggregate bandwidth increases with increased
  • Deterministic latency
  • Hub
  • 126 Devices
  • Bandwidth per device diminished with increased
  • Aggregate bandwidth is constant with increased
  • Latency increases as the number of devices

Fibre Channel Structure
Fibre Channel Bandwidth
  • Clock rate is 1.0625GHz
  • 1.0625Gbps x 2048payload/2168payloadoverhead
    x 0.88b10b/8bits 100.369 MB/s

Cable types in FC
FC Roadmap
Product Naming Throughput (MB/s) T11 Spec Completed (Year) Market Availability (Year)
1GFC 200 1996 1997
2GFC 400 2000 2001
4GFC 800 2003 2005
8GFC 1,600 2006 2008
16GFC 3200 2009 2011
32GFC 6400 2012 Market Demand
64GFC 12800 2016 Market Demand
128GFC 25600 2020 Market Demand
Interface Comparison
Market Segments
Its more than interface, Seagate, 2003
Interface Trends - Previous
Its more than interface, Seagate, 2003
Interface Trends Today and Tomorrow
Its more than interface, Seagate, 2003
IP Storage
IP Storage (contd)
  • TCP/IP is used as a storage interconnect to
    transfer block level data.
  • IETF working group, the IP Storage (IPS)
  • iSCSI, iFCP, and FCIP protocols
  • Cheaper
  • Provides one technology for a client to connect
    to servers and storage devices
  • Increases operating distances
  • Improves availability of storage systems
  • Can utilize network management tools

Its more than interface, Seagate, 2003
iSCSI (Internet SCSI)
  • iSCSI is a Transport for SCSI Commands
  • iSCSI is an End to End protocol
  • iSCSI can be implemented on Desktops, Laptops and
  • iSCSI can be implemented with current TCP/IP
  • iSCSI can be implemented completely in a HBA
  • Overcomes the distance limitation
  • Cost-effective

Protocol Stack - iSCSI
Packet and Bandwidth - iSCSI
  • iSCSI overhead 78 Bytes
  • 14 (Ethernet) 20 (IP) 20 (TCP) 4 (CRC) 20
    (Interframe Gap)
  • iSCSI header occurs 48 bytes per SCSI command
  • 1.25Gbps x 1460payload/1538payloadoverhead
    x 0.88b10b/8bits 113.16 MB/s
  • Bi-Directional Payload Bandwidth 220.31 MB/s

Problems with iSCSI
  • Limited Performance because
  • Protocol overhead in TCP/IP
  • Interrupts are generated for each network packet
  • Extra copies when sending and receiving data

iSCSI Adapter Implementations
  • Software approach
  • Show the best performance
  • This approach is very competitive due to fast
    modern CPUs
  • Hardware Approaches
  • Relatively slow CPU compared to host CPU
  • Development speed is also slower than that in
    host CPU
  • Performance improvement is limited without
    superior advances in embedded CPU
  • Can show performance improvement in highly-loaded

Prasenjit Sarkar, Sandeep Utamchandani, Kaladhar
Voruganti, Storage over IP When Does Hardware
Support help?, FAST 2003
iFCP (Internet Fiber Channel Protocol)
  • iFCP is a gateway-to-gateway protocol for the
    implementation of a fibre channel fabric over a
    TCP/IP transport
  • Allow users to interconnect FC devices over a
    TCP/IP network at any distance
  • Traffic between fibre channel devices is routed
    and switched by TCP/IP network
  • iFCP maps each FC address to an IP address and
    each FC session to an TCP session
  • FC messaging and routing services are terminated
    at the gateways so that are not merged
  • Data backup and replication
  • mFCP uses UDP/IP

How does iFCP work?
Types of iFCP communication
FCIP (Fiber Channel over IP)
  • TCP/IP-based tunneling protocol to encapsulate
    fibre channel packets
  • Allow users to interconnect FC devices over a
    TCP/IP network at any distance (same as iFCP)
  • Merges connected SANs into a single FC fabric
  • Data backup and replication
  • Gateways
  • used to interconnect fibre channel SANs to the IP
  • set up connections between SANs or between fibre
    channel devices and SANs

FCIP (Fiber Channel over IP)
Comparison between FCIP and iFCP
IP Storage Protocols iSCSI, iFCP and FCIP
(No Transcript)
  • Reliability
  • The basic InfiniBand link connection is comprised
    of only four signal wires
  • IBA accommodates multiple ports for each I/O unit
  • IBA provides multiple CRCs
  • Availability
  • An IBA fabric in inherently redundant, with
    multiple paths to sources assuring data delivery
  • IBA allows the network to heal itself if a link
    fails or is reporting errors
  • IBA has a many-to-many server-to-I/O relationship
  • Serviceability
  • Hot-pluggable

Feature Infini Band Fibre Channel 1Gb 10 Gb Ethernet PCI-X
Bandwidth 2.5 , 10, 30 Gb/s 1, 2.1 Gb/s 1, 10 Gb/s 8.51 Gb/s
Bandwidth Full-Duplex 5, 20, 60 Gb/s 2.1 , 4.2 GB/s 2, 20 Gb/s N/A
Pin Count 4, 16, 48 4 4 / 8 90
Media Copper/Fiber Copper/Fiber Copper/Fiber PCB
Max Length Copper 250 / 125 m 13m 100m inches
Max Length Fiber 10 km km km N/A
Partitioning X X X N/A
Scalable Link Width X N/A N/A N/A
Max Payload 4 KB 2KB 1.5 KB No Packets
A classification of storage systems(warning -
not comprehensive)
  • Isolated
  • E.g., A laptop/PC with a local file system
  • We know how these work
  • File systems were first developed for centralized
    computer systems as an OS facility providing a
    convenient programming interfact to (disk)
  • Subsequently acquired features like AC,
    file-locking that made them useful for sharing of
    data and programs
  • Distributed
  • Why?
  • Sharing, scalability, mobility, fault tolerance,
  • Basic Distributed file system
  • Give the illusion of local storage when the data
    is spread across a network (usually a LAN) to
    clients running on multiple computers
  • Support the sharing of information of in the form
    of files and hardware resources in the form of
    persistent storage throughout an intranet
  • Enhancements in various domains for real-time
    performance (multimedia), high failure
    resistance, high scalability (P2P), security,
    longevity (archival systems), mobility/disconnecti
  • Remote objects to support distributed
    object-oriented programming

Storage systems and their properties
Caching/ replication
Consistency maintenance
Main memory No No No Strict one-copy RAM
File system No Yes No Strict one-copy UNIX FS
Distributed file system Yes Yes Yes Yes (approx.) NFS
Web Yes Yes Yes Very approx/No Web server
Distributed shared memory Yes No Yes Yes (approx) Ivy
Remote objects (RMI/ORB) Yes No No Strict one-copy CORBA
Persistent object store Yes Yes No Strict one-copy CORBA persistent state service
P2P storage system Yes Yes Yes Very approx OceanStore
Write a Comment
User Comments (0)