Latest Relevant Techniques and Applications for Distributed File Systems - PowerPoint PPT Presentation


PPT – Latest Relevant Techniques and Applications for Distributed File Systems PowerPoint presentation | free to download - id: 7e1057-NjRmZ


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Latest Relevant Techniques and Applications for Distributed File Systems


Ela Sharda Latest Relevant Techniques and Applications for Distributed File Systems Overview What is Distributed File System ? – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 22
Provided by: gsu131
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Latest Relevant Techniques and Applications for Distributed File Systems

Latest Relevant Techniques and Applications
for Distributed File Systems
  • Ela Sharda

  • What is Distributed File System ?
  • Features
  • The Google File System (GFS)?
  • Ceph
  • Andrew File System (AFS)?
  • Coda
  • Hadoop
  • References

What is Distributed File System ?
  • A distributed file system stores files on one
    or more computers called servers, and
    makes them accessible to other computers
    called clients, where they appear as normal
  • Advantages of using file servers
  • - Files are more widely available since many
    computers can access the
    servers, and sharing the files from a single
    location is easier than distributing
    copies of files to individual clients.
    - Backups and safety of the information are
    easier to arrange. The servers can
    provide large storage space, which might be
    costly or impractical to supply to every

  • Since more than one client may access the same
    data simultaneously, the server must have a
    mechanism in place to organize updates so that
    the client always receives the most current
    version of data and that data conflicts do not
  • DFS typically use file or database replication
    (distributing copies of data on multiple servers)
    to protect against data access failures.
  • Files which are distributed across multiple
    servers appear to users as if they reside in one
    place on the network. Users no longer need to
    know and specify the actual physical location of
    files in order to access them.

Features 2
  • Architecture Different DFS architecture
    exists 1.
    Client- Server Architecture Sun Microsystems
    Network System) which
    provides standardized view of its local file
  • 2. Cluster-Based Distributed File System
    such as GFS. It consists of a Single master
    along with multiple chunk servers and
    divided into multiple chunks.
  • 3. Symmetric Architecture Based on
    peer-to-peer technology. In this
    file system, the clients also host the metadata
    manager code,resulting in all nodes
    understanding the disk structures.

  • 4. Asymmetric Architecture There are one or
    more dedicated metadata managers that
    maintain the file system and its associated
    disk structures. Examples include Lustre and
    traditional NFS file systems.
  • 5. Parallel Architecture Here, data blocks
    are striped, in parallel, across multiple
    storage devices on multiple storage servers.
    Support for concurrent read and write

  • Communication DFSs use Remote Procedure Call
    method to communicate as they make the system
    independent from underlying OS, networks and
    transport protocols
  • - In RPC approach, there are two communication
    protocols to consider, TCP and UDP.
  • - TCP is mostly used by all DFSs.
  • - UDP is considered for improving performance
    in Hadoop.

  • Naming The currently common approach employs -
  • 1. Central metadata server to manage file
    name space. Therefore decoupling metadata and
    data improve the file namespace and relief the
    synchronization problem.
  • 2. Metadata distributed in all nodes
    resulting in all nodes understanding the disk

  • Consistency and Replication Most of DFS employ
    checksum to validate the data after sending
    through communication network.
  • - Caching and Replication play an important
    role in DFS when they are designed to operate
    over wide area network.
  • - It can be done in many ways such as
    Client-side caching and Server-Side replication.
  • - There are two types of data need to be
    considered for replication metadata replication
    and data object replication.

  • Security Authentication Issues and access
    control are some of the important security
    issues in DFSs that need to be analyzed.
  • - Most DFS employ security with authentication,
    authorization and privacy.
  • - Some DFSs for specific purposes such as GFS
    and Hadoop, base on the trust between all
    nodes and clients.

  • Fault Tolerance It is very much related to the
    replication feature because replication is
    created to provide availability and support
    transparency of failures to users.
  • - There are two approaches for fault tolerance
    failure as exception and failure as norm.
  • - Failure as exception systems will isolate the
    failure node or recover the system from last
    normal running state.
  • - Failure as norm systems employ replication of
    all kind of data.

The Google File System 1
  • A scalable distributed file system for large
    data-intensive applications. It provides fault
    tolerance while running on inexpensive commodity
    hardware, and it delivers high aggregate
    performance to a large number of clients.
  • It is widely deployed within Google as the
    storage platform for the generation and
    processing of data as well as research and
    development efforts that require large data sets.
  • GFS is optimized for Google's core data storage
    and usage needs which can generate enormous
    amounts of data that need to be retained.
  • The architecture is cluster based distributed
    file system.

GFS Architecture
Ceph 6
  • A distributed file system that provides excellent
    performance and reliability while promising
    unparalleled scalability. It is developed at
  • Ceph maximizes the separation between data and
    metadata management by replacing allocation
    tables with a pseudo-random data distribution
    function (CRUSH) designed for heterogeneous and
    dynamic clusters of object storage devices
  • The primary goals of the architecture are
    scalability (to hundreds of petabytes and
    beyond), performance, and reliability.

(No Transcript)
Andrew File System 5
  • The Andrew File System is a distributed networked
    file system which uses a set of trusted servers
    to present a homogeneous, location-transparent
    file name space to all the client workstations.
    It was developed by CMU as part of the Andrew
  • AFS uses Kerberos for authentication, and
    implements access control lists on directories
    for users and groups.
  • Kerberos is a computer network authentication
    protocol developed at MIT, which allows
    individuals communicating over a non-secure
    network to prove their identity to one another in
    a secure manner.
  • It provides mutual authentication both the user
    and the server verify each other's identity.

Coda 4
  • Coda is a distributed file system developed at
    CMU with its origin in AFS2. It has many features
    that are very desirable for network filesystems.
  • Disconnected operation for clients
    - reintegration of data from
    disconnected client - bandwidth
  • Failure Resilience - read/write
    replication servers - handles of
    network failures which partition the servers
  • Performance and scalability - client
    side persistent caching of files, directories and
    attributes for high performance -
    write back caching

  • Some more features
  • Security -
    kerberos like authentication
  • - access control lists (ACL's)?
  • Well defined semantics of sharing
  • Freely available source code

Hadoop 7
  • Apache Hadoop is a free Java software
    framework that supports data intensive
    distributed applications. It enables applications
    to work with thousands of nodes and petabytes of
  • Hadoop was inspired by Google's MapReduce and
    Google File System papers.
  • It is a top level Apache project and Yahoo has
    been the largest contributor to the project and
    uses Hadoop extensively in its Web Search and
    Advertising businesses. IBM and Google have
    announced a major initiative to use Hadoop to
    support University courses in Distributed
    Computer Programming.

  • 1
  • 2 Tran Doan Thanh et al, A Taxonomy and Survey
    on Distributed File Systems, Fourth International
    Conference on Networked Computing and Advanced
    Information Management
  • 3 http//
  • 4 http//
  • 5 http//
  • 6 Sage A. Weil et al, Ceph A Scalable,
    High-Performance Distributed File System
  • 7 http//

Thank You...