Title: Distributed File System
1Distributed File System
2Outline
- Basic Concepts
- Current project
- Hadoop Distributed File System
- Future work
- Reference
3DFS
- A distributed implementation of the classical
time sharing model of a file system, where
multiple users share files and storage resources.
4Key Characteristics of DFS
- Dispersion
- Clients and files
- Multiplicity
- Clients and files
5Primary issues of DFS
- Naming and Transparency
- Fault Tolerance
6Naming
- Naming mapping between logical and physical
objects. - Multilevel mapping.
- Transparent replicas and location
7Naming Schemes Three Main Approaches
- Host name local name
- guarantees a unique system wide name.
- Mount remote directories to local directories
- once mounted, files can be referenced in a
location-transparent manner - Total integration of the component file systems.
- A single global name structure
- If a server is unavailable, some arbitrary set of
directories on on different machines also becomes
unavailable
8Transparency(1)
- Login Transparency User can log in at any host
with uniform login procedure and perceive a
uniform view of the file system. - Access Transparency Client process on a hots has
uniform mechanism to access all files in system
regardeless of files are on local/remote host. - Location Transparency The names of the files do
not reveal their physical location.
9Transparency(2)
- Concurrency Transparency An update to a file
should not have effect on the correct execution
of other process that is concurrently sharing a
file. - Replication Transparency Files may be
replicated to provide redundancy for availability
and also to permit concurrent access for
efficiency.
10Fault Tolerance
- Stateful Vs. Stateless
- Maintain information on client
- File Replication
11Distinctions Between Stateful Stateless Service
- Failure Recovery.
- A stateful server loses all its volatile state in
a crash. - With stateless server, the effects of server
failure and recovery are almost unnoticeable.
12File Replication
- Several copies of a file's contents at different
locations enable multiple servers to share the
load of providing the service - Naming scheme maps a replicated file name to a
particular replica. - Updates
13Current Project
- HDFS Hadoop Distributed File System
- Distributed parallel fault tolerant file system.
It is designed to reliably store very large files
across machines in a large cluster. - Efficient, reliable, and open source
14Hadoop is a framework for running applications on
large clusters built of commodity hardware. The
Hadoop framework transparently provides
applications both reliability and data motion.
Hadoop implements a computational paradigm named
Map/Reduce, where the application is divided into
many small fragments of work, each of which may
be executed or reexecuted on any node in the
cluster. In addition, it provides a distributed
file system (HDFS) that stores data on the
compute nodes, providing very high aggregate
bandwidth across the cluster. Both Map/Reduce and
the distributed file system are designed so that
node failures are automatically handled by the
framework.
15HDFS
- Hadoop's Distributed File System is designed to
reliably store very large files across machines
in a large cluster. It is inspired by the Google
File System. Hadoop DFS stores each file as a
sequence of blocks, all blocks in a file except
the last block are the same size. Blocks
belonging to a file are replicated for fault
tolerance. The block size and replication factor
are configurable per file. Files in HDFS are
"write once" and have strictly one writer at any
time. - Hadoop Distributed File System Goals
- Store large data sets
- Cope with hardware failure
- Emphasize streaming data access
16Architecture
- Like Hadoop Map/Reduce, HDFS follows a
master/slave architecture. An HDFS installation
consists of a single Namenode, a master server
that manages the filesystem namespace and
regulates access to files by clients. In
addition, there are a number of Datanodes, one
per node in the cluster, which manage storage
attached to the nodes that they run on. The
Namenode makes filesystem namespace operations
like opening, closing, renaming etc. of files and
directories available via an RPC interface. It
also determines the mapping of blocks to
Datanodes. The Datanodes are responsible for
serving read and write requests from filesystem
clients, they also perform block creation,
deletion, and replication upon instruction from
the Namenode.
17(No Transcript)
18- Naming central metadata server
- Synchronization write-once-read-many, give locks
on objects to clients, using leases - Consistency and replication server side
replication, asynchronous replication, checksum - Fault tolerance failure as norm
- Security no dedicated security mechanism
19Future Work
- Robustness of data sharing model
- The preceding section, architecture, naming,
synchronization, availability, heterogeneity and
support for databases - Security
20Reference
- 1 Thanh, T.D. Mohan, S. Choi, E. SangBum
Kim Pilsung Kim. 2008Networked Computing and
Advanced Information Management. A Taxonomy and
Survey on Distributed File Systems - 2 Randy chow,1997,Distributed operating systems
Algorithms - 3 Eliezer Levy, Abraham Silberschatz. December
1990 Computing Surveys (CSUR) , Volume 22 Issue
4. Distributed file systems concepts and
examples. - 4http//hadoop.apache.org/common/docs/current/hd
fs_design.htmlIntroduction - 5http//www.snia.org/events/wintersymp2009/cloud
/dhruba_hadoop_snia.pdf
21- 6http//en.wikipedia.org/wiki/List_of_file_syste
msDistributed_file_systems - 7http//en.wikipedia.org/wiki/HadoopHadoop_Dist
ributed_File_System - 8http//www.cs.gsu.edu/cscyqz/courses/aos/slide
s08/ch6.1-Fall08.pptx
22QA?
23 Thank you!