The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work

Description:

Goals. SECTION TITLE. Very Large Distributed File System 10K nodes, 100 million files, 10 PB. Assumes Commodity Hardware Files are replicated to handle ... – PowerPoint PPT presentation

Number of Views:361
Avg rating:3.0/5.0
Slides: 32
Provided by: sunsetUsc
Category:

less

Transcript and Presenter's Notes

Title: The Hadoop Distributed File System, by Dhyuba Borthakur and Related Work


1
The Hadoop Distributed File System, by Dhyuba
Borthakurand Related Work
  • Presented by Mohit Goenka

2
The Hadoop Distributed File System Architecture
and Design
3
Requirement
  • Need to process Multi Petabyte Datasets
  • Expensive to build reliability in each
    application.
  • Nodes fail every day
  • Need common infrastructure

SECTION TITLE
4
Introduction
  • HDFS, Hadoop Distributed File System is designed
    to run on commodity hardware
  • Built out by brilliant engineers and contributors
    from Yahoo, and Facebook and Cloudera and other
    companies
  • Has grown into really large project at Apache
    with significant ecosystem

SECTION TITLE
5
Commodity Hardware
SECTION TITLE
  • Typically in 2 level architecture
  • Nodes are commodity PCs
  • 30-40 nodes/rack
  • Uplink from rack is 3-4 gigabit
  • Rack-internal is 1 gigabit

6
Goals
  • Very Large Distributed File System
  • 10K nodes, 100 million files, 10 PB
  • Assumes Commodity Hardware
  • Files are replicated to handle hardware
    failure
  • Detect failures and recovers from them
  • Optimized for Batch Processing
  • Data locations exposed so that computations
    can move to where data resides
  • Provides very high aggregate bandwidth
  • User Space, runs on heterogeneous OS

SECTION TITLE
7
HDFS Basic Architecture
Cluster Membership
NameNode
1. filename
Secondary NameNode
2. BlckId, DataNodes o
SECTION TITLE
Client
3.Read data
Cluster Membership
NameNode Maps a file to a file-id and list of
MapNodes DataNode Maps a block-id to a
physical location on disk SecondaryNameNode
Periodic merge of Transaction log
DataNodes
8
Distributed File System
  • Single Namespace for entire cluster
  • Data Coherency
  • Write-once-read-many access model
  • Client can only append to existing files
  • Files are broken up into blocks
  • Typically 128 MB block size
  • Each block replicated on multiple DataNodes
  • Intelligent Client
  • Client can find location of blocks
  • Client accesses data directly from DataNode

SECTION TITLE
9
HDFS Core Architecture
SECTION TITLE
10
NameNode Metadata
  • Meta-data in Memory
  • The entire metadata is in main memory
  • No demand paging of meta-data
  • Types of Metadata
  • List of files
  • List of Blocks for each file
  • List of DataNodes for each block
  • File attributes, e.g creation time,
    replication factor
  • A Transaction Log
  • Records file creations, file deletions. etc

SECTION TITLE
11
Data Node
  • A Block Server
  • Stores data in the local file system (e.g.
    ext3)
  • Stores meta-data of a block (e.g. CRC)
  • Serves data and meta-data to Clients
  • Block Report
  • Periodically sends a report of all existing
    blocks to the NameNode
  • Facilitates Pipelining of Data
  • Forwards data to other specified DataNodes

SECTION TITLE
12
Block Placement
  • Current Strategy
  • - One replica on local node
  • - Second replica on a remote rack
  • - Third replica on same remote rack
  • - Additional replicas are randomly placed
  • Clients read from nearest replica
  • Would like to make this policy pluggable

SECTION TITLE
13
Data Correctness
  • Use Checksums to validate data
  • Use CRC32
  • File Creation
  • Client computes checksum per 512 byte
  • DataNode stores the checksum
  • File access
  • Client retrieves the data and checksum from
    DataNode
  • If Validation fails, Client tries other
    replicas

SECTION TITLE
14
NameNode Failure
  • A single point of failure
  • Transaction Log stored in multiple directories
  • - A directory on the local file system
  • - A directory on a remote file system (NFS/CIFS)
  • Need to develop a real HA solution

SECTION TITLE
15
Data Pipelining
  • Client retrieves a list of DataNodes on which to
    place replicas of a block
  • Client writes block to the first DataNode
  • The first DataNode forwards the data to the next
    DataNode in the Pipeline
  • When all replicas are written, the Client moves
    on to write the next block in file

SECTION TITLE
16
Rebalancer
  • Goal disk full on DataNodes should be similar
  • Usually run when new DataNodes are added
  • Cluster is online when Rebalancer is active
  • Rebalancer is throttled to avoid network
    congestion
  • Command line tool

SECTION TITLE
17
Hadoop Map / Reduce
  • The Map-Reduce programming model
  • Framework for distributed processing of large
    data sets
  • Pluggable user code runs in generic framework
  • Common design pattern in data processing
  • cat grep sort unique -c cat gt
    file
  • input map shuffle reduce output
  • Natural for
  • Log processing
  • Web search indexing
  • Ad-hoc queries

SECTION TITLE
18
Data Flow
Web Servers
Scribe Servers
SECTION TITLE
Network Storage
Oracle RAC
Hadoop Cluster
MySQL
19
Basic Operations
  • Listing files
  • - ./bin/hadoop fs ls
  • Writing files
  • - ./bin/hadoop fs put
  • Running Map Reduce Jobs
  • - mkdir input
  • - cp conf/.xml input
  • - cat output/

SECTION TITLE
20
Hadoop Ecosystem Projects
  • HBase
  • - Big Table
  • HIVE
  • - Built on Facebook, provides SQL interface
  • Chukwa
  • - Log Processing
  • Pig
  • - Scientific data analysis language
  • Zookeeper
  • - Distributed Systems management

SECTION TITLE
21
Limitatons
  • The gigabytes to terabytes of data this system
    handles can only be scaled down to limited
    threshold
  • Due to this threshold being very high, the system
    is limited in a lot of ways
  • It hampers the efficiency of the system during
    large computations or parallel data exchange

SECTION TITLE
22
JSON Interface to Control HDFS
  • An Open Source Project
  • by Mohit Goenka

23
JSON Interface to Control HDFS
  • An Open Source Project by Mohit Goenka

24
JSON
  • JSON (JavaScript Object Notation) is a
    lightweight data-interchange format
  • Can be easily read and written by humans
  • Can be easily parsed by machines
  • Written in text format
  • Similar conventions as existing programming
    languages

SECTION TITLE
25
JSON Data
  • It is based on two structures
  • A collection of name/value pairs
  • An ordered list of values
  • Concept Use the light-weighted nature of JSON
    data to automate command execution on HDFS
    interface

SECTION TITLE
26
Goal
  • Designing a JSON interface to control HDFS
  • Development of two modules
  • For writing into the system
  • For reading from the system

SECTION TITLE
27
Outcome
  • User can specify execution commands directly in
    the JSON file along with data
  • Only data gets stored into the system
  • Commands are deleted from the file after execution

SECTION TITLE
28
Sources and Acknowledgements
29
Sources
  • Dhurba Borthakur, Apache Hadoop Developer,
    Facebook Data Infrastructure
  • Matei Zaharia, Cloudera / Facebook / UC Berkeley
    RAD Lab
  • Devaraj Das, Yahoo! Inc. Bangalore and Apache
    Software Foundation
  • HDFS Java API
  • - http//hadoop.apache.org/core/d
    ocs/current/api/
  • HDFS source code
  • - http//hadoop.apache.org/core/version_c
    ontrol.html

SECTION TITLE
30
Acknowledgements
  • Professor Chris Mattmann for guidance as and when
    reqired
  • Hossein (Farshad) Tajalli for his continued
    support and help throughout the project
  • All my classmates for providing valuable inputs
    throughout the work, especially through their
    presentations

SECTION TITLE
31
Thats All Folks!
SECTION TITLE
Write a Comment
User Comments (0)
About PowerShow.com