The solution for bigdata - Hadoop - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

The solution for bigdata - Hadoop

Description:

An introduction to the Hadoop framework and a brief description on its structure, how it works – PowerPoint PPT presentation

Number of Views:17908

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: The solution for bigdata - Hadoop


1
The solution for Big data HADOOP
  • J. Sai Krishna and G. Sravya Lahari
  • 2nd B.Tech (CSE)
  • K.O.R.M College of Engineering
  • Kadapa

2
Contents
  1. Data trends in storing data.
  2. Bigdata problems in IT industry
  3. Introduction to HADOOP
  4. HDFS (Hadoop Distributed File System)
  5. MapReduce
  6. Prominent users of Hadoop.
  7. Conclusion

3
Data trends in storing data
  • What is data--- Any real world symbol (character,
    numeric, special character) or a of group
    of them is said to be data it may be of the
    visual or audio or scriptural ,etc

4
Big data
  • What is big dataIn IT, it is a collection of
    data sets so large and complex data that it
    becomes difficult to process using on-hand
    database management tools or traditional data
    processing applications.
  • As of 2012, limits on the size of data sets that
    are feasible to process in reasonable time were
    on the order of Exabyte of data.

5
BIGDATA and problems with it.
  • Daily about 0.5 Petabytes of updates are being
    made into FACEBOOK including 40 millions photos.
  • Daily, YOUTUBE is loaded with videos that can be
    watched for one year continuously
  • Limitations are encountered due to large data
    sets in many areas, including meteorology,
    genomics, complex physics simulations, and
    biological and environmental research.
  • Also affect Internet search, finance and business
    informatics.
  • The challenges include in capture, retrieval,
    storage, search, sharing, analysis, and
    visualization.

6
HADOOP
  • THEN WHAT COULD BE THE SOLUTION FOR BIGDATA

7
What is Hadoop?
  • It is a opensource software written in java
  • Hadoop software library is a framework that
    allows for the distributed processing of large
    data sets across clusters of computers using
    simple programming models.
  • It is designed to scale up from single servers to
    thousands of machines, each offering local
    computation and storage.

8
The project includes these modules
  • Hadoop Common
  • Hadoop Distributed File System (HDFS)
  • Hadoop MapReduce

9
1.Hadoop Commons
  • It provides access to the filesystems supported
    by Hadoop.
  • The Hadoop Common package contains the necessary
    JAR files and scripts needed to start Hadoop.
  • The package also provides source code,
    documentation, and a contribution section which
    includes projects from the Hadoop Community
    (Avro, Cassandra, Chukwa, Hbase, Hive, Mahout,
    Pig, ZooKeeper)

10
2. Hadoop Distributed File System (HDFS)
  • Hadoop uses HDFS, a distributed file system based
    on GFS (Google File System), as its shared
    filesystem.
  • HDFS architecture divides files into large chunks
    (64MB) distributed across data servers (this is
    configurable).
  • It has a namenode and datanodes

11
What does a HDFS contain
  • HDFS consists of a global namenodes or namespaces
    and they are federated.
  • The datanodes are used as common storage for
    blocks by all the Namenodes.
  • Each datanode registers with all the Namenodes in
    the cluster.
  • Datanodes send periodic heartbeats and block
    reports and handles commands from the Namenodes

12
Structure of Hadoop system
13
Master Node
  • Master node
  • Keeps track of namespace and metadata about items
  • Keeps track of MapReduce jobs in the system
  • Hadoop currently configured with centurion064 as
    the master node
  • Hadoop is locally installed in each system.
  • Installed location is in /localtmp/hadoop/hadoop-0
    .15.3

14
Slave Nodes
  • Slave nodes
  • Manage blocks of data sent from master node
  • In common, these are the chunkservers
  • Currently centurion060, centurion064 are the two
    slave nodes being used.
  • Slave nodes store their data in
    /localtmp/hadoop/hadoop-dfs (this is
    automatically created by the DFS)
  • Once you use the DFS, relative paths are from
    /usr/your usr id

15
Advantages and Limitations of HDFS
  • Reduce traffic on job scheduling.
  • File access can be achieved through the native
    Java or language of the users' choice (C, Java,
    Python, PHP, Ruby, Erlang, Perl, Haskell, C,
    Cocoa, Smalltalk, and OCaml),
  • It cannot be directly mounted by an existing
    operating system.
  • It should be provided with UNIX or LUNIX system.

16
3.Hadoop MAPREDUCE SYSTEM
  • The Hadoop MapReduce framework harnesses a
    cluster of machines and executes user defined
    MapReduce jobs across the nodes in the cluster.
  • A MapReduce computation has two phases
  • a map phase and
  • a reduce phase.

17
Map and reduce methods usage
18
Word Count over a Given Set of strings
Love 1 India 1 We 2 tennis
1 play 1
We 1 love 1 India 1 We 1 Play 1 tennis 1
We love India
We play tennis
Map
Reduce
19
MapReduce in with no reduce tasks
20
  • MapReduce with two reduce tasks - Automatic
    Parallel Execution in MapReduce

21
MapReduce - lifecycle
Map function
Map phase
Reduce phase
22
Shuffle and sort in MapReduce with multiple
reduce tasks
23
Prominent users of HADOOP
  • Amazon 100 nodes
  • Facebook two clusters of 8000 and 3000 nodes
  • Adobe 80 node system
  • EBay 532 node cluster
  • yahoo cluster of about 4500 nodes
  • IIIT Hyderabad 30 node cluster

24
Achievements
  • March 2011 - Apache Hadoop takes top prize at
    Media Guardian Innovation Award
  • July 2012 - Hadoop Wins Terabyte Sort Benchmark

25
Conclusion
  • It reduce traffic on capture, storage, search,
    sharing, analysis, and visualization.
  • A huge amount of data could be stored and large
    computations could be done in a single compound
    with full safety and security at cheap cost.
  • BIGDATA and BIGDATA-SOLUTIONS is one of the
    burning issues in the present IT industry so,
    work on those will surely make you more useful to
    that.

26
Thank you
  • Any queries
About PowerShow.com