Introductio to Hadoop : Architecture and Components PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Introductio to Hadoop : Architecture and Components


1
Introduction to Hadoop Architecture and
Components
2
What is Hadoop?
An open-source framework for distributed storage
and processing of large datasets. Developed by
Apache Software Foundation. Based on the
MapReduce programming model. Handles big data
across multiple nodes in a cluster. Scalable,
fault-tolerant, and cost-effective.
3
Why Use Hadoop?
? Scalability Handles petabytes of data across
many machines. ? Fault Tolerance Automatically
recovers from failures. ? Cost-Effective Uses
commodity hardware. ? Parallel Processing
Processes data across multiple nodes
simultaneously. ? Supports Various Data Types
Structured, semi-structured, and unstructured
data.
4
Hadoop Architecture Overview
  • Master-Slave Architecture
  • Master Node Manages and coordinates the
    cluster. Slave Nodes Store data and perform
    computations.
  • Core Components
  • HDFS (Hadoop Distributed File System) Storage
    layer. MapReduce Data processing engine.
  • YARN (Yet Another Resource Negotiator) Manages
    resources. Common Utilities Shared libraries
    for Hadoop modules.

5
Key Components of Hadoop
HDFS (Storage Layer) Stores data in a
distributed manner using blocks. MapReduce
(Processing Layer) Processes data in parallel
using map reduce tasks. YARN (Resource
Management) Allocates and manages resources
dynamically. Hadoop Common Provides utilities
for all Hadoop modules.
6
Hadoop Ecosystem (Additional Tools)
  • Hive SQL-like querying for big data.
  • Pig High-level scripting language for data
    transformation. HBase NoSQL database on top of
    HDFS.
  • Spark Fast in-memory processing engine.
  • Oozie Workflow scheduling for Hadoop jobs.
  • Flume Sqoop Data ingestion from external
    sources.

7
Conclusion
Hadoop is a powerful big data framework for
distributed storage processing. Highly
scalable, fault-tolerant, and cost-effective for
large-scale data. Key components HDFS (storage),
MapReduce (processing), and YARN (resource
management). Rich ecosystem with tools like Hive,
Pig, Spark, and HBase.
Write a Comment
User Comments (0)
About PowerShow.com