Bigdata Hadoop Testing - PowerPoint PPT Presentation

View by Category
About This Presentation

Bigdata Hadoop Testing


Bigdata Hadoop testing – PowerPoint PPT presentation

Number of Views:398
Slides: 19
Provided by: dineshraju


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Bigdata Hadoop Testing

BIG Data Hadoop(Testing)

Compiled By
  • Difference between Traditional DB Testing Big
    Data Testing
  • Introduction to Big Data
  • Types of data
  • Data flow in Big Data
  • Big data Hadoop (ECOSYSTEM)
  • Types of Testing in Hadoop
  • Big Data Testing Flow
  • Big Data Testing Tools

Big Data VS DB Testing
Properties Traditional database testing Big data testing
Data Deals structured data only option of "Sampling" strategy doing manually or "Exhaustive Verification" strategy by automation tool Deals with structured data as well as unstructured data Sampling can be challenging
Infrastructure Any special environment Iis not required as data is not huge It requires special test environment due to large data size and files (HDFS)
Validation Validation is done by using some UI automation tools or excel based macros Testing Tools can be used with basic operating knowledge and less training No defined tools ,there is vast range of tools like Map Reduce ,HIVEQL etc are needed It requires a specific set of skills and training to operate testing tool. Also, the tools are in their nascent stage and overtime it may come up with new features.
Introduction to Big Data
  • Extremely large data sets that may be analysed
    computationally to reveal patterns, trends, and
    associations, especially relating to human
    behaviour and interactions.
  • Features of Big Data

  • Velocity - It moves extremely fast through
    various sources such as online systems, sensors,
    social media, web clickstream capture, and other
  • Varity- Its made of many types of data from
    many sources structured and semi-structured, as
    well as unstructured (think emails, text
    messages, documents and the like)
  • Volume - It may (but not always) involve
    terabytes to petabytes (and beyond) of data
  • Complexity - It must be able to traverse
    multiple data centers, the cloud and geographical

Kind of data handled by big data
  • Structured - having predefined data model
  • Enterprise database
  • Semi Structured- Do not have any model defined
    but they do have tags associated with them like
    XML files ,JSON etc
  • Unstructured - not having predefined data model
  • Images
  • Videos
  • audio

House of big data
Big data Hadoop(ECOSYSTEM)
Components of Hadoop
  • HDFS (Hadoop Distributed File System)- is a
    Java-based file system that provides scalable and
    reliable data storage
  • FLUME - Flume is a distributed, reliable, and
    available service for efficiently collecting,
    aggregating, and moving large amounts of log
    data. It has a simple and flexible architecture
    based on streaming data flows. It is robust and
    fault tolerant with tunable reliability
    mechanisms and many failover and recovery
    mechanisms. It uses a simple extensible data
    model that allows for online analytic
  • SQOOP- Apache Sqoop is a tool designed for
    efficiently transferring bulk data between Apache
    Hadoop and structured data stores such as
    relational databases.

Zookeeper- is a centralized service for
maintaining configuration information, naming,
providing distributed synchronization, and
providing group services. All of these kinds of
services are used in some form or another by
distributed applications. Oozie - Coordinator
jobs are recurrent Oozie ,Workflow jobs triggered
by time (frequency) and data availabilty.Oozie is
integrated with the rest of the Hadoop stack
supporting several types of Hadoop jobs out of
the box (such as Java map-reduce, Streaming
map-reduce, Pig, Hive, Sqoop and Distcp) as well
as system specific jobs (such as Java programs
and shell scripts). Pig- is a platform for
analysing large data sets that consists of a
high-level language for expressing data analysis
programs, coupled with infrastructure for
evaluating these programs. The salient property
of Pig programs is that their structure is
amenable to substantial parallelization, which in
turns enables them to handle very large data
sets. Hive- Hive has three main functions data
summarization, query and analysis.  It supports
queries expressed in a language called HiveQL,
which automatically translates SQL-like queries
into MapReduce jobs executed on Hadoop. In
addition, HiveQL supports custom MapReduce
scripts to be plugged into queries. Hive also
enables data serialization/deserialization and
increases flexibility in schema design by
including a system catalog called Hive-Metastore.
Types of testing in hadoop
Big Data Testing Flow
Understanding the Hadoop Testing Spectrum
  • What to test
  • Core components testing (HDFS, MapReduce)
  • HDFS testing.
  • Essential components testing (Hive, HBase, NoSQL)
  • HBase testing.
  • Hive testing
  • NoSQL Testing.
  • Flume Sqoop Testing

HDFS testing. HDFS, a distributed file system
designed to run on commodity hardware, uses the
master-slave architecture. To examine the highly
complex architecture of HDFS, QA teams need to
verify that the file storage is in accordance
with the defined block size perform replication
checks to ensure availability of NameNode and
DataNode check for file load and input file
split per the defined block size and execute
HDFS file checks to ensure data integrity. Hive
testing. Hive enables Hadoop to operate as a data
warehouse. It superimposes structure on data in
HDFS, then permits queries over the data using a
familiar SQL-like syntax. Hives core
capabilities are extensible by UDFs (user defined
functions). Since Hive is recommended for
analysis of terabytes of data, the volume and
velocity of big data are extensively . covered in
Hive testing. (Hive testing can be classified
into Hive functional testing and Hive UDF
testing). From a functional standpoint, Hive
testing incorporates validation of successful
setup of the Hive meta-store database data
integrity between HDFS vs. Hive and Hive vs.
MySQL (meta-store) correctness of the query and
data transformation logic checks related to
number of MapReduce jobs triggered for each
business logic export/import of data from/to
Hive data integrity, and redundancy checks when
MapReduce jobs fail. Hive core capabilities are
made extensible by writing custom UDFs. The UDFs
must be tested for correctness and completeness,
which can be achieved using tools such as JUnit.
Testers should also test the UDF in Hive CLI
directly, especially if there are uncertainties
in functions that deal with the right data types,
such as struct, map and array.
Flume Sqoop testing. In a data warehouse, the
database supports the import/export functions of
data. Big data is equipped with data ingestion
tools such as Flume and Sqoop, which can be used
to move data into and out of Hadoop. Instead of
writing a stand-alone application to move data to
HDFS, its worth considering existing tools for
ingesting data, since they offer most of the
common functions. General QA checkpoints include
successfully generating streaming data from Web
sources using Flume, checks over data propagation
from conventional data storages into Hive and
HBase, and vice versa. HBase testing. HBase, a
non-relational (NoSQL) database that runs on top
of HDFS, provides fault-tolerant storage and
quick access to large quantities of sparse data.
Its purpose is to host very large tables with
billions of rows and millions of columns. HBase
testing involves validation of RowKey (PK) usage
for all the access attempts to HBase tables
verification of version settings in the query
output verification that the latency of
individual reads and writes is per the threshold
and checks on the Zookeeper to ensure the
coordination activities of the region servers
Tools Available in Market FOR tESTING
  • Thank you
  • Please download Tutorials!