Big Data Hadoop Training (1) - PowerPoint PPT Presentation

About This Presentation

Title:

Big Data Hadoop Training (1)

Description:

Big Data and Hadoop instructional class is intended to give information and aptitudes to turn into a fruitful Hadoop Developer. A great deal of ability, top to bottom learning of center ideas is needed in a course alongside execution on differed industry use-cases. SoftwareSkool provides various online training courses which are highly in demand in the present trend. We designed our e-learning platform on proven teaching methods in such a way that every individual will be mastered at the end of their course. Contact Us: Ph No: 4097912424 – PowerPoint PPT presentation

Number of Views:584

Slides: 52

Provided by: gauthamvarma

Category: How To, Education & Training

more less

Transcript and Presenter's Notes

Title: Big Data Hadoop Training (1)

1
(No Transcript)
2
Content

What Is Big Data
What Is Hadoop
Characteristics of Big Data
Characteristics of Hadoop
Big Data Storage Considerations
Understanding Hadoop Technology and storage
BigData Technologies
Hadoop HDFS Architecture
Why Big Data
Why Hadoop
Future of Big Data
Future of Hadoop

3
What is Big Data

Big data means really a big data, it is a
collection of large datasets that cannot be
processed using traditional computing techniques.
Big data is not merely a data, rather it has
become a complete subject, which involves various
tools, techniques and frameworks.

4
What is Hadoop

Hadoop is a free, Java-based programming
framework that supports the processing of large
data sets in a distributed computing environment.
It is part of the Apache project sponsored by the
Apache Software Foundation.

5
Characteristics of Big Data

We have all known about the 3Vs of huge
information which are Volume, Variety and
Velocity. Yet, Inderpal Bhandar, Chief Data
Officer at Express Scripts noted in his
presentation at the Big Data Innovation Summit in
Boston that extra Vs IT, business and information
researchers should be worried with, most
eminently enormous information Veracity. There
are 3 Types
volume
velocity
Variety

6
Volume

Volume Refers to the incomprehensible
measures of data made reliably. We are not
talking Terabytes yet rather Zetta bytes or
Bronto bytes. In case we take all the data made
on the planet between the absolute starting point
and 2008, the same measure of data will soon be
created reliably. This makes most data sets
excessively immeasurable, making it impossible to
store and dismember using standard database
development. New tremendous data instruments
usage coursed systems so we can store and
dismember data transversely over databases that
are specked around wherever on the planet.

7
Velocity

Velocity Refers to the speed at which new
data is made and the pace at which data moves
around. Essentially consider internet systems
administration messages turning into a web
sensation in seconds. Advancement grants us now
to analyze the data while it is being delivered
(as a less than dependable rule insinuated as
in-memory examination), while never putting it
into data bases. The Velocity is the pace at
which the data is made, secured, dismembered and
imagined. Some time recently, when group get
ready was fundamental practice, it was common to
get a redesign from the database reliably or even
reliably. PCs and servers obliged liberal time to
change the data and overhaul the databases. In
the immense data time, data is made persistently
or close steady. With the openness of Internet
joined devices, remote or wired, machines and
contraptions can go on their data the moment it
is made.

8
Variety

Variety Refers to the distinctive sorts of
information we can now utilize. In the past we
just centered around organized information that
perfectly fitted into tables or social databases,
for example, monetary information. Truth be told,
80 of the world's information is unstructured
(content, pictures, feature, voice, and so on.)
With enormous information innovation we can now
examine and unite information of distinctive
sorts, for example, messages, online networking
discussions, photographs, sensor information,
feature or voice recordings. Previously, all
information that was made was organized
information, it conveniently fitted in sections
and lines yet those days are over. These days,
90 of the information that is created by an
association is unstructured information.
Information today comes in a wide range of
organizations organized information,
semi-organized information, unstructured
information and even complex organized
information. The wide mixture of information
obliges an alternate methodology and diverse
strategies to store all crude information.

9
(No Transcript)
10
Characteristics of Hadoop

Hadoop provides a reliable shared storage (HDFS)
and analysis system (MapReduce).
Hadoop is highly scalable and unlike the
relational databases, Hadoop scales linearly. Due
to linear scale, a Hadoop Cluster can contain
tens, hundreds, or even thousands of servers.
Hadoop is very cost effective as it can work with
commodity hardware and does not require expensive
high-end hardware.
Hadoop is highly flexible and can process both
structured as well as unstructured data.
Hadoop has built-in fault tolerance. Data is
replicated across multiple nodes (replication
factor is configurable) and if a node goes down,
the required data can be read from another node
which has the copy of that data. And it also
ensures that the replication factor is
maintained, even if a node goes down, by
replicating the data to other available nodes.
Hadoop works on the principle of write once and
read multiple times.
Hadoop is optimized for large and very large data
sets. For instance, a small amount of data like
10 MB when fed to Hadoop, generally takes more
time to process than traditional systems.

11
(No Transcript)
12
Big Data Storage Considerations

Our experience building an industry leading
Big Data storage platform has taught us a few
things about the storage challenges faced by
organizations. Customers have shared with us some
of the general pros and cons of the storage
options they have considered when choosing a
storage platform.

13
Open Source

Pros
Free with community support
Scalable
Runs on inexpensive commercial-off-the-shelf
(COTS) hardware
Cons
Community support is not sufficient and there is
a reliance on outside consultancy
Investment to build and maintain in-house
competency
In-house support, testing and tuning
No guaranteed SLA
Long lead time to get into production

14
(No Transcript)
15
Conventional Storage Systems

Pros
Enterprise-class support and quality
Long term lifecycle/release management
Appliance based model
Cons
Expensive license and support
Locked-in/proprietary hardware
Scalability and manageability issues such as file
system, namespace, data protection, disaster
prevention, etc

16
(No Transcript)
17
Software-defined Storage

Pros
Enterprise-class support and quality
Long term lifecycle/release management
Massively scalable built for todays and
emerging workloads
Easy to manage self healing, non disruptive
upgrades
Runs on inexpensive COTS hardware
Cons
Some solutions require additional software with a
separate license
Scalability varies with solutions
Data migration is required with some solutions

18
(No Transcript)
19
Understanding Hadoop technology and storage

Because Hadoop stores three copies of each piece
of data, storage in a Hadoop cluster must be able
to accommodate a large number of files. To
support the Hadoop architecture, traditional
storage systems may not always work. The links
below explain how Hadoop clusters and HDFS work
with various storage systems, including
network-attached storage (NAS), SANs and object
storage.

software vendors have gotten the message that
Hadoop is hot -- and many are responding by
releasing Hadoop connectors that are designed to
make it easier for users to transfer information
between traditional relational databases and the
open source distributed processing system.
Oracle, Microsoft and IBM are among the vendors
that have begun offering Hadoop connector
software as part of their overall big data
management strategies. But it isnt just the
relational database management system (RDBMS)
market leaders that are getting in on the act.

21
(No Transcript)
22
Big Data Technologies

Big Data information is a wide term for
information sets so vast or complex that
customary information preparing applications are
lacking.
Big Data Technologies are 9 Technologies
Crowd sourcing
Data fusion
Data integration
Genetic algorithm
Machine learning
Natural language processing
Signal processing
Time series
Simulation

23
Crowd sourcing

Crowd sourcing, a present day business term
authored in 2005, is characterized by
Merriam-Webster as the procedure of soliciting so
as to acquire required administrations, thoughts,
or substance commitments from a substantial
gathering of individuals, and particularly from
an online group, as opposed to from customary
workers or suppliers a portmanteau of "group" and
"outsourcing, its more particular definitions are
yet vigorously faced off regarding.

24
(No Transcript)
25
Data fusion

Information combination is the procedure of
coordination of various information and learning
speaking to the same certifiable item into a
steady, exact, and valuable representation.
combination of the information from 2 sources
(measurement 1 2) can yield a classifier
better than any classifiers taking into account
measurement 1 or measurement 2 alone
Information combination procedures are regularly
arranged as low, middle of the road or high,
contingent upon the handling stage at which
combination takes place. Low level information
combination consolidates a few wellsprings of
crude information to create new crude
information. The desire is that melded
information is more educational and engineered
than the first inputs.

26
(No Transcript)
27
Data integration

Information joining includes consolidating
information living in distinctive sources and
furnishing clients with a brought together
perspective of these data.1 This procedure gets
to be noteworthy in an assortment of
circumstances, which incorporate both business
(when two comparative organizations need to blend
their databases) and investigative (joining
examination results from diverse bioinformatics
stores, for instance) areas. Information mix
shows up with expanding recurrence as the volume
and the need to share existing information
explodes. It has turned into the center of broad
hypothetical work, and various open issues stay
unsolved

28
(No Transcript)
29
Genetic Algorithm

In the field of counterfeit consciousness, a
hereditary calculation (GA) is a pursuit
heuristic that emulates the procedure of
characteristic choice. This heuristic (likewise
some of the time called a metaheuristic) is
routinely used to produce valuable answers for
advancement and pursuit problems.1 Genetic
calculations have a place with the bigger class
of developmental calculations (EA), which create
answers for streamlining issues utilizing systems
roused by characteristic advancement, for
example, legacy, change, determination, and
hybrid.

30
(No Transcript)
31
Machine learning

Machine learning is a subfield of PC science1
that developed from the investigation of example
acknowledgment and computational learning
hypothesis in fake intelligence. Machine learning
investigates the study and development of
calculations that can gain from and make
forecasts on data. Such calculations work by
building a model from sample inputs keeping in
mind the end goal to make information driven
expectations or decisions instead of taking after
entirely static project guidelines. Machine
learning is firmly identified with and regularly
covers with computational measurements a teach
that likewise works in expectation making. It has
solid binds to scientific enhancement, which
conveys systems, hypothesis and application
spaces to the field. Machine learning is utilized
in a scope of figuring assignments where
outlining and programming express calculations is
infeasible.

32
(No Transcript)
33
Natural language processing

This article speaks the truth dialect handling
by PCs. For the preparing of dialect by the human
cerebrum, see Language handling in the mind.
Normal dialect handling (NLP) is a field of
software engineering, computerized reasoning, and
computational etymology worried with the
collaborations in the middle of PCs and human
(characteristic) dialects. As being what is
indicated, NLP is identified with the territory
of humancomputer association. Numerous
difficulties in NLP include normal dialect
understanding, that is, empowering PCs to get
importance from human or common dialect
information, and others include characteristic

34
(No Transcript)
35
Signal processing

Sign preparing is an empowering innovation that
incorporates the key hypothesis, applications,
calculations, and executions of handling or
moving data contained in a wide range of
physical, typical, or unique configurations
extensively assigned as signals. It utilizes
numerical, measurable, computational, heuristic,
and semantic representations, formalisms, and
strategies for representation, demonstrating,
investigation, union, revelation, recuperation,
detecting, procurement, extraction, learning,
security, or legal sciences

36
(No Transcript)
37
Time series

A period arrangement is a grouping of information
focuses, commonly comprising of progressive
estimations made over a period interim. Cases of
time arrangement are sea tides, numbers of
sunspots, and the day by day shutting estimation
of the Dow Jones Industrial Average. Time
arrangement are every now and again plotted by
means of line outlines. Time arrangement are
utilized as a part of insights, sign preparing,
example acknowledgment, econometrics, numerical
money, climate anticipating, canny transport and
direction forecasting, seismic tremor
expectation, electroencephalography, control
building, stargazing, correspondences designing,
and to a great extent in any area of connected
science and designing which includes worldly
estimations.

38
(No Transcript)
39
Simulation

Simulation is the operation's impersonation of a
genuine procedure or framework over time. The
demonstration of reenacting something first
obliges that a model be created this model
speaks to the key qualities or practices/elements
of the chose physical or theoretical framework or
procedure. The model speaks to the framework
itself, while the reenactment speaks to the
framework's operation after some time.

40
(No Transcript)
41
Hadoop HDFS Architecture

Hadoop1 gives a disseminated filesystem and a
structure for the investigation and change of
expansive information sets utilizing the
MapReduce DG04 worldview. While the interface
to HDFS is designed after the Unix filesystem,
steadfastness to principles was relinquished for
enhanced execution for the applications at
hand.An imperative normal for Hadoop is the
apportioning of information and calculation
crosswise over numerous (thousands) of hosts, and
the execution of utilization calculations in
parallel near their information. A Hadoop group
scales calculation limit, stockpiling limit and
I/O transfer speed by essentially including thing
servers. Hadoop groups at Yahoo! compass 40,000
servers, and store 40 petabytes of utilization
information, with the biggest group being 4000
servers. One hundred different associations
overall report utilizing Hadoop

42
(No Transcript)
43
Why Big Data

Data are now woven into every sector and function
in the global economy, and, like other essential
factors of production such as hard assets and
human capital, much of modern economic activity
simply could not take place without them. The use
of Big Data large pools of data that can be
brought together and analyzed to discern patterns
and make better decisions will become the basis
of competition and growth for individual firms,
enhancing productivity and creating significant
value for the world economy by reducing waste and
increasing the quality of products and services

44
Why Hadoop

Apache Hadoop enables big data applications for
both operations and analytics and is one of the
fastest-growing technologies providing
competitive advantage for businesses across
industries. Hadoop is a key component of the
next-generation data architecture, providing a
massively scalable distributed storage and
processing platform. Hadoop enables organizations
to build new data-driven applications while
freeing up resources from existing systems. MapR
is a production-ready distribution for Apache
Hadoop.

45
Future of Big Data

Plainly Big Data is in its beginnings, and is
substantially more to be found. Presently is for
the most organizations only a cool keyword,
because it has an incredible potential and not
many genuinely recognize what all is about. A
clear sign that there is a whole other world to
enormous data then is at present appeared
available, is that the enormous programming
organizations not have, or don't display their
Big Data solutions, and those that have like
Google, does not utilize it in ca business way.
The organizations need to choose what kind of
technique utilization to execute Big Data. They
could utilize a more progressive approach and
move all the information to the new Big Data
environment, and all there porting, demonstrating
and cross examination will be executed utilizing
the new business intelligence in light of Big
Data. This methodology is now utilized by many
analytics driven associations that puts all the
information on the Hadoop environment and build
business knowledge arrangements on top of it.