IS6126 Databases for Management Information Systems Lecture 5: NoSQL and distributed data stores

About This Presentation

Title:

IS6126 Databases for Management Information Systems Lecture 5: NoSQL and distributed data stores

Description:

Title: Introduction Author: Rob Gleasure Last modified by: Gleasure, Rob Created Date: 9/20/2005 10:52:45 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:281

Avg rating:3.0/5.0

Slides: 47

Provided by: RobG177

Category:

more less

Transcript and Presenter's Notes

Title: IS6126 Databases for Management Information Systems Lecture 5: NoSQL and distributed data stores

1
IS6126 Databases for Management Information
SystemsLecture 5 NoSQL and distributed data
stores

Rob Gleasure
R.Gleasure_at_ucc.ie
robgleasure.com

2
IS6126

Todays session
Technologies for analysis
Technologies for storage
NoSQL
Distributed map reduce architectures, e.g. Hadoop

3
Technologies and tools
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
4
Data warehousing
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
5
Data warehousing
OLTP
OLAP
Business intelligence database
Operational databases
HR and payroll
Extract Transform Load
Data warehouse
Data mining
Sales and customers
Visual-isation
Orders
Reporting
Technical support
Purchased data
6
What is NoSQL?

What is NoSQL?
Basically any database that isnt a relational
database
Stands for Not only SQL
Its NOT anti-SQL or anti-relational databases

7
What is NoSQL (continued)?

Its not only rows in tables
NoSQL systems store and retrieve data from many
formats, e.g. text, csv, xml, graphml
Its not only joins
NoSQL systems mean you can extract data using
simple interfaces, rather than necessarily
relying on joins
Its not only schemas
NoSQL systems mean you can drag-and-drop data
into a folder, without having to organise and
query it according to entities, attributes,
relationships, etc.

8
What is NoSQL (continued)?

Its not only executed on one processor
NoSQL systems mean you can stores databases on
multiple processors with high-speed performance
Its not only specialised computers
NoSQL systems mean you can leverage low-cost
shared-nothing commodity processors that have
separate RAM and disk.
Its not only logarithmically scalable
NoSQL systems mean you can achieve linear
scalability as you add more processors
Its not only anything, really
NoSQL systems emphasise innovation and
inclusivity, meaning there are multiple
recognised options for how data is stored,
retrieved, and manipulated (including standard
SQL solutions)

9
What is NoSQL (continued)?
10
NoSQL
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
NoSQL
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
11
Four Data Patterns in NoSQL
12
Key-value stores

A simple string (the key) returns a Binary Large
OBject (BLOB) of data (the value)
E.g. the web
The key can take many formats
Logical path names
A hash string artificially generated from the
value
REST web service calls
SQL queries
Three basic functions
Put
Get
Delete

13
Key-value stores (continued)
14
Key-value stores (continued)

Advantages
Scalability, reliability, portability
Low operational costs
Simplicity
Disadvantages
No real options for advanced search
Commercial solutions
Amazon S3
Voldemort

15
Column-family stores

Stores BLOBs of data in one big table, with four
possible basic identifiers used for look-up
Row
Column
Column-family
Time-stamp
More like a spreadsheet than an RDBMS in many
ways (e.g. no indices, triggers, or SQL queries)
Grew from an idea presented in a Google BigTable
paper

16
Column-family stores (continued)
17
Column-family stores (continued)

May then be stored and reconstructed in
components, e.g. HBase converts to
column-name/value rows

18
Column-family stores (continued)

Advantages
Scales pretty well
Decent searchability
Easy to add new data
Pretty intuitive
Disadvantages
Cant query BLOB content
Not as efficient to search as some other options
Commercial solutions
Cassandra
HBase

19
Document stores

Stores data in nested hierarchies
Keeps logical chunks of data together in one
place
Treats data as collections of categories and
subcategories
Arguably started with the growth of XML as a
standard format for exchanging data between
applications
Gradual move to just storing data in that format

20
Document stores

Typically requires some flattening before data
can be analysed, e.g. ETL functions that
processing data and produce a .csv

Hierarchical docs, e.g. JSON
Mixed content, e.g. XML
Flat tables, e.g. csv
21
Document stores (continued)

Example of an XML document from MarkLogic

22
Document stores (continued)

Example of a JSON object from MongoDB

23
Document stores (continued)

Advantages
Lends itself to efficient within-document search
Very suitable for information retrieval
Very suitable where data is fed directly into
websites/applications
Allows for structure without being overly
restrictive
Disadvantages
Complicated to implement
Search process may require opening and closing
files
Analysis requires some flattening
Commercial solutions
MarkLogic
MongoDB

24
Graph stores

Model the interconnectivity of the data by
focusing on nodes (sometimes called vertices),
relationships (sometimes called edges), and
properties
Tables stored for nodes and edges separately,
meaning types of search become possible

25
Graph stores (continued)

Example from Gephi

26
Graph stores (continued)

Advantages
Fast network search
Works with many public data sets
Disadvantages
Not very scalable
Hard to query systematically unless you use
specialised languages based on graph traversals
Commercial solutions
Neo4j
AllegroGraph

27
So, where to get the power needed for these giant
data stores?
28
NoSQL
Data lifecycle
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
MapReduce
NoSQL
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
29
Traditional (Structured) Approach
Big Data
Data
High Power Processor
30
The MapReduce Concept
Big Data
Data Node
Data Node
Master Processor
Data Node
Data
Data Node
Standard Processor
Slave Processor
Data Node
Data Node
Data Node
Slave Processor
Data Node
31
The MapReduce Concept

Two fundamental steps
Map
Master node takes large problem and slices it
into sub problems
Master node distributes these sub problems to
worker nodes.
Worker node may also subdivide and distribute (in
which case, a multi-level tree structure results)
Worker processes sub problems and hands back to
master
Reduce
Master node reassembles solutions to sub problems
in a pre-defined way to answer high-level problem

32
Issues in Distributed Model

How should we decompose one big task into smaller
ones?
How do we figure out an efficient way to assign
tasks to different machines?
How do we exchange results between machines?
How do we synchronize distributed tasks?
What do we do if a task fails?

33
Apache Hadoop

Hadoop was created in 2005 by two Yahoo employees
(Doug Cutting and Mike Cafarella) building on
white papers by Google on their MapReduce
process.
The name refers to a toy elephant belonging to
Doug Cuttings son
Yahoo later donated the project to Apache to
maintain in 2006
Hadoop offers a framework of tools for dealing
with big data
Hadoop is open source, distributed under the
Apache licence

34
Hadoop Ecosystem

Image from http//www.neevtech.com/blog/2013/03/18
/hadoop-ecosystem-at-a-glance/

35
Hadoop Ecosystem

Image from http//thebigdatablog.weebly.com/blog/t
he-hadoop-ecosystem-overview/

36
MapReducing in Hadoop
Data Node
Master Processor
Task Tracker
Application
Queue
Batches
Job Tracker
Name Node
This is where HDFS comes in
This is where MapReduce comes in
Slave Processor
Slave Processor
Slave Processor
Task Tracker
Task Tracker
Task Tracker
Data Node
Data Node
Data Node
37
Fault Handling in Hadoop

Distributing processing means that sooner or
later, part of the distributed processing network
will fail
Practical truth of networks they are unreliable
Hadoops HDFS has fault tolerance built-in for
data nodes
Three copies of each file maintained by Hadoop
If one copy goes down, data is retrieved from
another
Faulty node is then updated with new (working)
data from backup
Hadoops HDFS also tracks failures in task
trackers
Master nodes job tracker watches for errors in
slave nodes
Allocates tasks to new slave if existing slave
responsible fails

38
Programming in Hadoop

Programmers using Hadoop dont have to worry
about
Where files are stored
How to manage failures
How to distribute computation
How to scale up or down activities
A variety of languages can be used, though Java
is the most common and arguably most hassle-free

39
Implementing a Hadoop System

Hadoop can be run in traditional onsite data
centres using multiple dedicated machines
Hadoop can also be run via cloud-hosted services,
including
Microsoft Azure
Amazon EC2/S3
Amazon Elastic MapReduce
Google Compute Engine

40
Implementing a Hadoop System Yahoo Servers
Running Hadoop
41
Applications of Hadoop

Areas of application include
Search engines e.g. Google, Yahoo
Social media e.g. Facebook, Twitter
Financial services Morgan Stanley, BNY Mellon
eCommerce e.g. Amazon, American Airlines, eBay,
IBM
Government e.g. Federal Reserve, Homeland
Security

42
Users of Hadoop

Just like RDBMS, Hadoop systems have different
levels of users
Administrators handle
Configuring of the system
Updates and installation
General firefighting
Basic users
Run tests and gather data for reporting, market
research, general exploration, etc.
Design applications to use data

43
And another thing Blockchain

The newest, hottest data storage trend
Based around 4 main principles
Shared data
Distributed consensus
Smart contracts
Native cryptography

44
Bitcoin
45
Want to read more?

Apache Hadoop Documentation
http//hadoop.apache.org/docs/current/
Data Intensive Text Processing with Map-Reduce
http//lintool.github.io/MapReduceAlgorithms/
Hadoop Definitive Guide
http//www.amazon.com/Hadoop-Definitive-Guide-Tom-
White/dp/1449311520

46
Want to read more?

Financial Services using Hadoop
http//hortonworks.com/blog/financial-services-had
oop/
https//www.mapr.com/solutions/industry/big-data-a
nd-apache-hadoop-financial-services
Hadoop at ND
http//ccl.cse.nd.edu/operations/hadoop/

Write a Comment

User Comments (0)

About PowerShow.com

IS6126 Databases for Management Information Systems Lecture 5: NoSQL and distributed data stores - PowerPoint PPT Presentation

IS6126 Databases for Management Information Systems Lecture 5: NoSQL and distributed data stores

Title: Introduction Author: Rob Gleasure Last modified by: Gleasure, Rob Created Date: 9/20/2005 10:52:45 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation