Title: IS6126 Databases for Management Information Systems Lecture 5: NoSQL and distributed data stores
1IS6126 Databases for Management Information
SystemsLecture 5 NoSQL and distributed data
stores
- Rob Gleasure
- R.Gleasure_at_ucc.ie
- robgleasure.com
2IS6126
- Todays session
- Technologies for analysis
- Technologies for storage
- NoSQL
- Distributed map reduce architectures, e.g. Hadoop
3Technologies and tools
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
4Data warehousing
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
5Data warehousing
OLTP
OLAP
Business intelligence database
Operational databases
HR and payroll
Extract Transform Load
Data warehouse
Data mining
Sales and customers
Visual-isation
Orders
Reporting
Technical support
Purchased data
6What is NoSQL?
- What is NoSQL?
- Basically any database that isnt a relational
database - Stands for Not only SQL
- Its NOT anti-SQL or anti-relational databases
7What is NoSQL (continued)?
- Its not only rows in tables
- NoSQL systems store and retrieve data from many
formats, e.g. text, csv, xml, graphml - Its not only joins
- NoSQL systems mean you can extract data using
simple interfaces, rather than necessarily
relying on joins - Its not only schemas
- NoSQL systems mean you can drag-and-drop data
into a folder, without having to organise and
query it according to entities, attributes,
relationships, etc.
8What is NoSQL (continued)?
- Its not only executed on one processor
- NoSQL systems mean you can stores databases on
multiple processors with high-speed performance - Its not only specialised computers
- NoSQL systems mean you can leverage low-cost
shared-nothing commodity processors that have
separate RAM and disk. - Its not only logarithmically scalable
- NoSQL systems mean you can achieve linear
scalability as you add more processors - Its not only anything, really
- NoSQL systems emphasise innovation and
inclusivity, meaning there are multiple
recognised options for how data is stored,
retrieved, and manipulated (including standard
SQL solutions)
9What is NoSQL (continued)?
10NoSQL
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
NoSQL
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
11Four Data Patterns in NoSQL
12Key-value stores
- A simple string (the key) returns a Binary Large
OBject (BLOB) of data (the value) - E.g. the web
- The key can take many formats
- Logical path names
- A hash string artificially generated from the
value - REST web service calls
- SQL queries
- Three basic functions
- Put
- Get
- Delete
13Key-value stores (continued)
14Key-value stores (continued)
- Advantages
- Scalability, reliability, portability
- Low operational costs
- Simplicity
- Disadvantages
- No real options for advanced search
- Commercial solutions
- Amazon S3
- Voldemort
15Column-family stores
- Stores BLOBs of data in one big table, with four
possible basic identifiers used for look-up - Row
- Column
- Column-family
- Time-stamp
- More like a spreadsheet than an RDBMS in many
ways (e.g. no indices, triggers, or SQL queries) - Grew from an idea presented in a Google BigTable
paper
16Column-family stores (continued)
17Column-family stores (continued)
- May then be stored and reconstructed in
components, e.g. HBase converts to
column-name/value rows
18Column-family stores (continued)
- Advantages
- Scales pretty well
- Decent searchability
- Easy to add new data
- Pretty intuitive
- Disadvantages
- Cant query BLOB content
- Not as efficient to search as some other options
- Commercial solutions
- Cassandra
- HBase
19Document stores
- Stores data in nested hierarchies
- Keeps logical chunks of data together in one
place - Treats data as collections of categories and
subcategories - Arguably started with the growth of XML as a
standard format for exchanging data between
applications - Gradual move to just storing data in that format
20Document stores
- Typically requires some flattening before data
can be analysed, e.g. ETL functions that
processing data and produce a .csv
Hierarchical docs, e.g. JSON
Mixed content, e.g. XML
Flat tables, e.g. csv
21Document stores (continued)
- Example of an XML document from MarkLogic
22Document stores (continued)
- Example of a JSON object from MongoDB
23Document stores (continued)
- Advantages
- Lends itself to efficient within-document search
- Very suitable for information retrieval
- Very suitable where data is fed directly into
websites/applications - Allows for structure without being overly
restrictive - Disadvantages
- Complicated to implement
- Search process may require opening and closing
files - Analysis requires some flattening
- Commercial solutions
- MarkLogic
- MongoDB
24Graph stores
- Model the interconnectivity of the data by
focusing on nodes (sometimes called vertices),
relationships (sometimes called edges), and
properties - Tables stored for nodes and edges separately,
meaning types of search become possible
25Graph stores (continued)
26Graph stores (continued)
- Advantages
- Fast network search
- Works with many public data sets
- Disadvantages
- Not very scalable
- Hard to query systematically unless you use
specialised languages based on graph traversals - Commercial solutions
- Neo4j
- AllegroGraph
27So, where to get the power needed for these giant
data stores?
28NoSQL
Data lifecycle
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
MapReduce
NoSQL
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
29Traditional (Structured) Approach
Big Data
Data
High Power Processor
30The MapReduce Concept
Big Data
Data Node
Data Node
Master Processor
Data Node
Data
Data Node
Standard Processor
Slave Processor
Data Node
Data Node
Data Node
Slave Processor
Data Node
31The MapReduce Concept
- Two fundamental steps
- Map
- Master node takes large problem and slices it
into sub problems - Master node distributes these sub problems to
worker nodes. - Worker node may also subdivide and distribute (in
which case, a multi-level tree structure results) - Worker processes sub problems and hands back to
master - Reduce
- Master node reassembles solutions to sub problems
in a pre-defined way to answer high-level problem
32Issues in Distributed Model
- How should we decompose one big task into smaller
ones? - How do we figure out an efficient way to assign
tasks to different machines? - How do we exchange results between machines?
- How do we synchronize distributed tasks?
- What do we do if a task fails?
33Apache Hadoop
- Hadoop was created in 2005 by two Yahoo employees
(Doug Cutting and Mike Cafarella) building on
white papers by Google on their MapReduce
process. - The name refers to a toy elephant belonging to
Doug Cuttings son - Yahoo later donated the project to Apache to
maintain in 2006 - Hadoop offers a framework of tools for dealing
with big data - Hadoop is open source, distributed under the
Apache licence
34Hadoop Ecosystem
- Image from http//www.neevtech.com/blog/2013/03/18
/hadoop-ecosystem-at-a-glance/
35Hadoop Ecosystem
- Image from http//thebigdatablog.weebly.com/blog/t
he-hadoop-ecosystem-overview/
36MapReducing in Hadoop
Data Node
Master Processor
Task Tracker
Application
Queue
Batches
Job Tracker
Name Node
This is where HDFS comes in
This is where MapReduce comes in
Slave Processor
Slave Processor
Slave Processor
Task Tracker
Task Tracker
Task Tracker
Data Node
Data Node
Data Node
37Fault Handling in Hadoop
- Distributing processing means that sooner or
later, part of the distributed processing network
will fail - Practical truth of networks they are unreliable
- Hadoops HDFS has fault tolerance built-in for
data nodes - Three copies of each file maintained by Hadoop
- If one copy goes down, data is retrieved from
another - Faulty node is then updated with new (working)
data from backup - Hadoops HDFS also tracks failures in task
trackers - Master nodes job tracker watches for errors in
slave nodes - Allocates tasks to new slave if existing slave
responsible fails
38Programming in Hadoop
- Programmers using Hadoop dont have to worry
about - Where files are stored
- How to manage failures
- How to distribute computation
- How to scale up or down activities
- A variety of languages can be used, though Java
is the most common and arguably most hassle-free
39Implementing a Hadoop System
- Hadoop can be run in traditional onsite data
centres using multiple dedicated machines - Hadoop can also be run via cloud-hosted services,
including - Microsoft Azure
- Amazon EC2/S3
- Amazon Elastic MapReduce
- Google Compute Engine
40Implementing a Hadoop System Yahoo Servers
Running Hadoop
41Applications of Hadoop
- Areas of application include
- Search engines e.g. Google, Yahoo
- Social media e.g. Facebook, Twitter
- Financial services Morgan Stanley, BNY Mellon
- eCommerce e.g. Amazon, American Airlines, eBay,
IBM - Government e.g. Federal Reserve, Homeland
Security
42Users of Hadoop
- Just like RDBMS, Hadoop systems have different
levels of users - Administrators handle
- Configuring of the system
- Updates and installation
- General firefighting
- Basic users
- Run tests and gather data for reporting, market
research, general exploration, etc. - Design applications to use data
43And another thing Blockchain
- The newest, hottest data storage trend
- Based around 4 main principles
- Shared data
- Distributed consensus
- Smart contracts
- Native cryptography
44Bitcoin
45Want to read more?
- Apache Hadoop Documentation
- http//hadoop.apache.org/docs/current/
- Data Intensive Text Processing with Map-Reduce
- http//lintool.github.io/MapReduceAlgorithms/
- Hadoop Definitive Guide
- http//www.amazon.com/Hadoop-Definitive-Guide-Tom-
White/dp/1449311520
46Want to read more?
- Financial Services using Hadoop
- http//hortonworks.com/blog/financial-services-had
oop/ - https//www.mapr.com/solutions/industry/big-data-a
nd-apache-hadoop-financial-services - Hadoop at ND
- http//ccl.cse.nd.edu/operations/hadoop/