IS6126 Databases for Management Information Systems Lecture 8: Working with unstructured data - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

IS6126 Databases for Management Information Systems Lecture 8: Working with unstructured data

Description:

IS6126 Databases for Management Information Systems Lecture 8: Working with unstructured data Rob Gleasure R.Gleasure_at_ucc.ie robgleasure.com MapReducing in Hadoop ... – PowerPoint PPT presentation

Number of Views:251

Avg rating:3.0/5.0

Slides: 44

Provided by: RobG171

Category:

more less

Transcript and Presenter's Notes

Title: IS6126 Databases for Management Information Systems Lecture 8: Working with unstructured data

1
IS6126 Databases for Management Information
SystemsLecture 8 Working with unstructured data

Rob Gleasure
R.Gleasure_at_ucc.ie
robgleasure.com

2
IS6126

Todays session
Technologies for analysis
Technologies for storage
NoSQL
Distributed map reduce architectures, e.g. Hadoop

3
Technologies and tools
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
4
Tools for analysis and presentation

Massive range of software, depending on needs
For visualising and sorting data
Excel
Pentaho
For data mining, regressions, clustering,
graphing, etc.
SPSS
R
Gephi
UNICET
For reporting
Excel
Pentaho

Lets get our hands data-y!
(I know. Sorry.)
5
Tools for analysis and presentation
6
Data warehousing
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
7
Data warehousing
OLTP
OLAP
Business intelligence database
Operational databases
HR and payroll
Extract Transform Load
Data warehouse
Data mining
Sales and customers
Visual-isation
Orders
Reporting
Technical support
Purchased data
8
OTLP vs. OLAP

Online transaction processing (OLTP)
databases/data stores support ongoing activities
in an organisation
Hence, they need to
Manage accurate real-time transactions
Handle reads, writes, and updates by large
numbers of concurrent users
Decompose data into joinable, efficient rows
(e.g. normalised to 3rd form)
These issues are often labelled ACID database
transactions
Atomic Every part of a transaction works or its
all rolled back.
Consistent The database in never left in
inconsistent states
Isolated Transactions do not interfere with one
other
Durable Completed transactions are not lost if
system crashes

9
OTLP vs. OLAP

Online analytical processing (OLAP)
databases/data stores are used to support
predictive analytics
Hence, they need to
Allow vast quantities of historical data to be
accessed quickly
Be updatable in batches (often daily)
Aggregate diverse structures with summary data
These issues are often labelled BASE database
transactions
Basic Availability
Soft-state
Eventual consistency

10
NoSQL
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
NoSQL
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
11
What is NoSQL?

What is NoSQL?
Basically any database that isnt a relational
database
Stands for Not only SQL
Its NOT anti-SQL or anti-relational databases

12
What is NoSQL (continued)?

Its not only rows in tables
NoSQL systems store and retrieve data from many
formats, e.g. text, csv, xml, graphml
Its not only joins
NoSQL systems mean you can extract data using
simple interfaces, rather than necessarily
relying on joins
Its not only schemas
NoSQL systems mean you can drag-and-drop data
into a folder, without having to organise and
query it according to entities, attributes,
relationships, etc.

13
What is NoSQL (continued)?

Its not only executed on one processor
NoSQL systems mean you can stores databases on
multiple processors with high-speed performance
Its not only specialised computers
NoSQL systems mean you can leverage low-cost
shared-nothing commodity processors that have
separate RAM and disk.
Its not only logarithmically scalable
NoSQL systems mean you can achieve linear
scalability as you add more processors
Its not only anything, really
NoSQL systems emphasise innovation and
inclusivity, meaning there are multiple
recognised options for how data is stored,
retrieved, and manipulated (including standard
SQL solutions)

14
What is NoSQL (continued)?
15
Four Data Patterns in NoSQL
16
Key-value stores

A simple string (the key) returns a Binary Large
OBject (BLOB) of data (the value)
E.g. the web
The key can take many formats
Logical path names
A hash string artificially generated from the
value
REST web service calls
SQL queries
Three basic functions
Put
Get
Delete

17
Key-value stores (continued)

Advantages
Scalability, reliability, portability
Low operational costs
Simplicity
Disadvantages
No real options for advanced search
Commercial solutions
Amazon S3
Voldemort

18
Column-family stores

Stores BLOBs of data in one big table, with four
possible basic identifiers used for look-up
Row
Column
Column-family
Time-stamp
More like a spreadsheet than an RDBMS in many
ways (e.g. no indices, triggers, or SQL queries)
Grew from an idea presented in a Google BigTable
paper

19
Column-family stores (continued)

Advantages
Scales pretty well
Decent search ability
Easy to add new data
Pretty intuitive
Disadvantages
Cant query BLOB content
Not as efficient to search as some other options
Commercial solutions
Cassandra
HBase

20
Document stores

Stores data in nested hierarchies (typically
using XML or JSON)
Keeps logical chunks of data together in one
place

Hierarchical docs, e.g. JSON
Mixed content, e.g. XML
Flat tables, e.g. csv
21
Document stores (continued)

Advantages
Lends itself to efficient within-document search
Very suitable for information retrieval
Very suitable where data is fed directly into
websites/applications
Allows for structure without being overly
restrictive
Disadvantages
Complicated to implement
Search process may require opening and closing
files
Analysis requires some flattening
Commercial solutions
MarkLogic
MongoDB

22
Graph stores

Model the interconnectivity of the data by
focusing on nodes (sometimes called vertices),
relationships (sometimes called edges), and
properties

23
Graph stores (continued)

Tables stored for nodes and edges separately,
meaning types of search become possible

24
Graph stores (continued)

Advantages
Fast network search
Works with many public data sets
Disadvantages
Not very scalable
Hard to query systematically unless you use
specialised languages based on graph traversals
Commercial solutions
Neo4j
AllegroGraph

25
So, where to get the power needed for these giant
data stores?
26
NoSQL
Data lifecycle
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
MapReduce
NoSQL
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
27
Traditional (Structured) Approach
Big Data
Data
High Power Processor
28
The MapReduce Concept
Big Data
Data Node
Data Node
Master Processor
Data Node
Data
Data Node
Standard Processor
Slave Processor
Data Node
Data Node
Data Node
Slave Processor
Data Node
29
The MapReduce Concept

Two fundamental steps
Map
Master node takes large problem and slices it
into sub problems
Master node distributes these sub problems to
worker nodes.
Worker node may also subdivide and distribute (in
which case, a multi-level tree structure results)
Worker processes sub problems and hands back to
master
Reduce
Master node reassembles solutions to sub problems
in a pre-defined way to answer high-level problem

30
Issues in Distributed Model

How should we decompose one big task into smaller
ones?
How do we figure out an efficient way to assign
tasks to different machines?
How do we exchange results between machines?
How do we synchronize distributed tasks?
What do we do if a task fails?

31
Apache Hadoop

Hadoop was created in 2005 by two Yahoo employees
(Doug Cutting and Mike Cafarella) building on
white papers by Google on their MapReduce
process.
The name refers to a toy elephant belonging to
Doug Cuttings son
Yahoo later donated the project to Apache to
maintain in 2006
Hadoop offers a framework of tools for dealing
with big data
Hadoop is open source, distributed under the
Apache licence

32
Hadoop Ecosystem

Image from http//www.neevtech.com/blog/2013/03/18
/hadoop-ecosystem-at-a-glance/

33
Hadoop Ecosystem

Image from http//thebigdatablog.weebly.com/blog/t
he-hadoop-ecosystem-overview/

34
MapReducing in Hadoop
Data Node
Master Processor
Task Tracker
Application
Queue
Batches
Job Tracker
Name Node
This is where HDFS comes in
This is where MapReduce comes in
Slave Processor
Slave Processor
Slave Processor
Task Tracker
Task Tracker
Task Tracker
Data Node
Data Node
Data Node
35
Fault Handling in Hadoop

Distributing processing means that sooner or
later, part of the distributed processing network
will fail
Practical truth of networks they are unreliable
Hadoops HDFS has fault tolerance built-in for
data nodes
Three copies of each file maintained by Hadoop
If one copy goes down, data is retrieved from
another
Faulty node is then updated with new (working)
data from backup
Hadoops HDFS also tracks failures in task
trackers
Master nodes job tracker watches for errors in
slave nodes
Allocates tasks to new slave if existing slave
responsible fails

36
Programming in Hadoop

Programmers using Hadoop dont have to worry
about
Where files are stored
How to manage failures
How to distribute computation
How to scale up or down activities
A variety of languages can be used, though Java
is the most common and arguably most hassle-free

37
Implementing a Hadoop System

Hadoop can be run in traditional onsite data
centres using multiple dedicated machines
Hadoop can also be run via cloud-hosted services,
including
Microsoft Azure
Amazon EC2/S3
Amazon Elastic MapReduce
Google Compute Engine

38
Implementing a Hadoop System Yahoo Servers
Running Hadoop
39
Applications of Hadoop

Areas of application include
Search engines e.g. Google, Yahoo
Social media e.g. Facebook, Twitter
Financial services Morgan Stanley, BNY Mellon
eCommerce e.g. Amazon, American Airlines, eBay,
IBM
Government e.g. Federal Reserve, Homeland
Security

40
Users of Hadoop

Just like RDBMS, Hadoop systems have different
levels of users
Administrators handle
Configuring of the system
Updates and installation
General firefighting
Basic users
Run tests and gather data for reporting, market
research, general exploration, etc.
Design applications to use data

41
Accessibility of NoSQL databases?
42
Want to read more?

Apache Hadoop Documentation
http//hadoop.apache.org/docs/current/
Data Intensive Text Processing with Map-Reduce
http//lintool.github.io/MapReduceAlgorithms/
Hadoop Definitive Guide
http//www.amazon.com/Hadoop-Definitive-Guide-Tom-
White/dp/1449311520

43
Want to read more?

Financial Services using Hadoop
http//hortonworks.com/blog/financial-services-had
oop/
https//www.mapr.com/solutions/industry/big-data-a
nd-apache-hadoop-financial-services
Hadoop at ND
http//ccl.cse.nd.edu/operations/hadoop/

Write a Comment

User Comments (0)