Title: IS6126 Databases for Management Information Systems Lecture 8: Working with unstructured data
1IS6126 Databases for Management Information
SystemsLecture 8 Working with unstructured data
- Rob Gleasure
- R.Gleasure_at_ucc.ie
- robgleasure.com
2IS6126
- Todays session
- Technologies for analysis
- Technologies for storage
- NoSQL
- Distributed map reduce architectures, e.g. Hadoop
3Technologies and tools
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
4Tools for analysis and presentation
- Massive range of software, depending on needs
- For visualising and sorting data
- Excel
- Pentaho
- For data mining, regressions, clustering,
graphing, etc. - SPSS
- R
- Gephi
- UNICET
- For reporting
- Excel
- Pentaho
Lets get our hands data-y!
(I know. Sorry.)
5Tools for analysis and presentation
6Data warehousing
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
7Data warehousing
OLTP
OLAP
Business intelligence database
Operational databases
HR and payroll
Extract Transform Load
Data warehouse
Data mining
Sales and customers
Visual-isation
Orders
Reporting
Technical support
Purchased data
8OTLP vs. OLAP
- Online transaction processing (OLTP)
databases/data stores support ongoing activities
in an organisation - Hence, they need to
- Manage accurate real-time transactions
- Handle reads, writes, and updates by large
numbers of concurrent users - Decompose data into joinable, efficient rows
(e.g. normalised to 3rd form) - These issues are often labelled ACID database
transactions - Atomic Every part of a transaction works or its
all rolled back. - Consistent The database in never left in
inconsistent states - Isolated Transactions do not interfere with one
other - Durable Completed transactions are not lost if
system crashes
9OTLP vs. OLAP
- Online analytical processing (OLAP)
databases/data stores are used to support
predictive analytics - Hence, they need to
- Allow vast quantities of historical data to be
accessed quickly - Be updatable in batches (often daily)
- Aggregate diverse structures with summary data
- These issues are often labelled BASE database
transactions - Basic Availability
- Soft-state
- Eventual consistency
10NoSQL
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
NoSQL
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
11What is NoSQL?
- What is NoSQL?
- Basically any database that isnt a relational
database - Stands for Not only SQL
- Its NOT anti-SQL or anti-relational databases
12What is NoSQL (continued)?
- Its not only rows in tables
- NoSQL systems store and retrieve data from many
formats, e.g. text, csv, xml, graphml - Its not only joins
- NoSQL systems mean you can extract data using
simple interfaces, rather than necessarily
relying on joins - Its not only schemas
- NoSQL systems mean you can drag-and-drop data
into a folder, without having to organise and
query it according to entities, attributes,
relationships, etc.
13What is NoSQL (continued)?
- Its not only executed on one processor
- NoSQL systems mean you can stores databases on
multiple processors with high-speed performance - Its not only specialised computers
- NoSQL systems mean you can leverage low-cost
shared-nothing commodity processors that have
separate RAM and disk. - Its not only logarithmically scalable
- NoSQL systems mean you can achieve linear
scalability as you add more processors - Its not only anything, really
- NoSQL systems emphasise innovation and
inclusivity, meaning there are multiple
recognised options for how data is stored,
retrieved, and manipulated (including standard
SQL solutions)
14What is NoSQL (continued)?
15Four Data Patterns in NoSQL
16Key-value stores
- A simple string (the key) returns a Binary Large
OBject (BLOB) of data (the value) - E.g. the web
- The key can take many formats
- Logical path names
- A hash string artificially generated from the
value - REST web service calls
- SQL queries
- Three basic functions
- Put
- Get
- Delete
17Key-value stores (continued)
- Advantages
- Scalability, reliability, portability
- Low operational costs
- Simplicity
- Disadvantages
- No real options for advanced search
- Commercial solutions
- Amazon S3
- Voldemort
18Column-family stores
- Stores BLOBs of data in one big table, with four
possible basic identifiers used for look-up - Row
- Column
- Column-family
- Time-stamp
- More like a spreadsheet than an RDBMS in many
ways (e.g. no indices, triggers, or SQL queries) - Grew from an idea presented in a Google BigTable
paper
19Column-family stores (continued)
- Advantages
- Scales pretty well
- Decent search ability
- Easy to add new data
- Pretty intuitive
- Disadvantages
- Cant query BLOB content
- Not as efficient to search as some other options
- Commercial solutions
- Cassandra
- HBase
20Document stores
- Stores data in nested hierarchies (typically
using XML or JSON) - Keeps logical chunks of data together in one
place
Hierarchical docs, e.g. JSON
Mixed content, e.g. XML
Flat tables, e.g. csv
21Document stores (continued)
- Advantages
- Lends itself to efficient within-document search
- Very suitable for information retrieval
- Very suitable where data is fed directly into
websites/applications - Allows for structure without being overly
restrictive - Disadvantages
- Complicated to implement
- Search process may require opening and closing
files - Analysis requires some flattening
- Commercial solutions
- MarkLogic
- MongoDB
22Graph stores
- Model the interconnectivity of the data by
focusing on nodes (sometimes called vertices),
relationships (sometimes called edges), and
properties
23Graph stores (continued)
- Tables stored for nodes and edges separately,
meaning types of search become possible
24Graph stores (continued)
- Advantages
- Fast network search
- Works with many public data sets
- Disadvantages
- Not very scalable
- Hard to query systematically unless you use
specialised languages based on graph traversals - Commercial solutions
- Neo4j
- AllegroGraph
25So, where to get the power needed for these giant
data stores?
26NoSQL
Data lifecycle
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
MapReduce
NoSQL
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
27Traditional (Structured) Approach
Big Data
Data
High Power Processor
28The MapReduce Concept
Big Data
Data Node
Data Node
Master Processor
Data Node
Data
Data Node
Standard Processor
Slave Processor
Data Node
Data Node
Data Node
Slave Processor
Data Node
29The MapReduce Concept
- Two fundamental steps
- Map
- Master node takes large problem and slices it
into sub problems - Master node distributes these sub problems to
worker nodes. - Worker node may also subdivide and distribute (in
which case, a multi-level tree structure results) - Worker processes sub problems and hands back to
master - Reduce
- Master node reassembles solutions to sub problems
in a pre-defined way to answer high-level problem
30Issues in Distributed Model
- How should we decompose one big task into smaller
ones? - How do we figure out an efficient way to assign
tasks to different machines? - How do we exchange results between machines?
- How do we synchronize distributed tasks?
- What do we do if a task fails?
31Apache Hadoop
- Hadoop was created in 2005 by two Yahoo employees
(Doug Cutting and Mike Cafarella) building on
white papers by Google on their MapReduce
process. - The name refers to a toy elephant belonging to
Doug Cuttings son - Yahoo later donated the project to Apache to
maintain in 2006 - Hadoop offers a framework of tools for dealing
with big data - Hadoop is open source, distributed under the
Apache licence
32Hadoop Ecosystem
- Image from http//www.neevtech.com/blog/2013/03/18
/hadoop-ecosystem-at-a-glance/
33Hadoop Ecosystem
- Image from http//thebigdatablog.weebly.com/blog/t
he-hadoop-ecosystem-overview/
34MapReducing in Hadoop
Data Node
Master Processor
Task Tracker
Application
Queue
Batches
Job Tracker
Name Node
This is where HDFS comes in
This is where MapReduce comes in
Slave Processor
Slave Processor
Slave Processor
Task Tracker
Task Tracker
Task Tracker
Data Node
Data Node
Data Node
35Fault Handling in Hadoop
- Distributing processing means that sooner or
later, part of the distributed processing network
will fail - Practical truth of networks they are unreliable
- Hadoops HDFS has fault tolerance built-in for
data nodes - Three copies of each file maintained by Hadoop
- If one copy goes down, data is retrieved from
another - Faulty node is then updated with new (working)
data from backup - Hadoops HDFS also tracks failures in task
trackers - Master nodes job tracker watches for errors in
slave nodes - Allocates tasks to new slave if existing slave
responsible fails
36Programming in Hadoop
- Programmers using Hadoop dont have to worry
about - Where files are stored
- How to manage failures
- How to distribute computation
- How to scale up or down activities
- A variety of languages can be used, though Java
is the most common and arguably most hassle-free
37Implementing a Hadoop System
- Hadoop can be run in traditional onsite data
centres using multiple dedicated machines - Hadoop can also be run via cloud-hosted services,
including - Microsoft Azure
- Amazon EC2/S3
- Amazon Elastic MapReduce
- Google Compute Engine
38Implementing a Hadoop System Yahoo Servers
Running Hadoop
39Applications of Hadoop
- Areas of application include
- Search engines e.g. Google, Yahoo
- Social media e.g. Facebook, Twitter
- Financial services Morgan Stanley, BNY Mellon
- eCommerce e.g. Amazon, American Airlines, eBay,
IBM - Government e.g. Federal Reserve, Homeland
Security
40Users of Hadoop
- Just like RDBMS, Hadoop systems have different
levels of users - Administrators handle
- Configuring of the system
- Updates and installation
- General firefighting
- Basic users
- Run tests and gather data for reporting, market
research, general exploration, etc. - Design applications to use data
41Accessibility of NoSQL databases?
42Want to read more?
- Apache Hadoop Documentation
- http//hadoop.apache.org/docs/current/
- Data Intensive Text Processing with Map-Reduce
- http//lintool.github.io/MapReduceAlgorithms/
- Hadoop Definitive Guide
- http//www.amazon.com/Hadoop-Definitive-Guide-Tom-
White/dp/1449311520
43Want to read more?
- Financial Services using Hadoop
- http//hortonworks.com/blog/financial-services-had
oop/ - https//www.mapr.com/solutions/industry/big-data-a
nd-apache-hadoop-financial-services - Hadoop at ND
- http//ccl.cse.nd.edu/operations/hadoop/