IS6126 Databases for Management Information Systems Lecture 8: Working with unstructured data - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

IS6126 Databases for Management Information Systems Lecture 8: Working with unstructured data

Description:

IS6126 Databases for Management Information Systems Lecture 8: Working with unstructured data Rob Gleasure R.Gleasure_at_ucc.ie robgleasure.com MapReducing in Hadoop ... – PowerPoint PPT presentation

Number of Views:251
Avg rating:3.0/5.0
Slides: 44
Provided by: RobG171
Category:

less

Transcript and Presenter's Notes

Title: IS6126 Databases for Management Information Systems Lecture 8: Working with unstructured data


1
IS6126 Databases for Management Information
SystemsLecture 8 Working with unstructured data
  • Rob Gleasure
  • R.Gleasure_at_ucc.ie
  • robgleasure.com

2
IS6126
  • Todays session
  • Technologies for analysis
  • Technologies for storage
  • NoSQL
  • Distributed map reduce architectures, e.g. Hadoop

3
Technologies and tools
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
4
Tools for analysis and presentation
  • Massive range of software, depending on needs
  • For visualising and sorting data
  • Excel
  • Pentaho
  • For data mining, regressions, clustering,
    graphing, etc.
  • SPSS
  • R
  • Gephi
  • UNICET
  • For reporting
  • Excel
  • Pentaho

Lets get our hands data-y!
(I know. Sorry.)
5
Tools for analysis and presentation
6
Data warehousing
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
7
Data warehousing
OLTP
OLAP
Business intelligence database
Operational databases
HR and payroll
Extract Transform Load
Data warehouse
Data mining
Sales and customers
Visual-isation
Orders
Reporting
Technical support
Purchased data
8
OTLP vs. OLAP
  • Online transaction processing (OLTP)
    databases/data stores support ongoing activities
    in an organisation
  • Hence, they need to
  • Manage accurate real-time transactions
  • Handle reads, writes, and updates by large
    numbers of concurrent users
  • Decompose data into joinable, efficient rows
    (e.g. normalised to 3rd form)
  • These issues are often labelled ACID database
    transactions
  • Atomic Every part of a transaction works or its
    all rolled back.
  • Consistent The database in never left in
    inconsistent states
  • Isolated Transactions do not interfere with one
    other
  • Durable Completed transactions are not lost if
    system crashes

9
OTLP vs. OLAP
  • Online analytical processing (OLAP)
    databases/data stores are used to support
    predictive analytics
  • Hence, they need to
  • Allow vast quantities of historical data to be
    accessed quickly
  • Be updatable in batches (often daily)
  • Aggregate diverse structures with summary data
  • These issues are often labelled BASE database
    transactions
  • Basic Availability
  • Soft-state
  • Eventual consistency

10
NoSQL
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
NoSQL
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
11
What is NoSQL?
  • What is NoSQL?
  • Basically any database that isnt a relational
    database
  • Stands for Not only SQL
  • Its NOT anti-SQL or anti-relational databases

12
What is NoSQL (continued)?
  • Its not only rows in tables
  • NoSQL systems store and retrieve data from many
    formats, e.g. text, csv, xml, graphml
  • Its not only joins
  • NoSQL systems mean you can extract data using
    simple interfaces, rather than necessarily
    relying on joins
  • Its not only schemas
  • NoSQL systems mean you can drag-and-drop data
    into a folder, without having to organise and
    query it according to entities, attributes,
    relationships, etc.

13
What is NoSQL (continued)?
  • Its not only executed on one processor
  • NoSQL systems mean you can stores databases on
    multiple processors with high-speed performance
  • Its not only specialised computers
  • NoSQL systems mean you can leverage low-cost
    shared-nothing commodity processors that have
    separate RAM and disk.
  • Its not only logarithmically scalable
  • NoSQL systems mean you can achieve linear
    scalability as you add more processors
  • Its not only anything, really
  • NoSQL systems emphasise innovation and
    inclusivity, meaning there are multiple
    recognised options for how data is stored,
    retrieved, and manipulated (including standard
    SQL solutions)

14
What is NoSQL (continued)?
15
Four Data Patterns in NoSQL
16
Key-value stores
  • A simple string (the key) returns a Binary Large
    OBject (BLOB) of data (the value)
  • E.g. the web
  • The key can take many formats
  • Logical path names
  • A hash string artificially generated from the
    value
  • REST web service calls
  • SQL queries
  • Three basic functions
  • Put
  • Get
  • Delete

17
Key-value stores (continued)
  • Advantages
  • Scalability, reliability, portability
  • Low operational costs
  • Simplicity
  • Disadvantages
  • No real options for advanced search
  • Commercial solutions
  • Amazon S3
  • Voldemort

18
Column-family stores
  • Stores BLOBs of data in one big table, with four
    possible basic identifiers used for look-up
  • Row
  • Column
  • Column-family
  • Time-stamp
  • More like a spreadsheet than an RDBMS in many
    ways (e.g. no indices, triggers, or SQL queries)
  • Grew from an idea presented in a Google BigTable
    paper

19
Column-family stores (continued)
  • Advantages
  • Scales pretty well
  • Decent search ability
  • Easy to add new data
  • Pretty intuitive
  • Disadvantages
  • Cant query BLOB content
  • Not as efficient to search as some other options
  • Commercial solutions
  • Cassandra
  • HBase

20
Document stores
  • Stores data in nested hierarchies (typically
    using XML or JSON)
  • Keeps logical chunks of data together in one
    place

Hierarchical docs, e.g. JSON
Mixed content, e.g. XML
Flat tables, e.g. csv
21
Document stores (continued)
  • Advantages
  • Lends itself to efficient within-document search
  • Very suitable for information retrieval
  • Very suitable where data is fed directly into
    websites/applications
  • Allows for structure without being overly
    restrictive
  • Disadvantages
  • Complicated to implement
  • Search process may require opening and closing
    files
  • Analysis requires some flattening
  • Commercial solutions
  • MarkLogic
  • MongoDB

22
Graph stores
  • Model the interconnectivity of the data by
    focusing on nodes (sometimes called vertices),
    relationships (sometimes called edges), and
    properties

23
Graph stores (continued)
  • Tables stored for nodes and edges separately,
    meaning types of search become possible

24
Graph stores (continued)
  • Advantages
  • Fast network search
  • Works with many public data sets
  • Disadvantages
  • Not very scalable
  • Hard to query systematically unless you use
    specialised languages based on graph traversals
  • Commercial solutions
  • Neo4j
  • AllegroGraph

25
So, where to get the power needed for these giant
data stores?
26
NoSQL
Data lifecycle
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
MapReduce
NoSQL
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
27
Traditional (Structured) Approach
Big Data
Data
High Power Processor
28
The MapReduce Concept
Big Data
Data Node
Data Node
Master Processor
Data Node
Data
Data Node
Standard Processor
Slave Processor
Data Node
Data Node
Data Node
Slave Processor
Data Node
29
The MapReduce Concept
  • Two fundamental steps
  • Map
  • Master node takes large problem and slices it
    into sub problems
  • Master node distributes these sub problems to
    worker nodes.
  • Worker node may also subdivide and distribute (in
    which case, a multi-level tree structure results)
  • Worker processes sub problems and hands back to
    master
  • Reduce
  • Master node reassembles solutions to sub problems
    in a pre-defined way to answer high-level problem

30
Issues in Distributed Model
  • How should we decompose one big task into smaller
    ones?
  • How do we figure out an efficient way to assign
    tasks to different machines?
  • How do we exchange results between machines?
  • How do we synchronize distributed tasks?
  • What do we do if a task fails?

31
Apache Hadoop
  • Hadoop was created in 2005 by two Yahoo employees
    (Doug Cutting and Mike Cafarella) building on
    white papers by Google on their MapReduce
    process.
  • The name refers to a toy elephant belonging to
    Doug Cuttings son
  • Yahoo later donated the project to Apache to
    maintain in 2006
  • Hadoop offers a framework of tools for dealing
    with big data
  • Hadoop is open source, distributed under the
    Apache licence

32
Hadoop Ecosystem
  • Image from http//www.neevtech.com/blog/2013/03/18
    /hadoop-ecosystem-at-a-glance/

33
Hadoop Ecosystem
  • Image from http//thebigdatablog.weebly.com/blog/t
    he-hadoop-ecosystem-overview/

34
MapReducing in Hadoop
Data Node
Master Processor
Task Tracker
Application
Queue
Batches
Job Tracker
Name Node
This is where HDFS comes in
This is where MapReduce comes in
Slave Processor
Slave Processor
Slave Processor
Task Tracker
Task Tracker
Task Tracker
Data Node
Data Node
Data Node
35
Fault Handling in Hadoop
  • Distributing processing means that sooner or
    later, part of the distributed processing network
    will fail
  • Practical truth of networks they are unreliable
  • Hadoops HDFS has fault tolerance built-in for
    data nodes
  • Three copies of each file maintained by Hadoop
  • If one copy goes down, data is retrieved from
    another
  • Faulty node is then updated with new (working)
    data from backup
  • Hadoops HDFS also tracks failures in task
    trackers
  • Master nodes job tracker watches for errors in
    slave nodes
  • Allocates tasks to new slave if existing slave
    responsible fails

36
Programming in Hadoop
  • Programmers using Hadoop dont have to worry
    about
  • Where files are stored
  • How to manage failures
  • How to distribute computation
  • How to scale up or down activities
  • A variety of languages can be used, though Java
    is the most common and arguably most hassle-free

37
Implementing a Hadoop System
  • Hadoop can be run in traditional onsite data
    centres using multiple dedicated machines
  • Hadoop can also be run via cloud-hosted services,
    including
  • Microsoft Azure
  • Amazon EC2/S3
  • Amazon Elastic MapReduce
  • Google Compute Engine

38
Implementing a Hadoop System Yahoo Servers
Running Hadoop
39
Applications of Hadoop
  • Areas of application include
  • Search engines e.g. Google, Yahoo
  • Social media e.g. Facebook, Twitter
  • Financial services Morgan Stanley, BNY Mellon
  • eCommerce e.g. Amazon, American Airlines, eBay,
    IBM
  • Government e.g. Federal Reserve, Homeland
    Security

40
Users of Hadoop
  • Just like RDBMS, Hadoop systems have different
    levels of users
  • Administrators handle
  • Configuring of the system
  • Updates and installation
  • General firefighting
  • Basic users
  • Run tests and gather data for reporting, market
    research, general exploration, etc.
  • Design applications to use data

41
Accessibility of NoSQL databases?
42
Want to read more?
  • Apache Hadoop Documentation
  • http//hadoop.apache.org/docs/current/
  • Data Intensive Text Processing with Map-Reduce
  • http//lintool.github.io/MapReduceAlgorithms/
  • Hadoop Definitive Guide
  • http//www.amazon.com/Hadoop-Definitive-Guide-Tom-
    White/dp/1449311520

43
Want to read more?
  • Financial Services using Hadoop
  • http//hortonworks.com/blog/financial-services-had
    oop/
  • https//www.mapr.com/solutions/industry/big-data-a
    nd-apache-hadoop-financial-services
  • Hadoop at ND
  • http//ccl.cse.nd.edu/operations/hadoop/
Write a Comment
User Comments (0)
About PowerShow.com