IS6126 Databases for Management Information Systems Lecture 5: NoSQL and distributed data stores - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

IS6126 Databases for Management Information Systems Lecture 5: NoSQL and distributed data stores

Description:

Title: Introduction Author: Rob Gleasure Last modified by: Gleasure, Rob Created Date: 9/20/2005 10:52:45 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:281
Avg rating:3.0/5.0
Slides: 47
Provided by: RobG177
Category:

less

Transcript and Presenter's Notes

Title: IS6126 Databases for Management Information Systems Lecture 5: NoSQL and distributed data stores


1
IS6126 Databases for Management Information
SystemsLecture 5 NoSQL and distributed data
stores
  • Rob Gleasure
  • R.Gleasure_at_ucc.ie
  • robgleasure.com

2
IS6126
  • Todays session
  • Technologies for analysis
  • Technologies for storage
  • NoSQL
  • Distributed map reduce architectures, e.g. Hadoop

3
Technologies and tools
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
4
Data warehousing
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
5
Data warehousing
OLTP
OLAP
Business intelligence database
Operational databases
HR and payroll
Extract Transform Load
Data warehouse
Data mining
Sales and customers
Visual-isation
Orders
Reporting
Technical support
Purchased data
6
What is NoSQL?
  • What is NoSQL?
  • Basically any database that isnt a relational
    database
  • Stands for Not only SQL
  • Its NOT anti-SQL or anti-relational databases

7
What is NoSQL (continued)?
  • Its not only rows in tables
  • NoSQL systems store and retrieve data from many
    formats, e.g. text, csv, xml, graphml
  • Its not only joins
  • NoSQL systems mean you can extract data using
    simple interfaces, rather than necessarily
    relying on joins
  • Its not only schemas
  • NoSQL systems mean you can drag-and-drop data
    into a folder, without having to organise and
    query it according to entities, attributes,
    relationships, etc.

8
What is NoSQL (continued)?
  • Its not only executed on one processor
  • NoSQL systems mean you can stores databases on
    multiple processors with high-speed performance
  • Its not only specialised computers
  • NoSQL systems mean you can leverage low-cost
    shared-nothing commodity processors that have
    separate RAM and disk.
  • Its not only logarithmically scalable
  • NoSQL systems mean you can achieve linear
    scalability as you add more processors
  • Its not only anything, really
  • NoSQL systems emphasise innovation and
    inclusivity, meaning there are multiple
    recognised options for how data is stored,
    retrieved, and manipulated (including standard
    SQL solutions)

9
What is NoSQL (continued)?
10
NoSQL
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
Data lifecycle
NoSQL
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
11
Four Data Patterns in NoSQL
12
Key-value stores
  • A simple string (the key) returns a Binary Large
    OBject (BLOB) of data (the value)
  • E.g. the web
  • The key can take many formats
  • Logical path names
  • A hash string artificially generated from the
    value
  • REST web service calls
  • SQL queries
  • Three basic functions
  • Put
  • Get
  • Delete

13
Key-value stores (continued)
14
Key-value stores (continued)
  • Advantages
  • Scalability, reliability, portability
  • Low operational costs
  • Simplicity
  • Disadvantages
  • No real options for advanced search
  • Commercial solutions
  • Amazon S3
  • Voldemort

15
Column-family stores
  • Stores BLOBs of data in one big table, with four
    possible basic identifiers used for look-up
  • Row
  • Column
  • Column-family
  • Time-stamp
  • More like a spreadsheet than an RDBMS in many
    ways (e.g. no indices, triggers, or SQL queries)
  • Grew from an idea presented in a Google BigTable
    paper

16
Column-family stores (continued)
17
Column-family stores (continued)
  • May then be stored and reconstructed in
    components, e.g. HBase converts to
    column-name/value rows

18
Column-family stores (continued)
  • Advantages
  • Scales pretty well
  • Decent searchability
  • Easy to add new data
  • Pretty intuitive
  • Disadvantages
  • Cant query BLOB content
  • Not as efficient to search as some other options
  • Commercial solutions
  • Cassandra
  • HBase

19
Document stores
  • Stores data in nested hierarchies
  • Keeps logical chunks of data together in one
    place
  • Treats data as collections of categories and
    subcategories
  • Arguably started with the growth of XML as a
    standard format for exchanging data between
    applications
  • Gradual move to just storing data in that format

20
Document stores
  • Typically requires some flattening before data
    can be analysed, e.g. ETL functions that
    processing data and produce a .csv

Hierarchical docs, e.g. JSON
Mixed content, e.g. XML
Flat tables, e.g. csv
21
Document stores (continued)
  • Example of an XML document from MarkLogic

22
Document stores (continued)
  • Example of a JSON object from MongoDB

23
Document stores (continued)
  • Advantages
  • Lends itself to efficient within-document search
  • Very suitable for information retrieval
  • Very suitable where data is fed directly into
    websites/applications
  • Allows for structure without being overly
    restrictive
  • Disadvantages
  • Complicated to implement
  • Search process may require opening and closing
    files
  • Analysis requires some flattening
  • Commercial solutions
  • MarkLogic
  • MongoDB

24
Graph stores
  • Model the interconnectivity of the data by
    focusing on nodes (sometimes called vertices),
    relationships (sometimes called edges), and
    properties
  • Tables stored for nodes and edges separately,
    meaning types of search become possible

25
Graph stores (continued)
  • Example from Gephi

26
Graph stores (continued)
  • Advantages
  • Fast network search
  • Works with many public data sets
  • Disadvantages
  • Not very scalable
  • Hard to query systematically unless you use
    specialised languages based on graph traversals
  • Commercial solutions
  • Neo4j
  • AllegroGraph

27
So, where to get the power needed for these giant
data stores?
28
NoSQL
Data lifecycle
Credit to https//www.youtube.com/wa
tch?vIjpU0dLIRDI for visualisation
MapReduce
NoSQL
Data Warehousing
Create
Capture
Store
Analyse
Present
Business Intelligence Tools
29
Traditional (Structured) Approach
Big Data
Data
High Power Processor
30
The MapReduce Concept
Big Data
Data Node
Data Node
Master Processor
Data Node
Data
Data Node
Standard Processor
Slave Processor
Data Node
Data Node
Data Node
Slave Processor
Data Node
31
The MapReduce Concept
  • Two fundamental steps
  • Map
  • Master node takes large problem and slices it
    into sub problems
  • Master node distributes these sub problems to
    worker nodes.
  • Worker node may also subdivide and distribute (in
    which case, a multi-level tree structure results)
  • Worker processes sub problems and hands back to
    master
  • Reduce
  • Master node reassembles solutions to sub problems
    in a pre-defined way to answer high-level problem

32
Issues in Distributed Model
  • How should we decompose one big task into smaller
    ones?
  • How do we figure out an efficient way to assign
    tasks to different machines?
  • How do we exchange results between machines?
  • How do we synchronize distributed tasks?
  • What do we do if a task fails?

33
Apache Hadoop
  • Hadoop was created in 2005 by two Yahoo employees
    (Doug Cutting and Mike Cafarella) building on
    white papers by Google on their MapReduce
    process.
  • The name refers to a toy elephant belonging to
    Doug Cuttings son
  • Yahoo later donated the project to Apache to
    maintain in 2006
  • Hadoop offers a framework of tools for dealing
    with big data
  • Hadoop is open source, distributed under the
    Apache licence

34
Hadoop Ecosystem
  • Image from http//www.neevtech.com/blog/2013/03/18
    /hadoop-ecosystem-at-a-glance/

35
Hadoop Ecosystem
  • Image from http//thebigdatablog.weebly.com/blog/t
    he-hadoop-ecosystem-overview/

36
MapReducing in Hadoop
Data Node
Master Processor
Task Tracker
Application
Queue
Batches
Job Tracker
Name Node
This is where HDFS comes in
This is where MapReduce comes in
Slave Processor
Slave Processor
Slave Processor
Task Tracker
Task Tracker
Task Tracker
Data Node
Data Node
Data Node
37
Fault Handling in Hadoop
  • Distributing processing means that sooner or
    later, part of the distributed processing network
    will fail
  • Practical truth of networks they are unreliable
  • Hadoops HDFS has fault tolerance built-in for
    data nodes
  • Three copies of each file maintained by Hadoop
  • If one copy goes down, data is retrieved from
    another
  • Faulty node is then updated with new (working)
    data from backup
  • Hadoops HDFS also tracks failures in task
    trackers
  • Master nodes job tracker watches for errors in
    slave nodes
  • Allocates tasks to new slave if existing slave
    responsible fails

38
Programming in Hadoop
  • Programmers using Hadoop dont have to worry
    about
  • Where files are stored
  • How to manage failures
  • How to distribute computation
  • How to scale up or down activities
  • A variety of languages can be used, though Java
    is the most common and arguably most hassle-free

39
Implementing a Hadoop System
  • Hadoop can be run in traditional onsite data
    centres using multiple dedicated machines
  • Hadoop can also be run via cloud-hosted services,
    including
  • Microsoft Azure
  • Amazon EC2/S3
  • Amazon Elastic MapReduce
  • Google Compute Engine

40
Implementing a Hadoop System Yahoo Servers
Running Hadoop
41
Applications of Hadoop
  • Areas of application include
  • Search engines e.g. Google, Yahoo
  • Social media e.g. Facebook, Twitter
  • Financial services Morgan Stanley, BNY Mellon
  • eCommerce e.g. Amazon, American Airlines, eBay,
    IBM
  • Government e.g. Federal Reserve, Homeland
    Security

42
Users of Hadoop
  • Just like RDBMS, Hadoop systems have different
    levels of users
  • Administrators handle
  • Configuring of the system
  • Updates and installation
  • General firefighting
  • Basic users
  • Run tests and gather data for reporting, market
    research, general exploration, etc.
  • Design applications to use data

43
And another thing Blockchain
  • The newest, hottest data storage trend
  • Based around 4 main principles
  • Shared data
  • Distributed consensus
  • Smart contracts
  • Native cryptography

44
Bitcoin
45
Want to read more?
  • Apache Hadoop Documentation
  • http//hadoop.apache.org/docs/current/
  • Data Intensive Text Processing with Map-Reduce
  • http//lintool.github.io/MapReduceAlgorithms/
  • Hadoop Definitive Guide
  • http//www.amazon.com/Hadoop-Definitive-Guide-Tom-
    White/dp/1449311520

46
Want to read more?
  • Financial Services using Hadoop
  • http//hortonworks.com/blog/financial-services-had
    oop/
  • https//www.mapr.com/solutions/industry/big-data-a
    nd-apache-hadoop-financial-services
  • Hadoop at ND
  • http//ccl.cse.nd.edu/operations/hadoop/
Write a Comment
User Comments (0)
About PowerShow.com