TDD:%20Topics%20in%20Distributed%20Databases%20(Querying%20and%20cleaning%20big%20data) - PowerPoint PPT Presentation

About This Presentation
Title:

TDD:%20Topics%20in%20Distributed%20Databases%20(Querying%20and%20cleaning%20big%20data)

Description:

Important issues in this line of work ... a set of criteria: the most important issues in that line of research, based ... Evaluate each of the papers based on ... – PowerPoint PPT presentation

Number of Views:1018
Avg rating:3.0/5.0
Slides: 49
Provided by: infor178
Category:

less

Transcript and Presenter's Notes

Title: TDD:%20Topics%20in%20Distributed%20Databases%20(Querying%20and%20cleaning%20big%20data)


1
TDD Topics in Distributed Databases(Querying
and cleaning big data)
Wenfei Fan University of Edinburgh
1
2
What is big data?
2
3
Big data What is it anyway?
Everyone talks about big data. But what is it?
  • Volume horrendously large
  • PB (1015B)
  • EB (1018B)
  • Variety heterogeneous, semi-structured or
    unstructured
  • 91 ratio of unstructured data vs. structured
    data
  • collecting 95 restaurants requires at least 5000
    sources
  • Velocity dynamic
  • think of the Web and Facebook,
  • Veracity trust in its quality
  • real-life data is typically dirty!

cf. Online ordering of overlapping data sources,
PVLDB 7(3), 2013, Mariam Salloum, Xin Luna Dong,
Divesh Srivastava, Vassilis J. Tsotra
A departure from our familiar data management!
3
4
Why is the data so big?
  • Worldwide information volume is growing annually
    at a minimum rate of 59
  • A single jet engine produces 20TB (1012B) of data
    per hour
  • Facebook has 1.38 billion users, 140 billion
    links, about 300 PB of data
  • Genome of human sampling, biochemistry,
    immunology, imaging, genetic, phenotypic data
  • 1 person 1PB (1015B)
  • 1000 people 1EB (1018B)
  • 1 billion people 1ZB (1024B)

Gartner 2011
Big data is a relative notion 1TB is already too
big for your laptop
4
5
Why do we care about big data?
5
6
Example Medicare
  • Google Flu Trends
  • advance indication in the 2007-08 flu season
  • the 2009 H1N1 outbreak
  • IBM Predict Heart Disease Through Big Data
    Analytics
  • traditional EKGs, heart rate, blood pressure
  • big data analysis connecting
  • exercise and fitness tests
  • diet
  • fat and muscle composition
  • genetics and environment
  • social media and wellness share information

Nature, 2009
A new game large number of data sources of big
volume
6
7
Big data is needed everywhere
  • Social media marketing
  • 78 of consumers trust peer (friend, colleague
    and family member) recommendations only 14
    trust ad
  • if three close friends of person X like items P
    and W, and if X also likes P, then the chances
    are that X likes W too
  • Social event monitoring
  • Prevent terrorist attack
  • The Net Project, Shenzhen, China (Audaque)
  • Scientific research
  • A new yet more effective way to develop theory,
    by exploring and discovering correlations of
    seemingly disconnected factors

The world is becoming data-driven, like it or not!
7
8
The big data market is BIG
  • US HEALTH CARE 300 B
  • Increase industry value per year by 300 B
  • US RETAIL 60
  • Increase net margin by 60
  • MANUFACTURING 50
  • Decrease development and assembly costs by 50
  • GLOBAL PERSONAL LOCATION DATA 100 B
  • Increase service provider revenue by 100 B
  • EUROPE PUBLIC SECTOR ADMIN 250 B Euro
  • Increase industry value per year by 250 B Euro

McKinsey Global Institute, May 2011
Big Data The next frontier for innovation,
competition and productivity
8
9
Why study big data?
  • Want to find a job?
  • Research and development of big data systems
  • ETL, distributed systems (eg, Hadoop),
    visualization tools, data warehouse, OLAP, data
    integration, data quality control,
  • Big data applications
  • social marketing, healthcare,
  • Data analysis to get values out of big data
  • discovering and applying patterns, predicative
    analysis, business intelligence, privacy and
    security,
  • Prepare you for
  • graduate study current research and practical
    issues
  • the job market skills/knowledge in need

complexity theory, distributed databases, query
answering, algorithms, data quality
Big data Big
10
What challenges are introduced by big data?
10
11
Big data Through the eyes of computation
  • Computer science is the topic about

the computation of function f(x)
  • Big data the data parameter x is horrendously
    large PB or EB

What is the challenge introduced to query
answering?
  • Fallacies
  • Big data introduces no fundamental problems
  • Big data MapReduce (Hadoop)
  • Big data data quantity (scalability)

Are these true?
11
12
Flashback Relational queries
  • Questions
  • What is a relational schema? A relation? A
    relational database?
  • What is a query? What is relational algebra?
  • What does relationally completeness mean?
  • What is a conjunctive query?

query
answer
updates
DBMS
DB
store data
The bible for database researchers Foundations
of Databases
13
Traditional database management systems
  • A database is a collection of data, typically
    containing the information about one or more
    related organizations.
  • A database management system (DBMS) is a software
    package designed to store and manage databases.
  • Database local
  • DBMS centralized single processor (CPU)
    managing local databases (single memory, disk)

query
answer
updates
DBMS
DB
store data
14
Facebook Graph Search
  • Find me restaurants in New York my friends have
    been to in 2013
  • friend(pid1, pid2)
  • person(pid, name, city)
  • dine(pid, rid, dd, mm, yy)
  • SQL query (in fact, a conjunctive query, or an
    SPC query)
  • select rid
  • from friend(pid1, pid2), person(pid, name,
    city),
  • dine(pid, rid, dd, mm, yy)
  • where pid1 p0 and pid2 person.pid and
  • pid2 dine.pid and city NYC and
    yy 2013

Facebook more than 1.38 billion nodes, and over
140 billion links
Is it feasible on big data?
14
15
Example queries Graph pattern matching
  • Input A pattern graph Q and a graph G
  • Output All the matches of Q in G, i.e., all
    subgraphs of G that are isomorphic to Q
  • Applications
  • pattern recognition
  • intelligence analysis
  • transportation network analysis
  • Web site classification
  • social position detection
  • user targeted advertising
  • knowledge base disambiguation

a bijective function f on nodes (u,u ) ? Q
iff (f(u), f(u)) ? G
What other graph queries do you know?
15
16
Graph pattern matching
  • Find all matches of a pattern in a graph

Identify suspects in a drug ring
B
B
A1
Am
1
W
W
A
S
W
3
3
W
W
W
Is this feasible? Facebook more than 1.38
billion nodes, and over 140 billion links
W
W
W
pattern graph
Understanding the structure of drug trafficking
organizations
16
17
Querying big data New challenges
Given a query Q and a dataset D, compute Q(D)
D
Q( )
Q( )
D
traditional database
big data (PB or EB)
What are new challenges introduced by querying
big data?
  • Does querying big data introduce new fundamental
    problems?
  • What new methodology do we need to cope with the
    sheer size of big data D?

Why?
A departure from classical theory and traditional
techniques
17
18
The good, the bad and the ugly
  • Traditional computational complexity theory of
    almost 50 years
  • The good polynomial time computable (PTIME)
  • The bad NP-hard (intractable)
  • The ugly PSPACE-hard, EXPTIME-hard, undecidable

What happens when it comes to big data?
How long does it take?
  • Using SSD of 6G/s, a linear scan of a data set D
    would take
  • 1.9 days when D is of 1PB (1015B)
  • 5.28 years when D is of 1EB (1018B)
  • O(n) time is already beyond reach on big data in
    practice!

What query is this?
Polynomial time queries become intractable on big
data!
18
19
Tractability revisited for big data
NP and beyond
P
Parallel polylog time
not BD-tractable
BD-tractable
Yes, querying big data comes with new and hard
fundamental problems
BD-tractable queries properly contained in P
unless P NC
19
20
Challenges query evaluation is costly
  • Graph pattern matching by subgraph isomorphism
  • NP-complete to decide whether there exists a
    match
  • possibly exponentially many matches
  • Membership problem for relational queries
  • Input a query Q, a database D, and a tuple t
  • Question Is t in Q(D)?
  • NP-complete if Q is a conjunctive query (SPC)
  • PSPACE-complete if Q is in relational algebra
    (SQL)

What is the complexity?
intractable even in the traditional complexity
theory
Already beyond reach in practice when the data is
not very big
20
21
Is it still feasible to query big data?
  • Can we do better if we are given more resources?
  • Parallel and distributed query processing TDD
  • Using 10000 SSD of 6G/s, a linear scan of D might
    take
  • 1.9 days/10000 16 seconds when D is of 1PB
    (1015B)
  • 5.28 years/10000 4.63 days when D is of 1EB
    (1018B)

Only ideally!
10,000 processors
Yes, parallel query processing. But how?
22
The two sides of a coin
Data quantity quality
  • When we talk about big data, we typically mean
    its quantity
  • What capacity of a system provides to cope with
    the sheer size of the data?
  • Is a query feasible on big data within our
    available resources?
  • How can we make our queries tractable on big
    data?
  • . . .

Veracity!
Can we trust the answers to our queries?
  • Dirty data routinely lead to misleading financial
    reports, strategic business planning decision ?
    loss of revenue, credibility and customers,
    disastrous consequences

The study of data quality is as important as data
quantity
23
Data consistency
FN LN address AC city
Mary Smith 2 Small St 908 NYC
Mary Dupont 10 Elm St 610 PHI
Mary Dupont 6 Main St 212 NYC
Bob Luth 8 Cowan St 215 PHI
Robert Luth 6 Drum St 212 NYC
  • Q1 how many employees are in the NY office?
  • 3 may not be the correct answer the AC and city
    in the first tuple are inconsistent!

Error rates 10 - 75 (telecommunication)
24
Information completeness
FN LN address AC city
Mary Smith 2 Small St 908 NYC
Mary Dupont 10 Elm St 610 PHI
Mary Dupont 6 Main St 212 NYC
Bob Luth 8 Cowan St 215 PHI
Robert Luth 6 Drum St 212 NYC
  • Q2 how many distinct employees have first name
    Marry?
  • 3 may not be the correct answer
  • The first three tuples refer to the same person
  • The information may be incomplete

information perceived as being needed for
clinical decisions was unavailable 13.6--81 of
the time (2005)
25
Data currency
FN LN address salary status
Mary Smith 2 Small St 50k single
Mary Dupont 10 Elm St 50k married
Mary Dupont 6 Main St 80k married
Bob Luth 8 Cowan St 80k married
Robert Luth 6 Drum St 55k married
Mary
Robert
Entities
Consistent, complete, and once correct
  • Q3 what is Marys current salary?

80k
  • In the real world, salary is monotonically
    increasing

In a customer file, within two years about 50 of
record may become obsolete (2002)
26
Data fusion
FN LN address salary status
Mary Smith 2 Small St 50k single
Mary Dupont 10 Elm St 50k married
Mary Dupont 6 Main St 80k married
Bob Luth 8 Cowan St 80k married
Robert Luth 6 Drum St 55k married
  • Q4 what is Marys current last name?
  • In real life
  • Marital status only changes from single ? married
    ? divorced
  • Tuples with the most current marital status also
    have the most current last name

Dupont
Deduce the true values of an entity
27
Data in real-life is often dirty
81 million National Insurance numbers but only 60
million eligible citizens
Pentagon asked 200 dead officers to re-enlist
98000 deaths each year, caused by errors in
medical data
500,000 dead people retain active Medicare cards
Data error rates in industry 1 - 30 (Redman,
1998)
Dirty data inconsistent, inaccurate, incomplete,
stale
28
Dirty data are costly
  • Poor data cost US businesses 611 billion
    annually
  • Erroneously priced data in retail databases cost
    US customers 2.5 billion each year
  • 1/3 of system development projects were forced to
    delay or cancel due to poor data quality
  • 30-80 of the development time and budget for
    data warehousing are for data cleaning
  • CIA dirty data about WMD in Iraq!

Can we trust answers to our queries in dirty data?
The scale of the data quality problem is far
worse on big data!
29
What does this course cover?
  • Big data quantity quality
  • Volume (quantity)
  • Veracity (quality)

29
30
Basic topic 1 Parallel database management
systems
  • Recall traditional DBMS
  • Database single memory, disk
  • DBMS centralized single processor (CPU)
  • Can we do better provided with multiple
    processors?
  • Parallel DBMS exploring parallelism
  • Improve performance
  • Reliability and availability

MapReduce
31
Basic topic 2 Distributed databases
  • Data is stored in several sites, each with an
    independent DBMS
  • Local ownership physically stored across
    different sites
  • Increased availability and reliability
  • Performance

Cloud computing
32
Advanced topic 1 MapReduce
  • A programming model with two primitive functions
  • Map ltk1, v1gt ? list (k2, v2)
  • Reduce ltk2, list(v2)gt ? list (k3, v3)
  • Connection between MapReduce and parallel query
    processing
  • Other parallel programming models
  • BSP (Bulk Synchronous Parallel)
  • Vertex-centric
  • Partial evaluation

Applications in cloud computing
33
Advanced topic 2 Querying big data
  • Foundations for querying big data
  • Tractability revised for querying big data
  • Parallel scalability
  • Bounded evaluability of queries
  • Techniques for querying big data
  • Develop parallel algorithms for querying big data
  • Bounded evaluability and access constraints
  • Query preserving compression
  • Query answering using views
  • Bounded incremental query processing

Querying big data theory and practice
33
34
Advanced topic 3 Data quality management
Big data quantity quality!
  • Central issues for data quality
  • Object identification (data fusion) do two
    objects refer to the same real-world entity? What
    is the true value of the entity?
  • Data consistency do our data values have
    conflicts?
  • Data accuracy is one value more accurate than
    another for a real-word entity?
  • Data currency is our data out of date?
  • Information completeness does D have enough
    information to answer our queries?

TDD the Veracity of big data
Make our data consistent, accurate, complete and
up to date!
34
35

Advanced topic 4 Dependencies as data quality
rules
  • Data quality rules
  • Conditional (functional and inclusion)
    dependencies to capture data inconsistencies
  • Matching dependencies for record matching Data
    consistency do our data values have conflicts?
  • There are also quality rules for data accuracy,
    data currency and information completeness in
    the textbook

A revision of classical dependencies
  • Fundamental problems for data quality rules
  • consistency are the data quality rules dirty
    themselves?
  • implication can we optimize the rules by
    removing redundant ones?

A uniform logic framework for improving data
quality
36
Advanced topic 5 Data cleaning
Repair
Detect errors
Reasoning
Discover rules
  • Discover data quality rules
  • Validate rules discovered
  • Detect errors with rules
  • Repairing data with rules
  • Certain fixes
  • Deducing the true values of entities

Semi-automated systems for improving data quality
37
Putting together
  • Basic technology
  • Parallel DBMS architectures, data partition,
    (intra/inter) operator parallelism, parallel
    query processing and optimization
  • Distributed DBMS architectures, fragmentation,
    replication
  • Advanced topics
  • Big data the Volume
  • MapReduce and other parallel programming models
  • Querying big data theory and practice
  • Big data the Veracity
  • Central issues for data quality
  • Dependencies as data quality rules
  • Cleaning distributed data rule discovery, rule
    validation, error detection, data repairing,
    certain fixes
  • Volume (quantity)
  • Veracity (quality)
  • Variety (entity resolution, conflict resolution
  • Velocity (incremental computation)

relational algebra/SQL, query processing, basic
complexity and algorithmic background (e.g., NP,
undecidability)
Prerequisites
38
Course format
38
39
Basic information
  • Web site
  • http//homepages.inf.ed.ac.uk/wenfei/tdd/home.html
  • Syllabus
  • Announcements
  • Lecture notes
  • deadlines
  • TA Chao Tian
  • chao.tian_at_ed.ac.uk
  • Office hours
  • Informatics Forum 5.23, 1100-1200, Thursday

40
Course format
  • Seminar course there will be no exam!
  • Lectures background.
  • http//homepages.inf.ed.ac.uk/wenfei/tdd/lecture/l
    ecture-notes.html
  • Textbook
  • R. Ramakrishnan, J. Gehrke Database Management
    Systems. WCB/McGraw-Hill 2003 (3rd edition). Chap
    22
  • Database System Concept, 4th edition, A.
    Silberschatz, H. Korth, S. Sudarshan, Part 6
    (Parallel and Distributed Database Systems)
  • W. Fan and F. Geerts. Foundations of Data Quality
    Management. Morgan Claypool, 2012 (Chapters
    1-4 e-copy available upon request)
  • Research papers or chapters related to the topics
    (3-4 each)
  • At the end of ln3-ln8

41
Grading
  • Reviews of research papers (8 in total) 40
  • Project (report) 45
  • Project presentation 15
  • Homework
  • Four sets of homework, starting from week 4
    deadlines
  • 9am, Thursday, February 5, week 4
  • 9am, Thursday, February 19, week 6
  • 9am, Thursday, March 5, week 8
  • 9am, Thursday, March 19, week 10
  • Papers choose two each time (two reviews)
    not chapters
  • 5 for each paper, and 10 for each homework

down from 12, 2012
42
Review Evaluation
  • Pick 2 research papers each time from the lecture
    note to be covered in next two weeks, starting
    from Week 4.
  • Write a one-page review for each of the papers,
    10 marks
  • Summary 2 marks
  • A clear problem statement input, question/output
  • The need for this line of research motivation
  • A summary of key ideas, techniques and
    contributions
  • Evaluation 5 marks
  • Criteria for the line of research (e.g.,
    expressive power, complexity, accuracy,
    scalability, etc)
  • Evaluation based on your criteria justify your
    evaluation
  • 3 strong points
  • 3 weak points
  • Suggest possible extensions 3 marks

43
Project Research and development (recommended)
  • Research and development
  • Topic pick one from lecture notes (ln3 ln8)
  • Example A MapReduce algorithm for graph
    simulation
  • Development
  • Pick a research paper from the reading list of
    ln3ln8
  • Implement its main algorithms
  • Conduct its experimental study

You are encouraged to come up with your own
project talk to me first
Multiple people may work on the same project
independently
Start early!
44
Grading design and development
  • Distribution
  • Algorithms technical depth, performance
    guarantees 20
  • Prove the correctness, complexity analysis and
    performance guarantees of your algorithms 15
  • Justification (experimental evaluation) 10
  • Report in the form of technical report/research
    paper
  • Introduction problem statement, motivation
  • Related work survey
  • Techniques algorithms, illustration via
    intuitive examples
  • Correctness/complexity/property/proofs
  • Experimental evaluation
  • Possible extensions

45
Project survey
  • Topic pick one topic from a lecture note (ln3
    ln8)
  • Example techniques for conflict resolution
  • Distribution
  • Select 5-6 representative papers, independently
    10
  • Develop a set of criteria the most important
    issues in that line of research, based on your
    own understanding justify your criteria
    10
  • Evaluate each of the papers based on your
    criteria 15
  • A table to summarize the assessment, based on
    your criteria, draw and justify your conclusion
    and recommendation for various application
    10
  • Sample survey A Brief Survey of Automatic
    Methods for Author Name Disambiguation
  • Find and download it from Google

Your understanding of the topic
46
Project report and presentation 15
  • A clear problem statement
  • Motivation and challenges
  • Key ideas, techniques/approaches
  • Key results what you have got, intuitive
    examples
  • Findings/recommendations for different
    applications
  • Demonstration a must if you do a development
    project
  • Presentation question handling (show that you
    have developed a good understanding of the line
    of work)

Learn how to present your work
47
Summary and Review
  • What is big data?
  • What is the volume of big data? Variety?
    Velocity? Veracity?
  • Why do we care about big data?
  • Is there any fundamental challenge introduced by
    querying big data?
  • Why study data quality?
  • What is consistency? Information completeness?
    Data currency? Data accuracy? Object
    identification?

48
Reading list
  • For next week, parallel databases, before the
    next lecture
  • Database Management Systems, 2nd edition, R.
    Ramakrishnan and J. Gehrke, Chapter 22.
  • Database System Concept, 4th edition, A.
    Silberschatz, H. Korth, S. Sudarshan, Part 6
    (Parallel and Distributed Database Systems)
  • About relational databases
  • Foundations of databases, S. Abiteboul, R. Hull,
    V. VIanu
  • About big data
  • W. Fan and J. Huai. Querying Big Data Theory and
    Practice, JCST 2014
  • http//homepages.inf.ed.ac.uk/wenfei/papers/JCST1
    4.pdf
Write a Comment
User Comments (0)
About PowerShow.com