Title: TDD:%20Topics%20in%20Distributed%20Databases%20(Querying%20and%20cleaning%20big%20data)
1TDD Topics in Distributed Databases(Querying
and cleaning big data)
Wenfei Fan University of Edinburgh
1
2What is big data?
2
3Big data What is it anyway?
Everyone talks about big data. But what is it?
- Volume horrendously large
- PB (1015B)
- EB (1018B)
- Variety heterogeneous, semi-structured or
unstructured - 91 ratio of unstructured data vs. structured
data - collecting 95 restaurants requires at least 5000
sources - Velocity dynamic
- think of the Web and Facebook,
- Veracity trust in its quality
- real-life data is typically dirty!
cf. Online ordering of overlapping data sources,
PVLDB 7(3), 2013, Mariam Salloum, Xin Luna Dong,
Divesh Srivastava, Vassilis J. Tsotra
A departure from our familiar data management!
3
4Why is the data so big?
- Worldwide information volume is growing annually
at a minimum rate of 59 - A single jet engine produces 20TB (1012B) of data
per hour - Facebook has 1.38 billion users, 140 billion
links, about 300 PB of data - Genome of human sampling, biochemistry,
immunology, imaging, genetic, phenotypic data - 1 person 1PB (1015B)
- 1000 people 1EB (1018B)
- 1 billion people 1ZB (1024B)
Gartner 2011
Big data is a relative notion 1TB is already too
big for your laptop
4
5Why do we care about big data?
5
6Example Medicare
- Google Flu Trends
- advance indication in the 2007-08 flu season
- the 2009 H1N1 outbreak
- IBM Predict Heart Disease Through Big Data
Analytics - traditional EKGs, heart rate, blood pressure
- big data analysis connecting
- exercise and fitness tests
- diet
- fat and muscle composition
- genetics and environment
- social media and wellness share information
-
Nature, 2009
A new game large number of data sources of big
volume
6
7Big data is needed everywhere
- Social media marketing
- 78 of consumers trust peer (friend, colleague
and family member) recommendations only 14
trust ad - if three close friends of person X like items P
and W, and if X also likes P, then the chances
are that X likes W too - Social event monitoring
- Prevent terrorist attack
- The Net Project, Shenzhen, China (Audaque)
- Scientific research
- A new yet more effective way to develop theory,
by exploring and discovering correlations of
seemingly disconnected factors
The world is becoming data-driven, like it or not!
7
8The big data market is BIG
- US HEALTH CARE 300 B
- Increase industry value per year by 300 B
- US RETAIL 60
- Increase net margin by 60
- MANUFACTURING 50
- Decrease development and assembly costs by 50
- GLOBAL PERSONAL LOCATION DATA 100 B
- Increase service provider revenue by 100 B
- EUROPE PUBLIC SECTOR ADMIN 250 B Euro
- Increase industry value per year by 250 B Euro
McKinsey Global Institute, May 2011
Big Data The next frontier for innovation,
competition and productivity
8
9Why study big data?
- Want to find a job?
- Research and development of big data systems
- ETL, distributed systems (eg, Hadoop),
visualization tools, data warehouse, OLAP, data
integration, data quality control, - Big data applications
- social marketing, healthcare,
- Data analysis to get values out of big data
- discovering and applying patterns, predicative
analysis, business intelligence, privacy and
security, - Prepare you for
- graduate study current research and practical
issues - the job market skills/knowledge in need
complexity theory, distributed databases, query
answering, algorithms, data quality
Big data Big
10What challenges are introduced by big data?
10
11Big data Through the eyes of computation
- Computer science is the topic about
the computation of function f(x)
- Big data the data parameter x is horrendously
large PB or EB
What is the challenge introduced to query
answering?
- Fallacies
- Big data introduces no fundamental problems
- Big data MapReduce (Hadoop)
- Big data data quantity (scalability)
Are these true?
11
12Flashback Relational queries
- Questions
- What is a relational schema? A relation? A
relational database? - What is a query? What is relational algebra?
- What does relationally completeness mean?
- What is a conjunctive query?
query
answer
updates
DBMS
DB
store data
The bible for database researchers Foundations
of Databases
13Traditional database management systems
- A database is a collection of data, typically
containing the information about one or more
related organizations. - A database management system (DBMS) is a software
package designed to store and manage databases. - Database local
- DBMS centralized single processor (CPU)
managing local databases (single memory, disk)
query
answer
updates
DBMS
DB
store data
14Facebook Graph Search
- Find me restaurants in New York my friends have
been to in 2013 - friend(pid1, pid2)
- person(pid, name, city)
- dine(pid, rid, dd, mm, yy)
- SQL query (in fact, a conjunctive query, or an
SPC query) - select rid
- from friend(pid1, pid2), person(pid, name,
city), - dine(pid, rid, dd, mm, yy)
- where pid1 p0 and pid2 person.pid and
- pid2 dine.pid and city NYC and
yy 2013 -
Facebook more than 1.38 billion nodes, and over
140 billion links
Is it feasible on big data?
14
15Example queries Graph pattern matching
- Input A pattern graph Q and a graph G
- Output All the matches of Q in G, i.e., all
subgraphs of G that are isomorphic to Q
- Applications
- pattern recognition
- intelligence analysis
- transportation network analysis
- Web site classification
- social position detection
- user targeted advertising
- knowledge base disambiguation
a bijective function f on nodes (u,u ) ? Q
iff (f(u), f(u)) ? G
What other graph queries do you know?
15
16Graph pattern matching
- Find all matches of a pattern in a graph
Identify suspects in a drug ring
B
B
A1
Am
1
W
W
A
S
W
3
3
W
W
W
Is this feasible? Facebook more than 1.38
billion nodes, and over 140 billion links
W
W
W
pattern graph
Understanding the structure of drug trafficking
organizations
16
17Querying big data New challenges
Given a query Q and a dataset D, compute Q(D)
D
Q( )
Q( )
D
traditional database
big data (PB or EB)
What are new challenges introduced by querying
big data?
- Does querying big data introduce new fundamental
problems? - What new methodology do we need to cope with the
sheer size of big data D?
Why?
A departure from classical theory and traditional
techniques
17
18The good, the bad and the ugly
- Traditional computational complexity theory of
almost 50 years - The good polynomial time computable (PTIME)
- The bad NP-hard (intractable)
- The ugly PSPACE-hard, EXPTIME-hard, undecidable
What happens when it comes to big data?
How long does it take?
- Using SSD of 6G/s, a linear scan of a data set D
would take - 1.9 days when D is of 1PB (1015B)
- 5.28 years when D is of 1EB (1018B)
- O(n) time is already beyond reach on big data in
practice!
What query is this?
Polynomial time queries become intractable on big
data!
18
19Tractability revisited for big data
NP and beyond
P
Parallel polylog time
not BD-tractable
BD-tractable
Yes, querying big data comes with new and hard
fundamental problems
BD-tractable queries properly contained in P
unless P NC
19
20Challenges query evaluation is costly
- Graph pattern matching by subgraph isomorphism
- NP-complete to decide whether there exists a
match - possibly exponentially many matches
- Membership problem for relational queries
- Input a query Q, a database D, and a tuple t
- Question Is t in Q(D)?
- NP-complete if Q is a conjunctive query (SPC)
- PSPACE-complete if Q is in relational algebra
(SQL)
What is the complexity?
intractable even in the traditional complexity
theory
Already beyond reach in practice when the data is
not very big
20
21Is it still feasible to query big data?
- Can we do better if we are given more resources?
- Parallel and distributed query processing TDD
-
- Using 10000 SSD of 6G/s, a linear scan of D might
take - 1.9 days/10000 16 seconds when D is of 1PB
(1015B) - 5.28 years/10000 4.63 days when D is of 1EB
(1018B)
Only ideally!
10,000 processors
Yes, parallel query processing. But how?
22The two sides of a coin
Data quantity quality
- When we talk about big data, we typically mean
its quantity - What capacity of a system provides to cope with
the sheer size of the data? - Is a query feasible on big data within our
available resources? - How can we make our queries tractable on big
data? - . . .
Veracity!
Can we trust the answers to our queries?
- Dirty data routinely lead to misleading financial
reports, strategic business planning decision ?
loss of revenue, credibility and customers,
disastrous consequences
The study of data quality is as important as data
quantity
23Data consistency
FN LN address AC city
Mary Smith 2 Small St 908 NYC
Mary Dupont 10 Elm St 610 PHI
Mary Dupont 6 Main St 212 NYC
Bob Luth 8 Cowan St 215 PHI
Robert Luth 6 Drum St 212 NYC
- Q1 how many employees are in the NY office?
-
- 3 may not be the correct answer the AC and city
in the first tuple are inconsistent! -
Error rates 10 - 75 (telecommunication)
24Information completeness
FN LN address AC city
Mary Smith 2 Small St 908 NYC
Mary Dupont 10 Elm St 610 PHI
Mary Dupont 6 Main St 212 NYC
Bob Luth 8 Cowan St 215 PHI
Robert Luth 6 Drum St 212 NYC
- Q2 how many distinct employees have first name
Marry? -
- 3 may not be the correct answer
- The first three tuples refer to the same person
- The information may be incomplete
-
information perceived as being needed for
clinical decisions was unavailable 13.6--81 of
the time (2005)
25Data currency
FN LN address salary status
Mary Smith 2 Small St 50k single
Mary Dupont 10 Elm St 50k married
Mary Dupont 6 Main St 80k married
Bob Luth 8 Cowan St 80k married
Robert Luth 6 Drum St 55k married
Mary
Robert
Entities
Consistent, complete, and once correct
- Q3 what is Marys current salary?
-
80k
- In the real world, salary is monotonically
increasing -
In a customer file, within two years about 50 of
record may become obsolete (2002)
26Data fusion
FN LN address salary status
Mary Smith 2 Small St 50k single
Mary Dupont 10 Elm St 50k married
Mary Dupont 6 Main St 80k married
Bob Luth 8 Cowan St 80k married
Robert Luth 6 Drum St 55k married
- Q4 what is Marys current last name?
- In real life
- Marital status only changes from single ? married
? divorced - Tuples with the most current marital status also
have the most current last name
Dupont
Deduce the true values of an entity
27Data in real-life is often dirty
81 million National Insurance numbers but only 60
million eligible citizens
Pentagon asked 200 dead officers to re-enlist
98000 deaths each year, caused by errors in
medical data
500,000 dead people retain active Medicare cards
Data error rates in industry 1 - 30 (Redman,
1998)
Dirty data inconsistent, inaccurate, incomplete,
stale
28Dirty data are costly
- Poor data cost US businesses 611 billion
annually - Erroneously priced data in retail databases cost
US customers 2.5 billion each year - 1/3 of system development projects were forced to
delay or cancel due to poor data quality - 30-80 of the development time and budget for
data warehousing are for data cleaning - CIA dirty data about WMD in Iraq!
Can we trust answers to our queries in dirty data?
The scale of the data quality problem is far
worse on big data!
29What does this course cover?
- Big data quantity quality
- Volume (quantity)
- Veracity (quality)
29
30Basic topic 1 Parallel database management
systems
- Recall traditional DBMS
- Database single memory, disk
- DBMS centralized single processor (CPU)
- Can we do better provided with multiple
processors? - Parallel DBMS exploring parallelism
- Improve performance
- Reliability and availability
MapReduce
31Basic topic 2 Distributed databases
- Data is stored in several sites, each with an
independent DBMS - Local ownership physically stored across
different sites - Increased availability and reliability
- Performance
Cloud computing
32Advanced topic 1 MapReduce
- A programming model with two primitive functions
- Map ltk1, v1gt ? list (k2, v2)
- Reduce ltk2, list(v2)gt ? list (k3, v3)
- Connection between MapReduce and parallel query
processing - Other parallel programming models
- BSP (Bulk Synchronous Parallel)
- Vertex-centric
- Partial evaluation
Applications in cloud computing
33Advanced topic 2 Querying big data
- Foundations for querying big data
- Tractability revised for querying big data
- Parallel scalability
- Bounded evaluability of queries
- Techniques for querying big data
- Develop parallel algorithms for querying big data
- Bounded evaluability and access constraints
- Query preserving compression
- Query answering using views
- Bounded incremental query processing
Querying big data theory and practice
33
34Advanced topic 3 Data quality management
Big data quantity quality!
- Central issues for data quality
- Object identification (data fusion) do two
objects refer to the same real-world entity? What
is the true value of the entity? - Data consistency do our data values have
conflicts? - Data accuracy is one value more accurate than
another for a real-word entity? - Data currency is our data out of date?
- Information completeness does D have enough
information to answer our queries?
TDD the Veracity of big data
Make our data consistent, accurate, complete and
up to date!
34
35 Advanced topic 4 Dependencies as data quality
rules
- Data quality rules
- Conditional (functional and inclusion)
dependencies to capture data inconsistencies - Matching dependencies for record matching Data
consistency do our data values have conflicts? - There are also quality rules for data accuracy,
data currency and information completeness in
the textbook
A revision of classical dependencies
- Fundamental problems for data quality rules
- consistency are the data quality rules dirty
themselves? - implication can we optimize the rules by
removing redundant ones?
A uniform logic framework for improving data
quality
36Advanced topic 5 Data cleaning
Repair
Detect errors
Reasoning
Discover rules
- Discover data quality rules
- Validate rules discovered
- Detect errors with rules
- Repairing data with rules
- Certain fixes
- Deducing the true values of entities
Semi-automated systems for improving data quality
37Putting together
- Basic technology
- Parallel DBMS architectures, data partition,
(intra/inter) operator parallelism, parallel
query processing and optimization - Distributed DBMS architectures, fragmentation,
replication - Advanced topics
- Big data the Volume
- MapReduce and other parallel programming models
- Querying big data theory and practice
- Big data the Veracity
- Central issues for data quality
- Dependencies as data quality rules
- Cleaning distributed data rule discovery, rule
validation, error detection, data repairing,
certain fixes
- Volume (quantity)
- Veracity (quality)
- Variety (entity resolution, conflict resolution
- Velocity (incremental computation)
relational algebra/SQL, query processing, basic
complexity and algorithmic background (e.g., NP,
undecidability)
Prerequisites
38Course format
38
39Basic information
- Web site
- http//homepages.inf.ed.ac.uk/wenfei/tdd/home.html
- Syllabus
- Announcements
- Lecture notes
- deadlines
- TA Chao Tian
- chao.tian_at_ed.ac.uk
- Office hours
- Informatics Forum 5.23, 1100-1200, Thursday
40Course format
- Seminar course there will be no exam!
- Lectures background.
- http//homepages.inf.ed.ac.uk/wenfei/tdd/lecture/l
ecture-notes.html - Textbook
- R. Ramakrishnan, J. Gehrke Database Management
Systems. WCB/McGraw-Hill 2003 (3rd edition). Chap
22 - Database System Concept, 4th edition, A.
Silberschatz, H. Korth, S. Sudarshan, Part 6
(Parallel and Distributed Database Systems) - W. Fan and F. Geerts. Foundations of Data Quality
Management. Morgan Claypool, 2012 (Chapters
1-4 e-copy available upon request) - Research papers or chapters related to the topics
(3-4 each) - At the end of ln3-ln8
41Grading
- Reviews of research papers (8 in total) 40
- Project (report) 45
- Project presentation 15
- Homework
- Four sets of homework, starting from week 4
deadlines - 9am, Thursday, February 5, week 4
- 9am, Thursday, February 19, week 6
- 9am, Thursday, March 5, week 8
- 9am, Thursday, March 19, week 10
- Papers choose two each time (two reviews)
not chapters - 5 for each paper, and 10 for each homework
down from 12, 2012
42Review Evaluation
- Pick 2 research papers each time from the lecture
note to be covered in next two weeks, starting
from Week 4. - Write a one-page review for each of the papers,
10 marks - Summary 2 marks
- A clear problem statement input, question/output
- The need for this line of research motivation
- A summary of key ideas, techniques and
contributions - Evaluation 5 marks
- Criteria for the line of research (e.g.,
expressive power, complexity, accuracy,
scalability, etc) - Evaluation based on your criteria justify your
evaluation - 3 strong points
- 3 weak points
- Suggest possible extensions 3 marks
43Project Research and development (recommended)
- Research and development
- Topic pick one from lecture notes (ln3 ln8)
- Example A MapReduce algorithm for graph
simulation - Development
- Pick a research paper from the reading list of
ln3ln8 - Implement its main algorithms
- Conduct its experimental study
You are encouraged to come up with your own
project talk to me first
Multiple people may work on the same project
independently
Start early!
44Grading design and development
- Distribution
- Algorithms technical depth, performance
guarantees 20 - Prove the correctness, complexity analysis and
performance guarantees of your algorithms 15 - Justification (experimental evaluation) 10
- Report in the form of technical report/research
paper - Introduction problem statement, motivation
- Related work survey
- Techniques algorithms, illustration via
intuitive examples - Correctness/complexity/property/proofs
- Experimental evaluation
- Possible extensions
45Project survey
- Topic pick one topic from a lecture note (ln3
ln8) - Example techniques for conflict resolution
- Distribution
- Select 5-6 representative papers, independently
10 - Develop a set of criteria the most important
issues in that line of research, based on your
own understanding justify your criteria
10 - Evaluate each of the papers based on your
criteria 15 - A table to summarize the assessment, based on
your criteria, draw and justify your conclusion
and recommendation for various application
10
- Sample survey A Brief Survey of Automatic
Methods for Author Name Disambiguation - Find and download it from Google
Your understanding of the topic
46Project report and presentation 15
- A clear problem statement
- Motivation and challenges
- Key ideas, techniques/approaches
- Key results what you have got, intuitive
examples - Findings/recommendations for different
applications - Demonstration a must if you do a development
project - Presentation question handling (show that you
have developed a good understanding of the line
of work)
Learn how to present your work
47Summary and Review
- What is big data?
- What is the volume of big data? Variety?
Velocity? Veracity? - Why do we care about big data?
- Is there any fundamental challenge introduced by
querying big data? - Why study data quality?
- What is consistency? Information completeness?
Data currency? Data accuracy? Object
identification?
48Reading list
- For next week, parallel databases, before the
next lecture - Database Management Systems, 2nd edition, R.
Ramakrishnan and J. Gehrke, Chapter 22. - Database System Concept, 4th edition, A.
Silberschatz, H. Korth, S. Sudarshan, Part 6
(Parallel and Distributed Database Systems) - About relational databases
- Foundations of databases, S. Abiteboul, R. Hull,
V. VIanu - About big data
- W. Fan and J. Huai. Querying Big Data Theory and
Practice, JCST 2014 - http//homepages.inf.ed.ac.uk/wenfei/papers/JCST1
4.pdf