Data Integration for the Relational Web - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Data Integration for the Relational Web

Description:

Data Integration for the Relational Web – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 32
Provided by: michaelc51
Category:

less

Transcript and Presenter's Notes

Title: Data Integration for the Relational Web


1
Data Integration for the Relational Web
  • Michael J. Cafarella, Alon Halevy,
  • Nodira Khoussainova
  • Work done while at Google, Inc.
  • Presenter
  • Michael J. Cafarella, University of Michigan
  • VLDB
  • August 27, 2009

2
Web Challenge
  • Try to create a database of allVLDB program
    committee members

3
Data Integration for Web
  • Can we combine tables to create new data sources?
  • Existing mashup, data integration tools ignore
    realities of Web data
  • A lot of useful data is not in XML
  • User cannot know all sources in advance
  • Transient integrations
  • Data semantics semi-tied to src page

4
Octopus
Crawl Web
Extract Tables
Integrate Tables
Obtain Database
  • Our system uses data from
  • WebTables WebDB08, Uncovering, Cafarella et
    alVLDB08, WebTables Exploring, Cafarella
    et al
  • Harvesting Relational Tables from Lists VLDB09,
    Harvesting Relational Tables from Lists,
    Elmeleegy et al
  • Octopus
  • Our test system has over 200M src tables
  • Lots of table/list-extraction work, e.g.,
  • VLDB09, Answering Table Augmentation, Gupta
    Sarawagi
  • JAIR08, Creating relational data, Michelson
    Knoblock
  • WWW07, Towards domain-independent,
    Gatterbauer et al
  • WWW02, A machine learning based, Wang Hu

5
Outline
  • Introduction
  • Data Sources
  • Octopus Operators
  • SEARCH
  • CONTEXT
  • EXTEND
  • Algorithms Experiments
  • Conclusions

6
Outline
  • Introduction
  • Data Sources
  • Octopus Operators
  • SEARCH
  • CONTEXT
  • EXTEND
  • Algorithms Experiments
  • Conclusions

7
(No Transcript)
8
List Extraction
9
List Extraction
10
Outline
  • Introduction
  • Data Sources
  • Octopus Operators
  • SEARCH
  • CONTEXT
  • EXTEND
  • Algorithms Experiments
  • Conclusions

11
Octopus
  • Provides workbench of data integration
    operators to build target database
  • Most operators are not correct/incorrect, but
    high/low quality
  • Some prosaic operators project, select,
  • Three original operators
  • SEARCH
  • CONTEXT
  • EXTEND
  • Under covers, each operator recovers different
    aspect of implicit GLAV src desc.

12
Operator 1 - SEARCH
  • SEARCH(VLDB program committee members)

13
Operator 2 - CONTEXT
  • Recover relevant data

CONTEXT()
CONTEXT()
14
Operator 2 - CONTEXT
  • Recover relevant data

CONTEXT()
CONTEXT()
15
Prosaic Operator - Union
  • Combine datasets

Union()
16
Operator 3 - EXTEND
  • Add column to data
  • Similar to join but join target is a topic

publications
EXTEND( publications, col0)
17
Straightforward Sequence
  • SEARCH(VLDB program committee members)
  • CONTEXT
  • CONTEXT

18
Straightforward Sequence
union
  • CONTEXT
  • CONTEXT

19
Straightforward Sequence
union
  • EXTEND
  • User integrated data sources with 4 operations
  • No wrappers data was never intended for reuse
  • User never visited source web pages

20
Outline
  • Introduction
  • Data Sources
  • Octopus Operators
  • SEARCH
  • CONTEXT
  • EXTEND
  • Algorithms Experiments
  • Conclusions

21
Experiments
  • 50 queries, suggested and evaluated by Amazon
    Mechanical Turk

22
SEARCH Algorithms - Ranking
  • SimpleRank - search engine ranking
  • SCPRank - symmetric conditional probability
    between query, table data
  • Similar to Pointwise Mutual Information
  • Lopes, DaSilva, 1999, multiword units

23
SEARCH Algorithms - Ranking
24
SEARCH Algorithms - Ranking
  • See paper for clustering results

25
CONTEXT Algorithms
  • Input table and source page
  • Output data values to add to table
  • SignificantTerms sorts terms in source page by
    importance (tf-idf)

26
Related View Partners
  • Looks for different views of same data

27
CONTEXT Experiments
28
EXTEND Algorithms
  • Input src table, src column, dst topic
  • EXTEND(t, col0, publications)
  • JoinTest
  • Tests a single table for join-compatibility
  • City mayors yes
  • VLDB publications no
  • Rank all tables by relevance to query topic
  • Select tables that are joinable to query column
  • MultiJoin
  • Finds a join-target tuple for each src tuple
  • City mayors maybe
  • VLDB publications yes
  • For each cell in src column, perform topic search
  • Cluster resulting tables, rank by column coverage

29
EXTEND Early Experiments
  • JoinTest
  • 3 of 7 source tables
  • 60 of source tuples
  • Single extension for each extended tuple
  • MultiJoin
  • All 7 source tables
  • 33 of source tuples
  • Avg 45.5 extensions for each extended tuple
  • 113 NYC mayors
  • 12 albums by Led Zeppelin

30
Related Work
  • Octopus relies on info extraction work
  • Substantial work in data integration
  • Mashup Tools
  • Yahoo! Pipes
  • Marmite - Wong and Hong, 2007
  • Karma - Tuchinda, et al., 2007
  • CIMPLE - DeRose, et al., 2007
  • Potters Wheel - Raman and Hellerstein, 2001

31
Octopus Contributions
  • Basic operators that enable Web data integration
    with very small user burden
  • Realistic and useful implementations for all
    three operators
  • Future work
  • Efficient large-scale implementation
Write a Comment
User Comments (0)
About PowerShow.com