LargeScale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

LargeScale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web

Description:

Cars.com. Amazon.com. Apartments.com. Biography.com. 401carfinder.com ... 'The dot-com bust has brought down DBs on the Web.' How many structured databases? ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 68
Provided by: ZhenZ7
Category:

less

Transcript and Presenter's Notes

Title: LargeScale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web


1
Large-Scale Deep Web IntegrationExploring and
Querying Structured Data on the Deep Web
Tutorial in SIGMOD06
  • Kevin C. Chang

2
Still challenges on the Web? Google is only the
start of search (and MSN will not be the end of
it).
3
Structured Data--- Prevalent but ignored!
4
Challenges on the Web come in dual Getting
access to the structured information!
  • Kevins 4-quardants

Deep Web
Surface Web
?
?
Access
?
?
Structure
5
Tutorial Focus Large Scale Integration of
structured data over the Deep
Web
  • That is Search-flavored integration.
  • Disclaimer-- What it is not
  • Small-scale, pre-configured, mediated-querying
    settings
  • many related techniques ? some we will relate
    today
  • Text databases (or, meta-search)
  • Several related but text-oriented issues in
    meta-search
  • eg, Stanford, Columbia, UIC
  • more in the IR community (distributed IR)
  • And, never a complete bibliography!!
  • http//metaquerier.cs.uiuc.edu/ Web
    Integration bibliography
  • Finally, no intention to finish this tutorial.

6
An evidence in Beta Google Base.
7
When Google speaks up What is an Attribute,
says Google!
8
And things are indeed happening!
9
(No Transcript)
10
(No Transcript)
11
The Deep Web Databases on the Web
12
The previous Web Search used to be crawl
and index
13
The current Web Search must eventually
resort to integration
14
How to enable effective access to the deep Web?
Cars.com
Amazon.com
Biography.com
Apartments.com
411localte.com
401carfinder.com
15
Survey the frontier BrightPlanet.com, March
2000 Bergman00
  • Overlap analysis of search engines.
  • Search sites not clearly defines.
  • Estimated 43,000 96,000 deep Web sites.
  • Content size 500 times that of surface Web.

16
Survey the frontier UIUC MetaQuerier, April 2004
ChangHL04
  • Macro Deep Web at large
  • Data Automatically-sampled 1 million IPs
  • Micro per-source specific characteristics
  • Data Manually-collected sources
  • 8 representative domains, 494 sources
  • Airfare (53), Autos (102), Books (69), CarRentals
    (24)
  • Hotels (38), Jobs (55), Movies (78), MusicRecords
    (75)
  • Available at http//metaquerier.cs.uiuc.edu/reposi
    tory

17
They wanted to observe
  • How many deep-Web sources are out there?
  • The dot-com bust has brought down DBs on the
    Web.
  • How many structured databases?
  • There are just (or, much more) text databases.
  • How do search engines cover them?
  • Google does it all. Or, InvisibleWeb.com does
    it all.
  • How hidden are they?
  • It is the hidden Web.
  • How complex are they?
  • Queries on the Web are much simpler, even
    trivial.
  • Coping with semantics is hopeless Lets Just
    wait till the semantic Web.

18
And their results are
  • How many deep-Web sources are out there?
  • 307,000 sites, 450,000 DBs, 1,258,000 interfaces.
  • How many structured databases?
  • 348,000 (structured) 102,000 (text) 3 1
  • How do search engines cover them?
  • Google covered 5 fresh and 21 state objects.
  • InvisibleWeb.com covered 7.8 sources.
  • How hidden are they?
  • CarRental (0) gt Airfares (4) gt gt MusicRec gt
    Books gt Movies (80)
  • How complex are they?
  • Amazon effects

19
Reported the Amazon effect
Attributes converge in a domain!
Condition patterns converge even across domains!
20
Googles Recent Survey courtesy Jayant
Madhavan
21
Driving Force The Large Scale
22
Circa 2000 Example System Information Agents
MichalowskiAKMTT04, Knoblock03
23
Circa 2000 Example System Comparison Shopping
Engines GuptaHR97
Virtual Database
24
System Example Applications
25
Vertical Search EnginesWarehousing approach
e.g., Libra Academic Search NieZW05 (courtesy
MSRA)
  • Integrating information from multiple types of
    sources
  • Ranking papers, conferences, and authors for a
    given query
  • Handling structured queries

26
On-the-fly Meta-querying Systems e.g., WISE
HeMYW03, MetaQuerier ChangHZ05
MetaQuerier_at_UIUC
FIND sources
Amazon.com
Cars.com
db of dbs
Apartments.com
QUERY sources
411localte.com
unified query interface
27
What needs to be done? Technical Challenges
  • Source Modeling Selection
  • Schema Matching
  • Source Querying, Crawling, and Obj Ranking
  • Data Extraction
  • System Integration

28
The Problems Technical Challenges
29
  • Technical Challenges
  • Source Modeling Selection
  • How to describe a source and find right sources
    for query answering?

30
Source Modeling Circa 2000
  • Focus
  • Design of expressive model mechanism.
  • Techniques
  • View-based mechanisms answering queries using
    views, LAV, GAV (see Halevy01 for survey).
  • Hierarchical or layered representations for
    modeling in-site navigations (KnoblockMA98,
    DavulcuFK99).

31
Source Modeling Selection for Large Scale
Integration
  • Focus Discovery of sources.
  • Focused crawling to collect query interfaces
    BarbosaF05, ChangHZ05.
  • Focus Extraction of source models.
  • Hidden grammar-based parsing ZhangHC04.
  • Proximity-based extraction HeMY04.
  • Classification to align with given taxonomy
    HessK03, Kushmerick03.
  • Focus Organization of sources and query routing
  • Offline clustering HeTC04, PengMH04.
  • Online search for query routing KabraLC05.

32
Form Extraction the Problem
  • Output all the conditions, for each
  • Grouping elements (into query conditions)
  • Tagging elements with their semantic roles

attribute
operator
value
33
Form Extraction Parsing Approach ZhangHC04 A
hidden syntactic model exist?
  • Observation Interfaces share patterns of
    presentation.
  • Hypothesis
  • Now, the problem
  • Given , how to find ?

query capabilities
34
Best-Effort Visual Language Parsing Framework
Input HTML query form
2P Grammar
Productions
Preferences
BE-Parser
Ambiguity Resolution Error Handling
X
Output semantic structure
35
Form Extraction Clustering Approach
HessK03, Kushmerick03
  • Concept A form as a Bayesian network.
  • Training Estimate the Bayesian probabilities.
  • Classification Max-likelihood predictions given
    terms.

36
  • Technical Challenges
  • 2. Schema Matching
  • How to match the schematic structures between
    sources?

37
Schema Matching Circa 2000
  • Focus
  • Generic matching without assuming Web sources
  • Techniques RahmB01

38
Schema Matching for Large Scale Integration
  • Focus Matching large number of interface
    schemas, often in a holistic way.
  • Statistical model discovery HeC03 correlation
    mining HeCH04, HeC05.
  • Query probing WangWL04.
  • Clustering HeMY03, WuYD04.
  • Corpus-assisted MadhavanBD05 Web-assisted
    WuDY06.
  • Focus Constructing unified interfaces.
  • As a global generative model HeC03.
  • Cluster-merge-select HeMY03.

39
WISE-Integrator Cluster-Merge-Represent
HeMY03
40
WISE-Integrator Cluster-Merge-Represent
HeMY03
  • Matching attributes
  • Synonymous label WordNet, string similarity
  • Compatible value domains (enum values or type)
  • Constructing integrated interface
  • form initial empty
  • until all attribtes covered
  • take one attribute
  • select a representative and merge values

41
Statistical Schema Matching MGS A
hidden statistical model exist? HeC03, HeCH04,
HeC05
  • Observation Schemas share tendencies of
    attribute usage.
  • Hypothesis
  • Now, the problem
  • Given , how to find
    ?

a
?
ß
?
d
Schema Generation
attribute matchings
Statistical Model
42
Statistical Hypothesis Discovery
  • Statistical formulation
  • Given as observations
  • Find underlying hypothesis
  • Global approach Hidden model discovery HeC03
  • Find entire global model at once
  • Local approach Correlation mining HeCH04,
    HeC05
  • Find local fragments of matchings one at a time.

Prob
QIs
43
  • Technical Challenges
  • 3. Source Querying, Crawling Search
  • How to query a source? How to crawl all objects
    and to search them?

44
Source Querying Circa 2000
  • Focus Mediation of cross-source, join-able
    queries
  • Query rewriting, planning Extensive study e.g.,
    LevyRO96, AmbiteKMP01, Halevy01.
  • Focus Execution optimization of queries
  • Adaptive, speculative query optimization e.g.,
    NaughtonDM01, BarishK03, IvesHW04.

45
Source Querying for Large Scale Integration
  • Metaquerying model
  • Focus On-the-fly Querying.
  • MetaQuerier Query Assistant ZhangHC05.
  • Vertical-search-engine model
  • Focus Source crawling to collect objects.
  • Form submission by query generation/selection
    e.g., RaghavanG01, WuWLM06.
  • Focus Object search and ranking NieZW05

46
On-the-fly Querying ZhangHC05
Type-locality based Predicate Translation
  • Correspondences occur within localities
  • Translation by type-handler

47
Source Crawling by Query Selection WuWL06
Compiler
System
Theory
Application
Ullman
Automata
Data Mining
Han
  • Conceptually, the DB as a graph
  • Node Attributes
  • Edge Occurrence relationship
  • Crawling is transformed into graph traversal
    problem
  • Find a set of nodes N in the graph G such that
    for every node i in G, there exists a node j in
    N, j-gti. And the summation of the cost of nodes
    in N should be minimum.

48
Object Ranking-- Object Relationship Graph
NieZW05
  • Popularity Propagation Factor for each type of
    relationship link
  • Popularity of an object is also affected by the
    popularity of the Web pages containing the object

49
Object Ranking-- Training Process
NieZW05
Initial Combination of PPFs
Link Graph
new combination from neighbors
PopRank Calculator
Ranking Distance Estimator
Expert Ranking
Accept The worse one ?
Better than the best ?
No
Yes
Yes
Chosen as the best
  • Subgraph selection to approximate rank
    calculation for speeding up.

50
  • Technical Challenges
  • 3. Data Extraction
  • How to extract result pages into relations?

51
Data Extraction Circa 2000 Need for rapid
wrapper construction well recognized.
  • Focus
  • Semi-automatic wrapper construction.
  • Techniques
  • Wrapper-mediator architecture Wiederhold92 .
  • Manual construction
  • Semi-automatic Learning-based
  • HLRT KushmerickWD97,
  • Stalker MusleaMK99,
  • Softmealy HsuD98

Mediator
Wrapper
Wrapper
Wrapper
52
Data Extraction for Large Scale Even more
automatic approaches.
  • Focus
  • Even more automatic approaches.
  • Techniques
  • Semi-automatic Learning-based
  • ZhaoMWRY05, IRMKS06.
  • Automatic Syntax-based
  • RoadRunner MeccaCM01,
  • ExAlg ArasuG03,
  • DEPTA LiuGZ03, ZhaiL05.

Mediator
Wrapper
Wrapper
Wrapper
53
HLRT Wrapper the first Wrapper Induction

KushmerickWD97
A manual wrapper
ExtractCCs(page P) skip past first occurrence of
ltBgt in P while next ltBgt is before next ltHRgt in P
for each ltlk,rkgtbelongs to lt ltBgt,lt/Bgtgt,lt
ltIgt,lt/Igtgt skip past next occurrence of
lk in P extract attribute from P to next
occurrence of rk return extracted tuples
A generalized wrapper
labeled data
ExecuteHLRT(lth,t,l1,r1,..,lk,rkgt,page P) skip
past first occurrence of h in P while next l1 is
before next t in P for each ltlk,rkgtbelongs to
ltl1,r1gt,..,lt lk, rk gt skip past next
occurrence of lk in P extract attr from P to
next occurrence of rk return extracted tuples
wrapper rules (delimiters) h l1, r1 l2,
r2 lk, rk t
Induction Algorithm
54
RoadRunner MeccaCM01
  • Basic idea
  • Page generation filling (encoding) data into a
    template
  • Data extraction as the reverse, decoding the
    template
  • Algorithm
  • Compare two HTML pages at one time
  • one as wrapper and the other as sample
  • Solving the mismatches
  • string mismatch -- content slot
  • tag mismatch -- structure variance

55
RoadRunner
56
RoadRunner
the template
57
  • Technical Challenges
  • 3. System Integration
  • Putting things together?

58
Our system research often ends up with
components in isolation
ChangHZ05


?
59
System integration Sample issues
AA.com
Result of extraction
  • New challenges
  • How will errors in automatic form extraction
    impact the subsequent schema matching?
  • New opportunities
  • Can the result of schema matching help to correct
    such errors?
  • e.g., (adults, children) together form a
    matching, then?

60
Current agenda Science of system integration
new challenge error cascading
Cascade

Feedback
new opportunity result feedback
61
Finally, observations Large scale is not only a
challenge, but also an opportunity!
62
Observation 1 Large scale introduces
New Problems!
  • Several issues arise in the context
  • Evidences of new problems
  • Source modeling selection
  • Source querying, crawling, ranking
  • On-the-fly query translation
  • Object crawling, ranking
  • System integration

63
Observation 2 Large scale introduces New
Semantics!
  • Relaxed metrics possible even the same problems.
  • Evidences of new metrics
  • Search-flavored integration large scale but
    simplistic
  • Function Simple queries
  • Source Transparency no more the fundamental
    doctrine
  • User In the loop of querying
  • Techniques Automatic but error-likely
  • Results Fuzzy, ranked
  • meta-querying ranking of matching sources
  • vertical-search-engine ranking of objects

64
Observation 2 Large scale introduces New
Insights!
  • The multitude of sources gives a holistic context
    for study.
  • Evidences of new insights
  • Schema matching Many holistic approaches
  • Source modeling Lego-based extraction
  • System integration Holistic error
    correction/feedback

65
The Web Trio (My three circles...)
Search
Integration
Mining
66
Looking Forward
Recall the first time I heard about Google Base.
DB People Buckle Up! Our time has finally come
67
Thank You!
For more information http//metaquerier.c
s.uiuc.edu kcchang_at_cs.uiuc.edu
Write a Comment
User Comments (0)
About PowerShow.com