Title: LargeScale Deep Web Integration: Exploring and Querying Structured Data on the Deep Web
1Large-Scale Deep Web IntegrationExploring and
Querying Structured Data on the Deep Web
Tutorial in SIGMOD06
2Still challenges on the Web? Google is only the
start of search (and MSN will not be the end of
it).
3Structured Data--- Prevalent but ignored!
4Challenges on the Web come in dual Getting
access to the structured information!
Deep Web
Surface Web
?
?
Access
?
?
Structure
5Tutorial Focus Large Scale Integration of
structured data over the Deep
Web
- That is Search-flavored integration.
- Disclaimer-- What it is not
- Small-scale, pre-configured, mediated-querying
settings - many related techniques ? some we will relate
today - Text databases (or, meta-search)
- Several related but text-oriented issues in
meta-search - eg, Stanford, Columbia, UIC
- more in the IR community (distributed IR)
- And, never a complete bibliography!!
- http//metaquerier.cs.uiuc.edu/ Web
Integration bibliography - Finally, no intention to finish this tutorial.
6An evidence in Beta Google Base.
7When Google speaks up What is an Attribute,
says Google!
8And things are indeed happening!
9(No Transcript)
10(No Transcript)
11The Deep Web Databases on the Web
12The previous Web Search used to be crawl
and index
13The current Web Search must eventually
resort to integration
14How to enable effective access to the deep Web?
Cars.com
Amazon.com
Biography.com
Apartments.com
411localte.com
401carfinder.com
15Survey the frontier BrightPlanet.com, March
2000 Bergman00
- Overlap analysis of search engines.
- Search sites not clearly defines.
- Estimated 43,000 96,000 deep Web sites.
- Content size 500 times that of surface Web.
16Survey the frontier UIUC MetaQuerier, April 2004
ChangHL04
- Macro Deep Web at large
- Data Automatically-sampled 1 million IPs
- Micro per-source specific characteristics
- Data Manually-collected sources
- 8 representative domains, 494 sources
- Airfare (53), Autos (102), Books (69), CarRentals
(24) - Hotels (38), Jobs (55), Movies (78), MusicRecords
(75) - Available at http//metaquerier.cs.uiuc.edu/reposi
tory
17They wanted to observe
- How many deep-Web sources are out there?
- The dot-com bust has brought down DBs on the
Web. - How many structured databases?
- There are just (or, much more) text databases.
- How do search engines cover them?
- Google does it all. Or, InvisibleWeb.com does
it all. - How hidden are they?
- It is the hidden Web.
- How complex are they?
- Queries on the Web are much simpler, even
trivial. - Coping with semantics is hopeless Lets Just
wait till the semantic Web.
18And their results are
- How many deep-Web sources are out there?
- 307,000 sites, 450,000 DBs, 1,258,000 interfaces.
- How many structured databases?
- 348,000 (structured) 102,000 (text) 3 1
- How do search engines cover them?
- Google covered 5 fresh and 21 state objects.
- InvisibleWeb.com covered 7.8 sources.
- How hidden are they?
- CarRental (0) gt Airfares (4) gt gt MusicRec gt
Books gt Movies (80) - How complex are they?
- Amazon effects
19Reported the Amazon effect
Attributes converge in a domain!
Condition patterns converge even across domains!
20Googles Recent Survey courtesy Jayant
Madhavan
21Driving Force The Large Scale
22Circa 2000 Example System Information Agents
MichalowskiAKMTT04, Knoblock03
23Circa 2000 Example System Comparison Shopping
Engines GuptaHR97
Virtual Database
24System Example Applications
25Vertical Search EnginesWarehousing approach
e.g., Libra Academic Search NieZW05 (courtesy
MSRA)
- Integrating information from multiple types of
sources - Ranking papers, conferences, and authors for a
given query - Handling structured queries
26On-the-fly Meta-querying Systems e.g., WISE
HeMYW03, MetaQuerier ChangHZ05
MetaQuerier_at_UIUC
FIND sources
Amazon.com
Cars.com
db of dbs
Apartments.com
QUERY sources
411localte.com
unified query interface
27What needs to be done? Technical Challenges
- Source Modeling Selection
- Schema Matching
- Source Querying, Crawling, and Obj Ranking
- Data Extraction
- System Integration
28The Problems Technical Challenges
29- Technical Challenges
- Source Modeling Selection
- How to describe a source and find right sources
for query answering?
30Source Modeling Circa 2000
- Focus
- Design of expressive model mechanism.
- Techniques
- View-based mechanisms answering queries using
views, LAV, GAV (see Halevy01 for survey). - Hierarchical or layered representations for
modeling in-site navigations (KnoblockMA98,
DavulcuFK99).
31Source Modeling Selection for Large Scale
Integration
- Focus Discovery of sources.
- Focused crawling to collect query interfaces
BarbosaF05, ChangHZ05. - Focus Extraction of source models.
- Hidden grammar-based parsing ZhangHC04.
- Proximity-based extraction HeMY04.
- Classification to align with given taxonomy
HessK03, Kushmerick03. - Focus Organization of sources and query routing
- Offline clustering HeTC04, PengMH04.
- Online search for query routing KabraLC05.
32Form Extraction the Problem
- Output all the conditions, for each
- Grouping elements (into query conditions)
- Tagging elements with their semantic roles
attribute
operator
value
33Form Extraction Parsing Approach ZhangHC04 A
hidden syntactic model exist?
- Observation Interfaces share patterns of
presentation. - Hypothesis
- Now, the problem
- Given , how to find ?
query capabilities
34Best-Effort Visual Language Parsing Framework
Input HTML query form
2P Grammar
Productions
Preferences
BE-Parser
Ambiguity Resolution Error Handling
X
Output semantic structure
35Form Extraction Clustering Approach
HessK03, Kushmerick03
- Concept A form as a Bayesian network.
- Training Estimate the Bayesian probabilities.
- Classification Max-likelihood predictions given
terms.
36- Technical Challenges
- 2. Schema Matching
- How to match the schematic structures between
sources?
37Schema Matching Circa 2000
- Focus
- Generic matching without assuming Web sources
- Techniques RahmB01
38Schema Matching for Large Scale Integration
- Focus Matching large number of interface
schemas, often in a holistic way. - Statistical model discovery HeC03 correlation
mining HeCH04, HeC05. - Query probing WangWL04.
- Clustering HeMY03, WuYD04.
- Corpus-assisted MadhavanBD05 Web-assisted
WuDY06. - Focus Constructing unified interfaces.
- As a global generative model HeC03.
- Cluster-merge-select HeMY03.
39WISE-Integrator Cluster-Merge-Represent
HeMY03
40WISE-Integrator Cluster-Merge-Represent
HeMY03
- Matching attributes
- Synonymous label WordNet, string similarity
- Compatible value domains (enum values or type)
- Constructing integrated interface
- form initial empty
- until all attribtes covered
- take one attribute
- select a representative and merge values
41Statistical Schema Matching MGS A
hidden statistical model exist? HeC03, HeCH04,
HeC05
- Observation Schemas share tendencies of
attribute usage. - Hypothesis
- Now, the problem
- Given , how to find
?
a
?
ß
?
d
Schema Generation
attribute matchings
Statistical Model
42Statistical Hypothesis Discovery
- Statistical formulation
- Given as observations
- Find underlying hypothesis
- Global approach Hidden model discovery HeC03
- Find entire global model at once
- Local approach Correlation mining HeCH04,
HeC05 - Find local fragments of matchings one at a time.
Prob
QIs
43- Technical Challenges
- 3. Source Querying, Crawling Search
- How to query a source? How to crawl all objects
and to search them?
44Source Querying Circa 2000
- Focus Mediation of cross-source, join-able
queries - Query rewriting, planning Extensive study e.g.,
LevyRO96, AmbiteKMP01, Halevy01. - Focus Execution optimization of queries
- Adaptive, speculative query optimization e.g.,
NaughtonDM01, BarishK03, IvesHW04.
45Source Querying for Large Scale Integration
- Metaquerying model
- Focus On-the-fly Querying.
- MetaQuerier Query Assistant ZhangHC05.
- Vertical-search-engine model
- Focus Source crawling to collect objects.
- Form submission by query generation/selection
e.g., RaghavanG01, WuWLM06. - Focus Object search and ranking NieZW05
46On-the-fly Querying ZhangHC05
Type-locality based Predicate Translation
- Correspondences occur within localities
- Translation by type-handler
47Source Crawling by Query Selection WuWL06
Compiler
System
Theory
Application
Ullman
Automata
Data Mining
Han
- Conceptually, the DB as a graph
- Node Attributes
- Edge Occurrence relationship
- Crawling is transformed into graph traversal
problem - Find a set of nodes N in the graph G such that
for every node i in G, there exists a node j in
N, j-gti. And the summation of the cost of nodes
in N should be minimum.
48Object Ranking-- Object Relationship Graph
NieZW05
- Popularity Propagation Factor for each type of
relationship link - Popularity of an object is also affected by the
popularity of the Web pages containing the object
49Object Ranking-- Training Process
NieZW05
Initial Combination of PPFs
Link Graph
new combination from neighbors
PopRank Calculator
Ranking Distance Estimator
Expert Ranking
Accept The worse one ?
Better than the best ?
No
Yes
Yes
Chosen as the best
- Subgraph selection to approximate rank
calculation for speeding up.
50- Technical Challenges
- 3. Data Extraction
- How to extract result pages into relations?
51Data Extraction Circa 2000 Need for rapid
wrapper construction well recognized.
- Focus
- Semi-automatic wrapper construction.
- Techniques
- Wrapper-mediator architecture Wiederhold92 .
- Manual construction
- Semi-automatic Learning-based
- HLRT KushmerickWD97,
- Stalker MusleaMK99,
- Softmealy HsuD98
Mediator
Wrapper
Wrapper
Wrapper
52Data Extraction for Large Scale Even more
automatic approaches.
- Focus
- Even more automatic approaches.
- Techniques
- Semi-automatic Learning-based
- ZhaoMWRY05, IRMKS06.
- Automatic Syntax-based
- RoadRunner MeccaCM01,
- ExAlg ArasuG03,
- DEPTA LiuGZ03, ZhaiL05.
Mediator
Wrapper
Wrapper
Wrapper
53HLRT Wrapper the first Wrapper Induction
KushmerickWD97
A manual wrapper
ExtractCCs(page P) skip past first occurrence of
ltBgt in P while next ltBgt is before next ltHRgt in P
for each ltlk,rkgtbelongs to lt ltBgt,lt/Bgtgt,lt
ltIgt,lt/Igtgt skip past next occurrence of
lk in P extract attribute from P to next
occurrence of rk return extracted tuples
A generalized wrapper
labeled data
ExecuteHLRT(lth,t,l1,r1,..,lk,rkgt,page P) skip
past first occurrence of h in P while next l1 is
before next t in P for each ltlk,rkgtbelongs to
ltl1,r1gt,..,lt lk, rk gt skip past next
occurrence of lk in P extract attr from P to
next occurrence of rk return extracted tuples
wrapper rules (delimiters) h l1, r1 l2,
r2 lk, rk t
Induction Algorithm
54RoadRunner MeccaCM01
- Basic idea
- Page generation filling (encoding) data into a
template - Data extraction as the reverse, decoding the
template - Algorithm
- Compare two HTML pages at one time
- one as wrapper and the other as sample
- Solving the mismatches
- string mismatch -- content slot
- tag mismatch -- structure variance
55RoadRunner
56RoadRunner
the template
57- Technical Challenges
- 3. System Integration
- Putting things together?
58Our system research often ends up with
components in isolation
ChangHZ05
?
59System integration Sample issues
AA.com
Result of extraction
- New challenges
- How will errors in automatic form extraction
impact the subsequent schema matching? - New opportunities
- Can the result of schema matching help to correct
such errors? - e.g., (adults, children) together form a
matching, then?
60Current agenda Science of system integration
new challenge error cascading
Cascade
Feedback
new opportunity result feedback
61Finally, observations Large scale is not only a
challenge, but also an opportunity!
62Observation 1 Large scale introduces
New Problems!
- Several issues arise in the context
- Evidences of new problems
- Source modeling selection
- Source querying, crawling, ranking
- On-the-fly query translation
- Object crawling, ranking
- System integration
63Observation 2 Large scale introduces New
Semantics!
- Relaxed metrics possible even the same problems.
- Evidences of new metrics
- Search-flavored integration large scale but
simplistic - Function Simple queries
- Source Transparency no more the fundamental
doctrine - User In the loop of querying
- Techniques Automatic but error-likely
- Results Fuzzy, ranked
- meta-querying ranking of matching sources
- vertical-search-engine ranking of objects
64Observation 2 Large scale introduces New
Insights!
- The multitude of sources gives a holistic context
for study. - Evidences of new insights
- Schema matching Many holistic approaches
- Source modeling Lego-based extraction
- System integration Holistic error
correction/feedback
65The Web Trio (My three circles...)
Search
Integration
Mining
66Looking Forward
Recall the first time I heard about Google Base.
DB People Buckle Up! Our time has finally come
67Thank You!
For more information http//metaquerier.c
s.uiuc.edu kcchang_at_cs.uiuc.edu