Wensheng Wu1, Clement Yu2, AnHai Doan1, Weiyi Meng3 - PowerPoint PPT Presentation

About This Presentation
Title:

Wensheng Wu1, Clement Yu2, AnHai Doan1, Weiyi Meng3

Description:

An Interactive Clustering-based Approach to Integrating Source Query Interfaces on the Deep Web ... hotwire.com. 4. Constructing Global Query Interface ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 25
Provided by: pagesC
Category:

less

Transcript and Presenter's Notes

Title: Wensheng Wu1, Clement Yu2, AnHai Doan1, Weiyi Meng3


1
An Interactive Clustering-based Approach to
Integrating Source Query Interfaces on the Deep
Web
  • Wensheng Wu1, Clement Yu2, AnHai Doan1, Weiyi
    Meng3
  • 1 University of Illinois at Urbana-Champaign
  • 2 University of Illinois at Chicago
  • 3 SUNY at Binghamton
  • June 2004, Paris, France

2
Access Deep Web Sources
united.com
airtravel.com
delta.com
hotwire.com
3
Global Query Interface
united.com
airtravel.com
delta.com
hotwire.com
4
Constructing Global Query Interface
  • A unified query interface with these desired
    features
  • Conciseness - Combine semantically
  • similar fields over source interfaces
  • Completeness - Retain source-specific fields
  • User-friendliness Highly related fields
  • are close together
  • Two-phrased integration
  • Interface Matching Identify semantically
    similar fields
  • Interface Integration Merge the source query
    interfaces

5
Interface Matching Challenges
  • Field A in one interface is semantically similar
  • to field B in another interface, but
  • have nothing in common. E.g.,
  • sim(A,B) sim(A,C), which field should
  • A match? E.g.,

x
x
?
6
Interface Matching Challenges (Contd)
  • 1m mappings E.g.,
  • Determine matching threshold

?
7
Existing Common Limitations
  • Limitation 1 Non-hierarchical modeling
  • Limitation 2 Do not handle 1m mappings or
    handle them with low accuracy
  • Limitation 3 Does not allow limited user
    interactions
  • Detailed comparisons given in paper

8
The IceQs Approach SIGMOD-04
  • Hierarchical modeling
  • Lets be out of flat land
  • Greedy is good
  • Always start with the most confident matching
  • Bridging effect
  • a2 and c2 might not look similar themselves
  • but they might both be similar to b3
  • 1m mappings
  • Aggregate and is-a types
  • User interaction helps in
  • Interactive learning of matching threshold
  • Resolution of uncertain mappings

0.8
0.5
Pick this!
X
9
Hierarchical Modeling
Ordered Tree Representation
Source Query Interface
Capture ordering and grouping of fields
10
Field Similarity Function
  • Each field may have a label, a name and a set of
    values, e.g.,
  • Evaluate the similarity sim(A,B) between two
    fields, A and B, based on
  • Linguistic similarity by label similarity, name
    similarity and name vs. label similarity, each
    measured by Cosine function
  • Domain similarity by domain type and domain value
    similarity

Linguistic similarity
Domain similarity
11
Find 11 Mappings via Clustering
Interfaces
Initial similarity matrix
(Threshold .3)
After one merge
, final clusters
a1,b1,c1, b2,c2,a2,b3
12
Bridging Effect
A
?
B
C
Observations - It is difficult to match
vehicle field, A, with make field, B -
But As instances are similar to Cs, and Cs
label is similar to Bs - Thus, C might serve
as a bridge to connect A and B!

13
Bridging Effect (Contd)
?
?
airtravel.com
hotfares.com
airtickets.com
Connections might also be made via labels
14
Field Ordering-based Tie Resolution
0.35
B1
0.35
A1
A2
0.35
0.35
B2
Question sim(A1, B1) sim(A1, B2), which one
should A1 match?
Observation the ordering of fields conveys
semantics!
15
Complex Mappings
Aggregate type contents of fields on the many
side are part of the content of field on the one
side
Commonalities (1) field proximity, (2) parent
label similarity, and (3) value characteristics
16
Complex Mappings (Contd)
Is-a type contents of fields on the many side
are sum/union of the content of field on the one
side
Commonalities (1) field proximity, (2) parent
label similarity, and (3) value characteristics
17
Complex Mappings (Contd)
  • Final 1-m phase infers new mappings

Preliminary 1-m phase a1 ? (b1, b2) Clustering
phase b1 ? c1, b2 ? c2 Final 1-m phase a1 ?
(c1, c2)
18
Active Learning of Thresholds
  • Observation In an ideal situation,
  • if field A matches with some field X, then sim(A,
    X) threshold T1
  • if field A does not match with any field, then
    for any C, maxsim(A, C)

.91 .8 .73 .62 .46 .2 .03
.87 .82 .6 .53 .5 .33 .28
.62 .53 .5 .48 .46 .32 .1
Initial B 0,.4
Drop rule 50
List 1
List 3
List 2
List1 (1) question on .2, answer yes, update B
0, .2, continue on list 1 (2) question
on .03, answer no, update B .03, .2 List2
question on .1, answer yes, update B.03,
.1 List3 no values within B
Threshold set to any value between .03 and .1
19
Interactive Resolution of Uncertain Mappings
  • Resolve potential homonyms
  • Observation two fields are
  • possible homonyms if their
  • labels are highly similar
  • while domains are not.
  • Determine potential synonyms
  • Observation Two fields might still be similar
  • if there are common values in their
  • domains even if their label/domain
  • similarities are low


x
X
20
Interactive Resolution of Uncertain Mappings
  • Determine potential 1m mappings
  • Observation A might still match with B and C if
    (a) sim(A,B) is very close to sim(A,C) (b) B and
    C are adjacent and (c) A is the only field in
    its interface which satisfies (a) and (b)

?
21
Empirical Evaluations
Accuracy with all user interactions
Accuracy with learned thresholds
Automatic field matching
Distribution of questions
22
Comparison of Component Contributions
7.3
15.4
On average, 12.6 increase in recall
23
Summary
  • High accuracy of determining matching fields
    across multiple user interfaces
  • Limited use of user interactions

24
Future Research
  • Improve the accuracy of determining matching
    fields further
  • Decrease the number of  user interactions
  • Produce unified friendly user interface
  • Provide such a tool on the Web
Write a Comment
User Comments (0)
About PowerShow.com