Integration%20of%20Heterogeneous%20Databases%20without%20Common%20Domains%20Using%20Queries%20Based%20on%20Textual%20Similarity: - PowerPoint PPT Presentation

About This Presentation
Title:

Integration%20of%20Heterogeneous%20Databases%20without%20Common%20Domains%20Using%20Queries%20Based%20on%20Textual%20Similarity:

Description:

Integration of Heterogeneous Databases without Common Domains Using Queries Based on Textual Similar – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Integration%20of%20Heterogeneous%20Databases%20without%20Common%20Domains%20Using%20Queries%20Based%20on%20Textual%20Similarity:


1
Integration of Heterogeneous Databases without
Common Domains Using Queries Based on Textual
Similarity
Embodied Cognition and Knowledge
  • William W. Cohen
  • Machine Learning Dept. and Language Technologies
    Inst.
  • School of Computer Science
  • Carnegie Mellon University

2
What was that paper, and who is this guy talking?
Machine Learning
Human languages NLP, IR
Representation languages DBs, KR
WHIRL Word-Based Heterogeneous Information
Representation Language
3
History
  • 1982/1984 Ehud Shapiros thesis
  • MIS Learning logic programs as debugging an
    empty Prolog program
  • Thesis contained 17 figures and a 25-page
    appendix that were a full implementation of MIS
    in Prolog
  • Incredibly elegant work
  • Computer science has a great advantage over
    other experimental sciences the world we
    investigate is, to a large extent, our own
    creation, and we are the ones to determine if it
    is simple or messy.

82
84
86
88
90
92
94
96
98
00
04
08
13
18
4
History
  • Grad school in AI at Rutgers
  • MTS at ATT Bell Labs in group doing KR, DB,
    learning, information retrieval,
  • My work learning logical (description-logic-like,
    Prolog-like, rule-based) representations that
    model large noisy real-world datasets.

82
84
86
88
90
92
94
96
98
00
04
08
13
18
5
History
  • ATT Bells Labs becomes ATT Labs Research
  • The web takes off
  • as predicted by Vinge and Gibson
  • IR folks start looking at retrieval and
    question-answering with the Web
  • Alon Halevy starts the Information Manifold
    project to integrate data on the web
  • VLDB 2006 10-year Best Paper Award for 1996 paper
    on IM
  • I started thinking about the same problem in a
    different way.

82
84
86
88
90
92
94
96
98
00
04
08
13
18
6
History WHIRL motivation 1
  • As the world of computer science gets richer and
    more complex, computer science can no longer
    limit itself to studying our own creation.
  • Tension exists between
  • Elegant theories of representation
  • The not-so-elegant real world that is being
    represented

82
84
86
88
90
92
94
96
98
00
04
08
13
18
CA
7
History WHIRL motivation 1
  • The beauty of the real world is its complexity.

82
84
86
88
90
92
94
96
98
00
04
08
13
18
8
History integration by mediation
82
84
86
88
90
92
94
96
98
00
04
08
13
18
  • Mediator translates between the knowledge in
    multiple separate KBs
  • Each KB is a separate symbol system
  • No formal connection between them except via the
    mediator

9
WHIRL idea exploit linguistic properties of the
HTML veneer of web-accessible DBs
TFIDF similarity
82
84
86
88
90
92
94
96
98
00
04
08
13
18
WHIRL Motivation 2 Web KBs are embodied
10
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Strongest links those agreeable to most users
William Will Cohen Cohn
Steve Steven Minton Mitton
Weaker links those agreeable to some users
even weaker links
William David Cohen Cohn
11
WHIRL approach
Link items as needed by Q
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Incrementally produce a ranked list of possible
links, with best matches first. User (or
downstream process) decides how much of the list
to generate and examine.
William Will Cohen Cohn
Steve Steven Minton Mitton
William David Cohen Cohn
12
(No Transcript)
13
WHIRL queries
  • Assume two relations
  • review(movieTitle,reviewText) archive of reviews
  • listing(theatre, movieTitle, showTimes, ) now
    showing

The Hitchhikers Guide to the Galaxy, 2005 This is a faithful re-creation of the original radio series not surprisingly, as Adams wrote the screenplay .
Men in Black, 1997 Will Smith does an excellent job in this
Space Balls, 1987 Only a die-hard Mel Brooks fan could claim to enjoy

Star Wars Episode III The Senator Theater 100, 415, 730pm.
Cinderella Man The Rotunda Cinema 100, 430, 730pm.

14
WHIRL queries
  • Find reviews of sci-fi comedies movie domain
  • FROM review SELECT WHERE r.textsci fi comedy
  • (like standard ranked retrieval of sci-fi
    comedy)
  • Where is that sci-fi comedy playing?
  • FROM review as r, LISTING as s, SELECT
  • WHERE r.titles.title and r.textsci fi comedy
  • (best answers titles are similar to each other
    e.g., Hitchhikers Guide to the Galaxy and The
    Hitchhikers Guide to the Galaxy, 2005 and the
    review text is similar to sci-fi comedy)

15
WHIRL queries
  • Similarity is based on TFIDF? rare words are most
    important.
  • Search for high-ranking answers uses inverted
    indices.

- It is easy to find the (few) items that match
on important terms - Search for strong matches
can prune unimportant terms
The Hitchhikers Guide to the Galaxy, 2005
Men in Black, 1997
Space Balls, 1987

Star Wars Episode III
Hitchhikers Guide to the Galaxy
Cinderella Man

hitchhiker movie00137
the movie001,movie003,movie007,movie008, movie013,movie018,movie023,movie0031, ..

16
After WHIRL
82
84
86
88
90
92
94
96
98
00
04
08
13
18
  • Efficient text joins
  • On-the-fly, best-effort, imprecise integration
  • Interactions between information extraction
    quality and results of queries on extracted data
  • Keyword search on databases
  • Use of statistics on text corpora to build
    intelligent embodied systems
  • Turney solving SAT analogies with PMI over word
    pairs
  • Mitchell Just predicting FMI brain images
    resulting from reading a common noun (hammer)
    from co-occurrence information between nouns and
    verbs

17
Recent work non-textual similarity
82
84
86
88
90
92
94
96
98
00
04
08
13
18
Christos Faloutsos, CMU
William W. Cohen, CMU
cohen
cmu
william
w
dr
Dr. W. W. Cohen
George H. W. Bush
George W. Bush
18
Recent Work
  • Personalized PageRank aka Random Walk with
    Restart
  • Similarity measure for nodes in a graph,
    analogous to TFIDF for text in a WHIRL database
  • natural extension to PageRank
  • amenable to learning parameters of the walk
    (gradient search, w/ various optimization
    metrics)
  • Toutanova, Manning NG, ICML2004 Nie et al,
    WWW2005 Xi et al, SIGIR 2005
  • various speedup techniques exist
  • queries
  • Given type t and node x, find yT(y)t and yx

82
84
86
88
90
92
94
96
98
00
04
08
13
18
19
Learning to Search Email
Einat Minkov, CMU Andrew Ng, Stanford
SIGIR 2006, CEAS 2006, WebKDD/SNA 2007
CALO
Term In Subject
Sent To
William
graph
proposal
CMU
6/17/07
6/18/07
einat_at_cs.cmu.edu
20
Tasks that are like similarity queries
Person namedisambiguation
term andy file msgId
person
Threading
file msgId
  • What are the adjacent messages in this thread?
  • A proxy for finding more messages like this one

file
Alias finding
What are the email-addresses of Jason ?...
term Jason
email-address
Meeting attendees finder
Which email-addresses (persons) should I notify
about this meeting?
meeting mtgId
email-address
21
Results on one task
Mgmt. game
PERSON NAME DISAMBIGUATION
22
Results on several tasks (MAP)

Namedisambiguation






Threading










Alias finding
23
Set Expansion using the Web
  1. Canon
  2. Nikon
  3. Olympus

Richard Wang, CMU
  1. Pentax
  2. Sony
  3. Kodak
  4. Minolta
  5. Panasonic
  6. Casio
  7. Leica
  8. Fuji
  9. Samsung
  • Fetcher download web pages from the Web
  • Extractor learn wrappers from web pages
  • Ranker rank entities extracted by wrappers

24
The Extractor
  • Learn wrappers from web documents and seeds on
    the fly
  • Utilize semi-structured documents
  • Wrappers defined at character level
  • No tokenization required thus language
    independent
  • However, very specific thus page-dependent
  • Wrappers derived from document d is applied to d
    only

25
(No Transcript)
26
Ranking Extractions
ford, nissan, toyota
Wrapper 2
find
northpointcars.com
extract
curryauto.com
derive
chevrolet 22.5
volvo chicago 8.4
Wrapper 1
honda 26.1
Wrapper 3
Wrapper 4
acura 34.6
bmw pittsburgh 8.4
  • A graph consists of a fixed set of
  • Node Types seeds, document, wrapper, mention
  • Labeled Directed Edges find, derive, extract
  • Each edge asserts that a binary relation r holds
  • Each edge has an inverse relation r-1 (graph is
    cyclic)

Minkov et al. Contextual Search and Name
Disambiguation in Email using Graphs. SIGIR 2006
27
Evaluation Method
  • Mean Average Precision
  • Commonly used for evaluating ranked lists in IR
  • Contains recall and precision-oriented aspects
  • Sensitive to the entire ranking
  • Mean of average precisions for each ranked list

Prec(r) precision at rank r
(a) Extracted mention at r matches any true
mention (b) There exist no other extracted
mention at rank less than r that is of the same
entity as the one at r
where L ranked list of extracted mentions, r
rank
  • Evaluation Average over 36 datasets in three
    languages (Chinese, Japanese, English)
  • Average over several 2- or 3-seed queries for
    each dataset.
  • MAP performance high 80s - mid 90s
  • Google Sets MAP in 40s, only English

True Entities total number of true entities
in this dataset
28
Evaluation Datasets
29
Top three mentions are the seeds
Try it out at http//rcwang.com/seal
30
Relational Set Expansion
Seeds
31
Future?
82
84
86
88
90
92
94
96
98
00
04
08
13
18
Machine Learning
?
Human languages NLP, IR
Representation languages DBs, KR
Write a Comment
User Comments (0)
About PowerShow.com