Title: A Framework for Learning to Query Heterogeneous Data
1A Framework for Learning to Query Heterogeneous
Data
- William W. Cohen
- Machine Learning Department and Language
Technologies Institute - School of Computer Science
- Carnegie Mellon University
- joint work with
- Einat Minkov, Andrew Ng, Richard Wang, Anthony
Tomasic, Bob Frederking
2Outline
- Two views on data quality
- Cleaning your data vs living with the mess.
- A lazy/Bayesian view of data cleaning
- A framework for querying dirty data
- Data model
- Query language
- Baseline results (biotext and email)
- How to improve results with learning
- Learning to re-rank query output
- Conclusions
3(No Transcript)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7A Bayesian Looks at Record Linkage
- Record linkage problem given two sets of records
Aa1,,am and Bb1,,bn, determine when
referent(ai)referent(bj) - Idea compute for each ai,bj pair
Pr(referent(ai)referent(bj)) - Pick two thresholds
- Pr(ab) gt HI ? accept pairing
- Pr(ab) lt LO ? reject pairing
- otherwise, clerical review by a human clerk
- Every optimal decision boundary is defined by a
threshold on the ranked list. - Thresholds depend on prior probability of a and b
matching.
A B Pr(AB)
A17 B22 0.99
A43 B07 0.98
A21 B13 0.85
A37 B44 0.82
A84 B03 0.79
A83 B71 0.63
A24 B52 0.25
8A Bayesian Looks at Record Linkage
- Every optimal decision boundary is defined by a
threshold on the ranked list.
2nm ways to link
A B Pr(AB)
A17 B22 0.99
A43 B07 0.98
A16 B23 0.91
A21 B13 0.85
A37 B44 0.82
A84 B03 0.79
A83 B71 0.63
A91 B21 0.46
A24 B52 0.25
- In other words
- 2nm nm linkages can be discarded as
impossible - of the remaining nm, all but HI-LO can be
discarded as improbable
M
M
M
M
U
U
U
U
U
M
M
U
U
U
U
U
U
U
U
U
U
U
U
M
M
M
M
. . .
nm pairs
- But wait why doesnt the human clerk pick a
threshold between LO and HI?
9A Bayesian Looks at Record Linkage
A B Pr(AB)
A17 B22 0.99
A43 B07 0.98
A32 B72 0.91
A21 B13 0.85
A37 B44 0.82
A84 B03 0.79
A83 B71 0.63
A21 B43 0.46
A24 B52 0.25
M
M
M
M
U
U
M
M
M
M
M
U
M
U
U
U
U
U
U
U
M
10Linking multiple relations database hardening
Database S1 (extracted from paper 1s title page)
Database S2 (extracted from paper 2s
bibliography)
11Using multiple relations database hardening
So this gives some known matches, which might
interact with proposed matches e.g. here we
deduce...
12Soft database from IE
Hard database suitable for Oracle, MySQL, etc
13Using multiple relations database hardening
- (McAllister et al, KDD2000) Defined hardening
- Find interpretation (maps variant-gtname) that
produces a compact version of soft database S. - Probabilistic interpretation of hardening
- Original soft data S is version of latent
hard data H. - Hardening finds max likelihood H.
- Hardening is hard!
- Optimal hardening is NP-hard.
- Greedy algorithm
- naive implementation is quadratic in S
- clever data structures make it P(n log n), where
nSd - Other related work
- Pasula et al, NIPS2002 more explicit generative
Bayesian formulation and MCMC method,
experimental support - Wellner McCallum 2004, Parag Domingos 2004,
Culotta McCallum 2005, ....
14A Bayesian Looks at Record Linkage
A B Pr(AB)
A17 B22 0.99
A43 B07 0.98
A32 B72 0.91
A21 B13 0.85
A37 B44 0.82
A84 B03 0.79
A83 B71 0.63
A21 B43 0.46
A24 B52 0.25
- An alternate view of the process
- F-Ss method answers the question directly for
the cases that everyone would agree on. - Human effort is used to answer the cases that are
a little harder.
M
M
M
M
U
U
M
M
M
M
M
U
M
U
U
U
U
U
U
U
M
15A Bayesian Looks at Record Linkage
A B Pr(AB)
A17 B22 0.99
A43 B07 0.98
A32 B72 0.91
A21 B13 0.85
A37 B44 0.82
A84 B03 0.79
A83 B71 0.63
A21 B43 0.46
A24 B52 0.25
- An alternate view of the process
- F-Ss method answers the question directly for
the cases that everyone would agree on. - Human effort is used to answer the cases that are
a little harder.
M
M
M
M
U
U
Q is A43 in B? A yes (p0.98)
Q is A83 in B? A not clear
Q is A21 in B? A unlikely
?
16Passing linkage decisions along to the user
- Usual goal link records and create a single
highly accurate database for users query. - Equality is often uncertain, given available
information about an entity - name T. Kennedy occupation terrorist
- The interpretation of equality may change from
user to user and application to application - Does Boston Market McDonalds ?
- Alternate goal wait for a query, then answer it,
propogating uncertainty about linkage decisions
on that query to the end user
X
17WHIRL project (1997-2000)
- WHIRL initiated when at ATT Bell Labs
ATT Research
ATT Labs - Research
ATT Research
ATT Labs
ATT Research Shannon Laboratory
ATT Shannon Labs
18When are two entities the same?
- Bell Labs
- Bell Telephone Labs
- ATT Bell Labs
- AT Labs
- ATT LabsResearch
- ATT Labs Research, Shannon Laboratory
- Shannon Labs
- Bell Labs Innovations
- Lucent Technologies/Bell Labs Innovations
1925
History of Innovation From 1925 to today, ATT
has attracted some of the world's greatest
scientists, engineers and developers.
www.research.att.com
Bell Labs Facts Bell Laboratories, the research
and development arm of Lucent Technologies, has
been operating continuously since 1925
bell-labs.com
19When are two entities are the same?
Buddhism rejects the key element in folk
psychology the idea of a self (a unified
personal identity that is continuous through
time) King Milinda and Nagasena (the Buddhist
sage) discuss personal identity Milinda
gradually realizes that "Nagasena" (the word)
does not stand for anything he can point to
not the hairs on Nagasena's head, nor the hairs
of the body, nor the "nails, teeth, skin,
muscles, sinews, bones, marrow, kidneys, ..."
etc Milinda concludes that "Nagasena" doesn't
stand for anything If we can't say what a person
is, then how do we know a person is the same
person through time? There's really no you,
and if there's no you, there are no beliefs or
desires for you to have The folk psychology
picture is profoundly misleading and believing it
will make you miserable. -S. LaFave
20Traditional approach
Linkage
Queries
Uncertainty about what to link must be decided by
the integration system, not the end user
21WHIRL vision
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Strongest links those agreeable to most users
William Will Cohen Cohn
Steve Steven Minton Mitton
Weaker links those agreeable to some users
even weaker links
William David Cohen Cohn
22WHIRL vision
DB1 DB2 ? DB
Link items as needed by Q
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Incrementally produce a ranked list of possible
links, with best matches first. User (or
downstream process) decides how much of the list
to generate and examine.
William Will Cohen Cohn
Steve Steven Minton Mitton
William David Cohen Cohn
23WHIRL queries
- Assume two relations
- review(movieTitle,reviewText) archive of reviews
- listing(theatre, movieTitle, showTimes, ) now
showing
The Hitchhikers Guide to the Galaxy, 2005 This is a faithful re-creation of the original radio series not surprisingly, as Adams wrote the screenplay .
Men in Black, 1997 Will Smith does an excellent job in this
Space Balls, 1987 Only a die-hard Mel Brooks fan could claim to enjoy
Star Wars Episode III The Senator Theater 100, 415, 730pm.
Cinderella Man The Rotunda Cinema 100, 430, 730pm.
24WHIRL queries
- Find reviews of sci-fi comedies movie domain
- FROM review SELECT WHERE r.textsci fi comedy
- (like standard ranked retrieval of sci-fi
comedy) - Where is that sci-fi comedy playing?
- FROM review as r, LISTING as s, SELECT
- WHERE r.titles.title and r.textsci fi comedy
- (best answers titles are similar to each other
e.g., Hitchhikers Guide to the Galaxy and The
Hitchhikers Guide to the Galaxy, 2005 and the
review text is similar to sci-fi comedy)
25WHIRL queries
- Similarity is based on TFIDF? rare words are most
important. - Search for high-ranking answers uses inverted
indices.
The Hitchhikers Guide to the Galaxy, 2005
Men in Black, 1997
Space Balls, 1987
Star Wars Episode III
Hitchhikers Guide to the Galaxy
Cinderella Man
26WHIRL queries
- Similarity is based on TFIDF? rare words are most
important. - Search for high-ranking answers uses inverted
indices.
- It is easy to find the (few) items that match
on important terms - Search for strong matches
can prune unimportant terms
The Hitchhikers Guide to the Galaxy, 2005
Men in Black, 1997
Space Balls, 1987
Star Wars Episode III
Hitchhikers Guide to the Galaxy
Cinderella Man
hitchhiker movie00137
the movie001,movie003,movie007,movie008, movie013,movie018,movie023,movie0031, ..
27WHIRL results
- This sort of worked
- Interactive speeds (lt0.3s/q) with a few hundred
thousand tuples. - For 2-way joins, average precision (sort of like
area under precision-recall curve) from 85 to
100 on 13 problems in 6 domains. - Average precision better than 90 on 5-way joins
28WHIRL and soft integration
- WHIRL worked for a number of web-based demo
applications. - e.g., integrating data from 30-50 smallish web
DBs with lt1 FTE labor - WHIRL could link many data types reasonably well,
without engineering - WHIRL generated numerous papers (Sigmod98, KDD98,
Agents99, AAAI99, TOIS2000, AIJ2000, ICML2000,
JAIR2001)
- WHIRL was relational
- But see ELIXIR (SIGIR2001)
- WHIRL users need to know schema of source DBs
- WHIRLs query-time linkage worked only for TFIDF,
token-based distance metrics - ? Text fields with few misspellimgs
- WHIRL was memory-based
- all data must be centrally storedno federated
data. - ? small datasets only
29WHIRL vision very radical, everything was
inter-dependent
Link items as needed by Q
To make SQL-like queries, user must understand
the schema of the underlying DB (and hence
someone must understand DB1, DB2, DB3, ...
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Incrementally produce a ranked list of possible
links, with best matches first. User (or
downstream process) decides how much of the list
to generate and examine.
William Will Cohen Cohn
Steve Steven Minton Mitton
William David Cohen Cohn
?
30Outline
- Two views on data quality
- Cleaning your data vs living with the mess.
- A lazy/Bayesian view of data cleaning
- A framework for querying dirty data
- Data model
- Query language
- Baseline results (biotext and email)
- How to improve results with learning
- Learning to re-rank query output
- Conclusions
31BANKS Basic Data Model
- Database is modeled as a graph
- Nodes tuples
- Edges references between tuples
- foreign key, inclusion dependencies, ..
- Edges are directed.
User need not know organization of database to
formulate queries.
MultiQuery Optimization
paper
BANKS Keyword search
writes
S. Sudarshan
Prasan Roy
Charuta
author
32BANKS Answer to Query
Query sudarshan roy Answer subtree from
graph
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
33BANKS Basic Data Model
- Database is modeled as a graph
- Nodes tuples
- Edges references between tuples
- edges are directed.
- foreign key, inclusion dependencies, ..
34BANKS Basic Data Model
not quite so basic
- Database All information is modeled as a graph
- Nodes tuples or documents or strings or words
- Edges references between tuples nodes
- edges are directed, labeled and weighted
- foreign key, inclusion dependencies, ...
- doc/string D to word contained by D (TFIDF
weighted, perhaps) - word W to doc/string containing W (inverted
index) - string S to strings similar to S
35Similarity in a BANKS-like system
- Motivation why Im interested in
- structured data that is partly text similarity!
- structured data represented as graphs all sorts
of information can be poured into this model. - measuring similarity of nodes in graphs
- Coming up next
- a simple query language for graphs
- experiments on natural types of queries
- techniques for learning to answer queries of a
certain type better
36Yet another schema-free query language
- Assume data is encoded in a graph with
- a node for each object x
- a type of each object x, T(x)
- an edge for each binary relation rx ? y
- Queries are of this form
- Given type t and node x, find yT(y)t and yx.
- Wed like to construct a general-purpose
similarity function xy for objects in the
graph - Wed also like to learn many such functions for
different specific tasks (like who should attend
a meeting)
Node similarity
37Similarity of Nodes in Graphs
- Given type t and node x, find yT(y)t and yx.
- Similarity defined by damped version of
PageRank - Similarity between nodes x and y
- Random surfer model from a node z,
- with probability a, stop and output z
- pick an edge label r using Pr(r z) ... e.g.
uniform - pick a y uniformly from y z ? y with label r
- repeat from node y ....
- Similarity xy Pr( output y start at x)
- Intuitively, xy is summation of weight of all
paths from x to y, where weight of path decreases
exponentially with length.
38BANKS Basic Data Model
not quite so basic
- Database All information is modeled as a graph
- Nodes tuples or documents or strings or words
- Edges references between tuples nodes
- edges are directed, labeled and weighted
- foreign key, inclusion dependencies, ...
- doc/string D to word contained by D (TFIDF
weighted, perhaps) - word W to doc/string containing W (inverted
index) - string S to strings similar to S
William W. Cohen, CMU
cohen
optionalstrings that are similar in TFIDF/cosine
distance will still be nearby in graph
(connected by many length2 paths)
william
w
cmu
dr
Dr. W. W. Cohen
39Similarity of Nodes in Graphs
- Random surfer on graphs
- natural extension to PageRank
- closely related to Laffertys heat diffusion
kernel - but generalized to directed graphs
- somewhat amenable to learning parameters of the
walk (gradient search, w/ various optimization
metrics) - Toutanova, Manning NG, ICML2004
- Nie et al, WWW2005
- Xi et al, SIGIR 2005
- can be sped up and adapted to longer walks by
sampling approaches to matrix multiplication
(e.g. Lewis E. Cohen, SODA 1998), similar to
particle filtering - our current implementation (GHIRL) Lucene
Sleepycat with extensive use of memory caching
(sampling approaches visit many nodes repeatedly)
40Query sudarshan roy Answer subtree from
graph
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
41y paper(y) yroy
w paper(y) wroy
AND
Query sudarshan roy Answer subtree from
graph
42Evaluation on Personal Information Management
Tasks
Minkov et al, SIGIR 2006
Many tasks can be expressed as simple,
non-conjunctive search queries in this framework.
- Such as
- Person Name Disambiguation in Email
- Threading
- Finding email-address aliases given a persons
name - Finding relevant meeting attendees
What is the email address for the person named
Halevy mentioned in this presentation? What
files from my home machine will I need for this
meeting? What people will attend this
meeting? ... ?
novel
eg Diehl, Getoor, Namata, 2006
eg Lewis Knowles 97
novel
Also consider a generalization x ? Vq Vq is a
distribution over nodes x
novel
43Email as a graph
sent_date
date2
sent_to
alias
Email address1
person name1
a_inv
1_day
person name2
Email address2
sf_inv
Sent_from
Sent_to
st_inv
sent_date
file1
Email address3
person name3
file2
date1
sd_Inv
sent_from
Email address4
person name4
in_file
in_subj
sent_to
If_inv
is_inv
Email address5
person name5
term8
term9
term2
term3
term1
term7
term6
term4
term11
term5
term10
44Person Name Disambiguation
file
Person
file
Person Andrew Johns
- Q who is Andy?
- Given a term that is not mentioned as is in
header (otherwise, easy), that is known to be a
personal name - Output ranked person nodes.
file
termandy
Person
This task is complementary to person name
annotation in email (E. Minkov, R. Wang,
W.Cohen, Extracting Personal Names from Emails
Applying Named Entity Recognition to Informal
Text, HLT/EMNLP 2005)
45Corpora and Datasets
a. Corpora
Example nicknames Dave for David, Kai for
Keiko, Jenny for Qing
b. Types of names
46Person Name Disambiguation
- 1. Baseline String matching ( common nicknames)
- Find persons that are similar to the name term
(Jaro) - Successful in many cases
- Not successful for some nicknames
- Can not handle ambiguity (arbitrary)
- 3. Graph walk termfile
- Vq name term file nodes (2 steps)
- The file node is natural available context
- Solves the ambiguity problem!
- But, incorporates additional noise.
- 4. Graph walk termfile, reranked using learning
- Re-rank the output of (3), using
- path-describing features
- source count do the paths originate from a
single or two source nodes - string similarity
- 2. Graph walk term
- Vq name term node (2 steps)
- Models co-occurrences.
- Can not handle ambiguity (dominant)
47Results
48Results
after learning-to-rank
graph walk from name,file
graph walk from name
baseline string match, nicknames
49Results
Enron execs
50Results
51Learning
- There is no single best measure of similarity
- How can you learn how to better rank graph nodes,
for a particular task? - Learning methods for graph walks
- The parameters can be adjusted using gradient
descent methods (Diligenti et-al, IJCAI 2005) - We explored a node re-ranking approach which
can take advantage of a wider range of features
features (and is complementary to parameter
tuning) - Features of candidate answer y describe the set
of paths from query x to y
52Re-ranking overview
- Boosting-based reranking, following (Collins and
Koo, Computational Linguistics, 2005) - A training example includes
- a ranked list of li nodes.
- Each node is represented through m features
- At least one known correct node
- Scoring functionFind w that minimizes
(boosted version)Requires binary features
and has a closed form formula to find best
feature and delta in each iteration.
linear combination of features
original score yx
, where
53Path describing Features
- The set of paths to a target node in step k is
recovered in full.
X1
Edge unigram featureswas edge type l used in
reaching x from Vq.
X2
X3
X4
Edge bigram featureswere edge types l1 and l2
used (in that order) in reaching x from Vq.
X5
K0
K1
K2
Top edge bigram featureswere edge types l1
and l2 used (in that order) in reaching x from
Vq, among the top two highest scoring paths.
- Paths (x3, k2)
- x2 ? x1 ? x3
- x4 ? x1 ? x3
- x2 ? x2 ? x3
- x2 ? x3
54Results
55Threading
- Threading is an interesting problem, because
- There are often irregularities in thread
structural information, thus threads discourse
should be captured using an intelligent approach
(D.E. Lewis and K.A. Knowles, Threading email A
preliminary study, Information Processing and
Management, 1997) - Threading information can improve message
categorization into topical folders (B. Klimt
and Y. Yang, The Enron corpus A new dataset for
email classification research, ECML, 2004) - Adjacent messages in a thread can be assumed to
be most similar to each other in the corpus.
Therefore, threading is related to the general
problem of finding similar messages in a corpus.
The task given a message, retrieve adjacent
messages in the thread
56Some intuition ?
filex
57Some intuition ?
filex
Shared content
58Some intuition ?
filex
Shared content
Social network
59Some intuition ?
filex
Shared content
Social network
Timeline
60Threading experiments
- Baseline TF-IDF SimilarityConsider all the
available information (header body) as text
- Graph walk uniformStart from the file node, 2
steps, uniform edge weights
- Graph walk random
- Start from the file node, 2 steps, random edge
weights (best out of 10)
- Graph walk reranked
- Rerank the output of (3) using the
graph-describing features
61Results
- Highly-ranked edge-bigrams
- sent-from ? sent-to -1
- date-of ? date-of -1
- has-term ? has-term -1
62Finding email-aliases given a name
Given a persons name (term)Retrieve the full
set of relevant email-addresses (email-address)
63Finding Meeting Attendees
Minkov et al, CEAS 2006
- Extended graph contains 2 months of calendar data
64Main Contributions
- Presented an extended similarity measure
incorporating non-textual objects - Finite lazy random walks to perform typed search
- A re-ranking paradigm to improve on graph walk
results - Instantiation of this framework for email
- Defined and evaluated novel tasks for email
65Another Task that Can be Formulated as a Graph
Query GeneId-Ranking
- Given
- a biomedical paper abstract
- Find
- the geneId for every gene mentioned in the
abstract - Method
- from paper x, ranked list of geneId y xy
- Background resources
- a synonym list geneId ? name1, name2, ...
- one or more protein NER systems
- training/evaluation data pairs of (paper,
geneId1, ...., geneIdn)
66Sample abstracts and synonyms
- MGI96273
- Htr1a
- 5-hydroxytryptamine (serotonin) receptor 1A
- 5-HT1A receptor
- MGI104886
- Gpx5
- glutathione peroxidase 5
- Arep
- ...
- 52,000 for mouse, 35,000 for fly
true labels
NER extractor
67Graph for the task....
abstracts
...
filedoc115
hasProtein
hasProtein
hasProtein
hasTerm
CA1
HT1A
HT1
...
proteins
hasTerm
hasTerm
termHT
term1
termA
termCA
termhippo- campus
...
terms
inFile
5-HT1A receptor
Htr1a
eIF-1A
synonyms
...
synonym
synonym
MGI95298
MGI46273
geneIds
...
68abstracts
...
filedoc115
hasProtein
hasProtein
hasProtein
hasTerm
CA1
HT1A
HT1
...
proteins
hasTerm
hasTerm
termHT
term1
termA
termCA
termhippo- campus
...
terms
inFile
5-HT1A receptor
Htr1a
eIF-1A
synonyms
...
synonym
MGI95298
MGI46273
geneIds
...
noisy training abstracts
filedoc214
filedoc523
filedoc6273
...
69Experiments
- Data Biocreative Task 1B
- mouse 10,000 train abstracts, 250 devtest, using
first 150 for now 50,000 geneIds graph has
525,000 nodes - NER systems
- likelyProtein trained on yapex.train using
off-the-shelf NER systems (Minorthird) - possibleProtein same, but modified (on
yapex.test) to optimize F3, not F1 (rewards
recall over precision)
70Experiments with NER
Token Token Span Span Span
Precision Recall Precision Recall F1
94.9 64.8 87.2 62.1 72.5
49.0 97.4 47.2 82.5 60.0
81.6 31.3 66.7 26.8 45.3
43.9 88.5 30.4 56.6 39.6
50.1 46.9 24.5 43.9 31.4
likely
yapex.test
possible
likely
possible
mouse
dictionary
71Experiments with Graph Search
- Baseline method
- extract entities of type x
- for each string of type x, find best-matching
synonym, and then its geneId - consider only synonyms sharing gt1 token
- Soft/TFIDF distance
- break ties randomly
- rank geneIds by number of times they are reached
- rewards multiple mentions (even via alternate
synonyms) - Evaluation
- average, over 50 test documents, of
- non-interpolated average precision (plausible for
curators) - max F1 over all cutoffs
72Experiments with Graph Search
mouse eval dataset MAP maxF1
likelyProtein softTFIDF 45.0 58.1
possibleProtein softTFIDF 62.6 74.9
graph walk 51.3 64.3
73Baseline vs Graphwalk
- Baseline includes
- softTFIDF distances from NER entity to gene
synonyms - knowledge that shortcut path doc?entity?synonym?
geneId is important - Graph includes
- IDF effects, correlations, training data, etc
- Proposed graph extension
- add softTFIDF and shortcut edges
- Learning and reranking
- start with local features fi(e) of edges eu?v
- for answer y, compute expectations E( fi(e)
startx,endy) - use expectations as feature values and voted
perceptron (Collins, 2002) as learning-to-rank
method.
74Experiments with Graph Search
mouse eval dataset MAP average max F1
likelyProtein softTFIDF 45.0 58.1
possibleProtein softTFIDF 62.6 74.9
graph walk 51.3 64.3
walk extra links 73.0 80.7
walk extra links learning 79.7 83.9
75Experiments with Graph Search
76Hot off the presses
- Ongoing work learn NER system from pairs of
(document,geneIdList) - much easier to obtain training data than
documents in which every occurrence of every gene
name is highlighted (usual NER training data) - obtains F1 of 71.1 on mouse data (vs 45.3 by
training on YAPEX data, which is from different
distribution)
77Experiments with Graph Search
mouse eval dataset MAP (Yapex trained)
likelyProtein softTFIDF 45.0
possibleProtein softTFIDF 62.6
graph walk 51.3
walk extra links 73.0
walk extra links learning 79.7
78Experiments with Graph Search
mouse eval dataset MAP (Yapex trained) MAP (MGI trained)
likelyProtein softTFIDF 45.0 72.7
possibleProtein softTFIDF 62.6 65.7
graph walk 51.3 54.4
walk extra links 73.0 76.7
walk extra links learning 79.7 84.2
79Experiments on BioCreative Blind Test Set
mouse blind test data MAP (Yapex trained) Max F1 (Yapex trained)
likelyProtein softTFIDF 36.8 42.1
possibleProtein softTFIDF 61.1 67.2
graph walk 64.0 69.5
walk extra links learning 71.1 75.5
(45.0) (58.1)
(62.6) (74.9)
(51.3) (64.3)
(79.7) (83.9)
80Experiments with Graph Search
mouse blind test data MAP (Yapex trained) Max F1 (Yapex trained)
likelyProtein softTFIDF 36.8 42.1
possibleProtein softTFIDF 61.1 67.2
graph walk 64.0 69.5
walk extra links learning 71.1 75.5
mouse blind test data MAP (MGI trained) Max F1 (MGI trained)
walk extra links learning 80.1 83.7
81mouse blind test data MAP (Yapex trained) Average Max F1 (Yapex trained)
walk extra links learning 71.1 75.5
(MGI trained) (MGI trained)
walk extra links learning 80.1 83.7
82Outline
- Two views on data quality
- Cleaning your data vs living with the mess.
- A lazy/Bayesian view of data cleaning
- A framework for querying dirty data
- Data model
- Query language
- Baseline results (biotext and email)
- How to improve results with learning
- Learning to re-rank query output
- Conclusions
83Conclusions
- Contributions
- a very simple query language for graphs, based on
a diffusion-kernel (damped PageRank,...)
similarity metric - experiments on natural types of queries
- finding likely meeting attendees
- finding related documents (email threading)
- disambiguating person and gene/protein entity
names - techniques for learning to answer queries
- reranking using expectations of simple, local
features - tune performance to a particular similarity
84Conclusions
- Some open problems
- scalability efficiency
- K-step walk on node-node graph with fan-out b is
O(KbN) - accurate sampling is O(1min) for 10-steps with
O(106) nodes. - faster, better learning methods
- combine re-ranking with learning parameters of
graph walk - add language modeling, topic modeling
- extend graph to include models as well as data
85Conclusions
- Dont forget that there are two views on data
quality - Cleaning your data vs living with the mess.
- A lazy/Bayesian view of data cleaning
- SQL/Oracle vs Google
- vs something in between .... ?