A Framework for Learning to Query Heterogeneous Data

About This Presentation

Title:

A Framework for Learning to Query Heterogeneous Data

Description:

William W. Cohen. Machine Learning Department and Language Technologies Institute ... 'name: T. Kennedy occupation: terrorist' ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 80

Provided by: willia95

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Framework for Learning to Query Heterogeneous Data

1
A Framework for Learning to Query Heterogeneous
Data

William W. Cohen
Machine Learning Department and Language
Technologies Institute
School of Computer Science
Carnegie Mellon University
joint work with
Einat Minkov, Andrew Ng, Richard Wang, Anthony
Tomasic, Bob Frederking

2
Outline

Two views on data quality
Cleaning your data vs living with the mess.
A lazy/Bayesian view of data cleaning
A framework for querying dirty data
Data model
Query language
Baseline results (biotext and email)
How to improve results with learning
Learning to re-rank query output
Conclusions

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
A Bayesian Looks at Record Linkage

Record linkage problem given two sets of records
Aa1,,am and Bb1,,bn, determine when
referent(ai)referent(bj)
Idea compute for each ai,bj pair
Pr(referent(ai)referent(bj))
Pick two thresholds
Pr(ab) gt HI ? accept pairing
Pr(ab) lt LO ? reject pairing
otherwise, clerical review by a human clerk
Every optimal decision boundary is defined by a
threshold on the ranked list.
Thresholds depend on prior probability of a and b
matching.

A B Pr(AB)
A17 B22 0.99
A43 B07 0.98

A21 B13 0.85
A37 B44 0.82
A84 B03 0.79
A83 B71 0.63

A24 B52 0.25
8
A Bayesian Looks at Record Linkage

Every optimal decision boundary is defined by a
threshold on the ranked list.

2nm ways to link
A B Pr(AB)
A17 B22 0.99
A43 B07 0.98
A16 B23 0.91
A21 B13 0.85
A37 B44 0.82
A84 B03 0.79
A83 B71 0.63
A91 B21 0.46
A24 B52 0.25

In other words
2nm nm linkages can be discarded as
impossible
of the remaining nm, all but HI-LO can be
discarded as improbable

M
M
M
M
U
U
U
U
U
M
M
U
U
U
U
U
U
U
U
U
U
U
U
M
M
M
M
. . .
nm pairs

But wait why doesnt the human clerk pick a
threshold between LO and HI?

9
A Bayesian Looks at Record Linkage
A B Pr(AB)
A17 B22 0.99
A43 B07 0.98
A32 B72 0.91
A21 B13 0.85
A37 B44 0.82
A84 B03 0.79
A83 B71 0.63
A21 B43 0.46
A24 B52 0.25
M
M
M
M

U
U
M
M
M
M
M
U
M
U
U
U
U
U
U
U
M
10
Linking multiple relations database hardening
Database S1 (extracted from paper 1s title page)
Database S2 (extracted from paper 2s
bibliography)
11
Using multiple relations database hardening
So this gives some known matches, which might
interact with proposed matches e.g. here we
deduce...
12
Soft database from IE
Hard database suitable for Oracle, MySQL, etc
13
Using multiple relations database hardening

(McAllister et al, KDD2000) Defined hardening
Find interpretation (maps variant-gtname) that
produces a compact version of soft database S.
Probabilistic interpretation of hardening
Original soft data S is version of latent
hard data H.
Hardening finds max likelihood H.
Hardening is hard!
Optimal hardening is NP-hard.
Greedy algorithm
naive implementation is quadratic in S
clever data structures make it P(n log n), where
nSd
Other related work
Pasula et al, NIPS2002 more explicit generative
Bayesian formulation and MCMC method,
experimental support
Wellner McCallum 2004, Parag Domingos 2004,
Culotta McCallum 2005, ....

14
A Bayesian Looks at Record Linkage
A B Pr(AB)
A17 B22 0.99
A43 B07 0.98
A32 B72 0.91
A21 B13 0.85
A37 B44 0.82
A84 B03 0.79
A83 B71 0.63
A21 B43 0.46
A24 B52 0.25

An alternate view of the process
F-Ss method answers the question directly for
the cases that everyone would agree on.
Human effort is used to answer the cases that are
a little harder.

M
M
M
M

U
U
M
M
M
M
M
U
M
U
U
U
U
U
U
U
M
15
A Bayesian Looks at Record Linkage
A B Pr(AB)
A17 B22 0.99
A43 B07 0.98
A32 B72 0.91
A21 B13 0.85
A37 B44 0.82
A84 B03 0.79
A83 B71 0.63
A21 B43 0.46
A24 B52 0.25

An alternate view of the process
F-Ss method answers the question directly for
the cases that everyone would agree on.
Human effort is used to answer the cases that are
a little harder.

M
M
M
M

U
U
Q is A43 in B? A yes (p0.98)
Q is A83 in B? A not clear
Q is A21 in B? A unlikely
?
16
Passing linkage decisions along to the user

Usual goal link records and create a single
highly accurate database for users query.
Equality is often uncertain, given available
information about an entity
name T. Kennedy occupation terrorist
The interpretation of equality may change from
user to user and application to application
Does Boston Market McDonalds ?
Alternate goal wait for a query, then answer it,
propogating uncertainty about linkage decisions
on that query to the end user

X
17
WHIRL project (1997-2000)

WHIRL initiated when at ATT Bell Labs

ATT Research
ATT Labs - Research
ATT Research
ATT Labs
ATT Research Shannon Laboratory
ATT Shannon Labs
18
When are two entities the same?

Bell Labs
Bell Telephone Labs
ATT Bell Labs
AT Labs
ATT LabsResearch
ATT Labs Research, Shannon Laboratory
Shannon Labs
Bell Labs Innovations
Lucent Technologies/Bell Labs Innovations

1925
History of Innovation From 1925 to today, ATT
has attracted some of the world's greatest
scientists, engineers and developers.
www.research.att.com
Bell Labs Facts Bell Laboratories, the research
and development arm of Lucent Technologies, has
been operating continuously since 1925
bell-labs.com
19
When are two entities are the same?
Buddhism rejects the key element in folk
psychology the idea of a self (a unified
personal identity that is continuous through
time) King Milinda and Nagasena (the Buddhist
sage) discuss personal identity Milinda
gradually realizes that "Nagasena" (the word)
does not stand for anything he can point to
not the hairs on Nagasena's head, nor the hairs
of the body, nor the "nails, teeth, skin,
muscles, sinews, bones, marrow, kidneys, ..."
etc Milinda concludes that "Nagasena" doesn't
stand for anything If we can't say what a person
is, then how do we know a person is the same
person through time? There's really no you,
and if there's no you, there are no beliefs or
desires for you to have The folk psychology
picture is profoundly misleading and believing it
will make you miserable. -S. LaFave
20
Traditional approach
Linkage
Queries
Uncertainty about what to link must be decided by
the integration system, not the end user
21
WHIRL vision
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Strongest links those agreeable to most users
William Will Cohen Cohn
Steve Steven Minton Mitton
Weaker links those agreeable to some users
even weaker links
William David Cohen Cohn
22
WHIRL vision
DB1 DB2 ? DB
Link items as needed by Q
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Incrementally produce a ranked list of possible
links, with best matches first. User (or
downstream process) decides how much of the list
to generate and examine.
William Will Cohen Cohn
Steve Steven Minton Mitton
William David Cohen Cohn
23
WHIRL queries

Assume two relations
review(movieTitle,reviewText) archive of reviews
listing(theatre, movieTitle, showTimes, ) now
showing

The Hitchhikers Guide to the Galaxy, 2005 This is a faithful re-creation of the original radio series not surprisingly, as Adams wrote the screenplay .
Men in Black, 1997 Will Smith does an excellent job in this
Space Balls, 1987 Only a die-hard Mel Brooks fan could claim to enjoy

Star Wars Episode III The Senator Theater 100, 415, 730pm.
Cinderella Man The Rotunda Cinema 100, 430, 730pm.

24
WHIRL queries

Find reviews of sci-fi comedies movie domain
FROM review SELECT WHERE r.textsci fi comedy
(like standard ranked retrieval of sci-fi
comedy)
Where is that sci-fi comedy playing?
FROM review as r, LISTING as s, SELECT
WHERE r.titles.title and r.textsci fi comedy
(best answers titles are similar to each other
e.g., Hitchhikers Guide to the Galaxy and The
Hitchhikers Guide to the Galaxy, 2005 and the
review text is similar to sci-fi comedy)

25
WHIRL queries

Similarity is based on TFIDF? rare words are most
important.
Search for high-ranking answers uses inverted
indices.

The Hitchhikers Guide to the Galaxy, 2005
Men in Black, 1997
Space Balls, 1987

Star Wars Episode III
Hitchhikers Guide to the Galaxy
Cinderella Man

26
WHIRL queries

Similarity is based on TFIDF? rare words are most
important.
Search for high-ranking answers uses inverted
indices.

- It is easy to find the (few) items that match
on important terms - Search for strong matches
can prune unimportant terms
The Hitchhikers Guide to the Galaxy, 2005
Men in Black, 1997
Space Balls, 1987

Star Wars Episode III
Hitchhikers Guide to the Galaxy
Cinderella Man

hitchhiker movie00137
the movie001,movie003,movie007,movie008, movie013,movie018,movie023,movie0031, ..

27
WHIRL results

This sort of worked
Interactive speeds (lt0.3s/q) with a few hundred
thousand tuples.
For 2-way joins, average precision (sort of like
area under precision-recall curve) from 85 to
100 on 13 problems in 6 domains.
Average precision better than 90 on 5-way joins

28
WHIRL and soft integration

WHIRL worked for a number of web-based demo
applications.
e.g., integrating data from 30-50 smallish web
DBs with lt1 FTE labor
WHIRL could link many data types reasonably well,
without engineering
WHIRL generated numerous papers (Sigmod98, KDD98,
Agents99, AAAI99, TOIS2000, AIJ2000, ICML2000,
JAIR2001)

WHIRL was relational
But see ELIXIR (SIGIR2001)
WHIRL users need to know schema of source DBs
WHIRLs query-time linkage worked only for TFIDF,
token-based distance metrics
? Text fields with few misspellimgs
WHIRL was memory-based
all data must be centrally storedno federated
data.
? small datasets only

29
WHIRL vision very radical, everything was
inter-dependent
Link items as needed by Q
To make SQL-like queries, user must understand
the schema of the underlying DB (and hence
someone must understand DB1, DB2, DB3, ...
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Incrementally produce a ranked list of possible
links, with best matches first. User (or
downstream process) decides how much of the list
to generate and examine.
William Will Cohen Cohn
Steve Steven Minton Mitton
William David Cohen Cohn
?
30
Outline

Two views on data quality
Cleaning your data vs living with the mess.
A lazy/Bayesian view of data cleaning
A framework for querying dirty data
Data model
Query language
Baseline results (biotext and email)
How to improve results with learning
Learning to re-rank query output
Conclusions

31
BANKS Basic Data Model

Database is modeled as a graph
Nodes tuples
Edges references between tuples
foreign key, inclusion dependencies, ..
Edges are directed.

User need not know organization of database to
formulate queries.
MultiQuery Optimization
paper
BANKS Keyword search
writes
S. Sudarshan
Prasan Roy
Charuta
author
32
BANKS Answer to Query
Query sudarshan roy Answer subtree from
graph
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
33
BANKS Basic Data Model

Database is modeled as a graph
Nodes tuples
Edges references between tuples
edges are directed.
foreign key, inclusion dependencies, ..

34
BANKS Basic Data Model
not quite so basic

Database All information is modeled as a graph
Nodes tuples or documents or strings or words
Edges references between tuples nodes
edges are directed, labeled and weighted
foreign key, inclusion dependencies, ...
doc/string D to word contained by D (TFIDF
weighted, perhaps)
word W to doc/string containing W (inverted
index)
string S to strings similar to S

35
Similarity in a BANKS-like system

Motivation why Im interested in
structured data that is partly text similarity!
structured data represented as graphs all sorts
of information can be poured into this model.
measuring similarity of nodes in graphs
Coming up next
a simple query language for graphs
experiments on natural types of queries
techniques for learning to answer queries of a
certain type better

36
Yet another schema-free query language

Assume data is encoded in a graph with
a node for each object x
a type of each object x, T(x)
an edge for each binary relation rx ? y
Queries are of this form
Given type t and node x, find yT(y)t and yx.
Wed like to construct a general-purpose
similarity function xy for objects in the
graph
Wed also like to learn many such functions for
different specific tasks (like who should attend
a meeting)

Node similarity
37
Similarity of Nodes in Graphs

Given type t and node x, find yT(y)t and yx.
Similarity defined by damped version of
PageRank
Similarity between nodes x and y
Random surfer model from a node z,
with probability a, stop and output z
pick an edge label r using Pr(r z) ... e.g.
uniform
pick a y uniformly from y z ? y with label r
repeat from node y ....
Similarity xy Pr( output y start at x)
Intuitively, xy is summation of weight of all
paths from x to y, where weight of path decreases
exponentially with length.

38
BANKS Basic Data Model
not quite so basic

Database All information is modeled as a graph
Nodes tuples or documents or strings or words
Edges references between tuples nodes
edges are directed, labeled and weighted
foreign key, inclusion dependencies, ...
doc/string D to word contained by D (TFIDF
weighted, perhaps)
word W to doc/string containing W (inverted
index)
string S to strings similar to S

William W. Cohen, CMU
cohen
optionalstrings that are similar in TFIDF/cosine
distance will still be nearby in graph
(connected by many length2 paths)
william
w
cmu
dr
Dr. W. W. Cohen
39
Similarity of Nodes in Graphs

Random surfer on graphs
natural extension to PageRank
closely related to Laffertys heat diffusion
kernel
but generalized to directed graphs
somewhat amenable to learning parameters of the
walk (gradient search, w/ various optimization
metrics)
Toutanova, Manning NG, ICML2004
Nie et al, WWW2005
Xi et al, SIGIR 2005
can be sped up and adapted to longer walks by
sampling approaches to matrix multiplication
(e.g. Lewis E. Cohen, SODA 1998), similar to
particle filtering
our current implementation (GHIRL) Lucene
Sleepycat with extensive use of memory caching
(sampling approaches visit many nodes repeatedly)

40
Query sudarshan roy Answer subtree from
graph
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
41
y paper(y) yroy
w paper(y) wroy
AND
Query sudarshan roy Answer subtree from
graph
42
Evaluation on Personal Information Management
Tasks
Minkov et al, SIGIR 2006
Many tasks can be expressed as simple,
non-conjunctive search queries in this framework.

Such as
Person Name Disambiguation in Email
Threading
Finding email-address aliases given a persons
name
Finding relevant meeting attendees

What is the email address for the person named
Halevy mentioned in this presentation? What
files from my home machine will I need for this
meeting? What people will attend this
meeting? ... ?
novel
eg Diehl, Getoor, Namata, 2006
eg Lewis Knowles 97
novel
Also consider a generalization x ? Vq Vq is a
distribution over nodes x
novel
43
Email as a graph
sent_date
date2
sent_to
alias
Email address1
person name1
a_inv
1_day
person name2
Email address2
sf_inv
Sent_from
Sent_to
st_inv
sent_date
file1
Email address3
person name3
file2
date1
sd_Inv
sent_from
Email address4
person name4
in_file
in_subj
sent_to
If_inv
is_inv
Email address5
person name5
term8
term9
term2
term3
term1
term7
term6
term4
term11
term5
term10
44
Person Name Disambiguation
file
Person
file
Person Andrew Johns

Q who is Andy?
Given a term that is not mentioned as is in
header (otherwise, easy), that is known to be a
personal name
Output ranked person nodes.

file
termandy
Person
This task is complementary to person name
annotation in email (E. Minkov, R. Wang,
W.Cohen, Extracting Personal Names from Emails
Applying Named Entity Recognition to Informal
Text, HLT/EMNLP 2005)
45
Corpora and Datasets
a. Corpora
Example nicknames Dave for David, Kai for
Keiko, Jenny for Qing
b. Types of names
46
Person Name Disambiguation

1. Baseline String matching ( common nicknames)
Find persons that are similar to the name term
(Jaro)
Successful in many cases
Not successful for some nicknames
Can not handle ambiguity (arbitrary)

3. Graph walk termfile
Vq name term file nodes (2 steps)
The file node is natural available context
Solves the ambiguity problem!
But, incorporates additional noise.

4. Graph walk termfile, reranked using learning
Re-rank the output of (3), using
path-describing features
source count do the paths originate from a
single or two source nodes
string similarity

2. Graph walk term
Vq name term node (2 steps)
Models co-occurrences.
Can not handle ambiguity (dominant)

47
Results
48
Results
after learning-to-rank
graph walk from name,file
graph walk from name
baseline string match, nicknames
49
Results
Enron execs
50
Results
51
Learning

There is no single best measure of similarity
How can you learn how to better rank graph nodes,
for a particular task?
Learning methods for graph walks
The parameters can be adjusted using gradient
descent methods (Diligenti et-al, IJCAI 2005)
We explored a node re-ranking approach which
can take advantage of a wider range of features
features (and is complementary to parameter
tuning)
Features of candidate answer y describe the set
of paths from query x to y

52
Re-ranking overview

Boosting-based reranking, following (Collins and
Koo, Computational Linguistics, 2005)
A training example includes
a ranked list of li nodes.
Each node is represented through m features
At least one known correct node
Scoring functionFind w that minimizes
(boosted version)Requires binary features
and has a closed form formula to find best
feature and delta in each iteration.

linear combination of features
original score yx
, where
53
Path describing Features

The set of paths to a target node in step k is
recovered in full.

X1
Edge unigram featureswas edge type l used in
reaching x from Vq.
X2
X3
X4
Edge bigram featureswere edge types l1 and l2
used (in that order) in reaching x from Vq.
X5
K0
K1
K2
Top edge bigram featureswere edge types l1
and l2 used (in that order) in reaching x from
Vq, among the top two highest scoring paths.

Paths (x3, k2)
x2 ? x1 ? x3
x4 ? x1 ? x3
x2 ? x2 ? x3
x2 ? x3

54
Results
55
Threading

Threading is an interesting problem, because
There are often irregularities in thread
structural information, thus threads discourse
should be captured using an intelligent approach
(D.E. Lewis and K.A. Knowles, Threading email A
preliminary study, Information Processing and
Management, 1997)
Threading information can improve message
categorization into topical folders (B. Klimt
and Y. Yang, The Enron corpus A new dataset for
email classification research, ECML, 2004)
Adjacent messages in a thread can be assumed to
be most similar to each other in the corpus.
Therefore, threading is related to the general
problem of finding similar messages in a corpus.

The task given a message, retrieve adjacent
messages in the thread
56
Some intuition ?
filex
57
Some intuition ?
filex
Shared content
58
Some intuition ?
filex
Shared content
Social network
59
Some intuition ?
filex
Shared content
Social network
Timeline
60
Threading experiments

Baseline TF-IDF SimilarityConsider all the
available information (header body) as text

Graph walk uniformStart from the file node, 2
steps, uniform edge weights

Graph walk random
Start from the file node, 2 steps, random edge
weights (best out of 10)

Graph walk reranked
Rerank the output of (3) using the
graph-describing features

61
Results

Highly-ranked edge-bigrams
sent-from ? sent-to -1
date-of ? date-of -1
has-term ? has-term -1

62
Finding email-aliases given a name
Given a persons name (term)Retrieve the full
set of relevant email-addresses (email-address)
63
Finding Meeting Attendees
Minkov et al, CEAS 2006

Extended graph contains 2 months of calendar data

64
Main Contributions

Presented an extended similarity measure
incorporating non-textual objects
Finite lazy random walks to perform typed search
A re-ranking paradigm to improve on graph walk
results
Instantiation of this framework for email
Defined and evaluated novel tasks for email

65
Another Task that Can be Formulated as a Graph
Query GeneId-Ranking

Given
a biomedical paper abstract
Find
the geneId for every gene mentioned in the
abstract
Method
from paper x, ranked list of geneId y xy
Background resources
a synonym list geneId ? name1, name2, ...
one or more protein NER systems
training/evaluation data pairs of (paper,
geneId1, ...., geneIdn)

66
Sample abstracts and synonyms

MGI96273
Htr1a
5-hydroxytryptamine (serotonin) receptor 1A
5-HT1A receptor
MGI104886
Gpx5
glutathione peroxidase 5
Arep
...
52,000 for mouse, 35,000 for fly

true labels
NER extractor
67
Graph for the task....
abstracts
...
filedoc115
hasProtein
hasProtein
hasProtein
hasTerm
CA1
HT1A
HT1
...
proteins
hasTerm
hasTerm
termHT
term1
termA
termCA
termhippo- campus
...
terms
inFile
5-HT1A receptor
Htr1a
eIF-1A
synonyms
...
synonym
synonym
MGI95298
MGI46273
geneIds
...
68
abstracts
...
filedoc115
hasProtein
hasProtein
hasProtein
hasTerm
CA1
HT1A
HT1
...
proteins
hasTerm
hasTerm
termHT
term1
termA
termCA
termhippo- campus
...
terms
inFile
5-HT1A receptor
Htr1a
eIF-1A
synonyms
...
synonym
MGI95298
MGI46273
geneIds
...
noisy training abstracts
filedoc214
filedoc523
filedoc6273
...
69
Experiments

Data Biocreative Task 1B
mouse 10,000 train abstracts, 250 devtest, using
first 150 for now 50,000 geneIds graph has
525,000 nodes
NER systems
likelyProtein trained on yapex.train using
off-the-shelf NER systems (Minorthird)
possibleProtein same, but modified (on
yapex.test) to optimize F3, not F1 (rewards
recall over precision)

70
Experiments with NER
Token Token Span Span Span
Precision Recall Precision Recall F1
94.9 64.8 87.2 62.1 72.5
49.0 97.4 47.2 82.5 60.0
81.6 31.3 66.7 26.8 45.3
43.9 88.5 30.4 56.6 39.6
50.1 46.9 24.5 43.9 31.4
likely
yapex.test
possible
likely
possible
mouse
dictionary
71
Experiments with Graph Search

Baseline method
extract entities of type x
for each string of type x, find best-matching
synonym, and then its geneId
consider only synonyms sharing gt1 token
Soft/TFIDF distance
break ties randomly
rank geneIds by number of times they are reached
rewards multiple mentions (even via alternate
synonyms)
Evaluation
average, over 50 test documents, of
non-interpolated average precision (plausible for
curators)
max F1 over all cutoffs

72
Experiments with Graph Search
mouse eval dataset MAP maxF1
likelyProtein softTFIDF 45.0 58.1
possibleProtein softTFIDF 62.6 74.9
graph walk 51.3 64.3
73
Baseline vs Graphwalk

Baseline includes
softTFIDF distances from NER entity to gene
synonyms
knowledge that shortcut path doc?entity?synonym?
geneId is important
Graph includes
IDF effects, correlations, training data, etc
Proposed graph extension
add softTFIDF and shortcut edges
Learning and reranking
start with local features fi(e) of edges eu?v
for answer y, compute expectations E( fi(e)
startx,endy)
use expectations as feature values and voted
perceptron (Collins, 2002) as learning-to-rank
method.

74
Experiments with Graph Search
mouse eval dataset MAP average max F1
likelyProtein softTFIDF 45.0 58.1
possibleProtein softTFIDF 62.6 74.9
graph walk 51.3 64.3
walk extra links 73.0 80.7
walk extra links learning 79.7 83.9
75
Experiments with Graph Search
76
Hot off the presses

Ongoing work learn NER system from pairs of
(document,geneIdList)
much easier to obtain training data than
documents in which every occurrence of every gene
name is highlighted (usual NER training data)
obtains F1 of 71.1 on mouse data (vs 45.3 by
training on YAPEX data, which is from different
distribution)

77
Experiments with Graph Search
mouse eval dataset MAP (Yapex trained)
likelyProtein softTFIDF 45.0
possibleProtein softTFIDF 62.6
graph walk 51.3
walk extra links 73.0
walk extra links learning 79.7
78
Experiments with Graph Search
mouse eval dataset MAP (Yapex trained) MAP (MGI trained)
likelyProtein softTFIDF 45.0 72.7
possibleProtein softTFIDF 62.6 65.7
graph walk 51.3 54.4
walk extra links 73.0 76.7
walk extra links learning 79.7 84.2
79
Experiments on BioCreative Blind Test Set
mouse blind test data MAP (Yapex trained) Max F1 (Yapex trained)
likelyProtein softTFIDF 36.8 42.1
possibleProtein softTFIDF 61.1 67.2
graph walk 64.0 69.5
walk extra links learning 71.1 75.5
(45.0) (58.1)
(62.6) (74.9)
(51.3) (64.3)
(79.7) (83.9)
80
Experiments with Graph Search
mouse blind test data MAP (Yapex trained) Max F1 (Yapex trained)
likelyProtein softTFIDF 36.8 42.1
possibleProtein softTFIDF 61.1 67.2
graph walk 64.0 69.5
walk extra links learning 71.1 75.5
mouse blind test data MAP (MGI trained) Max F1 (MGI trained)
walk extra links learning 80.1 83.7
81
mouse blind test data MAP (Yapex trained) Average Max F1 (Yapex trained)
walk extra links learning 71.1 75.5
(MGI trained) (MGI trained)
walk extra links learning 80.1 83.7
82
Outline

Two views on data quality
Cleaning your data vs living with the mess.
A lazy/Bayesian view of data cleaning
A framework for querying dirty data
Data model
Query language
Baseline results (biotext and email)
How to improve results with learning
Learning to re-rank query output
Conclusions

83
Conclusions

Contributions
a very simple query language for graphs, based on
a diffusion-kernel (damped PageRank,...)
similarity metric
experiments on natural types of queries
finding likely meeting attendees
finding related documents (email threading)
disambiguating person and gene/protein entity
names
techniques for learning to answer queries
reranking using expectations of simple, local
features
tune performance to a particular similarity

84
Conclusions