A Framework for Learning to Query Heterogeneous Data - PowerPoint PPT Presentation

About This Presentation
Title:

A Framework for Learning to Query Heterogeneous Data

Description:

William W. Cohen. Machine Learning Department and Language Technologies Institute ... 'name: T. Kennedy occupation: terrorist' ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 80
Provided by: willia95
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: A Framework for Learning to Query Heterogeneous Data


1
A Framework for Learning to Query Heterogeneous
Data
  • William W. Cohen
  • Machine Learning Department and Language
    Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University
  • joint work with
  • Einat Minkov, Andrew Ng, Richard Wang, Anthony
    Tomasic, Bob Frederking

2
Outline
  • Two views on data quality
  • Cleaning your data vs living with the mess.
  • A lazy/Bayesian view of data cleaning
  • A framework for querying dirty data
  • Data model
  • Query language
  • Baseline results (biotext and email)
  • How to improve results with learning
  • Learning to re-rank query output
  • Conclusions

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
A Bayesian Looks at Record Linkage
  • Record linkage problem given two sets of records
    Aa1,,am and Bb1,,bn, determine when
    referent(ai)referent(bj)
  • Idea compute for each ai,bj pair
    Pr(referent(ai)referent(bj))
  • Pick two thresholds
  • Pr(ab) gt HI ? accept pairing
  • Pr(ab) lt LO ? reject pairing
  • otherwise, clerical review by a human clerk
  • Every optimal decision boundary is defined by a
    threshold on the ranked list.
  • Thresholds depend on prior probability of a and b
    matching.

A B Pr(AB)
A17 B22 0.99
A43 B07 0.98

A21 B13 0.85
A37 B44 0.82
A84 B03 0.79
A83 B71 0.63

A24 B52 0.25
8
A Bayesian Looks at Record Linkage
  • Every optimal decision boundary is defined by a
    threshold on the ranked list.

2nm ways to link
A B Pr(AB)
A17 B22 0.99
A43 B07 0.98
A16 B23 0.91
A21 B13 0.85
A37 B44 0.82
A84 B03 0.79
A83 B71 0.63
A91 B21 0.46
A24 B52 0.25
  • In other words
  • 2nm nm linkages can be discarded as
    impossible
  • of the remaining nm, all but HI-LO can be
    discarded as improbable

M
M
M
M
U
U
U
U
U
M
M
U
U
U
U
U
U
U
U
U
U
U
U
M
M
M
M
. . .
nm pairs
  • But wait why doesnt the human clerk pick a
    threshold between LO and HI?

9
A Bayesian Looks at Record Linkage
A B Pr(AB)
A17 B22 0.99
A43 B07 0.98
A32 B72 0.91
A21 B13 0.85
A37 B44 0.82
A84 B03 0.79
A83 B71 0.63
A21 B43 0.46
A24 B52 0.25
M
M
M
M



U
U
M
M
M
M
M
U
M
U
U
U
U
U
U
U
M
10
Linking multiple relations database hardening
Database S1 (extracted from paper 1s title page)
Database S2 (extracted from paper 2s
bibliography)
11
Using multiple relations database hardening
So this gives some known matches, which might
interact with proposed matches e.g. here we
deduce...
12
Soft database from IE
Hard database suitable for Oracle, MySQL, etc
13
Using multiple relations database hardening
  • (McAllister et al, KDD2000) Defined hardening
  • Find interpretation (maps variant-gtname) that
    produces a compact version of soft database S.
  • Probabilistic interpretation of hardening
  • Original soft data S is version of latent
    hard data H.
  • Hardening finds max likelihood H.
  • Hardening is hard!
  • Optimal hardening is NP-hard.
  • Greedy algorithm
  • naive implementation is quadratic in S
  • clever data structures make it P(n log n), where
    nSd
  • Other related work
  • Pasula et al, NIPS2002 more explicit generative
    Bayesian formulation and MCMC method,
    experimental support
  • Wellner McCallum 2004, Parag Domingos 2004,
    Culotta McCallum 2005, ....

14
A Bayesian Looks at Record Linkage
A B Pr(AB)
A17 B22 0.99
A43 B07 0.98
A32 B72 0.91
A21 B13 0.85
A37 B44 0.82
A84 B03 0.79
A83 B71 0.63
A21 B43 0.46
A24 B52 0.25
  • An alternate view of the process
  • F-Ss method answers the question directly for
    the cases that everyone would agree on.
  • Human effort is used to answer the cases that are
    a little harder.

M
M
M
M



U
U
M
M
M
M
M
U
M
U
U
U
U
U
U
U
M
15
A Bayesian Looks at Record Linkage
A B Pr(AB)
A17 B22 0.99
A43 B07 0.98
A32 B72 0.91
A21 B13 0.85
A37 B44 0.82
A84 B03 0.79
A83 B71 0.63
A21 B43 0.46
A24 B52 0.25
  • An alternate view of the process
  • F-Ss method answers the question directly for
    the cases that everyone would agree on.
  • Human effort is used to answer the cases that are
    a little harder.

M
M
M
M



U
U
Q is A43 in B? A yes (p0.98)
Q is A83 in B? A not clear
Q is A21 in B? A unlikely
?
16
Passing linkage decisions along to the user
  • Usual goal link records and create a single
    highly accurate database for users query.
  • Equality is often uncertain, given available
    information about an entity
  • name T. Kennedy occupation terrorist
  • The interpretation of equality may change from
    user to user and application to application
  • Does Boston Market McDonalds ?
  • Alternate goal wait for a query, then answer it,
    propogating uncertainty about linkage decisions
    on that query to the end user

X
17
WHIRL project (1997-2000)
  • WHIRL initiated when at ATT Bell Labs

ATT Research
ATT Labs - Research
ATT Research
ATT Labs
ATT Research Shannon Laboratory
ATT Shannon Labs
18
When are two entities the same?
  • Bell Labs
  • Bell Telephone Labs
  • ATT Bell Labs
  • AT Labs
  • ATT LabsResearch
  • ATT Labs Research, Shannon Laboratory
  • Shannon Labs
  • Bell Labs Innovations
  • Lucent Technologies/Bell Labs Innovations

1925
History of Innovation From 1925 to today, ATT
has attracted some of the world's greatest
scientists, engineers and developers.
www.research.att.com
Bell Labs Facts Bell Laboratories, the research
and development arm of Lucent Technologies, has
been operating continuously since 1925
bell-labs.com
19
When are two entities are the same?
Buddhism rejects the key element in folk
psychology the idea of a self (a unified
personal identity that is continuous through
time) King Milinda and Nagasena (the Buddhist
sage) discuss personal identity Milinda
gradually realizes that "Nagasena" (the word)
does not stand for anything he can point to
not the hairs on Nagasena's head, nor the hairs
of the body, nor the "nails, teeth, skin,
muscles, sinews, bones, marrow, kidneys, ..."
etc Milinda concludes that "Nagasena" doesn't
stand for anything If we can't say what a person
is, then how do we know a person is the same
person through time? There's really no you,
and if there's no you, there are no beliefs or
desires for you to have The folk psychology
picture is profoundly misleading and believing it
will make you miserable. -S. LaFave
20
Traditional approach
Linkage
Queries
Uncertainty about what to link must be decided by
the integration system, not the end user
21
WHIRL vision
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Strongest links those agreeable to most users
William Will Cohen Cohn
Steve Steven Minton Mitton
Weaker links those agreeable to some users
even weaker links
William David Cohen Cohn
22
WHIRL vision
DB1 DB2 ? DB
Link items as needed by Q
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Incrementally produce a ranked list of possible
links, with best matches first. User (or
downstream process) decides how much of the list
to generate and examine.
William Will Cohen Cohn
Steve Steven Minton Mitton
William David Cohen Cohn
23
WHIRL queries
  • Assume two relations
  • review(movieTitle,reviewText) archive of reviews
  • listing(theatre, movieTitle, showTimes, ) now
    showing

The Hitchhikers Guide to the Galaxy, 2005 This is a faithful re-creation of the original radio series not surprisingly, as Adams wrote the screenplay .
Men in Black, 1997 Will Smith does an excellent job in this
Space Balls, 1987 Only a die-hard Mel Brooks fan could claim to enjoy

Star Wars Episode III The Senator Theater 100, 415, 730pm.
Cinderella Man The Rotunda Cinema 100, 430, 730pm.

24
WHIRL queries
  • Find reviews of sci-fi comedies movie domain
  • FROM review SELECT WHERE r.textsci fi comedy
  • (like standard ranked retrieval of sci-fi
    comedy)
  • Where is that sci-fi comedy playing?
  • FROM review as r, LISTING as s, SELECT
  • WHERE r.titles.title and r.textsci fi comedy
  • (best answers titles are similar to each other
    e.g., Hitchhikers Guide to the Galaxy and The
    Hitchhikers Guide to the Galaxy, 2005 and the
    review text is similar to sci-fi comedy)

25
WHIRL queries
  • Similarity is based on TFIDF? rare words are most
    important.
  • Search for high-ranking answers uses inverted
    indices.

The Hitchhikers Guide to the Galaxy, 2005
Men in Black, 1997
Space Balls, 1987

Star Wars Episode III
Hitchhikers Guide to the Galaxy
Cinderella Man

26
WHIRL queries
  • Similarity is based on TFIDF? rare words are most
    important.
  • Search for high-ranking answers uses inverted
    indices.

- It is easy to find the (few) items that match
on important terms - Search for strong matches
can prune unimportant terms
The Hitchhikers Guide to the Galaxy, 2005
Men in Black, 1997
Space Balls, 1987

Star Wars Episode III
Hitchhikers Guide to the Galaxy
Cinderella Man

hitchhiker movie00137
the movie001,movie003,movie007,movie008, movie013,movie018,movie023,movie0031, ..

27
WHIRL results
  • This sort of worked
  • Interactive speeds (lt0.3s/q) with a few hundred
    thousand tuples.
  • For 2-way joins, average precision (sort of like
    area under precision-recall curve) from 85 to
    100 on 13 problems in 6 domains.
  • Average precision better than 90 on 5-way joins

28
WHIRL and soft integration
  • WHIRL worked for a number of web-based demo
    applications.
  • e.g., integrating data from 30-50 smallish web
    DBs with lt1 FTE labor
  • WHIRL could link many data types reasonably well,
    without engineering
  • WHIRL generated numerous papers (Sigmod98, KDD98,
    Agents99, AAAI99, TOIS2000, AIJ2000, ICML2000,
    JAIR2001)
  • WHIRL was relational
  • But see ELIXIR (SIGIR2001)
  • WHIRL users need to know schema of source DBs
  • WHIRLs query-time linkage worked only for TFIDF,
    token-based distance metrics
  • ? Text fields with few misspellimgs
  • WHIRL was memory-based
  • all data must be centrally storedno federated
    data.
  • ? small datasets only

29
WHIRL vision very radical, everything was
inter-dependent
Link items as needed by Q
To make SQL-like queries, user must understand
the schema of the underlying DB (and hence
someone must understand DB1, DB2, DB3, ...
R.a S.a S.b T.b
Anhai Anhai Doan Doan
Dan Dan Weld Weld
Incrementally produce a ranked list of possible
links, with best matches first. User (or
downstream process) decides how much of the list
to generate and examine.
William Will Cohen Cohn
Steve Steven Minton Mitton
William David Cohen Cohn
?
30
Outline
  • Two views on data quality
  • Cleaning your data vs living with the mess.
  • A lazy/Bayesian view of data cleaning
  • A framework for querying dirty data
  • Data model
  • Query language
  • Baseline results (biotext and email)
  • How to improve results with learning
  • Learning to re-rank query output
  • Conclusions

31
BANKS Basic Data Model
  • Database is modeled as a graph
  • Nodes tuples
  • Edges references between tuples
  • foreign key, inclusion dependencies, ..
  • Edges are directed.

User need not know organization of database to
formulate queries.
MultiQuery Optimization
paper
BANKS Keyword search
writes
S. Sudarshan
Prasan Roy
Charuta
author
32
BANKS Answer to Query
Query sudarshan roy Answer subtree from
graph
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
33
BANKS Basic Data Model
  • Database is modeled as a graph
  • Nodes tuples
  • Edges references between tuples
  • edges are directed.
  • foreign key, inclusion dependencies, ..

34
BANKS Basic Data Model
not quite so basic
  • Database All information is modeled as a graph
  • Nodes tuples or documents or strings or words
  • Edges references between tuples nodes
  • edges are directed, labeled and weighted
  • foreign key, inclusion dependencies, ...
  • doc/string D to word contained by D (TFIDF
    weighted, perhaps)
  • word W to doc/string containing W (inverted
    index)
  • string S to strings similar to S

35
Similarity in a BANKS-like system
  • Motivation why Im interested in
  • structured data that is partly text similarity!
  • structured data represented as graphs all sorts
    of information can be poured into this model.
  • measuring similarity of nodes in graphs
  • Coming up next
  • a simple query language for graphs
  • experiments on natural types of queries
  • techniques for learning to answer queries of a
    certain type better

36
Yet another schema-free query language
  • Assume data is encoded in a graph with
  • a node for each object x
  • a type of each object x, T(x)
  • an edge for each binary relation rx ? y
  • Queries are of this form
  • Given type t and node x, find yT(y)t and yx.
  • Wed like to construct a general-purpose
    similarity function xy for objects in the
    graph
  • Wed also like to learn many such functions for
    different specific tasks (like who should attend
    a meeting)

Node similarity
37
Similarity of Nodes in Graphs
  • Given type t and node x, find yT(y)t and yx.
  • Similarity defined by damped version of
    PageRank
  • Similarity between nodes x and y
  • Random surfer model from a node z,
  • with probability a, stop and output z
  • pick an edge label r using Pr(r z) ... e.g.
    uniform
  • pick a y uniformly from y z ? y with label r
  • repeat from node y ....
  • Similarity xy Pr( output y start at x)
  • Intuitively, xy is summation of weight of all
    paths from x to y, where weight of path decreases
    exponentially with length.

38
BANKS Basic Data Model
not quite so basic
  • Database All information is modeled as a graph
  • Nodes tuples or documents or strings or words
  • Edges references between tuples nodes
  • edges are directed, labeled and weighted
  • foreign key, inclusion dependencies, ...
  • doc/string D to word contained by D (TFIDF
    weighted, perhaps)
  • word W to doc/string containing W (inverted
    index)
  • string S to strings similar to S

William W. Cohen, CMU
cohen
optionalstrings that are similar in TFIDF/cosine
distance will still be nearby in graph
(connected by many length2 paths)
william
w
cmu
dr
Dr. W. W. Cohen
39
Similarity of Nodes in Graphs
  • Random surfer on graphs
  • natural extension to PageRank
  • closely related to Laffertys heat diffusion
    kernel
  • but generalized to directed graphs
  • somewhat amenable to learning parameters of the
    walk (gradient search, w/ various optimization
    metrics)
  • Toutanova, Manning NG, ICML2004
  • Nie et al, WWW2005
  • Xi et al, SIGIR 2005
  • can be sped up and adapted to longer walks by
    sampling approaches to matrix multiplication
    (e.g. Lewis E. Cohen, SODA 1998), similar to
    particle filtering
  • our current implementation (GHIRL) Lucene
    Sleepycat with extensive use of memory caching
    (sampling approaches visit many nodes repeatedly)

40
Query sudarshan roy Answer subtree from
graph
paper
MultiQuery Optimization
writes
writes
author
author
S. Sudarshan
Prasan Roy
41
y paper(y) yroy
w paper(y) wroy
AND
Query sudarshan roy Answer subtree from
graph
42
Evaluation on Personal Information Management
Tasks
Minkov et al, SIGIR 2006
Many tasks can be expressed as simple,
non-conjunctive search queries in this framework.
  • Such as
  • Person Name Disambiguation in Email
  • Threading
  • Finding email-address aliases given a persons
    name
  • Finding relevant meeting attendees

What is the email address for the person named
Halevy mentioned in this presentation? What
files from my home machine will I need for this
meeting? What people will attend this
meeting? ... ?
novel
eg Diehl, Getoor, Namata, 2006
eg Lewis Knowles 97
novel
Also consider a generalization x ? Vq Vq is a
distribution over nodes x
novel
43
Email as a graph
sent_date
date2
sent_to
alias
Email address1
person name1
a_inv
1_day
person name2
Email address2
sf_inv
Sent_from
Sent_to
st_inv
sent_date
file1
Email address3
person name3
file2
date1
sd_Inv
sent_from
Email address4
person name4
in_file
in_subj
sent_to
If_inv
is_inv
Email address5
person name5
term8
term9
term2
term3
term1
term7
term6
term4
term11
term5
term10
44
Person Name Disambiguation
file
Person
file
Person Andrew Johns
  • Q who is Andy?
  • Given a term that is not mentioned as is in
    header (otherwise, easy), that is known to be a
    personal name
  • Output ranked person nodes.

file
termandy
Person
This task is complementary to person name
annotation in email (E. Minkov, R. Wang,
W.Cohen, Extracting Personal Names from Emails
Applying Named Entity Recognition to Informal
Text, HLT/EMNLP 2005)
45
Corpora and Datasets
a. Corpora
Example nicknames Dave for David, Kai for
Keiko, Jenny for Qing
b. Types of names
46
Person Name Disambiguation
  • 1. Baseline String matching ( common nicknames)
  • Find persons that are similar to the name term
    (Jaro)
  • Successful in many cases
  • Not successful for some nicknames
  • Can not handle ambiguity (arbitrary)
  • 3. Graph walk termfile
  • Vq name term file nodes (2 steps)
  • The file node is natural available context
  • Solves the ambiguity problem!
  • But, incorporates additional noise.
  • 4. Graph walk termfile, reranked using learning
  • Re-rank the output of (3), using
  • path-describing features
  • source count do the paths originate from a
    single or two source nodes
  • string similarity
  • 2. Graph walk term
  • Vq name term node (2 steps)
  • Models co-occurrences.
  • Can not handle ambiguity (dominant)

47
Results
48
Results
after learning-to-rank
graph walk from name,file
graph walk from name
baseline string match, nicknames
49
Results
Enron execs
50
Results
51
Learning
  • There is no single best measure of similarity
  • How can you learn how to better rank graph nodes,
    for a particular task?
  • Learning methods for graph walks
  • The parameters can be adjusted using gradient
    descent methods (Diligenti et-al, IJCAI 2005)
  • We explored a node re-ranking approach which
    can take advantage of a wider range of features
    features (and is complementary to parameter
    tuning)
  • Features of candidate answer y describe the set
    of paths from query x to y

52
Re-ranking overview
  • Boosting-based reranking, following (Collins and
    Koo, Computational Linguistics, 2005)
  • A training example includes
  • a ranked list of li nodes.
  • Each node is represented through m features
  • At least one known correct node
  • Scoring functionFind w that minimizes
    (boosted version)Requires binary features
    and has a closed form formula to find best
    feature and delta in each iteration.

linear combination of features
original score yx
, where
53
Path describing Features
  • The set of paths to a target node in step k is
    recovered in full.

X1
Edge unigram featureswas edge type l used in
reaching x from Vq.
X2
X3
X4
Edge bigram featureswere edge types l1 and l2
used (in that order) in reaching x from Vq.
X5
K0
K1
K2
Top edge bigram featureswere edge types l1
and l2 used (in that order) in reaching x from
Vq, among the top two highest scoring paths.
  • Paths (x3, k2)
  • x2 ? x1 ? x3
  • x4 ? x1 ? x3
  • x2 ? x2 ? x3
  • x2 ? x3

54
Results
55
Threading
  • Threading is an interesting problem, because
  • There are often irregularities in thread
    structural information, thus threads discourse
    should be captured using an intelligent approach
    (D.E. Lewis and K.A. Knowles, Threading email A
    preliminary study, Information Processing and
    Management, 1997)
  • Threading information can improve message
    categorization into topical folders (B. Klimt
    and Y. Yang, The Enron corpus A new dataset for
    email classification research, ECML, 2004)
  • Adjacent messages in a thread can be assumed to
    be most similar to each other in the corpus.
    Therefore, threading is related to the general
    problem of finding similar messages in a corpus.

The task given a message, retrieve adjacent
messages in the thread
56
Some intuition ?
filex
57
Some intuition ?
filex
Shared content
58
Some intuition ?
filex
Shared content
Social network
59
Some intuition ?
filex
Shared content
Social network
Timeline
60
Threading experiments
  1. Baseline TF-IDF SimilarityConsider all the
    available information (header body) as text
  1. Graph walk uniformStart from the file node, 2
    steps, uniform edge weights
  • Graph walk random
  • Start from the file node, 2 steps, random edge
    weights (best out of 10)
  • Graph walk reranked
  • Rerank the output of (3) using the
    graph-describing features

61
Results
  • Highly-ranked edge-bigrams
  • sent-from ? sent-to -1
  • date-of ? date-of -1
  • has-term ? has-term -1

62
Finding email-aliases given a name
Given a persons name (term)Retrieve the full
set of relevant email-addresses (email-address)
63
Finding Meeting Attendees
Minkov et al, CEAS 2006
  • Extended graph contains 2 months of calendar data

64
Main Contributions
  • Presented an extended similarity measure
    incorporating non-textual objects
  • Finite lazy random walks to perform typed search
  • A re-ranking paradigm to improve on graph walk
    results
  • Instantiation of this framework for email
  • Defined and evaluated novel tasks for email

65
Another Task that Can be Formulated as a Graph
Query GeneId-Ranking
  • Given
  • a biomedical paper abstract
  • Find
  • the geneId for every gene mentioned in the
    abstract
  • Method
  • from paper x, ranked list of geneId y xy
  • Background resources
  • a synonym list geneId ? name1, name2, ...
  • one or more protein NER systems
  • training/evaluation data pairs of (paper,
    geneId1, ...., geneIdn)

66
Sample abstracts and synonyms
  • MGI96273
  • Htr1a
  • 5-hydroxytryptamine (serotonin) receptor 1A
  • 5-HT1A receptor
  • MGI104886
  • Gpx5
  • glutathione peroxidase 5
  • Arep
  • ...
  • 52,000 for mouse, 35,000 for fly

true labels
NER extractor
67
Graph for the task....
abstracts
...
filedoc115
hasProtein
hasProtein
hasProtein
hasTerm
CA1
HT1A
HT1
...
proteins
hasTerm
hasTerm
termHT
term1
termA
termCA
termhippo- campus
...
terms
inFile
5-HT1A receptor
Htr1a
eIF-1A
synonyms
...
synonym
synonym
MGI95298
MGI46273
geneIds
...
68
abstracts
...
filedoc115
hasProtein
hasProtein
hasProtein
hasTerm
CA1
HT1A
HT1
...
proteins
hasTerm
hasTerm
termHT
term1
termA
termCA
termhippo- campus
...
terms
inFile
5-HT1A receptor
Htr1a
eIF-1A
synonyms
...
synonym
MGI95298
MGI46273
geneIds
...
noisy training abstracts
filedoc214
filedoc523
filedoc6273
...
69
Experiments
  • Data Biocreative Task 1B
  • mouse 10,000 train abstracts, 250 devtest, using
    first 150 for now 50,000 geneIds graph has
    525,000 nodes
  • NER systems
  • likelyProtein trained on yapex.train using
    off-the-shelf NER systems (Minorthird)
  • possibleProtein same, but modified (on
    yapex.test) to optimize F3, not F1 (rewards
    recall over precision)

70
Experiments with NER
Token Token Span Span Span
Precision Recall Precision Recall F1
94.9 64.8 87.2 62.1 72.5
49.0 97.4 47.2 82.5 60.0
81.6 31.3 66.7 26.8 45.3
43.9 88.5 30.4 56.6 39.6
50.1 46.9 24.5 43.9 31.4
likely
yapex.test
possible
likely
possible
mouse
dictionary
71
Experiments with Graph Search
  • Baseline method
  • extract entities of type x
  • for each string of type x, find best-matching
    synonym, and then its geneId
  • consider only synonyms sharing gt1 token
  • Soft/TFIDF distance
  • break ties randomly
  • rank geneIds by number of times they are reached
  • rewards multiple mentions (even via alternate
    synonyms)
  • Evaluation
  • average, over 50 test documents, of
  • non-interpolated average precision (plausible for
    curators)
  • max F1 over all cutoffs

72
Experiments with Graph Search
mouse eval dataset MAP maxF1
likelyProtein softTFIDF 45.0 58.1
possibleProtein softTFIDF 62.6 74.9
graph walk 51.3 64.3
73
Baseline vs Graphwalk
  • Baseline includes
  • softTFIDF distances from NER entity to gene
    synonyms
  • knowledge that shortcut path doc?entity?synonym?
    geneId is important
  • Graph includes
  • IDF effects, correlations, training data, etc
  • Proposed graph extension
  • add softTFIDF and shortcut edges
  • Learning and reranking
  • start with local features fi(e) of edges eu?v
  • for answer y, compute expectations E( fi(e)
    startx,endy)
  • use expectations as feature values and voted
    perceptron (Collins, 2002) as learning-to-rank
    method.

74
Experiments with Graph Search
mouse eval dataset MAP average max F1
likelyProtein softTFIDF 45.0 58.1
possibleProtein softTFIDF 62.6 74.9
graph walk 51.3 64.3
walk extra links 73.0 80.7
walk extra links learning 79.7 83.9
75
Experiments with Graph Search
76
Hot off the presses
  • Ongoing work learn NER system from pairs of
    (document,geneIdList)
  • much easier to obtain training data than
    documents in which every occurrence of every gene
    name is highlighted (usual NER training data)
  • obtains F1 of 71.1 on mouse data (vs 45.3 by
    training on YAPEX data, which is from different
    distribution)

77
Experiments with Graph Search
mouse eval dataset MAP (Yapex trained)
likelyProtein softTFIDF 45.0
possibleProtein softTFIDF 62.6
graph walk 51.3
walk extra links 73.0
walk extra links learning 79.7
78
Experiments with Graph Search
mouse eval dataset MAP (Yapex trained) MAP (MGI trained)
likelyProtein softTFIDF 45.0 72.7
possibleProtein softTFIDF 62.6 65.7
graph walk 51.3 54.4
walk extra links 73.0 76.7
walk extra links learning 79.7 84.2
79
Experiments on BioCreative Blind Test Set
mouse blind test data MAP (Yapex trained) Max F1 (Yapex trained)
likelyProtein softTFIDF 36.8 42.1
possibleProtein softTFIDF 61.1 67.2
graph walk 64.0 69.5
walk extra links learning 71.1 75.5
(45.0) (58.1)
(62.6) (74.9)
(51.3) (64.3)
(79.7) (83.9)
80
Experiments with Graph Search
mouse blind test data MAP (Yapex trained) Max F1 (Yapex trained)
likelyProtein softTFIDF 36.8 42.1
possibleProtein softTFIDF 61.1 67.2
graph walk 64.0 69.5
walk extra links learning 71.1 75.5
mouse blind test data MAP (MGI trained) Max F1 (MGI trained)
walk extra links learning 80.1 83.7
81
mouse blind test data MAP (Yapex trained) Average Max F1 (Yapex trained)
walk extra links learning 71.1 75.5
(MGI trained) (MGI trained)
walk extra links learning 80.1 83.7
82
Outline
  • Two views on data quality
  • Cleaning your data vs living with the mess.
  • A lazy/Bayesian view of data cleaning
  • A framework for querying dirty data
  • Data model
  • Query language
  • Baseline results (biotext and email)
  • How to improve results with learning
  • Learning to re-rank query output
  • Conclusions

83
Conclusions
  • Contributions
  • a very simple query language for graphs, based on
    a diffusion-kernel (damped PageRank,...)
    similarity metric
  • experiments on natural types of queries
  • finding likely meeting attendees
  • finding related documents (email threading)
  • disambiguating person and gene/protein entity
    names
  • techniques for learning to answer queries
  • reranking using expectations of simple, local
    features
  • tune performance to a particular similarity

84
Conclusions
  • Some open problems
  • scalability efficiency
  • K-step walk on node-node graph with fan-out b is
    O(KbN)
  • accurate sampling is O(1min) for 10-steps with
    O(106) nodes.
  • faster, better learning methods
  • combine re-ranking with learning parameters of
    graph walk
  • add language modeling, topic modeling
  • extend graph to include models as well as data

85
Conclusions
  • Dont forget that there are two views on data
    quality
  • Cleaning your data vs living with the mess.
  • A lazy/Bayesian view of data cleaning
  • SQL/Oracle vs Google
  • vs something in between .... ?
Write a Comment
User Comments (0)
About PowerShow.com