Title: Statistical Relational Learning for Knowledge Extraction from the Web
1Statistical Relational Learning for Knowledge
Extraction from the Web
- Hoifung Poon
- Dept. of Computer Science Eng.
- University of Washington
1
2Drowning in Information, Starved for Knowledge
WWW
2
2
2
3Great VisionKnowledge Extraction from Web
Craven et al., Learning to Construct Knowledge
Bases from the World Wide Web," Artificial
Intelligence, 1999.
- Also need
- Knowledge representation and reasoning
- Close the loop Apply knowledge to extraction
- Machine reading Etzioni et al., 2007
3
4Machine Reading Text ? Knowledge
4
4
4
5Rapidly Growing Interest
- AAAI-07 Spring Symposium on Machine Reading
- DARPA Machine Reading Program (2009-2014)
- NAACL-10 Workshop on Learning By Reading
- Etc.
5
6Great Impact
- Scientific inquiry and commercial applications
- Literature-based discovery, robot scientists
- Question answering, semantic search
- Drug design, medical diagnosis
- Breach knowledge acquisition bottleneck for AI
and natural language understanding - Automatically semantify the Web
- Etc.
6
7This Talk
- Statistical relational learning offers promising
solutions to machine reading - Markov logic is a leading unifying framework
- A success story USP
- Unsupervised, end-to-end machine reading
- Extracts five times as many correct answers as
state of the art, with highest accuracy of 91
7
8USP Question-Answer Example
Interestingly, the DEX-mediated IkappaBalpha
induction was completely inhibited by IL-2, but
not IL-4, in Th1 cells, while the reverse profile
was seen in Th2 cells.
Q What does IL-2 control? A The DEX-mediated
IkappaBalpha induction
8
8
9Overview
- Machine reading Challenges
- Statistical relational learning
- Markov logic
- USP Unsupervised Semantic Parsing
- Research directions
9
9
9
10Key Challenges
- Complexity
- Uncertainty
- Pipeline accumulates errors
- Supervision is scarce
10
11Languages Are Structural
governments
lmpxtm (Hebrew according to their families)
IL-4 induces CD11B
Involvement of p70(S6)-kinase activation in IL-10
up-regulation in human monocytes by gp41......
George Walker Bush was the 43rd President of the
United States. Bush was the eldest son of
President G. H. W. Bush and Babara Bush. . In
November 1977, he met Laura Welch at a barbecue.
11
11
11
12Languages Are Structural
S
govern-ment-s
l-mpx-t-m (Hebrew according to their families)
VP
NP
V
NP
IL-4 induces CD11B
Involvement of p70(S6)-kinase activation in IL-10
up-regulation in human monocytes by gp41......
George Walker Bush was the 43rd President of the
United States. Bush was the eldest son of
President G. H. W. Bush and Babara Bush. . In
November 1977, he met Laura Welch at a barbecue.
involvement
Theme
Cause
up-regulation
activation
Site
Theme
Cause
Theme
human monocyte
IL-10
gp41
p70(S6)-kinase
12
12
12
13Knowledge Is Heterogeneous
- Individuals
- E.g. Socrates is a man
- Types
- E.g. Man is mortal
- Inference rules
- E.g. Syllogism
- Ontological relations
- Etc.
MAMMAL
FACE
ISA
ISPART
HUMAN
EYE
13
13
14Complexity
- Can handle using first-order logic
- Trees, graphs, dependencies, hierarchies, etc.
easily expressed - Inference algorithms (satisfiability testing,
theorem proving, etc.) - But logic is brittle with uncertainty
14
14
15Languages Are Ambiguous
Microsoft buys Powerset Microsoft acquires
Powerset Powerset is acquired by Microsoft
Corporation The Redmond software giant buys
Powerset Microsofts purchase of Powerset,
I saw the man with the telescope
NP
I saw the man with the telescope
NP
ADVP
I saw the man with the telescope
G. W. Bush Laura Bush Mrs. Bush
Here in London, Frances Deek is a retired teacher
In the Israeli town , Karen London says Now
London says
Which one?
London ? PERSON or LOCATION?
15
15
15
16Knowledge Has Uncertainty
- We need to model correlations
- Our information is always incomplete
- Our predictions are uncertain
16
16
17Uncertainty
- Statistics provides the tools to handle this
- Mixture models
- Hidden Markov models
- Bayesian networks
- Markov random fields
- Maximum entropy models
- Conditional random fields
- Etc.
- But statistical models assume i.i.d.
data(independently and identically distributed)
objects ? feature vectors
18Pipeline is Suboptimal
- E.g., NLP pipeline
- Tokenization ? Morphology ? Chunking ? Syntax ?
- Accumulates and propagates errors
- Wanted Joint inference
- Across all processing stages
- Among all interdependent objects
18
19Supervision is Scarce
- Tons of text but most is not annotated
- Labeling is expensive (Cf. Penn-Treebank)
- ? Need to leverage indirect supervision
-
19
19
19
20Redundancy
- Key source of indirect supervision
- State-of-the-art systems depend on this
- E.g., TextRunner Banko et al., 2007
- But Web is heterogeneous Long tail
- Redundancy only present in head regime
21Overview
- Machine reading Challenges
- Statistical relational learning
- Markov logic
- USP Unsupervised Semantic Parsing
- Research directions
21
21
21
22Statistical Relational Learning
- Burgeoning field in machine learning
- Offers promising solutions for machine reading
- Unify statistical and logical approaches
- Replace pipeline with joint inference
- Principled framework to leverage both
direct and indirect supervision
22
22
23Machine Reading A Vision
Challenge Long tail
23
24Machine Reading A Vision
24
25Challenges in Applying Statistical Relational
Learning
- Learning is much harder
- Inference becomes a crucial issue
- Greater complexity for user
25
25
26Progress to Date
- Probabilistic logic Nilsson, 1986
- Statistics and beliefs Halpern, 1990
- Knowledge-based model constructionWellman et
al., 1992 - Stochastic logic programs Muggleton, 1996
- Probabilistic relational models Friedman et al.,
1999 - Relational Markov networks Taskar et al., 2002
- Markov logic Domingos Lowd, 2009
- Etc.
26
26
27Progress to Date
- Probabilistic logic Nilsson, 1986
- Statistics and beliefs Halpern, 1990
- Knowledge-based model constructionWellman et
al., 1992 - Stochastic logic programs Muggleton, 1996
- Probabilistic relational models Friedman et al.,
1999 - Relational Markov networks Taskar et al., 2002
- Markov logic Domingos Lowd, 2009
- Etc.
Leading unifying framework
27
27
28Overview
- Machine reading
- Statistical relational learning
- Markov logic
- USP Unsupervised Semantic Parsing
- Research directions
28
28
28
29Markov Networks
- Undirected graphical models
Cancer
Smoking
Cough
Asthma
Weight of Feature i
Feature i
29
30First-Order Logic
- Constants, variables, functions, predicatesE.g.
Anna, x, MotherOf(x), Friends(x,y) - Grounding Replace all variables by
constantsE.g. Friends (Anna, Bob) - World (model, interpretation)Assignment of
truth values to all ground predicates
30
31Markov Logic
- Intuition Soften logical constraints
- Syntax Weighted first-order formulas
- Semantics Feature templates for Markov networks
- A Markov Logic Network (MLN) is a set of pairs
(Fi, wi) where - Fi is a formula in first-order logic
- wi is a real number
Number of true groundings of Fi
31
32Example Friends Smokers
32
33Example Friends Smokers
33
34Example Friends Smokers
34
35Example Friends Smokers
Probabilistic graphical models andfirst-order
logic are special cases
Two constants Anna (A) and Bob (B)
Friends(A,B)
Smokes(A)
Friends(A,A)
Smokes(B)
Friends(B,B)
Cancer(A)
Cancer(B)
Friends(B,A)
35
36MLN AlgorithmsThe First Three Generations
Problem First generation Second generation Third generation
MAP inference Weighted satisfiability Lazy inference Cutting planes
Marginal inference Gibbs sampling MC-SAT Lifted inference
Weight learning Pseudo-likelihood Voted perceptron Scaled conj. gradient
Structure learning Inductive logic progr. ILP PL (etc.) Clustering pathfinding
36
37Efficient Inference
- Logical or statistical inference already hard
- But can do approximate inference
- Suffice to perform well in most cases
- Combine ideas from both camps
- E.g., MC-SAT ? MCMC ? SAT solver
- Can also leverage sparsity in relational domains
More Poon Domingos, Sound and Efficient
Inference with Probabilistic and Deterministic
Dependencies, in Proc. AAAI-2006.
More Poon, Domingos Sumner, A General Method
for Reducing the Complexity of Relational
Inference and its Application to MCMC, in Proc.
AAAI-2008.
37
38Weight Learning
- Probability model P(X)
- X Observable in training data
- Maximize likelihood of observed data
- Regularization to prevent overfitting
39Weight Learning
Requires inference
- Gradient descent
- Use MC-SAT for inference
- Can also leverage second-order information Lowd
Domingos, 2007
No. of times clause i is true in data
Expected no. times clause i is true according to
MLN
39
39
39
40Unsupervised Learning How?
- I.I.D. learning Sophisticated model requires
more labeled data - Statistical relational learning Sophisticated
model may require less labeled data - Ambiguities vary among objects
- Joint inference ? Propagate information from
unambiguous objects to ambiguous ones - One formula is worth a thousand labels
- Small amount of domain knowledge ?
large-scale joint inference
40
40
40
41Unsupervised Weight Learning
- Probability model P(X,Z)
- X Observed in training data
- Z Hidden variables
- E.g., clustering with mixture models
- Z Cluster assignment
- X Observed features
- Maximize likelihood of observed data by summing
out hidden variables Z
42Unsupervised Weight Learning
- Gradient descent
- Use MC-SAT to compute both expectations
- May also combine with contrastive estimation
Sum over z, conditioned on observed x
Summed over both x and z
More Poon, Cherry, Toutanova, Unsupervised
Morphological Segmentation with Log-Linear
Models, in Proc. NAACL-2009.
42
42
42
Best Paper Award
43Markov Logic
- Unified inference and learning algorithms
- ? Can handle millions of variables, billions of
features, ten of thousands of parameters - Easy-to-use software Alchemy
- Many successful applications
- E.g. Information extraction, coreference
resolution, semantic parsing, ontology induction
43
43
43
44Pipeline ? Joint Inference
- Combine segmentation and entity resolution for
information extraction - Extract complex and nested bio-events from PubMed
abstracts
More Poon Domingos, Joint Inference for
Information Extraction, in Proc. AAAI-2007.
More Poon Vanderwende, Joint Inference for
Knowledge Extraction from Biomedical Literature,
in Proc. NAACL-2010.
44
44
45Unsupervised Learning Example
- Coreference resolution Accuracy comparable to
previous supervised state of the art
More Poon Domingos, Joint Unsupervised
Coreference Resolution with Markov Logic, in
Proc. EMNLP-2008.
45
45
46Overview
- Machine reading Challenges
- Statistical relational learning
- Markov logic
- USP Unsupervised Semantic Parsing
- Research directions
46
46
46
47Unsupervised Semantic Parsing
Best Paper Award
- USP Poon Domingos, EMNLP-09
- First unsupervised approach for semantic parsing
- End-to-end machine reading system
- Read text, answer questions
- OntoUSP ? USP ? Ontology Induction Poon
Domingos, ACL-10 - Encoded in a few Markov logic formulas
47
47
48Semantic Parsing
Goal
Microsoft buys Powerset
BUY(MICROSOFT,POWERSET)
Challenge
Microsoft buys Powerset Microsoft acquires
semantic search engine Powerset Powerset is
acquired by Microsoft Corporation The Redmond
software giant buys Powerset Microsofts purchase
of Powerset,
48
48
48
49Limitations of Existing Approaches
- Manual grammar or supervised learning
- Applicable to restricted domains only
- For general text
- Not clear what predicates and objects to use
- Hard to produce consistent meaning annotation
- Also, often learn both syntax and semantics
- Fail to leverage advanced syntactic parsers
- Make semantic parsing harder
50USP Key Idea 1
- Target predicates and objects can be learned
- Viewed as clusters of syntactic or lexical
variations of the same meaning - BUY(-,-)
- ? ?buys, acquires, s purchase of, ?
- ? Cluster of various expressions for
acquisition - MICROSOFT
- ? ?Microsoft, the Redmond software giant, ?
- ? Cluster of various mentions of Microsoft
50
51USP Key Idea 2
- Relational clustering ? Cluster relations with
same objects - USP ? Recursively cluster arbitrary expressions
with similar subexpressions - Microsoft buys Powerset
- Microsoft acquires semantic search engine
Powerset - Powerset is acquired by Microsoft Corporation
- The Redmond software giant buys Powerset
- Microsofts purchase of Powerset,
51
52USP Key Idea 2
- Relational clustering ? Cluster relations with
same objects - USP ? Recursively cluster arbitrary expressions
with similar subexpressions - Microsoft buys Powerset
- Microsoft acquires semantic search engine
Powerset - Powerset is acquired by Microsoft Corporation
- The Redmond software giant buys Powerset
- Microsofts purchase of Powerset,
Cluster same forms at the atom level
52
53USP Key Idea 2
- Relational clustering ? Cluster relations with
same objects - USP ? Recursively cluster arbitrary expressions
with similar subexpressions - Microsoft buys Powerset
- Microsoft acquires semantic search engine
Powerset - Powerset is acquired by Microsoft Corporation
- The Redmond software giant buys Powerset
- Microsofts purchase of Powerset,
Cluster forms in composition with same forms
53
54USP Key Idea 2
- Relational clustering ? Cluster relations with
same objects - USP ? Recursively cluster arbitrary expressions
with similar subexpressions - Microsoft buys Powerset
- Microsoft acquires semantic search engine
Powerset - Powerset is acquired by Microsoft Corporation
- The Redmond software giant buys Powerset
- Microsofts purchase of Powerset,
Cluster forms in composition with same forms
54
55USP Key Idea 2
- Relational clustering ? Cluster relations with
same objects - USP ? Recursively cluster arbitrary expressions
with similar subexpressions - Microsoft buys Powerset
- Microsoft acquires semantic search engine
Powerset - Powerset is acquired by Microsoft Corporation
- The Redmond software giant buys Powerset
- Microsofts purchase of Powerset,
Cluster forms in composition with same forms
55
56USP Key Idea 3
- Start directly from syntactic analyses
- Focus on translating them to semantics
- Leverage rapid progress in syntactic parsing
- Much easier than learning both
56
57Joint Inference in USP
- Forms canonical meaning representation by
recursively clustering synonymous expressions - Text ? Logical form in this representation
- Induces ISA hierarchy among clusters and
applies hierarchical smoothing (shrinkage)
57
58USP System Overview
- Input Dependency trees for sentences
- Converts dependency trees into quasi-logical
forms (QLFs) - Starts with QLF clusters at atom level
- Recursively builds up clusters of larger forms
- Output
- Probability distribution over QLF clusters and
their composition - MAP semantic parses of sentences
58
59Generating Quasi-Logical Forms
buys
nsubj
dobj
Powerset
Microsoft
Convert each node into an unary atom
59
60Generating Quasi-Logical Forms
buys(n1)
nsubj
dobj
Microsoft(n2)
Powerset(n3)
n1, n2, n3 are Skolem constants
60
61Generating Quasi-Logical Forms
buys(n1)
nsubj
dobj
Microsoft(n2)
Powerset(n3)
Convert each edge into a binary atom
61
62Generating Quasi-Logical Forms
buys(n1)
nsubj(n1,n2)
dobj(n1,n3)
Microsoft(n2)
Powerset(n3)
Convert each edge into a binary atom
62
63A Semantic Parse
buys(n1)
nsubj(n1,n2)
dobj(n1,n3)
Microsoft(n2)
Powerset(n3)
Partition QLF into subformulas
63
64A Semantic Parse
buys(n1)
nsubj(n1,n2)
dobj(n1,n3)
Microsoft(n2)
Powerset(n3)
Subformula ? Lambda form Replace Skolem
constant not in unary atom with a unique lambda
variable
64
65A Semantic Parse
buys(n1)
?x2.nsubj(n1,x2)
?x3.dobj(n1,x3)
Microsoft(n2)
Powerset(n3)
Subformula ? Lambda form Replace Skolem
constant not in unary atom with a unique lambda
variable
65
66A Semantic Parse
Core form
buys(n1)
Argument form
Argument form
?x2.nsubj(n1,x2)
?x3.dobj(n1,x3)
Microsoft(n2)
Powerset(n3)
Core form No lambda variable Argument form One
lambda variable
66
67A Semantic Parse
buys(n1)
? BUY
?x2.nsubj(n1,x2)
?x3.dobj(n1,x3)
? MICROSOFT
Microsoft(n2)
? POWERSET
Powerset(n3)
Assign subformula to object cluster
67
68Object Cluster BUY
buys(n1)
0.1
One formula in MLN Learn weights for each pair
of cluster and core form
acquires(n1)
0.2
Distribution over core forms
68
69Object Cluster BUY
BUYER
buys(n1)
0.1
acquires(n1)
0.2
BOUGHT
PRICE
May contain variable number of property clusters
69
70Property Cluster BUYER
0.5
0.1
0.2
Zero
?x2.nsubj(n1,x2)
MICROSOFT
Three MLN formulas
0.1
GOOGLE
0.4
0.8
One
?x2.agent(n1,x2)
Distributions over argument forms, clusters, and
number
70
71Probabilistic Model
- Exponential prior on number of parameters
- Cluster mixtures
Object Cluster BUY
Property Cluster BUYER
buys
0.1
0.5
MICROSOFT
0.2
nsubj
Zero
0.1
acquires
0.4
0.8
0.4
One
agent
GOOGLE
0.1
71
71
71
72Probabilistic Model
E.g., picking MICROSOFT as BUYER argument depends
not only on BUY, but also on its ISA ancestors
- Exponential prior on number of parameters
- Cluster mixtures with hierarchical smoothing
Object Cluster BUY
Property Cluster BUYER
buys
0.1
0.5
MICROSOFT
0.2
nsubj
Zero
0.1
acquires
0.4
0.8
0.4
One
agent
GOOGLE
0.1
72
72
72
73Abstract Lambda Form
- buys(n1)
- ?x2.nsubj(n1,x2)
- ?x3.dobj(n1,x3)
Final logical form is obtained via lambda
reduction
- BUYS(n1)
- ?x2.BUYER(n1,x2)
- ?x3.BOUGHT(n1,x3)
73
74Challenge State Space Too Large
- Potential cluster number ? exp(token-number)
- Also, meaning units and clusters often small
- ? Use combinatorial search
74
74
74
75Inference Find MAP Parse
induces
nsubj
dobj
Initialize
protein
CD11B
nn
IL-4
Lambda reduction
protein
protein
Search Operator
nn
nn
IL-4
IL-4
75
75
75
76Learning Greedily Maximize Posterior
enhances
acid
amino
1.0
induces
1.0
1.0
1.0
Initialize
Search Operators
MERGE
COMPOSE
enhances
1.0
induces
1.0
acid
amino
1.0
1.0
induces
0.2
amino acid
1.0
enhances
0.8
76
76
76
77Operator Abstract
MERGE with REGULATE?
0.3
induces
0.1
enhances
induces
0.6
0.2
inhibits
suppresses
0.1
up-regulates
0.2
INDUCE
ISA
ISA
INHIBIT
INDUCE
inhibits
0.4
inhibits
0.4
induces
0.6
suppresses
INHIBIT
0.2
suppresses
0.2
up-regulates
0.2
Captures substantial similarities
77
77
77
78Experiments
- Apply to machine reading
- Extract knowledge from text and answer questions
- Evaluation Number of answers and accuracy
- GENIA dataset 1999 Pubmed abstracts
- Use simple factoid questions, e.g.
- What does anti-STAT1 inhibit?
- What regulates MIP-1 alpha?
78
78
78
79Total and Correct Answers
USP extracted five times as many correct answers
as TextRunner Highest precision of 91
79
79
KW-SYN
TextRunner
RESOLVER
DIRT
USP
80Qualitative Analysis
- Resolve many nontrivial variations
- Argument forms that mean the same, e.g.,
- expression of X ? X expression
- X stimulates Y ? Y is stimulated with X
- Active vs. passive voices
- Synonymous expressions
- Etc.
80
80
81Clusters And Compositions
- Clusters in core forms
- ? investigate, examine, evaluate, analyze, study,
assay ? - ? diminish, reduce, decrease, attenuate ?
- ? synthesis, production, secretion, release ?
- ? dramatically, substantially, significantly ?
-
- Compositions
- amino acid, t cell, immune response,
transcription factor, initiation site, binding
site
81
81
82Question-Answer Example
Interestingly, the DEX-mediated IkappaBalpha
induction was completely inhibited by IL-2, but
not IL-4, in Th1 cells, while the reverse profile
was seen in Th2 cells.
Q What does IL-2 control? A The DEX-mediated
IkappaBalpha induction
82
82
83Overview
- Machine reading
- Statistical relational learning
- Markov logic
- USP Unsupervised Semantic Parsing
- Research directions
83
83
83
84Web-Scale Joint Inference
- Challenge Efficiently identify the relevant
- Key Induce and leverage an ontology
- Ontology ? Capture essential properties
Abstract away unimportant variations - Upper-level nodes ? Skip irrelevant branches
- Wanted Combine the following
- Probabilistic ontology induction (e.g., USP)
- Coarse-to-fine learning and inference
Felzenszwalb McAllester, 2007 Petrov, Ph.D.
Thesis
84
84
85Knowledge Reasoning
- Most facts/rules are not explicitly stated
- Dark matter in the natural language universe
- kale contains calcium ? calcium prevent
osteoporosis - ? kale prevents osteoporosis
- Keys
- Induce generic reasoning patterns
- Incorporate reasoning in extraction
- Additional sources of indirect supervision
85
85
86Harness Social Computing
- Bootstrap online community
Knowledge Base
86
86
87Harness Social Computing
- Bootstrap online community
- Incorporate human end tasks in the loop
Tell me everything about dicer applied to
synapse
Knowledge Base
87
87
88Harness Social Computing
Your extraction from my paper is correct except
for blah
- Bootstrap online community
- Incorporate human end tasks in the loop
Knowledge Base
88
88
89Harness Social Computing
- Bootstrap online community
- Incorporate human end tasks in the loop
- Form positive feedback loop
Knowledge Base
89
89
90Acknowledgments
- Pedro Domingos, Colin Cherry, Kristina Toutanova,
Lucy Vanderwende, Oren Etzioni, Dan Weld, Matt
Richardson, Parag Singla, Stanley Kok, Daniel
Lowd, Marc Sumner - ARO, AFRL, ONR, DARPA, NSF
90
90
91Summary
- Statistical relational learning offers promising
solutions for machine reading - Markov logic provides a language for this
- Syntax Weighted first-order logical formulas
- Semantics Feature templates of Markov nets
- Open-source software Alchemy
- A success story USP
- Three key research directions
alchemy.cs.washington.edu
alchemy.cs.washington.edu/papers/poon09
91
91