1 / 158

Markov Logic in Natural Language Processing

- Hoifung Poon
- Dept. of Computer Science Eng.
- University of Washington

Overview

- Motivation
- Foundational areas
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning

Languages Are Structural

governments lmpxtm (according

to their families)

Languages Are Structural

S

govern-ment-s l-mpx-t-m (according

to their families)

VP

NP

V

NP

IL-4 induces CD11B

Involvement of p70(S6)-kinase activation in IL-10

up-regulation in human monocytes by gp41......

George Walker Bush was the 43rd President of the

United States. Bush was the eldest son of

President G. H. W. Bush and Babara Bush. . In

November 1977, he met Laura Welch at a barbecue.

involvement

Theme

Cause

up-regulation

activation

Site

Theme

Cause

Theme

human monocyte

IL-10

gp41

p70(S6)-kinase

Languages Are Structural

S

govern-ment-s l-mpx-t-m (according

to their families)

VP

NP

V

NP

IL-4 induces CD11B

Involvement of p70(S6)-kinase activation in IL-10

up-regulation in human monocytes by gp41......

George Walker Bush was the 43rd President of the

United States. Bush was the eldest son of

President G. H. W. Bush and Babara Bush. . In

November 1977, he met Laura Welch at a barbecue.

involvement

Theme

Cause

up-regulation

activation

Site

Theme

Cause

Theme

human monocyte

IL-10

gp41

p70(S6)-kinase

Languages Are Structural

- Objects are not just feature vectors
- They have parts and subparts
- Which have relations with each other
- They can be trees, graphs, etc.
- Objects are seldom i.i.d.(independent and

identically distributed) - They exhibit local and global dependencies
- They form class hierarchies (with multiple

inheritance) - Objects properties depend on those of related

objects - Deeply interwoven with knowledge

First-Order Logic

- Main theoretical foundation of computer science
- General language for describing complex

structures and knowledge - Trees, graphs, dependencies, hierarchies, etc.

easily expressed - Inference algorithms (satisfiability testing,

theorem proving, etc.)

Languages Are Statistical

Microsoft buys Powerset Microsoft acquires

Powerset Powerset is acquired by Microsoft

Corporation The Redmond software giant buys

Powerset Microsofts purchase of Powerset,

I saw the man with the telescope

NP

I saw the man with the telescope

NP

ADVP

I saw the man with the telescope

G. W. Bush Laura Bush Mrs. Bush

Here in London, Frances Deek is a retired teacher

In the Israeli town , Karen London says Now

London says

Which one?

London ? PERSON or LOCATION?

Languages Are Statistical

- Languages are ambiguous
- Our information is always incomplete
- We need to model correlations
- Our predictions are uncertain
- Statistics provides the tools to handle this

Probabilistic Graphical Models

- Mixture models
- Hidden Markov models
- Bayesian networks
- Markov random fields
- Maximum entropy models
- Conditional random fields
- Etc.

The Problem

- Logic is deterministic, requires manual coding
- Statistical models assume i.i.d. data,objects

feature vectors - Historically, statistical and logical NLPhave

been pursued separately - We need to unify the two!
- Burgeoning field in machine learning
- Statistical relational learning

Costs and Benefits ofStatistical Relational

Learning

- Benefits
- Better predictive accuracy
- Better understanding of domains
- Enable learning with less or no labeled data
- Costs
- Learning is much harder
- Inference becomes a crucial issue
- Greater complexity for user

Progress to Date

- Probabilistic logic Nilsson, 1986
- Statistics and beliefs Halpern, 1990
- Knowledge-based model constructionWellman et

al., 1992 - Stochastic logic programs Muggleton, 1996
- Probabilistic relational models Friedman et al.,

1999 - Relational Markov networks Taskar et al., 2002
- Etc.
- This talk Markov logic Domingos Lowd, 2009

Markov Logic A Unifying Framework

- Probabilistic graphical models andfirst-order

logic are special cases - Unified inference and learning algorithms
- Easy-to-use software Alchemy
- Broad applicability
- Goal of this tutorialQuickly learn how to use

Markov logic and Alchemy for a broad spectrum of

NLP applications

Overview

- Motivation
- Foundational areas
- Probabilistic inference
- Statistical learning
- Logical inference
- Inductive logic programming
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning

Markov Networks

- Undirected graphical models

Cancer

Smoking

Cough

Asthma

- Potential functions defined over cliques

Smoking Cancer ?(S,C)

False False 4.5

False True 4.5

True False 2.7

True True 4.5

Markov Networks

- Undirected graphical models

Cancer

Smoking

Cough

Asthma

- Log-linear model

Weight of Feature i

Feature i

Markov Nets vs. Bayes Nets

Property Markov Nets Bayes Nets

Form Prod. potentials Prod. potentials

Potentials Arbitrary Cond. probabilities

Cycles Allowed Forbidden

Partition func. Z ? Z 1

Indep. check Graph separation D-separation

Indep. props. Some Some

Inference MCMC, BP, etc. Convert to Markov

Inference in Markov Networks

- Goal compute marginals conditionals of
- Exact inference is P-complete
- Conditioning on Markov blanket is easy
- Gibbs sampling exploits this

MCMC Gibbs Sampling

state ? random truth assignment for i ? 1 to

num-samples do for each variable x

sample x according to P(xneighbors(x))

state ? state with new value of x P(F) ? fraction

of states in which F is true

Other Inference Methods

- Belief propagation (sum-product)
- Mean field / Variational approximations

MAP/MPE Inference

- Goal Find most likely state of world given

evidence

Query

Evidence

MAP Inference Algorithms

- Iterated conditional modes
- Simulated annealing
- Graph cuts
- Belief propagation (max-product)
- LP relaxation

Overview

- Motivation
- Foundational areas
- Probabilistic inference
- Statistical learning
- Logical inference
- Inductive logic programming
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning

Generative Weight Learning

- Maximize likelihood
- Use gradient ascent or L-BFGS
- No local maxima
- Requires inference at each step (slow!)

No. of times feature i is true in data

Expected no. times feature i is true according to

model

Pseudo-Likelihood

- Likelihood of each variable given its neighbors

in the data - Does not require inference at each step
- Widely used in vision, spatial statistics, etc.
- But PL parameters may not work well forlong

inference chains

Discriminative Weight Learning

- Maximize conditional likelihood of query (y)

given evidence (x) - Approximate expected counts by counts in MAP

state of y given x

No. of true groundings of clause i in data

Expected no. true groundings according to model

Voted Perceptron

- Originally proposed for training HMMs

discriminatively - Assumes network is linear chain
- Can be generalized to arbitrary networks

wi ? 0 for t ? 1 to T do yMAP ? Viterbi(x)

wi ? wi ? counti(yData) counti(yMAP) return

? wi / T

Overview

- Motivation
- Foundational areas
- Probabilistic inference
- Statistical learning
- Logical inference
- Inductive logic programming
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning

First-Order Logic

- Constants, variables, functions, predicatesE.g.

Anna, x, MotherOf(x), Friends(x, y) - Literal Predicate or its negation
- Clause Disjunction of literals
- Grounding Replace all variables by

constantsE.g. Friends (Anna, Bob) - World (model, interpretation)Assignment of

truth values to all ground predicates

Inference in First-Order Logic

- Traditionally done by theorem proving(e.g.

Prolog) - Propositionalization followed by model checking

turns out to be faster (often by a lot) - PropositionalizationCreate all ground atoms and

clauses - Model checking Satisfiability testing
- Two main approaches
- Backtracking (e.g. DPLL)
- Stochastic local search (e.g. WalkSAT)

Satisfiability

- Input Set of clauses(Convert KB to conjunctive

normal form (CNF)) - Output Truth assignment that satisfies all

clauses, or failure - The paradigmatic NP-complete problem
- Solution Search
- Key pointMost SAT problems are actually easy
- Hard region Narrow range ofClauses / Variables

Stochastic Local Search

- Uses complete assignments instead of partial
- Start with random state
- Flip variables in unsatisfied clauses
- Hill-climbing Minimize unsatisfied clauses
- Avoid local minima Random flips
- Multiple restarts

The WalkSAT Algorithm

for i ? 1 to max-tries do solution random

truth assignment for j ? 1 to max-flips do

if all clauses satisfied then

return solution c ? random unsatisfied

clause with probability p

flip a random variable in c else

flip variable in c that maximizes satisfied

clauses return failure

Overview

- Motivation
- Foundational areas
- Probabilistic inference
- Statistical learning
- Logical inference
- Inductive logic programming
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning

Rule Induction

- Given Set of positive and negative examples of

some concept - Example (x1, x2, , xn, y)
- y concept (Boolean)
- x1, x2, , xn attributes (assume Boolean)
- Goal Induce a set of rules that cover all

positive examples and no negative ones - Rule xa xb ? y (xa Literal, i.e., xi

or its negation) - Same as Horn clause Body ? Head
- Rule r covers example x iff x satisfies body of r
- Eval(r) Accuracy, info gain, coverage, support,

etc.

Learning a Single Rule

head ? y body ? Ø repeat for each literal x

rx ? r with x added to body

Eval(rx) body ? body best x until no x

improves Eval(r) return r

Learning a Set of Rules

R ? Ø S ? examples repeat learn a single rule

r R ? R U r S ? S - positive examples

covered by r until S Ø return R

First-Order Rule Induction

- y and xi are now predicates with argumentsE.g.

y is Ancestor(x,y), xi is Parent(x,y) - Literals to add are predicates or their negations
- Literal to add must include at least one

variablealready appearing in rule - Adding a literal changes groundings of

ruleE.g. Ancestor(x,z) Parent(z,y) ?

Ancestor(x,y) - Eval(r) must take this into accountE.g.

Multiply by positive groundings of rule

still covered after adding literal

Overview

- Motivation
- Foundational areas
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning

Markov Logic

- Syntax Weighted first-order formulas
- Semantics Feature templates for Markov networks
- Intuition Soften logical constraints
- Give each formula a weight(Higher weight ?

Stronger constraint)

Example Coreference Resolution

Barack Obama, the 44th President of the United

States, is the first African American to hold the

office.

Example Coreference Resolution

Example Coreference Resolution

Example Coreference Resolution

Two mention constants A and B

Apposition(A,B)

Head(A,President)

Head(B,President)

MentionOf(A,Obama)

MentionOf(B,Obama)

Head(A,Obama)

Head(B,Obama)

Apposition(B,A)

Markov Logic Networks

- MLN is template for ground Markov nets
- Probability of a world x
- Typed variables and constants greatly reduce size

of ground Markov net - Functions, existential quantifiers, etc.
- Can handle infinite domains Singla Domingos,

2007 and continuous domains Wang

Domingos, 2008

Weight of formula i

No. of true groundings of formula i in x

Relation to Statistical Models

- Special cases
- Markov networks
- Markov random fields
- Bayesian networks
- Log-linear models
- Exponential models
- Max. entropy models
- Gibbs distributions
- Boltzmann machines
- Logistic regression
- Hidden Markov models
- Conditional random fields

- Obtained by making all predicates zero-arity
- Markov logic allows objects to be interdependent

(non-i.i.d.)

Relation to First-Order Logic

- Infinite weights ? First-order logic
- Satisfiable KB, positive weights ? Satisfying

assignments Modes of distribution - Markov logic allows contradictions between

formulas

MLN AlgorithmsThe First Three Generations

Problem First generation Second generation Third generation

MAP inference Weighted satisfiability Lazy inference Cutting planes

Marginal inference Gibbs sampling MC-SAT Lifted inference

Weight learning Pseudo-likelihood Voted perceptron Scaled conj. gradient

Structure learning Inductive logic progr. ILP PL (etc.) Clustering pathfinding

MAP/MPE Inference

- Problem Find most likely state of world given

evidence

Query

Evidence

MAP/MPE Inference

- Problem Find most likely state of world given

evidence

MAP/MPE Inference

- Problem Find most likely state of world given

evidence

MAP/MPE Inference

- Problem Find most likely state of world given

evidence - This is just the weighted MaxSAT problem
- Use weighted SAT solver(e.g., MaxWalkSAT Kautz

et al., 1997 )

The MaxWalkSAT Algorithm

for i ? 1 to max-tries do solution random

truth assignment for j ? 1 to max-flips do

if ? weights(sat. clauses) gt threshold then

return solution c ? random

unsatisfied clause with probability p

flip a random variable in c else

flip variable in c that maximizes

? weights(sat. clauses)

return failure, best solution found

Computing Probabilities

- P(FormulaMLN,C) ?
- MCMC Sample worlds, check formula holds
- P(Formula1Formula2,MLN,C) ?
- If Formula2 Conjunction of ground atoms
- First construct min subset of network necessary

to answer query (generalization of KBMC) - Then apply MCMC

But Insufficient for Logic

- ProblemDeterministic dependencies break

MCMCNear-deterministic ones make it very slow - SolutionCombine MCMC and WalkSAT? MC-SAT

algorithm Poon Domingos, 2006

Auxiliary-Variable Methods

- Main ideas
- Use auxiliary variables to capture dependencies
- Turn difficult sampling into uniform sampling
- Given distribution P(x)
- Sample from f (x, u), then discard u

Slice Sampling Damien et al. 1999

U

P(x)

Slice

u(k)

X

x(k1)

x(k)

Slice Sampling

- Identifying the slice may be difficult
- Introduce an auxiliary variable ui for each ?i

The MC-SAT Algorithm

- Select random subset M of satisfied clauses
- With probability 1 exp ( wi )
- Larger wi ? Ci more likely to be selected
- Hard clause (wi ? ?) Always selected
- Slice ? States that satisfy clauses in M
- Uses SAT solver to sample x u.
- Orders of magnitude faster than Gibbs sampling,

etc.

But It Is Not Scalable

- 1000 researchers
- Coauthor(x,y) 1 million ground atoms
- Coauthor(x,y) ? Coauthor(y,z) ? Coauthor(x,z) 1

billion ground clauses - Exponential in arity

Sparsity to the Rescue

- 1000 researchers
- Coauthor(x,y) 1 million ground atoms
- But most atoms are false
- Coauthor(x,y) ? Coauthor(y,z) ? Coauthor(x,z)
- 1 billion ground clauses
- Most trivially satisfied if most atoms are false
- No need to explicitly compute most of them

Lazy Inference

- LazySAT Singla Domingos, 2006a
- Lazy version of WalkSAT Selman et al., 1996
- Grounds atoms/clauses as needed
- Greatly reduces memory usage
- The idea is much more general Poon

Domingos, 2008a

General Method for Lazy Inference

- If most variables assume the default value,

wasteful to instantiate all variables / functions - Main idea
- Allocate memory for a small subset of
- active variables / functions
- Activate more if necessary as inference proceeds
- Applicable to a diverse set of algorithms

Satisfiability solvers (systematic,

local-search), Markov chain Monte Carlo, MPE /

MAP algorithms, Maximum expected utility

algorithms, Belief propagation, MC-SAT, Etc. - Reduce memory and time by orders of magnitude

Lifted Inference

- Consider belief propagation (BP)
- Often in large problems, many nodes are

interchangeableThey send and receive the same

messages throughout BP - Basic idea Group them into supernodes, forming

lifted network - Smaller network ? Faster inference
- Akin to resolution in first-order logic

Belief Propagation

Features (f)

Nodes (x)

Lifted Belief Propagation

Features (f)

Nodes (x)

Lifted Belief Propagation

?,? Functions of edge counts

?

?

Features (f)

Nodes (x)

Learning

- Data is a relational database
- Closed world assumption (if not EM)
- Learning parameters (weights)
- Learning structure (formulas)

Parameter Learning

- Parameter tying Groundings of same clause
- Generative learning Pseudo-likelihood
- Discriminative learning Conditional

likelihood,use MC-SAT or MaxWalkSAT for inference

No. of times clause i is true in data

Expected no. times clause i is true according to

MLN

Parameter Learning

- Pseudo-likelihood L-BFGS is fast and robust but

can give poor inference results - Voted perceptronGradient descent MAP

inference - Scaled conjugate gradient

Voted Perceptron for MLNs

- HMMs are special case of MLNs
- Replace Viterbi by MaxWalkSAT
- Network can now be arbitrary graph

wi ? 0 for t ? 1 to T do yMAP ?

MaxWalkSAT(x) wi ? wi ? counti(yData)

counti(yMAP) return ? wi / T

Problem Multiple Modes

- Not alleviated by contrastive divergence
- Alleviated by MC-SAT
- Warm start Start each MC-SAT run at previous end

state

Problem Extreme Ill-Conditioning

- Solvable by quasi-Newton, conjugate gradient,

etc. - But line searches require exact inference
- Solution Scaled conjugate gradient

Lowd Domingos, 2008 - Use Hessian to choose step size
- Compute quadratic form inside MC-SAT
- Use inverse diagonal Hessian as preconditioner

Structure Learning

- Standard inductive logic programming

optimizesthe wrong thing - But can be used to overgenerate for L1 pruning
- Our approachILP Pseudo-likelihood Structure

priors - For each candidate structure changeStart from

current weights relax convergence - Use subsampling to compute sufficient statistics

Structure Learning

- Initial state Unit clauses or prototype KB
- Operators Add/remove literal, flip sign
- Evaluation function Pseudo-likelihood

Structure prior - Search Beam search, shortest-first search

Alchemy

- Open-source software including
- Full first-order logic syntax
- Generative discriminative weight learning
- Structure learning
- Weighted satisfiability, MCMC, lifted BP
- Programming language features

alchemy.cs.washington.edu

Alchemy Prolog BUGS

Represent-ation F.O. Logic Markov nets Horn clauses Bayes nets

Inference Model check- ing, MCMC, lifted BP Theorem proving MCMC

Learning Parameters structure No Params.

Uncertainty Yes No Yes

Relational Yes Yes No

Constrained Conditional Model

- Representation Integer linear programs
- Local classifiers Global constraints
- Inference LP solver
- Parameter learning None for constraints
- Weights of soft constraints set heuristically
- Local weights typically learned independently
- Structure learning None to date
- But see latest development in NAACL-10

Running Alchemy

- Programs
- Infer
- Learnwts
- Learnstruct
- Options

- MLN file
- Types (optional)
- Predicates
- Formulas
- Database files

Overview

- Motivation
- Foundational areas
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning

Uniform Distribn. Empty MLN

- Example Unbiased coin flips
- Type flip 1, , 20
- Predicate Heads(flip)

Binomial Distribn. Unit Clause

- Example Biased coin flips
- Type flip 1, , 20
- Predicate Heads(flip)
- Formula Heads(f)
- Weight Log odds of heads
- By default, MLN includes unit clauses for all

predicates - (captures marginal distributions, etc.)

Multinomial Distribution

- Example Throwing die
- Types throw 1, , 20
- face 1, , 6
- Predicate Outcome(throw,face)
- Formulas Outcome(t,f) f ! f gt

!Outcome(t,f). - Exist f Outcome(t,f).
- Too cumbersome!

Multinomial Distrib. ! Notation

- Example Throwing die
- Types throw 1, , 20
- face 1, , 6
- Predicate Outcome(throw,face!)
- Formulas
- Semantics Arguments without ! determine

arguments with !. - Also makes inference more efficient (triggers

blocking).

Multinomial Distrib. Notation

- Example Throwing biased die
- Types throw 1, , 20
- face 1, , 6
- Predicate Outcome(throw,face!)
- Formulas Outcome(t,f)
- Semantics Learn weight for each grounding of

args with .

Logistic Regression (MaxEnt)

Logistic regression Type

obj 1, ... , n Query predicate

C(obj) Evidence predicates Fi(obj) Formulas

a C(x)

bi Fi(x) C(x) Resulting distribution

Therefore Alternative form Fi(x) gt

C(x)

Hidden Markov Models

obs Red, Green, Yellow state Stop,

Drive, Slow time 0, ..., 100

State(state!,time) Obs(obs!,time) State(s,0)

State(s,t) State(s',t1) Obs(o,t)

State(s,t) Sparse HMM State(s,t) gt

State(s1,t1) v State(s2, t1) v ... .

Bayesian Networks

- Use all binary predicates with same first

argument (the object x). - One predicate for each variable A A(x,v!)
- One clause for each line in the CPT andvalue of

the variable - Context-specific independenceOne clause for

each path in the decision tree - Logistic regression As before
- Noisy OR Deterministic OR Pairwise clauses

Relational Models

- Knowledge-based model construction
- Allow only Horn clauses
- Same as Bayes nets, except arbitrary relations
- Combin. function Logistic regression, noisy-OR

or external - Stochastic logic programs
- Allow only Horn clauses
- Weight of clause log(p)
- Add formulas Head holds ? Exactly one body holds
- Probabilistic relational models
- Allow only binary relations
- Same as Bayes nets, except first argument can vary

Relational Models

- Relational Markov networks
- SQL ? Datalog ? First-order logic
- One clause for each state of a clique
- syntax in Alchemy facilitates this
- Bayesian logic
- Object Cluster of similar/related observations
- Observation constants Object constants
- Predicate InstanceOf(Obs,Obj) and clauses using

it - Unknown relations Second-order Markov logic
- S. Kok P. Domingos, Statistical Predicate

Invention, inProc. ICML-2007.

Overview

- Motivation
- Foundational areas
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning

Text Classification

The 56th quadrennial United States presidential

election was held on November 4, 2008. Outgoing

Republican President George W. Bush's policies

and actions and the American public's desire for

change were key issues throughout the campaign.

Topic politics

The Chicago Bulls are an American professional

basketball team based in Chicago, Illinois,

playing in the Central Division of the Eastern

Conference in the National Basketball Association

(NBA).

Topic sports

Text Classification

page 1, ..., max word ... topic ...

Topic(page,topic) HasWord(page,word) Topic(p,

t) HasWord(p,w) gt Topic(p,t) If topics

mutually exclusive Topic(page,topic!)

Text Classification

page 1, ..., max word ... topic ...

Topic(page,topic) HasWord(page,word) Links(page

,page) Topic(p,t) HasWord(p,w) gt

Topic(p,t) Topic(p,t) Links(p,p') gt

Topic(p',t) Cf. S. Chakrabarti, B. Dom P.

Indyk, Hypertext Classification Using

Hyperlinks, in Proc. SIGMOD-1998.

Entity Resolution

AUTHOR H. POON P. DOMINGOS TITLE UNSUPERVISED

SEMANTIC PARSING VENUE EMNLP-09

SAME?

AUTHOR Hoifung Poon and Pedro Domings TITLE

Unsupervised semantic parsing VENUE Proceedings

of the 2009 Conference on Empirical Methods in

Natural Language Processing

AUTHOR Poon, Hoifung and Domings, Pedro TITLE

Unsupervised ontology induction from text VENUE

Proceedings of the Forty-Eighth Annual Meeting of

the Association for Computational Linguistics

SAME?

AUTHOR H. Poon, P. Domings TITLE Unsupervised

ontology induction VENUE ACL-10

Entity Resolution

Problem Given database, find duplicate

records HasToken(token,field,record) SameField(fi

eld,record,record) SameRecord(record,record) HasT

oken(t,f,r) HasToken(t,f,r) gt

SameField(f,r,r) SameField(f,r,r) gt

SameRecord(r,r)

Entity Resolution

Problem Given database, find duplicate

records HasToken(token,field,record) SameField(fi

eld,record,record) SameRecord(record,record) HasT

oken(t,f,r) HasToken(t,f,r) gt

SameField(f,r,r) SameField(f,r,r) gt

SameRecord(r,r) SameRecord(r,r)

SameRecord(r,r) gt SameRecord(r,r) Cf.

A. McCallum B. Wellner, Conditional Models of

Identity Uncertainty with Application to Noun

Coreference, in Adv. NIPS 17, 2005.

Entity Resolution

Can also resolve fields HasToken(token,field,rec

ord) SameField(field,record,record) SameRecord(rec

ord,record) HasToken(t,f,r)

HasToken(t,f,r) gt SameField(f,r,r) SameFi

eld(f,r,r) ltgt SameRecord(r,r) SameRecord(r,r)

SameRecord(r,r) gt SameRecord(r,r) SameFi

eld(f,r,r) SameField(f,r,r) gt

SameField(f,r,r) More P. Singla P. Domingos,

Entity Resolution with Markov Logic, in Proc.

ICDM-2006.

Information Extraction

Unsupervised Semantic Parsing, Hoifung Poon and

Pedro Domingos. Proceedings of the 2009

Conference on Empirical Methods in Natural

Language Processing. Singapore ACL.

UNSUPERVISED SEMANTIC PARSING. H. POON P.

DOMINGOS. EMNLP-2009.

Information Extraction

Author

Title

Venue

Unsupervised Semantic Parsing, Hoifung Poon and

Pedro Domingos. Proceedings of the 2009

Conference on Empirical Methods in Natural

Language Processing. Singapore ACL.

SAME?

UNSUPERVISED SEMANTIC PARSING. H. POON P.

DOMINGOS. EMNLP-2009.

Information Extraction

- Problem Extract database from text

orsemi-structured sources - Example Extract database of publications from

citation list(s) (the CiteSeer problem) - Two steps
- SegmentationUse HMM to assign tokens to fields
- Entity resolutionUse logistic regression and

transitivity

Information Extraction

Token(token, position, citation) InField(position,

field!, citation) SameField(field, citation,

citation) SameCit(citation, citation) Token(t,i,

c) gt InField(i,f,c) InField(i,f,c)

InField(i1,f,c) Token(t,i,c)

InField(i,f,c) Token(t,i,c)

InField(i,f,c) gt SameField(f,c,c) SameField(

f,c,c) ltgt SameCit(c,c) SameField(f,c,c)

SameField(f,c,c) gt SameField(f,c,c) SameCit(c,

c) SameCit(c,c) gt SameCit(c,c)

Information Extraction

Token(token, position, citation) InField(position,

field!, citation) SameField(field, citation,

citation) SameCit(citation, citation) Token(t,i,

c) gt InField(i,f,c) InField(i,f,c)

!Token(.,i,c) InField(i1,f,c) Token(t,i,c

) InField(i,f,c) Token(t,i,c)

InField(i,f,c) gt SameField(f,c,c) SameField(

f,c,c) ltgt SameCit(c,c) SameField(f,c,c)

SameField(f,c,c) gt SameField(f,c,c) SameCit(c,

c) SameCit(c,c) gt SameCit(c,c) More H.

Poon P. Domingos, Joint Inference in

Information Extraction, in Proc. AAAI-2007.

Biomedical Text Mining

- Traditionally, name entity recognition or

information extraction - E.g., protein recognition, protein-protein

identification - BioNLP-09 shared task Nested bio-events
- Much harder than traditional IE
- Top F1 around 50
- Naturally calls for joint inference

Bio-Event Extraction

Involvement of p70(S6)-kinase activation in IL-10

up-regulation in human monocytes by gp41 envelope

protein of human immunodeficiency virus type 1 ...

involvement

Theme

Cause

up-regulation

activation

Site

Theme

Cause

Theme

human monocyte

p70(S6)-kinase

gp41

IL-10

Bio-Event Extraction

Token(position, token) DepEdge(position,

position, dependency) IsProtein(position) EvtType(

position, evtType) InArgPath(position, position,

argType!) Token(i,w) gt EvtType(i,t) Token(j,w)

DepEdge(i,j,d) gt EvtType(i,t) DepEdge(i,j,d

) gt InArgPath(i,j,a) Token(i,w)

DepEdge(i,j,d) gt InArgPath(i,j,a)

Logistic regression

Bio-Event Extraction

Token(position, token) DepEdge(position,

position, dependency) IsProtein(position) EvtType(

position, evtType) InArgPath(position, position,

argType!) Token(i,w) gt EvtType(i,t) Token(j,w)

DepEdge(i,j,d) gt EvtType(i,t) DepEdge(i,j,d

) gt InArgPath(i,j,a) Token(i,w)

DepEdge(i,j,d) gt InArgPath(i,j,a) InArgPath(

i,j,Theme) gt IsProtein(j) v

(Exist k k!i InArgPath(j, k,

Theme)). More H. Poon and L. Vanderwende,

Joint Inference for Knowledge Extraction from

Biomedical Literature, 1040 am, June 4, Gold

Room.

Adding a few joint inference rules doubles the F1

Temporal Information Extraction

- Identify event times and temporal relations

(BEFORE, AFTER, OVERLAP) - E.g., who is the President of U.S.A.?
- Obama 1/20/2009 ? present
- G. W. Bush 1/20/2001 ? 1/19/2009
- Etc.

Temporal Information Extraction

DepEdge(position, position, dependency) Event(posi

tion, event) After(event, event)

DepEdge(i,j,d) Event(i,p) Event(j,q) gt

After(p,q) After(p,q) After(q,r) gt

After(p,r)

Temporal Information Extraction

DepEdge(position, position, dependency) Event(posi

tion, event) After(event, event) Role(position,

position, role) DepEdge(I,j,d) Event(i,p)

Event(j,q) gt After(p,q) Role(i,j,ROLE-AFTER)

Event(i,p) Event(j,q) gt After(p,q) After(p,q)

After(q,r) gt After(p,r) More K. Yoshikawa,

S. Riedel, M. Asahara and Y. Matsumoto, Jointly

Identifying Temporal Relations with Markov

Logic, in Proc. ACL-2009. X. Ling D. Weld,

Temporal Information Extraction, in Proc.

AAAI-2010.

Semantic Role Labeling

- Problem Identify arguments for a predicate
- Two steps
- Argument identificationDetermine whether a

phrase is an argument - Role classificationDetermine the type of an

argument (agent, theme, temporal, adjunct, etc.)

Semantic Role Labeling

Token(position, token) DepPath(position,

position, path) IsPredicate(position) Role(positio

n, position, role!) HasRole(position, position)

Token(i,t) gt IsPredicate(i) DepPath(i,j,p)

gt Role(i,j,r) HasRole(i,j) gt

IsPredicate(i) IsPredicate(i) gt Exist j

HasRole(i,j) HasRole(i,j) gt Exist r

Role(i,j,r) Role(i,j,r) gt HasRole(i,j) Cf. K.

Toutanova, A. Haghighi, C. Manning, A global

joint model for semantic role labeling, in

Computational Linguistics 2008.

Joint Semantic Role Labeling and Word Sense

Disambiguation

Token(position, token) DepPath(position,

position, path) IsPredicate(position) Role(positio

n, position, role!) HasRole(position,

position) Sense(position, sense!) Token(i,t) gt

IsPredicate(i) DepPath(i,j,p) gt

Role(i,j,r) Sense(I,s) gt IsPredicate(i) HasRole

(i,j) gt IsPredicate(i) IsPredicate(i) gt Exist j

HasRole(i,j) HasRole(i,j) gt Exist r

Role(i,j,r) Role(i,j,r) gt HasRole(i,j) Token(i,t

) Role(i,j,r) gt Sense(i,s) More I.

Meza-Ruiz S. Riedel, Jointly Identifying

Predicates, Arguments and Senses using Markov

Logic, in Proc. NAACL-2009.

Practical Tips Modeling

- Add all unit clauses (the default)
- How to handle uncertain dataR(x,y) R(x,y)

(the HMM trick) - Implications vs. conjunctions
- For soft correlation, conjunctions often better
- Implication A gt B is equivalent to !(A !B)
- Share cases with others like A gt C
- Make learning unnecessarily harder

Practical Tips Efficiency

- Open/closed world assumptions
- Low clause arities
- Low numbers of constants
- Short inference chains

Practical Tips Development

- Start with easy components
- Gradually expand to full task
- Use the simplest MLN that works
- Cycle Add/delete formulas, learn and test

Overview

- Motivation
- Foundational areas
- Markov logic
- NLP applications
- Basics
- Supervised learning
- Unsupervised learning

Unsupervised Learning Why?

- Virtually unlimited supply of unlabeled text
- Labeling is expensive (Cf. Penn-Treebank)
- Often difficult to label with consistency and

high quality (e.g., semantic parses) - Emerging field Machine reading
- Extract knowledge from unstructured text with

high precision/recall and minimal human effort - Check out LBR-Workshop (WS9) on Sunday

Unsupervised Learning How?

- I.i.d. learning Sophisticated model requires

more labeled data - Statistical relational learning Sophisticated

model may require less labeled data - Relational dependencies constrain problem space
- One formula is worth a thousand labels
- Small amount of domain knowledge ?

large-scale joint inference

Unsupervised Learning How?

- Ambiguities vary among objects
- Joint inference ? Propagate information from

unambiguous objects to ambiguous ones - E.g.
- G. W. Bush
- He
- Mrs. Bush

Are they coreferent?

Unsupervised Learning How

- Ambiguities vary among objects
- Joint inference ? Propagate information from

unambiguous objects to ambiguous ones - E.g.
- G. W. Bush
- He
- Mrs. Bush

Should be coreferent

Unsupervised Learning How

- Ambiguities vary among objects
- Joint inference ? Propagate information from

unambiguous objects to ambiguous ones - E.g.
- G. W. Bush
- He
- Mrs. Bush

So must be singular male!

Unsupervised Learning How

- Ambiguities vary among objects
- Joint inference ? Propagate information from

unambiguous objects to ambiguous ones - E.g.
- G. W. Bush
- He
- Mrs. Bush

Must be singular female!

Unsupervised Learning How

- Ambiguities vary among objects
- Joint inference ? Propagate information from

unambiguous objects to ambiguous ones - E.g.
- G. W. Bush
- He
- Mrs. Bush

Verdict Not coreferent!

Parameter Learning

- Marginalize out hidden variables
- Use MC-SAT to approximate both expectations
- May also combine with contrastive estimation

Poon Cherry Toutanova, NAACL-2009

Sum over z, conditioned on observed x

Summed over both x and z

Unsupervised Coreference Resolution

Head(mention, string) Type(mention,

type) MentionOf(mention, entity)

MentionOf(m,e) Type(m,t) Head(m,h)

MentionOf(m,e) MentionOf(a,e) MentionOf(b,e)

gt (Type(a,t) ltgt Type(b,t)) (similarly for

Number, Gender etc.)

Mixture model

Joint inference formulas Enforce agreement

Unsupervised Coreference Resolution

Head(mention, string) Type(mention,

type) MentionOf(mention, entity) Apposition(mentio

n, mention) MentionOf(m,e) Type(m,t) Head(m,

h) MentionOf(m,e) MentionOf(a,e)

MentionOf(b,e) gt (Type(a,t) ltgt Type(b,t))

(similarly for Number, Gender etc.) Apposition(a,

b) gt (MentionOf(a,e) ltgt MentionOf(b,e)) More

H. Poon and P. Domingos, Joint Unsupervised

Coreference Resolution with Markov Logic, in

Proc. EMNLP-2008.

Joint inference formulas Leverage apposition

Relational Clustering Discover Unknown Predicates

- Cluster relations along with objects
- Use second-order Markov logic

Kok Domingos, 2007, 2008 - Key idea Cluster combination determines

likelihood of relations - InClust(r,c) InClust(x,a) InClust(y,b)

gt r(x,y) - Input Relational tuples extracted by TextRunner

Banko et al., 2007 - Output Semantic network

Recursive Relational Clustering

- Unsupervised semantic parsing

Poon Domingos, EMNLP-2009 - Text ? Knowledge
- Start directly from text
- Identify meaning units Resolve variations
- Use high-order Markov logic (variables over

arbitrary lambda forms and their clusters) - End-to-end machine reading Read

text, then answer questions

Semantic Parsing

INDUCE(e1)

IL-4 protein induces CD11b

INDUCER(e1,e2)

INDUCED(e1,e3)

IL-4(e2)

CD11B(e3)

Structured prediction Partition Assignment

induces

induces

INDUCE

nsubj

dobj

nsubj

dobj

INDUCED

INDUCER

protein

CD11b

protein

CD11b

nn

CD11B

nn

IL-4

IL-4

IL-4

Challenge Same Meaning, Many Variations

- IL-4 up-regulates CD11b
- Protein IL-4 enhances the expression of CD11b
- CD11b expression is induced by IL-4 protein
- The cytokin interleukin-4 induces CD11b

expression - IL-4s up-regulation of CD11b,

Unsupervised Semantic Parsing

- USP ? Recursively cluster arbitrary expressions

composed with / by similar expressions - IL-4 induces CD11b
- Protein IL-4 enhances the expression of CD11b
- CD11b expression is enhanced by IL-4 protein
- The cytokin interleukin-4 induces CD11b

expression - IL-4s up-regulation of CD11b,

Unsupervised Semantic Parsing

- USP ? Recursively cluster arbitrary expressions

composed with / by similar expressions - IL-4 induces CD11b
- Protein IL-4 enhances the expression of CD11b
- CD11b expression is enhanced by IL-4 protein
- The cytokin interleukin-4 induces CD11b

expression - IL-4s up-regulation of CD11b,

Cluster same forms at the atom level

Unsupervised Semantic Parsing

- USP ? Recursively cluster arbitrary expressions

composed with / by similar expressions - IL-4 induces CD11b
- Protein IL-4 enhances the expression of CD11b
- CD11b expression is enhanced by IL-4 protein
- The cytokin interleukin-4 induces CD11b

expression - IL-4s up-regulation of CD11b,

Cluster forms in composition with same forms

Unsupervised Semantic Parsing

- USP ? Recursively cluster arbitrary expressions

composed with / by similar expressions - IL-4 induces CD11b
- Protein IL-4 enhances the expression of CD11b
- CD11b expression is enhanced by IL-4 protein
- The cytokin interleukin-4 induces CD11b

expression - IL-4s up-regulation of CD11b,

Cluster forms in composition with same forms

Unsupervised Semantic Parsing

- USP ? Recursively cluster arbitrary expressions

composed with / by similar expressions - IL-4 induces CD11b
- Protein IL-4 enhances the expression of CD11b
- CD11b expression is enhanced by IL-4 protein
- The cytokin interleukin-4 induces CD11b

expression - IL-4s up-regulation of CD11b,

Cluster forms in composition with same forms

Unsupervised Semantic Parsing

- USP ? Recursively cluster arbitrary expressions

composed with / by similar expressions - IL-4 induces CD11b
- Protein IL-4 enhances the expression of CD11b
- CD11b expression is enhanced by IL-4 protein
- The cytokin interleukin-4 induces CD11b

expression - IL-4s up-regulation of CD11b,

Cluster forms in composition with same forms

Unsupervised Semantic Parsing

- Exponential prior on number of parameters
- Event/object/property cluster mixtures
- InClust(e,c) HasValue(e,v)

Object/Event Cluster INDUCE

Property Cluster INDUCER

induces

0.1

0.5

IL-4

0.2

nsubj

None

0.1

enhances

0.4

0.4

One

0.8

agent

IL-8

0.1

But State Space Too Large

- Coreference -clusters ? -mentions
- USP -clusters ? exp(-tokens)
- Also, meaning units often small and many

singleton clusters - ? Use combinatorial search

Inference Hill-Climb Probability

induces

?

nsubj

dobj

?

?

Initialize

protein

CD11B

?

?

nn

?

IL-4

?

Lambda reduction

protein

protein

?

Search Operator

nn

?

nn

?

IL-4

IL-4

?

Learning Hill-Climb Likelihood

protein

enhances

1

1

IL-4

1

induces

1

Initialize

MERGE

COMPOSE

enhances

1

induces

1

1

protein

1

IL-4

Search Operator

induces

0.2

IL-4 protein

1

enhances

0.8

Unsupervised Ontology Induction

- Limitations of USP
- No ISA hierarchy among clusters
- Little smoothing
- Limited capability to generalize
- OntoUSP Poon Domingos, ACL-2010
- Extends USP to also induce ISA hierarchy
- Joint approach for ontology induction,

population, and knowledge extraction - To appear in ACL (see you in Uppsala -)

OntoUSP

- Modify the cluster mixture formula
- InClust(e,c) ISA(c,d) HasValue(e,v)
- Hierarchical smoothing clustering
- New operator in learning

MERGE with REGULATE?

ABSTRACTION

0.3

induces

0.1

enhances

induces

0.6

0.2

inhibits

suppresses

0.1

up-regulates

0.2

INDUCE

ISA

ISA

INHIBIT

INDUCE

inhibits

0.4

inhibits

0.4

induces

0.6

suppresses

INHIBIT

0.2

suppresses

0.2

up-regulates

0.2

End of The Beginning

- Not merely a user guide of MLN and Alchemy
- Statistical relational learning
- Growth area for machine learning and NLP

Future Work Inference

- Scale up inference
- Cutting-planes methods (e.g., Riedel, 2008)
- Unify lifted inference with sampling
- Coarse-to-fine inference
- Alternative technology
- E.g., linear programming, lagrangian relaxation

Future Work Supervised Learning

- Alternative optimization objectives
- E.g., max-margin learning Huynh Mooney, 2009
- Learning for efficient inference
- E.g., learning arithmetic circuits Lowd

Domingos, 2008 - Structure learning

Improve accuracy and scalability - E.g., Kok Domingos, 2009

Future Work Unsupervised Learning

- Model Learning objective, formalism, etc.
- Learning Local optima, intractability, etc.
- Hyperparameter tuning
- Leverage available resources
- Semi-supervised learning
- Multi-task learning
- Transfer learning (e.g., domain adaptation)
- Human in the loop
- E.g., interative ML, active learning,

crowdsourcing

Future Work NLP Applications

- Existing application areas
- More joint inference opportunities
- Additional domain knowledge
- Combine multiple pipeline stages
- A killer app Machine reading
- Many, many more awaiting YOU to discover

Summary

- We need to unify logical and statistical NLP
- Markov logic provides a language for this
- Syntax Weighted first-order formulas
- Semantics Feature templates of Markov nets
- Inference Satisfiability, MCMC, lifted BP, etc.
- Learning Pseudo-likelihood, VP, PSCG, ILP, etc.
- Growing set of NLP applications
- Open-source software Alchemy
- Book Domingos Lowd, Markov Logic,Morgan

Claypool, 2009.

alchemy.cs.washington.edu

References

- Banko et al., 2007 Michele Banko, Michael J.

Cafarella, Stephen Soderland, Matt Broadhead,

Oren Etzioni, "Open Information Extraction From

the Web", In Proc. IJCAI-2007. - Chakrabarti et al., 1998 Soumen Chakrabarti,

Byron Dom, Piotr Indyk, "Hypertext Classification

Using Hyperlinks", in Proc. SIGMOD-1998. - Damien et al., 1999 Paul Damien, Jon Wakefield,

Stephen Walker, "Gibbs sampling for Bayesian

non-conjugate and hierarchical models by

auxiliary variables", Journal of the Royal

Statistical Society B, 612. - Domingos Lowd, 2009 Pedro Domingos and Daniel

Lowd, Markov Logic, Morgan Claypool. - Friedman et al., 1999 Nir Friedman, Lise

Getoor, Daphne Koller, Avi Pfeffer, "Learning

probabilistic relational models", in Proc.

IJCAI-1999.

References

- Halpern, 1990 Joe Halpern, "An analysis of

first-order logics of probability", Artificial

Intelligence 46. - Huynh Mooney, 2009 Tuyen Huynh and Raymond

Mooney, "Max-Margin Weight Learning for Markov

Logic Networks", In Proc. ECML-2009. - Kautz et al., 1997 Henry Kautz, Bart Selman,

Yuejun Jiang, "A general stochastic approach to

solving problems with hard and soft constraints",

In The Satisfiability Problem Theory and

Applications. AMS. - Kok Domingos, 2007 Stanley Kok and Pedro

Domingos, "Statistical Predicate Invention", In

Proc. ICML-2007. - Kok Domingos, 2008 Stanley Kok and Pedro

Domingos, "Extracting Semantic Networks from Text

via Relational Clustering", In Proc. ECML-2008.

References

- Kok Domingos, 2009 Stanley Kok and Pedro

Domingos, "Learning Markov Logic Network

Structure via Hypergraph Lifting", In Proc.

ICML-2009. - Ling Weld, 2010 Xiao Ling and Daniel S.

Weld, "Temporal Information Extraction", In Proc.

AAAI-2010. - Lowd Domingos, 2007 Daniel Lowd and Pedro

Domingos, "Efficient Weight Learning for Markov

Logic Networks", In Proc. PKDD-2007. - Lowd Domingos, 2008 Daniel Lowd and Pedro

Domingos, "Learning Arithmetic Circuits", In

Proc. UAI-2008. - Meza-Ruiz Riedel, 2009 Ivan Meza-Ruiz and

Sebastian Riedel, "Jointly Identifying

Predicates, Arguments and Senses using Markov

Logic", In Proc. NAACL-2009.

References

- Muggleton, 1996 Stephen Muggleton, "Stochastic

logic programs", in Proc. ILP-1996. - Nilsson, 1986 Nil Nilsson, "Probabilistic

logic", Artificial Intelligence 28. - Page et al., 1998 Lawrence Page, Sergey Brin,

Rajeev Motwani, Terry Winograd, "The PageRank

Citation Ranking Bringing Order to the Web",

Tech. Rept., Stanford University, 1998. - Poon Domingos, 2006 Hoifung Poon and Pedro

Domingos, "Sound and Efficient Inference with

Probabilistic and Deterministic Dependencies", In

Proc. AAAI-06. - Poon Domingos, 2007 Hoifung Poon and Pedro

Domingo, "Joint Inference in Information

Extraction", In Proc. AAAI-07.

References

- Poon Domingos, 2008a Hoifung Poon, Pedro

Domingos, Marc Sumner, "A General Method for

Reducing the Complexity of Relational Inference

and its Application to MCMC", In Proc. AAAI-08. - Poon Domingos, 2008b Hoifung Poon and Pedro

Domingos, "Joint Unsupervised Coreference

Resolution with Markov Logic", In Proc. EMNLP-08. - Poon Domingos, 2009 Hoifung and Pedro

Domingos, "Unsupervised Semantic Parsing", In

Proc. EMNLP-09. - Poon Cherry Toutanova, 2009 Hoifung Poon,

Colin Cherry, Kristina Toutanova, "Unsupervised

Morphological Segmentation with Log-Linear

Models", In Proc. NAACL-2009.

References

- Poon Vanderwende, 2010 Hoifung Poon and Lucy

Vanderwende, "Joint Inference for Knowledge

Extraction from Biomedical Literature", In Proc.

NAACL-10. - Poon Domingos, 2010 Hoifung and Pedro

Domingos, "Unsupervised Ontology Induction From

Text", In Proc. ACL-10. - Riedel 2008 Sebatian Riedel, "Improving the

Accuracy and Efficiency of MAP Inference for

Markov Logic", In Proc. UAI-2008. - Riedel et al., 2009 Sebastian Riedel, Hong-Woo

Chun, Toshihisa Takagi and Jun'ichi Tsujii, "A

Markov Logic Approach to Bio-Molecular Event

Extraction", In Proc. BioNLP 2009 Shared Task. - Selman et al., 1996 Bart Selman, Henry Kautz,

Bram Cohen, "Local search strategies for

satisfiability testing", In Cliques, Coloring,

and Satisfiability Second DIMACS Implementation

Challenge. AMS.

References

- Singla Domingos, 2006a Parag Singla and Pedro

Domingos, "Memory-Efficient Inference in

Relational Domains", In Proc. AAAI-2006. - Singla Domingos, 2006b Parag Singla and Pedro

Domingos, "Entity Resolution with Markov Logic",

In Proc. ICDM-2006. - Singla Domingos, 2007 Parag Singla and Pedro

Domingos, "Markov Logic in Infinite Domains", In

Proc. UAI-2007. - Singla Domingos, 2008 Parag Singla and Pedro

Domingos, "Lifted First-Order Belief

Propagation", In Proc. AAAI-2008. - Taskar et al., 2002 Ben Taskar, Pieter Abbeel,

Daphne Koller, "Discriminative probabilistic

models for relational data", in Proc. UAI-2002.

References

- Toutanova Haghighi Manning, 2008 Kristina

Toutanova, Aria Haghighi, Chris Manning, "A

global joint model for semantic role labeling",

Computational Linguistics. - Wang Domingos, 2008 Jue Wang and Pedro

Domingos, "Hybrid Markov Logic Networks", In

Proc. AAAI-2008. - Wellman et al., 1992 Michael Wellman, John S.

Breese, Robert P. Goldman, "From knowledge bases

to decision models", Knowledge Engineering Review

7. - Yoshikawa et al., 2009 Katsumasa Yoshikawa,

Sebastian Riedel, Masayuki Asahara and Yuji

Matsumoto, "Jointly Identifying Temporal

Relations with Markov Logic", In Proc. ACL-2009.