Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis

About This Presentation

Title:

Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis

Description:

Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory – PowerPoint PPT presentation

Number of Views:299

Avg rating:3.0/5.0

Slides: 63

Provided by: Andrew1276

Learn more at: https://people.cs.umass.edu

Category:

more less

Transcript and Presenter's Notes

Title: Unified Models of Information Extraction and Data Mining with Application to Social Network Analysis

1
Unified Models of Information Extraction and
Data Miningwith Application to Social Network
Analysis

Andrew McCallum
Information Extraction and Synthesis Laboratory
Computer Science Department
University of Massachusetts Amherst
Joint work with David Jensen
Knowledge Discovery and Dissemination (KDD)
Conference
September 2004

Intelligence Technology Innovation Center
ITIC
2
Goal
Improve the state-of-the-art in our abilityto
mine actionable knowledgefrom unstructured text.
3
Traditional Pipeline
Spider
Filter
KnowledgeDiscovery
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
4
Extracting Job Openings from the Web
5
Data Mining the Extracted Job Information
6
IE from Research Papers
7
Mining Research Papers
Rosen-Zvi, Griffiths, Steyvers, Smyth, 2004
Giles et al
8
IE fromChinese Documents regarding Weather
Department of Terrestrial System, Chinese Academy
of Sciences
200k documents several millennia old - Qing
Dynasty Archives - memos - newspaper articles -
diaries
9
Traditional Pipeline
Spider
Filter
KnowledgeDiscovery
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
10
Problem

Combined in serial juxtaposition,
IE and KD are unaware of each others
weaknesses and opportunities.
KD begins from a populated DB, unaware of where
the data came from, or its inherent
uncertainties.
IE is unaware of emerging patterns and
regularities in the DB.
The accuracy of both suffers, and significant
mining of complex text sources is beyond reach.

11
Solution
Uncertainty Info
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Database
Documentcollection
Actionableknowledge
Emerging Patterns
Prediction Outlier detection Decision support
12
Research Approach
Unified Model
Spider
Filter
Data Mining
IE
Segment Classify Associate Cluster
Discover patterns - entity types - links /
relations - events
Probabilistic Model
Documentcollection
Actionableknowledge
Prediction Outlier detection Decision support
13
Accomplishments, Discoveries Results

Extracting answers, and also uncertainty/confidenc
e.
Formally justified as marginalization in
graphical models
Applications to new word discovery in Chinese
word segmentation,and correction propagation in
interactive IE
Joint inference, with efficient methods
Multiple, cascaded label sequences (Factorial
CRFs)
Multiple distant, but related mentions
(Skip-chain CRFs)
Multiple co-reference decisions (Affinity Matrix
CRF)
Integrating extraction with co-reference (Graphs
chains)
Put it into a large-scale, working system
Social network analysis from Email and the Web
A new portal research, people, connections.

14
Accomplishments, Discoveries Results

Extracting answers, and also uncertainty/confidenc
e.
Formally justified as marginalization in
graphical models
Applications to new word discovery in Chinese
word segmentation,and correction propagation in
interactive IE
Joint inference, with efficient methods
Multiple, cascaded label sequences (Factorial
CRFs)
Multiple distant, but related mentions
(Skip-chain CRFs)
Multiple co-reference decisions (Affinity Matrix
CRF)
Integrating extraction with co-reference (Graphs
chains)
Put it into a large-scale, working system
Social network analysis from Email and the Web
A new portal research, people, connections.

15
Types of Uncertainty in Knowledge Discovery from
Text

Confidence that extractor correctly obtained
statements the author intended.
Confidence that what was written is truthful
Author could have had misconceptions.
or have been purposefully trying to mislead.
Confidence that the emerging, discovered pattern
is a reliable fact or generalization.

16
1. Labeling Sequence DataLinear-chain CRFs
Lafferty, McCallum, Pereira 2001
Undirected graphical model, trained to maximize
conditional probability of outputs given inputs
Finite state model
Graphical model
OTHER PERSON PERSON ORG TITLE
output seq
y
y
y
y
y
t2
t3
t
-
1
t
t1
FSM states
. . .
observations
x
x
x
x
x
t
2
t
3
t
t
1
-
t
1
said Arden Bement NSF Director
input seq
17
Confidence Estimation inLinear-chain CRFs
Culotta, McCallum 2004
Finite State Lattice
output sequence
y
y
y
y
y
t2
t3
t
-
1
t
t1
ORG
OTHER
Lattice ofFSM states
. . .
PERSON
TITLE
observations
x
x
x
x
x
t
t
t
t
1
-
2
3
t
1
input sequence
said Arden Bement NSF Director
18
Confidence Estimation inLinear-chain CRFs
Culotta, McCallum 2004
Constrained Forward-Backward
output sequence
y
y
y
y
y
t2
t3
t
-
1
t
t1
ORG
OTHER
Lattice ofFSM states
. . .
PERSON
TITLE
observations
x
x
x
x
x
t
t
t
t
1
-
2
3
t
1
input sequence
said Arden Bement NSF Director
19
Forward-Backward Confidence Estimationimproves
accuracy/coverage
ourforward-backwardconfidence
optimal
traditionaltoken-wiseconfidence
no use ofconfidence
20
Confidence Estimation Applied

New word discovery inChinese word segmentation
Improves segmentation accuracy by 25
Highlighting fields for Interactive Information
Extraction
After fixing least confident field,constrained
Viterbi automatically reduces error by another
23.

Peng, Fangfang, McCallumCOLING 2004
Kristiansen, Culotta, Viola, McCallum AAAI
2004Honorable Mention Award
21
Accomplishments, Discoveries Results

Extracting answers, and also uncertainty/confidenc
e.
Formally justified as marginalization in
graphical models
Applications to new word discovery in Chinese
word segmentation,and correction propagation in
interactive IE
Joint inference, with efficient methods
Multiple, cascaded label sequences (Factorial
CRFs)
Multiple distant, but related mentions
(Skip-chain CRFs)
Multiple co-reference decisions (Affinity Matrix
CRF)
Integrating extraction with co-reference (Graphs
chains)
Put it into a large-scale, working system
Social network analysis from Email and the Web
A new portal research, people, connections.

22
1. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
23
1. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
24
1. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
But errors cascade--must be perfect at every
stage to do well.
25
1. Jointly labeling cascaded sequencesFactorial
CRFs
Sutton, Khashayar, McCallum, ICML 2004
Named-entity tag
Noun-phrase boundaries
Part-of-speech
English words
Joint prediction of part-of-speech and
noun-phrase in newswire, matching accuracy with
only 50 of the training data.
Inference Tree reparameterization BP
Wainwright et al, 2002
26
2. Jointly labeling distant mentionsSkip-chain
CRFs
Sutton, McCallum, SRL 2004

Senator Joe Green said today .
Green ran for
Dependency among similar, distant mentions
ignored.
27
2. Jointly labeling distant mentionsSkip-chain
CRFs
Sutton, McCallum, SRL 2004

Senator Joe Green said today .
Green ran for
14 reduction in error on most repeated field in
email seminar announcements.
Inference Tree reparameterization BP
Wainwright et al, 2002
28
3. Joint co-reference among all pairsAffinity
Matrix CRF
Entity resolutionObject correspondence
. . . Mr Powell . . .
45
. . . Powell . . .
Y/N
Y/N
-99
Y/N
25 reduction in error on co-reference of proper
nouns in newswire.
11
. . . she . . .
Inference Correlational clustering graph
partitioning
McCallum, Wellner, IJCAI WS 2003, NIPS 2004
Bansal, Blum, Chawla, 2002
29
Joint IE and Coreference from Research Paper
Citations
4. Joint segmentation and co-reference
Textual citation mentions(noisy, with duplicates)
Paper database, with fields,clean, duplicates
collapsed
AUTHORS TITLE VENUE Cowell,
Dawid Probab Springer Montemerlo,
ThrunFastSLAM AAAI Kjaerulff
Approxi Technic
30
Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents Metaphors
with Character , in Smith , The Art of
Human-Computr Interface Design , 355-366 ,
1990 .
31
Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Brenda Laurel . Interface Agents Metaphors
with Character , in Smith , The Art of
Human-Computr Interface Design , 355-366 ,
1990 .

Segment citation fields

32
Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Y ? N
Brenda Laurel . Interface Agents Metaphors
with Character , in Smith , The Art of
Human-Computr Interface Design , 355-366 ,
1990 .

Segment citation fields
Resolve coreferent citations

33
Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Y ? N
Brenda Laurel . Interface Agents Metaphors
with Character , in Smith , The Art of
Human-Computr Interface Design , 355-366 ,
1990 .
Segmentation Quality Citation Co-reference (F1)
No Segmentation 78
CRF Segmentation 91
True Segmentation 93

Segment citation fields
Resolve coreferent citations

34
Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Y ? N
Brenda Laurel . Interface Agents Metaphors
with Character , in Smith , The Art of
Human-Computr Interface Design , 355-366 ,
1990 .
AUTHOR Brenda Laurel TITLE Interface
Agents Metaphors with CharacterPAGES
355-366BOOKTITLE The Art of Human-Computer
Interface DesignEDITOR T. SmithPUBLISHER
Addison-WesleyYEAR 1990

Segment citation fields
Resolve coreferent citations
Form canonical database record

Resolving conflicts
35
Citation Segmentation and Coreference
Laurel, B. Interface Agents Metaphors with
Character , in The Art of Human-Computer
Interface Design , T. Smith (ed) ,
Addison-Wesley , 1990 .
Y ? N
Brenda Laurel . Interface Agents Metaphors
with Character , in Smith , The Art of
Human-Computr Interface Design , 355-366 ,
1990 .
AUTHOR Brenda Laurel TITLE Interface
Agents Metaphors with CharacterPAGES
355-366BOOKTITLE The Art of Human-Computer
Interface DesignEDITOR T. SmithPUBLISHER
Addison-WesleyYEAR 1990

Segment citation fields
Resolve coreferent citations
Form canonical database record

jointly.
Perform
36
IE Coreference Model
AUT AUT YR TITL TITL
CRF Segmentation
s
Observed citation
x
J Besag 1986 On the
37
IE Coreference Model
AUTHOR J Besag YEAR 1986 TITLE On
the
Citation mention attributes
c
CRF Segmentation
s
Observed citation
x
J Besag 1986 On the
38
IE Coreference Model
Smyth , P Data mining
Structure for each citation mention
c
s
x
J Besag 1986 On the
Smyth . 2001 Data Mining
39
IE Coreference Model
Smyth , P Data mining
Binary coreference variablesfor each pair of
mentions
c
s
x
J Besag 1986 On the
Smyth . 2001 Data Mining
40
IE Coreference Model
Smyth , P Data mining
Binary coreference variablesfor each pair of
mentions
y
n
n
c
s
x
J Besag 1986 On the
Smyth . 2001 Data Mining
41
IE Coreference Model
Smyth , P Data mining
AUTHOR P Smyth YEAR 2001 TITLE Data
Mining ...
Research paper entity attribute nodes
y
n
n
c
s
x
J Besag 1986 On the
Smyth . 2001 Data Mining
42
IE Coreference Model
Smyth , P Data mining
Research paper entity attribute node
y
y
y
c
s
x
J Besag 1986 On the
Smyth . 2001 Data Mining
43
IE Coreference Model
Smyth , P Data mining
y
n
n
c
s
x
J Besag 1986 On the
Smyth . 2001 Data Mining
44

Such a highly connected graph makes exact
inference intractable, so

45
Approximate Inference 1
m1(v2)
m2(v3)

Loopy Belief
Propagation

v1
v3
v2
m3(v2)
m2(v1)
messages passed between nodes
v6
v5
v4
46
Approximate Inference 1
m1(v2)
m2(v3)

Loopy Belief
Propagation
Generalized Belief
Propagation

v1
v3
v2
m3(v2)
m2(v1)
messages passed between nodes
v6
v5
v4
messages passed between regions
Here, a message is a conditional probability
table passed among nodes.But when message size
grows exponentially with size of overlap between
regions!
47
Approximate Inference 2

Iterated Conditional
Modes (ICM)
Besag 1986

v2
v1
v3
v6i1 argmax P(v6i v \ v6i)
v6
v5
v4

v6i
48
Approximate Inference 2

Iterated Conditional
Modes (ICM)
Besag 1986

v2
v1
v3
v5j1 argmax P(v5j v \ v5j)
v6
v5
v4

v5j
49
Approximate Inference 2

Iterated Conditional Modes
(ICM)
Besag 1986

v2
v1
v3
v4k1 argmax P(v4k v \ v4k)
v6
v5
v4

v4k
Structured inference scales well here,
but greedy, and easily falls into local minima.
50
Approximate Inference 2

Iterated Conditional Modes
(ICM)
Besag 1986
Iterated Conditional Sampling (ICS) (our
name)
Instead of selecting only argmax, sample of
argmaxes of P(v4k v \ v4k)
e.g. an N-best list (the top N values)

v2
v1
v3
v4k1 argmax P(v4k v \ v4k)
v6
v5
v4

v4k
v2
v1
v3
Can use Generalized Version of this doing
exact inference on a region of several nodes at
once. Here, a message grows only linearly with
overlap region size and N!
v6
v5
v4
51
Features of this Inference Method

Structured or factored representation (ala GBP)
Uses samples to approximate density
Closed-loop message-passing on loopy graph (ala
BP)

Related Work

Beam search
Forward-only inference
Particle filtering, e.g. Doucet 1998
Usually on tree-shaped graph, or feedforward
only.
MC SamplingEmbedded HMMs Neal, 2003
Sample from high-dim continuous state space do
forward-backward
Sample Propagation Paskin, 2003
Messages samples, on a junction tree
Fields to Trees Hamze de Freitas, UAI earlier
today
Rao-Blackwellized MCMC, partitioning G into
non-overlapping trees
Factored Particles for DBNs Ng, Peshkin,
Pfeffer, 2002
Combination of Particle Filtering and
Boyan-Koller for DBNs

52
IE Coreference Model
Smyth , P Data mining
Exact inference onthese linear-chain regions
From each chainpass an N-best Listinto
coreference
J Besag 1986 On the
Smyth . 2001 Data Mining
53
IE Coreference Model
Smyth , P Data mining
Approximate inferenceby graph partitioning
Make scale to 1Mcitations with
CanopiesMcCallum, Nigam, Ungar 2000
integrating outuncertaintyin samplesof
extraction
J Besag 1986 On the
Smyth . 2001 Data Mining
54
InferenceSample N-best List from CRF
Segmentation
When calculating similarity with another
citation, have more opportunity to find correct,
matching fields.
Name Title Book Title Year
Laurel, B. Interface Agents Metaphors with Character The Art of Human Computer Interface Design 1990
Laurel, B. Interface Agents Metaphors with Character The Art of Human Computer Interface Design 1990
Laurel, B. Interface Agents Metaphors with Character The Art of Human Computer Interface Design 1990
Name Title
Laurel, B Interface Agents Metaphors with Character The
Laurel, B. Interface Agents Metaphors with Character
Laurel, B. Interface Agents Metaphors with Character
y ? n
55
IE Coreference Model
Smyth , P Data mining
Exact (exhaustive) inferenceover entity
attributes
y
n
n
J Besag 1986 On the
Smyth . 2001 Data Mining
56
IE Coreference Model
Smyth , P Data mining
Revisit exact inferenceon IE linear chain,now
conditioned on entity attributes
y
n
n
J Besag 1986 On the
Smyth . 2001 Data Mining
57
Parameter Estimation
Separately for different regions
IE Linear-chainExact MAP
Coref graph edge weightsMAP on individual edges
Entity attribute potentialsMAP, pseudo-likelihood
y
n
n
In all casesClimb MAP gradient
withquasi-Newton method
58
4. Joint segmentation and co-reference
Wellner, McCallum, Peng, Hay, UAI 2004
o
Extraction from and matching of research paper
citations.
s
World Knowledge
Laurel, B. Interface Agents Metaphors with
Character, in The Art of Human-Computer
Interface Design, B. Laurel (ed), Addison-Wesley,
1990.
c
Co-reference decisions
y
y
p
Databasefield values
Brenda Laurel. Interface Agents Metaphors with
Character, in Laurel, The Art of Human-Computer
Interface Design, 355-366, 1990.
c
c
Citation attributes
y
s
s
Segmentation
o
o
35 reduction in co-reference error by using
segmentation uncertainty.
6-14 reduction in segmentation error by using
co-reference.
Inference Variant of Iterated Conditional Modes
Besag, 1986
59
Experimenal Results

Set of citations from CiteSeer
1500 citation mentions
to 900 paper entities
Hand-labeled for coreference and field-extraction
Divided into 4 subsets, each on a different topic
RL, Face detection, Reasoning, Constraint
Satisfaction
Within each subset many citations share authors,
publication venues, publishers, etc.
70 of the citation mentions are singletons

60
Coreference Results
Coreference cluster recall
N Reinforce Face Reason Constraint
1 (Baseline) 0.946 0.96 0.94 0.96
3 0.95 0.98 0.96 0.96
7 0.95 0.98 0.95 0.97
9 0.982 0.97 0.96 0.97
Optimal 0.99 0.99 0.99 0.99

Average error reduction is 35.
Optimal makes best use of N-best list by using
true labels.
Indicates that even more improvement can be
obtained

61
Information Extraction Results
Segmentation F1
Reinforce Face Reason Constraint
Baseline .943 .908 .929 .934
w/ Coref .949 .914 .935 .943
Err. Reduc. .101 .062 .090 .142
P-value .0442 .0014 .0001 .0001

Error reduction ranges from 6-14.
Small, but significant at 95 confidence level
(p-value lt 0.05)

Biggest limiting factor in both sets of results
data set is small, and does not have large
coreferent sets.
62
Accomplishments, Discoveries Results

Extracting answers, and also uncertainty/confidenc
e.
Formally justified as marginalization in
graphical models
Applications to new word discovery in Chinese
word segmentation,and correction propagation in
interactive IE
Joint inference, with efficient methods
Multiple, cascaded label sequences (Factorial
CRFs)
Multiple distant, but related mentions
(Skip-chain CRFs)
Multiple co-reference decisions (Affinity Matrix
CRF)
Integrating extraction with co-reference (Graphs
chains)
Put it into a large-scale, working system
Social network analysis from Email and the Web
A new portal research, people, connections.

63
One Application Project
Workplace effectiveness Ability to leverage
network of acquaintances The power of your
little black book But filling Contacts DB by
hand is tedious, and incomplete.
Contacts DB
Email Inbox
Automatically
WWW
64
System Overview
CRF
WWW
Email
names
65
An Example
To Andrew McCallum mccallum_at_cs.umass.edu Subjec
t ...
First Name Andrew
Middle Name Kachites
Last Name McCallum
JobTitle Associate Professor
Company University of Massachusetts
Street Address 140 Governors Dr.
City Amherst
State MA
Zip 01003
Company Phone (413) 545-1323
Links Fernando Pereira, Sam Roweis,
Key Words Information extraction, social network,
Search for new people
66
Summary of Results
Person Keywords
William Cohen Logic programming Text categorization Data integration Rule learning
Daphne Koller Bayesian networks Relational models Probabilistic models Hidden variables
Deborah McGuiness Semantic web Description logics Knowledge representation Ontologies
Tom Mitchell Machine learning Cognitive states Learning apprentice Artificial intelligence
Example keywords extracted
Contact info and name extraction performance (25
fields)
Token Acc Field Prec Field Recall Field F1
CRF 94.50 85.73 76.33 80.76

Expert FindingWhen solving some task, find
friends-of-friends with relevant expertise.
Avoid stove-piping in large orgs by
automatically suggesting collaborators. Given a
task, automatically suggest the right team for
the job. (Hiring aid!)
Social Network AnalysisUnderstand the social
structure of your organization.Suggest
structural changes for improved efficiency.

67
Main Application Project
68
Main Application Project
Cites
Research Paper
69
Main Application Project
Expertise
Cites
Research Paper
Person
Grant
University
Venue
Groups
70
Main Application Project

Status
Spider running. Over 1.5M PDFs in hand.
Best-in-world published results in IE from
research paper headers and references.
First version of multi-entity co-reference
running.
First version of Web servlet interface up.
Well-engineered Java, servlets, SQL, Lucene,
SOAP, etc.
Public launch this Fall.

71
Software Infrastructure
MALLET Machine Learning for Language Toolkit

80k lines of Java
Document classification, information extraction,
clustering, co-reference, POS tagging, shallow
parsing, relational classification,
New package Graphical models and modern
inference methods.
Variational, Tree-reparameterization, Stochastic
sampling, contrastive divergence,
New documentation and interfaces.
Unlike other toolkits (e.g. Weka) MALLET scales
to millions of features, 100ks training
examples, as needed for NLP.

Released as Open Source Software. http//mallet.cs
.umass.edu
In use at UMass, MIT, CMU, Stanford, Berkeley,
UPenn, UT Austin, Purdue
72
Publications and Contact Info

Conditional Models of Identity Uncertainty with
Application to Noun Coreference. Andrew McCallum
and Ben Wellner. Neural Information Processing
Systems (NIPS), 2004.
An Integrated, Conditional Model of Information
Extraction and Coreference with Application to
Citation Matching. Ben Wellner, Andrew McCallum,
Fuchun Peng, Michael Hay. Conference on
Uncertainty in Artificial Intelligence (UAI),
2004.
Collective Segmentation and Labeling of Distant
Entities in Information Extraction. Charles
Sutton and Andrew McCallum. ICML workshop on
Statistical Relational Learning, 2004.
Extracting Social Networks and Contact
Information from Email and the Web. Aron Culotta,
Ron Bekkerman and Andrew McCallum. Conference on
Email and Spam (CEAS) 2004.
Dynamic Conditional Random Fields Factorized
Probabilistic Models for Labeling and Segmenting
Sequence Data. Charles Sutton, Khashayar
Rohanimanesh and Andrew McCallum. ICML 2004.
Interactive Information Extraction with
Constrained Conditional Random Fields. Trausti
Kristjannson, Aron Culotta, Paul Viola and Andrew
McCallum. AAAI 2004. (Winner of Honorable
Mention Award.)
Accurate Information Extraction from Research
Papers using Conditional Random Fields. Fuchun
Peng and Andrew McCallum. HLT-NAACL, 2004.
Chinese Segmentation and New Word Detection using
Conditional Random Fields. Fuchun Peng, Fangfang
Feng, and Andrew McCallum. International
Conference on Computational Linguistics (COLING
2004), 2004.
Confidence Estimation for Information Extraction.
Aron Culotta and Andrew McCallum. (HLT-NAACL),
2004,