Title: Probabilistic Models of Text and Link Structure for Hypertext Classification
1Probabilistic Models of Text and Link Structure
for Hypertext Classification
- Lise Getoor
- Stanford ?
- Univ. MD, College Park
Eran Segal Stanford
Benjamin Taskar Stanford
Daphne Koller Stanford
2Introduction
- Many domains are inherently relational
- Related objects are not IID
- Exploit dependency
- Link structure can be predictive
- Unlabelled data also useful
- Active Research Area
- Chakrabarti, et al., Cohn and Hofmann, et al.,
Ghani, et al., Jensen Neville, Slattery and
Mitchel, many others...
3Introduction
Topic
Theory
AI
Agent
Scientific Paper
- Attributes of linked objects
- Attributes of heterogeneous linked objects
4Our Approach
- Motivation relational structure provides useful
information for density estimation and prediction
- Provide a unified probabilistic framework
- Construct probabilistic models of relational
structure that capture link uncertainty - Use unlabelled data for improved accuracy
5Probabilistic Relational Models
- Extend Bayes net representation to relational
setting - Specify dependence of each attribute on other
attributes - A template for a Bayes net over a relational
database
Koller Pfeffer, 98
6Relational Schema
- A relational schema describes the classes,
attributes and relations in a domain
Author
Institution
Research Area
Wrote
Paper
Paper
Topic
Topic
Word1
Word1
Word2
Cites
Word2
Count
Citing Paper
WordN
Cited Paper
WordN
7Attribute Uncertainty
Author
Institution
Research Area
Wrote
Paper
Topic
...
Word1
WordN
8Link Uncertainty
Document Collection
Document Collection
9PRM w/ Exists Uncertainty
Paper
Paper
Topic
Topic
Cites
Words
Words
Exists
Dependency model for existence of relationship
10Exists Uncertainty Example
Paper
Paper
Topic
Topic
Cites
Words
Words
Exists
False
True
Cited.Topic
Citer.Topic
11Ground Bayes Net
Author2
Author1
Inst
Inst
Area
Area
Paper2
Paper3
Topic
Paper1
Topic
Topic
WordN
WordN
Word1
Word1
...
Word1
...
...
WordN
Exists
Exists
Exists
Exists
Exists
Exists
1-2
2-3
2-1
3-1
3-2
1-3
- Captures correlations between topics of related
papers - Information flows along active paths in the
Bayes net
12PRM Learning Algorithm
Author
Paper
Cites
Paper
Database
PRM
- Learn parameters qualitative dependency
structure - Extend known techniques for learning Bayesian
networks from data
Friedman, et al., IJCAI99 Getoor, et al.,
ICML01
13Parameter Estimation in PRMs
- Assume known dependency structure S
- Goal estimate PRM parameters q
- entries in local probability models,
- q is good if it is likely to generate the
observed data, instance I .
14Parameter Estimation II
- MLE Principle Choose q so as to maximize l
- Key computational step
- computation of sufficient statistics - frequency
of different instantiations of a node and its
parents in DB
As in Bayesian network learning, crucial
property decomposition separate terms for
different X.A
15Experiment I Prediction
Paper P506
Topic ??
. . .
w1
wN
16Domains
Paper
Paper
Topic
Topic
Cites
. . .
. . .
w1
wN
w1
wN
Exists
cited paper
citing paper
Cora Dataset, McCallum, et. al
Web Page
Web Page
Category
Category
Link
. . .
. . .
w1
wN
w1
wN
Exists
From Page
To Page
WebKB, Craven, et. al
17Prediction Accuracy
18Using Unlabelled Data
Author2
Author1
Inst
Inst
Area
Area
Area
Area
Paper2
Paper3
Topic
Paper1
Topic
Topic
Topic
Topic
WordN
WordN
Word1
Word1
...
Word1
...
...
WordN
Exists
Exists
Exists
Exists
Exists
Exists
1-2
2-3
2-1
3-1
1-3
19EM
- Use EM to learn with latent variables
- E-step involves inference in unrolled network
- Infeasible for large networks
- Use approximate inference for E-step
- Loopy belief propagation (Pearl, 88 McEliece,
98) - Scales linearly with size of network
- Guaranteed to converge only for polytrees
- Empirically, often converges in general nets
(Murphy,99) - Local message passing
- Belief messages transferred between related
instances - Induces a natural influence propagation
behavior - Instances give information about related instances
Taskar, et al., Probabilistic Classification and
Clustering in Relational Data, IJCAI01
20Webpage Classification
- Webpages from four CS departments (Cravenal)
- Each webpage has
- Content words
- Links
- Type Student, Faculty, Course, Project, Other
Trained on 3 schools tested on 4th
0.52
0.54
0.56
0.58
0.6
0.62
0.64
0.66
0.68
21Webpage Classification
From-
Anchor
NB
Links
LinkHub
LinkAnchor
All
0.52
0.54
0.56
0.58
0.6
0.62
0.64
0.66
0.68
22Conclusions
- PRMs provide unified probabilistic framework for
prediction and density estimation in relational
domains - We can model dependencies between
- Allows us to make use of unlabelled data in a
principled manner - Future work more expressive link uncertainty
models
- Attributes of linked objects
- Attributes of heterogeneous linked objects