Latent Topic Models for Hypertext - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Latent Topic Models for Hypertext

Description:

Latent Topic Models for Hypertext. Amit Gruber1. Michal Rosen-Zvi2. Yair Weiss1 ... d is generated by Latent Dirichlet Allocation. Generating two documents d' ... – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 28
Provided by: gru72
Category:

less

Transcript and Presenter's Notes

Title: Latent Topic Models for Hypertext


1
Latent Topic Models for Hypertext
  • Amit Gruber1
  • Michal Rosen-Zvi2
  • Yair Weiss1
  • 1The Hebrew University of Jerusalem
  • 2IBM Research Labs, Haifa

2
Introduction
  • In this work we focus on hypertext documents,
    i.e. documents with links

0.3 sports
0.4 crime
0.3 politics
3
Hypertext Documents
4
Hypertext Documents
5
Introduction
  • Hypertext is everywhere !
  • web pages, refs. in scientific publications
  • Connectivity is important
  • PageRank
  • Topology of the WWW is complicated

6
Problem Setting
  • Input documents and links
  • Estimate
  • Document topic mixture
  • Pr(word topic)
  • Document importance
  • Unsupervised

7
Previous WorkTopic Models for Hypertext
  • Cohn and Hofmann, 01.
  • Erosheva et al. 04.
  • Links are modeled similar to words
  • Links are not associated with words
  • Dietz et al. 07.
  • Nallapati and Cohen, 08.
  • Distinguish between citing and cited docs

8
  • The Latent Topic Hypertext Model
  • (LTHM)

9
LTHM Generative Model
  • Words are created (by LDA)

10
LTHM Generative Model
  • Words are created (by LDA)
  • Links are created (our contribution)

11
LTHM Modeling Links
  • Allows for arbitrary topology of the citation
    graph (including self links)
  • A link points from a word to a document

12
LTHM Link Generation
  • Depends on
  • The topic of the anchor word
  • The topic mixture of the target document
  • The importance of the target document

importance of d
prevalence of z in d
13
Generating a single document d
a
?d
Doc d
d is generated by Latent Dirichlet Allocation
zW
w
Nd
?
ß z
K
14
Generating two documents d and d
a
?d
?d
Doc d
Doc d
zW
zW
w
w
Nd
Nd
?
ß z
K
15
Generating Links from d to d
a
?d
?d
Source d
Target d
zW
zW
Link
w
w
Nd
Nd
?
ß z
?
?
K
16
Properties of the Model
  • D additional parameters (? ) for links vs. DxK
    parameters in previous models
  • The existence (or non-existence) of a link is an
    observation
  • A link shares the same topic with the word
  • Link affects topic estimation in both the source
    and target documents

17
Approximate Learning
  • Exact inference is intractable in hierarchical
    models such as LDA
  • Approximate inference in LTHM is even more
    challenging as non-links are also observations
  • Using symmetries, we derived an O(Kcorpus size)
    EM algorithm

18
Experiments
  • WebKB dataset
  • 8282 documents
  • 12911 links
  • Wikipedia
  • A new data set, collected by crawling from the
    NIPS Wikipedia page
  • 105 documents
  • 790 links

19
Experiments Wikipedia
20
Experiments Wikipedia
21
Experiments Wikipedia
22
Experiments Wikipedia
23
Journal of Machine Learning Research
LDA
LTHM
24
Experiments link prediction on test set
  • Wikipedia corpus 105 documents with 790 links
  • 20 hidden aspects
  • Test set 11 documents, outgoing links are
    invisible

25
Experiments link prediction on train set
  • Wikipedia corpus 105 documents with 790 links
  • 20 hidden aspects

26
Experiments link prediction
  • Webkb corpus 8282 documents with 12911 links
  • 20 hidden aspects
  • Test set 10

27
Summary
  • Explicit modeling of link generation in an LDA
    like model
  • Efficient approximate inference algorithm
  • Performs better than previous topic models in
    link recommendation
  • Code and data available online at
    http//www.cs.huji.ac.il/amitg/lthm.html
Write a Comment
User Comments (0)
About PowerShow.com