The use of unlabeled data to improve supervised learning for text summarization

About This Presentation

Title:

The use of unlabeled data to improve supervised learning for text summarization

Description:

Boils down to a passage classification/ranking problem. Major Contribution ... Non-trainable System: passage ranking. Trainable System: Na ve Bayes sentence classifier ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 17

Provided by: jone51

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: The use of unlabeled data to improve supervised learning for text summarization

1
The use of unlabeled data to improve supervised
learning for text summarization

MR Amini, P Gallinari (SIGIR 2002)

Slides prepared by Jon Elsas for the
Semi-supervised NL Learning Reading Group
2
Presentation Outline

Overview of Document Summarization
Major contribution Semi-Supervised Logistic
Classification Maximum Likelihood summaries.
Evaluation
Baseline Systems
Results

3
Document Summarization

Motivation text volume gtgt users time
Single Document Summarization
Used for display of search results, automatic
abstracting, browsing, etc.
Multi-Document Summarization
Describe clusters document collections, QA,
etc.
Problem What is the summary used for? Does a
generic summary exist?

4
Single Document Summarization example
5
Document Summarization

Generative Summaries
Synthetic text produced after analysis of high
level linguistic features discourse, semantics,
etc.
Hard.
Extract Summaries
Text excerpts (usually sentences) composed
together to create summary
Boils down to a passage classification/ranking
problem

6
Major Contribution

Semi-supervised Logistic Classifying Expectation
Maximization (CEM) for passage classification
Advantage over other methods
Works on small set of labeled data large set of
unlabeled data
No modeling assumptions for density estimation
Cons
(probably) slow no performance numbers given

7
Expectation Maximization (EM)

Finds maximum likelihood estimates of parameters
when underlying distribution depends on
unobserved latent variables.
Maximizes model fit to data distribution
Criterion function

8
Classifying EM (CEM)

Like EM, with the addition of an indicator
variable for component membership.
Maximizes quality of clustering
Criterion function

9
Semi-supervised generative-CEM

Fix component membership for labeled data.
Criterion function

Labeled Data
Unlabeled Data
10
Semi-supervised logistic-CEM

Use discriminative classifier (logistic) instead
of generative.
M-step, need to re-do gradient descent to
estimate ßs

Labeled Data
Unlabeled Data
11
Evaluation

Algorithm evaluated against 3 other
single-document summarization algorithms
Non-trainable System passage ranking
Trainable System Naïve Bayes sentence classifier
Generative-CEM (using full Gaussians)
Precision/Recall with regard to gold-standard
extract summaries
The fine print
All systems used similar representation
schemes, but not the same

12
Baseline System Sentence Ranking

Rank sentences, using a TF-IDF similarity measure
with query expansion (Sim2)
Blind-relevance feedback from the top sentences
WordNet similarity thesaurus
Generic query created with the most frequent
words in the training set.

13
Naïve Bayes Model Sentence Classification

Simple Naïve Bayes classifier trained on 5
features
Sentence length lt tlength 0,1
Sentence contains cue words 0,1
Sentence query similarity (Sim2) gt tsim 0,1
Upper-case/Acronym features (count?)
Sentence/paragraph position in text 1, 2, 3

14
Logistic-CEM Sentence Representation Features

Features used to train Logistic-CEM
Normalized sentence length 0, 1
Normalized cue word frequency 0, 1
Sentence Query Similarity (Sim2) 0, 8)
Normalized acronym frequency 0, 1
Sentence/paragraph position in text 1, 2, 3
(All of the binary features converted to
continuous.)

15
Results on Reuters dataset
16
Results on Reuters dataset

Write a Comment

User Comments (0)

About PowerShow.com

The use of unlabeled data to improve supervised learning for text summarization - PowerPoint PPT Presentation

The use of unlabeled data to improve supervised learning for text summarization

Boils down to a passage classification/ranking problem. Major Contribution ... Non-trainable System: passage ranking. Trainable System: Na ve Bayes sentence classifier ... – PowerPoint PPT presentation