Hierarchical Summaries - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Hierarchical Summaries

Description:

Endangered. Animals (2910) marine mammals (188) ... animals (42) habitat (283) endangered species (204) mammals (126) listed species (110) ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 36

Provided by: dawnl2

Learn more at: http://www.cs.loyola.edu

Category:

more less

Transcript and Presenter's Notes

Title: Hierarchical Summaries

1
Hierarchical Summaries
for Search

By Dawn J. Lawrie
University of Massachusetts, Amherst

2
The Problem
3
Possible Solution
4
Possible Solution
5
Solution Automatic Hierarchies
6
Strengths of Automatic Hierarchies

Word-based summary
Focus on topics of the documents
Allows users to navigate through the results
Easy to understand
Bonus Useful for summarizing documents

7
Example

Hand-generated hierarchy of 50 documents Query
Endangered Species (Mammals)

Endangered Animals (2910)
Endangered plants (70)
8
Proposed Framework
Term word or phrase
9
Challenges

Selecting terms for the hierarchy
Displaying the hierarchy
Showing that it works

10
Outline

Introduction
Description of framework for creating hierarchies
Examples
Methods of evaluation
Future Improvements

11
Methodology

Build probabilistic word model of documents
Find best terms
On topic
Predictive
Recursive definition creates hierarchy

12
Term characteristics

Why topicality?
Distinguish topic terms from the rest of the
vocabulary
The Secretary of Interior listed bald eagles
south of the 40th parallel as endangered under
the Endangered Species Preservation Act of 1966.
Why predictiveness?
Topic words can be strongly related
Represent different facets of the vocabulary
Example
P(EndangeredStellar sea lions)
1.00

13
Statistical Model

AT refers to topicality with respect to topic T
Find if the word w is in set T
B refers to predictiveness
Precondition for other terms to occur
Find if word w is in set P

14
Probabilistic Word Model

Captures statistical information about text
Called a language model in speech recognition
Provides basis for estimation of probabilities

15
Estimating Topicality

Use terms contribution to relative entropy
Compares two models using K-L divergence
Model of documents in hierarchy
Model of general English

16
KL Example
endangered
17
Estimating Predictiveness

Relates the vocabulary to a set of candidate
topic terms
Use conditional probability - Px (tv)
x is the maximum distance between t and v

18
Dominating Set Approximation

Interpret predictive language model as graph
edges weighted by the conditional probability
Finds terms that are connected to lots of terms
with a high weight
Chooses topic terms until vocabulary is dominated
(predicted)

19
Term Selection Example
20
Generating a Summary

4-step process
(1) Preprocess document set
(2) Generate a language model
(3) Select the terms
(4) Create a Hierarchy

recursive
21
Outline

Introduction
Description of framework for creating hierarchies
Examples
Methods of evaluation
Future Improvements

22
Example Hierarchies

Generated from 50 documents retrieved for the
query Endangered Species - Mammals
Demonstrate the difference between using
different topic models
Web hierarchy using same query

23
Uniform Topic Model Hierarchy
species (439)
marine mammals (187)
plan (192)
marine (187)
24
KL-Topic Model Hierarchy
marine mammals (187)
species (439)
marine (187)
Marine Mammal Protection Act (73)
management plan (51)
25
Query-Topic Model Hierarchy
species (439)
Endangered Species Act (335)
marine mammals (187)
Federal (232)
marine (187)
26
Web Hierarchies

Submit query to a web search engine
Gather titles and snippets of documents
Text considered a document
Documents are about 30 words

27
Example of Web Hierarchy
marine (76)
Endangered Species (440)
endangered (491)
mammals (600)
28
Outline

Introduction
Description of framework for creating hierarchies
Examples
Methods of evaluation
Future Improvements

29
Evaluations

Summary Evaluation
Tests how well the topic terms chosen predict the
vocabulary
Access Evaluation
Compare number of documents a user can find
Relevance Evaluation
Path length to find all relevant documents

30
Automatic Evaluation Test Set

Use 50 standard queries
Document sets
500 documents retrieved from TREC volumes 4 and 5
(have relevance judgments)
200 documents retrieved from a news database
1000 titles and snippets retrieved using Google
Search Engine

31
Evaluating Hypotheses

Denotes an evaluation confirmed hypothesis
Denotes evaluation showed no significant
difference

?
Relevance
Summary
TREC Collection and News Documents
Access
Use KL-topic model Use sub-collections
32
Web Document Evaluation

Results completely different
Best hierarchy uniform topic model
Hierarchies do not look as good to human
inspection

33
User Study

Include 12 to 16 users
Compare ranked list and hierarchy to ranked list
alone
Users asked to find all instances that are
relevant to the query
Only have to identify one document about a
particular instance
Study includes 10 queries

34
Future Work

Complete user study
Failure Analysis
Explore the use of topic hierarchies in other
organizational tasks
Personal collections of documents
E-mails

35
Conclusions

Developed a formal framework for topic
hierarchies
Created hierarchies from full text and snippets
of documents
Verified intuition concerning hierarchies
generated from full text

36
Questions?
Demo http//www-ciir.cs.umass.edu/lawrie/categor
ies/google-qry/
37
Improving Topicality Estimate

Estimate topicality using a query model
Emphasizes query related terms
Improve model of English with sub-collections
Distinguishes between terms that are frequently
used in a genre and topic terms

38
Key Ideas

Language models are created from documents
Topicality and predictiveness are used to choose
terms
Topicality is estimated using Kullback-Leibler
divergence
Predictiveness is estimated by calculating
conditional probabilities

39
Key Ideas

Showed the effect of using topic model
Observed trade-off between snippets and full text
of documents

40
Summary Evaluation

Expected Mutual Information Measure
Two sets deviates from stochastic independence
Shows how well the topic terms chosen predict the
vocabulary

41
Hierarchy Terms vs. Top TF.IDF Terms

TF.IDF popular term weight
Commonly used as a method of naming clusters
Compare equal number of unique terms in hierarchy
to top TF.IDF Terms
Results
Hierarchies always significantly better at
summarizing documents

42
Access Evaluation

Compare number of documents that are accessible
Example policy
Examine parts of the hierarchy with 20 or fewer
documents
Look at the top 200 documents in a ranked list

43
Hierarchy vs. Ranked List
Level Size 5 Rank 400
Rank 350 Rank 300 Rank
250 Rank 200 Hier.
Topics 50 Hier. Topics 45
Hier. Topics 40 Rank 150
Hier. Topics 35 Hier.
Topics 30 Hier. Topics 25
Hier. Topics 20 Hier. Topics
15 Hier. Topics 10
Rank 100 Hier. Topics 5
Rank 50
Level Size 10 Rank 500
Rank 450 Rank 400 Rank
350 Rank 300 Hier.
Topics 50 Hier. Topics 45
Hier. Topics 40 Hier. Topics
35 Rank 250 Hier.
Topics 30 Hier. Topics 25
Hier. Topics 20 Hier. Topics
15 Hier. Topics 10
Rank 200 Hier. Topics 5
Rank 150
Level Size 20 Rank 500
Rank 450 Hier. Topic 50
Hier. Topic 45 Hier. Topic
40 Rank 400 Hier.
Topics 35 Hier. Topics 30
Hier. Topics 25 Hier. Topics
20 Hier. Topics 15
Hier. Topics 10 Rank 350
Hier. Topics 5 Rank 300
Rank 250 Rank 200
Rank 150
Level Size 15 Rank 500
Rank 450 Rank 400 Rank
350 Hier. Topics 50
Hier. Topics 45 Hier. Topics 40
Hier. Topics 35 Hier.
Topics 30 Hier. Topics 25
Hier. Topics 20 Hier. Topics
15 Rank 300 Hier.
Topics 10 Hier. Topics 5
Rank 250 Rank 200
Rank 150
44
Relevance Evaluation

Calculate average path to a relevant document
Assumptions
one does not read extraneous menus
one reads all documents at a node
Ignores relevant documents that are not in the
hierarchy
Smaller score denotes a better hierarchy

45
Key Ideas

Automatic evaluations have confirmed hypotheses
for hierarchies created from full text of
documents
User study is necessary for determining how well
people can use hierarchies

Write a Comment

User Comments (0)