Hierarchical Summaries - PowerPoint PPT Presentation


Title: Hierarchical Summaries


1
Hierarchical Summaries
for Search
  • By Dawn J. Lawrie
  • University of Massachusetts, Amherst

2
The Problem
3
Possible Solution
4
Possible Solution
5
Solution Automatic Hierarchies
6
Strengths of Automatic Hierarchies
  • Word-based summary
  • Focus on topics of the documents
  • Allows users to navigate through the results
  • Easy to understand
  • Bonus Useful for summarizing documents

7
Example
  • Hand-generated hierarchy of 50 documents Query
    Endangered Species (Mammals)

Endangered Animals (2910)
Endangered plants (70)
8
Proposed Framework
Term word or phrase
9
Challenges
  • Selecting terms for the hierarchy
  • Displaying the hierarchy
  • Showing that it works

10
Outline
  • Introduction
  • Description of framework for creating hierarchies
  • Examples
  • Methods of evaluation
  • Future Improvements

11
Methodology
  • Build probabilistic word model of documents
  • Find best terms
  • On topic
  • Predictive
  • Recursive definition creates hierarchy

12
Term characteristics
  • Why topicality?
  • Distinguish topic terms from the rest of the
    vocabulary
  • The Secretary of Interior listed bald eagles
    south of the 40th parallel as endangered under
    the Endangered Species Preservation Act of 1966.
  • Why predictiveness?
  • Topic words can be strongly related
  • Represent different facets of the vocabulary
  • Example
  • P(EndangeredStellar sea lions)
  • 1.00

13
Statistical Model
  • AT refers to topicality with respect to topic T
  • Find if the word w is in set T
  • B refers to predictiveness
  • Precondition for other terms to occur
  • Find if word w is in set P

14
Probabilistic Word Model
  • Captures statistical information about text
  • Called a language model in speech recognition
  • Provides basis for estimation of probabilities

15
Estimating Topicality
  • Use terms contribution to relative entropy
  • Compares two models using K-L divergence
  • Model of documents in hierarchy
  • Model of general English

16
KL Example
endangered
17
Estimating Predictiveness
  • Relates the vocabulary to a set of candidate
    topic terms
  • Use conditional probability - Px (tv)
  • x is the maximum distance between t and v

18
Dominating Set Approximation
  • Interpret predictive language model as graph
  • edges weighted by the conditional probability
  • Finds terms that are connected to lots of terms
    with a high weight
  • Chooses topic terms until vocabulary is dominated
    (predicted)

19
Term Selection Example
20
Generating a Summary
  • 4-step process
  • (1) Preprocess document set
  • (2) Generate a language model
  • (3) Select the terms
  • (4) Create a Hierarchy

recursive
21
Outline
  • Introduction
  • Description of framework for creating hierarchies
  • Examples
  • Methods of evaluation
  • Future Improvements

22
Example Hierarchies
  • Generated from 50 documents retrieved for the
    query Endangered Species - Mammals
  • Demonstrate the difference between using
    different topic models
  • Web hierarchy using same query

23
Uniform Topic Model Hierarchy
species (439)
marine mammals (187)
plan (192)
marine (187)
24
KL-Topic Model Hierarchy
marine mammals (187)
species (439)
marine (187)
Marine Mammal Protection Act (73)
management plan (51)
25
Query-Topic Model Hierarchy
species (439)
Endangered Species Act (335)
marine mammals (187)
Federal (232)
marine (187)
26
Web Hierarchies
  • Submit query to a web search engine
  • Gather titles and snippets of documents
  • Text considered a document
  • Documents are about 30 words

27
Example of Web Hierarchy
marine (76)
Endangered Species (440)
endangered (491)
mammals (600)
28
Outline
  • Introduction
  • Description of framework for creating hierarchies
  • Examples
  • Methods of evaluation
  • Future Improvements

29
Evaluations
  • Summary Evaluation
  • Tests how well the topic terms chosen predict the
    vocabulary
  • Access Evaluation
  • Compare number of documents a user can find
  • Relevance Evaluation
  • Path length to find all relevant documents

30
Automatic Evaluation Test Set
  • Use 50 standard queries
  • Document sets
  • 500 documents retrieved from TREC volumes 4 and 5
    (have relevance judgments)
  • 200 documents retrieved from a news database
  • 1000 titles and snippets retrieved using Google
    Search Engine

31
Evaluating Hypotheses
  • Denotes an evaluation confirmed hypothesis
  • Denotes evaluation showed no significant
    difference

?
Relevance
Summary
TREC Collection and News Documents
Access
Use KL-topic model Use sub-collections
32
Web Document Evaluation
  • Results completely different
  • Best hierarchy uniform topic model
  • Hierarchies do not look as good to human
    inspection

33
User Study
  • Include 12 to 16 users
  • Compare ranked list and hierarchy to ranked list
    alone
  • Users asked to find all instances that are
    relevant to the query
  • Only have to identify one document about a
    particular instance
  • Study includes 10 queries

34
Future Work
  • Complete user study
  • Failure Analysis
  • Explore the use of topic hierarchies in other
    organizational tasks
  • Personal collections of documents
  • E-mails

35
Conclusions
  • Developed a formal framework for topic
    hierarchies
  • Created hierarchies from full text and snippets
    of documents
  • Verified intuition concerning hierarchies
    generated from full text

36
Questions?
Demo http//www-ciir.cs.umass.edu/lawrie/categor
ies/google-qry/
37
Improving Topicality Estimate
  • Estimate topicality using a query model
  • Emphasizes query related terms
  • Improve model of English with sub-collections
  • Distinguishes between terms that are frequently
    used in a genre and topic terms

38
Key Ideas
  • Language models are created from documents
  • Topicality and predictiveness are used to choose
    terms
  • Topicality is estimated using Kullback-Leibler
    divergence
  • Predictiveness is estimated by calculating
    conditional probabilities

39
Key Ideas
  • Showed the effect of using topic model
  • Observed trade-off between snippets and full text
    of documents

40
Summary Evaluation
  • Expected Mutual Information Measure
  • Two sets deviates from stochastic independence
  • Shows how well the topic terms chosen predict the
    vocabulary

41
Hierarchy Terms vs. Top TF.IDF Terms
  • TF.IDF popular term weight
  • Commonly used as a method of naming clusters
  • Compare equal number of unique terms in hierarchy
    to top TF.IDF Terms
  • Results
  • Hierarchies always significantly better at
    summarizing documents

42
Access Evaluation
  • Compare number of documents that are accessible
  • Example policy
  • Examine parts of the hierarchy with 20 or fewer
    documents
  • Look at the top 200 documents in a ranked list

43
Hierarchy vs. Ranked List
Level Size 5 Rank 400
Rank 350 Rank 300 Rank
250 Rank 200 Hier.
Topics 50 Hier. Topics 45
Hier. Topics 40 Rank 150
Hier. Topics 35 Hier.
Topics 30 Hier. Topics 25
Hier. Topics 20 Hier. Topics
15 Hier. Topics 10
Rank 100 Hier. Topics 5
Rank 50
Level Size 10 Rank 500
Rank 450 Rank 400 Rank
350 Rank 300 Hier.
Topics 50 Hier. Topics 45
Hier. Topics 40 Hier. Topics
35 Rank 250 Hier.
Topics 30 Hier. Topics 25
Hier. Topics 20 Hier. Topics
15 Hier. Topics 10
Rank 200 Hier. Topics 5
Rank 150
Level Size 20 Rank 500
Rank 450 Hier. Topic 50
Hier. Topic 45 Hier. Topic
40 Rank 400 Hier.
Topics 35 Hier. Topics 30
Hier. Topics 25 Hier. Topics
20 Hier. Topics 15
Hier. Topics 10 Rank 350
Hier. Topics 5 Rank 300
Rank 250 Rank 200
Rank 150
Level Size 15 Rank 500
Rank 450 Rank 400 Rank
350 Hier. Topics 50
Hier. Topics 45 Hier. Topics 40
Hier. Topics 35 Hier.
Topics 30 Hier. Topics 25
Hier. Topics 20 Hier. Topics
15 Rank 300 Hier.
Topics 10 Hier. Topics 5
Rank 250 Rank 200
Rank 150
44
Relevance Evaluation
  • Calculate average path to a relevant document
  • Assumptions
  • one does not read extraneous menus
  • one reads all documents at a node
  • Ignores relevant documents that are not in the
    hierarchy
  • Smaller score denotes a better hierarchy

45
Key Ideas
  • Automatic evaluations have confirmed hypotheses
    for hierarchies created from full text of
    documents
  • User study is necessary for determining how well
    people can use hierarchies
View by Category
About This Presentation
Title:

Hierarchical Summaries

Description:

Endangered. Animals (2910) marine mammals (188) ... animals (42) habitat (283) endangered species (204) mammals (126) listed species (110) ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 36
Provided by: dawnl2
Learn more at: http://www.cs.loyola.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Hierarchical Summaries


1
Hierarchical Summaries
for Search
  • By Dawn J. Lawrie
  • University of Massachusetts, Amherst

2
The Problem
3
Possible Solution
4
Possible Solution
5
Solution Automatic Hierarchies
6
Strengths of Automatic Hierarchies
  • Word-based summary
  • Focus on topics of the documents
  • Allows users to navigate through the results
  • Easy to understand
  • Bonus Useful for summarizing documents

7
Example
  • Hand-generated hierarchy of 50 documents Query
    Endangered Species (Mammals)

Endangered Animals (2910)
Endangered plants (70)
8
Proposed Framework
Term word or phrase
9
Challenges
  • Selecting terms for the hierarchy
  • Displaying the hierarchy
  • Showing that it works

10
Outline
  • Introduction
  • Description of framework for creating hierarchies
  • Examples
  • Methods of evaluation
  • Future Improvements

11
Methodology
  • Build probabilistic word model of documents
  • Find best terms
  • On topic
  • Predictive
  • Recursive definition creates hierarchy

12
Term characteristics
  • Why topicality?
  • Distinguish topic terms from the rest of the
    vocabulary
  • The Secretary of Interior listed bald eagles
    south of the 40th parallel as endangered under
    the Endangered Species Preservation Act of 1966.
  • Why predictiveness?
  • Topic words can be strongly related
  • Represent different facets of the vocabulary
  • Example
  • P(EndangeredStellar sea lions)
  • 1.00

13
Statistical Model
  • AT refers to topicality with respect to topic T
  • Find if the word w is in set T
  • B refers to predictiveness
  • Precondition for other terms to occur
  • Find if word w is in set P

14
Probabilistic Word Model
  • Captures statistical information about text
  • Called a language model in speech recognition
  • Provides basis for estimation of probabilities

15
Estimating Topicality
  • Use terms contribution to relative entropy
  • Compares two models using K-L divergence
  • Model of documents in hierarchy
  • Model of general English

16
KL Example
endangered
17
Estimating Predictiveness
  • Relates the vocabulary to a set of candidate
    topic terms
  • Use conditional probability - Px (tv)
  • x is the maximum distance between t and v

18
Dominating Set Approximation
  • Interpret predictive language model as graph
  • edges weighted by the conditional probability
  • Finds terms that are connected to lots of terms
    with a high weight
  • Chooses topic terms until vocabulary is dominated
    (predicted)

19
Term Selection Example
20
Generating a Summary
  • 4-step process
  • (1) Preprocess document set
  • (2) Generate a language model
  • (3) Select the terms
  • (4) Create a Hierarchy

recursive
21
Outline
  • Introduction
  • Description of framework for creating hierarchies
  • Examples
  • Methods of evaluation
  • Future Improvements

22
Example Hierarchies
  • Generated from 50 documents retrieved for the
    query Endangered Species - Mammals
  • Demonstrate the difference between using
    different topic models
  • Web hierarchy using same query

23
Uniform Topic Model Hierarchy
species (439)
marine mammals (187)
plan (192)
marine (187)
24
KL-Topic Model Hierarchy
marine mammals (187)
species (439)
marine (187)
Marine Mammal Protection Act (73)
management plan (51)
25
Query-Topic Model Hierarchy
species (439)
Endangered Species Act (335)
marine mammals (187)
Federal (232)
marine (187)
26
Web Hierarchies
  • Submit query to a web search engine
  • Gather titles and snippets of documents
  • Text considered a document
  • Documents are about 30 words

27
Example of Web Hierarchy
marine (76)
Endangered Species (440)
endangered (491)
mammals (600)
28
Outline
  • Introduction
  • Description of framework for creating hierarchies
  • Examples
  • Methods of evaluation
  • Future Improvements

29
Evaluations
  • Summary Evaluation
  • Tests how well the topic terms chosen predict the
    vocabulary
  • Access Evaluation
  • Compare number of documents a user can find
  • Relevance Evaluation
  • Path length to find all relevant documents

30
Automatic Evaluation Test Set
  • Use 50 standard queries
  • Document sets
  • 500 documents retrieved from TREC volumes 4 and 5
    (have relevance judgments)
  • 200 documents retrieved from a news database
  • 1000 titles and snippets retrieved using Google
    Search Engine

31
Evaluating Hypotheses
  • Denotes an evaluation confirmed hypothesis
  • Denotes evaluation showed no significant
    difference

?
Relevance
Summary
TREC Collection and News Documents
Access
Use KL-topic model Use sub-collections
32
Web Document Evaluation
  • Results completely different
  • Best hierarchy uniform topic model
  • Hierarchies do not look as good to human
    inspection

33
User Study
  • Include 12 to 16 users
  • Compare ranked list and hierarchy to ranked list
    alone
  • Users asked to find all instances that are
    relevant to the query
  • Only have to identify one document about a
    particular instance
  • Study includes 10 queries

34
Future Work
  • Complete user study
  • Failure Analysis
  • Explore the use of topic hierarchies in other
    organizational tasks
  • Personal collections of documents
  • E-mails

35
Conclusions
  • Developed a formal framework for topic
    hierarchies
  • Created hierarchies from full text and snippets
    of documents
  • Verified intuition concerning hierarchies
    generated from full text

36
Questions?
Demo http//www-ciir.cs.umass.edu/lawrie/categor
ies/google-qry/
37
Improving Topicality Estimate
  • Estimate topicality using a query model
  • Emphasizes query related terms
  • Improve model of English with sub-collections
  • Distinguishes between terms that are frequently
    used in a genre and topic terms

38
Key Ideas
  • Language models are created from documents
  • Topicality and predictiveness are used to choose
    terms
  • Topicality is estimated using Kullback-Leibler
    divergence
  • Predictiveness is estimated by calculating
    conditional probabilities

39
Key Ideas
  • Showed the effect of using topic model
  • Observed trade-off between snippets and full text
    of documents

40
Summary Evaluation
  • Expected Mutual Information Measure
  • Two sets deviates from stochastic independence
  • Shows how well the topic terms chosen predict the
    vocabulary

41
Hierarchy Terms vs. Top TF.IDF Terms
  • TF.IDF popular term weight
  • Commonly used as a method of naming clusters
  • Compare equal number of unique terms in hierarchy
    to top TF.IDF Terms
  • Results
  • Hierarchies always significantly better at
    summarizing documents

42
Access Evaluation
  • Compare number of documents that are accessible
  • Example policy
  • Examine parts of the hierarchy with 20 or fewer
    documents
  • Look at the top 200 documents in a ranked list

43
Hierarchy vs. Ranked List
Level Size 5 Rank 400
Rank 350 Rank 300 Rank
250 Rank 200 Hier.
Topics 50 Hier. Topics 45
Hier. Topics 40 Rank 150
Hier. Topics 35 Hier.
Topics 30 Hier. Topics 25
Hier. Topics 20 Hier. Topics
15 Hier. Topics 10
Rank 100 Hier. Topics 5
Rank 50
Level Size 10 Rank 500
Rank 450 Rank 400 Rank
350 Rank 300 Hier.
Topics 50 Hier. Topics 45
Hier. Topics 40 Hier. Topics
35 Rank 250 Hier.
Topics 30 Hier. Topics 25
Hier. Topics 20 Hier. Topics
15 Hier. Topics 10
Rank 200 Hier. Topics 5
Rank 150
Level Size 20 Rank 500
Rank 450 Hier. Topic 50
Hier. Topic 45 Hier. Topic
40 Rank 400 Hier.
Topics 35 Hier. Topics 30
Hier. Topics 25 Hier. Topics
20 Hier. Topics 15
Hier. Topics 10 Rank 350
Hier. Topics 5 Rank 300
Rank 250 Rank 200
Rank 150
Level Size 15 Rank 500
Rank 450 Rank 400 Rank
350 Hier. Topics 50
Hier. Topics 45 Hier. Topics 40
Hier. Topics 35 Hier.
Topics 30 Hier. Topics 25
Hier. Topics 20 Hier. Topics
15 Rank 300 Hier.
Topics 10 Hier. Topics 5
Rank 250 Rank 200
Rank 150
44
Relevance Evaluation
  • Calculate average path to a relevant document
  • Assumptions
  • one does not read extraneous menus
  • one reads all documents at a node
  • Ignores relevant documents that are not in the
    hierarchy
  • Smaller score denotes a better hierarchy

45
Key Ideas
  • Automatic evaluations have confirmed hypotheses
    for hierarchies created from full text of
    documents
  • User study is necessary for determining how well
    people can use hierarchies
About PowerShow.com