SIMS 290-2: Applied Natural Language Processing: Marti Hearst - PowerPoint PPT Presentation

About This Presentation
Title:

SIMS 290-2: Applied Natural Language Processing: Marti Hearst

Description:

Is 'jaguar' a good predictor for the 'auto' class? We want to compare: ... So, there would be N Pr(j,a), i.e. N Pr(j) Pr(a) occurances of 'jaguar' Pr(j) = (2 3)/N; ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 23
Provided by: cisU
Learn more at: http://www.cis.upenn.edu
Category:

less

Transcript and Presenter's Notes

Title: SIMS 290-2: Applied Natural Language Processing: Marti Hearst


1

Measures of association chi square test,
mutual information, binomial distribution and log
likelihood ratio
  • Lecture 8

2
Experiments in Multidocument summarization
(SNM02)
  • Summarization system based on a range of features
  • Raises issues we have not discussed upto now
  • Non-extractive techniques
  • Ordering of information

3
Lead values feature
  • Lead sentences of news articles can often make
    excellent brief summaries
  • But for multi-document summaries there are
    several first sentences, so difficult to choose!
  • They are information dense
  • Can we find very informative words based on this
    observation
  • Used Binomial test to decide if

4
Sample lead words
5
Verb specificity
  • Compare arrest with do or be
  • Often given subjects are very strongly associated
    with a verb
  • Actors ? appear in movies
  • Singers ? release an album
  • Compute associations between subject nouns and
    verbs
  • Use mutual association measure

6
Concept sets
  • Frequency of words are not that reliable, even
    when stemming is used
  • Synonyms, hypernyms and hyponyms from wordnet

7
Other features
  • Location A negative value that penalizes
    sentencesthat appear late
  • in the document.
  • Publication Date Additional value to the most
    recent documents,
  • on the assumption that users will want the most
    up-to-date
  • information.
  • Target Indicates the presence of the central
    personage in the document
  • cluster, if one exists.
  • Length A penalty for sentences that are below a
    minimum (15
  • words) and above a maximum (30 words). Short
    sentences
  • are often require some introduction or reference
    resolution,
  • or else are a kind of interjection. Long
    sentences can cover
  • multiple thoughts that are often found elsewhere
    in the document
  • cluster in single sentences.
  • Others Indicates the presence of any named
    entity, weighted to the
  • frequency of that entity across all documents.
  • Pronoun A negative value on sentences that have
    pronouns in the
  • beginning of the sentence.

8
Other issues
  • Sentence ordering
  • How to present the selected information?
  • Even good choices might be hard to understand if
    they are presented in the wrong order
  • Imagine a newspaper articles with all sentences
    randomly permuted
  • Noun phrases
  • Depend on the context

9
Extractive summary
10
Partly modified summary
11
Measures of associations
  • For supervised learning, they can help us
    detrmine which features are predictive of the
    distinctions we want to make
  • Chi square test from last lecture
  • Words that are likely to appear in the first
    sentence rather than anywhere else
  • Verbs that are strongly associated with a given
    subjects
  • ? A variety of measures are defined in the
    Chapter 5 reading

12
?2 statistic (CHI)
  • ?2 statistic (pronounced kai square)
  • A commonly used method of comparing
    proportions.
  • Measures the lack of independence between a term
    and a category

13
?2 statistic (CHI)
  • Is jaguar a good predictor for the auto
    class?
  • We want to compare
  • the observed distribution above and
  • null hypothesis that jaguar and auto are
    independent

Term jaguar Term ? jaguar
Class auto 2 500
Class ? auto 3 9500
14
?2 statistic (CHI)
  • Under the null hypothesis (jaguar and auto
    independent) How many co-occurrences of jaguar
    and auto do we expect?
  • If independent Pr(j,a) Pr(j) ? Pr(a)
  • So, there would be N ? Pr(j,a), i.e. N ? Pr(j) ?
    Pr(a) occurances of jaguar
  • Pr(j) (23)/N
  • Pr(a) (2500)/N
  • N235009500
  • N?(5/N)?(502/N)2510/N2510/10005 ? 0.25

Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 500
Class ? auto 3 9500
15
?2 statistic (CHI)
Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect?
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500
Class ? auto 3 9500
expected fe
observed fo
16
?2 statistic (CHI)
Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect?
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500 (502)
Class ? auto 3 (4.75) 9500 (9498)
expected fe
observed fo
17
?2 statistic (CHI)
?2 is interested in (fo fe)2/fe summed over all
table entries The null hypothesis is rejected
with confidence .999, since 12.9 gt 10.83 (the
value for .999 confidence).
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500 (502)
Class ? auto 3 (4.75) 9500 (9498)
expected fe
observed fo
18
?2 statistic (CHI)
There is a simpler formula for ?2
A (t,c) C (t,c)
B (t,c) D (t, c)
N A B C D
19
Finding translation equivalents
20
Binomial distribution
knumber of successes nnumber of
trails xprobability of success
21
Log likelihood ratio test
22
Log likelihood ratio test
Write a Comment
User Comments (0)
About PowerShow.com