SIMS 290-2: Applied Natural Language Processing: Marti Hearst

About This Presentation

Title:

SIMS 290-2: Applied Natural Language Processing: Marti Hearst

Description:

Is 'jaguar' a good predictor for the 'auto' class? We want to compare: ... So, there would be N Pr(j,a), i.e. N Pr(j) Pr(a) occurances of 'jaguar' Pr(j) = (2 3)/N; ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 23

Provided by: cisU

Learn more at: http://www.cis.upenn.edu

Category:

more less

Transcript and Presenter's Notes

Title: SIMS 290-2: Applied Natural Language Processing: Marti Hearst

1

Measures of association chi square test,
mutual information, binomial distribution and log
likelihood ratio

Lecture 8

2
Experiments in Multidocument summarization
(SNM02)

Summarization system based on a range of features
Raises issues we have not discussed upto now
Non-extractive techniques
Ordering of information

3
Lead values feature

Lead sentences of news articles can often make
excellent brief summaries
But for multi-document summaries there are
several first sentences, so difficult to choose!
They are information dense
Can we find very informative words based on this
observation
Used Binomial test to decide if

4
Sample lead words
5
Verb specificity

Compare arrest with do or be
Often given subjects are very strongly associated
with a verb
Actors ? appear in movies
Singers ? release an album
Compute associations between subject nouns and
verbs
Use mutual association measure

6
Concept sets

Frequency of words are not that reliable, even
when stemming is used
Synonyms, hypernyms and hyponyms from wordnet

7
Other features

Location A negative value that penalizes
sentencesthat appear late
in the document.
Publication Date Additional value to the most
recent documents,
on the assumption that users will want the most
up-to-date
information.
Target Indicates the presence of the central
personage in the document
cluster, if one exists.
Length A penalty for sentences that are below a
minimum (15
words) and above a maximum (30 words). Short
sentences
are often require some introduction or reference
resolution,
or else are a kind of interjection. Long
sentences can cover
multiple thoughts that are often found elsewhere
in the document
cluster in single sentences.
Others Indicates the presence of any named
entity, weighted to the
frequency of that entity across all documents.
Pronoun A negative value on sentences that have
pronouns in the
beginning of the sentence.

8
Other issues

Sentence ordering
How to present the selected information?
Even good choices might be hard to understand if
they are presented in the wrong order
Imagine a newspaper articles with all sentences
randomly permuted
Noun phrases
Depend on the context

9
Extractive summary
10
Partly modified summary
11
Measures of associations

For supervised learning, they can help us
detrmine which features are predictive of the
distinctions we want to make
Chi square test from last lecture
Words that are likely to appear in the first
sentence rather than anywhere else
Verbs that are strongly associated with a given
subjects
? A variety of measures are defined in the
Chapter 5 reading

12
?2 statistic (CHI)

?2 statistic (pronounced kai square)
A commonly used method of comparing
proportions.
Measures the lack of independence between a term
and a category

13
?2 statistic (CHI)

Is jaguar a good predictor for the auto
class?
We want to compare
the observed distribution above and
null hypothesis that jaguar and auto are
independent

Term jaguar Term ? jaguar
Class auto 2 500
Class ? auto 3 9500
14
?2 statistic (CHI)

Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect?
If independent Pr(j,a) Pr(j) ? Pr(a)
So, there would be N ? Pr(j,a), i.e. N ? Pr(j) ?
Pr(a) occurances of jaguar
Pr(j) (23)/N
Pr(a) (2500)/N
N235009500
N?(5/N)?(502/N)2510/N2510/10005 ? 0.25

Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 500
Class ? auto 3 9500
15
?2 statistic (CHI)
Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect?
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500
Class ? auto 3 9500
expected fe
observed fo
16
?2 statistic (CHI)
Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect?
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500 (502)
Class ? auto 3 (4.75) 9500 (9498)
expected fe
observed fo
17
?2 statistic (CHI)
?2 is interested in (fo fe)2/fe summed over all
table entries The null hypothesis is rejected
with confidence .999, since 12.9 gt 10.83 (the
value for .999 confidence).
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500 (502)
Class ? auto 3 (4.75) 9500 (9498)
expected fe
observed fo
18
?2 statistic (CHI)
There is a simpler formula for ?2
A (t,c) C (t,c)
B (t,c) D (t, c)
N A B C D
19
Finding translation equivalents
20
Binomial distribution
knumber of successes nnumber of
trails xprobability of success
21
Log likelihood ratio test
22
Log likelihood ratio test

Write a Comment

User Comments (0)