Title: SIMS 290-2: Applied Natural Language Processing: Marti Hearst
1 Measures of association chi square test,
mutual information, binomial distribution and log
likelihood ratio
2Experiments in Multidocument summarization
(SNM02)
- Summarization system based on a range of features
- Raises issues we have not discussed upto now
- Non-extractive techniques
- Ordering of information
3Lead values feature
- Lead sentences of news articles can often make
excellent brief summaries - But for multi-document summaries there are
several first sentences, so difficult to choose! - They are information dense
- Can we find very informative words based on this
observation - Used Binomial test to decide if
4Sample lead words
5Verb specificity
- Compare arrest with do or be
- Often given subjects are very strongly associated
with a verb - Actors ? appear in movies
- Singers ? release an album
- Compute associations between subject nouns and
verbs - Use mutual association measure
6Concept sets
- Frequency of words are not that reliable, even
when stemming is used - Synonyms, hypernyms and hyponyms from wordnet
7Other features
- Location A negative value that penalizes
sentencesthat appear late - in the document.
- Publication Date Additional value to the most
recent documents, - on the assumption that users will want the most
up-to-date - information.
- Target Indicates the presence of the central
personage in the document - cluster, if one exists.
- Length A penalty for sentences that are below a
minimum (15 - words) and above a maximum (30 words). Short
sentences - are often require some introduction or reference
resolution, - or else are a kind of interjection. Long
sentences can cover - multiple thoughts that are often found elsewhere
in the document - cluster in single sentences.
- Others Indicates the presence of any named
entity, weighted to the - frequency of that entity across all documents.
- Pronoun A negative value on sentences that have
pronouns in the - beginning of the sentence.
8Other issues
- Sentence ordering
- How to present the selected information?
- Even good choices might be hard to understand if
they are presented in the wrong order - Imagine a newspaper articles with all sentences
randomly permuted - Noun phrases
- Depend on the context
9Extractive summary
10Partly modified summary
11Measures of associations
- For supervised learning, they can help us
detrmine which features are predictive of the
distinctions we want to make - Chi square test from last lecture
- Words that are likely to appear in the first
sentence rather than anywhere else - Verbs that are strongly associated with a given
subjects - ? A variety of measures are defined in the
Chapter 5 reading
12?2 statistic (CHI)
- ?2 statistic (pronounced kai square)
- A commonly used method of comparing
proportions. - Measures the lack of independence between a term
and a category
13?2 statistic (CHI)
- Is jaguar a good predictor for the auto
class? - We want to compare
- the observed distribution above and
- null hypothesis that jaguar and auto are
independent
Term jaguar Term ? jaguar
Class auto 2 500
Class ? auto 3 9500
14?2 statistic (CHI)
- Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect? - If independent Pr(j,a) Pr(j) ? Pr(a)
- So, there would be N ? Pr(j,a), i.e. N ? Pr(j) ?
Pr(a) occurances of jaguar - Pr(j) (23)/N
- Pr(a) (2500)/N
- N235009500
- N?(5/N)?(502/N)2510/N2510/10005 ? 0.25
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 500
Class ? auto 3 9500
15?2 statistic (CHI)
Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect?
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500
Class ? auto 3 9500
expected fe
observed fo
16?2 statistic (CHI)
Under the null hypothesis (jaguar and auto
independent) How many co-occurrences of jaguar
and auto do we expect?
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500 (502)
Class ? auto 3 (4.75) 9500 (9498)
expected fe
observed fo
17?2 statistic (CHI)
?2 is interested in (fo fe)2/fe summed over all
table entries The null hypothesis is rejected
with confidence .999, since 12.9 gt 10.83 (the
value for .999 confidence).
Term jaguar Term jaguar Term ? jaguar Term ? jaguar
Class auto 2 (0.25) 500 (502)
Class ? auto 3 (4.75) 9500 (9498)
expected fe
observed fo
18?2 statistic (CHI)
There is a simpler formula for ?2
A (t,c) C (t,c)
B (t,c) D (t, c)
N A B C D
19Finding translation equivalents
20Binomial distribution
knumber of successes nnumber of
trails xprobability of success
21Log likelihood ratio test
22Log likelihood ratio test