Loading...

PPT – Lecture 5: Collocations (Chapter 5 of Manning and Schutze) PowerPoint presentation | free to download - id: b3ad7-YTY1M

The Adobe Flash plugin is needed to view this content

View by Category

Presentations

Products
Sold on our sister site CrystalGraphics.com

About This Presentation

Write a Comment

User Comments (0)

Transcript and Presenter's Notes

Lecture 5 Collocations (Chapter 5 of Manning

and Schutze)

- Wen-Hsiang Lu (???)
- Department of Computer Science and Information

Engineering, - National Cheng Kung University
- 2008/10/13
- (Slides from Dr. Mary P. Harper,
- http//min.ecn.purdue.edu/ee669/)

The website of uploading homework for NLP

coursehttp//moodle.ncku.edu.tw/course/view.php?

id2960

What is a Word?

- This is harder to pin down than one might

imagine. In general we talk about word forms,

which (in written English) is a particular

configuration of letters wherever it occurs. Each

individual occurrence of a word form is called a

token. - As an example, let's say 'water'. There are

related word forms water, waters, watered,

watering, watery. This set of word forms is

called a lemma. - Some argue that the best way to find how water is

used is by studying many cases of actual use, and

abandoning prejudices about generalities such as

syntactic categories.

Definition Of Collocation (wrt Corpus Literature)

- A collocation is defined as a sequence of two or

more consecutive words, that has characteristics

of a syntactic and semantic unit, and whose exact

and unambiguous meaning or connotation cannot be

derived directly from the meaning or connotation

of its components. Chouekra, 1988

Word Collocations

- It is believed that children learn word forms

like water as distinct items. Only in school do

they begin to recognize nouns and verbs after

learning formal rules of language. - Many units of meaning extend over several words,

for example kick the bucket, whose meaning

cannot be derived from its parts. - Many of the sentences we utter can be thought

of as having a substantial component of

frequently used phrases. Again, this can be

thought of in terms of collocations, where these

groups of words reflect frequently encountered

contexts in the language community.

Word Collocations

- Many of the ways that say, water can be used are

subtly linked to very specific situations, and to

other words. An example - His mouth watered.
- His eyes watered.
- But this paradigm doesn't extend to watering.
- The roast was mouth-watering.
- The smokey nightclub was eye-watering.
- Special relationships between words which tend to

be used together (or not) are called

collocational constraints.

Word Collocations

- Collocation
- Firth word is characterized by the company it

keeps collocations of a given word are

statements of the habitual or customary places of

that word. - non-compositionality of meaning
- cannot be derived directly from its parts (heavy

rain) - non-substitutability in context
- for parts (red light)
- non-modifiability ( non-transformability)
- kick the yellow bucket take exceptions to

Association and Co-occurrenceTerms

- Considerable overlap between the concepts of

collocations and (in technical domains) terms,

technical term and terminological phrase. (Terms

in IR refers to both words and phrases.) - Terms appear together or in the same (or similar)

context - (doctors, nurses)
- (hardware, software)
- (gas, fuel)
- (hammer, nail)
- (communism, free speech)
- Collocations sometimes reflect attitudes (e.g.,

towards different types of substances strong

cigarettes, tea, coffee versus powerful drug

(e.g., heroin)).

Linguistic Subclasses of Collocations

- Light verbs verbs with little semantic content

like make, take, do - Terminological Expressions concepts and objects

in technical domains (e.g., hard drive) - Idioms fixed phrases
- kick the bucket, birds-of-a-feather, run for

office (????) - Proper names difficult to recognize even with

lists - Tuesday (persons name), May, Winston Churchill,

IBM, Inc. - Numerical expressions
- containing ordinary words
- Monday Oct 04 1999, two thousand seven hundred

fifty - Verb particle constructions or Phrasal Verbs
- Separable parts
- look up, take off, tell off

Motivation

- Tasks where words and the company they keep is

important - word sense disambiguation (MT, IR, IE)
- lexical entries subdivision and definitions

(lexicography) - language modeling (generalization, smoothing)
- word/phrase/term translation (MT, Multilingual

IR) - NL generation (natural phrases) (Generation,

MT) - parsing (lexically-based selectional preferences)

Collocations

- Collocations are not necessarily adjacent
- Collocations cannot be directly translated into

other languages. - It may be better to use the term collocation in

the narrower sense of grammatically bound

elements that occur in a particular order, and

use association or co-occurrence for words that

appear together in context.

Other Word Relations

- Synonymy different form/word, same meaning
- notebook / laptop
- Antonymy opposite meaning
- new/old, black/white, start/stop
- Homonymy same form/word, different meaning
- true (random, unrelated) can (aux. verb / can

of Coke) - related polysemy notebook, shift, grade, ...
- Other
- Hyperonymy/Hyponymy general vs. specific

vehicle/car - Meronymy/Holonymy part vs. whole leg/body

Collocation Exploration

- Computers can be used to study collocation in

large text corpora. Typically this means

selecting a word form type which one wants to

study, which will serve as the node. Each

occurrence of the type is a token. We then select

a span (??) within which we want to study

co-occurrence of other words. Most of the

significant relationships are within a span of

plus or minus four. - Whenever a token of our node word occurs, we

tally (??) each of the tokens of other words

which occur within its span. Where there are many

occurrences of the node word, a statistical

profile of that word's collocates starts to

emerge. - This works a lot better for content words than it

does with function words.

Overview of the Collocation Detection Techniques

Surveyed

- Selection of Collocations by Frequency
- Selection of Collocation based on Mean and

Variance of the distance between focal word and

collocating word. - Hypothesis Testing
- Pointwise Mutual Information

Using Frequency to Hunt for Collocations

- The most frequent n-grams are not in general

always collocations many involve function words

or are common names. - Simple heuristic methods help to improve the

collocation yield of the n-grams. - Use knowledge of stop words words/forms that

cannot alone make up a collocation - a, the, and, or, but, not,
- Use part of speech patterns to filter the n-grams

(Justeson and Katz, 1995) - Adj Noun (cold feet)
- Noun Noun (oil prices)
- Noun Preposition Noun (out of sight)

Mean and Variance (Smadja et al., 1993)

- Frequency-based search works well for fixed

phrases. However, many collocations consist of

two words in more flexible (although regular)

relationships. For example, - Knock and door may not occur at a fixed distance

from each other - One method of detecting these flexible

relationships uses the mean and variance of the

offset (signed distance) between the two words

in the corpus. - If the offsets are randomly distributed (i.e.,

no collocation), then the variance will be high

(and means close to zero as would be the case for

a uniform distribution).

Mean, Sample Variance, and Standard Deviation

Example Knock and Door

- She knocked on his door.
- They knocked at the door.
- 100 women knocked on the big red door.
- A man knocked on the metal front door.
- Average offset between knock and door
- (3 3 5 5)/ 4 4
- Variance
- ((3-4)2 (3-4)2 (5-4)2 (5-4)2 )/(4-1) 4/3

Hypothesis Testing Overview

- We want to determine whether the co-occurrence is

random or whether it occurs more often than

chance. This is a classical problem of

Statistics called Hypothesis Testing. - We formulate a null hypothesis H0 (the

association occurs by chance, i.e., no

association between words). Assuming this,

calculate the probability that a collocation

would occur if H0 were true. If the probability

is very low (e.g., p lt 0.05) (thus confirming

interesting things are happening!), then reject

H0 otherwise retain it as possible. - In this case, we assume that two words are not

collocations if they occur independently.

Hypothesis Testing The t test

- The t test looks at the mean and variance of a

sample of measurements, where the null hypothesis

is that the sample is drawn from a distribution

with mean ?. - The test looks at the difference between the

observed and expected means, scaled by the

variance of the data, and tells us how likely one

is to get a sample of that mean and variance

assuming that the sample is drawn from a normal

distribution with mean ?.

The Students t test

- To determine the probability of getting a certain

sample, we compute the t statistic, where is

the sample mean and s2 is the sample variance,

and look up its significance wrt the normal

distribution.

- N ? 30
- ? is unknown
- Normal distribution

The t test

- Significance of difference
- Compare with normal distribution (mean m)
- Using real-world data, compute t
- Find in tables (see Manning and Schutze, p. 609)
- d.f. degrees of freedom (parameters which are

not determined by other parameters sample size) - percentile level p 0.05 (or lower)
- The bigger the t statistic
- the better chance that it is an interesting

combination (i.e. we can reject the null

hypothesis no association) - t significance level from the t table

The t test on Collocations

- Null hypothesis independence
- mean m p(w1) p(w2)
- Data estimates
- x MLE of joint probability from data
- s2 is p(1-p), i.e. almost p for small p N is

the data size - Example compute t value for new companies
- C(new)15,828 C(companies) 4,675 N14,307,668
- H0 p(new companies) 15,828/14,307,668 4,675/

14,307,668 3.615 10-7 - p(new companies) 8/14,307,6685.591 10-7
- s2 p(1-p) p-p2? 5.591 10-7
- T (5.591 10-7 - 3.615 10-7)/(5.591 10-7

/14,307,668).5.999932 - For a 0.05, need a t value of 1.645, so the

null hypothesis is not rejected.

Hypothesis Testing of Differences (Church

Hanks, 1989)

- We may also want to find words whose

co-occurrence patterns best distinguish between

two words (e.g., strong versus powerful). This

application can be useful for Lexicography. - The t test is extended to the comparison of the

means under the assumption that they are normally

distributed. - The null hypothesis is that the average

difference is 0.

The t-test for Comparing Two Populations

- This t test compares the means of two normal

populations. The variances of the two

populations are added since the variance of the

difference of two RVs is the sum of their

variances.

Collocation Testing

- T values are calculated assuming a Bernoulli

distribution w is the collocate of interest, v1

and v2 are the words to compare, and assume that

s2 ? p.

Pearsons Chi-Square Test

- Use of the t test has been criticized by Church

and Mercer (1993) because it assumes that

probabilities are approximately normally

distributed (not true, generally). - The Chi-Square test does not make this

assumption. - The essence of the test is to compare observed

frequencies with frequencies expected in the case

of independence. If the difference between

observed and expected frequencies is large, then

we can reject the null hypothesis of

independence. - c2 test (general formula) Si,j (Oij-Eij)2 / Eij
- where Oij and Eij are the observed versus

expected counts of events i, j

Pearsons Chi-square Test

- Example of a two-outcome event

Pearsons Chi-Square Test

P(w1w2) P(w1)p(w2) E11/N gt E11 P(w1)P(w2)N

- The expected frequencies are computed from the

marginal probabilities - E11 (O11 O12)/N ? (O11 O21)/N ? N
- where N is the number of bigrams
- c2 221097 ?(219243 ? 9 - 75 ? 1770)2/(1779 ?

84 ? 221013 ? 219318) - 103.39 gt 7.88 (at .005 thus we

can reject the independence assumption)

Pearsons Chi-Square Applications

- One of the early uses of the Chi-Square test in

Statistical NLP was the identification of

translation pairs in aligned corpora (Church

Gale, 1991). - A more recent application is to use Chi-Square

as a metric for corpus similarity (Kilgariff and

Rose, 1998) - Note that the Chi-Square test should not be used

for small counts.

Likelihood Ratios Within a Single Corpus

(Dunning, 1993)

- Likelihood ratios are more appropriate for sparse

data than the Chi-Square test. In addition, they

are easier to interpret than the Chi-Square

statistic. - In applying the likelihood ratio test to

collocation discovery, use the following two

alternative explanations for the occurrence

frequency of a bigram w1 w2 - H1 The occurrence of w2 is independent of the

previous occurrence of w1 P(w2 w1) P(w2

?w1 ) p - H2 The occurrence of w2 is dependent of the

previous occurrence of w1 p1 P(w2 w1) ? P(w2

?w1) p2

Likelihood Ratios Within a Single Corpus

Binominal Distribution

- Use the MLE for probabilities for p, p1, and p2

and assume the binomial distribution - Under H1 P(w2 w1) c2/N, P(w2 ?w1) c2/N
- Under H2 P(w2 w1) c12/ c1 p1,

P(w2 ?w1) (c2-c12)/(N-c1) p2 - Under H1 b(c12 c1, p) gives c12 out of c1

bigrams are w1w2 and b(c2-c12 N-c1, p) gives c2-

c12 out of N-c1 bigrams are ?w1w2 - Under H2 b(c12 c1, p1) gives c12 out of c1

bigrams are w1w2 and b(c2-c12 N-c1, p2) gives

c2- c12 out of N-c1 bigrams are ?w1w2

Likelihood Ratios Within a Single Corpus

- The likelihood of H1
- L(H1) b(c12 c1, p)?b(c2-c12 N-c1, p)

(likelihood of independence) - The likelihood of H2
- L(H2) b(c12 c1, p1)?b(c2- c12 N-c1, p2)

(likelihood of dependence) - The log of likelihood ratio
- log ? log L(H1)/ L(H2) log b(..) log

b(..) log b(..) log b(..) - The quantity 2 log ? is asymptotically ?2

distributed, so we can test for significance.

Likelihood Ratios II Between two or more corpora

(Damerau, 1993)

- Ratios of relative frequencies between two or

more different corpora can be used to discover

collocations that are characteristic of a corpus

when compared to other corpora. - This approach is most useful for the discovery of

subject-specific collocations. - For example, suppose foo bar occurs 2 out of

1,232,444 times in one corpus and 55 out of

5,348,212, then the frequency ratio is (2/

1,232,444)/(55/ 5,348,212) .1578

Pointwise Mutual Information

- An Information-Theoretic measure for discovering

collocations is pointwise mutual information

(Church et al., 1989, 1991). - This is NOT MI as defined in Information Theory
- (MI random variables not values of random

variables) - I(a,b) log2 p(a,b) / p(a)p(b) log2 p(ab)

/ p(a) - Pointwise Mutual Information is roughly a measure

of how much one word tells us about the other.

Pointwise Mutual Information

- Example I(true, species) log2 (4.1e-5 /

3.8e-4 ? 8.0e-3) 3.74 - measured in bits but it is difficult to give it

an interpretation - used for ranking (NOT null hypothesis tests)
- Pointwise mutual information works particularly

badly in sparse environments (favors low

frequency events). - May not be a good measure of what an interesting

correspondence between two events is (Church and

Gale, 1995).

About PowerShow.com

PowerShow.com is a leading presentation/slideshow sharing website. Whether your application is business, how-to, education, medicine, school, church, sales, marketing, online training or just for fun, PowerShow.com is a great resource. And, best of all, most of its cool features are free and easy to use.

You can use PowerShow.com to find and download example online PowerPoint ppt presentations on just about any topic you can imagine so you can learn how to improve your own slides and presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

You can use PowerShow.com to find and download example online PowerPoint ppt presentations on just about any topic you can imagine so you can learn how to improve your own slides and presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

presentations for free. Or use it to find and download high-quality how-to PowerPoint ppt presentations with illustrated or animated slides that will teach you how to do something new, also for free. Or use it to upload your own PowerPoint slides so you can share them with your teachers, class, students, bosses, employees, customers, potential investors or the world. Or use it to create really cool photo slideshows - with 2D and 3D transitions, animation, and your choice of music - that you can share with your Facebook friends or Google+ circles. That's all free as well!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

For a small fee you can get the industry's best online privacy or publicly promote your presentations and slide shows with top rankings. But aside from that it's free. We'll even convert your presentations and slide shows into the universal Flash format with all their original multimedia glory, including animation, 2D and 3D transition effects, embedded music or other audio, or even video embedded in slides. All for free. Most of the presentations and slideshows on PowerShow.com are free to view, many are even free to download. (You can choose whether to allow people to download your original PowerPoint presentations and photo slideshows for a fee or free or not at all.) Check out PowerShow.com today - for FREE. There is truly something for everyone!

Recommended

«

/ »

Page of

«

/ »

Promoted Presentations

Related Presentations

Page of

Home About Us Terms and Conditions Privacy Policy Presentation Removal Request Contact Us Send Us Feedback

Copyright 2018 CrystalGraphics, Inc. — All rights Reserved. PowerShow.com is a trademark of CrystalGraphics, Inc.

Copyright 2018 CrystalGraphics, Inc. — All rights Reserved. PowerShow.com is a trademark of CrystalGraphics, Inc.

The PowerPoint PPT presentation: "Lecture 5: Collocations (Chapter 5 of Manning and Schutze)" is the property of its rightful owner.

Do you have PowerPoint slides to share? If so, share your PPT presentation slides online with PowerShow.com. It's FREE!