Search Engine Statistics Beyond the ngram: Application to Noun Compound Bracketing - PowerPoint PPT Presentation

About This Presentation
Title:

Search Engine Statistics Beyond the ngram: Application to Noun Compound Bracketing

Description:

Search Engine Statistics Beyond the ngram: Application to Noun Compound Bracketing – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 48
Provided by: Sar1
Category:

less

Transcript and Presenter's Notes

Title: Search Engine Statistics Beyond the ngram: Application to Noun Compound Bracketing


1
Search Engine Statistics Beyond the n-gram
Application to Noun Compound Bracketing
  • Preslav Nakov and Marti HearstComputer Science
    Division and SIMSUniversity of California,
    Berkeley

Supported by NSF DBI-0317510 and a gift from
Genentech
2
Overview
  • Unsupervised algorithm
  • Applied here to noun compound bracketing, but
    promising for structural ambiguity generally
  • Features
  • n-grams, ?2 , MI
  • Beyond the n-gram
  • surface features
  • paraphrases
  • State-of-the art accuracy

3
Noun Compound Bracketing
liver cell line
liver cell antibody
  • (a) liver cell antibody (left
    bracketing)
  • (b) liver cell line (right bracketing)
  • In (a), the antibody targets the liver cell.
  • In (b), the cell line is derived from the liver.

4
Related Work
Pr that w1 precedes w2
  • Marcus(1980), Pustejoskyal.(1993), Resnik(1993)
  • adjacency model Pr(w1w2) vs. Pr(w2w3)
  • Lauer (1995)
  • dependency model Pr(w1w2) vs. Pr(w1w3)
  • Keller Lapata (2004)
  • use the Web
  • unigrams and bigrams
  • Girju al. (2005)
  • supervised model
  • bracketing in context
  • requires WordNet senses
  • to be given
  • This work
  • ?2
  • Web
  • n-grams
  • paraphrases
  • surface features

5
Adjacency Dependency (1)
  • right bracketing w1 w2w3
  • w2w3 is a compound (modified by w1)
  • home health care
  • w1 and w2 independently modify w3
  • adult male rat
  • left bracketing w1w2 w3
  • only 1 modificational choice possible
  • law enforcement officer

w1 w2 w3
w1 w2 w3
6
Adjacency Dependency (2)
  • right bracketing w1w2w3
  • w2w3 is a compound (modified by w1)
  • w1 and w2 independently modify w3
  • adjacency model
  • Is w2w3 a compound?
  • (vs. w1w2 being a compound)
  • dependency model
  • Does w1 modify w3?
  • (vs. w1 modifying w2)

w1 w2 w3
w1 w2 w3
w1 w2 w3
7
Frequencies
  • Adjacency model
  • Compare (w1,w2) to (w2,w3)
  • Dependency model
  • Compare (w1,w2) to (w1,w3)

Frequency of w1w2
w1 w2 w3
left
right
w1 w2 w3
8
Probabilities
  • Adjacency model
  • Compare Pr(w1?w2w2) to Pr(w2?w3w3)
  • Dependency model
  • Compare Pr(w1?w2w2) to Pr(w1?w3w3)

Pr that w1 modifies w2
w1 w2 w3
left
right
w1 w2 w3
9
Probabilities Dependency
  • Dependency model
  • Pr(left) ? Pr(w1?w2w2)Pr(w2?w3w3)
  • Pr(right) ? Pr(w1?w3w3)Pr(w2?w3w3)
  • So we compare Pr(w1?w2w2) to Pr(w1?w3w3)
  • BUT! No cancellation in
  • Lauers model

right
w1 w2 w3
left
10
Probabilities Estimation
  • Using page hits as a proxy for n-gram counts
  • Pr(w1?w2w2) (w1,w2) / (w2)
  • (w2) word frequency query for w2
  • (w1,w2) bigram frequency query for w1 w2
  • smoothed by 0.5

11
Probabilities Why? (1)
  • Why should we use
  • (a) Pr(w1?w2w2), rather than
  • (b) Pr(w2?w1w1)?
  • KellerLapata (2004) calculate
  • AltaVista queries
  • (a) 70.49
  • (b) 68.85
  • British National Corpus
  • (a) 63.11
  • (b) 65.57

12
Probabilities Why? (2)
  • Why should we use
  • (a) Pr(w1?w2w2), rather than
  • (b) Pr(w2?w1w1)?
  • Maybe to introduce a bracketing prior.
  • Just like Lauer (1995) did.
  • But otherwise, no reason to prefer either one.
  • Do we need probabilities? (association is OK)
  • Do we need a directed model? (symmetry is OK)

13
Association Models ?2 (Chi Squared)
  • A (wi,wj)
  • B (wi) (wi,wj)
  • C (wj) (wi,wj)
  • D N (ABC)
  • N 8 trillion ( ABCD)

8 billion Web pages x 1,000 words
14
Web-derived Surface Features
  • Authors often disambiguate noun compounds using
    surface markers, e.g.
  • law-enforcement officer ? left
  • brain stems cell ? left
  • brains stem cell ? right
  • The enormous size of the Web makes them frequent
    enough to be useful.

15
Web-derived Surface FeaturesDash (hyphen)
  • Left dash
  • cell-cycle analysis ? left
  • Right dash
  • donor T-cell ? right
  • fiber optics-system ? should be left..
  • Double dash
  • T-cell-depletion ? unusable

16
Web-derived Surface FeaturesPossessive Marker
  • Attached to the first word
  • brains stem cell ? right
  • Attached to the second word
  • brain stems cell ? left
  • Combined features
  • brains stem-cell ? right

17
Web-derived Surface FeaturesCapitalization
  • dont-care lowercase uppercase
  • Plasmodium vivax Malaria ? left
  • plasmodium vivax Malaria ? left
  • lowercase uppercase dont-care
  • brain Stem cell ? right
  • brain Stem Cell ? right
  • Disabled on
  • Roman digits
  • Single-letter words e.g., vitamin D deficiency

18
Web-derived Surface FeaturesEmbedded Slash
  • Left embedded slash
  • leukemia/lymphoma cell ? right

19
Web-derived Surface FeaturesParentheses
  • Single-word
  • growth factor (beta) ? left
  • (brain) stem cell ? right
  • Two-word
  • (growth factor) beta ? left
  • brain (stem cell) ? right

20
Web-derived Surface FeaturesComma,dot,column,sem
i-column,
  • Following the first word
  • home. health care ? right
  • adult, male rat ? right
  • Following the second word
  • health care, provider ? left
  • lung cancer patients ? left

21
Web-derived Surface FeaturesDash to External
Word
  • Dash to an external word to the left
  • mouse-brain stem cell ? right
  • Dash to an external word to the right
  • tumor necrosis factor-alpha ? left

22
Web-derived Surface FeaturesProblems Solutions
  • Problem search engines ignore punctuation
  • brain-stem cell does not work
  • Solution
  • query for brain stem cell
  • obtain 1,000 document summaries
  • look for the features in these summaries

23
Other Web-derived FeaturesPossessive Marker
  • We can also query directly for possessives
  • Yes, brain stems cell sort of works.
  • Search engines
  • drop the possessive marker
  • but s is kept
  • Still, we cannot query for brain stems cell

24
Other Web-derived FeaturesAbbreviation
  • After the second word
  • tumor necrosis (TN) factor ? left
  • After the third word
  • tumor necrosis factor (NF) ? right
  • We query for e.g., tumor necrosis tn factor
  • Problems
  • Roman digits IV, vii
  • States CA
  • Short words me

25
Other Web-derived FeaturesConcatenation
  • Consider health care reform
  • healthcare 79,500,000
  • carereform 269
  • healthreform 812
  • Adjacency model
  • healthcare vs. carereform
  • Dependency model
  • healthcare vs. healthreform
  • Triples
  • healthcare reform vs. health carereform

26
Other Web-derived FeaturesUsing Googles
  • Each allows a 1 word wildcard
  • Single star
  • health care reform ? left
  • health care reform ? right
  • More stars and/or reverse order
  • care reform health ? right
  • reform health care ? left
  • Adjacency model

27
Other Web-derived FeaturesReorder
  • Reorders for health care reform
  • care reform health ? right
  • reform health care ? left

28
Other Web-derived FeaturesInternal Inflection
Variability
  • First word
  • bone mineral density
  • bones mineral density
  • Second word
  • bone mineral density
  • bone minerals density

? right
? left
29
Other Web-derived FeaturesSwitch The First Two
Words
  • Predict right, if we can reorder
  • adult male rat as
  • male adult rat

30
Paraphrases (1)
  • The semantics of a noun compound is often made
    overt by a paraphrase (Warren,1978)
  • Prepositional
  • stem cells in the brain ? right
  • stem cells from the brain ? right
  • cells from the brain stem ? left
  • Verbal
  • virus causing human immunodeficiency ? left
  • pain associated with arthritis migraine ? left
  • Copula
  • office building that is a skyscraper ? right

31
Paraphrases (2)
  • Lauer(1995), KellerLapata(2003), Girjual.(2005)
    try to choose the best prepositional paraphrase
    as a proxy for the semantic interpretation of NCs
  • They use of, for, in, at, on, from, with, about,
    (like)
  • This could be problematic, when more than one
    preposition is possible.
  • In contrast
  • we try to predict syntax, not semantics
  • we do not need to disambiguate, just add up all
    counts
  • cells in (the) bone marrow ? left (61,700)
  • cells from (the) bone marrow ? left (16,500)
  • marrow cells from (the) bone ? right (12)

32
Paraphrases (3)
  • prepositional paraphrases
  • We use 150 prepositions
  • verbal paraphrases
  • We use associated with, caused by, contained in,
    derived from, focusing on, found in, involved in,
    located at/in, made of, performed by, preventing,
    related to and used by/in/for.
  • copula paraphrases
  • We use that/which/who and is/was
  • optional elements
  • articles a, an, the
  • quantifiers some, every, etc.
  • pronouns this, these, etc.

33
Evaluation Datasets
  • Lauer Set
  • 244 noun compounds (NCs)
  • from Groliers encyclopedia
  • inter-annotator agreement 81.5
  • Biomedical Set
  • 430 NCs
  • from MEDLINE
  • inter-annotator agreement 88 (? .606)

34
Evaluation Experiments
  • Exact phrase queries
  • Limited to English
  • Inflections
  • Lauer Set Carrolls morphological tools
  • Biomedical Set UMLS Specialist Lexicon

35
Results Lauer (1)
N/A
wrong
correct
36
Results Lauer (2)
N/A
wrong
correct
37
Results Lauer (3)
38
Results Bio (1)
N/A
wrong
correct
39
Results Bio (2)
N/A
wrong
correct
40
Individual Surface Features Performance Bio
41
Paraphrase and Surface Features Performance
  • Lauer Set
  • Biomedical Set

42
Discussion
  • Lauer Bio
  • Adjacency vs. Dependency
  • ?2 vs. frequencies vs. probabilities

43
Conclusion
  • Obtained new state-of-the-art results on NC
    bracketing
  • more robust than Lauer (1995)
  • more accurate than KellerLapata (2004)
  • Introduced search engine statistics that go
    beyond the n-gram
  • surface features
  • paraphrases
  • Works well for other structural ambiguity
    problems
  • Prepositional phrase attachment
  • Noun phrase coordination

44
Future Work
  • Recognize ambiguous cases
  • 3-way classification
  • Bracket more than 3 nouns
  • Not just bracketing but dependences
  • e.g., growth factor alpha
  • Bracket NPs in general (other POS)
  • augment Penn Treebank with NP-internal dependences

45
The End
  • Thank you!

46
Web Counts Problems
  • Page hits are inaccurate
  • This may be ok (KellerLapata,2003)
  • The Web lacks linguistic annotation
  • Pr(healthcare) (health care) / (care)
  • health noun
  • care both verb and noun
  • can be adjacent by chance
  • can come from different sentences
  • Cannot find
  • stem cells VERB PREPOSITION brain
  • protein synthesis inhibition

47
Inter-annotator Agreement Lauer Set
  • Lauer 6 judges
  • Average 81.50
  • Best pair of annotators 84.40
  • Worse pair of annotators 73.00
  • Total of 308 examples.
  • 244 used
  • the rest indeterminate or extraction errors
  • Problem
  • Gold standard Lauer, in context.
  • Human judges no context!

48
Search Engines 5/7/2005any language, inflections
49
MSN over time any language, inflections
50
Google over time any language, inflections
Write a Comment
User Comments (0)
About PowerShow.com