Title: Search Engine Statistics Beyond the ngram: Application to Noun Compound Bracketing
1Search Engine Statistics Beyond the n-gram
Application to Noun Compound Bracketing
- Preslav Nakov and Marti HearstComputer Science
Division and SIMSUniversity of California,
Berkeley
Supported by NSF DBI-0317510 and a gift from
Genentech
2Overview
- Unsupervised algorithm
- Applied here to noun compound bracketing, but
promising for structural ambiguity generally - Features
- n-grams, ?2 , MI
- Beyond the n-gram
- surface features
- paraphrases
- State-of-the art accuracy
3Noun Compound Bracketing
liver cell line
liver cell antibody
- (a) liver cell antibody (left
bracketing) - (b) liver cell line (right bracketing)
- In (a), the antibody targets the liver cell.
- In (b), the cell line is derived from the liver.
4Related Work
Pr that w1 precedes w2
- Marcus(1980), Pustejoskyal.(1993), Resnik(1993)
- adjacency model Pr(w1w2) vs. Pr(w2w3)
- Lauer (1995)
- dependency model Pr(w1w2) vs. Pr(w1w3)
- Keller Lapata (2004)
- use the Web
- unigrams and bigrams
- Girju al. (2005)
- supervised model
- bracketing in context
- requires WordNet senses
- to be given
- This work
- ?2
- Web
- n-grams
- paraphrases
- surface features
5Adjacency Dependency (1)
- right bracketing w1 w2w3
- w2w3 is a compound (modified by w1)
- home health care
- w1 and w2 independently modify w3
- adult male rat
- left bracketing w1w2 w3
- only 1 modificational choice possible
- law enforcement officer
w1 w2 w3
w1 w2 w3
6Adjacency Dependency (2)
- right bracketing w1w2w3
- w2w3 is a compound (modified by w1)
- w1 and w2 independently modify w3
- adjacency model
- Is w2w3 a compound?
- (vs. w1w2 being a compound)
- dependency model
- Does w1 modify w3?
- (vs. w1 modifying w2)
w1 w2 w3
w1 w2 w3
w1 w2 w3
7Frequencies
- Adjacency model
- Compare (w1,w2) to (w2,w3)
- Dependency model
- Compare (w1,w2) to (w1,w3)
Frequency of w1w2
w1 w2 w3
left
right
w1 w2 w3
8Probabilities
- Adjacency model
- Compare Pr(w1?w2w2) to Pr(w2?w3w3)
- Dependency model
- Compare Pr(w1?w2w2) to Pr(w1?w3w3)
Pr that w1 modifies w2
w1 w2 w3
left
right
w1 w2 w3
9Probabilities Dependency
- Dependency model
- Pr(left) ? Pr(w1?w2w2)Pr(w2?w3w3)
- Pr(right) ? Pr(w1?w3w3)Pr(w2?w3w3)
- So we compare Pr(w1?w2w2) to Pr(w1?w3w3)
- BUT! No cancellation in
- Lauers model
right
w1 w2 w3
left
10Probabilities Estimation
- Using page hits as a proxy for n-gram counts
- Pr(w1?w2w2) (w1,w2) / (w2)
- (w2) word frequency query for w2
- (w1,w2) bigram frequency query for w1 w2
- smoothed by 0.5
11Probabilities Why? (1)
- Why should we use
- (a) Pr(w1?w2w2), rather than
- (b) Pr(w2?w1w1)?
- KellerLapata (2004) calculate
- AltaVista queries
- (a) 70.49
- (b) 68.85
- British National Corpus
- (a) 63.11
- (b) 65.57
12Probabilities Why? (2)
- Why should we use
- (a) Pr(w1?w2w2), rather than
- (b) Pr(w2?w1w1)?
- Maybe to introduce a bracketing prior.
- Just like Lauer (1995) did.
- But otherwise, no reason to prefer either one.
- Do we need probabilities? (association is OK)
- Do we need a directed model? (symmetry is OK)
13Association Models ?2 (Chi Squared)
- A (wi,wj)
- B (wi) (wi,wj)
- C (wj) (wi,wj)
- D N (ABC)
- N 8 trillion ( ABCD)
8 billion Web pages x 1,000 words
14Web-derived Surface Features
- Authors often disambiguate noun compounds using
surface markers, e.g. - law-enforcement officer ? left
- brain stems cell ? left
- brains stem cell ? right
- The enormous size of the Web makes them frequent
enough to be useful.
15Web-derived Surface FeaturesDash (hyphen)
- Left dash
- cell-cycle analysis ? left
- Right dash
- donor T-cell ? right
- fiber optics-system ? should be left..
- Double dash
- T-cell-depletion ? unusable
16Web-derived Surface FeaturesPossessive Marker
- Attached to the first word
- brains stem cell ? right
- Attached to the second word
- brain stems cell ? left
- Combined features
- brains stem-cell ? right
17Web-derived Surface FeaturesCapitalization
- dont-care lowercase uppercase
- Plasmodium vivax Malaria ? left
- plasmodium vivax Malaria ? left
- lowercase uppercase dont-care
- brain Stem cell ? right
- brain Stem Cell ? right
- Disabled on
- Roman digits
- Single-letter words e.g., vitamin D deficiency
18Web-derived Surface FeaturesEmbedded Slash
- Left embedded slash
- leukemia/lymphoma cell ? right
19Web-derived Surface FeaturesParentheses
- Single-word
- growth factor (beta) ? left
- (brain) stem cell ? right
- Two-word
- (growth factor) beta ? left
- brain (stem cell) ? right
20Web-derived Surface FeaturesComma,dot,column,sem
i-column,
- Following the first word
- home. health care ? right
- adult, male rat ? right
- Following the second word
- health care, provider ? left
- lung cancer patients ? left
21Web-derived Surface FeaturesDash to External
Word
- Dash to an external word to the left
- mouse-brain stem cell ? right
- Dash to an external word to the right
- tumor necrosis factor-alpha ? left
22Web-derived Surface FeaturesProblems Solutions
- Problem search engines ignore punctuation
- brain-stem cell does not work
- Solution
- query for brain stem cell
- obtain 1,000 document summaries
- look for the features in these summaries
23Other Web-derived FeaturesPossessive Marker
- We can also query directly for possessives
- Yes, brain stems cell sort of works.
- Search engines
- drop the possessive marker
- but s is kept
- Still, we cannot query for brain stems cell
24Other Web-derived FeaturesAbbreviation
- After the second word
- tumor necrosis (TN) factor ? left
- After the third word
- tumor necrosis factor (NF) ? right
- We query for e.g., tumor necrosis tn factor
- Problems
- Roman digits IV, vii
- States CA
- Short words me
25Other Web-derived FeaturesConcatenation
- Consider health care reform
- healthcare 79,500,000
- carereform 269
- healthreform 812
- Adjacency model
- healthcare vs. carereform
- Dependency model
- healthcare vs. healthreform
- Triples
- healthcare reform vs. health carereform
26Other Web-derived FeaturesUsing Googles
- Each allows a 1 word wildcard
- Single star
- health care reform ? left
- health care reform ? right
- More stars and/or reverse order
- care reform health ? right
- reform health care ? left
- Adjacency model
27Other Web-derived FeaturesReorder
- Reorders for health care reform
- care reform health ? right
- reform health care ? left
28Other Web-derived FeaturesInternal Inflection
Variability
- First word
- bone mineral density
- bones mineral density
- Second word
- bone mineral density
- bone minerals density
? right
? left
29Other Web-derived FeaturesSwitch The First Two
Words
- Predict right, if we can reorder
- adult male rat as
- male adult rat
30Paraphrases (1)
- The semantics of a noun compound is often made
overt by a paraphrase (Warren,1978) - Prepositional
- stem cells in the brain ? right
- stem cells from the brain ? right
- cells from the brain stem ? left
- Verbal
- virus causing human immunodeficiency ? left
- pain associated with arthritis migraine ? left
- Copula
- office building that is a skyscraper ? right
31Paraphrases (2)
- Lauer(1995), KellerLapata(2003), Girjual.(2005)
try to choose the best prepositional paraphrase
as a proxy for the semantic interpretation of NCs - They use of, for, in, at, on, from, with, about,
(like) - This could be problematic, when more than one
preposition is possible. - In contrast
- we try to predict syntax, not semantics
- we do not need to disambiguate, just add up all
counts - cells in (the) bone marrow ? left (61,700)
- cells from (the) bone marrow ? left (16,500)
- marrow cells from (the) bone ? right (12)
32Paraphrases (3)
- prepositional paraphrases
- We use 150 prepositions
- verbal paraphrases
- We use associated with, caused by, contained in,
derived from, focusing on, found in, involved in,
located at/in, made of, performed by, preventing,
related to and used by/in/for. - copula paraphrases
- We use that/which/who and is/was
- optional elements
- articles a, an, the
- quantifiers some, every, etc.
- pronouns this, these, etc.
33Evaluation Datasets
- Lauer Set
- 244 noun compounds (NCs)
- from Groliers encyclopedia
- inter-annotator agreement 81.5
- Biomedical Set
- 430 NCs
- from MEDLINE
- inter-annotator agreement 88 (? .606)
34Evaluation Experiments
- Exact phrase queries
- Limited to English
- Inflections
- Lauer Set Carrolls morphological tools
- Biomedical Set UMLS Specialist Lexicon
35Results Lauer (1)
N/A
wrong
correct
36Results Lauer (2)
N/A
wrong
correct
37Results Lauer (3)
38Results Bio (1)
N/A
wrong
correct
39Results Bio (2)
N/A
wrong
correct
40Individual Surface Features Performance Bio
41Paraphrase and Surface Features Performance
42Discussion
- Lauer Bio
- Adjacency vs. Dependency
- ?2 vs. frequencies vs. probabilities
43Conclusion
- Obtained new state-of-the-art results on NC
bracketing - more robust than Lauer (1995)
- more accurate than KellerLapata (2004)
- Introduced search engine statistics that go
beyond the n-gram - surface features
- paraphrases
- Works well for other structural ambiguity
problems - Prepositional phrase attachment
- Noun phrase coordination
44Future Work
- Recognize ambiguous cases
- 3-way classification
- Bracket more than 3 nouns
- Not just bracketing but dependences
- e.g., growth factor alpha
- Bracket NPs in general (other POS)
- augment Penn Treebank with NP-internal dependences
45The End
46Web Counts Problems
- Page hits are inaccurate
- This may be ok (KellerLapata,2003)
- The Web lacks linguistic annotation
- Pr(healthcare) (health care) / (care)
- health noun
- care both verb and noun
- can be adjacent by chance
- can come from different sentences
- Cannot find
- stem cells VERB PREPOSITION brain
- protein synthesis inhibition
47Inter-annotator Agreement Lauer Set
- Lauer 6 judges
- Average 81.50
- Best pair of annotators 84.40
- Worse pair of annotators 73.00
- Total of 308 examples.
- 244 used
- the rest indeterminate or extraction errors
- Problem
- Gold standard Lauer, in context.
- Human judges no context!
48Search Engines 5/7/2005any language, inflections
49MSN over time any language, inflections
50Google over time any language, inflections