Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005

Description:

Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005 Preslav Nakov EECS, Computer Science Division University of California ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 24
Provided by: chy81
Category:

less

Transcript and Presenter's Notes

Title: Search Engine Statistics Beyond the n-gram: Application to Noun Compound Bracketing CoNLL-2005


1
Search Engine Statistics Beyond the
n-gramApplication to Noun Compound
BracketingCoNLL-2005
  • Preslav Nakov
  • EECS, Computer Science Division
  • University of California, Berkeley
  • Marti Hearst
  • SIMS
  • University of California, Berkeley

2
Outline
  • Introduction
  • Related Work
  • Models and Features

3
Introduction
  • Noun compound bracketing-gt Noun compound
    interpretation
  • liver cell antibody
  • liver cell antibody
  • liver cell line
  • liver cell line
  • POS equivalent, different syntactic trees

4
This Paper
  • A highly accurate unsupervised method for making
    bracketing decisions for noun compounds (NCs)
  • Current using bigram estimates to compute
    adjacency and dependency scores
  • Improvement
  • ?2 measure
  • a new set of surface features for querying Web
    search engines
  • Evaluate on 2 domains, encyclopedia bioscience

5
Related Work
  • NC syntax and semantics
  • Still active -gt J. of Com. Speech and Language
    Special Issue on Multiword Expressions
  • Adjacency model
  • Probabilistic dependency model, Laucer (1995)
  • Data sparseness (use categories instead)
  • 244 NCs from encyclopedia
  • Inter-annotator agreement 81.5
  • Baseline 66.8 -gt 77.5
  • Adding POS -gt state-of-the-art result of 80.7

6
20032005
  • Keller and Lapata (2003)
  • Use Web Search Engines for obtaining frequencies
    for unseen bigrams
  • (2004) apply to six NLP tasks including
    disambiguation of NCs
  • Simpler version (use frequency only) - 78.68
  • Girju et al. (2005) supervised (decision tree) (5
    WordNet semantic features)
  • 83.1

7
Models and Features
  • Adjacency and dependency model
  • w1w2w3 -gt w1 w2w3 (two reasons) take on
    right bracketing
  • w2w3 is a compound (modified by w1)
  • home health care
  • Adjacency model checks 1.
  • w1 and w2 independently modify w3
  • adult male rat
  • (Better) Dependency model checks 2.
  • Left bracketing -gt only 1 choice
  • law enforcement agent

8
Computing Probabilities
  • Alternative
  • Calculations

9
?2 measure
  • B(wi)-(A)
  • C(wj)-(A)
  • DN-A-B-C
  • N8T
  • google 8B pages X 1000 words/page
  • (Yang and Pedersen, 1997) ?2 better than MI

10
???
  • ? 2067593
  • ??2217
  • ? 10207448
  • ??3398
  • ? 1672224
  • ?2 ??750.34 gt ??67.32

11
Web-Derived Surface (1/2)
  • Authors sometimes (consciously or not)
    disambiguate the words they write by using
    surface-level markers to suggest the correct
    meaning.
  • Dash (hyphen)
  • left bracketing
  • cell cycle analysis -gt cell-cycle
  • right bracketing less reliable
  • donor T-cell
  • fiber optics-system
  • t-cell-depletion
  • Possessive marker
  • brains stem cells, brain stems cells, brains
    stem-cells
  • Internal capitalization
  • Plasmodium vivax Malaria, brain Stem cells
  • disable this feature on Roman digits and
    single-letter words
  • vitamin D deficiency

12
Web-Derived Surface (2/2)
  • Embedded slashes
  • leukemia/lymphoma cell
  • growth factor (beta) or (growth factor) beta
  • (brain) stem cells
  • a comma, a dot or a colon
  • health care, provider or lung cancer
    patients (weak indicator)
  • mouse-brain stem cells (weak indicator)
  • Unfortunately, Web SE ignore punctuation
    characters - hyphens, brackets, apostrophes, etc.
  • collect them indirectly post-processing the
    resulting summaries (up to 1000 results)
  • Above features are clearly more reliable than
    others, we do not try to weight them
  • Features verifying
  • Counts returned by SE, page hits as a proxy for
    n-gram frequencies
  • from 1000 summaries

13
Other Web-Derived Features
  • Abbreviations
  • tumor necrosis factor (NF)
  • tumor necrosis (TN) factor
  • Concatenation
  • health care reform -gt healthcare, carereform
  • Wildcard ()
  • health care reform lt-gt health care reform
  • Reorder
  • reform health care lt-gt care reform health
  • myosin heavy chain, heavy chain myosin
  • Internal inflection variability
  • tyrosine kinase activation, tyrosine kinases
    activation
  • Switching
  • adult male rat, we would also expect male
    adult rat.

14
???
15
Paraphrases
  • Warren (1978) proposes
  • stem cells in the brain
  • cells from the brain stem
  • Copula paraphrase
  • office building that/which is a skyscraper
  • pain associated with arthritis migraine
  • search engines lack linguistic annotations
  • small set of hand-chosen paraphrases
  • associated with, caused by, contained in, derived
    from, focusing on, found in, involved in, located
    at/in, made of, performed by, preventing, related
    to and used by/in/for

16
Evaluations
  • Lauers Dataset (1995)
  • 244 unambiguous 3-noun NC-s
  • Biomedical Dataset (Nakov et al., 2005, SIG
    BioLink)
  • Open NLP tools
  • sentence splitted, tokenized, POS tagged and
    shallow parsed a set of 1.4 million MEDLINE
    abstracts (citations between 1994 and 2003)
  • 500 NCs, 361 left, 69 right, 70 ambiguous

17
Experiments
  • used MSN Search statistics for the n-grams and
    the paraphrases (unless the pattern contained a
    )
  • MSN always returned exact numbers
  • Google for the surface features
  • Google and Yahoo rounded their page hits, which
    generally leads to lower accuracy (Yahoo was
    better than Google for these estimates)

18
Tools Mentioned
  • UMLS Specialist lexicon
  • ????????????
  • http//www.nlm.nih.gov/pubs/factsheets/umlslex.htm
    l
  • Carrolls morphological tools
  • http//www.cogs.susx.ac.uk/lab/nlp/carroll/morph.h
    tml

19
UMLS Lexicon
  • baseAAAentryE0000049 catnoun variantsmetare
    g variantsuncount acronym_ofabdominal aortic
    aneurysmectomyE0429482 acronym_ofacne-associated
    arthritisE0429483 acronym_ofacquired aplastic
    anemiaE0429484 acronym_ofacute anxiety
    attackE0429485 acronym_ofandrogenic anabolic
    agentE0429486 acronym_ofaneurysm of ascending
    aorta acronym_ofaromatic amino
    acidE0356310 acronym_ofacute apical
    abscessE0356309 abbreviation_ofabdominal aortic
    aneurysmE0006446
  • baseAAMDspelling_variantA.A.M.D.entryE000005
    0 catnoun variantsgroupuncount acronym_ofAmeri
    can Association on Mental DeficiencyE0000277

20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
Conclusions and Future Work
  • Improved upon the state-of-the-art approaches to
    NC bracketing
  • Future include
  • test on gt 3 words
  • recognize the ambiguous case
  • Include determiners and modifiers
  • on other NLP problems
  • refine the parser output
  • Parser typically assume right bracketing
Write a Comment
User Comments (0)
About PowerShow.com