Research Proposal - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Research Proposal

Description:

Phrasing techniques have not been fine tuned to the requirements of the ... Michael Collins, AT&T Labs-Research, Florham Park, New Jersey, mcollins_at_research. ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 13
Provided by: navend
Category:

less

Transcript and Presenter's Notes

Title: Research Proposal


1
Research Proposal
  • by
  • Navendu Garg

2
Research Question
  • Can Named Entities improve Query Precision in
    the Biomedicine domain?

3
Observations
  • Phrasing techniques have not been fine tuned to
    the requirements of the biomedicine domain.
  • Most of the phrasing techniques have been based
    on n-gram models.
  • Research in the area of phrasing and biomedicine
    specific IR systems has been done in isolation.
  • Due to the diversity of the vocabulary of the
    biomedicine domain, existing phrasing techniques
    may not improve query precision.
  • Biomedicine is a specialized domain with
    vocabulary rich in named entities.
  • Named Entity Recognition systems have shown
    remarkable precision in the news article domain.

4
Hypotheses
  • Adding named entities as phrases to the index of
    an IR system can increase the accuracy because
  • The biomedicine domain has a rich and diverse
    collection of named entities.
  • Users of such systems invariably use named
    entities for their search. e.g.
    N-acetyl-cysteine, 84 kDa proteins

5
Prior Work
  • Named Entity Extraction
  • Information Extraction task (MUC 1995)
  • Methods
  • Support Vector machines
  • Hidden Markov Models
  • Maximum Entropy Models
  • Good Performance
  • Primarily news articles

6
Prior Work
  • Same Methods used for Biomedicine NE
  • Not comparable results due to lack of training
    data.
  • Evidence of improvement in results over a period
    of time.
  • Research in Phrasing
  • Done in isolation with IR Systems for biomedicine
  • Researches using collocation has been done for
    NE.
  • No known research about using NE for improving
    Query Precision

7
Methods
  • Named Entity Tagging system
  • Identify features to extract named entities.
  • Train the supervised learning algorithm with the
    training corpus.
  • Extract named entities using classification
    algorithm by tagging corpus with the named
    entities.
  • Post process the named entities extracted.
  • Information Retrieval system
  • Extract tokens including named entities and build
    the inverted index.
  • Preprocess the query.
  • Run the processed query on the IR engine.
  • Get the ranked documents.

8
Methods
  • Support Vector Machines
  • By far best results in biomedicine domain.
  • Maximum Entropy Tagger with
  • Boosting algorithm.
  • Voted perceptron.
  • Select the best result from N possible Named
    Entities.
  • Flexibility in incorporating features
  • Scores are discriminating for correct hypotheses.

9
Experiment Setup
  • An existing IR System specific to biomedicine
    domain.
  • GENIA corpus will be used to train the algorithm.
  • 10,000 commonly biomedicine terms will be used
    for named entity extraction.
  • 10,000 commonly used prefixes and suffixes are
    used for named entity extraction.
  • MEDLINE abstracts will be used to extract Named
    Entities and build the index and queries will be
    formed out these abstract to test the system.
  • To extract Named Entities certain features will
    be used such as
  • Orthographic features
  • Parts of Speech
  • Prefixes and suffixes
  • Contextual words on the left and right side of
    the possible entity.

10
Performance Estimation
  • Named Entity Extraction/ Query
  • Precision/Recall
  • F-Measure

11
Experimentation
  • Build the index
  • Without Named Entities
  • To get base perfomance
  • With Named Entities
  • To get experiment results
  • Use the 3 proposed methods to arrive at a result.

12
References
  • NLP in Biomedicine ACL 2003 Workshop Program
  • Lee, K.J., Hwang, Y.S. Rim, H.C. "Two-Phase
    Biomedical NE Recognition based on SVMs"
  • Koichi, T. Collier, N. "Bio-medical Entity
    Extraction Using Support Vector Machines"
  • Shen, D., Zhang, J., Zhou, G., Su, J. Tan, C.L.
    "Effective Adaptation of Hidden Markov
    Model-based Named Entity Recognizer for
    biomedical Domain"
  • Hou, W. J. Chen, H.H. " Enhancing Performance
    of Protein Name Recognizers Using Collocation"
  • Computational Linguistics (ACL), Philadelphia,
    July 2002
  • Michael Collins, ATT Labs-Research, Florham
    Park, New Jersey, mcollins_at_research.att.com,
    Ranking Algorithms for NamedEntity Extraction
    Boosting and the Voted Perceptron
Write a Comment
User Comments (0)
About PowerShow.com