Statistical Models for MeSH Term Generation - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

Statistical Models for MeSH Term Generation

Description:

... is a comprehensive vocabulary that includes definitions of MeSH terms [UMLS 2000] ... American Medical Informatics Association Fall Symposium, 1997. ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 12
Provided by: jayur
Category:

less

Transcript and Presenter's Notes

Title: Statistical Models for MeSH Term Generation


1
Statistical Models for MeSH Term Generation
  • April 18, 2004
  • Jay Urbain
  • Advisor Ophir Frieder

2
Statistical Models for MeSH Term Generation
  • Identifying correct MeSH terms can improve
    MEDLINE search precision.
  • Work attempting to correlate extracted noun
    phrase entities from queries with only the
    definitions of MeSH have shown mixed results.
  • Increases in precision of search results have
    been achieved using relevance feedback techniques
    that iteratively augment the original query with
    MeSH from highly ranked documents from prior
    search results.
  • We believe the MEDLINE database has richer
    contextual information and diverse language that
    goes beyond the concise definitions of MeSH
    contained with the MetaThesaurus.
  • We propose deriving models of the MeSH term
    generation process by building statistical models
    of each MeSH term from language models for
    MEDLINE documents.

3
MEDLINE
  • MEDLINE
  • The MEDLINE database is a collection of more than
    10,000,000 citations from more than 4,200 medical
    journals, indexed from a MetThesaurus containing
    more than 19,000 Medical Subject Headings (MeSH).
  • The UMLS MetaThesaurus is a comprehensive
    vocabulary that includes definitions of MeSH
    terms UMLS 2000.
  • MeSH terms are organized in a related hierarchy
    of categories and sub-categories and can contain
    term
  • relations, synonyms, and hierarchical
    relationships.
  • About a dozen MeSH terms are manually assigned to
    each citation by professional indexers following
    the UMLS guide book.

4
Language Models
  • Language Models
  • Language Models in IR seek to model the query
    generation process Lafferty 2003.
  • Query generation is seen as a process of randomly
    sampling a document or more appropriately that
    documents model of terms and term frequencies.
  • A model is generated for each document Jackson
    2002.
  • Query generation is seen as a process of randomly
    sampling a document, or more appropriately that
    documents model of terms and term frequencies. A
    model is generated for each document.
  • Documents with a relatively high probability of
    generating the query are ranked highly.

5
Language Models
  • Given a document model M for document d, we seek
    to find the probability (maximum likelihood
    estimate) that the model generated query Q
  • If a term t does not occur in d, P(t,d) 0, then
    P(Q(Md) 0. So smoothing is needed missing terms
    need to be smoothed.
  • For terms that occur in documents, but occur
    infrequently, the probability estimate is
    typically fortified by observing the
    probabilities of observing term t in other
    documents.

6
Problem MeSH Term Assignment
  • PROBLEM MeSH Term Assignment
  • MeSH Term Assignment
  • Assigning MeSH terms to a search query from a
    vocabulary of 19,000 MeSH terms, requires expert
    knowledge of the hierarchical MeSH indexing
    scheme Katcher 1999.
  • Even expert users tend to select MeSH terms
    that are too general. Studies have found novice
    user using VSM search can be as effective as
    experts using Boolean search Hersh 1994, 2000.
  • Novice and expert user a like need to find
    relevant documents with the facts needed to make
    decisions.

7
Problem Sparse Data Problem
  • PROBLEM Sparse Data
  • Language models have a significant problem with
    respect to sparse data
  • The maximum likelihood estimate needs to be
    smoothed for term not occurring in the document.
  • Even if the term occurs in the document, a
    document sized sample, specifically the small
    size of a MEDLINE citation, will most likely be
    too small for our estimate.

8
Proposal MeSH Term Models
  • PROPOSAL
  • We propose deriving models of the MeSH term
    generation process by building statistical models
    for each MeSH term from MEDLINE documents
    statistics.
  • Given that adding the correct MeSH terms to a
    query increases precision, our approach solves
    both problems
  • First, we solve the MeSH assignment problem by
    building statistical models of each MeSH based on
    the probabilities of MEDLINE citations that have
    actually been tagged with that information.
  • Second, we solve the sparse data problem by
    building our models on the entire MEDLINE
    database which contains 10,000,000 documents.

9
Protocol MeSH Term Models
  • Identify test corpus, complete regular expression
    preprocessing for identification of special terms
    and stop words.
  • Identify baseline retrieval strategy and other
    strategies and utilizes to evaluate.
  • Identify the most effective generative
    statistical model for determining the likelihood
    of a document text generating a MeSH term.
  • Compare generative model with other
    classification models
  • SVMs
  • Clustering techniques
  • Standard co-occurence

10
References
  • Lafferty, John, Zhai ChengXiang. Probabilistic
    Relevance Models Based on Document and Query
    Generation. In Language Modeling for Information
    Retrieval, 1-10. Kluwer. 2003.
  • Jackson, Peter, Moulinier, Isabelle. Natural
    Language Processing for Online Applications Text
    Retrieval, Extraction, and Categorization. John
    Benjamins Publishing Company 2002.
  • Durbin R., S. Eddy, Krogh, A., Mitchison, G.
    Biological Sequence Analysis. Cambridge Press,
    1998.
  • Aronson, Alan R. Rindflesch, Thomas C. Query
    Expansion Using the UMLS Metathesaurus. American
    Medical Informatics Association Fall Symposium,
    1997.
  • Hersh, William Buckley, Chris Leone, TJ
    Hickam, David. OHSUMED An Interactive Retrieval
    Evaluation and New Large Test Collection for
    Research. Proceedings of the 17th ACM-SIGIR
    Conference on Research and Development in
    Information Retrieval, pages 192-201, 1994.
  • Katcher, Brian A. MEDLINE, A Guide to effective
    Searching. Ashbury Press, 1999.
  • Grossman, David A. Frieder, Ophir. Information
    Retrieval Algorithms and Heuristics. Kluwer
    Academic Publishers, 1998.
  • UMLS Unified Medical Language System.
    http//www.nlm.nih.gov/research/umls/, 2003.
  • Hersh, William Price Susan Donohoe, Larry.
    Assessing Thesaurus-Based Query Expansion Using
    the UMLS Metathesaurus. Proceedings of the AMIA
    Symposium, 2000 344-8
  • Voorhees, Ellen M. Harman, Donna. Overview of
    the Ninth Text REtrieval Conference (TREC-9).
    National Institute of Standards and Technology,
    2000.
  • Urbain, Jay. Improving MEDLINE Search Precision
    with Relevance Feedback. IIT Technical Report not
    published. 2004.

11
Sample MeSH
  • ltMeshHeadingListgt
  • ltMeshHeadinggt
  • ltDescriptorName MajorTopicYN"N"gtAdultlt/Descript
    orNamegt
  • lt/MeshHeadinggt
  • ltMeshHeadinggtltDescriptorName MajorTopicYN"N"gtCar
    diovascular Diseaseslt/DescriptorNamegt
  • ltQualifierName MajorTopicYN"Y"gtepidemiologylt/Qu
    alifierNamegt
  • ltQualifierName MajorTopicYN"N"gtprevention
    controllt/QualifierNamegt
  • lt/MeshHeadinggt
  • ltMeshHeadinggt
  • ltDescriptorName MajorTopicYN"N"gtDiabetes
    Mellitus, Type IIlt/DescriptorNamegt
  • ltQualifierName MajorTopicYN"N"gtepidemiologylt/Qu
    alifierNamegt
  • lt/MeshHeadinggt
  • lt/MeshHeadingListgt
Write a Comment
User Comments (0)
About PowerShow.com