Comparing Frequency of Content-Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 - PowerPoint PPT Presentation

About This Presentation
Title:

Comparing Frequency of Content-Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001

Description:

Comparing Frequency of Content-Bearing Words in Abstracts ... However, the remaining articles can only be said to not conclusively disagree. Acknowledgements ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 22
Provided by: jimr77
Category:

less

Transcript and Presenter's Notes

Title: Comparing Frequency of Content-Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001


1
Comparing Frequency of Content-Bearing Words in
Abstracts and Texts in Articles from Four Medical
JournalsAn Exploratory StudySeptember 4, 2001
James E. Ries, Kuichun Su, Gabriel Peterson,
MaryEllen C. Sievert, Timothy B. Patrick, David
E. Moxley, Lawrence D. Ries CECS, HMI,
Statistics, and SISLT
2
Abstract
  • Retrieval tests have assumed that the abstract is
    a true surrogate of the entire text. However,
    the frequency of terms in abstracts has never
    been compared to that of the articles they
    represent. Even though many sources are now
    available in full-text, many still rely on the
    abstract for retrieval
  • In these four journals, the abstracts are
    lexical, as well as intellectual, surrogates for
    the documents they represent

3
Background
  • Many retrieval systems still use abstracts as a
    surrogates for full text.
  • Abstracts are often indexed with respect to word
    occurrence by employing Zipfs Law.
  • Product of occurrence frequency and rank of
    occurrence frequency is constant
  • Most occurring and least occurring words
    contribute little to article content.

4
Background (cont.)
  • Previous studies have shown that abstracts are
    sometimes inconsistent with their corresponding
    articles. However, no study has previously shown
    that abstracts and articles are inconsistent in a
    statistical sense.

5
Methods
  • 4 medical journals (BMJ, JAMA, Lancet, and NEJM)
  • Two different countries
  • Many medical subdisciplines
  • Regarded as top journals
  • Available in electronic format
  • Studied all articles which contained an abstract
    and were 2 pages or longer during 1999.
  • 1,138 articles 35 parsing problems 1,103
    articles

6
Methods (cont.)
  • Text of articles and abstracts were downloaded
    and stored in HTML.
  • HTML was parsed into separate abstract and
    article files via custom C parsing program.
  • References and figures were removed.

7
Methods (cont.)
  • Content-bearing words extracted from abstracts
    and articles
  • Numerical values, special characters, and
    captions excluded and used as word delimiters
  • Removed words contained in a home-grown stop
    word list (words with little or no medical
    meaning)

8
Methods (cont.)
  • Remaining words conflated using NLMs LVG tools.
  • E.g, reading -gt read
  • Frequencies of all conflated words were
    calculated for abstracts and articles.

9
Analysis
  • Used chi-squared test to determine whether
    discrepancies between observed occurrences in
    abstract and occurrences in articles were due to
    sampling or were truly indicative of a difference
    in content.

10
Analysis (cont.)
  • Example Rosing (Lancet)
  • Abstract contained 140 content bearing words
  • contraceptive appeared 6 times in the abstract
    and 35 times in the text of the article.
  • Since text contained 1081 content bearing words,
    expect 140/1081 35 3.35 occurrences of this
    term in the abstract.

11
Analysis (cont.)
  • Example Rosing (Lancet)
  • Actual number of occurrences was 6, the square of
    the error divided by the expected was added to
    the chi-squared statistic for this particular
    word (i.e., ((6-3.35)2)/3.35 2.10).
  • Every other content bearing word in the article
    was compared to the abstract in this way, and sum
    of all of the errors was the total chi-squared
    statistic for the given article.

12
Analysis (cont.)
  • We reran our analysis using the Bonferroni
    Inequality measure to assure that we would not
    have incorrect results simply by virtue of our
    large sample size.

13
Cumulative Results w/o Bonferroni
14
Cumulative Results w/o Bonferroni
15
Cumulative Results w/ Bonferroni
16
Cumulative Results w/ Bonferroni
17
Future Work
  • Utilize a smaller, more standard stop word list
    (see Su K, et. al., Comparing Frequency of Word
    Occurances in Abstracts and Texts Using Two Stop
    Word Lists in Fall 2001 AMIA Proceedings).
  • Explore over agreement.

18
Future Work (cont.)
  • Compare phrases (terms) rather than words.
  • Utilize the UMLS to compare Concept Unique
    Identifiers (CUIs) via MetaMap rather than words
    or phrases.
  • Changes in agreement/disagreement may indicate
    the use of synonyms which might still negatively
    affect retrieval.

19
Conclusion
  • In these four journals, the abstracts are
    lexical, as well as intellectual, surrogates for
    the documents they represent.
  • Our test was conservative in the sense that we
    can only strongly state that a small number of
    abstract/article pairs do disagree. However,
    the remaining articles can only be said to not
    conclusively disagree.

20
Acknowledgements
  • This research was supported in part by grant
    T15-089 LM0708-09 from the National Library of
    Medicine, United States of America.

21
Questions
  • http//riesj.hmi.missouri.edu
  • JimR_at_acm.org
Write a Comment
User Comments (0)
About PowerShow.com