Title: Comparing Frequency of Content-Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001
1Comparing Frequency of Content-Bearing Words in
Abstracts and Texts in Articles from Four Medical
JournalsAn Exploratory StudySeptember 4, 2001
James E. Ries, Kuichun Su, Gabriel Peterson,
MaryEllen C. Sievert, Timothy B. Patrick, David
E. Moxley, Lawrence D. Ries CECS, HMI,
Statistics, and SISLT
2Abstract
- Retrieval tests have assumed that the abstract is
a true surrogate of the entire text. However,
the frequency of terms in abstracts has never
been compared to that of the articles they
represent. Even though many sources are now
available in full-text, many still rely on the
abstract for retrieval - In these four journals, the abstracts are
lexical, as well as intellectual, surrogates for
the documents they represent
3Background
- Many retrieval systems still use abstracts as a
surrogates for full text. - Abstracts are often indexed with respect to word
occurrence by employing Zipfs Law. - Product of occurrence frequency and rank of
occurrence frequency is constant - Most occurring and least occurring words
contribute little to article content.
4Background (cont.)
- Previous studies have shown that abstracts are
sometimes inconsistent with their corresponding
articles. However, no study has previously shown
that abstracts and articles are inconsistent in a
statistical sense.
5Methods
- 4 medical journals (BMJ, JAMA, Lancet, and NEJM)
- Two different countries
- Many medical subdisciplines
- Regarded as top journals
- Available in electronic format
- Studied all articles which contained an abstract
and were 2 pages or longer during 1999. - 1,138 articles 35 parsing problems 1,103
articles
6Methods (cont.)
- Text of articles and abstracts were downloaded
and stored in HTML. - HTML was parsed into separate abstract and
article files via custom C parsing program. - References and figures were removed.
7Methods (cont.)
- Content-bearing words extracted from abstracts
and articles - Numerical values, special characters, and
captions excluded and used as word delimiters - Removed words contained in a home-grown stop
word list (words with little or no medical
meaning)
8Methods (cont.)
- Remaining words conflated using NLMs LVG tools.
- E.g, reading -gt read
- Frequencies of all conflated words were
calculated for abstracts and articles.
9Analysis
- Used chi-squared test to determine whether
discrepancies between observed occurrences in
abstract and occurrences in articles were due to
sampling or were truly indicative of a difference
in content.
10Analysis (cont.)
- Example Rosing (Lancet)
- Abstract contained 140 content bearing words
- contraceptive appeared 6 times in the abstract
and 35 times in the text of the article. - Since text contained 1081 content bearing words,
expect 140/1081 35 3.35 occurrences of this
term in the abstract.
11Analysis (cont.)
- Example Rosing (Lancet)
- Actual number of occurrences was 6, the square of
the error divided by the expected was added to
the chi-squared statistic for this particular
word (i.e., ((6-3.35)2)/3.35 2.10). - Every other content bearing word in the article
was compared to the abstract in this way, and sum
of all of the errors was the total chi-squared
statistic for the given article.
12Analysis (cont.)
- We reran our analysis using the Bonferroni
Inequality measure to assure that we would not
have incorrect results simply by virtue of our
large sample size.
13Cumulative Results w/o Bonferroni
14Cumulative Results w/o Bonferroni
15Cumulative Results w/ Bonferroni
16Cumulative Results w/ Bonferroni
17Future Work
- Utilize a smaller, more standard stop word list
(see Su K, et. al., Comparing Frequency of Word
Occurances in Abstracts and Texts Using Two Stop
Word Lists in Fall 2001 AMIA Proceedings). - Explore over agreement.
18Future Work (cont.)
- Compare phrases (terms) rather than words.
- Utilize the UMLS to compare Concept Unique
Identifiers (CUIs) via MetaMap rather than words
or phrases. - Changes in agreement/disagreement may indicate
the use of synonyms which might still negatively
affect retrieval.
19Conclusion
- In these four journals, the abstracts are
lexical, as well as intellectual, surrogates for
the documents they represent. - Our test was conservative in the sense that we
can only strongly state that a small number of
abstract/article pairs do disagree. However,
the remaining articles can only be said to not
conclusively disagree.
20Acknowledgements
- This research was supported in part by grant
T15-089 LM0708-09 from the National Library of
Medicine, United States of America.
21Questions
- http//riesj.hmi.missouri.edu
- JimR_at_acm.org