Comparing Frequency of Content-Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001 - PowerPoint PPT Presentation

About This Presentation

Title:

Comparing Frequency of Content-Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001

Description:

Comparing Frequency of Content-Bearing Words in Abstracts ... However, the remaining articles can only be said to not conclusively disagree. Acknowledgements ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 22

Provided by: jimr77

Category:

more less

Transcript and Presenter's Notes

Title: Comparing Frequency of Content-Bearing Words in Abstracts and Texts in Articles from Four Medical Journals: An Exploratory Study September 4, 2001

1
Comparing Frequency of Content-Bearing Words in
Abstracts and Texts in Articles from Four Medical
JournalsAn Exploratory StudySeptember 4, 2001
James E. Ries, Kuichun Su, Gabriel Peterson,
MaryEllen C. Sievert, Timothy B. Patrick, David
E. Moxley, Lawrence D. Ries CECS, HMI,
Statistics, and SISLT
2
Abstract

Retrieval tests have assumed that the abstract is
a true surrogate of the entire text. However,
the frequency of terms in abstracts has never
been compared to that of the articles they
represent. Even though many sources are now
available in full-text, many still rely on the
abstract for retrieval
In these four journals, the abstracts are
lexical, as well as intellectual, surrogates for
the documents they represent

3
Background

Many retrieval systems still use abstracts as a
surrogates for full text.
Abstracts are often indexed with respect to word
occurrence by employing Zipfs Law.
Product of occurrence frequency and rank of
occurrence frequency is constant
Most occurring and least occurring words
contribute little to article content.

4
Background (cont.)

Previous studies have shown that abstracts are
sometimes inconsistent with their corresponding
articles. However, no study has previously shown
that abstracts and articles are inconsistent in a
statistical sense.

5
Methods

4 medical journals (BMJ, JAMA, Lancet, and NEJM)
Two different countries
Many medical subdisciplines
Regarded as top journals
Available in electronic format
Studied all articles which contained an abstract
and were 2 pages or longer during 1999.
1,138 articles 35 parsing problems 1,103
articles

6
Methods (cont.)

Text of articles and abstracts were downloaded
and stored in HTML.
HTML was parsed into separate abstract and
article files via custom C parsing program.
References and figures were removed.

7
Methods (cont.)

Content-bearing words extracted from abstracts
and articles
Numerical values, special characters, and
captions excluded and used as word delimiters
Removed words contained in a home-grown stop
word list (words with little or no medical
meaning)

8
Methods (cont.)

Remaining words conflated using NLMs LVG tools.
E.g, reading -gt read
Frequencies of all conflated words were
calculated for abstracts and articles.

9
Analysis

Used chi-squared test to determine whether
discrepancies between observed occurrences in
abstract and occurrences in articles were due to
sampling or were truly indicative of a difference
in content.

10
Analysis (cont.)

Example Rosing (Lancet)
Abstract contained 140 content bearing words
contraceptive appeared 6 times in the abstract
and 35 times in the text of the article.
Since text contained 1081 content bearing words,
expect 140/1081 35 3.35 occurrences of this
term in the abstract.

11
Analysis (cont.)

Example Rosing (Lancet)
Actual number of occurrences was 6, the square of
the error divided by the expected was added to
the chi-squared statistic for this particular
word (i.e., ((6-3.35)2)/3.35 2.10).
Every other content bearing word in the article
was compared to the abstract in this way, and sum
of all of the errors was the total chi-squared
statistic for the given article.

12
Analysis (cont.)

We reran our analysis using the Bonferroni
Inequality measure to assure that we would not
have incorrect results simply by virtue of our
large sample size.

13
Cumulative Results w/o Bonferroni
14
Cumulative Results w/o Bonferroni
15
Cumulative Results w/ Bonferroni
16
Cumulative Results w/ Bonferroni
17
Future Work

Utilize a smaller, more standard stop word list
(see Su K, et. al., Comparing Frequency of Word
Occurances in Abstracts and Texts Using Two Stop
Word Lists in Fall 2001 AMIA Proceedings).
Explore over agreement.

18
Future Work (cont.)

Compare phrases (terms) rather than words.
Utilize the UMLS to compare Concept Unique
Identifiers (CUIs) via MetaMap rather than words
or phrases.
Changes in agreement/disagreement may indicate
the use of synonyms which might still negatively
affect retrieval.

19
Conclusion

In these four journals, the abstracts are
lexical, as well as intellectual, surrogates for
the documents they represent.
Our test was conservative in the sense that we
can only strongly state that a small number of
abstract/article pairs do disagree. However,
the remaining articles can only be said to not
conclusively disagree.

20
Acknowledgements