Interactive Thesaurus Assessment for Automatic Document Annotation

About This Presentation

Title:

Interactive Thesaurus Assessment for Automatic Document Annotation

Description:

Associate probabilities with concepts in the taxonomy. Depends on the corpus used ... monotonic as one moves up the taxonomy. p(coin) p(cash) From Information ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 18

Provided by: jaliyaek

Category:

more less

Transcript and Presenter's Notes

Title: Interactive Thesaurus Assessment for Automatic Document Annotation

1
Interactive Thesaurus Assessment for Automatic
Document Annotation

B552 Class Presentation
Sribabu Doddapaneni
Jaliya Ekanayake

2
Annotating Documents

Using a Thesaurus
Automatic
No human in the annotation loop. Ideal for ultra
large Document collections
Semi-automatic
Interactive Annotation process. User can make
changes and refine the annotation process
Manual
Used in libraries for centuries
Folksonomy
Brief descriptions added by users
Ad hoc representation of web artifices using
peoples own vocabulary
Myspace, YouTube, delicious etc.
Linguistic Knowledge
Use of Natural language processing and lingustic
ideas to annotate gist of documents.

3
Why Annotate Documents?

To Retrieve documents based on their semantics
instead of keywords
Categorization of a document set based on their
semantic categories

4
Keyword Based Annotation

Keywords A significant word that can be used in
an index to best describe the contents of the
document.
Properties
Choice of Keyword depends on indexing engine
Might ignore important index words due to high
frequency of appearance
Documents retrieved could be totally irrelevant
Cheaper
E.g. Google uses keywords page Rank

5
Contextual Annotation

Documents are Indexed based on contextual or
semantic information
Properties
Unaffected by indexing mechanism.
Relevant documents are ranked higher
Could be costly to both index and retrieve

6
Keywords vs. Context

How does each fare?
Cost ?
Is it Always possible ?

7
Abstract

Analyze Thesauri to identify their effect on
automatic annotation
Use visual tools to identify problems and
non-problems
Demonstrate that refining the thesaurus using
this method actually performs better than manual
annotation

8
The Experiment

Use two thesauri to annotate and index two
document collections
Thesuari
Medical Subject Headings (MeSH)
Standard Thesaurus Writschaft (STW)
Document Collections
800 Medline Abstracts
1000 Abstracts provided by Elsevier B. V

9
Automatically Assigned vs. Manually Selected
Concepts
10
Thesaurus Analyses

Improve the Quality of the automatic indexing
Review the Thesaurus
Detecting parts that shows unexpected behavior
Detailed inspection
Revising Thesaurus
Adaptation of the thesaurus to changes in the
vocabulary of the domain of interest by means of
adding of new terms
Deletion and/or merging of rarely used terms
Splitting, extension or restriction of
extensively used terms
Review of the thesaurus structure to avoid
extensive subclassing
identification of problematic concepts for the
automatic indexer, i.e. concepts that are
erroneously assigned due to misleading
occurrences in the documents with improper sense

11
Evaluating Thesaurus Suitability

Information Content (IC)
Associate probabilities with concepts in the
taxonomy
Depends on the corpus used
Taxonomy is augmented with a function
p C -gt0,1
for any c ? C, p(c)
is the probability of encountering an instance
of concept c
p is monotonic as one moves up the taxonomy
p(coin)ltp(cash)
From Information Theory
Information Content(IC) -log p(c)

12
Evaluating Thesaurus Suitability Contd.

Intrinsic Information Content (IIC)
Depends only on the hierarchical structure
More hyponyms a concept has the less information
it expresses
Leaf nodes are the most specified in terms of
information content

Difference of Information Content (DIC) DIC(c)
IC(c) IIC(c)
13
Visualizing Thesaurus Suitability
Treemap Algorithm Ben Shneiderman
Treemap with Colors for MeSH Thesaurus
14
STW Analysis