Interactive Thesaurus Assessment for Automatic Document Annotation - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Interactive Thesaurus Assessment for Automatic Document Annotation

Description:

Associate probabilities with concepts in the taxonomy. Depends on the corpus used ... monotonic as one moves up the taxonomy. p(coin) p(cash) From Information ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 18
Provided by: jaliyaek
Category:

less

Transcript and Presenter's Notes

Title: Interactive Thesaurus Assessment for Automatic Document Annotation


1
Interactive Thesaurus Assessment for Automatic
Document Annotation
  • B552 Class Presentation
  • Sribabu Doddapaneni
  • Jaliya Ekanayake

2
Annotating Documents
  • Using a Thesaurus
  • Automatic
  • No human in the annotation loop. Ideal for ultra
    large Document collections
  • Semi-automatic
  • Interactive Annotation process. User can make
    changes and refine the annotation process
  • Manual
  • Used in libraries for centuries
  • Folksonomy
  • Brief descriptions added by users
  • Ad hoc representation of web artifices using
    peoples own vocabulary
  • Myspace, YouTube, delicious etc.
  • Linguistic Knowledge
  • Use of Natural language processing and lingustic
    ideas to annotate gist of documents.

3
Why Annotate Documents?
  • To Retrieve documents based on their semantics
    instead of keywords
  • Categorization of a document set based on their
    semantic categories

4
Keyword Based Annotation
  • Keywords A significant word that can be used in
    an index to best describe the contents of the
    document.
  • Properties
  • Choice of Keyword depends on indexing engine
  • Might ignore important index words due to high
    frequency of appearance
  • Documents retrieved could be totally irrelevant
  • Cheaper
  • E.g. Google uses keywords page Rank

5
Contextual Annotation
  • Documents are Indexed based on contextual or
    semantic information
  • Properties
  • Unaffected by indexing mechanism.
  • Relevant documents are ranked higher
  • Could be costly to both index and retrieve

6
Keywords vs. Context
  • How does each fare?
  • Cost ?
  • Is it Always possible ?

7
Abstract
  • Analyze Thesauri to identify their effect on
    automatic annotation
  • Use visual tools to identify problems and
    non-problems
  • Demonstrate that refining the thesaurus using
    this method actually performs better than manual
    annotation

8
The Experiment
  • Use two thesauri to annotate and index two
    document collections
  • Thesuari
  • Medical Subject Headings (MeSH)
  • Standard Thesaurus Writschaft (STW)
  • Document Collections
  • 800 Medline Abstracts
  • 1000 Abstracts provided by Elsevier B. V

9
Automatically Assigned vs. Manually Selected
Concepts
10
Thesaurus Analyses
  • Improve the Quality of the automatic indexing
  • Review the Thesaurus
  • Detecting parts that shows unexpected behavior
  • Detailed inspection
  • Revising Thesaurus
  • Adaptation of the thesaurus to changes in the
    vocabulary of the domain of interest by means of
    adding of new terms
  • Deletion and/or merging of rarely used terms
  • Splitting, extension or restriction of
    extensively used terms
  • Review of the thesaurus structure to avoid
    extensive subclassing
  • identification of problematic concepts for the
    automatic indexer, i.e. concepts that are
    erroneously assigned due to misleading
    occurrences in the documents with improper sense

11
Evaluating Thesaurus Suitability
  • Information Content (IC)
  • Associate probabilities with concepts in the
    taxonomy
  • Depends on the corpus used
  • Taxonomy is augmented with a function
  • p C -gt0,1
  • for any c ? C, p(c)
  • is the probability of encountering an instance
    of concept c
  • p is monotonic as one moves up the taxonomy
  • p(coin)ltp(cash)
  • From Information Theory
  • Information Content(IC) -log p(c)

12
Evaluating Thesaurus Suitability Contd.
  • Intrinsic Information Content (IIC)
  • Depends only on the hierarchical structure
  • More hyponyms a concept has the less information
    it expresses
  • Leaf nodes are the most specified in terms of
    information content

Difference of Information Content (DIC) DIC(c)
IC(c) IIC(c)
13
Visualizing Thesaurus Suitability
Treemap Algorithm Ben Shneiderman
Treemap with Colors for MeSH Thesaurus
14
STW Analysis
  • Demo
  • IC Diff analysis identified the heavily used
    areas of the thesaurus
  • E.g. Business Economics
  • The documents talk about the production of paper
  • in a recent paper suggest

Higher IC -gtRed Color Lower IC -gt Blue Color
15
MeSH Analysis
  • Demo
  • Concepts with lots of sub concepts
  • E.g. Angiosperms
  • AS is the problem
  • Corrected via the language interpreter
  • Does this has anything to do with the thesaurus?

16
Evaluation
  • Generalized Precision and Recall
  • Similarity Measure
  • Results
  • Improvements in Precision
  • Removing the concept paper -gt 4.8
  • Fixing the normalize-gt1.2

17
Review Questions
  • How do you find under represented concepts?
  • Indexing methods that use semantics ?
  • Other distance measures ?
  • Natural language processing to understand
    "concept" ?
  • Thesaurus based annotation vs. other methods
  • Link structure
  • Cosine similarity
  • Multimedia documents
  • Is DIC(c) IC(c) IIC(c) a reasonable measure?
  • Are the results adequate?
Write a Comment
User Comments (0)
About PowerShow.com