Indexing - PowerPoint PPT Presentation

About This Presentation

Title:

Indexing

Description:

... Mexico, CITGO, refineries, Peru, ... (CITGO, drill, old wells, Mexico) (U.S. introduce, oil ... (Subject: CITGO, Action:drill, Object: oil wells, Modifier: ... – PowerPoint PPT presentation

Number of Views:17

Avg rating:3.0/5.0

Slides: 16

Provided by: borameCs

Category:

more less

Transcript and Presenter's Notes

Title: Indexing

1
Indexing

Overview
Approaches to indexing
Automatic indexing
Information extraction

2
Overview

Indexing the transformation of documents to
searchable data structures.
May be manual or automatic
Creates basis for direct search,or for search
through index files.
Historically performed by professional indexers
associated with library organizations.
A critical process users ability to find
documents on particular subject is limited by the
indexer creating index terms for this subject.
Initial computerization still relied on human
indexers, but encouraged using more index
terms(index cards no longer being required for
each index term)

3
Changes in Objectives of Indexing Due to full
Tex Availability

Indexing defines the source major concepts of
documents.
The use of a controlled vocabulary(the domain of
the index),help standardize the choice of terms.
Controlled vacabularies slow the indexing
process,but aid users because they know the
domain the indexer had to use
With the availability of full text the need for
manual indexing is diminishing
Source information (citation data) can easily be
extracted.
Every word of a document(after appropriate
normalization) may be used as a term
Thesauri compensate for lack of controlled
vocabularies.
Hence,importance of manual indexing shifts to its
ability to
Perform abstractions and determine additional
related terms.
Judge the value of the information (e.g. , more
difficult to cheat)

4
ApproachesScope

Exhaustively the extent to which concepts are
indexed.
Should we index only the most important
concepts, of also more minor concepts?
In a 10-page document, should a 2-sentence
discussion of some subject be indexed?
Specificity the preciseness of the index term
used.
Should we use general indexing terms of more
specific terms?
Should we use the term computer, personal
computer, or IBM Aptiv a Model M61?
Main effect
Low exhaustivity has adverse effect on recall.
Low specificity has adverse effect on precision.
Related issues
Index title and abstract only, or the entire
document?
Should index terms be weighted?

5
Approaches Pre-coordination

Post-coordination when a query uses a set of
terms linked by AND, it links these terms
together.
Pre-coordination links among terms are
specified in the index. Pre-coordination improves
retrieval for post-coordinated queries.
Example Document discusses drilling of oil
wells in Mexico by CITGO and introduction of oil
refineries in Peru by the U.S.
No pre-coordination of terms
oil, wells, Mexico, CITGO, refineries, Peru, U.S.
Document retrieval if query links oil, Mexico
and Peru.
Simple re-coordination
(oil wells, Mexico, CITGO)
(oil refineries, Peru, U.S.)
Document not retrieved if query links oil,
Mexico and Peru

6
Example(cont.)

Pre-coordination with position indicating role
(CITGO, drill, old wells, Mexico)
(U.S. introduce, oil refineries, Peru)
Discriminates which country introduces refineries
into the other country
Pre-coordination with modifier indicating role
(Subject CITGO, Actiondrill, Object oil wells,
Modifier in Mexico)
(Subject U.S. , Action introduce, Object oil
refineries, Modifier in Peru)
If document discussed U.S. introducing refineries
in Peru, Bolivia, and Argentina, one entry is
used with three Modifier fields.

7
Automatic Indexing

System automatically determines the index terms
assigned to documents.
Relative advantages
Human indexing
Ability to determine concept abstractions.
Ability to judge the value of concepts.
Automatic indexing
Reduced cost once initial hardware cost is
amortized, operational cost is cheaper vs.
compensation for human indexers.
Reduced processing time at most few seconds vs.
at least a few minutes.
Improved consistency algorithms select index
terms terms much more consistently than humans.

8
Weighted and Unweighted indexes

Unweighted indexing
No attempt to determine the value of the
different terms assigned to a document. Not
possible to distinguish between major topics and
casual references.
All retrieved documents are equal in value.
Typical of commercial systems through the 1980s.
Weighted indexing
Attempt made to place a value on each term as a
description of the document.
This value is related to frequency of occurrence
of the term in the document(higher is better),
but also to number of collection documents that
use this term (lower is better).
Query weights and document weights are combined
to a value describing the likelihood that a
document matches a query, and a threshold value
limits the number of documents returned.
Typically used only with automatic indexing.

9
Automatic Indexing by Term and by Concept

Indexing by Term The item is represented by
terms extracted from the item.
The Vector model
The Bayesian Model
Natural language processing
indexing by conceptThe document is represented
by concepts not necessarily used in the document.

10
Indexing by Termthe Vector Model

The SMART system developed by Salton at Cornell
University.
Each document is stored as a vector of weights.
Each vector position represents a term in the
database domain(the dimension of these vectors is
the size of the vocabulary).
The value is represented by a similar vector
The Search involves calculating the vector
distance between the query vector and each
document vector.

11
Indexing by Term the Bayesian Model

Bayes rule of conditional probability
P(A/B) P(A,B)/P(B) P(A)P(B/A)/P(B)
Bayesian methods can be used to determine the
processing tokens and their weights.
Principle calculate the (posterior) probability
that a given document contains concept C, given
the presence of features (words) F1,,Fm in the
document.
To calculate this probability we need to know
The prior probability that the document is
relevant to the concept C.
The conditional probability that the features Fi
are present in a document, given that the
document is relevant to the concept C.

12
Indexing by Term Natural Language Processing

The DR-LINK system.
Enhance indexing by using semantic information (
in addition to statistical information).
Process the language, rather than treat each word
as an independent entity.
Process documents at different levels
morphological, lexical, semantic, syntactic, and
discourse ( beyond the sentence).

13
Indexing by Concept

There are many ways to represent the same idea
and increased retrieval performance comes from
using a single representation.
Hence, a single canonical set of concepts is
determined and is used for indexing all
documents.
The MatchPlus system
A set of n features (concepts) is selected.
For each word stem a context vector of dimension
n is built, describing how strongly the stem
reflects each feature.
The context vectors for the word stems are
combined with a weighted sum, to create a single
context vector for the entire documents.
This vector represents the document in terms of
the concepts.
Queries go through same analysis to determine
vector representations.
During search, query vector is compared to
document vectors.

14
Information Extraction

Two processed related to indexing
Extraction of facts(e.g, when building indexes
automatically).
Document summarization.
Extraction of facts into a database
Extract specific types of information using
extraction criteria (indexing attempts to
represent the entire document).
Recall now refers to how much information was
extracted from a documents(vs. how much should
have been extracted).
Precision now refers to the proportion of the
extracted information which is accurate.
Experiments show that automatic extraction
performs much worse than human extracion (55
precision and recall vs. about 80), but operates
about 20 times faster.

15
Information Extraction(cont.)

Documents summarization
Extract the most important ideas, while reducing
the size significantly.
Example the abstract of a document.
True summarization is not feasible.
Instead, most summarization techniques extract
the most significant subsets(e.g. , sentences),
and concatenate them.
Each sentence is assigned a score, and the
highest scoring sentences are extracted.
No guarantee of a coherent narrative.
Heuristic algorithms, with no overall theory. For
example,
Consider sentences over 5 words in length.
Look for cues e.g., in conclusion.
Focus on the first 10 and last 5 paragraphs.