Indexing - PowerPoint PPT Presentation

About This Presentation
Title:

Indexing

Description:

... Mexico, CITGO, refineries, Peru, ... (CITGO, drill, old wells, Mexico) (U.S. introduce, oil ... (Subject: CITGO, Action:drill, Object: oil wells, Modifier: ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 16
Provided by: borameCs
Category:
Tags: citgo | indexing

less

Transcript and Presenter's Notes

Title: Indexing


1
Indexing
  • Overview
  • Approaches to indexing
  • Automatic indexing
  • Information extraction

2
Overview
  • Indexing the transformation of documents to
    searchable data structures.
  • May be manual or automatic
  • Creates basis for direct search,or for search
    through index files.
  • Historically performed by professional indexers
    associated with library organizations.
  • A critical process users ability to find
    documents on particular subject is limited by the
    indexer creating index terms for this subject.
  • Initial computerization still relied on human
    indexers, but encouraged using more index
    terms(index cards no longer being required for
    each index term)

3
Changes in Objectives of Indexing Due to full
Tex Availability
  • Indexing defines the source major concepts of
    documents.
  • The use of a controlled vocabulary(the domain of
    the index),help standardize the choice of terms.
  • Controlled vacabularies slow the indexing
    process,but aid users because they know the
    domain the indexer had to use
  • With the availability of full text the need for
    manual indexing is diminishing
  • Source information (citation data) can easily be
    extracted.
  • Every word of a document(after appropriate
    normalization) may be used as a term
  • Thesauri compensate for lack of controlled
    vocabularies.
  • Hence,importance of manual indexing shifts to its
    ability to
  • Perform abstractions and determine additional
    related terms.
  • Judge the value of the information (e.g. , more
    difficult to cheat)

4
ApproachesScope
  • Exhaustively the extent to which concepts are
    indexed.
  • Should we index only the most important
    concepts, of also more minor concepts?
  • In a 10-page document, should a 2-sentence
    discussion of some subject be indexed?
  • Specificity the preciseness of the index term
    used.
  • Should we use general indexing terms of more
    specific terms?
  • Should we use the term computer, personal
    computer, or IBM Aptiv a Model M61?
  • Main effect
  • Low exhaustivity has adverse effect on recall.
  • Low specificity has adverse effect on precision.
  • Related issues
  • Index title and abstract only, or the entire
    document?
  • Should index terms be weighted?

5
Approaches Pre-coordination
  • Post-coordination when a query uses a set of
    terms linked by AND, it links these terms
    together.
  • Pre-coordination links among terms are
    specified in the index. Pre-coordination improves
    retrieval for post-coordinated queries.
  • Example Document discusses drilling of oil
    wells in Mexico by CITGO and introduction of oil
    refineries in Peru by the U.S.
  • No pre-coordination of terms
  • oil, wells, Mexico, CITGO, refineries, Peru, U.S.
  • Document retrieval if query links oil, Mexico
    and Peru.
  • Simple re-coordination
  • (oil wells, Mexico, CITGO)
  • (oil refineries, Peru, U.S.)
  • Document not retrieved if query links oil,
    Mexico and Peru

6
Example(cont.)
  • Pre-coordination with position indicating role
  • (CITGO, drill, old wells, Mexico)
  • (U.S. introduce, oil refineries, Peru)
  • Discriminates which country introduces refineries
    into the other country
  • Pre-coordination with modifier indicating role
  • (Subject CITGO, Actiondrill, Object oil wells,
    Modifier in Mexico)
  • (Subject U.S. , Action introduce, Object oil
    refineries, Modifier in Peru)
  • If document discussed U.S. introducing refineries
    in Peru, Bolivia, and Argentina, one entry is
    used with three Modifier fields.

7
Automatic Indexing
  • System automatically determines the index terms
    assigned to documents.
  • Relative advantages
  • Human indexing
  • Ability to determine concept abstractions.
  • Ability to judge the value of concepts.
  • Automatic indexing
  • Reduced cost once initial hardware cost is
    amortized, operational cost is cheaper vs.
    compensation for human indexers.
  • Reduced processing time at most few seconds vs.
    at least a few minutes.
  • Improved consistency algorithms select index
    terms terms much more consistently than humans.

8
Weighted and Unweighted indexes
  • Unweighted indexing
  • No attempt to determine the value of the
    different terms assigned to a document. Not
    possible to distinguish between major topics and
    casual references.
  • All retrieved documents are equal in value.
  • Typical of commercial systems through the 1980s.
  • Weighted indexing
  • Attempt made to place a value on each term as a
    description of the document.
  • This value is related to frequency of occurrence
    of the term in the document(higher is better),
    but also to number of collection documents that
    use this term (lower is better).
  • Query weights and document weights are combined
    to a value describing the likelihood that a
    document matches a query, and a threshold value
    limits the number of documents returned.
  • Typically used only with automatic indexing.

9
Automatic Indexing by Term and by Concept
  • Indexing by Term The item is represented by
    terms extracted from the item.
  • The Vector model
  • The Bayesian Model
  • Natural language processing
  • indexing by conceptThe document is represented
    by concepts not necessarily used in the document.

10
Indexing by Termthe Vector Model
  • The SMART system developed by Salton at Cornell
    University.
  • Each document is stored as a vector of weights.
  • Each vector position represents a term in the
    database domain(the dimension of these vectors is
    the size of the vocabulary).
  • The value is represented by a similar vector
  • The Search involves calculating the vector
    distance between the query vector and each
    document vector.

11
Indexing by Term the Bayesian Model
  • Bayes rule of conditional probability
  • P(A/B) P(A,B)/P(B) P(A)P(B/A)/P(B)
  • Bayesian methods can be used to determine the
    processing tokens and their weights.
  • Principle calculate the (posterior) probability
    that a given document contains concept C, given
    the presence of features (words) F1,,Fm in the
    document.
  • To calculate this probability we need to know
  • The prior probability that the document is
    relevant to the concept C.
  • The conditional probability that the features Fi
    are present in a document, given that the
    document is relevant to the concept C.

12
Indexing by Term Natural Language Processing
  • The DR-LINK system.
  • Enhance indexing by using semantic information (
    in addition to statistical information).
  • Process the language, rather than treat each word
    as an independent entity.
  • Process documents at different levels
    morphological, lexical, semantic, syntactic, and
    discourse ( beyond the sentence).

13
Indexing by Concept
  • There are many ways to represent the same idea
    and increased retrieval performance comes from
    using a single representation.
  • Hence, a single canonical set of concepts is
    determined and is used for indexing all
    documents.
  • The MatchPlus system
  • A set of n features (concepts) is selected.
  • For each word stem a context vector of dimension
    n is built, describing how strongly the stem
    reflects each feature.
  • The context vectors for the word stems are
    combined with a weighted sum, to create a single
    context vector for the entire documents.
  • This vector represents the document in terms of
    the concepts.
  • Queries go through same analysis to determine
    vector representations.
  • During search, query vector is compared to
    document vectors.

14
Information Extraction
  • Two processed related to indexing
  • Extraction of facts(e.g, when building indexes
    automatically).
  • Document summarization.
  • Extraction of facts into a database
  • Extract specific types of information using
    extraction criteria (indexing attempts to
    represent the entire document).
  • Recall now refers to how much information was
    extracted from a documents(vs. how much should
    have been extracted).
  • Precision now refers to the proportion of the
    extracted information which is accurate.
  • Experiments show that automatic extraction
    performs much worse than human extracion (55
    precision and recall vs. about 80), but operates
    about 20 times faster.

15
Information Extraction(cont.)
  • Documents summarization
  • Extract the most important ideas, while reducing
    the size significantly.
  • Example the abstract of a document.
  • True summarization is not feasible.
  • Instead, most summarization techniques extract
    the most significant subsets(e.g. , sentences),
    and concatenate them.
  • Each sentence is assigned a score, and the
    highest scoring sentences are extracted.
  • No guarantee of a coherent narrative.
  • Heuristic algorithms, with no overall theory. For
    example,
  • Consider sentences over 5 words in length.
  • Look for cues e.g., in conclusion.
  • Focus on the first 10 and last 5 paragraphs.
Write a Comment
User Comments (0)
About PowerShow.com