Special%20Topics%20in%20Computer%20Science%20The%20Art%20of%20Information%20Retrieval%20Chapter%207:%20Text%20Operations - PowerPoint PPT Presentation

About This Presentation
Title:

Special%20Topics%20in%20Computer%20Science%20The%20Art%20of%20Information%20Retrieval%20Chapter%207:%20Text%20Operations

Description:

Need to decompress from the beginning. Not for IR. Dictionary. Pointers to previous occurrences. ... Size compressed / size decompressed. Huffman, units = words: ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Special%20Topics%20in%20Computer%20Science%20The%20Art%20of%20Information%20Retrieval%20Chapter%207:%20Text%20Operations


1
Special Topics in Computer ScienceThe Art of
Information RetrievalChapter 7 Text Operations
  • Alexander Gelbukh
  • www.Gelbukh.com

2
Previous chapter Conclusions
  • Modeling of text helps predict behavior of
    systems
  • Zipf law, Heaps law
  • Describing formally the structure of documents
    allows to treat a part of their meaning
    automatically, e.g., search
  • Languages to describe document syntax
  • SGML, too expensive
  • HTML, too simple
  • XML, good combination

3
Text operations
  • Linguistic operations
  • Document clustering
  • Compression
  • Encription (not discussed here)

4
Linguistic operations
  • Purpose Convert words to meanings
  • Synonyms or related words
  • Different words, same meaning. Morphology
  • Foot / feet, woman / female
  • Homonyms
  • Same words, different meanings. Word senses
  • River bank / financial bank
  • Stopwords
  • Word, no meaning. Functional words
  • The

5
For good or for bad?
  • More exact matching
  • Less noise, better recall
  • Unexpected behavior
  • Difficult for users to grasp
  • Harms if introduces errors
  • More expensive
  • Adds a whole new technology
  • Maintenance language dependents
  • Slows down
  • Good if done well, harmful if done badly

6
Document preprocessing
  • Lexical analysis (punctuation, case)
  • Simple but must be careful
  • Stopwords. Reduces index size and pocessing time
  • Stemming connected, connection, connections, ...
  • Multiword expressions hot dog, B-52
  • Here, all the power of linguistic analysis can be
    used
  • Selection of index terms
  • Often nouns noun groups computer science
  • Construction of thesaurus
  • synonymy network of related concepts (words or
    phrases)

7
Stemming
  • Methods
  • Linguistic analysis complex, expensive
    maintenance
  • Table lookup simple, but needs data
  • Statistical (Avetisyan) no data, but imprecise
  • Suffix removal
  • Suffix removal
  • Porter algorithm. Martin Porter. Ready code on
    his website
  • Substitution rules sses ? s, s ? ?
  • stresses ? stress.

8
Better stemming
  • The whole problematics of computational
    linguistics
  • POS disambiguation
  • well ? adverb or noun? Oil well.
  • Statistical methods. Brill tagger
  • Syntactic analysis. Syntactic disambiguation
  • Word sense disambiguatiuon
  • bank1 and bank2 should be different stems
  • Statistical methods
  • Dictionary-based methods. Lesk algorithm
  • Semantic analysis

9
Thesaurus
  • Terms (controlled vocabulary) and relationships
  • Terms
  • used for indexing
  • represent a concept. One word or a phrase.
    Usually nouns
  • sense. Definition or notes to distinguish senses
    key (door).
  • Relationships
  • Paradigmatic
  • Synonymy, hierarchical (is-a, part),
    non-hierarchical
  • Syntagmatic collocations, co-occurrences
  • WordNet. EuroWordNet
  • synsets

10
Use of thesurus
  • To help the user to formulate the query
  • Navigation in the hierarchy of words
  • Yahoo!
  • For the program, to collate related terms
  • woman ? female
  • fuzzy comparison woman ? 0.8 female. Path
    length

11
Yahoo! vs. thesaurus
  • The book says Yahoo! is based on a thesaurus.
  • I disagree
  • Tesaurus words of language organized in
    hierarchy
  • Document hierarchy documents attached to
    hierarchy
  • This is word sense disambiguation
  • I claim that Yahoo! is based on (manual) WSD
  • Also uses thesaurus for navigation

12
Text operations
  • Linguistic operations
  • Document clustering
  • Compression
  • Encription (not discussed here)

13
Document clustering
  • Operation on the whole collection
  • Global vs. local
  • Global whole collection
  • At compile time, one-time operation
  • Local
  • Cluster the results of a specific query
  • At runtime, with each query
  • Is more a query transformation operation
  • Already discussed in Chapter 5

14
Text operations
  • Linguistic operations
  • Document clustering
  • Compression
  • Encription (not discussed here)

15
Compression
  • Gain storage, transmission, search
  • Lost time on compressing/decompressing
  • In IR need for random access.
  • Blocks do not work
  • Also pattern matching on compressed text

16
Compression methods
  • Statistical
  • Huffman fixed size per symbol.
  • More frequent symbols shorter
  • Allows starting decompression from any symbol
  • Arithmetic dynamic coding
  • Need to decompress from the beginning
  • Not for IR
  • Dictionary
  • Pointers to previous occurrences. Lampel-Ziv
  • Again not for IR

17
Compression ratio
  • Size compressed / size decompressed
  • Huffman, units words up to 2 bits per char
  • Close to the limit entropy. Only for large
    texts!
  • Other methods similar ratio, but no random
    access
  • Shannon optimal length for symbol with
    probability p is - log2 p
  • Entropy Limit of compression
  • Average length with optimal coding
  • Property of model

18
Modeling
  • Find probability for the next symbol
  • Adaptive, static, semi-static
  • Adaptive good compression, but need to start
    frombeginning
  • Static (for language) poor compression, random
    access
  • Semi-static (for specific text two-pass) both
    OK
  • Word-based vs. character-based
  • Word-based better compression and search

19
Huffman coding
  • Each symbol is encoded, sequentially
  • More frequent symbols have shorter codes
  • No code is a prefix of another one
  • How to buildthe tree book
  • Byte codesare better
  • Allow forsequentialsearch

20
Dictionary-based methods
  • Static (simple, poor compression), dynamic,
    semi-static.
  • Lempel-Ziv references to previous occurrence
  • Adaptive
  • Disadvantages for IR
  • Need to decode from the very beginning
  • New statistical methods perform better

21
Comparison of methods
22
Compression of inverted files
  • Inverted file words lists of docs where they
    occur
  • Lists of docs are ordered. Can be compressed
  • Seen as lists of gaps.
  • Short gaps occur more frequently
  • Statistical compression
  • Our work order the docs for better compression
  • We code runs of docs
  • Minimize the number of runs
  • Distance of different words
  • TSP.

23
Research topics
  • All computational linguistics
  • Improved POS tagging
  • Improved WSD
  • Uses of thesaurus
  • for user navigation
  • for collating similar terms
  • Better compression methods
  • Searchable compression
  • Random access

24
Conclusions
  • Text transformation meaning instead of strings
  • Lexical analysis
  • Stopwords
  • Stemming
  • POS, WSD, syntax, semantics
  • Ontologies to collate similar stems
  • Text compression
  • Searchable
  • Random access
  • Word-based statistical methods (Huffman)
  • Index compression

25
Thank you! Till compensation lecture
Write a Comment
User Comments (0)
About PowerShow.com