Project Description 2 Indexing - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

Project Description 2 Indexing

Description:

Project Description 2 Indexing Indexing Tokenize a text document, and attach to each token a list of locations that this token has appeared Sort and Store ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 11
Provided by: tylin
Category:

less

Transcript and Presenter's Notes

Title: Project Description 2 Indexing


1
Project Description 2Indexing

2
Indexing
  • Tokenize a text document, and
  • attach to each token
  • a list of locations
  • that this token has appeared
  • Sort and Store these result in databases or files

3
Tokenizer
  • Tokenizer
  • Admissible symbols for token we will not user
    delimiter to capture the token.
  • Keep a record of the position of each token

4
Tokenizer
  • Example
  • Document1 He is a dumb teacher Dumb! Dumb! and
    Dumb!
  • Document2He is a great council. His advices are
    really great. He truly helps.

5
Tokenizer
  • Inverted File for document 1 -continue
  • dumb 4
  • Dumb 6
  • Dumb 8
  • Dumb 11
  • He 1
  • is 2
  • teacher 5

6
Tokenizer - Example
  • Inverted File for document 1
  • ! 12
  • ! 7
  • ! 9
  • a 3
  • and 10

7
Tokenizer
  • Inverted File for document 1
  • ! 7, 9, 12 (frequency 3/ 12)
  • a 3
  • and 10
  • Dumb 4, 6, 8 , 11
  • He 1
  • is 2
  • teacher 5

8
Tokenizer
  • Inverted File for document 2
  • (period) . 6 , 12
  • a 3
  • advices 8
  • are 9
  • council 5
  • great 4 , 11
  • He 1, 13
  • His 7,
  • is 2
  • really 10

9
Create a Token Database
  • Organize a Inverted file for the following
    documents
  • For Simple data
  • Fro complex data

10
Token database
  • Store the token into database
  • First Column is sorted tokens
  • Second Column is the Document Names
  • Rest of a tuple keeps locations of the token
  • This is the so called inverted list
Write a Comment
User Comments (0)
About PowerShow.com