Information%20Retrieval%20by%20means%20of%20Vector%20Space%20Model%20of%20Document%20Representation%20and%20Cascade%20Neural%20Networks - PowerPoint PPT Presentation

About This Presentation
Title:

Information%20Retrieval%20by%20means%20of%20Vector%20Space%20Model%20of%20Document%20Representation%20and%20Cascade%20Neural%20Networks

Description:

Information Retrieval by means of Vector Space Model of Document ... Complicated word timing and declension, prefixes and suffixes, ... Synonyms and homonyms ... – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Information%20Retrieval%20by%20means%20of%20Vector%20Space%20Model%20of%20Document%20Representation%20and%20Cascade%20Neural%20Networks


1
Information Retrieval by means of Vector Space
Model of Document Representation and Cascade
Neural Networks
  • Igor Mokriš, Lenka Skovajsová
  • Institute of Informatics, SAS Bratislava,
    Slovakia
  • mokris_at_aoslm.sk, skovajsova_at_aoslm.sk

2
Summary
  • Development of the neural network model for
    information retrieval from text documents in
    Slovak language by vector space model of document
    representation
  • Key words Information Retrieval, Queries,
    Keywords, Text Documents, Neural Networks, Slovak
    Language

3
Text Document Analysis
  • The most common approaches
  • Statistical analyses words in text documents
    comparing them with keywords
  • Linguistic extracts linguistic units from text
    phoneme, morpheme, lexeme, ...
  • Knowledge based uses domain models of
    documents descripted by ontology
  • Porter algorithm for English

4
Slovak language is more complicated
  • Inflection of Slovak language grammatical forms
    nouns, adjectives, pronouns, ...
  • Complicated word timing and declension,
    prefixes and suffixes, ...
  • Synonyms and homonyms
  • Phrases containing more than one word,
  • And so on

5
System for Information Retrieval in STDFurdík,
K. Inf. Retrieval in Nat. Language by Hypertext
Structure, 2003.
  • User Indexation
    Document Administrator

6
How continue
  • Utilization of Neural Networks
  • Well trained NN is able
  • to simplify the Slovak text analysis,
  • is invariance from point of Slovak words
    infection,
  • perform faster linguistic analysis
  • Disadvantage
  • problems with learning and static structure of NN

7
System for Information Retrieval
  • Can be simplified


8
It means 3 Layer Information Retrieval
System
  • Most simplified structure of system

Keywords
Documents
Queries
9
Next solution - Representation the query,
keywords and document layer by neural networks
10
Development of 1st NN for Keyword Determination
  • 1st NN Feed-Forward NN Back-Prop Type

11
Development of 2nd NN for Document Determination
Vector Space Model
K(m x n) Vector Space Matrix kkd frequency of
keywords in documents k number of keywords d
number of documents
12
NN with Spreading Activation Function
  • Determination of Documents

13
NN with Spreading Activation Function
  • SAF NN is not learning
  • Weights are setting by equation
  • W K

14
Experiments
  • Model of cascade NN in Matlab
  • Query layer - 12 characters
  • Keyword layer - 20 keywords
  • Document layer - 90 documents
  • Each document - app. 50 words
  • QTrS 164 queries of training set
  • KwTrS 20 keywords of training set
  • 2nd NN is not trained

15
Experiments
  • 1st experiment
  • QTsS1 185 queries, questions from QTsS1
    belonging keywords from KwTrS
  • Precision 0,996
  • 2nd experiment
  • QTsS1 100 queries, questions from QTsS1
    belonging no keywords from KwTrS
  • Precision 0,97

16
Disadvantage of VS Model Approach
  • Great dimension of VS matrix
  • Next approach Dimension reduction of VS matrix
    Latent Semantic Model

17
Latent Semantic Model
  • Singular Value Decomposition of Vector Space
    Matrix K
  • K U S VT
  • U row oriented eigen vectors of K.KT
  • V column oriented eigen vectors of K.KT
  • S diagonal matrix of singular values of K.KT
  • dim (S) lt dim (K)

18
VS Matrix Dimension Reduction Truncated SVD Sr
lt S
k number of singular values si r lt
k r number of si after dimension
reduction Number of elements of reduced matrices
is lower then number of elements in the matrix K
19
Solution of Dimension VSM Reduction
  • Document relevance D is defined by
  • D Q x K,
  • Q set of queries
  • K VS matrix
  • Reduced document relevance Dr is defined by
  • Dr Q x Kr,
  • Kr U.Sr.VT reduced VS matrix

20
Experiments
  • Collection of 90 documents with 20 keywords
    vector space matrix
  • Dimension reduction by truncated singular value
    decomposition
  • For chosen number of singular values computation
    the precision, recall, absolute and relative
    number of element kil

21
Evaluation of Experiments precision, recall,
number of elements VSred
  • R - recall R nretrel / nrel
  • nretrel number of retrieved relevant documents
  • nrel number of relevant documents
  • P precision P nretrel / nret
  • ret number of retrieved documents

22
Results
  • si Precision Recall Absolute Relative
  • 1 0,7942 0,24 110 0,632
  • 2 0,95 0,314 121 0,695
  • 3 0,95 0,405 137 0,787
  • 5 0,975 0,512 148 0,850
  • 7 0,977 0,634 161 0,925
  • 10 1,0 0,754 165 0,948
  • 15 1,0 0,95 173 0,994
  • 20 1,0 1,0 174 1,0

23
Conclusion
  • follows from table
Write a Comment
User Comments (0)
About PowerShow.com