Title: Information%20Retrieval%20by%20means%20of%20Vector%20Space%20Model%20of%20Document%20Representation%20and%20Cascade%20Neural%20Networks
1Information Retrieval by means of Vector Space
Model of Document Representation and Cascade
Neural Networks
- Igor Mokriš, Lenka Skovajsová
- Institute of Informatics, SAS Bratislava,
Slovakia - mokris_at_aoslm.sk, skovajsova_at_aoslm.sk
2Summary
- Development of the neural network model for
information retrieval from text documents in
Slovak language by vector space model of document
representation - Key words Information Retrieval, Queries,
Keywords, Text Documents, Neural Networks, Slovak
Language
3Text Document Analysis
- The most common approaches
- Statistical analyses words in text documents
comparing them with keywords - Linguistic extracts linguistic units from text
phoneme, morpheme, lexeme, ... - Knowledge based uses domain models of
documents descripted by ontology - Porter algorithm for English
4Slovak language is more complicated
- Inflection of Slovak language grammatical forms
nouns, adjectives, pronouns, ... - Complicated word timing and declension,
prefixes and suffixes, ... - Synonyms and homonyms
- Phrases containing more than one word,
- And so on
5System for Information Retrieval in STDFurdík,
K. Inf. Retrieval in Nat. Language by Hypertext
Structure, 2003.
- User Indexation
Document Administrator
6How continue
- Utilization of Neural Networks
- Well trained NN is able
- to simplify the Slovak text analysis,
- is invariance from point of Slovak words
infection, - perform faster linguistic analysis
- Disadvantage
- problems with learning and static structure of NN
7System for Information Retrieval
8It means 3 Layer Information Retrieval
System
- Most simplified structure of system
Keywords
Documents
Queries
9Next solution - Representation the query,
keywords and document layer by neural networks
10Development of 1st NN for Keyword Determination
- 1st NN Feed-Forward NN Back-Prop Type
11Development of 2nd NN for Document Determination
Vector Space Model
K(m x n) Vector Space Matrix kkd frequency of
keywords in documents k number of keywords d
number of documents
12NN with Spreading Activation Function
- Determination of Documents
13NN with Spreading Activation Function
- SAF NN is not learning
- Weights are setting by equation
- W K
14Experiments
- Model of cascade NN in Matlab
- Query layer - 12 characters
- Keyword layer - 20 keywords
- Document layer - 90 documents
- Each document - app. 50 words
- QTrS 164 queries of training set
- KwTrS 20 keywords of training set
- 2nd NN is not trained
15Experiments
- 1st experiment
- QTsS1 185 queries, questions from QTsS1
belonging keywords from KwTrS - Precision 0,996
- 2nd experiment
- QTsS1 100 queries, questions from QTsS1
belonging no keywords from KwTrS - Precision 0,97
16Disadvantage of VS Model Approach
- Great dimension of VS matrix
- Next approach Dimension reduction of VS matrix
Latent Semantic Model
17Latent Semantic Model
- Singular Value Decomposition of Vector Space
Matrix K -
- K U S VT
- U row oriented eigen vectors of K.KT
- V column oriented eigen vectors of K.KT
- S diagonal matrix of singular values of K.KT
- dim (S) lt dim (K)
18VS Matrix Dimension Reduction Truncated SVD Sr
lt S
k number of singular values si r lt
k r number of si after dimension
reduction Number of elements of reduced matrices
is lower then number of elements in the matrix K
19Solution of Dimension VSM Reduction
- Document relevance D is defined by
- D Q x K,
- Q set of queries
- K VS matrix
- Reduced document relevance Dr is defined by
- Dr Q x Kr,
- Kr U.Sr.VT reduced VS matrix
20Experiments
- Collection of 90 documents with 20 keywords
vector space matrix - Dimension reduction by truncated singular value
decomposition - For chosen number of singular values computation
the precision, recall, absolute and relative
number of element kil
21Evaluation of Experiments precision, recall,
number of elements VSred
- R - recall R nretrel / nrel
- nretrel number of retrieved relevant documents
- nrel number of relevant documents
- P precision P nretrel / nret
- ret number of retrieved documents
22Results
- si Precision Recall Absolute Relative
- 1 0,7942 0,24 110 0,632
- 2 0,95 0,314 121 0,695
- 3 0,95 0,405 137 0,787
- 5 0,975 0,512 148 0,850
- 7 0,977 0,634 161 0,925
- 10 1,0 0,754 165 0,948
- 15 1,0 0,95 173 0,994
- 20 1,0 1,0 174 1,0
23Conclusion