Applying%20Evolutionary%20Computation%20Techniques%20to%20Web%20Information%20Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Applying%20Evolutionary%20Computation%20Techniques%20to%20Web%20Information%20Retrieval

Description:

Applying Evolutionary Computation Techniques to Web ... Keyword1. Randomly. generated. Chromosome. W1 W2 select. Keyword1. 38. HGA for Web Search (cont. ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 55
Provided by: marks196
Category:

less

Transcript and Presenter's Notes

Title: Applying%20Evolutionary%20Computation%20Techniques%20to%20Web%20Information%20Retrieval


1
Applying Evolutionary Computation Techniques to
Web Information Retrieval
  • Chih-Chin Lai, Ph.D.
  • Dept. of Computer Science and Information
    Engineering
  • National University of Tainan, Taiwan
  • E-mail cclai_at_mail.nutn.edu.tw
  • Nov. 28, 2007

2
Outlines
  • Information Retrieval
  • some related topics
  • Evolutionary Computation (EC)
  • Applying EC to Web Information Retrieval
  • Conclusions

3
Introduction
  • Definition of Information Retrieval
  • Salton (1989) Information-retrieval systems
    process files of records and requests for
    information, and identify and retrieve from the
    files certain records in response to the
    information requests. The retrieval of particular
    records depends on the similarity between the
    records and the queries, which in turn is
    measured by comparing the values of certain
    attributes to records and information requests.
  • Kowalski (1997) An Information Retrieval System
    is a system that is capable of storage,
    retrieval, and maintenance of information.
    Information in this context can be composed of
    text (including numeric and date data), images,
    audio, video, and other multi-media objects).

4
Introduction (cont.)
  • Information Retrieval (IR)
  • The indexing and retrieval of textual documents
  • Searching newspaper articles
  • Searching on the Web
  • Concerned firstly with retrieving relevant
    documents to a query
  • Concerned secondly with retrieving large sets of
    documents efficiently

5
Typical IR Task
  • Given
  • User has information need
  • A corpus of textual natural-language documents
  • A user query in the form of a textual string
  • Find
  • A ranked set of documents that are relevant to
    the query

6
Key Qualities
  • Document and query representations
  • Mechanisms for finding relevant documents and
    ranking the results
  • Mechanisms for obtaining user feedback

7
Typical IR System
Documents
Documents
Query
User
Process
Process
Retrieved relevant(?) documents
Store
Retrieval Part
IR System
8
Relevance
  • Relevance is a subject judgment
  • Being on the proper subject
  • Being timely (recent information)
  • Satisfying the goals of the user and his/her
    intended use of the information (information need)

9
IR System Components
  • Text operations forms index words (tokens)
  • Stopword removal
  • Stemming
  • Indexing maps each keyword to a set of documents
    that contains the keyword
  • Searching retrieves documents that contain a
    given query token from the inverted index
  • Ranking scores all retrieved documents according
    to a relevance metric

10
IR System Components (cont.)
  • User interface manages interaction with the user
  • Query input and document output
  • Relevance feedback
  • Visualization of results
  • Query operations transform the query to improve
    retrieval

11
Examples of IR System
  • Conventional (library catalog) Search by
    keyword, title, author, etc.

12
Examples of IR System (cont.)
  • Text-based (Google) Search by keywords. Limited
    search using queries in natural language

13
Examples of IR System (cont.)
  • Multimedia (WebSeek) Search by visual appearance
    (shapes, colors,)

14
Examples of IR System (cont.)
  • Question answering systems (AnswerBus) Search in
    (restricted) natural language

15
Searching the Web
  • Application of IR to HTML documents on the World
    Wide Web
  • Three forms
  • Use search engines that index a portion of the
    Web documents as a full-text database
  • Use Web directories, which classify selected Web
    documents by subject
  • Search the Web exploiting its hyperlink structure

16
Web Search System
Documents
Query
Spider
User
Process
Process
Retrieved relevant(?) documents
Store
World Wide Web
Retrieval Part
IR System
17
Retrieval Models
  • A retrieval model specifies the details of
  • Document and Query representation
  • Matching strategies for assessing the relevance
    of documents to a user query
  • Methods for ranking query output
  • Mechanisms for acquiring user-relevance feedback
  • Notion of relevance can be binary or continuous
    (i.e. ranked retrieval)

18
Types of IR Models
  • Boolean model
  • Simple Boolean queries regarding existence of
    terms within documents
  • Easy to understand, but difficult to rank output
  • Vector space model
  • Documents are represented by n-dimensional
    vectors
  • Typically one dimension per term

19
Types of IR Models (cont.)
  • Probabilistic model
  • Start with some user-supplied relevance
    information about a training set of documents
  • The training set is used to compute term weights
    by estimating
  • Useful for theoretical analysis, but probably not
    in practice (?)

20
Statistical Retrieval
  • Retrieval based on similarity between query and
    documents
  • Output documents are ranked according to
    similarity to query
  • Similarity based on occurrence frequencies of
    keywords in query and document

21
The Vector Space Model
  • A document is typically represented by a bag of
    words (unordered words with frequencies)
  • Assume a vocabulary of t distinct terms
  • Each term, i, in a document or query, j, is given
    a real-valued weight, wij
  • Both documents and queries are expressed as
    t-dimensional vectors
  • dj (w1j, w2j, , wtj)

22
Concept Representation
  • Example
  • Vdoc1 2T1 4T2 5T3
  • Vdoc2 4T1 7T2 T3
  • Vquery 0T1 0T2 2T3

T3
5
Vdoc1 2T1 4T2 5T3
Vquery 0T1 0T2 2T3
4
2
T1
Vdoc2 4T1 7T2 T3
7
  • Is Vdoc1or Vdoc2 more similar to Vquery?
  • How to measure the degree of similarity?

T2
23
Term Weights TF-IDF
  • More frequent terms in a document are more
    indicative to the topic
  • fij frequency of term i in document j
  • tfij fij / maxfij (normalization)
  • Terms that appear in many different documents are
    less indicative of overall topic
  • df i document frequency of term i
  • number of documents containing term
    i
  • idfi inverse document frequency of term i,
  • log(N/ df i) ( where N total
    number of documents)

24
TF-IDF Weighting
  • A typical combined term importance indicator is
    tf-idf weighting
  • wij tfij idfi tfij log (N/ dfi)
  • A term occurring frequently in the document but
    rarely in the rest of the collection is given
    high weight
  • Experimentally, tf-idf has been found to work well

25
Similarity Measure
  • A similarity measure is a function that computes
    the degree of similarity between two vectors
  • Using a similarity measure between the query and
    each document
  • to rank the retrieved documents
  • to control the size of the retrieved set

26
Similarity Measure (cont.)
t3
  • Cosine similarity measures the cosine of the
    angle between two vectors inner product
    normalized by the vector lengths

?1
Vdoc1
Vquery
?2
t1
CosSim(dj, q)
Vdoc2
t2
Vdoc1 2T1 4T2 5T3 CosSim(Vdoc1 ,
Vquery) 10 / ?(41625)(004) 0.75 Vdoc2
4T1 7T2 1T3 CosSim(Vdoc2 , Vquery) 2 /
?(16491)(004) 0.12 Vquery 0T1 0T2 2T3
D1 is 6 times better than D2 using cosine
similarity but only 5 times better using inner
product.
27
Accuracy Measures Precision and Recall
Relevant documents
From all the documents that are retrieved by the
IR system, how many are relevant? From all
the documents that are relevant out there, how
many did the IR system retrieve?
28
Precision and Recall
  • Precision
  • The ability to retrieve top-ranked documents that
    are mostly relevant
  • Recall
  • The ability of the search to find all of the
    relevant items in the corpus

29
Precision and Recall Variations
Narrow query formulation Returns relevant
documents but misses many useful ones
1
Broad query formulation Returns most
relevant documents but includes lots of junk
Precision
1
0
Recall
Figure taken from Raymond J. Mooney
(http//www.cs.utexas.edu/users/mooney/ir-course/)
30
Evolutionary Computation
  • Definition
  • EC (GA, GP, ES) solve computational problems by
    simulating evolution with natural selection
  • They are stochastic search algorithms which
    incrementally preserve and combine desirable
    features of individual potential solutions in a
    population over an extended period of time

Figure taken from www.genetic-programming.org
31
Template of EC
  • procedure EC
  • begin
  • t 0
  • initializePopulation(P(0))
  • evaluate(P(0))
  • repeat
  • t t 1
  • P' selectForVariation((P(t))
  • recombine(P')
  • mutate(P')
  • evaluate(P')
  • until termination true
  • end

32
Applications of EC to IR
  • EC has been applied to the following problems
  • Automatic document indexing
  • Document and term clustering
  • Query definition
  • Matching function learning
  • Image retrieval
  • Design of user profiles for IR on the Internet
  • Web page classification
  • Design of agents for Internet searching

33
MGA for Web Search
  • Genetic algorithm
  • John Holland, 1975
  • David E. Goldberg, 1989
  • Metagenetic algorithm (MGA)
  • Zacharis and Panayiotopoulos proposed (2001)
  • A two-stage GA that controls and optimizes both
    populations simultaneously

34
MGA for Web Search (cont.)
  • Zacharis and Panayiotopoulos, IEEE Internet
    Computing, 2001

35
Hierarchical Genetic Algorithm
  • HGA
  • Tang et al. (1998) proposed
  • It is a variant of conventional genetic
    algorithm with hierarchical genetic structure
  • In HGA, the chromosome consists of two types of
    genes
  • the control genes and
  • the parametric genes
  • The relationship between parametric genes and
    control genes is that the activation of former is
    governed by the value of the latter

36
HGA Representation
37
HGA for Web Search
Chromosome
Randomly generated
W1 gt W2 select Keyword1
Dictionary
Control genes Control genes Control genes Control genes Control genes Parametric genes
1 1 0 0 1 news intelligence mit lab artificial
0 1 1 0 1 mit artificial ai lab intelligence
38
HGA for Web Search (cont.)
Control genes Control genes Control genes Control genes Control genes
1 1 0 0 1
0 1 1 0 1
Cut point
Control genes Control genes Control genes Control genes Control genes Parametric genes
1 1 1 0 1 news intelligence mit lab artificial
0 1 0 0 1 mit artificial ai lab intelligence
39
HGA for Web Search (cont.)
Control genes Control genes Control genes Control genes Control genes Parametric genes
1 1 1 0 1 news intelligence mit lab artificial
Dictionary
0 1 0 0 1 mit artificial ai lab intelligence
40
HGA for Web Search (cont.)
User interface
Interesting
Relevance page
Update
PWIS
Vector DD
Recommendation component
Vector DR
Query
World Wide Web
Results by PageRank
Keywords
41
HGA for Web Search (cont.)
42
HGA for Web Search (cont.)
43
Profile for Web Search
44
Profile for Web Search
45
Profile for Web Search
46
Profile for Web Search (cont.)
47
Profile for Web Search (cont.)
48
Profile for Web Search (cont.)
49
Conclusions
  • The aim of a Web IR system is to estimate the
    relevance of web information items to a user
    information need expressed in a query
  • This is a very hard and complex task
  • It is pervaded with subjectivity, vagueness and
    impression
  • The main characteristic of EC is that it is
    tolerant to impression, vagueness, partial truth,
    and approximation
  • EC techniques have been used satisfactorily to
    improve IR process

50
Conclusions (cont.)
Figure taken from M. Henzinger, The past,
presence, and future of Web Information Retrieval
51
Web Intelligence
  • Today's search engines are designed for human
    consumption
  • (1) A user queries the SE and gets relevant pages
  • (2) The user reads the pages and extracts
    manually the information
  • (3) The information must be integrated to produce
    the desired knowledge

(1)
(2)
(1)
(3)
(3)
Figure taken from Prof. F. Ciravegna, University
of Sheffield, Web Intelligence
52
Web Intelligence (cont.)
  • The future web will have semantics associated to
    pages and SE will be able to provide
    semantically-based services

Figure taken from Prof. F. Ciravegna, University
of Sheffield, Web Intelligence
53
References Journals
  • Information Processing and Management
  • Journal of the American Society of Information
    Science
  • Transactions On Information Science
  • Information Retrieval
  • Journal of Documentation
  • Information Retrieval

54
Good books
  • Van Rijsbergen
  • Information Retrieval, ir.dcs.gla.ac.uk
  • Sparck Jones and Willett
  • Readings in Information Retrieval
  • Baeza-Yates and Ribeiro-Neto
  • Modern Information Retrieval
  • Witten, Moffat and Bell
  • Managing Gigabytes
Write a Comment
User Comments (0)
About PowerShow.com