Genetic Algorithms for Information Retrieval - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Genetic Algorithms for Information Retrieval

Description:

The GA starts with a limited number of individuals from P (initial population) ... to survive, with less fit genes dying off, being replaced by the fitter genes. ... – PowerPoint PPT presentation

Number of Views:483
Avg rating:5.0/5.0
Slides: 23
Provided by: cisJu
Category:

less

Transcript and Presenter's Notes

Title: Genetic Algorithms for Information Retrieval


1
Genetic Algorithms for Information Retrieval
  • Presented to
  • Dr. Eyyas Qawasmeh.
  • Prepared by
  • Duaa Sawalha
  • 20063173038

2
Outline
  • Introduction.
  • Genetic Algorithms in information retrieval.
  • 2.1. Chromosome representation
  • 2.2. Fitness evaluation
  • 2.3. Selection
  • 2.4. Crossover
  • 2.5. Mutation
  • Suggested steps of GA in IR
  • A GA Example
  • Conclusion
  • References

3
1. Introduction
Docs
Index Terms
doc
match
Information Need
Ranking
query
4
2. Genetic Algorithms in Information Retrieval
  • The GA starts with a limited number of
    individuals from P (initial population).
  • The iterative search process is based
    on the competition of these individuals and
    their descendants during a number of
    generations.
  • The individuals are coded according to the
    chromosome model as a string of length l.
  • The simplest GA constructs a new generation from
    an old one following three steps reproduction,
    crossover, and mutation.

5
2.1 Chromosome Representation
  • A document vector (Doc) with n keywords and a
    query vector with m query terms can be
    represented as
  • Doc (term1, term2, term3 ,..termn )
  • Query (qterm1, qterm2, qterm3,..qtermm)
  • By using binary term vector, each termi (or
    qtermj) is either 0 or 1. Termi is set to zero
    when termi is not presented in document and set
    to one when termi is presented in document.

6
2.1 Chromosome Representation (cont.)
  • For example, user enters a query into our system
    that could retrieve 5 documents. These documents
    are
  • Doc1 Relational Databases, Query, Data
    Retrieval, Computer Networks, DBMS
  • Doc2 Artificial Intelligence, Internet,
    Indexing, Natural Language Processing
  • Doc3 Databases, Expert System, Information
    Retrieval System, Multimedia
  • Doc4 Fuzzy Logic, Neural Network, Computer
    Networks
  • Doc5 Object-Oriented, DBMS, Query, Indexing

7
2.1 Chromosome Representation (cont.)
  • All keywords of these documents can be arranged
    in the ascending order as
  • Artificial Intelligence, Computer Networks, Data
    Retrieval, Databases, DBMS, Expert System, Fuzzy
    Logic, Indexing, Information Retrieval System,
    Internet, Multimedia, Natural Language
    Processing, Neural Network, Object-Oriented,
    Query, Relational Databases.
  • Encode in the chromosome representation as
  • Doc1 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 1
  • Doc2 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0
  • Doc3 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 0
  • Doc4 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0
  • Doc5 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 0

8
2.2 Fitness Evaluation
  • The meaning of fitness in GA perspective is Those
    genes that are most fit are most likely to
    survive, with less fit genes dying off, being
    replaced by the fitter genes.
  • Several possible functions may be used in
    determining the fitness and efficacy of a
    grammar, such as
  • Average Search Length (ASL).
  • average maximum parse length (AMPL).
  • Dice coefficient measure, Cosine coefficient, and
    Jaccard coefficient measure if the vector space
    model is used.

9
2.3 Selection
  • The selection process in the genetic inheritance
    is the best chromosome gets more copies, the
    average stay even, and the worst die off.
  • In Genetic algorithms the selection of a new
    population is with respect to the probability
    distribution based on the fitness values.
  • In Information retrieval many researchers used
    the roulette wheel reproduction process.

10
2.4 Crossover
  • The intuition of crossover in the Genetic
    Algorithm is to produce new solutions from the
    existing one. There is maybe one point crossover
    or multiple points' crossover.
  • The suggested crossover for IR is multiple point
    crossovers. High fitness chromosomes are more
    likely to be chosen in the crossover process.
  • For example, two chromosomes are crossover
    between position 5 and 11.
  • 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1
  • 1 0 0 1 1 0 0 1 1 1 1 0 0 0 0
  • The resulting crossover yields two new
    chromosomes.
  • 1 0 1 1 1 0 0 1 1 1 1 1 1 0 1
  • 1 0 0 1 1 1 1 1 0 0 1 0 0 0 0

11
2.5 Mutation
  • It can help the search find solutions that
    crossover alone might not encounter.
  • Chromosomes may be better or poorer than old
    chromosomes. If they are poorer than old
    chromosomes, they are eliminated in selection
    step.
  • The objective of mutation is restoring lost and
    exploring variety of data.
  • Example
  • 1 0 1 1 1 1 1 1 0 0 1 1 1 0 1
  • Result 1 0 1 1 1 1 1 1 0 1 1 1 1 0 1

12
3. Suggested steps of GA in IR system
  • User enters query into IR system.
  • Match keywords from user query with list of
    keywords.
  • Encode documents retrieved by user query to
    chromosomes (initial population).
  • Population feed into genetic operator process
    such as selection, crossover, and mutation.
  • Do step 4 until max generation is reached. An
    optimize query chromosome for document retrieval
    will be achieved.
  • Decode optimize query chromosome to query and
    retrieve document from database.

13
Enter user Query
Query keywords Text keywords
Yes
Encode retrieved doc to chromosomes (Generate
initial population )
Feed population into GA process (selection,
crossover and mutation )
IS max generation reached
No
Yes
Decode optimize query chromosome
Retrieve document from database
14
4. A GA Example
  • . Input Documents and Keywords
  • DOC0 DATA RETRIEVAL, DATABASE, COMPUTER
    NETWORKS,IMPROVEMENTS, INFORMATION RETRIEVAL,
    METHOD, NETWORK,MULTIPLE, QUERY, RELATION,
    RELATIONAL, RETRIEVAL, QUERIES, RELATIONAL
    DATABASES, RELATIONAL DATABASE, US, CARAT.DAT,
    GQP.DAT,ORUS.DAT, QUERY.OPT
  • DOC1 INFORMATION, INFORMATION RETRIEVAL,
    INFORMATION STORAGE,INDEXING, RETRIEVAL,
    STORAGE, US, KEVIN.HOT
  • DOC2 ARTIFICIAL INTELLIGENCE, INFORMATION
    RETRIEVAL SYSTEMS,INFORMATION RETRIEVAL,
    INDEXING, NATURAL LANGUAGE PROCESSING, US,
    DBMS.AI, GQP.DAT
  • DOC3 FUZZY SET THEORY, INFORMATION RETRIEVAL
    SYSTEMS, INDEXING, PERFORMANCE, RETRIEVAL
    SYSTEMS, RETRIEVAL, QUERIES, US, KEVIN.HOT
  • DOC4 INFORMATION RETRIEVAL SYSTEMS, INDEXING,
    RETRIEVAL, STAIRS, US, KEVIN.HOT

15
4. A GA Example (cont.)
  • . Total Set of Concepts
  • DATA RETRIEVAL, DATABASE, COMPUTER NETWORKS,
    IMPROVEMENTS, INFORMATION RETRIEVAL, METHOD,
    NETWORK, MULTIPLE, QUERY, RELATION, RELATIONAL,
    RETRIEVAL, QUERIES, RELATIONAL DATABASES,
    RELATIONAL DATABASE, US, CARAT.DAT, GQP.DAT,
    ORUS.DAT, QUERY.OPT, INFORMATION, INFORMATION
    STORAGE, INDEXING, STORAGE, KEVIN.HOT, ARTIFICIAL
    INTELLIGENCE, INFORMATION RETRIEVAL SYSTEMS,
    NATURAL LANGUAGE PROCESSING, DBMS.AI, FUZZY SET
    THEORY, PERFORMANCE, RETRIEVAL SYSTEMS, STAIRS,

16
4. A GA Example (cont.)
  • . Initial Genetic Pattern of Chromosome in
    Population
  • chromosome fitness1111111111111111111100000000000
    00 0.287744000010000001000100001111100000000
    0.411692000010000000000101000010011110000
    0.367556000000000001100100000010101001110
    0.427473000000000001000100000010101000001
    0.451212
  • Average Fitness 0.3891

17
4. A GA Example (cont.)
  • A document which included more concepts shared by
    other documents had a higher Jaccard's score.
  • Jaccard's Score of DOC0 and DOC0
    1.000000Jaccard's Score of DOC0 and DOC1
    0.120000Jaccard's Score of DOC0 and DOC2
    0.120000Jaccard's Score of DOC0 and DOC3
    0.115384Jaccard's Score of DOC0 and DOC4
    0.083333Average Fitness (Jaccard's Score) of
    Document0 0.28774

18
4. A GA Example (cont.)
  • If a user provided documents which are closely
    related, the average fitness for the complete
    document set was high. If the user-selected
    documents were only loosely related, their
    overall fitness was low.
  • Generally, GAs did a good job optimizing a
    document set which was initially low in fitness.
    Using the previous example, the overall Jaccard's
    score increased over generations. The optimized
    population contained only one single chromosome,
    with an average fitness value of 0.45121.

19
4. A GA Example (cont.)
  • The optimized chromosome contained six relevant
    keywords which best described the initial set of
    documents.
  • Using these optimized'' keywords, an
    information retrieval system could proceed to
    suggest relevant documents to users. The user-GA
    interaction continued until a search was
    completed or the user decided to stop.

20
4. A GA Example (cont.)
  • . Optimized Chromosomes in the Population
  • chromosome fitness
  • 000000000001000100000010101000001
    0.45121000000000001000100000010101000001
    0.45121000000000001000100000010101000001
    0.45121000000000001000100000010101000001
    0.45121000000000001000100000010101000001
    0.45121
  • Average Fitness 0.4512
  • . Derived Concepts from Optimized Population
  • RETRIEVAL, US, INDEXING, KEVIN.HOT, INFORMATION
    RETRIEVAL SYSTEMS, STAIRS,

21
5. Conclusion
  • The GA can be successfully implemented in the
    field of information retrieval, many approaches
    could be used that implement GA in the field of
    information retrieval. And a continuous study is
    required in this field also in the future the
    test of this algorithm should be on a large
    database.

22
6.References
  • Ricardo Baeza-Yates, Modern information retrieval
  • Hsinchun Chen , Machine Learning for Information
    Retrieval Neural Networks, Symbolic Learning,
    and Genetic Algorithms ,Journal of the American
    Society for Information Science, 1994, in press.
    http//ai.arizona.edu/papers/mlir93/mlir93.html
  • Bangorn Klabbankoh, Ouen Pinngern Ph.D., Applied
    Genetic Algorithms in Information Retrieval.
    International Journal of the computer, the
    Internet and Management. http//www.ijcim.th.org/p
    ast_editions/1999V07N3/02-drouen.pdf
  • Robert Losee. Learning Syntactic Rules and Tags
    with Genetic Algorithms for Information Retrieval
    and Filtering An Empirical Basis for Grammatical
    Rules, Information Processing Management,
    32(2), pp. 185-197, 1996. (published article)
    http//www.ils.unc.edu/losee/gene1.pdf
  • Eric Krevice Prebys, The Genetic Algorithm in
    Computer Science. International Journal of the
    computer, the Internet and Management.
    http//www-math.mit.edu/phase2/UJM/vol1/PREBYS-F.P
    DF
  • Matthew. http//lancet.mit.edu/mbwall/presentatio
    ns/IntroToGAs/P002.html
  • D. Vrajitoru (1997) Genetic Algorithms in
    Information Retrieval. AIDRI97 Learning From
    Natural Principles to Artificial Methods. ,
    Genève, June 1997. http//www.cs.iusb.edu/danav/p
    apers/AidriEng.pdf
Write a Comment
User Comments (0)
About PowerShow.com