Lucene-Demo - PowerPoint PPT Presentation

About This Presentation
Title:

Lucene-Demo

Description:

... index should be created Analyzer Standard Analyzer Porter Stemming w/ Stop Words Krovetz Stemmer-Example package org.apache.lucene.analysis; ... – PowerPoint PPT presentation

Number of Views:272
Avg rating:3.0/5.0
Slides: 17
Provided by: EPLT
Category:
Tags: apache | demo | lucene

less

Transcript and Presenter's Notes

Title: Lucene-Demo


1
Lucene-Demo
  • Brian Nisonger

2
Intro
  • No details about Implementation/Theory
  • See Treehouse Wiki- Lucene for additional info
  • Set of Java classes
  • Not an end to end solution
  • Designed to allow rapid development of IR tools

3
Index
  • The first step is to take a set of text documents
    and build an Index
  • DemoIndexFiles on Pongo
  • Two major classes
  • Analyzer
  • Used to Tokenize data
  • More on this later
  • IndexWriter
  • IndexWriter writer new IndexWriter(INDEX_DIR,
    new StandardAnalyzer(), true)

4
Index Writer
  • Index Writer creates an index of documents
  • First argument is a directory of where to
    build/find the index
  • Second argument calls an Analyzer
  • Third argument determines if a new index should
    be created

5
Analyzer
  • Standard Analyzer
  • Porter Stemming w/ Stop Words
  • Krovetz Stemmer-Example
  • package org.apache.lucene.analysis
  • import org.apache.lucene.analysis.Analyzer
  • import org.apache.lucene.analysis.standard.
  • import org.apache.lucene.analysis.TokenStream
  • import org.apache.lucene.analysis.StopFilter
  • import org.apache.lucene.analysis.LowerCaseTokeniz
    er
  • import org.apache.lucene.analysis.KStemFilter
  • import java.io.Reader
  • public class KStemAnalyzer extends Analyzer
  • public final TokenStream tokenStream(String
    fieldName, Reader reader)
  • return new KStemFilter(new LowerCaseTokenizer(
    reader))

6
Analyzer-II
  • Snowball Stemmer
  • A stemmer language created by Porter used to
    build Stemmers
  • Multilingual analyzers/Stemmers
  • Porter2
  • Fully Integrated with Lucene 1.9.1
  • MyAnalyzer(Home Built)
  • Demo

7
Adding Documents
  • The Next step after creating an index is to add
    documents
  • writer.addDocument(FileDocument.Document(file))
  • Remember we already determined how the document
    will be tokenized
  • Fields
  • Can split document in to parts such as document
    title,body,date created, paragraphs

8
Adding Documents-II
  • Assigns Token/doc ID
  • For why this is important see Lucene TreeHouse
    Wiki
  • Create some type of loop to add all the documents
  • This is the actual creation of the Index before
    we merely set the Index parameters

9
Finalizing Index Creation
  • After that the Index is optimized with
    writer.optimize()
  • Merges etc.
  • The Index is close with writer.close()

10
Searching an Index
  • Open Index
  • IndexReader reader IndexReader.open(index)
  • Create Searcher
  • Searcher searcher new IndexSearcher(reader)
  • Assign Analyzer
  • Use the same Analyzer used to create Index (Why?)

11
Searching an Index-II
  • Parse/Create query
  • Query query QueryParser.parse(line, field,
    analyzer)
  • Takes a line, looks for a particular field, and
    runs it through an analyzer to create query
  • Determine which documents are matches
  • Hits hits searcher.search(query)

12
Retrieving Documents
  • Hits creates a collection of documents
  • Using a loop we can reference each doc
  • Document doc hits.doc(i)
  • This allows us to get info about the document
  • Name of document, date is was created, words in
    document
  • Relevancy Score(TF/IDF)
  • Demo

13
Finishing Searching
  • Return list of documents
  • Close Reader

14
Other Functions
  • Spans (Example from http//lucene.apache.org/java/
    docs/api/index.html)
  • Useful for Phrasal matching
  • Allows for Passage Retrieval

15
Questions?
  • Any Questions, comments, jokes, opinions??

16
I said Good Day
  • The END
Write a Comment
User Comments (0)
About PowerShow.com