Lucene-Demo

About This Presentation

Title:

Lucene-Demo

Description:

... index should be created Analyzer Standard Analyzer Porter Stemming w/ Stop Words Krovetz Stemmer-Example package org.apache.lucene.analysis; ... – PowerPoint PPT presentation

Number of Views:272

Avg rating:3.0/5.0

Slides: 17

Provided by: EPLT

Learn more at: http://faculty.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lucene-Demo

1
Lucene-Demo

Brian Nisonger

2
Intro

No details about Implementation/Theory
See Treehouse Wiki- Lucene for additional info
Set of Java classes
Not an end to end solution
Designed to allow rapid development of IR tools

3
Index

The first step is to take a set of text documents
and build an Index
DemoIndexFiles on Pongo
Two major classes
Analyzer
Used to Tokenize data
More on this later
IndexWriter
IndexWriter writer new IndexWriter(INDEX_DIR,
new StandardAnalyzer(), true)

4
Index Writer

Index Writer creates an index of documents
First argument is a directory of where to
build/find the index
Second argument calls an Analyzer
Third argument determines if a new index should
be created

5
Analyzer

Standard Analyzer
Porter Stemming w/ Stop Words
Krovetz Stemmer-Example
package org.apache.lucene.analysis
import org.apache.lucene.analysis.Analyzer
import org.apache.lucene.analysis.standard.
import org.apache.lucene.analysis.TokenStream
import org.apache.lucene.analysis.StopFilter
import org.apache.lucene.analysis.LowerCaseTokeniz
er
import org.apache.lucene.analysis.KStemFilter
import java.io.Reader
public class KStemAnalyzer extends Analyzer
public final TokenStream tokenStream(String
fieldName, Reader reader)
return new KStemFilter(new LowerCaseTokenizer(
reader))

6
Analyzer-II

Snowball Stemmer
A stemmer language created by Porter used to
build Stemmers
Multilingual analyzers/Stemmers
Porter2
Fully Integrated with Lucene 1.9.1
MyAnalyzer(Home Built)
Demo

7
Adding Documents

The Next step after creating an index is to add
documents
writer.addDocument(FileDocument.Document(file))
Remember we already determined how the document
will be tokenized
Fields
Can split document in to parts such as document
title,body,date created, paragraphs

8
Adding Documents-II

Assigns Token/doc ID
For why this is important see Lucene TreeHouse
Wiki
Create some type of loop to add all the documents
This is the actual creation of the Index before
we merely set the Index parameters

9
Finalizing Index Creation

After that the Index is optimized with
writer.optimize()
Merges etc.
The Index is close with writer.close()

10
Searching an Index

Open Index
IndexReader reader IndexReader.open(index)
Create Searcher
Searcher searcher new IndexSearcher(reader)
Assign Analyzer
Use the same Analyzer used to create Index (Why?)

11
Searching an Index-II

Parse/Create query
Query query QueryParser.parse(line, field,
analyzer)
Takes a line, looks for a particular field, and
runs it through an analyzer to create query
Determine which documents are matches
Hits hits searcher.search(query)

12
Retrieving Documents

Hits creates a collection of documents
Using a loop we can reference each doc
Document doc hits.doc(i)
This allows us to get info about the document
Name of document, date is was created, words in
document
Relevancy Score(TF/IDF)
Demo

13
Finishing Searching

Return list of documents
Close Reader

14
Other Functions

Spans (Example from http//lucene.apache.org/java/
docs/api/index.html)
Useful for Phrasal matching
Allows for Passage Retrieval

15
Questions?

Any Questions, comments, jokes, opinions??

16
I said Good Day

The END

Write a Comment

User Comments (0)

About PowerShow.com

Lucene-Demo - PowerPoint PPT Presentation

Lucene-Demo

... index should be created Analyzer Standard Analyzer Porter Stemming w/ Stop Words Krovetz Stemmer-Example package org.apache.lucene.analysis; ... – PowerPoint PPT presentation