Lucene and Solr - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Lucene and Solr

Description:

... Pluggable Java 5 Solr Powered by Solr Netflix CNET Smithsonian AOL:sports and music RightNow ?? Drupal module GameSpot Configuration (solrconfig.xml) ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 25
Provided by: labsRight
Category:
Tags: lucene | netflix | solr

less

Transcript and Presenter's Notes

Title: Lucene and Solr


1
Lucene and Solr
2
Lucene
  • Doug Cutting
  • Created in 1999
  • Donated to Apache in 2001
  • Features
  • Highly scalable
  • Java (1.4)
  • Ports to many other languages
  • No crawler
  • No document parsing
  • No PageRank

3
Lucene
  • Powered by Lucene
  • IBM Omnifind Y! Edition
  • Technorati
  • Wikipedia
  • Internet Archive
  • LinkedIn
  • monster.com

4
Indexing
  • Logical structure
  • Index is collection of documents
  • Documents are a collection of fields
  • Fields are the content
  • Stored Stored verbatim for retrival with
    results
  • Indexed Tokenized and made searchable
  • Indexed terms stored in inverted index
  • Physical structure
  • Multiple documents (with all fields) stored in
    segments
  • mergeFactor
  • All segments together make up the index
  • IndexWriter is interface object for entire index

5
Indexing
aardvark
Little Red Riding Hood
0
hood
0
1
little
0
2
Robin Hood
1
red
0
riding
0
robin
1
Little Women
2
women
2
zoo
6
Indexing
  • Analysis
  • Extract tokens from text (tokenizer)
  • Whitespace
  • Hyphens
  • Manipulate or modify tokens (token filter)
  • Stemming
  • Removal
  • Tokenizer / Token Filter chains are called
    analyzers

7
Indexing
LexCorp BFG-9000
WhitespaceTokenizer
LexCorp
BFG-9000
WordDelimiterFilter catenateWords1
BFG
9000
Lex
Corp
LexCorp
LowercaseFilter
bfg
9000
lex
corp
lexcorp
8
Searching
  • Query Creation
  • Query parser
  • Manual query construction from terms
  • titleBell authorHemmingway3.0
  • Query terms are analyzed
  • Same analyzer for indexing and searching on each
    field

9
Searching
LexCorp BFG-9000
Lex corp bfg9000
WhitespaceTokenizer
WhitespaceTokenizer
LexCorp
BFG-9000
Lex
bfg9000
corp
WordDelimiterFilter catenateWords1
WordDelimiterFilter catenateWords0
BFG
9000
Lex
Corp
bfg
9000
Lex
corp
LexCorp
LowercaseFilter
LowercaseFilter
bfg
9000
lex
corp
bfg
9000
lex
corp
lexcorp
A Match!
10
Searching
  • Many query types
  • Term
  • Phrase
  • bad wolf
  • Proximity
  • quick fox4
  • Prefix
  • pla?e (plate or place or plane)
  • practic (practice or practical or practically)
  • Fuzzy (edit distance)
  • planting0.75 (granting or planning)
  • roam (default is 0.5)
  • Range
  • date05072007 TO 05232007 (inclusive)
  • author king TO mason (exclusive)

11
Searching
  • Multiple searchers at once
  • Thread safe
  • Additions or deletions to index are not reflected
    in already open searchers
  • Must be closed and reopened
  • Use commit or optimize on indexWriter

12
Lucene Sub-projects
  • Nutch
  • Web crawler with document parsing
  • Hadoop
  • Distributed data processor
  • Implements MapReduce
  • Solr

13
Solr
  • Yonik Seeley
  • Developed at CNET
  • Donated to Apache in 2006
  • Features
  • Servlet
  • Web Administration Interface
  • XML/HTTP, JSON Interfaces
  • Faceting
  • Schema to define types and fields
  • Highlighting
  • Caching
  • Index Replication (Master / Slaves)
  • Pluggable
  • Java 5

14
Solr
  • Powered by Solr
  • Netflix
  • CNET
  • Smithsonian
  • AOLsports and music
  • RightNow ??
  • Drupal module
  • GameSpot

15
Configuration (solrconfig.xml)
  • ltmainIndexgt
  • ltuseCompoundFilegtfalselt/useCompoundFilegt
  • ltmergeFactorgt10lt/mergeFactorgt
  • ltmaxBufferedDocsgt1000lt/maxBufferedDocsgt
  • ltmaxMergeDocsgt2147483647lt/maxMergeDocsgt
  • ltmaxFieldLengthgt10000lt/maxFieldLengthgt
  • lt/mainIndexgt
  • ltrequestHandler name"standard"
    class"solr.StandardRequestHandler" /gt
  • ltrequestHandler namecustom" class"your.package.
    CustomRequestHandler" /gt
  • ltautoCommitgt
  • ltmaxDocsgt10000lt/maxDocsgt
  • ltmaxTimegt1000lt/maxTimegt
  • lt/autoCommitgt
  • ltqueryResponseWriter name"xml"
    class"org.apache.solr.request.XMLResponseWriter"
    default"true"/gt

16
Schema (schema.xml)
  • Fields
  • ltuniqueKeygtidlt/uniqueKeygt
  • ltfield name"products" type"text" indexed"true"
    storedtrue"/gt
  • ltfield name"keywords" type"text_ws"
    indexed"true" storedtrue/gt
  • ltfield name"keywordsSorted" type"text_sorted"
    indexed"true" stored"false"/gt
  • ltfield name"timestamp" type"date"
    indexed"true" stored"true" default"NOW"/gt
  • ltdynamicField name"_i" type"integer"
    indexed"true" stored"true"/gt
  • ltdynamicField name"desc_" type"string"
    indexed"true" stored"false"/gt
  • ltcopyField sourcekeywords" destkeywordsSorted"
    /gt

17
Schema
  • Analyzers
  • ltfieldtype name"nametext" class"solr.TextField"gt
  • ltanalyzer class"org.apache.lucene.analysis.White
    spaceAnalyzer"/gt
  • lt/fieldtypegt
  • ltfieldtype name"text" class"solr.TextField"gt
  • ltanalyzergt
  • lttokenizer class"solr.StandardTokenizerFactory"
    /gt
  • ltfilter class"solr.StandardFilterFactory"/gt
  • ltfilter class"solr.LowerCaseFilterFactory"/gt
  • ltfilter class"solr.StopFilterFactory"/gt
  • ltfilter class"solr.PorterStemFilterFactory"/gt
  • lt/analyzergt
  • lt/fieldtypegt
  • ltfieldtype name"myfieldtype" class"solr.TextFiel
    d"gt
  • ltanalyzergt
  • lttokenizer class"solr.WhitespaceTokenizerFactor
    y"/gt
  • ltfilter class"solr.SnowballPorterFilterFactory"
    language"German" /gt

18
Insertion
  • HTTP POST to http//localhost8983/solr/update/
  • ltaddgt
  • ltdocgt
  • ltfield name"employeeId"gt05991lt/fieldgt
  • ltfield name"office"gtBridgewaterlt/fieldgt
  • ltfield name"skills"gtPerllt/fieldgt
  • ltfield name"skills"gtJavalt/fieldgt
  • lt/docgt
  • ltdocgt ... lt/docgtltdocgt ... lt/docgt
  • lt/addgt
  • Documents or fields can have boosts attached

19
Update / Delete
  • Inserting a document with already present
    uniqueKey will erase the original
  • Deleting
  • By uniqueKey field
  • ltdeletegtltidgt05991lt/idgtlt/deletegt
  • By query
  • ltdeletegtltquerygtnameAnthonylt/querygtlt/deletegt
  • ltCommit/gt
  • ltOptimize/gt

20
Search
  • Core parameters
  • qt query type (request handler)
  • wt writer type (response writer)
  • Common parameters
  • q
  • sort
  • start
  • rows
  • fq filters
  • fl return fields

21
Search
  • Faceting
  • Available in StandardRequestHandler and
    DisMaxRequestHandler

22
Search
  • http//localhost8983/solr/select?qipodrows0fa
    cettruefacet.limit-1facet.fieldcatfacet.minc
    ount1facet.fieldinStock
  • ltresponsegt
  • ltresponseHeadergt
  • ltstatusgt0lt/statusgt
  • ltQTimegt3lt/QTimegt
  • lt/responseHeadergt
  • ltresult numFound"4" start"0"/gt
  • ltlst name"facet_counts"gt
  • ltlst name"facet_queries"/gt
  • ltlst name"facet_fields"gt
  • ltlst name"cat"gt
  • ltint name"music"gt1lt/intgt
  • ltint name"connector"gt2lt/intgt
  • ltint name"electronics"gt3lt/intgt
  • lt/lstgt
  • ltlst name"inStock"gt
  • ltint name"false"gt3lt/intgt
  • ltint name"true"gt1lt/intgt

23
Many more features
  • Replication
  • Master / Slave architecture for load balancing
    and backups
  • More-like-this
  • Easy to add RequestHandlers and ResponseWriters
  • Responses in many formats
  • Hit highlighting

24
Sources
  • http//lucene.apache.org/
  • http//lucene.apache.org/solr/
  • http//people.apache.org/yonik/presentations/
Write a Comment
User Comments (0)
About PowerShow.com