Title: Apache Solr
1Apache Solr
Yonik Seeley yonik_at_apache.org 29 June
2006 Dublin, Ireland
2History
- Search for a replacement search platform
- commercial high license fees
- open-source no full solutions
- CNET grants code to Apache, Solr enters Incubator
17 Jan 2006 - Solr is a Lucene sub-project
- Users CNET Reviews, CNET Channel, shopper.com,
news.com, nines.org, krugle.com, oodle.com,
booklooker.de
3Lucene Refresher
- Lucene is a full-text search library
- Add documents to an index via IndexWriter
- A document is a a collection of fields
- No config files, dynamic field typing
- Flexible text analysis tokenizers, filters
- Search for documents via IndexSearcher
- Hits search(Query,Filter,Sort,topN)
- Scoring tf idf lengthNorm
4What Is Solr
- A full text search server based on Lucene
- XML/HTTP Interfaces
- Loose Schema to define types and fields
- Web Administration Interface
- Extensive Caching
- Index Replication
- Extensible Open Architecture
- Written in Java5, deployable as a WAR
5Architecture
Admin Interface
HTTP Request Servlet
Update Servlet
Standard Request Handler
Disjunction Max Request Handler
Custom Request Handler
XML Update Interface
XML Response Writer
Solr Core
Update Handler
Caching
Config
Schema
Analysis
Concurrency
Replication
Lucene
6Adding Documents
- HTTP POST to /update
- ltaddgtltdoc boost2gt
- ltfield namearticlegt05991lt/fieldgt
- ltfield nametitlegtApache Solrlt/fieldgt
- ltfield namesubjectgtAn intro...lt/fieldgt
- ltfield namecategorygtsearchlt/fieldgt
- ltfield namecategorygtlucenelt/fieldgt
- ltfield namebodygtSolr is a full...lt/fieldgt
- lt/docgtlt/addgt
7Deleting Documents
- Delete by Id
- ltdeletegtltidgt05591lt/idgtlt/deletegt
- Delete by Query (multiple documents)
- ltdeletegt
- ltquerygtmanufacturermicrosoftlt/querygt
- lt/deletegt
8Commit
- ltcommit/gt makes changes visible
- closes IndexWriter
- removes duplicates
- opens new IndexSearcher
- newSearcher/firstSearcher events
- cache warming
- register the new IndexSearcher
- ltoptimize/gt same as commit, merges all index
segments.
9Default Query Syntax
- Lucene Query Syntax sort specification
- mission impossible releaseDate desc
- mission impossible actorcruise
- mission impossible actorcruise
- titlespiderman10 descriptionspiderman
- descriptionspiderman movie10
- HDTV weight0 TO 100
- Wildcard queries te?t, tet, test
10Default Parameters
- Query Arguments for HTTP GET/POST to /select
param default description
q The query
start 0 Offset into the list of matches
rows 10 Number of documents to return
fl Stored fields to return
qt standard Query type maps to query handler
df (schema) Default field to search
11Search Results
- http//localhost8983/solr/select?qvideostart0
rows2flname,price - ltresponsegtltresponseHeadergtltstatusgt0lt/statusgt
- ltQTimegt1lt/QTimegtlt/responseHeadergt
- ltresult numFound"16173" start"0"gt
- ltdocgt
- ltstr name"name"gtApple 60 GB iPod with
Videolt/strgt - ltfloat name"price"gt399.0lt/floatgt
- lt/docgt
- ltdocgt
- ltstr name"name"gtASUS Extreme
N7800GTX/2DHTVlt/strgt - ltfloat name"price"gt479.95lt/floatgt
- lt/docgt
- lt/resultgt
- lt/responsegt
12Caching
- IndexSearchers view of an index is fixed
- Aggressive caching possible
- Consistency for multi-query requests
- filterCache unordered set of document ids
matching a query - resultCache ordered subset of document ids
matching a query - documentCache the stored fields of documents
- userCaches application specific, custom query
handlers
13Warming for Speed
- Lucene IndexReader warming
- field norms, FieldCache, tii the term index
- Static Cache warming
- Configurable static requests to warm new
Searchers - Smart Cache Warming (autowarming)
- Using MRU items in the current cache to
pre-populate the new cache - Warming in parallel with live requests
14Smart Cache Warming
2
1
Regenerator
3
Autowarming
Field Cache
Regenerator
Field Norms
Regenerator
Autowarming warm n MRU cache keys w/ new Searcher
15Schema
- Lucene has no notion of a schema
- Sorting - string vs. numeric
- Ranges - val42 included in val1 TO 5 ?
- Lucene QueryParser has date-range support, but
must guess. - Defines fields, their types, properties
- Defines unique key field, default search field,
Similarity implementation
16Field Definitions
- Field Attributes name, type, indexed, stored,
multiValued, omitNorms - ltfield name"id type"string"
indexed"true" stored"true"/gt - ltfield name"sku type"textTight indexed"true"
stored"true"/gt - ltfield name"name type"text
indexed"true" stored"true"/gt - ltfield namereviews type"text
indexed"true storedfalse"/gt - ltfield name"category type"text_ws
indexed"true" stored"true multiValued"true"/gt - Dynamic Fields, in the spirit of Lucene!
- ltdynamicField name"_i" type"sint
indexed"true" stored"true"/gt - ltdynamicField name"_s" type"string
indexed"true" stored"true"/gt - ltdynamicField name"_t" type"text
indexed"true" stored"true"/gt
17Search Relevancy
Query Analysis
Document Analysis
PowerShot SD 500
power-shot sd500
WhitespaceTokenizer
WhitespaceTokenizer
PowerShot
SD
500
power-shot
sd500
WordDelimiterFilter catenateWords0
WordDelimiterFilter catenateWords1
SD
500
Power
Shot
sd
500
power
shot
PowerShot
LowercaseFilter
LowercaseFilter
sd
500
power
shot
sd
500
power
shot
powershot
A Match!
18Configuring Relevancy
- ltfieldtype name"text" class"solr.TextField"gt
- ltanalyzergt
- lttokenizer class"solr.WhitespaceTokenizerFacto
ry"/gt - ltfilter class"solr.LowerCaseFilterFactory"/gt
- ltfilter class"solr.SynonymFilterFactory"
- synonyms"synonyms.txt/gt
- ltfilter class"solr.StopFilterFactory
- wordsstopwords.txt/gt
- ltfilter class"solr.EnglishPorterFilterFactory"
- protected"protwords.txt"/gt
- lt/analyzergt
- lt/fieldtypegt
19copyField
- Copies one field to another at index time
- Usecase Analyze same field different ways
- copy into a field with a different analyzer
- boost exact-case, exact-punctuation matches
- language translations, thesaurus, soundex
- ltfield nametitle typetext/gt
- ltfield nametitle_exact typetext_exact
storedfalse/gt - ltcopyField sourcetitle desttitle_exact/gt
- Usecase Index multiple fields into single
searchable field
20High Availability
Dynamic HTML Generation
Appservers
HTTP search requests
Load Balancer
Solr Searchers
Index Replication
admin queries
DB
updates
updates
admin terminal
Solr Master
21Replication
Master
Searcher
solr/data/index
solr/data/index
after mv
new segment
Lucene index segments
1. hard links
2. hard links
4. mv dir
after rsync
3. rsync
solr/data/snapshot-2006062950000
solr/data/snapshot-2006062950000-WIP
22Faceted Browsing Example
23Faceted Browsing
computer_typePC
proc_manuIntel
594
memory1GB TO
proc_manuAMD
intersection Size()
382
computer
price asc
Search(Query,Filter,Sort,offset,n)
price0 TO 500
247
price500 TO 1000
section of ordered results
689
Unordered set of all results
manuDell
104
DocList
DocSet
manuHP
92
manuLenovo
75
Query Response
24Web Admin Interface
- Show Config, Schema, Distribution info
- Query Interface
- Statistics
- Caches lookups, hits, hitratio, inserts,
evictions, size - RequestHandlers requests, errors
- UpdateHandler adds, deletes, commits, optimizes
- IndexReader, open-time, index-version, numDocs,
maxDocs, - Analysis Debugger
- Shows tokens after each Analyzer stage
- Shows token matches for query vs index
25(No Transcript)
26Selling Points
- Fast
- Powerful Configurable
- High Relevancy
- Mature Product
- Same features as software costing
- Leverage Community
- Lucene committers, IR experts
- Free consulting shared problems solutions
27Where are we going?
- OOTB Simple Faceted Browsing
- Automatic Database Indexing
- Federated Search
- HA with failover
- Alternate output formats (JSON, Ruby)
- Highlighter integration
- Spellchecker
- Alternate APIs (Google Data, OpenSearch)
28Resources
- WWW
- http//incubator.apache.org/solr
- http//incubator.apache.org/solr/tutorial.html
- http//wiki.apache.org/solr/
- Mailing Lists
- solr-user-subscribe_at_lucene.apache.org
- solr-dev-subscribe_at_lucene.apache.org