Apache Solr - PowerPoint PPT Presentation

About This Presentation
Title:

Apache Solr

Description:

Title: Slide 0 Author: user Last modified by [puoiu Created Date: 5/22/2006 3:51:10 AM Document presentation format: On-screen Show Company: S & S Media (India) – PowerPoint PPT presentation

Number of Views:468
Avg rating:3.0/5.0
Slides: 29
Provided by: peopleAp6
Learn more at: http://people.apache.org
Category:
Tags: apache | intel | solr

less

Transcript and Presenter's Notes

Title: Apache Solr


1
Apache Solr
Yonik Seeley yonik_at_apache.org 29 June
2006 Dublin, Ireland
2
History
  • Search for a replacement search platform
  • commercial high license fees
  • open-source no full solutions
  • CNET grants code to Apache, Solr enters Incubator
    17 Jan 2006
  • Solr is a Lucene sub-project
  • Users CNET Reviews, CNET Channel, shopper.com,
    news.com, nines.org, krugle.com, oodle.com,
    booklooker.de

3
Lucene Refresher
  • Lucene is a full-text search library
  • Add documents to an index via IndexWriter
  • A document is a a collection of fields
  • No config files, dynamic field typing
  • Flexible text analysis tokenizers, filters
  • Search for documents via IndexSearcher
  • Hits search(Query,Filter,Sort,topN)
  • Scoring tf idf lengthNorm

4
What Is Solr
  • A full text search server based on Lucene
  • XML/HTTP Interfaces
  • Loose Schema to define types and fields
  • Web Administration Interface
  • Extensive Caching
  • Index Replication
  • Extensible Open Architecture
  • Written in Java5, deployable as a WAR

5
Architecture
Admin Interface
HTTP Request Servlet
Update Servlet
Standard Request Handler
Disjunction Max Request Handler
Custom Request Handler
XML Update Interface
XML Response Writer
Solr Core
Update Handler
Caching
Config
Schema
Analysis
Concurrency
Replication
Lucene
6
Adding Documents
  • HTTP POST to /update
  • ltaddgtltdoc boost2gt
  • ltfield namearticlegt05991lt/fieldgt
  • ltfield nametitlegtApache Solrlt/fieldgt
  • ltfield namesubjectgtAn intro...lt/fieldgt
  • ltfield namecategorygtsearchlt/fieldgt
  • ltfield namecategorygtlucenelt/fieldgt
  • ltfield namebodygtSolr is a full...lt/fieldgt
  • lt/docgtlt/addgt

7
Deleting Documents
  • Delete by Id
  • ltdeletegtltidgt05591lt/idgtlt/deletegt
  • Delete by Query (multiple documents)
  • ltdeletegt
  • ltquerygtmanufacturermicrosoftlt/querygt
  • lt/deletegt

8
Commit
  • ltcommit/gt makes changes visible
  • closes IndexWriter
  • removes duplicates
  • opens new IndexSearcher
  • newSearcher/firstSearcher events
  • cache warming
  • register the new IndexSearcher
  • ltoptimize/gt same as commit, merges all index
    segments.

9
Default Query Syntax
  • Lucene Query Syntax sort specification
  • mission impossible releaseDate desc
  • mission impossible actorcruise
  • mission impossible actorcruise
  • titlespiderman10 descriptionspiderman
  • descriptionspiderman movie10
  • HDTV weight0 TO 100
  • Wildcard queries te?t, tet, test

10
Default Parameters
  • Query Arguments for HTTP GET/POST to /select

param default description
q The query
start 0 Offset into the list of matches
rows 10 Number of documents to return
fl Stored fields to return
qt standard Query type maps to query handler
df (schema) Default field to search
11
Search Results
  • http//localhost8983/solr/select?qvideostart0
    rows2flname,price
  • ltresponsegtltresponseHeadergtltstatusgt0lt/statusgt
  • ltQTimegt1lt/QTimegtlt/responseHeadergt
  • ltresult numFound"16173" start"0"gt
  • ltdocgt
  • ltstr name"name"gtApple 60 GB iPod with
    Videolt/strgt
  • ltfloat name"price"gt399.0lt/floatgt
  • lt/docgt
  • ltdocgt
  • ltstr name"name"gtASUS Extreme
    N7800GTX/2DHTVlt/strgt
  • ltfloat name"price"gt479.95lt/floatgt
  • lt/docgt
  • lt/resultgt
  • lt/responsegt

12
Caching
  • IndexSearchers view of an index is fixed
  • Aggressive caching possible
  • Consistency for multi-query requests
  • filterCache unordered set of document ids
    matching a query
  • resultCache ordered subset of document ids
    matching a query
  • documentCache the stored fields of documents
  • userCaches application specific, custom query
    handlers

13
Warming for Speed
  • Lucene IndexReader warming
  • field norms, FieldCache, tii the term index
  • Static Cache warming
  • Configurable static requests to warm new
    Searchers
  • Smart Cache Warming (autowarming)
  • Using MRU items in the current cache to
    pre-populate the new cache
  • Warming in parallel with live requests

14
Smart Cache Warming
2
1
Regenerator
3
Autowarming
Field Cache
Regenerator
Field Norms
Regenerator
Autowarming warm n MRU cache keys w/ new Searcher
15
Schema
  • Lucene has no notion of a schema
  • Sorting - string vs. numeric
  • Ranges - val42 included in val1 TO 5 ?
  • Lucene QueryParser has date-range support, but
    must guess.
  • Defines fields, their types, properties
  • Defines unique key field, default search field,
    Similarity implementation

16
Field Definitions
  • Field Attributes name, type, indexed, stored,
    multiValued, omitNorms
  • ltfield name"id type"string"
    indexed"true" stored"true"/gt
  • ltfield name"sku type"textTight indexed"true"
    stored"true"/gt
  • ltfield name"name type"text
    indexed"true" stored"true"/gt
  • ltfield namereviews type"text
    indexed"true storedfalse"/gt
  • ltfield name"category type"text_ws
    indexed"true" stored"true multiValued"true"/gt
  • Dynamic Fields, in the spirit of Lucene!
  • ltdynamicField name"_i" type"sint
    indexed"true" stored"true"/gt
  • ltdynamicField name"_s" type"string
    indexed"true" stored"true"/gt
  • ltdynamicField name"_t" type"text
    indexed"true" stored"true"/gt

17
Search Relevancy
Query Analysis
Document Analysis
PowerShot SD 500
power-shot sd500
WhitespaceTokenizer
WhitespaceTokenizer
PowerShot
SD
500
power-shot
sd500
WordDelimiterFilter catenateWords0
WordDelimiterFilter catenateWords1
SD
500
Power
Shot
sd
500
power
shot
PowerShot
LowercaseFilter
LowercaseFilter
sd
500
power
shot
sd
500
power
shot
powershot
A Match!
18
Configuring Relevancy
  • ltfieldtype name"text" class"solr.TextField"gt
  • ltanalyzergt
  • lttokenizer class"solr.WhitespaceTokenizerFacto
    ry"/gt
  • ltfilter class"solr.LowerCaseFilterFactory"/gt
  • ltfilter class"solr.SynonymFilterFactory"
  • synonyms"synonyms.txt/gt
  • ltfilter class"solr.StopFilterFactory
  • wordsstopwords.txt/gt
  • ltfilter class"solr.EnglishPorterFilterFactory"
  • protected"protwords.txt"/gt
  • lt/analyzergt
  • lt/fieldtypegt

19
copyField
  • Copies one field to another at index time
  • Usecase Analyze same field different ways
  • copy into a field with a different analyzer
  • boost exact-case, exact-punctuation matches
  • language translations, thesaurus, soundex
  • ltfield nametitle typetext/gt
  • ltfield nametitle_exact typetext_exact
    storedfalse/gt
  • ltcopyField sourcetitle desttitle_exact/gt
  • Usecase Index multiple fields into single
    searchable field

20
High Availability
Dynamic HTML Generation
Appservers
HTTP search requests
Load Balancer
Solr Searchers
Index Replication
admin queries
DB
updates
updates
admin terminal
Solr Master
21
Replication
Master
Searcher
solr/data/index
solr/data/index
after mv
new segment
Lucene index segments
1. hard links
2. hard links
4. mv dir
after rsync
3. rsync
solr/data/snapshot-2006062950000
solr/data/snapshot-2006062950000-WIP
22
Faceted Browsing Example
23
Faceted Browsing
computer_typePC
proc_manuIntel
594
memory1GB TO
proc_manuAMD
intersection Size()
382
computer
price asc
Search(Query,Filter,Sort,offset,n)
price0 TO 500
247
price500 TO 1000
section of ordered results
689
Unordered set of all results
manuDell
104
DocList
DocSet
manuHP
92
manuLenovo
75
Query Response
24
Web Admin Interface
  • Show Config, Schema, Distribution info
  • Query Interface
  • Statistics
  • Caches lookups, hits, hitratio, inserts,
    evictions, size
  • RequestHandlers requests, errors
  • UpdateHandler adds, deletes, commits, optimizes
  • IndexReader, open-time, index-version, numDocs,
    maxDocs,
  • Analysis Debugger
  • Shows tokens after each Analyzer stage
  • Shows token matches for query vs index

25
(No Transcript)
26
Selling Points
  • Fast
  • Powerful Configurable
  • High Relevancy
  • Mature Product
  • Same features as software costing
  • Leverage Community
  • Lucene committers, IR experts
  • Free consulting shared problems solutions

27
Where are we going?
  • OOTB Simple Faceted Browsing
  • Automatic Database Indexing
  • Federated Search
  • HA with failover
  • Alternate output formats (JSON, Ruby)
  • Highlighter integration
  • Spellchecker
  • Alternate APIs (Google Data, OpenSearch)

28
Resources
  • WWW
  • http//incubator.apache.org/solr
  • http//incubator.apache.org/solr/tutorial.html
  • http//wiki.apache.org/solr/
  • Mailing Lists
  • solr-user-subscribe_at_lucene.apache.org
  • solr-dev-subscribe_at_lucene.apache.org
Write a Comment
User Comments (0)
About PowerShow.com