Powerful Full-Text Search with Solr - PowerPoint PPT Presentation

About This Presentation
Title:

Powerful Full-Text Search with Solr

Description:

Iron Man, Tony Stark, superhero, powered armor | flight ... 'supername':'Spider-Man', 'category':'superhero' ... (popularity)super man. Parameter dereferencing ... – PowerPoint PPT presentation

Number of Views:605
Avg rating:3.0/5.0
Slides: 34
Provided by: people4
Learn more at: http://people.apache.org
Category:

less

Transcript and Presenter's Notes

Title: Powerful Full-Text Search with Solr


1
Powerful Full-Text Search with Solr
  • Yonik Seeley
  • yonik_at_apache.org
  • Web 2.0 Expo, Berlin
  • 8 November 2007

download at http//www.apache.org/yonik
2
What is Lucene
  • High performance, scalable, full-text search
    library
  • Focus Indexing Searching Documents
  • Document is just a list of namevalue pairs
  • No crawlers or document parsing
  • Flexible Text Analysis (tokenizers token
    filters)
  • 100 Java, no dependencies, no config files

3
What is Solr
  • A full text search server based on Lucene
  • XML/HTTP, JSON Interfaces
  • Faceted Search (category counting)
  • Flexible data schema to define types and fields
  • Hit Highlighting
  • Configurable Advanced Caching
  • Index Replication
  • Extensible Open Architecture, Plugins
  • Web Administration Interface
  • Written in Java5, deployable as a WAR

4
Basic App
HTML
Indexer
Webapp
Document super_name Mr.
Fantastic name Reed Richards category
superhero powers elasticity
Query Response (matching docs)
Query (powersagility)
http//solr/update
http//solr/select
Solr
Servlet Container
5
Indexing Data
  • HTTP POST to http//localhost8983/solr/update

ltaddgtltdocgt ltfield nameidgt05991lt/fieldgt
ltfield namenamegtPeter Parkerlt/fieldgt ltfield
namesupernamegtSpider-Manlt/fieldgt ltfield
namecategorygtsuperherolt/fieldgt ltfield
namepowersgtagilitylt/fieldgt ltfield
namepowersgtspider-senselt/fieldgt lt/docgtlt/addgt
6
Indexing CSV data
Iron Man, Tony Stark, superhero, powered armor
flight Sandman, William BakerFlint Marko,
supervillain, sand transform Wolverine,James
HowlettLogan, superhero, healingadamantium Magne
to, Erik Lehnsherr, supervillain,
magnetismelectricity
http//localhost8983/solr/update/csv? fieldnames
supername,name,category,powers separator, f.
name.splittruef.name.separator f.powers.spli
ttruef.powers.separator
7
Data upload methods
  • URLhttp//localhost8983/solr/update/csv
  • HTTP POST body (curl, HttpClient, etc)
  • curl URL -H 'Content-typetext/plain
    charsetutf-8' --data-binary _at_info.csv
  • Multi-part file upload (browsers)
  • Request parameter
  • ?stream.bodyCyclops, Scott Summers,
  • Streaming from URL (must enable)
  • ?stream.urlfile//data/info.csv

8
Indexing with SolrJ
  • // Solrs Java Client API remote or
    embedded/local!
  • SolrServer server new CommonsHttpSolrServer("htt
    p//localhost8983/solr")
  • SolrInputDocument doc new SolrInputDocument()
  • doc.addField("supername","Daredevil")
  • doc.addField("name","Matt Murdock")
  • doc.addField(category",superhero")
  • server.add(doc)
  • server.commit()

9
Deleting Documents
  • Delete by Id, most efficient
  • ltdeletegt
  • ltidgt05591lt/idgt
  • ltidgt32552lt/idgt
  • lt/deletegt
  • Delete by Query
  • ltdeletegt
  • ltquerygtcategorysupervillainlt/querygt
  • lt/deletegt

10
Commit
  • ltcommit/gt makes changes visible
  • Triggers static cache warming in solrconfig.xml
  • Triggers autowarming from existing caches
  • ltoptimize/gt same as commit, merges all index
    segments for faster searching

Lucene Index Segments
11
Searching
  • http//localhost8983/solr/select?qpowersagility
  • start0rows2flsupername,category
  • ltresponsegt
  • ltresult numFound427" start"0"gt
  • ltdocgt
  • ltstr namesupername"gtSpider-Manlt/strgt
  • ltstr namecategorygtsuperherolt/strgt
  • lt/docgt
  • ltdocgt
  • ltstr namesupername"gtMsytiquelt/strgt
  • ltstr namecategorygtsupervillainlt/strgt
  • lt/docgt
  • lt/resultgt
  • lt/responsegt

12
Response Format
  • Add wtjson for JSON formatted response
  • result" "numFound"427, "start"0,
  • "docs"
  • supernameSpider-Man,
    categorysuperhero,
  • supername Msytique, category
    supervillain
  • Also Python, Ruby, PHP, SerializedPHP, XSLT

13
Scoring
  • Query results are sorted by score descending
  • VSM Vector Space Model
  • tf term frequency numer of matching terms in
    field
  • lengthNorm number of tokens in field
  • idf inverse document frequency
  • coord coordination factor, number of matching
    terms
  • document boost
  • query clause boost
  • http//lucene.apache.org/java/docs/scoring.html

14
Explain
  • http//solr/select?qsuper fastindentondebugQue
    ryon
  • ltlst name"debug"gt
  • ltlst name"explain"gt
  • ltstr name"idFlash,internal_docid6"gt
  • 0.16389132 (MATCH) product of
  • 0.32778263 (MATCH) sum of
  • 0.32778263 (MATCH) weight(textfast in 6),
    product of
  • 0.5012072 queryWeight(textfast), product
    of
  • 2.466337 idf(docFreq5)
  • 0.20321926 queryNorm
  • 0.65398633 (MATCH) fieldWeight(textfast
    in 6), product of
  • 1.4142135 tf(termFreq(textfast)2)
  • 2.466337 idf(docFreq5)
  • 0.1875 fieldNorm(fieldfast, doc6)
  • 0.5 coord(1/2)
  • lt/strgt
  • ltstr name"idSuperman,internal_docid7"gt
  • 0.1365761 (MATCH) product of

15
Lucene Query Syntax
  • justice league
  • Equiv justice OR league
  • QueryParser default operator is OR/optional
  • justice league nameaquaman
  • Equiv justice AND league NOT nameaquaman
  • justice league nameaquaman
  • titlespiderman10 descriptionspiderman
  • descriptionspiderman movie100

16
Lucene Query Examples2
  • releaseDate2000 TO 2007
  • Wildcard searches sup?r, sur, super
  • spider
  • Fuzzy search Levenshtein distance
  • Optional minimum similarity spider0.7
  • (Superman AND Lex Luthor) OR (Batman Joker)

17
DisMax Query Syntax
  • Good for handling raw user queries
  • Balanced quotes for phrase query
  • for required, - for prohibited
  • Separates query terms from query structure
  • http//solr/select?qtdismax
  • qsuper man // the user query
  • qftitle3 subject2 body // field to query
  • pftitle2,body // fields to do phrase
    queries
  • ps100 // slop for those phrase qs
  • tie.1 // multi-field match reward
  • mm2 // of terms that should match
  • bfpopularity // boost function

18
DisMax Query Form
  • The expanded Lucene Query
  • ( DisjunctionMaxQuery( titlesuper3
    subjectsuper2 bodysuper)
  • DisjunctionMaxQuery( titleman3
    subjectman2 bodyman)
  • )
  • DisjunctionMaxQuery(titlesuper man1002
    bodysuper man100)
  • FunctionQuery(popularity)
  • Tip set up your own request handler with default
    parameters to avoid clients having to specify them

19
Function Query
  • Allows adding function of field value to score
  • Boost recently added or popular documents
  • Current parser only supports function notation
  • Example log(sum(popularity,1))
  • sum, product, div, log, sqrt, abs, pow
  • scale(x, target_min, target_max)
  • calculates min max of x across all docs
  • map(x, min, max, target)
  • useful for dealing with defaults

20
Boosted Query
  • Score is multiplied instead of added
  • New local params lt!...gt syntax added
  • qlt!boost bsqrt(popularity)gtsuper man
  • Parameter dereferencing in local params
  • qlt!boost bboost vuserqgt
  • boostsqrt(popularity)
  • userqsuper man

21
Analysis Search Relevancy
Query Analysis
Document Indexing Analysis
LexCorp BFG-9000
Lex corp bfg9000
WhitespaceTokenizer
WhitespaceTokenizer
LexCorp
BFG-9000
Lex
bfg9000
corp
WordDelimiterFilter catenateWords1
WordDelimiterFilter catenateWords0
BFG
9000
Lex
Corp
bfg
9000
Lex
corp
LexCorp
LowercaseFilter
LowercaseFilter
bfg
9000
lex
corp
bfg
9000
lex
corp
lexcorp
A Match!
22
Configuring Relevancy
  • ltfieldType name"text" class"solr.TextField"gt
  • ltanalyzergt
  • lttokenizer class"solr.WhitespaceTokenizerFacto
    ry"/gt
  • ltfilter class"solr.LowerCaseFilterFactory"/gt
  • ltfilter class"solr.SynonymFilterFactory"
  • synonyms"synonyms.txt/gt
  • ltfilter class"solr.StopFilterFactory
  • wordsstopwords.txt/gt
  • ltfilter class"solr.EnglishPorterFilterFactory"
  • protected"protwords.txt"/gt
  • lt/analyzergt
  • lt/fieldTypegt

23
Field Definitions
  • Field Attributes name, type, indexed, stored,
    multiValued, omitNorms, termVectors
  • ltfield name"id type"string"
    indexed"true" stored"true"/gt
  • ltfield name"sku type"textTight indexed"true"
    stored"true"/gt
  • ltfield name"name type"text
    indexed"true" stored"true"/gt
  • ltfield nameinStock typeboolean
    indexed"true storedfalse"/gt
  • ltfield nameprice typesfloat
    indexed"true storedfalse"/gt
  • ltfield name"category type"text_ws
    indexed"true" stored"true multiValued"true"/gt
  • Dynamic Fields
  • ltdynamicField name"_i" type"sint
    indexed"true" stored"true"/gt
  • ltdynamicField name"_s" type"string
    indexed"true" stored"true"/gt
  • ltdynamicField name"_t" type"text
    indexed"true" stored"true"/gt

24
copyField
  • Copies one field to another at index time
  • Usecase 1 Analyze same field different ways
  • copy into a field with a different analyzer
  • boost exact-case, exact-punctuation matches
  • language translations, thesaurus, soundex
  • ltfield nametitle typetext/gt
  • ltfield nametitle_exact typetext_exact
    storedfalse/gt
  • ltcopyField sourcetitle desttitle_exact/gt
  • Usecase 2 Index multiple fields into single
    searchable field

25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
Facet Query
  • http//solr/select?qfoowtjsonindenton
  • facettruefacet.fieldcat
  • facet.queryprice0 TO 100
  • facet.querymanuIBM
  • "response""numFound"26,"start"0,"docs",
  • facet_counts"
  • "facet_queries"
  • "price0 TO 100"6,
  • manuIBM"2,
  • "facet_fields"
  • "cat" "electronics",14, "memory",3,
  • "card",2, "connector",2

29
Filters
  • Filters are restrictions in addition to the query
  • Use in faceting to narrow the results
  • Filters are cached separately for speed
  • 1. User queries for memory, query sent to solr is
  • qmemoryfqinStocktruefacettrue
  • 2. User selects 1GB memory size
  • qmemoryfqinStocktruefqsize1GB
  • 3. User selects DDR2 memory type
  • qmemoryfqinStocktruefqsize1GB
  • fqtypeDDR2

30
Highlighting
  • http//solr/select?qlcdwtjsonindenton
  • hltruehl.flfeatures
  • "response""numFound"5,"start"0,"docs"
  • "id""3007WFP", price899.95,
  • "highlighting"
  • "3007WFP" "features""30\" TFT active matrix
    ltemgtLCDlt/emgt, 2560 x 1600
  • "VA902B" "features""19\" TFT active matrix
    ltemgtLCDlt/emgt, 8ms response time, 1280 x 1024
    native resolution"

31
MoreLikeThis
  • Selects documents that are similar to the
    documents matching the main query.
  • qid6H500F0 mlttruemlt.flname,cat,features
  • "moreLikeThis" "6H500F0""numFound"5,"start"0
    ,
  • "docs
  • "name""Apple 60 GB iPod with Video
  • Playback Black", "price"399.0,
  • "inStock"true, "popularity"10,
  • ,

32
High Availability
Dynamic HTML Generation
Appservers
HTTP search requests
Load Balancer
Solr Searchers
Index Replication
admin queries
DB
updates
updates
admin terminal
Solr Master
33
Resources
  • WWW
  • http//lucene.apache.org/solr
  • http//lucene.apache.org/solr/tutorial.html
  • http//wiki.apache.org/solr/
  • Mailing Lists
  • solr-user-subscribe_at_lucene.apache.org
  • solr-dev-subscribe_at_lucene.apache.org
Write a Comment
User Comments (0)
About PowerShow.com