Advanced Indexing Techniques with - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced Indexing Techniques with

Description:

Document IDs. Positions. 3. Advanced Indexing Techniques with Apache Lucene - Payloads ... Example from java-user: Unique Doc Ids ... – PowerPoint PPT presentation

Number of Views:127
Avg rating:3.0/5.0
Slides: 38
Provided by: ping153
Learn more at: http://people.apache.org
Category:

less

Transcript and Presenter's Notes

Title: Advanced Indexing Techniques with


1
Advanced Indexing Techniques with
  • Michael Busch
  • (buschmi_at_apache.org)

http//people.apache.org/buschmi/apachecon/
2
Agenda
  • Part 1 Inverted Index 101
  • Posting Lists
  • Stored Fields vs. Payloads
  • Part 2 Use cases for Payloads
  • BoostingTermQuery
  • Simple facet counting

3
Lucenes data structures
Inverted Index
Store
search
retrieve stored fields
Hits
Results
4
Query not
c\docs\einstein.txt The important thing is not
to stop questioning.
String comparison slow!
Solution
Inverted index
c\docs\shakespeare.txt To be or not to be.
5
Query not
Inverted index
c\docs\einstein.txt The important thing is not
to stop questioning.
be important is not or questioning stop to the thi
ng
1 0 0 0 1 1 0 0 0 1 0 0
0
c\docs\shakespeare.txt To be or not to be.
1
Document IDs
6
Inverted index
Query not to
c\docs\einstein.txt The important thing is not
to stop questioning.
be important is not or questioning stop to the thi
ng
1 0 0 0 1 1 0 0 0 1 0 0
0
c\docs\shakespeare.txt To be or not to be.
1
Document IDs
7
Query not to
Inverted index
c\docs\einstein.txt The important thing is not
to stop questioning.
be important is not or questioning stop to the thi
ng
1 0 0 0 1 0 0 0 0 0
1 1 3 4 2 7 6 5 0 2
5
0
0 1 2 3 4 5
1
3
6 7
c\docs\shakespeare.txt To be or not to be.
1
0 4
1
0 1 2 3 4 5
Document IDs
Positions
8
Inverted index with Payloads
1 0 0 0 1 0 0 0 0 0
1 1 3 4 2 7 6 5 0 2
c\docs\einstein.txt The important thing is not
to stop questioning.
be important is not or questioning stop to the thi
ng
0
0 1 2 3 4 5
6 7
c\docs\shakespeare.txt To be or not to be.
1
B
0 1 2 3 4 5
Document IDs
Positions
Payloads
9
So far
  • String comparison slow
  • Inverted index used to accelerate search
  • Store positions in posting lists to allow phrase
    searches
  • Store payloads in posting lists to store
    arbitrary data with each position

10
Lucenes data structures
Inverted Index
Store
search
retrieve stored fields
Hits
Results
11
Store
12
Store
D0
D1
D2
F3
F3
F1
F2
F3
F1
F2
F1
F2
  • Optimized for random access
  • Document-locality

13
Store
D0
D1
D2
F3
F3
F1
F2
F3
F1
F2
F1
F2
  • Optimized for scanning and skipping
  • Space-efficient encoding

14
Agenda
  • Part 1 Inverted Index 101
  • Posting Lists
  • Stored Fields vs. Payloads
  • Part 2 Use cases for Payloads
  • BoostingTermQuery
  • Simple facet counting

15
org.apache.lucene.analysis.Token
Payloads - API
void setPayload(Payload payload)
org.apache.lucene.index.Payload
Payload(byte data) Payload(byte data,
int offset, int length)
16
Payloads - API
org.apache.lucene.index.TermPositions
boolean next() int doc() int freq() int
nextPosition() int getPayloadLength() byte
getPayload(byte data, int
offset)
17
Use case
Example BoostingTermQuery
  • Score certain occurrences of a term higher than
    others
  • E. g. Query warning
  • doc1
  • HURRICANE WARNING
  • doc2
  • The Warning Label Generator is a fun way to
    generate your own warning labels!
    (www.warninglabelgenerator.com)

18
Analyzer
Example BoostingTermQuery
final byte BoldBoost 5 Token token new
Token() if (isBold) token.setPayload(
new Payload(new byte BoldBoost)) return
token
19
Similarity
Example BoostingTermQuery
Similarity boostingSimilarity new
DefaultSimilarity() // _at_override public
float scorePayload(byte payload,
int offset,
int length) if (length 1)
return payloadoffset
20
Example BoostingTermQuery
BoostingTermQuery
Query btq new BoostingTermQuery(
new Term(field, searchterm))
Searching
Searcher searcher new IndexSearcher() Searcher
.setSimilarity(boostingSimilarity) Hits hits
searcher.search(btq)
21
Use case
Example from java-user Unique Doc Ids
  • Store a unique document id (UID) that maps to a
    row in a database table
  • Retrieve UID at search time to influence
    matching/scoring
  • FieldCache takes to long to load

22
Solution
Example from java-user Unique Doc Ids
  • Index one special term for each document, e. g.
    IDUID
  • Index one occurrence for each document
  • Store UID in the Payload of the occurrence

23
For indexing TokenStream
Example from java-user Unique Doc Ids
class SinglePayloadTokenStream extends
TokenStream boolean done false public
void setUID(int uid) ... public Token next()
throws IOException if (done) return
null Token token new Token(UID)
token.setPayload(new Payload(uid) done
true return token
24
For retrieving TermPositions
Example from java-user Unique Doc Ids
public int getCachedUIDs(IndexReader reader)
int cache new intreader.maxDoc()
TermPositions tp reader.termPositions(
new Term(ID, UID) byte
buffer new byte4 while(tp.next()) //
iterate over docs tp.nextPosition() // only
one pos per doc tp.getPayload(buffer, 0)
cachetp.doc() bytesToInt(buffer)
return cache
25
Performance
Example from java-user Unique Doc Ids
  • Load UIDs for 2M docs into memory
  • FieldCache 16.5 s
  • Payloads 430 ms

26
Use case
Example (Very) Simple facet counting
  • Collection with docs from different sources
  • Show top-n results from each source instead of
    top-n results from entire collection

27
Analyzer
Example (Very) Simple facet counting
public TokenStream tokenStream(String fieldName,
Reader reader)
if (fieldName.equals(_facet)) return
new TokenStream() boolean done false
public Token next() if (done)
return null Token token new Token()
token.setPayload( new
Payload(computeHash(url)) done true
return token
28
Hitcollector
Example (Very) Simple facet counting
  • Use different PriorityQueues for different sites
  • Instead of returning top-n results of the whole
    data set, return top-n results per site

29
Summary
Example (Very) Simple facet counting
  • In this example facet (site) used for scoring,
    but extendable for facet counting
  • Good performance due to locality of facet values

30
Use case
Example Efficient Numeric Search
  • Find documents that have a numeric value in a
    specific range, e. g. all docs with a date gt2006
    and lt2007

Currently in Lucene
  • RangeQuery
  • Store all values in the dictionary
  • Query expansion

31
Dictionary Postinglists
Example Efficient Numeric Search
01/01/2006 01/02/2006 01/04/2006 . . . 12/30/2006
Query01/05/2006 TO 11/25/2006
Problem A large number of postinglists have to
be processed
32
Idea
Example Efficient Numeric Search
  • Index special term, e. g. numericdate and
    store actual value in a Payload for each doc
  • Problem Postinglist can become very big -gt
    entire list has to be processed
  • Solution Hybrid approach

33
Dictionary Postinglists
Example Efficient Numeric Search
date01/2006 date02/2006 . . . date12/2006
Store day in payload
Store position where date occurred
Document IDs
Positions
Payloads
34
Example Efficient Numeric Search
  • Tradeoff between number of postinglists to
    process and size of postinglists
  • Significant speedup possible with good choice of
    chunk size

35
Conclusion
  • Payloads offer great flexibility
  • Payloads are stored very space-efficient
  • Sophisticated data structures enable efficient
    skipping over payloads
  • Payloads should be used whenever special data is
    required for finding hits and scoring

36
Outlook
  • Finalize API (currently Beta)
  • Add more out-of-the-box query types
  • Per-document Payloads updateable
  • FieldCache implementation that uses Payloads

37
Advanced Indexing Techniques with
  • Questions ?

http//people.apache.org/buschmi/apachecon/
Write a Comment
User Comments (0)
About PowerShow.com