N-gram Search Engine on Wikipedia - PowerPoint PPT Presentation

About This Presentation
Title:

N-gram Search Engine on Wikipedia

Description:

N-gram Search Engine on Wikipedia. Satoshi Sekine (NYU) Kapil Dalwani (JHU) ... 1. Search candidates. 2. Filtering. 3. Display. Implementation: Overview ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 18
Provided by: csJ8
Learn more at: https://www.cs.jhu.edu
Category:

less

Transcript and Presenter's Notes

Title: N-gram Search Engine on Wikipedia


1
N-gram Search Engine on Wikipedia
  • Satoshi Sekine (NYU)
  • Kapil Dalwani (JHU)

2
Hammer Fast and multi-functional n-gram search
engine
Search ngram FAST INPUT token, POS, chunk,
NE OUTPUT frequency to text
ngrams
2
3
Characteristics
  • Search up to 7 grams with wildcards
  • Multi-level input
  • Token, POS, chunk, NE, combinations
  • NOT, OR for POS, chunk, NE
  • Multi-level output
  • Token, POS, chunk, NE
  • document information
  • Original sentences, KWIC, ngram
  • Display
  • Show the results in the order of frequency
  • Running Environment
  • Single CPU, PC-Linux, 400MB process, 500GB disk

3
4
Demo
  • http//linserv1.cims.nyu.edu23232/ngram_wikipedia
    2

5
Available for you
  • Web system
  • At NYU
  • http//nlp.cs.nyu.edu/nsearch
  • At JHU?
  • USB Hard drive

6
Implementation Overview
Suffix array for text
N-gram data
Inverted index for n-gram data
Search request
Wikipedia text
POS, chunk, NE for N-gram data
Wikipedia POS, chunk, NE
7
Implementation Overview
Suffix array for text
N-gram data
Inverted index for n-gram data
Search request
Wikipedia text
POS, chunk, NE for N-gram data
Wikipedia POS, chunk, NE
8
From n-gram to Inverted Index
  • Example 3-grams
  • Posting list

Ngram ID Position1 Position2 Position3
1 A B C
2 A B B
3 B A C
1 2
A pos1
3
A pos2
3
B pos1
1 2
B pos2
2
B pos3
1 3
C pos3
9
Posting list
  • Wide variation of posting list size (in 7-gram
    1.27B)
  • EOS (100,906,888), , (55,644,989), the
    (33,762,672)
  • conscipcuous, consiety, Mizuk, (1)
  • 3 types for faster speed and smaller index size
  • Bitmap (freq gt1) EOS 1.27B bits
    (bitmap) lt-gt 3.2B bits (list)
  • List of ngramID
  • Encoded into pointer (freq1)

1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1
1 3
C pos3
C pos3 5
10
Search
  • Given an n-gram request (A B C)
  • Get posting lists for A, B and C
  • Search intersections of posting lists
  • Use look ahead to speed up the search
  • Look ahead size Sqrt(size of posting list)
  • Moffat and Zobel (1996)

4 33 34 55 76 80 89 92 99
SKIP
4 12 15 19 22 33 37 46 59 60 62 76 82 89 94 98
11
Implementation Overview
Suffix array for text
1 Search candidates.
N-gram data
Inverted index for n-gram data
Search request
Wikipedia text
POS, chunk, NE for N-gram data
Wikipedia POS, chunk, NE
12
Filtering
  • Not all candidate ngramIDs match the request
  • We need frequency, sentence information to
    matched n-grams
  • POS, chunk and NE information is presented as ID
  • Reduce the index more than 200GB

A B
Freq123
NN
PERSON
VB
LOC
Freq10
Freq5
13
Implementation Overview
2. Filtering
Suffix array for text
N-gram data
Inverted index for n-gram data
Search request
Wikipedia text
POS, chunk, NE for N-gram data
Wikipedia POS, chunk, NE
14
Display
  • N-gram will be displayed in the descending order
    of frequency
  • N-gram ID is ordered by the frequency
  • Sentences are searched using suffix array
  • POS, chunk, NE are displayed with sentence, KWIC,
    ngram
  • Doc ID, title of Wikipedia (and possible features
    of doc) is displayed with sentences and KWIC

15
Size of data
8 GB
Text 1.7 G words 200M sentences 2.4M
articles Ngram 1 8M 2 93M 3 377M 4 733M
5 1.00B 6 1.17B 7 1.27B
Total 530GB
Suffix array For text
260 GB
N-gram data
108 GB
Inverted index for n-gram data
8 GB
Wikipedia text
100 GB
POS, chunk, NE for N-gram data
6 GB
Wikipedia POS, chunk, NE
40 GB
Others
16
Future Work
  • Other information (ex parse, coref, relation,
    genre, discourse)
  • Longer n-gram
  • Compress index, dictionary
  • Ease the indexing load
  • Now we need a big memory machine
  • Distributing indexing
  • Union operation for tokens

17
Available for you
  • Web demo
  • At NYU
  • http//nlp.cs.nyu.edu/nsearch
  • At JHU?
  • USB Hard drive
Write a Comment
User Comments (0)
About PowerShow.com