Building a Distributed Full-Text Index for the Web

About This Presentation

Title:

Building a Distributed Full-Text Index for the Web

Description:

Inverted index consist of an inverted lists for each sorted term. Inverted list consist of a locations in sorted ... Posting consist of (index term, location) ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 47

Provided by: ilan46

Category:

more less

Transcript and Presenter's Notes

Title: Building a Distributed Full-Text Index for the Web

1
Building a Distributed Full-Text Index for the
Web S. Melnik, S. Raghavan, B.Yang, H.
Garcia-Molina
2

Introduction.
Testbed architecture.
Design of the indexer.
Distributed indexing.

Introduction.
Testbed architecture.
Design of the indexer.
Distributed indexing.

4
1
2
3
Pig Cat Fish Cat
Fly Dog Pig
Dog Cat Fish Dog
Inverted list Cat-gt (1,2), (1,4),
(3,2) Dog-gt(2,2), (3,1), (3,4) Fish-gt(1,3),
(3,3) Pig-gt(1,1), (2,3)
Inverted index
location
5
Inverted index consist of an inverted lists for
each sorted term. Inverted list consist of a
locations in sorted way. Location consist of
(page identifier, position in the
page). Posting consist of (index term, location).
6
Building an inverted index over a collection of
web pages involves 1. Processing each
page to extract postings. 2. Building
for each term inverted list. 3. Writing out
on disk.
7
Important problems when building web-scale
inverted index 1. Scale and growth
rate. 2. Rate of change
8

Introduction.
Testbed architecture.
Design of the indexer.
Distributed indexing.

9
(No Transcript)
10

Distributors.
Indexers.
Query servers.

Distributed inverted index organization
Local inverted files.
2. Global inverted files.

12
Global inverted files
Cat-gt(1,2), (1,4), (3,2) Dog-gt(2,2), (3,1), (3,4)
Query server 1
a-e
Fish-gt(1,3), (3,3) Pig-gt(1,1), (2,3)
Query server 2
f-z
2
1
3
Dog Cat Fish Dog
Fly Dog Pig
Pig Cat Fish Cat
13
Local inverted files
f-z
a-e
Query server 2
Query server 1
Cat-gt(3,2) Dog-gt(3,1), (3,4) Fish-gt(3,3)
Cat-gt(1,2), (1,4) Dog-gt(2,2) Fish-gt(1,3) Fly-gt(2,1
) Pig-gt(1,1), (2,3)
Dog Cat Fish Dog
Fly Dog Pig
Pig Cat Fish Cat
2
1
3
14
Local vs. Global

Resilience to failures.
Network load.

15
Testbed environment The indexers and the query
servers are single processor PCs with 350-500
MHz processors, 300-500 MB of main memory, and
equipped with multiple disks. All the machines
are interconnected by a 100 Mbps Ethernet LAN
network.
16
The WebBase collection To study some properties
of web pages that are relevant to text indexing,
we analyzed 5 samples, of 100,000 pages each,
from different portions of the WebBase
repository.
17
value Property
438 Average number of words per page
171 Average number of distinct words per page
8650 Average size of each page (as HTML)
2815 Average size of each page after removing HTML tags
8 Average size of a word in the vocabulary
Table 1 Properties of the WebBase collection
18
(No Transcript)
19

Introduction.
Testbed architecture.
Design of the indexer.
Distributed indexing.

20
(No Transcript)
21

Design of the Indexer
Software pipeline.
The storage of the inverted files generated by
the process.

Software pipeline
The process can logically be split into 3 phases
Processing -gt CPU intensive.
Flushing -gt disk.
loading -gt network.

23
(No Transcript)
24
The goal of our pipelining technique is to design
an execution schedule for the different indexing
phases that will result in minimal overall
running time. Examples
F
Execution of the pipeline
P
L
25
(No Transcript)
26
t

Pipeline time
27
Theoretical analysis vs. experimental results
28
(No Transcript)
29
(No Transcript)
30

Design of the Indexer
Software pipeline.
The storage of the inverted files generated by
the process.

31
Storage schemes We consider ed three storage
schemes for storing inverted files as sets of
(key, value) pairs in a B-tree 1.
Full list. 2. Single payload. 3.
Mixed list.
32
(No Transcript)
33

A qualitative comparison of these storage
schemes
Index size
Zig-zag joins
Hot updates

34
Zig-zag join using ordered indexes
1
2
3
4
7
9
18
1
7
9
11
17
12
19
35
Experimental results (using mixed list)
36
Index size (age) Index size (GB) Input size (GB) Number of pages(million)
6.17 0.05 0.81 0.1
6.70 0.27 4.03 0.5
7.01 1.13 16.11 2.0
6.90 2.78 40.28 5.0
Table 5Mixed-list scheme index sizes
Only one posting was generated for all the
occurrences of a word in a page
37
(No Transcript)
38

Introduction.
Testbed architecture.
Design of the indexer.
Distributed indexing.

Two problems that must be addressed when building
an inverted index on a distributed architecture
Page distribution The question of when and how
to distribute pages to the indexing nodes.
Collecting global statistics the question of
where, when, and how to compute and distribute
global statistics.

Two strategies for page distribution
A priori distribution.
Runtime distribution.

Three advantages of runtime distribution
Space.
Load balancing.
Effective pipelining.

Collecting global statistics
A dedicated server known as the statistician.
Parallel computation.
Minimize the number of conversations among
servers.
Avoid extra disk I/O
Reduces network overhead.

Building a Distributed Full-Text Index for the Web - PowerPoint PPT Presentation

Building a Distributed Full-Text Index for the Web

Inverted index consist of an inverted lists for each sorted term. Inverted list consist of a locations in sorted ... Posting consist of (index term, location) ... – PowerPoint PPT presentation