Building a Distributed Full-Text Index for the Web - PowerPoint PPT Presentation

About This Presentation
Title:

Building a Distributed Full-Text Index for the Web

Description:

Inverted index consist of an inverted lists for each sorted term. Inverted list consist of a locations in sorted ... Posting consist of (index term, location) ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 47
Provided by: ilan46
Category:

less

Transcript and Presenter's Notes

Title: Building a Distributed Full-Text Index for the Web


1
Building a Distributed Full-Text Index for the
Web S. Melnik, S. Raghavan, B.Yang, H.
Garcia-Molina
2
  • Introduction.
  • Testbed architecture.
  • Design of the indexer.
  • Distributed indexing.

3
  • Introduction.
  • Testbed architecture.
  • Design of the indexer.
  • Distributed indexing.

4
1
2
3
Pig Cat Fish Cat
Fly Dog Pig
Dog Cat Fish Dog
Inverted list Cat-gt (1,2), (1,4),
(3,2) Dog-gt(2,2), (3,1), (3,4) Fish-gt(1,3),
(3,3) Pig-gt(1,1), (2,3)
Inverted index
location
5
Inverted index consist of an inverted lists for
each sorted term. Inverted list consist of a
locations in sorted way. Location consist of
(page identifier, position in the
page). Posting consist of (index term, location).
6
Building an inverted index over a collection of
web pages involves 1. Processing each
page to extract postings. 2. Building
for each term inverted list. 3. Writing out
on disk.
7
Important problems when building web-scale
inverted index 1. Scale and growth
rate. 2. Rate of change
8
  • Introduction.
  • Testbed architecture.
  • Design of the indexer.
  • Distributed indexing.

9
(No Transcript)
10
  • Distributors.
  • Indexers.
  • Query servers.

11
  • Distributed inverted index organization
  • Local inverted files.
  • 2. Global inverted files.

12
Global inverted files
Cat-gt(1,2), (1,4), (3,2) Dog-gt(2,2), (3,1), (3,4)
Query server 1
a-e
Fish-gt(1,3), (3,3) Pig-gt(1,1), (2,3)
Query server 2
f-z
2
1
3
Dog Cat Fish Dog
Fly Dog Pig
Pig Cat Fish Cat
13
Local inverted files
f-z
a-e
Query server 2
Query server 1
Cat-gt(3,2) Dog-gt(3,1), (3,4) Fish-gt(3,3)
Cat-gt(1,2), (1,4) Dog-gt(2,2) Fish-gt(1,3) Fly-gt(2,1
) Pig-gt(1,1), (2,3)
Dog Cat Fish Dog
Fly Dog Pig
Pig Cat Fish Cat
2
1
3
14
Local vs. Global
  • Resilience to failures.
  • Network load.

15
Testbed environment The indexers and the query
servers are single processor PCs with 350-500
MHz processors, 300-500 MB of main memory, and
equipped with multiple disks. All the machines
are interconnected by a 100 Mbps Ethernet LAN
network.
16
The WebBase collection To study some properties
of web pages that are relevant to text indexing,
we analyzed 5 samples, of 100,000 pages each,
from different portions of the WebBase
repository.
17
value Property
438 Average number of words per page
171 Average number of distinct words per page
8650 Average size of each page (as HTML)
2815 Average size of each page after removing HTML tags
8 Average size of a word in the vocabulary
Table 1 Properties of the WebBase collection
18
(No Transcript)
19
  • Introduction.
  • Testbed architecture.
  • Design of the indexer.
  • Distributed indexing.

20
(No Transcript)
21
  • Design of the Indexer
  • Software pipeline.
  • The storage of the inverted files generated by
    the process.

22
  • Software pipeline
  • The process can logically be split into 3 phases
  • Processing -gt CPU intensive.
  • Flushing -gt disk.
  • loading -gt network.

23
(No Transcript)
24
The goal of our pipelining technique is to design
an execution schedule for the different indexing
phases that will result in minimal overall
running time. Examples
F
Execution of the pipeline
P
L
25
(No Transcript)
26
t

Pipeline time
27
Theoretical analysis vs. experimental results
28
(No Transcript)
29
(No Transcript)
30
  • Design of the Indexer
  • Software pipeline.
  • The storage of the inverted files generated by
    the process.

31
Storage schemes We consider ed three storage
schemes for storing inverted files as sets of
(key, value) pairs in a B-tree 1.
Full list. 2. Single payload. 3.
Mixed list.
32
(No Transcript)
33
  • A qualitative comparison of these storage
    schemes
  • Index size
  • Zig-zag joins
  • Hot updates

34
Zig-zag join using ordered indexes
1
2
3
4
7
9
18
1
7
9
11
17
12
19
35
Experimental results (using mixed list)
36
Index size (age) Index size (GB) Input size (GB) Number of pages(million)
6.17 0.05 0.81 0.1
6.70 0.27 4.03 0.5
7.01 1.13 16.11 2.0
6.90 2.78 40.28 5.0
Table 5Mixed-list scheme index sizes
Only one posting was generated for all the
occurrences of a word in a page
37
(No Transcript)
38
  • Introduction.
  • Testbed architecture.
  • Design of the indexer.
  • Distributed indexing.

39
  • Two problems that must be addressed when building
    an inverted index on a distributed architecture
  • Page distribution The question of when and how
    to distribute pages to the indexing nodes.
  • Collecting global statistics the question of
    where, when, and how to compute and distribute
    global statistics.

40
  • Two strategies for page distribution
  • A priori distribution.
  • Runtime distribution.

41
  • Three advantages of runtime distribution
  • Space.
  • Load balancing.
  • Effective pipelining.

42
  • Collecting global statistics
  • A dedicated server known as the statistician.
  • Parallel computation.
  • Minimize the number of conversations among
    servers.
  • Avoid extra disk I/O
  • Reduces network overhead.

43
  • Two strategies for sending information to the
    statistician
  • ME Strategy sending local information during
    merging.
  • FL Strategy sending local information during
    flushing.

44
(No Transcript)
45
(No Transcript)
46
comparison
Write a Comment
User Comments (0)
About PowerShow.com