WORMS: A HighPerformance Text Retrieval Prototype

About This Presentation

Title:

WORMS: A HighPerformance Text Retrieval Prototype

Description:

heterogenous data-type, platforms, network environment, dynamic changing ... 256 Megabytes of SDRAM, 10 Gigabytes IDE Harddisk, a simple fast ethernet network. ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 24

Provided by: NiranAngka1

Category:

more less

Transcript and Presenter's Notes

Title: WORMS: A HighPerformance Text Retrieval Prototype

1
WORMS A High-Performance Text Retrieval Prototype

Rungsawang A., Uthayopas P., Lertprasertkune M.,
Ingongngam P., Laohakanniyom A.
fenganr,pu,b4105102,g4365021,b4105118_at_ku.ac.th
Massive Information Knowlegde Engineering
Department of Computer Engineering
Faculty of Engineering
Kasetsart University, Bangkok, Thailand.

2
Outline

Introduction
Vector Space Model
Inverted File Structure
Text Retrieval Prototypes
Experiments
Conclusion

3
Introduction

Nature of web documents
heterogenous data-type, platforms, network
environment,
dynamic changing in data content,
exponential rate growing in volume.
Current web searching technology
hard to modify or adapt.
run on expensive HPC system

4
Introduction (2)

This work proposes
A cost effective parallel web document retrieval
using
PC-Cluster,
Linux Operating System,
PVM library,
Vector space model and inverted file structure.

5
WORM Design

Sequential WORM

6
Vector Space Model

Translate a document to vector of N dimensions
while N is the number of selected index terms.

X Y Z
list of ltdi, tj, wkgt sort by document
7
Vector Space Model (2)

Rank documents using dot product or cosine
similarity operations,
Sort and return only top-rank documents as
retrieval result.

di (wi1, wi2, , wiN) qj (wj1, wj2, , wjN)
8
Speeding up using Inverted File Concept

For large document collection, dot product all
document is a time consuming process.
Use Inverted file structure for reduce retrieval
time.

9
Inverted File

Consists of list of
terms
link to document that the term appears
Steps
Extract terms from query
construct list of document that the term appears
from inverted file
calculate vector dot product on query and
document vector in list only
So, the processing is reduced to only related
documents

10
WORMS Design

Parallel WORMS

Input LDF GDF INVF
Master
Input LDF GDF INVF
Input LDF GDF INVF
11
Processing Steps

Indexing
Master process spawn worker task on each node
Each worker read local input document
Pass to stop Stemmer to clean document
Build Local Document Frequency (LDF) data base
send LDF to master and wait for other workers to
send their LDF
Run Merger to merge LDF to form GDF (Global
Document Frequency Database)
Run Inverter to generate inverted file
At the end of this stage, each worker node have
the same GDF

LDF
12
Improvement

Incremental Indexing
Global indexing for the first time
Index only incremental document and added new
index to GDF
Parallel Incremental Indexing using Master GDF
Server

MASTER GDF SERVER
x y
x y z
x y z a
x y z
x y a
x y z a
x y z a
x y z a
13
Processing Steps

New document are distributed among the worker
node
Worker index the new document
If new term found
Send to Master GDF server
Master reply with new term index and broadcast
that to all worker
Each slave add this to new LDF
At the end Merger merge new (and same) LDF to
local GDF
Each node have the same GDF at the end

14
Experiments

A cluster of 24 x86 PC machines
Athlon 700 MHz CPU,
256 Megabytes of SDRAM,
10 Gigabytes IDE Harddisk,
a simple fast ethernet network.
Software specification
Linux Operating System 2.2.4,
PVM Library.

15
Experiments (2)

Data preparation
using the small web track TREC-9 collection,
composing of 1.6 million web pages, 10 gigabytes.

16
Experiments (3)

Experimental method

Indexing Process

Retrieval Process

Indexing Master
Retrieval Master
17
Experiments (4)

Average total time needed by WORMS indexing
process

Well know IR system SMART spend time around
2G/1hour or 6G/3hours.
18
Experiments (5)

WORMS speedup curves derived from the 6G
collection

19
Experiments (6)

WORMSs average query response time

20
Conclusion

Contribution
proposing a parallel text retrieval prototype
called WORMS.
WORMS was implemented using a low-cost Linux
PC-cluster, and the PVM library.
employ two techniques to increase the speed
Inverted filed to reduce required calculations
Reusing of the GDF data by incremental indexing

21
Conclusion

Results
WORMS can index up to 450,000 web pages per hour
per machine.
WORMS provides 5.4 seconds query response time on
searching in the 1.5 million web pages using 2
machines.

22
Future works

Testing system with 100 gigabytes TREC
collection.
Design distributed and peer-to-peer WORMS

23
Questions and Answer

For more information http//mike.cpe.ku.ac.th

Write a Comment

User Comments (0)

About PowerShow.com

WORMS: A HighPerformance Text Retrieval Prototype - PowerPoint PPT Presentation

WORMS: A HighPerformance Text Retrieval Prototype

heterogenous data-type, platforms, network environment, dynamic changing ... 256 Megabytes of SDRAM, 10 Gigabytes IDE Harddisk, a simple fast ethernet network. ... – PowerPoint PPT presentation