WORMS: A HighPerformance Text Retrieval Prototype - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

WORMS: A HighPerformance Text Retrieval Prototype

Description:

heterogenous data-type, platforms, network environment, dynamic changing ... 256 Megabytes of SDRAM, 10 Gigabytes IDE Harddisk, a simple fast ethernet network. ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 24
Provided by: NiranAngka1
Category:

less

Transcript and Presenter's Notes

Title: WORMS: A HighPerformance Text Retrieval Prototype


1
WORMS A High-Performance Text Retrieval Prototype
  • Rungsawang A., Uthayopas P., Lertprasertkune M.,
  • Ingongngam P., Laohakanniyom A.
  • fenganr,pu,b4105102,g4365021,b4105118_at_ku.ac.th
  • Massive Information Knowlegde Engineering
  • Department of Computer Engineering
  • Faculty of Engineering
  • Kasetsart University, Bangkok, Thailand.

2
Outline
  • Introduction
  • Vector Space Model
  • Inverted File Structure
  • Text Retrieval Prototypes
  • Experiments
  • Conclusion

3
Introduction
  • Nature of web documents
  • heterogenous data-type, platforms, network
    environment,
  • dynamic changing in data content,
  • exponential rate growing in volume.
  • Current web searching technology
  • hard to modify or adapt.
  • run on expensive HPC system

4
Introduction (2)
  • This work proposes
  • A cost effective parallel web document retrieval
    using
  • PC-Cluster,
  • Linux Operating System,
  • PVM library,
  • Vector space model and inverted file structure.

5
WORM Design
  • Sequential WORM

6
Vector Space Model
  • Translate a document to vector of N dimensions
    while N is the number of selected index terms.

X Y Z
list of ltdi, tj, wkgt sort by document
7
Vector Space Model (2)
  • Rank documents using dot product or cosine
    similarity operations,
  • Sort and return only top-rank documents as
    retrieval result.

di (wi1, wi2, , wiN) qj (wj1, wj2, , wjN)
8
Speeding up using Inverted File Concept
  • For large document collection, dot product all
    document is a time consuming process.
  • Use Inverted file structure for reduce retrieval
    time.

9
Inverted File
  • Consists of list of
  • terms
  • link to document that the term appears
  • Steps
  • Extract terms from query
  • construct list of document that the term appears
    from inverted file
  • calculate vector dot product on query and
    document vector in list only
  • So, the processing is reduced to only related
    documents

10
WORMS Design
  • Parallel WORMS

Input LDF GDF INVF
Master
Input LDF GDF INVF
Input LDF GDF INVF
11
Processing Steps
  • Indexing
  • Master process spawn worker task on each node
  • Each worker read local input document
  • Pass to stop Stemmer to clean document
  • Build Local Document Frequency (LDF) data base
  • send LDF to master and wait for other workers to
    send their LDF
  • Run Merger to merge LDF to form GDF (Global
    Document Frequency Database)
  • Run Inverter to generate inverted file
  • At the end of this stage, each worker node have
    the same GDF

LDF
12
Improvement
  • Incremental Indexing
  • Global indexing for the first time
  • Index only incremental document and added new
    index to GDF
  • Parallel Incremental Indexing using Master GDF
    Server

MASTER GDF SERVER
x y
x y z
x y z a
x y z
x y a
x y z a
x y z a
x y z a
13
Processing Steps
  • New document are distributed among the worker
    node
  • Worker index the new document
  • If new term found
  • Send to Master GDF server
  • Master reply with new term index and broadcast
    that to all worker
  • Each slave add this to new LDF
  • At the end Merger merge new (and same) LDF to
    local GDF
  • Each node have the same GDF at the end

14
Experiments
  • A cluster of 24 x86 PC machines
  • Athlon 700 MHz CPU,
  • 256 Megabytes of SDRAM,
  • 10 Gigabytes IDE Harddisk,
  • a simple fast ethernet network.
  • Software specification
  • Linux Operating System 2.2.4,
  • PVM Library.

15
Experiments (2)
  • Data preparation
  • using the small web track TREC-9 collection,
  • composing of 1.6 million web pages, 10 gigabytes.

16
Experiments (3)
  • Experimental method
  • Indexing Process
  • Retrieval Process

Indexing Master
Retrieval Master
17
Experiments (4)
  • Average total time needed by WORMS indexing
    process

Well know IR system SMART spend time around
2G/1hour or 6G/3hours.
18
Experiments (5)
  • WORMS speedup curves derived from the 6G
    collection

19
Experiments (6)
  • WORMSs average query response time

20
Conclusion
  • Contribution
  • proposing a parallel text retrieval prototype
    called WORMS.
  • WORMS was implemented using a low-cost Linux
    PC-cluster, and the PVM library.
  • employ two techniques to increase the speed
  • Inverted filed to reduce required calculations
  • Reusing of the GDF data by incremental indexing

21
Conclusion
  • Results
  • WORMS can index up to 450,000 web pages per hour
    per machine.
  • WORMS provides 5.4 seconds query response time on
    searching in the 1.5 million web pages using 2
    machines.

22
Future works
  • Testing system with 100 gigabytes TREC
    collection.
  • Design distributed and peer-to-peer WORMS

23
Questions and Answer
  • For more information http//mike.cpe.ku.ac.th
Write a Comment
User Comments (0)
About PowerShow.com