A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval - PowerPoint PPT Presentation

About This Presentation

Title:

A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval

Description:

A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval Zhichen Xu Yan Chen Chengxiang Zhai Northwestern University University of Illinois – PowerPoint PPT presentation

Number of Views:212

Avg rating:3.0/5.0

Slides: 22

Provided by: yanc8

Learn more at: https://users.cs.northwestern.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Scalable Semantic Indexing Framework for Peer-to-Peer Information Retrieval

1
A Scalable Semantic Indexing Framework for
Peer-to-Peer Information Retrieval
Zhichen Xu
Yan Chen
Chengxiang Zhai
Northwestern University

University of Illinois
at Urbana-Champain

Yahoo! Inc.
ACM SIGIR HDIR 2005
2
Motivation

Rapid information growth requires scalable and
robust retrieval architecture
Problems with centralized retrieval architecture
Hard to maintain freshness of information
Single-point-of-failure
Peer-to-Peer IR may be a possible solution
No need for centralized indexing
Easy to maintain freshness of information
Resistant to single-point-of-failure
Challenge P2P IR architecture?

3
Term Index vs. Document Index

Term index
Fast query execution
Insufficient for supporting sophisticated
algorithms such as feedback
Hard to update (e.g., adding a doc) in a
distributed environment
Document index
Easy to update
Support advanced retrieval algorithms
Slow query matching

4
What is the Right Indexing Architecture for P2P
IR?
5
Previous Work pSearch Tang et al. 03

Based on document indexing
Address the problem of slow query execution by
Dimension reduction (using LSI)
Exploiting distributed hash tables (DHT)
Problems
Lack of semantic locality (semantically similar
documents may be stored in quite different nodes)
Slow index generation
Hard to add a new concept

6
Proposed Solution P2PIR, Scalable Semantic
Indexing Framework for IR

Semantic locality
Achieved through a novel two-phase distributed
semantic indexing
Documents with similar semantics will have
indices stored on nearby nodes
Flexible tradeoff between search accuracy and
efficiency
Support of sophisticated retrieval methods
E.g., feedback and personalized search
Adaptation to document dynamics
Incrementally incorporate new documents/concepts

7
Background on Sample DHTContent-Addressable
Network

Two key operations
Put (key, object)
Object get (key)
Partition Cartesian space into zones
Each zone is assigned to a computer
Neighboring zones are routing neighbors
Object lookup is done through routing

8
Routing and Location Properties on DHT

Log(N) hops need to route a key where N is the
number of nodes in the overlay
Log(N) maintenance overhead for routing
Guaranteed success
Fault-tolerant and robust
DoS attack resilient
Becoming increasingly practical for serious use

9
P2PIR Architecture
Applications
structure -aware feature vector
feature extraction
XML doc
index placement on DHT
concept vector construction
index locator construction
P2PIR
term vector
text doc
P2P DHT system
Results to user
Index locator generation
search on DHT
Relevance Ranking
Internet
query
query refinement

Two stage document indexing
Concept vector generation
Index locator generation and placement
Open for plugging in feature extraction,
relevance ranking and query refinement

10
Assumptions about the Retrieval Models

Documents and queries are both represented as
vectors
Naturally occurring in the vector-space model
Probabilistic models can be computed as vector
matching as well
Euclidean distances are reasonably accurate in
capturing document topic similarity
Euclidean distances are only used to prune
non-promising documents
Final relevance ranking can be based on more
accurate retrieval functions

11
Concept Vector Construction

Group document into k clusters based on the
feature vectors
The centroid of each cluster corresponds to a
concept
Given a document d, the similarity between its
feature vector and a concept c (e.g., cosine
value between them) defines the weight of d on
concept c
The concept vector of d is composed of its
weights on all the concepts

12
Two-Stage Semantic Indexing

Stage 1 Fast dimension reduction
Document clustering to identify nd clusters (d
DHT dimension)
Represent each document with a vector on this nd
dimensional space
Stage 2 Semantic index locator construction
Further partition the nd clusters into n
equal-size semantically coherent groups, each
with size d
Each group forms an index locator (key for
searching DHT)

13
Fast Dimension Reduction

Regular k-means clustering
Randomly start with k centroids
Iteratively re-assign documents to each cluster
and re-compute the centroids
Can stop at anytime to obtain rough clusters
Modification
Start with k relatively different centroids
Complexity at each iteration O(kN), where
N gtgtk is the number of documents
Can be run on a sample of documents
Vector(D) (sim(D,C1), , sim(D,Ck))

14
Index Locator Construction

Motivation the dimensionality of concept vectors
(e.g., a few hundreds) may be much larger than
that of DHT, so hard to place index directly with
concept vector
Basic idea break the concept vector into
multiple chunks with the same dimensionality as
that of DHT, and each chunk contains related
concepts
With such division, each document only has a
small number of chunks with non-negligible
weights for indexing
Such chunks are called index locators

15
Index Placement on DHT

For each index locator of a document d
If its norm (i.e., length of the vector) is over
certain threshold, we put the index locator of d
along with its feature vector on the peer node
whose DHT address vector matches best with the
index locator.

16
Illustration of Two-Stage Indexing
Semantic Chunk 1
Semantic Chunk k
C1 C2
CM
D1 D2
DN
Concepts C1 C2
CM Doc Di (x1, x2, ,
xd, xd1, , x2d, . . xM)
Locator 1
Locator 2 .
(x1, x2, , xd ) ? Original vector(D)
In DHT
(x1, x2, , xd ) ? Original vector(D)
17
Querying

Contact any node on the DHT
Project the query vector to find related
concepts, and form the index locators
Use index locators to route to DHT nodes with the
indices and feature vectors of related documents
Use original query vector and document vectors to
perform relevance ranking
This local retrieval process can expand to
neighboring DHT nodes until enough relevant
results have been identified

18
Adaptation to Corpus Dynamics

Basic idea Incrementally add new
documents/concepts without affecting existing
indices, and periodically (very infrequently)
rebuild index locators for all documents
When a set of new documents emerge, we check
whether they contain new frequently-used terms or
new heavy weighted terms
whether their concept vectors belong to any
existing cluster in the existing semantic space

19
Adaptation to Corpus Dynamics (II)

To add and index a new concept c
If c belongs to an existing concept chunk whose
size is less than that of the underlying DHT, we
can add c to that cluster by using the next
available entry of the index locator.
Otherwise, we generate a new concept group and a
new set of index locators to represent c
Generate the index locators for the new
documents, and deploy their indices on DHT
Finally, multicast the addition of the new
concept c, and the addition of new concept group
to all DHT nodes, so that they can route queries
about c

20
Example for Corpus Dynamics

When new documents on Bin Larden appear, we
detect it as a new concept relating to the
concept group terrorism.
If the dimensionality of DHT is 20, and the size
of terrorism concept group is 17
Just add Bin Larden to that group as dimension
18 of the index locator.
The corresponding index locators of existing
documents have weight zero as default on
dimension 18, and thus remain the same.
Otherwise, the terrorism concept group already
full, we generate a new concept group for Bin
Larden (i.e., a new set of index locators).

21
Summary

Propose a scalable semantic indexing framework
for peer-to-peer information retrieval P2PIR
Index placement with good semantic locality,
leading to good retrieval accuracy and efficiency
Tunable framework and flexibility
Incremental adaptation to document/concept
dynamics
Prototype and evaluation of P2PIR in progress

Write a Comment

User Comments (0)