Collection Fusion in Carrot2

About This Presentation

Title:

Collection Fusion in Carrot2

Description:

The Collection Fusion Problem ... wi (N) wi: Weight returned by the cluster. N : Number of documents in the final result ... agent is allotted a sub-collection. ... – PowerPoint PPT presentation

Number of Views:156

Avg rating:3.0/5.0

Slides: 22

Provided by: Rajesh86

Learn more at: https://redirect.cs.umbc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Collection Fusion in Carrot2

1
Collection Fusion in Carrot2

Mithun Sheshagiri

2
Acknowledgements

Prof. Scott Cost
Srikanth Kallurkar
Hemali Majithia

3
Overview

Collection Fusion Problem in IR
Possible solutions
Equal Distribution Assumption
Comparable similarities
Modeling Relevant Document Distribution
Query Clustering
Carrot2 System
Query Routing in Carrot2

4
Overview

Collection Fusion in Carrot2
Future Work
Conclusions
References

5
The Collection Fusion Problem

Centralized Indexing and Retrieval.
Distributed IR Systems
The Collection Fusion Problem
Determining the number of documents that need to
be retrieved from each sub-collection
Interleaving the documents returned by each
sub-collection

6
Possible Solutions

Equal Distribution Assumption
Assumes that relevant documents are distributed
equally across all sub-collections
Comparable similarities
Documents in the final result are listed as
though the similarities are normalized across
sub-collections.
Similarity values are dependant on
sub-collections
A rare but not so relevant document can have
higher ranking

7
Possible Solutions

Equal Distribution Assumption
Assumes that relevant documents are distributed
equally across all sub-collections
Comparable similarities
Documents in the final result are listed as
though the similarities are normalized across
sub-collections.
Similarity values are dependant on
sub-collections
A rare but not so relevant document can have
higher ranking

8
Possible Solutions

Modeling Relevant Document Distribution
The document distribution model is built using
training queries.
The document distribution for a query q is
obtained by averaging the number of relevant
documents retrieved by the k nearest queries.
This is done for all sub-collections.
These document distributions along with the total
number of documents to be retrieved is passed to
a maximization procedure.

9
Possible Solutions

Modeling Relevant Document Distribution
This maximization procedure calculates a cut-off
value for each sub-collection.

10
Possible Solutions

Query Clustering
Query clusters are formed by grouping training
queries which return some identical documents.
A weight is assigned to each cluster.
Weight is computed based on the number of
relevant documents returned by the queries
belonging to the cluster.
The centroid of the query cluster is calculated
by averaging the query vectors belonging to that
query cluster.

11
Possible Solutions

Query Clustering
The cluster whose centroid is most similar to the
user query is selected and its weight is
returned.
The set of weights returned by all the
sub-collections are used to apportion the
retrieved set.
wi (N)
?wi
wi Weight returned by the cluster
N Number of documents in the final result

12
Carrot2 System

Carrot2 is a agent based distributed IR system.
Uses Jackal Communication Infrastructure
KQML is used by agents for communication
Agents interface with IR engine through a
wrapper
Wrapper provides functionality to index documents
as well as metadata

13
Carrot2 System

Metadata is a reduced representation of the
sub-collection. (8-10)
Metadata is a vector consisting of N-grams
(terms) and the number of documents that contain
it.
On start-up an agent is allotted a
sub-collection.
Every agent has an associated metadata object.
An agent also has access to a metadata pool.

14
Query Routing in Carrot2

Query is submitted to a Query Manager.
Query manager picks an agent from a list of
agents returned by the Collection Manager.
Every agent queries its metadata pool and makes a
decision.
Query its local collection.
Forward the query.
Combination of both.

15
Query Routing in Carrot2

The process ends when
There are no more agents that have not already
received the query.
The number of times the query has been forwarded
has reached a threshold value.

16
Collection fusion in Carrot2

An approach similar to query clustering.
Query cluster Metadata object
Representations of sub-collections
Both have a weight/similarity which is an
indication of the relevance of the documents in
the sub-collection to the given query.
The similarity values of the metadata objects can
be used to apportion the total number of
documents that need to be returned.

17
Collection Fusion in Carrot2

Requirement for implementation
Access to the metadata object of all
participating sub-collections (C2 agents).
Using the metadata pool of one agent when the
metadata objects are distributed in broadcast
mode. (Flooding strategy)
A new agent which accesses the metadata objects
of all participating agents.

18
Collection Fusion in Carrot2

Similarity value is appended to the result
returned by each agent.
The interleaving can be done by rolling a C-faced
die which is biased by the number of documents
that are still to be picked from the original
result set.

19
Future Work

The suitability of the proposed technique to the
C2 system should be experimentally verified.
This technique makes use of existing entities and
information, implementation can be done with
minimal changes to the existing architecture.

20
Conclusion

Combination of query clustering like approach
along with probabilistic interleaving is a good
candidate for collection fusion in C2
Decentralized nature
Use of existing entities
Easy to implement
Less prone to scalability issues.

21
References

Ellen M. Voorhees, Narendra Gupta, and Ben
JohnsonLaird. Learning collection fusion
strategies.
James P Callan, Zhihong Lu and Bruce Croft
Searching Distributed Collections With Inference
Networks.
E. M. Voorhees, N. K. Gupta, and B.
JohnsonLaird. The collection fusion problem.

Write a Comment

User Comments (0)