Collection Fusion in Carrot2 - PowerPoint PPT Presentation

About This Presentation
Title:

Collection Fusion in Carrot2

Description:

The Collection Fusion Problem ... wi (N) wi: Weight returned by the cluster. N : Number of documents in the final result ... agent is allotted a sub-collection. ... – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 22
Provided by: Rajesh86
Category:

less

Transcript and Presenter's Notes

Title: Collection Fusion in Carrot2


1
Collection Fusion in Carrot2
  • Mithun Sheshagiri

2
Acknowledgements
  • Prof. Scott Cost
  • Srikanth Kallurkar
  • Hemali Majithia

3
Overview
  • Collection Fusion Problem in IR
  • Possible solutions
  • Equal Distribution Assumption
  • Comparable similarities
  • Modeling Relevant Document Distribution
  • Query Clustering
  • Carrot2 System
  • Query Routing in Carrot2

4
Overview
  • Collection Fusion in Carrot2
  • Future Work
  • Conclusions
  • References

5
The Collection Fusion Problem
  • Centralized Indexing and Retrieval.
  • Distributed IR Systems
  • The Collection Fusion Problem
  • Determining the number of documents that need to
    be retrieved from each sub-collection
  • Interleaving the documents returned by each
    sub-collection

6
Possible Solutions
  • Equal Distribution Assumption
  • Assumes that relevant documents are distributed
    equally across all sub-collections
  • Comparable similarities
  • Documents in the final result are listed as
    though the similarities are normalized across
    sub-collections.
  • Similarity values are dependant on
    sub-collections
  • A rare but not so relevant document can have
    higher ranking

7
Possible Solutions
  • Equal Distribution Assumption
  • Assumes that relevant documents are distributed
    equally across all sub-collections
  • Comparable similarities
  • Documents in the final result are listed as
    though the similarities are normalized across
    sub-collections.
  • Similarity values are dependant on
    sub-collections
  • A rare but not so relevant document can have
    higher ranking

8
Possible Solutions
  • Modeling Relevant Document Distribution
  • The document distribution model is built using
    training queries.
  • The document distribution for a query q is
    obtained by averaging the number of relevant
    documents retrieved by the k nearest queries.
  • This is done for all sub-collections.
  • These document distributions along with the total
    number of documents to be retrieved is passed to
    a maximization procedure.

9
Possible Solutions
  • Modeling Relevant Document Distribution
  • This maximization procedure calculates a cut-off
    value for each sub-collection.

10
Possible Solutions
  • Query Clustering
  • Query clusters are formed by grouping training
    queries which return some identical documents.
  • A weight is assigned to each cluster.
  • Weight is computed based on the number of
    relevant documents returned by the queries
    belonging to the cluster.
  • The centroid of the query cluster is calculated
    by averaging the query vectors belonging to that
    query cluster.

11
Possible Solutions
  • Query Clustering
  • The cluster whose centroid is most similar to the
    user query is selected and its weight is
    returned.
  • The set of weights returned by all the
    sub-collections are used to apportion the
    retrieved set.
  • wi (N)
  • ?wi
  • wi Weight returned by the cluster
  • N Number of documents in the final result

12
Carrot2 System
  • Carrot2 is a agent based distributed IR system.
  • Uses Jackal Communication Infrastructure
  • KQML is used by agents for communication
  • Agents interface with IR engine through a
    wrapper
  • Wrapper provides functionality to index documents
    as well as metadata

13
Carrot2 System
  • Metadata is a reduced representation of the
    sub-collection. (8-10)
  • Metadata is a vector consisting of N-grams
    (terms) and the number of documents that contain
    it.
  • On start-up an agent is allotted a
    sub-collection.
  • Every agent has an associated metadata object.
  • An agent also has access to a metadata pool.

14
Query Routing in Carrot2
  • Query is submitted to a Query Manager.
  • Query manager picks an agent from a list of
    agents returned by the Collection Manager.
  • Every agent queries its metadata pool and makes a
    decision.
  • Query its local collection.
  • Forward the query.
  • Combination of both.

15
Query Routing in Carrot2
  • The process ends when
  • There are no more agents that have not already
    received the query.
  • The number of times the query has been forwarded
    has reached a threshold value.

16
Collection fusion in Carrot2
  • An approach similar to query clustering.
  • Query cluster Metadata object
  • Representations of sub-collections
  • Both have a weight/similarity which is an
    indication of the relevance of the documents in
    the sub-collection to the given query.
  • The similarity values of the metadata objects can
    be used to apportion the total number of
    documents that need to be returned.

17
Collection Fusion in Carrot2
  • Requirement for implementation
  • Access to the metadata object of all
    participating sub-collections (C2 agents).
  • Using the metadata pool of one agent when the
    metadata objects are distributed in broadcast
    mode. (Flooding strategy)
  • A new agent which accesses the metadata objects
    of all participating agents.

18
Collection Fusion in Carrot2
  • Similarity value is appended to the result
    returned by each agent.
  • The interleaving can be done by rolling a C-faced
    die which is biased by the number of documents
    that are still to be picked from the original
    result set.

19
Future Work
  • The suitability of the proposed technique to the
    C2 system should be experimentally verified.
  • This technique makes use of existing entities and
    information, implementation can be done with
    minimal changes to the existing architecture.

20
Conclusion
  • Combination of query clustering like approach
    along with probabilistic interleaving is a good
    candidate for collection fusion in C2
  • Decentralized nature
  • Use of existing entities
  • Easy to implement
  • Less prone to scalability issues.

21
References
  • Ellen M. Voorhees, Narendra Gupta, and Ben
    JohnsonLaird. Learning collection fusion
    strategies.
  • James P Callan, Zhihong Lu and Bruce Croft
    Searching Distributed Collections With Inference
    Networks.
  • E. M. Voorhees, N. K. Gupta, and B.
    JohnsonLaird. The collection fusion problem.
Write a Comment
User Comments (0)
About PowerShow.com