Distributed Content-based Search on Structured Peer-to-Peer Overlay Networks - PowerPoint PPT Presentation

Loading...

PPT – Distributed Content-based Search on Structured Peer-to-Peer Overlay Networks PowerPoint presentation | free to download - id: c2645-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Distributed Content-based Search on Structured Peer-to-Peer Overlay Networks

Description:

V-hash (whole vector hashing) requires overlay to have Cartesian space ... comparable to state-of-the-art centralized IR systems while visiting only ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 39
Provided by: hpl5
Learn more at: http://www.hpl.hp.com
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Distributed Content-based Search on Structured Peer-to-Peer Overlay Networks


1
Distributed Content-based Search on Structured
Peer-to-Peer Overlay Networks
  • Chunqiang Tang, Zhichen Xu
  • Sandhya Dwarkdas, Mallik Mahalingam
  • HP Labs
  • Hewlett-Packard Company
  • Univ. of Rochester

2
Motivation
  • 93 of information produced worldwide is in
    digital form
  • Unique data added yearly exceeds one exabytes (or
    1018 bytes)
  • The volume of digital content is estimated
    doubling annually
  • The contents are becoming richer
  • Efforts are undertaken to make these contents
    easier to access (e.g.,QBIC, mpeg7)
  • This calls for scalable infrastructures capable
    of indexing and searching rich content such as
    HTML, plain text, music, image files and so forth
  • This particular work focus on content-based
    search

3
Motivation (contd)
  • P2P systems scalable, fault-tolerant,
    self-organizing
  • Progress made in storage, DNS, media streaming,
    web caching
  • Raising hope for a self-organizing distributed
    search engine
  • Content-based search in P2P is NOT yet unsolved
  • Most systems use simple keyword matches
  • ignore developments in informational retrieval
    (IR)
  • Hard to do , e.g., search for a song by whistling
    a tune, or search for an image by submitting a
    sample of patches
  • They also have efficiency and accuracy problems
  • Centralized indexing, index/query flooding
  • Inaccuracy, high-maintenance cost of
    heuristic-based approaches

4
The goals of our project
  • Build a self-organizing search engine out of P2P
    nodes
  • Extend centralized IR algorithms,
  • Vector space model (VSM) and latent semantic
    indexing (LSI)
  • Documents and queries as vectors not specific to
    texts, P. Raghaven
  • Differences to centralized systems such as
    Google
  • Designed for Web search, harness explicit cross
    reference information
  • Explicit cross reference does not always exist in
    all digital contents
  • On the other hand, there are richer
    inter-relationships that the search engine can
    make use of see our HotOS03 paper
  • P2P systems are self-organizing, low cost, easy
    of deployment, infinite scalability.

5
Our approach, pSearch
  • A fundamental problem of existing approaches
  • Documents are randomly distributed, a query
    either has to search a large number of nodes, or
    has to suffer high probability of missing
    important documents
  • Controlled placement of document indices in an
    overlay such that distances reflects the
    dissimilarity in content
  • Two algorithms
  • V-hash (whole vector hashing) requires overlay to
    have Cartesian space abstraction (historically
    pLSI)
  • E-hash (hash on individual elements)
    (historically pVSM)

6
Benefits of controlled placement for search
query
A
B
C
documents
  • With VSM or LSI, documents and queries are
    vectors in a Cartesian (semantic space)
  • Similarity is measured as distance in the
    semantic space

7
V-hash map the semantic space to CAN
8
Highlight of results
  • Achieve an accuracy comparable to centralized
    information retrieval system by visiting a small
    number of nodes
  • E.g.,with proper configuration,
  • A system with 128,000 nodes and 528,543
    documents (from news, magazines, etc),
  • pSearch searches only 19 nodes and transmits
    only 95.5 KB data during the search,
  • the top 15 documents returned by v-hash and LSI
    have a 91.7 intersection

9
Overview
  • Background
  • A basic parallel LSI (v-hash) algorithm to
    highlight challenges
  • Solutions to the challenges
  • Experimental results
  • Discussions
  • Conclusions

10
Background---VSM and LSI
  • Documents and queries are vectors in a Cartesian
    space
  • Similarity between a query and a document is
    measured as the cosine of the angle between their
    vector representations
  • Precision of LSI ranges from comparable to up to
    30 better than that of VSM
  • LSI can bring together documents that are
    semantically related even if they do not share
    terms
  • e.g., a search for car may return relevant
    documents that uses automobile in the text

11
Background---The vector space model (VSM)
  • If a term t appears often in a document, then a
    query containing t should retrieval that document
  • A terms scarcity across the collection is a
    measure of its importance
  • Documents and queries are both vectors
  • Di (wi,1, wi,2, wi,t)
  • Wd,t tfd,t x idft
  • tfd,t the frequency of t in document d
  • Idft inverse document frequency
  • There are many variations.
  • Similarity d . q/(d.q)

12
Background---Latent Semantic Indexing (LSI)
  • Map term space to lower dimensional concept space
  • LSI --- Singular Value Decomposition (SVD)
  • Let A be an n x m matrix of rank r, ?1 ? ?2 ? ?
    ?r are the singular values of A
  • A UDVT, where D diag(?1, ?2 , , ?r) is an r
    x r matrix, U (u1, , ur) is an n x r matrix,
    and V (v1, , vr) is an m x r matrix
  • LSI omits all but the k largest singular values
    of A, i.e.,
  • AkUk Dk VkT, where Dk diag(?1, ?2 , , ?k), Uk
    (U1, , Uk) and Vk (v1, , Vk)

13
Background --- CAN Ratnasamy01
zone
node
  • Cartesian space partitioned into zones
  • A node serves as owner of a zone
  • A key is a point in the Cartesian space
  • Object stored on node that owns the zone that
    contains the point (key)

14
Low maintenance cost self-organizing
new zone
new node
  • A node only needs to know the owners of its
    neighboring zones
  • Node join pick a point and split zone with node
    currently owns the point
  • Node departure a neighboring node takes over
    state of the departing node
  • Dynamisms are shielded from the users and
    applications!

15
Object lookup translates to logical routing
1
2
3
  • Find the node who is the owner of the zone that
    contains the point
  • Routing traverse a series of neighboring zones
    from source to destination

16
A basic parallel LSI algorithm (naïve v-hash)
CAN zones
1
query
A
B
3
2
2
3
C
documents
  • 1 query routing
  • 2 local query localized flooding
  • 3 results routing

17
It is more complicated
  • Dimensionality of semantic space typically very
    high
  • 50-350 for IR corpuses expect to increase as the
    corpus size
  • Nearest neighbor search in a high dimension is
    very difficult
  • Dimensionality of CAN is much lower
  • When k ? log(n) and zones are partitioned evenly,
    each node has only log (n) neighbors
  • CAN can only partition a small number of
    dimensions
  • Uneven distribution of semantic vectors in
    semantic space
  • Global information
  • Solutions hierarchical clustering,
    rolling-index and content-directed search

18
Problems due to dimension mismatch an example
  • Semantic space of 4 dimensions
  • Vd (-0.1, 0.55, 0.57, -0.6), Vq (0.55, -0.1,
    0.6, -0.57)
  • Vd and Vq are similar on elements 2 and 3 (in
    red)
  • If CAN only partitions the first two dimensions

1
Vd
Vq
1
-1
19
Intuitions behind our solutions
  • The dimensions relevant to a particular document
    is typically a much smaller number
  • Queries submitted to search engines can be very
    short, averaging less than 2.4 terms per query
    Lempel Moran

20
Our solutions
  • Use clustering algorithms to identify the
    clusters of semantic vectors that corresponds to
    e.g., chemistry, computer science, etc. Not yet
    evaluated
  • Rotate the semantic space and map each of the
    rotated space to the same CAN
  • Use the contents stored on the neighboring nodes
    and queries received in the recent past to guide
    search

21
Hierarchical clustering- high-level idea
1
cluster
digest 2
cluster
cluster
cluster
cluster
digest 2.3
digest 2.4
digest 1
digest 3
0
CAN
cluster
cluster
digest 2.2
digest 2.1
0.5
0.5
cluster
CAN
cluster
digest 1.3
digest 1.4
cluster
cluster
0.25
digest 1.2
digest 1.1
0
CAN
0
0.25
0.5
CAN
  • Digests are typically made of most important
    concepts (terms) in a domain
  • Challenge efficiently/effectively decide which
    cluster a document/query falls into

22
Rolling-index
e0, e1, e2, e3, e4, e5, e6, e7, e8, e9, e10, e11
e2, e3, e4, e5, e6, e7, e8, e9, e10, e11, e0, e1
Original vector for a document (or query)
(e0, e1)
e9, e10, e11, e0, e1, e2, e3, e4, e5, e6, e7, e8
Vector rotated by 2-elements
(e2, e3)
(e9, e10)
23
An example of rolling-index
  • Semantic space of 4 dimensions
  • Vd (-0.1, 0.55, 0.57, -0.6), Vq (0.55, -0.1,
    0.6, -0.57)
  • Vd and Vq are similar on elements 2 and 3 (in red)

1
1
Vd
Vd
Vq
Vq
1
1
-1
-1
Precision at the cost of replication
24
Properties of SVD
  • Sorts elements in semantic vectors by decreasing
    importance.
  • A large number of documents discussing popular
    concepts are likely to be correctly classified by
    a relative small number of low-dimension elements

25
Query accuracy distribution time
  • A total of 100 queries
  • 4 rotated spaces, each rotated the previous space
    by 25
  • Accuracy percentage overlap with a centralized
    baseline

26
Content-directed search
  • Curse of dimensionality
  • High-dimensional data spaces are sparsely
    populated. Even very large hyper-cube in
    high-dimensional spaces are not likely to contain
    a point
  • The distance between a query and its nearest
    neighbor (NN) grows steadily with the
    dimensionality of the space
  • Use the contents stored on nodes and recently
    processed queries as a hint to guide the search
    to the right places
  • Uses samples from other nodes to determine
    content similarity between a query and content
    stored on the nodes

27
Content-directed search
  • Search for two documents
  • N list of nodes to search
  • Step 1 N 6,14,11,9
  • Step 2 a is identified and N 7,14, 11,
  • Closest document may not be on direct routing
    neighbor

1
2
3
4
5
6
7
8
a
b
9
10
11
12
q
13
14
15
16
28
Content-directed replication caching
  • Selectively replicate contents stored on
    surrounding nodes
  • The threshold is set according to the nodes
    storage capacity, computing power, and network
    connectivity

1
2
3
4
5
6
7
8
a
b
9
10
11
12
q
13
14
15
16
29
Experimental Results
  • Software packages
  • SMART Cornell LAS2 from SVDPACK netlibeCAN
    sim
  • Validate the correctness using MEDLINE corpus
    Buckley
  • Experiment with TREC-7,8 Topics 351-450 as
    queries
  • term by document matrix by sampling 15 documents
  • 79,316 sampled docs and 83,098 indexed terms
  • Project all 528,543 docs onto 300 dimensions
    after SVD
  • Metrics
  • Number of visited nodes
  • Accuracy (A ? B / A) x 100, A set of
    documents returned by LSI, and B set of
    documents returned by v-hash

30
Scalability with respect to the system size
  • As system size increases exponentially, the
    number of visited nodes increases only moderately
  • For 32k system, v-hash can achieve an accuracy of
    90 by visiting 139 nodes

31
Effect of the number of returned documents
  • 10,000 nodes in total
  • The number of visited nodes grows quickly, but
    the average number of nodes that needs to be
    searched to return one document decrease
    drastically

32
Using actual contents and past queries to direct
searches
  • When queries have locality, learning from past
    history can increase the accuracy while reducing
    the number of visited nodes

33
Replication improves search efficiency and
accuracy
  • Visit 24 nodes in a 10,000 node system to achieve
    accuracy higher than 96.8
  • Replicating direct neighbors content. The
    scalability declines from O(n) to O(n/log(n))

34
An example of a large system of 128 K nodes
  • Repl-query series uses both the content and past
    queries to guide the sampling
  • Combining replication and the query heuristics,
    it can achieve an accuracy of 91.7 by visiting
    19 nodes, or an accuracy of 98 by visiting 45
    nodes

35
Discussions
  • V-hash requires the overlays to have Cartesian
    space abstraction
  • for an individual doc, query, the most
    significant elements may not be contiguous
  • clustering is needed for larger corpus
  • element hashing (e-hash) algorithm eliminates
    the constraint
  • We expect the content-directed search to improve
    as the size of the corpus size increases,
  • Selective content replication and query result
    caching have the potential to substantially
    improve the performance while keeping the
    scalability of the storage high
  • Study how other IR algorithms such as PageRank
    can complement our approach
  • Integrate attribute-based, content-based, and
    context-based search

36
E-hash
Computer w1 Network w2
sports w1 Network w2
Overlay e.g., Chord, Pastry
  • Query global_rank (sigma (local_ranking))
  • Intelligent storage management based on query
    patterns

37
Conclusion
  • pSearch is the first system that organizes
    contents around their semantic in a P2P network.
  • This makes it possible to achieve an accuracy
    comparable to state-of-the-art centralized IR
    systems while visiting only a small number of
    nodes.
  • We propose the use of hierarchical clustering,
    rolling-index, and content-directed search to
    reduce the dimensionality of the search space and
    to resolve the dimensionality mismatch between
    semantic space and CAN
  • We employ content-aware node bootstrapping to
    balance the load

38
(No Transcript)
About PowerShow.com