Exploiting locality for scalable information retrieval in peer-to-peer networks - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Exploiting locality for scalable information retrieval in peer-to-peer networks

Description:

Issues in p2p networks. Content based / file identifiers information retrieval. ... Digging deeper by increasing TTL. Reach more nodes deeper. ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 29
Provided by: mik88
Category:

less

Transcript and Presenter's Notes

Title: Exploiting locality for scalable information retrieval in peer-to-peer networks


1
  • Exploiting locality for scalable information
    retrieval in peer-to-peer networks
  • D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios
    Gunopulos
  • Manos Moschous
  • November, 2004

2
Issues in p2p networks
  • Content based / file identifiers information
    retrieval.
  • Dynamic networks (ad-hoc).
  • Scalability (global knowledge).
  • Query messages (flooding network congestion).
  • Recall rate.
  • Efficiency (recall rate / query messages).
  • Query Response Time (QRT).

3
IR in pure p2p networks
  • BFS technique
  • Each peer forwards the query to all its neighbors
  • Simple
  • Performance
  • Network utilization
  • Use of TTL
  • RBFS technique
  • Each peer forwards the query to a random subset
    of its neighbors
  • Reduce query messages
  • Probabilistic algorithm

4
IR in pure p2p networks
  • gtRES technique
  • Each peer forwards the query to some of its peers
    based on some aggregated statistics.
  • Heuristic The Most Results in Past (for the last
    10 queries).
  • Explore..
  • The larger network segments.
  • The most stable neighbors.
  • ! (The nodes which contain content related to the
    query.)
  • gtRES is a quantitative rather than qualitative
    approach.

5
The intelligent search mechanism (ISM)
  • Main Idea Peers estimate for each query, which
    of its peers are more likely to reply to this
    query, and propagates the query message to those
    peers only.
  • Exploit the locality of past queries.
  • Some characteristics
  • Entirely distributed (requires only local
    knowledge).
  • Scales well with the size of the network.
  • Scales well to large data sets.
  • Works well in dynamic environments.
  • High recall rates.
  • Minimize the communication costs.

6
Architecture (ISM) (1/4)
  • Profiling structure
  • Single queries table
  • LRU policy to keep the most recent queries
  • Table size is limited ? good performance

7
Architecture (ISM) (2/4)
  • Query Similarity function (cosine similarity)
  • Assumption A peer that has a document relevant
    to a given query is also likely to have other
    documents that are relevant to other similar
    queries.
  • Qsim Q2 ?0,1
  • L the set of all words appeared in queries
    1,1,1,1
  • q1,1,0,0
  • qi1,0,1,0

8
Architecture (ISM) (3/4)
  • Peer ranking (Relevance Rank)
  • Pi each peer.
  • Pl the decision-maker node.
  • a allows us to add more weight to the most
    similar queries.
  • S(Pi, qj) the number of results returned by Pi
    for query qj.

9
Architecture (ISM) (4/4)
  • Search Mechanism
  • Invoke RR function.
  • Forward query to k (threshold) peers only.

10
Experiments
  • Peerware A distributed middleware infrastructure
  • GraphGen generates network topologies.
  • dataPeer p2p client which answers to boolean
    queries from its local xml repository(XQL).
  • SearchPeer p2p client that performs queries and
    harvest answers back from a Peerware network
    (connect to a dataPeer and perform queries).

11
Experiments - DMP
  • If node Pk receives the same query q with some
    TTL2, where TTL2gtTTL1 we allow the TTL2 message
    to proceed.
  • This may allow q to reach more peers than its
    predecessor
  • Without this fix the BFS behaviour is not
    predictable and therefore is not able to find the
    nodes that we were supposed to find.
  • Our experiments revealed that almost 30 of the
    forwarded queries were discarded because of DMP.
  • The experimental results presented in this work
    are not suffering from DMP.
  • This is the reason why the number of messages is
    slightly higher (30) than the expected number
    of messages.
  • The total number of messages should be
    for n nodes each of which with a degree
    di.

12
Experiments-DMP
  • Query examples
  • A set of 4 keywords
  • 1 keyword gt 4 characters
  • Random Topology
  • Each vertex selects its d neighbors randomly.
  • Simple.
  • Leads to connected topologies if the degree d gt
    log2n.

Query
1 AUSTRIA INTERVENE DOES DOLLAR
2 APPROVES MEDITERRANEAN FINANCIAL PACKAGES
3 AGREES PEACE NEW MOVES
13
Experiments (Set1)
  • Reuters 21578 Peerware
  • Random topology of 104 nodes (static) with
    average degree 8 (running on network 75
    workstations).
  • Categorize the documents by their country
    attribute (104 country files - each for a node) -
    Each country file has at least 5 articles.
  • Data Sets
  • Reuters 10X10 set of 10 random queries which are
    repeated 10 consecutive times (high locality of
    similar queries) suits better to ISM.
  • Reuters 400 set of 400 random queries which are
    uniformly sampled from the initial 104 country
    files (lower repetition).

14
Results (Set1) Reuters 10X10 (1/4)
  • Reducing query messages
  • ISM finds the most documents compared to RBFS and
    gtRES.
  • ISM achieves almost 90 (recall rate) while using
    only 38 of BFSs messages.
  • ISM and gtRES start out with low recall rate.
  • Suffer from low recall rate.

15
Results (Set1) Reuters 10X10 (2/4)
  • Digging deeper by increasing TTL
  • Reach more nodes deeper.
  • ISM achieves 100 recall rate while using only
    57 of BFSs messages with TTL4.

16
Results (Set1) Reuters 10X10 (3/4)
  • Reducing query response time (QRT)
  • 30-60 of BFSs QRT for TTL4 and 60-80 for
    TTL5.
  • ISM requires more time than gtRES because its
    decision involves some computation over the past
    queries.

17
Results (Set1) Reuters 400 (4/4)
  • Improving the recall rate over time
  • ISM achieves 95 recall rate while using 38 of
    BFSs messages.
  • During queries 150-200 major outbreaks occur in
    BFS.
  • ISM requires a learning period of about 100
    queries before it starts competing the
    performance of gtRES.

18
Experiments (Set2)
  • TREC-LATimes Preeware (random topology of 1000
    nodes static)
  • It contains approximately 132,000 articles.
  • These articles were horizontally partitioned in
    1000 documents (Each document contain 132
    articles).
  • Each peer shares one or more of 1000 documents
    (replicated articles).

19
Experiments (Set2)
  • Data Sets
  • TREC 100 a set of 100 queries out of the initial
    150 topics.
  • TREC 10X10 a list of 10 randomly sampled
    queries, out of the initial 150 topics, which are
    repeated 10 consecutive times.
  • TREC 50X2 for which we first generated a set
    a50 randomly sampled queries out of the initial
    150 topics merged with a generated list of
    another 50 queries which are randomly sampled out
    of a.

20
Results (Set2) TREC100 (1/3)
  • Searching in a large-scale network topology
  • For TTL5 we reach 859 of 1000 nodes (BFS).
  • For TTL6 we reach 998 of 1000 nodes at a cost of
    8500 m/q.
  • For TTL7 we reach all nodes at a cost of 10,500
    m/q.
  • ISM will not exhibit any learning behavior if the
    frequency of terms is very low.

21
Results (Set2) TREC 10X10 (2/3)
  • The effect of high term frequency
  • The recall rate will improve dramatically if the
    frequency of terms is high.
  • ISM achieves higher recall rate than BFS (BFSs
    TTL5).
  • After the learning phase of 20-30 queries it
    scores 120 of BFSs recall rate by using 4 times
    less messages.

22
Results (Set2) TREC 50X2 (3/3)
  • The effect of high term frequency
  • More realistic set, a few terms occur many times
    in queries and most terms occur less frequently.
  • ISM monotonically improves its recall rate and at
    the 90th query it again exceeds BFS performance.
  • gtRESs recall rate fluctuate and behave as bad as
    RBFS if the queries dont follow any constant
    pattern.

23
Experiments (Set3)
  • Searching in dynamic network topologies
  • Why network failures?
  • Misusage at the application layer (shutdown PC
    without disconnecting).
  • Overwhelming amount of generated network traffic.
  • Because of some poorly written p2p clients.
  • Simulate dynamic environment
  • Total number of suspended nodes is no more than
    drop_rate.
  • drop_rate is evaluated every k seconds against a
    random number r.
  • If r lt drop_rate node will break all incoming and
    outgoing connections (for l seconds).
  • In our experiments
  • K60,000 ms and l60,000 ms.
  • TREC-LATimes Peerware with the TREC 10X10 query
    set.
  • drop_rate belongs to (0.0, 0.05, 0.1, 0.2)
  • r is a random number which is uniformly generated
    in 0.0 .. 1.0)

24
Results (Set3) (1/3)
  • BFS mechanism
  • The increase of drop_rate decreases the number of
    messages.
  • BFS does not exhibit any learning behavior at any
    level of drop_rate.
  • BFS is tolerable to small drop_rates (5) because
    is highly redundant.

25
Results (Set3) (2/3)
  • gtRES mechanism
  • The increase of drop_rate decreases the number of
    messages.
  • gtRES does not exhibit any learning behavior at
    any level of drop_rate.

26
Results (Set3) (3/3)
  • ISM mechanism
  • The increase of drop_rate decreases the number of
    messages.
  • Quite well at low levels of drop_rate.
  • Not expected to be tolerant to large drop_rates
    (The information gathered by the profiling
    structure becomes obsolete before it gets the
    chance to be utilized).

27
Extend ISM to different environments
  • ISM mechanism could easily become the query
    routing protocol for some hybrid p2p environments
    (KaZaa, Gnutella).
  • Super Peers form a backbone of infrastructure
    (long-time network connectivity).
  • Regular Peers are unstable and less powerful.
  • How could it work?
  • Regular peer obtain a list of active Super peers.
  • Connects to one or more Super peer and post
    queries.
  • Super peer utilize the ISM mechanism and forward
    the query to a selective subset of its super peer
    neighbors.

28
  • Thank you
Write a Comment
User Comments (0)
About PowerShow.com