Exploiting locality for scalable information retrieval in peer-to-peer networks - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Exploiting locality for scalable information retrieval in peer-to-peer networks

Description:

Issues in p2p networks. Content based / file identifiers information retrieval. ... Digging deeper by increasing TTL. Reach more nodes deeper. ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 29

Provided by: mik88

Category:

more less

Transcript and Presenter's Notes

Title: Exploiting locality for scalable information retrieval in peer-to-peer networks

1

Exploiting locality for scalable information
retrieval in peer-to-peer networks
D. Zeinalipour-Yazti, Vana Kalogeraki, Dimitrios
Gunopulos
Manos Moschous
November, 2004

2
Issues in p2p networks

Content based / file identifiers information
retrieval.
Dynamic networks (ad-hoc).
Scalability (global knowledge).
Query messages (flooding network congestion).
Recall rate.
Efficiency (recall rate / query messages).
Query Response Time (QRT).

3
IR in pure p2p networks

BFS technique
Each peer forwards the query to all its neighbors
Simple
Performance
Network utilization
Use of TTL
RBFS technique
Each peer forwards the query to a random subset
of its neighbors
Reduce query messages
Probabilistic algorithm

4
IR in pure p2p networks

gtRES technique
Each peer forwards the query to some of its peers
based on some aggregated statistics.
Heuristic The Most Results in Past (for the last
10 queries).
Explore..
The larger network segments.
The most stable neighbors.
! (The nodes which contain content related to the
query.)
gtRES is a quantitative rather than qualitative
approach.

5
The intelligent search mechanism (ISM)

Main Idea Peers estimate for each query, which
of its peers are more likely to reply to this
query, and propagates the query message to those
peers only.
Exploit the locality of past queries.
Some characteristics
Entirely distributed (requires only local
knowledge).
Scales well with the size of the network.
Scales well to large data sets.
Works well in dynamic environments.
High recall rates.
Minimize the communication costs.

6
Architecture (ISM) (1/4)

Profiling structure
Single queries table
LRU policy to keep the most recent queries
Table size is limited ? good performance

7
Architecture (ISM) (2/4)

Query Similarity function (cosine similarity)
Assumption A peer that has a document relevant
to a given query is also likely to have other
documents that are relevant to other similar
queries.
Qsim Q2 ?0,1
L the set of all words appeared in queries
1,1,1,1
q1,1,0,0
qi1,0,1,0

8
Architecture (ISM) (3/4)

Peer ranking (Relevance Rank)
Pi each peer.
Pl the decision-maker node.
a allows us to add more weight to the most
similar queries.
S(Pi, qj) the number of results returned by Pi
for query qj.

9
Architecture (ISM) (4/4)

Search Mechanism
Invoke RR function.
Forward query to k (threshold) peers only.

10
Experiments

Peerware A distributed middleware infrastructure
GraphGen generates network topologies.
dataPeer p2p client which answers to boolean
queries from its local xml repository(XQL).
SearchPeer p2p client that performs queries and
harvest answers back from a Peerware network
(connect to a dataPeer and perform queries).

11
Experiments - DMP

If node Pk receives the same query q with some
TTL2, where TTL2gtTTL1 we allow the TTL2 message
to proceed.
This may allow q to reach more peers than its
predecessor
Without this fix the BFS behaviour is not
predictable and therefore is not able to find the
nodes that we were supposed to find.
Our experiments revealed that almost 30 of the
forwarded queries were discarded because of DMP.
The experimental results presented in this work
are not suffering from DMP.
This is the reason why the number of messages is
slightly higher (30) than the expected number
of messages.
The total number of messages should be
for n nodes each of which with a degree
di.

12
Experiments-DMP

Query examples
A set of 4 keywords
1 keyword gt 4 characters

Random Topology
Each vertex selects its d neighbors randomly.
Simple.
Leads to connected topologies if the degree d gt
log2n.

Query
1 AUSTRIA INTERVENE DOES DOLLAR
2 APPROVES MEDITERRANEAN FINANCIAL PACKAGES
3 AGREES PEACE NEW MOVES
13
Experiments (Set1)

Reuters 21578 Peerware
Random topology of 104 nodes (static) with
average degree 8 (running on network 75
workstations).
Categorize the documents by their country
attribute (104 country files - each for a node) -
Each country file has at least 5 articles.
Data Sets
Reuters 10X10 set of 10 random queries which are
repeated 10 consecutive times (high locality of
similar queries) suits better to ISM.
Reuters 400 set of 400 random queries which are
uniformly sampled from the initial 104 country
files (lower repetition).

14
Results (Set1) Reuters 10X10 (1/4)

Reducing query messages
ISM finds the most documents compared to RBFS and
gtRES.
ISM achieves almost 90 (recall rate) while using
only 38 of BFSs messages.
ISM and gtRES start out with low recall rate.
Suffer from low recall rate.

15
Results (Set1) Reuters 10X10 (2/4)

Digging deeper by increasing TTL
Reach more nodes deeper.
ISM achieves 100 recall rate while using only
57 of BFSs messages with TTL4.

16
Results (Set1) Reuters 10X10 (3/4)

Reducing query response time (QRT)
30-60 of BFSs QRT for TTL4 and 60-80 for
TTL5.
ISM requires more time than gtRES because its
decision involves some computation over the past
queries.

17
Results (Set1) Reuters 400 (4/4)

Improving the recall rate over time
ISM achieves 95 recall rate while using 38 of
BFSs messages.
During queries 150-200 major outbreaks occur in
BFS.
ISM requires a learning period of about 100
queries before it starts competing the
performance of gtRES.

18
Experiments (Set2)

TREC-LATimes Preeware (random topology of 1000
nodes static)
It contains approximately 132,000 articles.
These articles were horizontally partitioned in
1000 documents (Each document contain 132
articles).
Each peer shares one or more of 1000 documents
(replicated articles).

19
Experiments (Set2)

Data Sets
TREC 100 a set of 100 queries out of the initial
150 topics.
TREC 10X10 a list of 10 randomly sampled
queries, out of the initial 150 topics, which are
repeated 10 consecutive times.
TREC 50X2 for which we first generated a set
a50 randomly sampled queries out of the initial
150 topics merged with a generated list of
another 50 queries which are randomly sampled out
of a.

20
Results (Set2) TREC100 (1/3)

Searching in a large-scale network topology
For TTL5 we reach 859 of 1000 nodes (BFS).
For TTL6 we reach 998 of 1000 nodes at a cost of
8500 m/q.
For TTL7 we reach all nodes at a cost of 10,500
m/q.
ISM will not exhibit any learning behavior if the
frequency of terms is very low.

21
Results (Set2) TREC 10X10 (2/3)

The effect of high term frequency
The recall rate will improve dramatically if the
frequency of terms is high.
ISM achieves higher recall rate than BFS (BFSs
TTL5).
After the learning phase of 20-30 queries it
scores 120 of BFSs recall rate by using 4 times
less messages.

22
Results (Set2) TREC 50X2 (3/3)

The effect of high term frequency
More realistic set, a few terms occur many times
in queries and most terms occur less frequently.
ISM monotonically improves its recall rate and at
the 90th query it again exceeds BFS performance.
gtRESs recall rate fluctuate and behave as bad as
RBFS if the queries dont follow any constant
pattern.

23
Experiments (Set3)

Searching in dynamic network topologies
Why network failures?
Misusage at the application layer (shutdown PC
without disconnecting).
Overwhelming amount of generated network traffic.
Because of some poorly written p2p clients.
Simulate dynamic environment
Total number of suspended nodes is no more than
drop_rate.
drop_rate is evaluated every k seconds against a
random number r.
If r lt drop_rate node will break all incoming and
outgoing connections (for l seconds).
In our experiments
K60,000 ms and l60,000 ms.
TREC-LATimes Peerware with the TREC 10X10 query
set.
drop_rate belongs to (0.0, 0.05, 0.1, 0.2)
r is a random number which is uniformly generated
in 0.0 .. 1.0)

24
Results (Set3) (1/3)

BFS mechanism
The increase of drop_rate decreases the number of
messages.
BFS does not exhibit any learning behavior at any
level of drop_rate.
BFS is tolerable to small drop_rates (5) because
is highly redundant.

25
Results (Set3) (2/3)

gtRES mechanism
The increase of drop_rate decreases the number of
messages.
gtRES does not exhibit any learning behavior at
any level of drop_rate.

26
Results (Set3) (3/3)

ISM mechanism
The increase of drop_rate decreases the number of
messages.
Quite well at low levels of drop_rate.
Not expected to be tolerant to large drop_rates
(The information gathered by the profiling
structure becomes obsolete before it gets the
chance to be utilized).

27
Extend ISM to different environments

ISM mechanism could easily become the query
routing protocol for some hybrid p2p environments
(KaZaa, Gnutella).
Super Peers form a backbone of infrastructure
(long-time network connectivity).
Regular Peers are unstable and less powerful.
How could it work?
Regular peer obtain a list of active Super peers.
Connects to one or more Super peer and post
queries.
Super peer utilize the ISM mechanism and forward
the query to a selective subset of its super peer
neighbors.