Peer-to-Peer Information Search - PowerPoint PPT Presentation

1 / 137
About This Presentation
Title:

Peer-to-Peer Information Search

Description:

Peer-to-Peer Information Search Sebastian Michel Ecole Polytechnique F d rale Lausanne Lausanne - Switzerland Josiane Xavier Parreira Max-Planck Institute for ... – PowerPoint PPT presentation

Number of Views:316
Avg rating:3.0/5.0
Slides: 138
Provided by: peopleMmc
Category:

less

Transcript and Presenter's Notes

Title: Peer-to-Peer Information Search


1
Peer-to-Peer Information Search
  • Sebastian Michel
  • Ecole Polytechnique Fédérale Lausanne
  • Lausanne - Switzerland

Josiane Xavier Parreira Max-Planck Institute for
Informatics Saarbrücken - Germany
2
Outline of Part 1
  • Introduction to P2P Systems
  • Distributed Hashtables Range Queries
  • Peer-to-Peer IR (Query Routing, Result Merging)
  • Overlapping Sources / Multi-key Statistics
  • Top-k Query Processing
  • Probabilistic Pruning
  • Distributed Top-k

3
P2P Systems
Peer one that is of equal standing with
another (source Merriam-Webster Online
Dictionary )
  • Known from Napster and others
  • Sharing of mostly illegal content (mp3, movies)
  • P2P Pirate-to-Pirate ??
  • New kind of network organization no
    client/server anymore
  • Basic Ideas
  • Each peer connects to a few other peers
  • All peers together form powerful networks
  • Potential Benefits
  • No single point of failure
  • Load is spread across mulitple peers
  • (Resilient to failures and dynamics)

4
Napster
  • Developed in 1998.
  • First P2P file-sharing system

File Download
Publish file statistics
  • Central server (index)
  • Client software sends information
  • about users contents to server.
  • User send queries to server
  • Server responds with IP of users
  • that store matching files.
  • ? Peer-to-Peer file sharing!

File Download
5
Gnutella
  • Protocol for distributed file sharing
  • Started in 2000
  • in 2005 1.81 million computers connected
  • Unstructured Network
  • Truly decentralized
  • Uses message flooding during query execution.
  • Later version with super nodes and query routing

http//www.slyck.com/news.php?story814
6
Gnutella Style
TTL 1
TTL 2
TTL 0
TTL 3
TTL 1
TTL 2
TTL 2
Paris Hilton?
TTL 0
TTL 3
TTL 1
7
Gnutella Style
  • Pros
  • no complex statistical bookkeeping
  • Cons
  • lot of network traffic
  • some peers might not be reachable (TTL)

8
Bit Torrent
  • Idea Load sharing through file splitting
  • A lot of (legal) software distributors offer
    software through Bit-torrent
  • Download information in small .torrent file
  • One tracker node per file (specified in torrent
    file)

segment 1
segment 3
tracker node
segment 1
File
segment 2
segment 5
segment 3
segment 4
segment 5
request segments
request random peer list
segment 4
segment 2
Incentives tit-for-tat Each peer remembers
collaborative peers ? different priorities
Client
9
Literature
  • Book Peer-to-Peer Harnessing the Power of
    Disruptive Technologies by Andy Oram. O'Reilly
    Media, Inc.

10
Overlay Networks
  • On top of existing networks
  • Different way to build an overlay network
  • structured
  • unstructured
  • hybrid

11
Self Properties (Promises)
  • Self-Organizing
  • evolves, grows..... without being guided/managed
  • Self-Optimizing
  • Self-Configuring
  • Self-Healing
  • Self-Restoration
  • Self-Diagnostics
  • Self-Protecting

12
Outline
  • Introduction to P2P Systems
  • Distributed Hashtables Range Queries
  • Peer-to-Peer IR (Query Routing, Result Merging)
  • Overlapping Sources / Multi-key Statistics
  • Top-k Query Processing
  • Probabilistic Pruning
  • Distributed Top-k

13
Distributed Hash Tables
  • Hash-Table given a key, return the bucket id.
    Based on a hash function (like SHA-1)
  • Now Distributed. For a given key, return the id
    of the peer currently responsible for the key.
  • Challenge Purely distributed protocols that cope
    with node failures, departures, arrivals.
  • No central manager.

14
Chord
  • uses an m-bit identifier space ordered in a
    mod-2m circle, the Chord ring
  • maps peers and objects to identifiers in the
    Chord ring, using the hash function SHA-1
  • uses consistent hashing
  • an object with identifier id is placed on the
    successor peer, succ(id), which is the first node
    whose identifier is equal to, or follows id on
    the Chord ring
  • Key k (e.g., hash(file name))
  • is assigned to the node with
  • key p (e.g., hash(IP address))
  • such that k ? p and there is
  • no node p with k ? p and pltp

Ion Stoica, Robert Morris, David R. Karger, M.
Frans Kaashoek, Hari Balakrishnan Chord A
scalable peer-to-peer lookup service for internet
applications. SIGCOMM 2001 149-160
15
Chord
peer n maintains routing information about peers
that lie on the Chord ring at logarithmically
increasing distance ? Finger tables
fingertable p8
fingertable p51
p1
p56
Chord Ring
p8
p51
p48
p14
fingertable p42
p42
p38
p21
p32
16
Node Joins in Chord
p42
lookup(42)
sets succ pointer
p42
p38
moving keys
updates succ pointer
p38
p48
k40
p42
k43
k39
k40
k39
init_finger_tables() successornode.find_success
or() predecessorsuccessor.predecessor predecess
or.successornew
17
And others ...
  • P-Grid Karl Aberer P-Grid A Self-Organizing
    Access Structure for P2P Information Systems.
    CoopIS 2001 179-194
  • CAN Sylvia Ratnasamy, Paul Francis, Mark
    Handley, Richard M. Karp, Scott Shenker A
    scalable content-addressable network. 161-172
  • Pastry Antony I. T. Rowstron, Peter Druschel
    Pastry Scalable, Decentralized Object Location,
    and Routing for Large-Scale Peer-to-Peer Systems.
    Middleware 2001 329-350
  • Bamboo Sean Rhea, Dennis Geels, Timothy Roscoe,
    and John Kubiatowicz. Handling Churn in a DHT.
    Proceedings of the USENIX Annual Technical
    Conference, June 2004.

18
Range queries
  • Range queries
  • A range query v1, v2 searches for those peers
    which store data whit key value k? v1, v2
  • DHTs only support efficiently exact-match queries
  • The naïve approach to process range queries in
    DHTs is to
  • query each value of a range individually
  • It is HIGHLY EXPENSIVE!

19
DHTs and Range Queries
Order preserving hash function
usually leads skewed distributions
  • There are two main solutions to cope with load
    imbalances i.e. to perform load balancing
  • transferring load, or
  • replicating data

20
DHT and Range Queries (2)
  • Existing approaches to deal with range queries
  • Locality preserving hashing
  • OP-Chord Triantafillou et al (2003). Skip
    Graphs Aspnes et al (2004)
  • Hashing ranges of values instead of each value
    individually
  • CAN-based Andrzejak et al (2002), Sahin et al
    (2004)
  • Another problem in that context access load
    imbalances
  • One possible solution hot data transferring to
    deal with those load imbalances
  • However, data transfer does not solve access load
    imbalances in skewed access (query) distributions

21
HotRod replicating hot arcs
Theoni Pitoura et al. EDBT 2006.
A peer is hot (or overloaded) when ? gt ? _max,
where ?_max is the upper limit of its resource
capacity An arc of peers is hot when at least
one of its peers is hot replicate ranges of
values
22
Efficient Load Balancing
23
Outline
  • Introduction to P2P Systems
  • Distributed Hashtables Range Queries
  • Peer-to-Peer IR (Query Routing, Result Merging)
  • Overlapping Sources / Multi-key Statistics
  • Top-k Query Processing
  • Probabilistic Pruning
  • Distributed Top-k

24
Building a P2P Search Engine(Peer to Peer
Information Retrieval)
  • Distributed Google
  • P2P approach best suitable
  • large number of peers
  • exploit mostly idle resources
  • intellectual input of user community
  • scalable and self organizing

25
Information Retrieval Basics
Document
Terms
26
Information Retrieval Basics (2)
Top-k Query Processing find k documents with the
highest total score
Query Execution Usually using some kind of
threshold algorithm - sequential scans over
the index lists (round-robin) -
(random accesses to fetch missing
scores) - aggregate scores - stop when the
threshold is reached
B tree on terms
index lists with (DocId tfidf) sorted by Score
e.g. Fagins algorithm TA or a variant without
random accesses
27
Going distributed Index Organization
  • document index

Peer 2
Peer 3
Peer 1
Peer 2
Peer 1
  • peer index
  • every peer has its own collection (full
    documents)
  • distributed index index of peer descriptions

28
(Full) Document Index
  • Straight forward from centralized document index
  • Each peer is responsible for storing the index
    list for a subset of terms.

Query Routing DHT lookups Query Execution
Distributed Top-k TPUT 04, KLEE 05
29
Peer Index
  • Each peer has its own local index (e.g., created
    by web crawls)
  • Peers publish compact per-term descriptions about
    their index

Query Routing 1. DHT lookups 2. Retrieve
Metadata 3. Find most promising peers Query
Execution - Send the complete Query and
merge the incoming results
30
P2P Search with Minerva
based on scalable, churn-resilient DHT with O(log
n) key lookup








Query routing aims to optimize benefit/cost
driven by distributed statistics on peers
content quality, content overlap, freshness,
authority, trust, etc.
Maintain semantic/social/statistical overlay
network (SON)
Exploit community behavior (bookmarks, links,
tags, clicks, etc.)
31
Two major Problems
  • Task of merging the obtained results into final
    ranking Result Merging
  • Task of finding high quality peers Query
    Routing
  • aka database/collection/peer selection
  • Overview articles
  • J. Callan. (2000). "Distributed information
    retrieval." In W. B. Croft, editor, Advances in
    Information Retrieval. Kluwer Academic
    Publishers. (pp. 127-150).
  • Weiyi Meng, Clement T. Yu, King-Lup Liu Building
    efficient and effective metasearch engines. ACM
    Comput. Surv. 34(1) 48-89 (2002)

32
Query Routing
  • Given a Query Qterm1, term2, ...., termN)
    select the most promising peers
  • Based on
  • per-term per-peer statistics
  • document frequency
  • vocabulary size
  • normalization issues like
  • collection frequency
  • avg vocabulary size
  • Most popular
  • CORI, GlOSS, Decision Theoretic Framework (DTF)

33
CORI
Apply document ranking to resource ranking
Resources
....
p1
p2
pj-1
pj
t1
t2
t3
tk
Terms
q
C peers df document frequency
cf collection frequency cw distinct words
per peer
Query
34
Literature
  • J. Callan. (2000). "Distributed information
    retrieval." In W. B. Croft, editor, Advances in
    Information Retrieval. Kluwer Academic
    Publishers. (pp. 127-150).
  • Weiyi Meng, Clement T. Yu, King-Lup Liu Building
    efficient and effective metasearch engines. ACM
    Comput. Surv. 34(1) 48-89 (2002)
  • CORI James P. Callan, Zhihong Lu, W. Bruce
    Croft Searching Distributed Collections with
    Inference Networks. SIGIR 1995 21-28
  • GlOSS Luis Gravano, Hector Garcia-Molina,
    Anthony Tomasic GlOSS Text-Source Discovery
    over the Internet. ACM Trans. Database Syst.
    24(2) 229-264 (1999)
  • Decision Theoretic Framework Norbert Fuhr A
    Decision-Theoretic Approach to Database Selection
    in Networked IR. ACM Trans. Inf. Syst. 17(3)
    229-249 (1999)

35
Result Merging
  • Problem incomparable scores
  • Different corpus statistics
  • df component used in tfids scoring functions is
    not globally known
  • user with lot of high quality documents for term
    a ? high df
  • non expert user with some bad documents for term
    a ? low df
  • Different scoring functions
  • completely different functions
  • different parameters in the same function

36
Result Merging Approaches
  • Score Normalization by
  • using global statistics
  • computation of global statistics difficult (not
    obvious)
  • solution using gossip
  • score re-computation with query initiators local
    statistics
  • required re-ranking and knowledge about document
    contents
  • score re-computation using query routing scores
  • routing score available anyway

37
Global DF Estimation
gdf (global doc. freq.) of a term is interesting
key measure, but overlap among peers makes simple
distr. counting infeasible
  • hash sketches Flajolet/Martin 1985
  • duplicate-sensitive cardinality estimator for
    multisets
  • hash each multiset element x onto m-bit
    bitvector
  • and remember least significant 1 bit
  • rough intuition least-significant bit set by
    half of the documents,
  • second bit by ¼ of the documents......
  • Theory says most significant bit estimator of
    log (n) ndocuments
  • Higher accuracy average multiple iid sketches

38
Global DF Estimation
Hash sketches of different peers collected at
directory peer distributivity is free!! ?i
?(h(x)) x ?Si ?(h(x)) x ? ?i Si
  • gdf estimation algorithm
  • each peer p posts hash sketch for each
    (discriminative) term t to directory
  • directory peer for term t forms union of
    incoming hash sketches
  • when a peer needs to know gdf(t), simply ask
    directory peer for t
  • sliding-window techniques for dynamic adjustment

Matthias Bender, Sebastian Michel, Peter
Triantafillou, Gerhard Weikum Global Document
Frequency Estimation in Peer-to-Peer Web Search.
WebDB 2006
39
Outline
  • Introduction to P2P Systems
  • Distributed Hashtables Range Queries
  • Peer-to-Peer IR (Query Routing, Result Merging)
  • Overlapping Sources / Multi-key Statistics
  • Top-k Query Processing
  • Probabilistic Pruning
  • Distributed Top-k

40
Autonomous Peers ?Overlapping Sources
A
A,B
A,B,C
A,..,D
A
B
querying peer
?
Recall
C
1
3
2
4
peers
?
D
E
41
How?
  • Enrich published statistics with overlap
    estimators.
  • Interested in NOVELTY and QUALITY
  • Iterative greedy selection process
  • select first peer based on quality
  • select next peer by qualitynovelty
  • Suitable synopses for overlap estimation
  • Bloom filter Bloom 1979
  • hash sketches FlajoletMartin 1985
  • min wise independent permutations Broder 1997

42
Min-Wise Independent Permutations Broder 97
set of ids
17 21 3 12 24 8
h1(x) 7x 3 mod 51
20 48 24 36 18 8
h2(x) 5x 6 mod 51
40 9 21 15 24 46

hN(x) 3x 9 mod 51
9 21 18 45 30 33
compute N random permutations
MIPs are unbiased estimator of overlap P
min h(x) x?A min h(y) y?B A?B /
A?B
43
Bloom Filter Bloom 1979
  • bit array of size m
  • k hash functions h_i docId_space ? 1,..,m
  • insert n docs by hashing the ids and settings the
    corresponding bits
  • document is in the Bloom Filter if the
    corresponding bits are set
  • probability of false positives (pfp)
  • tradeoff accuracy vs. efficiency

Andrei Broder and Michael Mitzenmacher Network
Applications of Bloom Filters A Survey. Internet
Mathematics 1(4). 2005.
X
1
1
1
1
44
Multi-Key Statistics
  • solves interesting problem
  • peer with lot of docs on american football and
    lots of documents about pop music has not a
    single document about american music
  • cannot be predicted using per-term statistics

Obvious Recall that
45
Multi-Key Statistics in P2P
  • Motivation
  • estimated_quality(a and b) quality(a) quality
    (b) df_a df_b ! df_(a and b)
  • Impossible (Infeasible) to consider all
    term-pairs, triplets, quadruples, .....
  • Query Driven Analyze query logs _at_ directory
    peers.
  • Data driven verficication
  • PAnnaKournikova ......
  • PAndyRodick
  • PBerlinMarathon
  • No additional messages shorter lists highly
    accurate

Whole process can be easily integrated
into Peer-level P2P IR
additional statistics often not needed
Sebastian Michel, Matthias Bender, Nikos Ntarmos,
Peter Triantafillou, Gerhard Weikum, Christian
Zimmer Discovering and exploiting keyword and
attribute-value co-occurrences to improve P2P
routing indices. CIKM 2006 172-181
46
Single-term vs. multi-term P2P document indexing
Single
term indexing
long posting lists
-
make use of highly discriminative keys limit
influence of overly long index lists consider
term pairs (triplets ...) for shorter lists ?
efficient query processing

.
term
1
posting list
1
c
PEER
1
o

term
2
posting list
2
v

l
l
...
...
...
a
m
s

term
M
-
1
posting list
M
-
1

PEER
N
term
M
posting list
M
Multi
-
term keys
Multi
term indexing

key
11
posting list
11

key
12
posting list
12
PEER
1
...
...

key
1
i
posting list
1
i
.
c
o
v

...
e
g
r
a
l

key
N
1
posting list
N
1

key
N
2
posting list
N
2
PEER
N
...
...

key
Nj
posting list
Nj
short posting lists
Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko,
Martin Rajman, Karl Aberer Web text retrieval
with a P2P query-driven index. SIGIR 2007 679-686
47
Literature
  • Overlap Awareness
  • Ronak Desai, Qi Yang, Zonghuan Wu, Weiyi Meng,
    Clement T. Yu Identifying redundant search
    engines in a very large scale metasearch engine
    context. WIDM 2006 51-58
  • Matthias Bender, Sebastian Michel, Peter
    Triantafillou, Gerhard Weikum, Christian Zimmer
    Improving collection selection with overlap
    awareness in P2P search engines. SIGIR 2005
    67-74
  • Thomas Hernandez, Subbarao Kambhampati Improving
    text collection selection with coverage and
    overlap statistics. WWW (Special interest tracks
    and posters) 2005 1128-1129
  • Sketches
  • Andrei Z. Broder, Moses Charikar, Alan M. Frieze,
    Michael Mitzenmacher Min-Wise Independent
    Permutations. J. Comput. Syst. Sci. 60(3)
    630-659 (2000)
  • Philippe Flajolet, G. Nigel Martin Probabilistic
    Counting Algorithms for Data Base Applications.
    J. Comput. Syst. Sci. 31(2) 182-209 (1985)
  • Andrei Broder and Michael Mitzenmacher Network
    Applications of Bloom Filters A Survey. Internet
    Mathematics 1(4). 2005.

48
Literature
  • Multi-key statistics
  • Ivana Podnar, Martin Rajman, Toan Luu, Fabius
    Klemm, Karl Aberer Scalable Peer-to-Peer Web
    Retrieval with Highly Discriminative Keys. ICDE
    2007 1096-1105
  • Gleb Skobeltsyn, Toan Luu, Ivana Podnar Zarko,
    Martin Rajman, Karl Aberer Web text retrieval
    with a P2P query-driven index. SIGIR 2007
    679-686
  • Sebastian Michel, Matthias Bender, Nikos Ntarmos,
    Peter Triantafillou, Gerhard Weikum, Christian
    Zimmer Discovering and exploiting keyword and
    attribute-value co-occurrences to improve P2P
    routing indices. CIKM 2006 172-181

49
Outline
  • Introduction to P2P Systems
  • Distributed Hashtables Range Queries
  • Peer-to-Peer IR (Query Routing, Result Merging)
  • Overlapping Sources / Multi-key Statistics
  • Top-k Query Processing
  • Probabilistic Pruning
  • Distributed Top-k

50
For the IR people ....
  • Why top-k?
  • Cannot take a look at all matching documents
  • E.g., Google provides millions of documents about
    Britney Spears

Requires ranking (scoring)
In text retrieval for instance
of course pagerank if you wish
Remember Part one Local Query Execution at each
peer (peer-index-model) AND truly distributed
top-k processing in the full document-index.
51
For the DB guys ...
  • Table with schema (id, attribute, value)

SELECT id, aggr(value) from table group by
id sort by aggr(value) desc limit k
52
For the networking guys ...
Network Monitoring
Find clients that cause high network traffic.
IP Bytes in kB
192.168.1.7 31kB
192.168.1.3 23kB
192.168.1.4 12kB
IP Bytes in kB
192.168.1.8 81kB
192.168.1.3 33kB
192.168.1.1 12kB
IP Bytes in kB
192.168.1.4 53kB
192.168.1.3 21kB
192.168.1.1 9kB
IP Bytes in kB
192.168.1.1 29kB
192.168.1.4 28kB
192.168.1.5 12kB
53
Computational Model
  • m lists with (itemId, score)-pairs sorted by
    score descending.
  • One list per attribute (e.g. term)
  • Aggregation function
  • aggr()
  • Monotonicity is important
  • for all items a, b
  • whith denoting the
    score of item x in list i
  • Goal return the top-k items w.r.t. their
    aggregated (overall) scores

54
How to process this?
  • Most popular Family of threshold algorithms
  • Fagin, 1999
  • Nepal/ Ramakrishna, 1999
  • Güntzer/Balke/Kießling, 2001
  • Basic ideas
  • keep upper and lower score bound for each
    document
  • lowerbound (or worstscore) sum of scores we
    have seen so far
  • assuming 0 for unseen dimensions
  • upperbound (or bestscore) lowerbound highest
    possible value for unseen dimensions
  • know what weve got already know what do expect
  • stop if no further step can improve the current
    (i.e. final) ranking

55
Fagins NRA
  • NRA(q,L)
  • top-k ? candidates ? min-k 0
  • scan all lists Li (i 1..m) in parallel
  • consider item d at position posi in
    Li
  • E(d) E(d) ? i
  • highi si(qi,d)
  • worstscore(d) aggrs?(q?,d)??E(d)
  • bestscore(d) aggraggrs?(q?,d)??E(d
    ), aggrhigh???E(d)
  • if worstscore(d) gt min-k then
  • remove argmindworstscore(d)d
    ?top-k from top-k
  • add d to top-k
  • min-k minworstscore(d)
    d ? top-k
  • else if bestscore(d) gt min-k then
  • candidates candidates ? d
  • threshold max bestscore(d) d?
    candidates
  • if threshold ? min-k then exit

56
Top-k Search
Data items d1, , dn
d1
s(t1,d1) 0.7 s(tm,d1) 0.2
Query q (t1, t2, t3)
Index lists
Rank Doc Worst-score Best-score
1 d78 0.9 2.4
2 d64 0.8 2.4
3 d10 0.7 2.4
Rank Doc Worst-score Best-score
1 d78 1.4 2.0
2 d23 1.4 1.9
3 d64 0.8 2.1
4 d10 0.7 2.1
k 1
Rank Doc Worst-score Best-score
1 d10 2.1 2.1
2 d78 1.4 2.0
3 d23 1.4 1.8
4 d64 1.2 2.0
d78 0.9
d1 0.7
d88 0.2
d23 0.8
d10 0.8
t1
Scan depth 1

Scan depth 2
Scan depth 3
d64 0.8
d10 0.2
d78 0.1
d23 0.6
d10 0.6
t2

d99 0.2
d34 0.1
d10 0.7
d78 0.5
d64 0.4
STOP!
t3

57
Outline
  • Introduction to P2P Systems
  • Distributed Hashtables Range Queries
  • Peer-to-Peer IR (Query Routing, Result Merging)
  • Overlapping Sources / Multi-key Statistics
  • Top-k Query Processing
  • Probabilistic Pruning
  • Distributed Top-k

58
Evolution of a Candidates Score
Observation pruning often overly conservative
(deep scans, high memory for priority queue)
drop d from the candidate queue
score
bestscored
min-k
worstscored
scan depth
  • Approximate top-k
  • What is the probability that d qualifies for the
    top-k ?

59
Safe Thresholding vs. Probabilistic Guarantees
  • NRA based on invariant
  • Relaxed into probabilistic threshold test
  • Or equivalently, with

bestscored
min-k
worstscored
bestscored
d(d)
worstscored
60
Expected Result Quality
  • Missing relevant items
  • Probability p_miss of missing a true top-k object
    equals the probability of erroneously dropping a
    candidate from the queue
  • For each candidate p_miss e
  • Precall r/k Pprecision r/k
  • Eprecision Erecall

61
Outline
  • Introduction to P2P Systems
  • Distributed Hashtables Range Queries
  • Peer-to-Peer IR (Query Routing, Result Merging)
  • Overlapping Sources / Multi-key Statistics
  • Top-k Query Processing
  • Probabilistic Pruning
  • Distributed Top-k

62
Going distributed
  • Key Observations
  • Network traffic is crucial
  • Number of round trips is crucial
  • Straight forward application of TA/NRA?
  • expensive huge number of rounds trips
  • even with batching unpredictable performance

63
Where is the data?
P1
P0
P2
P3
  • Consider
  • network consumption
  • per peer load
  • latency (query response time)
  • network
  • I/O
  • processing

64
Three Phase Uniform Threshold Algorithm
Cao and Wang, PODC 2004
First distributed top-k algorithm with fixed
number of phases!
  • Exactly 3 phases
  • fetch k best entries (d, sj) from each of P1 ...
    Pm and aggregate (?j1..m sj(d)) at query
    initiator
  • ask each of P1 ... Pm for all entries with sj gt
    min-k / m and aggregate results at query
    initiator. min-k is score of item currently at
    rank k.
  • fetch missing scores for all candidates by
    random lookups at P1 ... Pm

65
Coordinator
Peer P0
Cohort
Cohort
Peer Pj
Peer Pi
score
score
...
...
Index List
Index List
66
Analysis of TPUT
  • Theorem TPUT is an exact algorithm, i.e.
    identifies the true top-k items
  • Proof (sketch) TPUT cannot miss a true top-k
    item.
  • Assume it misses one, i.e. item is below
    mink/m in all lists.
  • ? overall score lt mink
  • ? not a true top-k item!

list 1
list 2
list 3
State after phase 2
min-k score
lt min-k
67
Analysis of TPUT
  • if mink / m is small TPUT retrieves a lot of data
    in Phase 2
  • ? high network traffic
  • random accesses
  • ? high per-peer load
  • KLEE VLDB 05
  • Different philosophy approximate answers
  • Efficiency
  • Reduces (docId, score)-pair transfers
  • no random accesses at each peer
  • Two pillars
  • The HistogramBlooms structure
  • The Candidate List Filter structure

68
Additional Data Structures
increase the min-k / m threshold
Equi-width histogram Bloom filter for each
cell average score per cell upper/lower
score
Usage During Phase 1 fetch top-k from
each list top-c cells
69
KLEE
Coordinator
Peer P0
Cohort
Cohort
Peer Pj
Peer Pi
score
score
...
...
Index List
Index List
70
KLEE Candidate Set Reduction
Coordinator Peer P0
candidate set
current top-k
min-k / m
Cohort Peer Pi
Cohort Peer Pj
010010000100010001
100010100000010001
top k
0000100000100000001
0000100000100000001
score
...
Index List
71
KLEE Candidate Retrieval
Coordinator Peer P0
current top-k
candidate set
Cohort Peer Pi
Cohort Peer Pj
010010000100010001
100010100000010001
top k
0000100000100000001
0000100000100000001
score
...
Index List
72
Literature
  • Ronald Fagin Combining Fuzzy Information from
    Multiple Systems. J. Comput. Syst. Sci. 58(1)
    83-99 (1999)
  • Ronald Fagin, Amnon Lotem, Moni Naor Optimal
    aggregation algorithms for middleware. J. Comput.
    Syst. Sci. 66(4) 614-656 (2003)
  • Surya Nepal, M. V. Ramakrishna Query Processing
    Issues in Image (Multimedia) Databases. ICDE
    1999 22-29
  • Ulrich Güntzer, Wolf-Tilo Balke, Werner Kießling
    Towards Efficient Multi-Feature Queries in
    Heterogeneous Environments. ITCC 2001 622-628
  • Martin Theobald, Gerhard Weikum, Ralf Schenkel
    Top-k Query Evaluation with Probabilistic
    Guarantees. VLDB 2004 648-659
  • Holger Bast, Debapriyo Majumdar, Ralf Schenkel,
    Martin Theobald, Gerhard Weikum IO-Top-k
    Index-access Optimized Top-k Query Processing.
    VLDB 2006 475-486
  • Amélie Marian, Nicolas Bruno, Luis Gravano
    Evaluating top-k queries over web-accessible
    databases. ACM Trans. Database Syst. 29(2)
    319-362 (2004)
  • Pei Cao, Zhe Wang Efficient top-K query
    calculation in distributed networks. PODC 2004
    206-215
  • Sebastian Michel, Peter Triantafillou, Gerhard
    Weikum KLEE A Framework for Distributed Top-k
    Query Algorithms. VLDB 2005 637-648

73
Part II Social Search
74
(No Transcript)
75
Motivation
  • People connected through a network
  • People create links to other people
  • Links can express friendship, recommendations,
    etc
  • Different graph structures appear
  • Sharing interests
  • Enables users to find others who share common
    interests
  • Similar users can provide relevant content
  • Users and content spread at different sites
  • Distributed nature and continuously increasing
    size call for peer-to-peer approaches

76
Outline of the Second Part
  • Link Analysis The Web as a Graph
  • PageRank
  • Distributed Approaches
  • BlockRank
  • Local PageRank ServerRank
  • Adaptive OPIC
  • JXP
  • Identifying common interests Semantic Overlay
    Networks
  • Crespo and Garcia Molina
  • pSearch
  • p2pDating
  • Social Networks A new paradigm
  • What people share
  • Social graphs
  • Links, Tags, users analysis

77
Links are everywhere
  • connecting Web pages

78
Links are everywhere
  • connecting people

Example of a Flickrs friends network
79
Links are everywhere
  • connecting products

80
Links Analysis
  • The set of nodes/pages (e.g., web pages, people,
    products, etc) and the links connecting them
    define a graph

81
Link Analysis
  • At the end we have something like this
  • Lots of useful information can be obtained from
    the analysis of the such graphs

82
Adjacency Matrix
  • Matrix representation of graphs
  • Given a graph G, its adjacency matrix A is nxn
    and
  • aij 1, it there is a link from node i to node j
  • aij 0, otherwise

83
PageRank Exploring the Wisdom of Crowds
  • Measures relative importance of pages on the
    graph
  • Importance of a page depends on the importance of
    the pages that point to it
  • Random Surfer Model once in a page, the surfer
    chooses to follow one of the outlinks with prob.
    a, or to jump to a random page with prob. (1- a)
  • PR probability of being at a certain
  • page, after a enough number of jumps

S. Brin L. Page. The anatomy of a large-scale
hypertextual web search engine. In WWW Conf. 1998.
84
PageRank Formal Definition
  • N ? Total number of pages
  • PR(p) ? PageRank of page p
  • out(p) ? Outdegree of p
  • e? Random jump probability
  • Can be computed using power iteration method
  • In practice more efficient versions can be used
  • Google is believed to use it on the Web graph,
    combined with other metrics, to rank their search
    results

85
PageRank Matrix Notation
  • A ? Matrix containing the transition
    probabilities
  • where Pij 1/out(i), if there is a link from
    i to j, 0 otherwise E is the random jumps matrix
  • Probability distribution vector at time k
  • is the starting vector
  • PageRank ? Stationary distribution of the Markov
    Chain described by A, i.e., principal eigenvector
    or A

86
Going Distributed
  • PageRank in principle needs the whole graph at
    one place
  • Shortcomings
  • Not Scalable for huge graphs, like the Web
  • Slow update PageRank in such huge graph can
    take weeks
  • Not suitable for different network architectures
    (e.g. P2P)
  • Distributed approaches, where the graph is
    partitioned, are clearly needed
  • Some distributed approaches (more details on the
    next slides)
  • Local PageRank ServerRank (Wang et al.)
  • BlockRank (Kamvar et al.)
  • JXP (Parreira et al.)

87
The Block Structure
  • Most of links are among web pages inside same host

1 1
1 1 1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1
1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1
1 1 1 1 1 1 1 1
1 1 1 1
1 1 1 1 1 1
1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
1 1 1 1
Pages from Host A
Block structure can be exploited for speeding up
and/or distributing the PR computation
Pages from Host B
Adjacency Matrix
88
BlockRank
  • PageRank in three steps
  • Computes local PageRanks of pages for each
    host, by considering only intra host links
  • Computes the importance of the host, using the
    local PR values and the inter host links
  • Combines previous values to create the starting
    vector for the standard PR algorithm
  • Speeds up computation
  • Step 1 can be parallelized
  • Still needs the whole matrix for step 3

S. Kamvar, T. Haveliwala, C. Manning G. Golub.
Exploiting the block structure of the web for
computing pagerank. Technical report, Stanford
University, 2003.
89
Going Distributed
  • Local PR ServerRank
  • Similar to BlockRank
  • Local PR PR computed inside each server using
    intra server links
  • ServerRank PR computed on server graph using
    inter server links
  • Server graph does not need to be materialized.
    Computation is done by exchanging messages among
    servers
  • Local PR and ServerRank are combined to
    approximate the true PR of a page
  • Values can be further refined by using Local PR
    info on ServerRank computation and vice versa.
  • Server partition can be a limitation

Y. Wang D. J. DeWitt. Computing pagerank in a
distributed internet search system. In VLDB, 2004.
90
Partition at peer level
  • In P2P networks, server partition is not suitable

91
Partition at peer level
  • Every peer crawls Web fragments at its discretion
  • Peers have only local (incomplete) information
  • Pages might be link to or linked by pages at
    other peers
  • Overlaps between peers graphs may occur
  • Peers a priori unaware of other peers contents

92
Adaptive OPIC
  • OPIC Online Page Importance Computation
  • Computes the importance of a page on-line, with
    few resources
  • Algorithm
  • Pages initially receive some cash
  • Pages are randomly visited
  • When a page is visited, its cash is distributed
    between the pages it points to
  • The page importance for a given page is computed
    using the history of cash of that page

Serge Abiteboul, Mihai Preda, and Gregory Cobena.
Adaptive on-line page importance computation. In
WWW, 2003.
93
Adaptive OPIC
  • Example
  • Small Web of 3 pages
  • Alice has all the cash to start (Importance
    independent of the initial state)

Alice
George
Bob
Cash-Game History Alice received 600 (200400) 4
0 Bob received 600 (200100300) 40 George
received 300 (200100) 20
94
Adaptive OPIC
  • No particular graph partition
  • No need to store the link matrix
  • Adapts to the changes on the web graph by
    considering only the recent part of the cash
    history for each page
  • Time window now-T, now
  • High number of messages exchanged
  • Does not handle case where same page is stored at
    more than one place

95
The JXP Algorithm
  • Decentralized algorithm for computing global
    authority scores of pages in a P2P Network
  • Runs locally at every peer
  • No coordinator, asynchronous
  • Combines Local PageRank computations Meetings
    between peers
  • JXP scores converge to the true global PageRank
    scores

Josiane Xavier Parreira, Carlos Castillo,
Debora Donato, Sebastian Michel and Gerhard
Weikum The JXP Method for Robust PageRank
Approximation in a Peer-to-Peer Web Search
Network. The VLDB Journal, 2007.
96
The JXP Algorithm
  • World Node
  • Special node attached to the local graph at every
    peer
  • Compact representation of all other pages in the
    network
  • Special features
  • All links from local pages to external pages
    point to World Node
  • Links from external pages that point to local
    pages (discovered during meetings) are
    represented at the World Node
  • Score and outdegree of these external pages are
    stored World Node outgoing links are weighted to
    reflect score mass given by original link
  • Self-loop link to represent transitions among
    external pages

W
97
The JXP Algorithm
  • Initialization step
  • Local graph is extended by adding the world node
  • PageRank is computed in the extended graph ? JXP
    Scores
  • Main algorithm (for every Pi in the network)
  • Select Pj to meet
  • Update world node
  • Add edges for pages in Pj that point to pages in
    Pi
  • If an edge already exists at the world node, the
    score of the source page is updated by taking the
    highest of both scores
  • Compute PageRank ? JXP scores

98
The JXP Algorithm
Theorem In a fair series of JXP meetings, the
JXP scores of all nodes converge to the true
global PR scores
99
Locating parts of the Graph
  • Finding peers that share common interests
  • Many applications can benefit from it
  • Distributed PR
  • In principle, peers need to send content only to
    the peers that contain their successors
  • Random messages guarantees that those peers will
    eventually be reached, but part of messages will
    be wasted

100
WASTED MEETING!!!! We want to avoid it!!!
101
Locating parts of the Graph
  • Query answering
  • Ideal Forward query only to peers that are more
    likely to provide good answers to it
  • Query flooding is very expensive
  • Hash-based queries are not suitable for
    approximate queries

102
Locating parts of the Graph
  • Locating relevant peers
  • Increase performance
  • Reduce traffic load
  • Idea Group peers according to the semantic of
    their content and place them into different
    overlay networks

103
Outline of the Second Part
  • Link Analysis The Web as a Graph
  • PageRank
  • Distributed Approaches
  • BlockRank
  • Local PageRank ServerRank
  • Adaptive OPIC
  • JXP
  • Identifying common interests Semantic Overlay
    Networks
  • Crespo and Garcia Molina
  • pSearch
  • p2pDating
  • Social Networks A new paradigm
  • What people share
  • Social graphs
  • Links, Tags, users analysis

104
Semantic Overlay Networks
  • Partition the P2P network into several thematic
    networks
  • Peers with similar or beneficial/complementary
    content are clustered together
  • Queries for a content will be forwarded only to
    peers with such content
  • Flooding in smaller networks with smaller TTL (or
    more results with same)

105
Overlay Networks Random vs. Semantic
  • Random
  • Peers connect to a small set of random peers
  • Queries are flooded through the network
  • Peers with unrelated content receive query
  • Low performance High number of messages
  • Low recall if only few peers are contacted
  • Semantic
  • Peers connect to peers with related content ?
    Cluster of peers
  • Peers identify querys topic and forward it only
    the set of peers on that topic
  • Messages to peers with unrelated content are
    avoided
  • Better performance Smaller number of messages
  • High recall by asking only few peers

106
When creating SONs
  • Two main things to consider
  • Node partitioning
  • Clustering criteria
  • Node partitioning - When does a peer belong to
    SON A?
  • When it contains a doc of type A
  • When it contains more than x docs of type A
  • Less peers per SON ? more results sooner
  • Less SONs per peer ? less connections
  • Clustering criteria - Clustering must provide
  • Load-balance
  • Each category has similar number of nodes
  • Each node belongs to a small number of categories
  • Easy and accurate way to classify a document

107
Crespo and Garcia-Molina
  • Uses a classification hierarchy to form the
    overlay networks
  • Documents and queries are classified into one or
    more concepts
  • Queries are forwarded to peers in the super/sub
    concepts

A. Crespo and H. Garcia-Molina. Semantic Overlay
Networks for P2P Systems. Technical report,
Stanford University, January 2003.
108
Crespo and Garcia-Molina
  • Reported results show a significant improvement
    on number of messages
  • Music file sharing scenario To get half the
    documents that match a query
  • SONs 461 msgs
  • Gnutella 1731 msgs
  • SON links are logical Two peers
  • that are connected on a SON can
  • actually be many hops away from
  • each other
  • Requirement that hierarchy and
  • classification algorithm are
  • shared among all nodes might
  • be a problem

109
pSearch
  • Semantic Overlay on top of Content Addressable
    Networks (CANs)
  • Latent Semantic Indexing (LSI) is used to
    generate a semantic vector for each document
  • Semantic vectors are used as keys to store docs
    indices in the CAN
  • Indices close in semantics are stored close in
    the overlay
  • Two types of operations
  • Publish document indices
  • Process queries

Chunqiang Tang, Zhichen Xu, and Sandhya
Dwarkadas. Peer-to-peer Information Retrieval
Using Self-Organizing Semantic Overlay Networks.
In SIGCOMM, 2003.
110
pSearch Key Idea
semantic space
doc
111
pSearch Key Idea
semantic space
doc
query
112
BackgroundContent-Addressable Network
  • Partition Cartesian space into zones
  • Each zone is assigned to a computer
  • Neighboring zones are routing neighbors
  • An object key is a point in the space
  • Object lookup is done through routing

113
Background Vector Space Model
  • Term Vectors represent documents and queries
  • Elements correspond to importance of term in
    document or vector
  • Statistical computation of vector elements
  • Term frequency inverse document frequency
  • Ranking of retrieved documents
  • Similarity between document vector and query
    vector

114
Background Vector Space Model
A books on computer networks B network
routing in P2P networks Q P2P network
115
Background Latent Semantic Indexing
  • Document vectors dimension has to match the
    dimension of the CAN network
  • Latent Semantic Indexing uses Singular Value
    Decomposition (SVD)
  • high-dimensional term vector to low-dimensional
    semantic vector
  • elements correspond to importance of abstract
    concept in document/query
  • Also helps to overcomes synonym problem (e.g.,
    user looks for car and dont find document about
    automobile)

116
Background Latent Semantic Indexing
documents
Va
Vb
terms
..
  • SVD singular value decomposition
  • Reduce dimensionality
  • Suppress noise
  • Discover word semantics
  • Car lt-gt Automobile

117
pSearch Basic Algorithm Steps
  • Receive a new document A generate a semantic
    vector Va, store the key in the index
  • Receive a new query Q generate a semantic vector
    Vq, route the query in the overlay
  • The query is flooded to nodes within a radius r
  • R determined by similarity threshold or number of
    wanted documents
  • All receiving nodes do a local search and report
    references to best matching documents

118
pSearch Illustration
119
p2pDating
  • Start with a randomly connected network
  • Peers meet other peers they do not know (blind
    dates)
  • If a peer likes another it will remember it as
    a friend.
  • A remembers B ? abstract link A ? B
  • Directed links ? preserves peers autonomy
  • SONs dynamically evolve from the meeting process

J. X. Parreira et al. p2pDating Real Life
Inspired Semantic Overlay Networks for Web
Search. Information Processing Management 43,
643-664
120
p2pDating
  • Finding new friends
  • Random meetings (Blind dates)
  • Meet friends of friends

A
B
A
Bs Friends
If A and B are friends
it is very likely the Bs friends are friends
of A as well.
121
Defining Good Friends
  • Criteria for defining a good friend ? combination
    of different measures
  • History Credits for good behavior in the past
  • Response time, query result precision, etc
  • Collection similarity
  • Collection Overlap
  • Different ways of estimating the overlap between
    two collections
  • Number of links between peers
  • Etc
  • Peers might have more than one list of friends
  • E.g., according to different criterias

122
Going Social
  • Before
  • Only few content producers (e.g., companies,
    universities)
  • Analysis was done using the content itself plus a
    few implicit recommendations (links)
  • Very little information about the content
    consumers (mainly through query logs)
  • Nowadays
  • New technologies to facilitate content sharing
  • Content consumers are now also content producers
    and content describers (e.g., explicit
    recommendations, tags, etc)
  • More and more crowd wisdom that can be harvested

123
Outline of the Second Part
  • Link Analysis The Web as a Graph
  • PageRank
  • Distributed Approaches
  • BlockRank
  • Local PageRank ServerRank
  • Adaptive OPIC
  • JXP
  • Identifying common interests Semantic Overlay
    Networks
  • Crespo and Garcia Molina
  • pSearch
  • p2pDating
  • Social Networks A new paradigm
  • What people share
  • Social graphs
  • Links, Tags, users analysis

124
(No Transcript)
125
Social Networks
  • A social structure made of nodes (which are
    generally individuals or organizations) that are
    tied by one or more specific types of relations,
    such as
  • values
  • visions
  • ideas
  • friends
  • conflict
  • web links
  • Etc
  • Social networks have been studied for over a
    century

126
Social Network Services
  • Enable the creation of online social networks for
    communities of people who share interests and
    activities, or who are interested in exploring
    the interests and activities of others
  • Online communities offer an easy way
  • for users to publish and share their content.

127
Social Networking Growth
  • Several social networking sites have experienced
    dramatic growth during the past year.

Worldwide Growth of Selected Social Networking
Sites. June 2007 vs. June 2006, Users Age 15,
Source comScore
Social Networking Site Total Unique Visitors (Mio.) Total Unique Visitors (Mio.) Total Unique Visitors (Mio.)
Social Networking Site Jun-06 Jun-07 Change
MySpace 66.41 114.15 72
Facebook 14.08 52.17 270
Hi5 18.10 28.17 56
Friendster 14.92 24.68 65
Orkut 13.59 24.12 78
Bebo 6.69 18.20 172
Tagged 1.51 13.17 774
128
What people share
129
Social Networks
  • Besides sharing content, a user can
  • describe documents using tags
  • maintain a list of friends
  • make comments on other users content, exchange
    opinions, discover users with similar profile.
  • In contrast to Web Graph, in Social Graphs users
    are part of the model

130
Social Content Graph
Sihem Amer-Yahia, Michael Benedikt, Philip
Bohannon Challenges in Searching Online
Communities. IEEE Data Eng. Bull. 30(2) 23-31
(2007)
131
Social Graphs
  • Other models also possible
  • Directed vs. Undirected edges
  • Etc.

Standard IR techniques for Web retrieval need to
be adapted to work on social networks - Lot of
current research dedicated on this area
132
Social Networks
  • The Wisdom of Crowds Beyond PR
  • Spectral analysis of various graphs
  • E.g., SocialPageRank, FolkRank.
  • Tag semantic analysis
  • Discovering semantic from tags co-occurrence
  • E.g., SocialSimRank
  • Distributed View
  • Exploiting social relations to enhance search
  • E.g., PeerSpective

133
Link Analysis in Social Networks
  • SocialPageRank
  • High quality web pages are usually popularly
    annotated and popular web pages, up-to-date web
    users and hot social annotations can be mutual
    enhanced.
  • Let MUT, MTD, MDU be the matrices corresponding
    to relations UsersTags, TagsDocs, DocsUsers
  • Compute iteratively


S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu
Optimizing Web Search Using Social Annotation.
WWW 2007
134
Link Analysis in Social Networks
  • FolkRank
  • Define graph G as union of graphs UsersTags,
    TagsDocs, DocsUsers
  • Assume each user has personal preference vector
  • Compute iteratively
  • FolkRank vector of docs is

Andreas Hotho, Robert Jäschke, Christoph Schmitz,
Gerd Stumme Information Retrieval in
Folksonomies Search and Ranking. ESWC 2006
411-426
135
Tag Similarity
  • SocialSimRank
  • Idea Similar annotations (tags) are usually
    assigned to similar web pages by users with
    common interests.
  • sim(t1, t2) aggr sim(d1,d2) (t1,d1),
    (t2,d2)?Tagging sim(d1, d2) aggr
    sim(t1,t2) (t1,d1), (t2,d2)?Tagging

S. Bao, X. Wu, B. Fei, G. Xue, Z. Su, Y. Yu
Optimizing Web Search Using Social Annotation.
WWW 2007
136
Exploring friendship connections
  • PeerSpective users can query their friends
    viewed pages
  • HTTP proxies on users computers index all browsed
    content
  • When a Google search in performance, query is
    also send to the other proxies in parallel

Alan Mislove, Krishna P. Gummadi, and Peter
Druschel. Exploiting Social Networks for Internet
Search. HotNets, 2006.
137
Social Networks
  • New paradigm of publishing and searching content
  • Rich data
  • Different link structures
  • Users input for free!!!
  • Relatively recent topic Lots of research
    opportunities
  • Works mentioned are by no means complete, still a
    lot to do

Since we are talking about Web 2.0 http//p2pinfo
rmationsearch.blogspot.com/
Write a Comment
User Comments (0)
About PowerShow.com