Scaling linkbased similarity search - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Scaling linkbased similarity search

Description:

Formalized as a PageRank-like equation: Power iteration: quadratic storage and time ... Storage trick does not work; increase in disk requirement (both storage and ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 22
Provided by: rczb
Category:

less

Transcript and Presenter's Notes

Title: Scaling linkbased similarity search


1
Scaling link-based similarity search
Dániel Fogaras, Balázs Rácz
Computer and Automation Research Institute of the
Hungarian Academy of Sciences
Budapest University of Technology and Economics
2
Outline
  • Introduction
  • Scaling link-based similarity search
  • Scaling link-based similarity search
  • Scaling link-based similarity search
  • First scalable algorithm for SimRank
  • New similarity functions
  • Experiments

3
Introduction / Motivation
  • Similarity search on the Web

4
Approaches / Related Results
  • Text-based
  • Classic IR
  • Min-hash fingerprinting (Broder 98)
  • Pure link-based
  • Single-step cocitation, bibliographic coupling,
  • Multi-step
  • Companion (Dean, Henzinger, 98)
  • SimRank (Jeh, Widom, 02)
  • Hybrid
  • Anchor text based (Haveliwala et al. 02)

random access quadratic
5
Scalability requirements
  • Architecture
  • Webgraph
  • V8.000.000.000 web pages
  • E80.000.000.000 hyperlinks
  • Indexing
  • Limited main memory, stream access to graph
  • Within sorting time
  • Parallelizable
  • Distributed index database
  • Query
  • Limited number of DB access
  • Parallelizable

Webgraph
Indexing
Database
Database
Database
Query
Sim(u,v) ?
Top(u) ?
6
SimRank
  • Jeh and Widom, 2002
  • the similarity of two pages is the average
    similarity of their referring pages
  • Formalized as a PageRank-like equation
  • Power iteration quadratic storage and time
  • Goal quadratic ? linear

7
Randomization
  • For pages u and v, start two random walks from
    them, following the links backwards.
  • Let t be the first meeting time
  • Jeh, Widom sim(u,v)expected value of ct
  • Our algorithm
  • Monte Carlo method
  • simulate N independent pair of random walks
  • approximate sim with the average of ct
  • Index DB N random walk for each page
  • Query calculate meeting times

8
Derandomization
  • (partially)
  • trick 1 pair-wise independence is enough
  • trick 2 anything after the first meeting is
    irrelevant
  • ? coalescing (sticky) walks

9
Compact storage
  • trick 3 We only need the time of the first
    meeting, not the path itself

For the path of u4 the first smaller indexed
path it meets is the path of u3 the meeting time
is 3.
Storage L integers/path ? 2 integers/path
10
Gains
  • Vno. of pages (109)
  • Nno. of indep. simulations (100)
  • Indexing stream access to the graph, V cells of
    memory (or external memory)
  • Index Database size NV (500 GB)
  • Query 2N disk seeks, time proportional to the
    number of results
  • Parallelizable to N machines (5 GB storage, 2
    disk seeks/query each)

11
Parallelization
  • Each back-end server one simulation
  • Query ask N servers, merge the results
  • Fault tolerance
  • when a server fails
  • merge N-1 resultsets
  • Load balance ask any N servers
  • Adapt to workload under heavy load, ask fewer
    servers ? slight loss of precision

12
New similarity functions
  • Problem with SimRank nodes with high in-degree
    are dissimilar to all other nodes

When SimRank fails Pages u and v have k
witnesses for similarity, yet sim(u,v) 1/k.
13
PSimRank
  • Coupling ? walks attract each other, like they
    were walking towards the same goal
  • Still, PSimRank can be computed within the same
    Monte Carlo similarity search framework (all
    scalability properties still hold!)

14
Extended Jaccard-coefficient
  • Take the k-step in-neighborhood of pages u,v
  • Calculate their similarity using Jaccard-coeff.
  • Take the exponentially weighted sum in k
  • Storage trick does not work increase in disk
    requirement (both storage and seeks/query)
    indexing is the same as prior

15
Experimental evaluation
  • Evaluation methodology Haveliwala et al. 02
  • Uses Open Directory Project (dmoz.org)
  • Ground truth similarity in directory
  • familial distance documents in the same class
    are more similar as those in different classes
  • Compare orderings of familial distance and
    calculated similarity
  • Stanford WebBase
  • 80M pages including 200K ODP pages

16
Experiments 1 path length
Multi-step similarity does make sense!...
17
Experiments 2 decay factor c
but mostly when downweighted.
18
Experiments 3 number of simul. N
Note recall ( of results) grows linearly.
19
Further results
  • In the paper
  • Formal analysis of error of approximation vs.
    number of independent simulations exponential
    decay of error
  • WAW2004
  • Application of a similar random walk based Monte
    Carlo method to compute Personalized PageRank
  • Lower bound worst case (i.e., on arbitrary
    graphs), quadratic DB is required unless
    approximation

20
Conclusion
  • Approximation algorithm for multi-step/ recursive
    similarity functions
  • Uses simulated random walks
  • Monte Carlo method
  • Scalable
  • New similarity functions
  • First sight of these on real(ly big) web data
  • Yes, they do make sense!

21
Open problems
  • Theoretical
  • Analysis of storage trick efficiency
  • Expected query time/result set size
  • Practical
  • Comparison with text/anchor text methods
  • Both
  • Further methods
  • Our methods are general
  • Methods that use special features of the web
    graph?
  • Goal better approximation, recall
  • Combined methods
Write a Comment
User Comments (0)
About PowerShow.com