Web searching and graph similarity Vincent Blondel and Paul Van Dooren* CESAME, Universite Catholique de Louvain http://www.inma.ucl.ac.be/ - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Web searching and graph similarity Vincent Blondel and Paul Van Dooren* CESAME, Universite Catholique de Louvain http://www.inma.ucl.ac.be/

Description:

... Properties S=BSAT ... graph of likely Central uses this sub-graph to rank ... Disappear Parallelogram Science Sugar Conclusion New ... – PowerPoint PPT presentation

Number of Views:250
Avg rating:3.0/5.0
Slides: 34
Provided by: PaulVan97
Category:

less

Transcript and Presenter's Notes

Title: Web searching and graph similarity Vincent Blondel and Paul Van Dooren* CESAME, Universite Catholique de Louvain http://www.inma.ucl.ac.be/


1
(No Transcript)
2
Web searching and graph similarityVincent
Blondel and Paul Van DoorenCESAME, Universite
Catholique de Louvainhttp//www.inma.ucl.ac.be/
  • Thanks to P. Sennelart

    GAMM, 2003

3
The web graph
  • Nodes web pages, Edges hyperlinks between
    pages
  • 3 billion (Google searched 3,083,324,625 webpages
    in 2002)
  • Average of 7 outgoing links

4
The web graph
  • Nodes web pages, Edges hyperlinks between
    pages
  • 3 billion (Google searched 3,083,324,625 webpages
    in 2002)
  • Average of 7 outgoing links
  • Growth of a few
  • every month

5
Outline
  • 1. Structure of the web
  • 2. Methods for searching the web
  • (Google PageRank and Kleinberg Hits)
  • 3. Similarity in graphs
  • 4. Application to synonym extraction
    (Blondel-Sennelart)

6
Structure of the web
  • Experiments two crawls over 200 million pages
    in 1999
  • found a giant strongly connected component (core)
  • Contains most prominent sites
  • It contains 30 of all pages
  • Average distance between nodes is 16
  • Small world
  • Ref Broder et al., Graph structure in the web,
    WWW9, 2000

7
The web is a bowtie
  • Ref The web is a bowtie, Nature, May 11, 2000

8
In- and out-degree distributions
  • Power law distribution number of pages of
    in-degree n is
  • proportional to 1/n2.1 (Zipf law)

9
A score for every page
  • The score of a page is high if the page has many
    incoming
  • links coming from pages with high page score
  • One browses from page to page by following
    outgoing links
  • with equal probability. Score frequency a page
    is visited.

10
A score for every page
  • The score of a page is high if the page has many
    incoming
  • links coming from pages with high page score
  • One browses from page to page by following
    outgoing links
  • with equal probability. Score frequency a page
    is visited.
  • some pages may have no outgoing links
  • many pages have zero frequency

11
PageRank teleporting random score
  • The surfer follows a path by choosing an outgoing
    link with probability
  • p/dout(i) or teleports to a random web page with
    probability 0lt1-p lt1.
  • Put the transition probability of i to j in a
    matrix M (bij1 if i?j)
  • mij p bij /dout(i) (1-p)/n
  • then the vector x of probability distribution on
    the nodes of the graph
  • is the steady state vector of the iteration
    xk1Mxk i.e. the dominant
  • eigenvector of the matrix M (unique because of
    Perron-Frobenius)
  • PageRank of node i is the (relative) size of
    element i of this vector

12
Matlab News and Notes, October 2002
13
and my own page rank ?
  • use Google toolbar
  • some top pages
  • PageRank In-degree
  • 1 http//www.yahoo.com 10 654,000
  • 2 http//www.adobe.com 10 646,000
  • 5 http//www.google.com 10 252,000
  • 8 http//www.microsoft.com 10 129,000
  • 12 http//www.nasa.gov 10 93,900
  • 20 http//mit.edu 10 47,600
  • 23 http//www.nsf.gov 10 39,400
  • 26 http//www.inria.fr 10 17,400
  • 72 http//www.stanford.edu 9 36,300

14
Kleinbergs structure graph
  • The score of a page is high if the page has
  • many incoming links
  • The score is high if the incoming links are
  • from pages that have high scores

15
Kleinbergs structure graph
  • The score of a page is high if the page has
  • many incoming links
  • The score is high if the incoming links are
  • from pages that have high scores
  • This inspired Kleinbergs structure graph
  • hub authority

16
Good authorities for University Belgium
17
A good hub for University Belgium
18
Hub and authority scores
  • Web pages have a hub score hj and an authority
    score aj which are
  • mutually reinforcing
  • pages with large hj point to pages with high aj
  • pages with large aj are pointed to by pages with
    high hj
  • hj ?
    S i(j?i) ai
  • aj ? S i(i?j) hi
  • or, using the adjacency matrix B of the graph
    (bij1 if j?i is an edge)
  • h 0 B
    h h 1
  • a k1 BT 0 a
    k a 0 1
  • Use limiting vector a (dominant eigenvector of
    BTB) to rank pages



19
(No Transcript)
20
Extension to another structure graph
  • Give three scores to each web page begin b,
    center c, end e
  • b
    c e
  • Use again mutual reinforcement to define the
    iteration
  • bj ?
    S i(j?i) ci
  • cj ? S i(i?j) bi
    S i(j?i) ei
  • ej ?
    S i(i?j) ci
  • Defines a limiting vector for the iteration

  • b 0 B
    0
  • xk1 M xk, x0 1 where
    x c , M BT 0 B

  • e 0 BT
    0

21
Towards arbitrary graphs
  • For the graph ? A
    and M
  • For the graph ? ? A
    and M
  • Formula for M for two arbitrary graphs GA and GB
  • M A B
    AT BT
  • With xk vec(Xk) iteration xk1 M xk is
    equivalent to Xk1 BXk ATBT Xk A

0 1
0 0
0 B
BT 0
0 1 0
0 0 1
0 0 0
0 B 0
BT 0 B
0 BT 0
22
Convergence ?
  • The (normalized) sequence
  • Zk1 (BZk ATBT Zk A)/ BZk ATBT Zk A2
  • has two fixed points Zeven and Zodd for every
    Z0gt0
  • Similarity matrix S lim k?8 Z2k , Z0 1
  • Si,j is the similarity score between Vj (A) and
    Vi (B)
  • Properties
  • ?SBSATBTSA, ?BSATBTSA2
  • Fixed point of largest 1-norm
  • Robust fixed point for Me1
  • Linear convergence (power method for sparse M)

23
Bow tie example
  • S S
  • if mgtn if ngtm
  • not satisfactory

? 0
0 0

0 0
0 1

0 1
0 ?
1 0

1 0
0 0

0 0
graph A 1 ? 2
graph B 2
1 n1
nm1
24
Bow tie example
  • S
  • central score is good

graph A 1 ? ? 3 2
0 ? 0
1 0 0

1 0 0
0 0 1

0 0 1
graph B 2
1 n1
nm1
25
Other properties
  • Central score is a dominant eigenvector of
    BBTBTB
  • (cfr. hub score of BBT and authority score of
    BTB)
  • Similarity matrix of a graph with itself is
    square and semi-definite.
  • Path graph ? ?
    Cycle graph

.4 0 0
0 .8 0
0 0 .4
1 1 1
1 1 1
1 1 1
26
The dictionary graph
  • OPTED, based on Websters unabridged dictionary
  • http//msowww.anu.edu.au/ralph/OPTED
  • Nodes words present in the dictionary 112,169
    nodes
  • Edge (u,v) if v appears in the definition of u
    1,398,424 edges
  • Average of 12 edges per node

27
In and out degree distribution
  • Very similar to web (power law)
  • Words with highest in degree
  • of, a, the, or, to, in
  • Words with null out degree
  • 14159, Fe3O4, Aaron,
  • and some undefined or misspelled words

28
Neighborhood graph
  • is the subset of vertices used for finding
    synonyms
  • it contains all parents and children of the node
  • neighborhood graph
    of likely
  • Central uses this sub-graph to rank
    automatically synonyms
  • Comparison with Vectors, ArcRank (automatic)
  • Wordnet, Microsoft
    Word (manual)

29
Disappear
Vectors Central ArcRanc Wordnet Microsoft
1 vanish vanish epidemic vanish vanish
2 wear pass disappearing go away cease to exist
3 die die port end fade away
4 sail wear dissipate finish die out
5 faint faint cease terminate go
6 light fade eat cease evaporate
7 port sail gradually wane
8 absorb light instrumental expire
9 appear dissipate darkness withdraw
10 cease cease efface pass away
Mark 3.6 6.3 1.2 7.5 8.6
Std Dev 1.8 1.7 1.2 1.4 1.3
30
Parallelogram
Vectors Central ArcRanc Wordnet Microsoft
1 square square quadrilateral quadrilateral diamond
2 parallel rhomb gnomon quadrangle lozenge
3 rhomb parallel right-lined tetragon rhomb
4 prism figure rectangle
5 figure prism consequently
6 equal equal parallelopiped
7 quadrilateral opposite parallel
8 opposite angles cylinder
9 altitude quadrilateral popular
10 parallelopiped rectangle prism
Mark 4.6 4.8 3.3 6.3 5.3
Std Dev 2.7 2.5 2.2 2.5 2.6
31
Science
Vectors Central ArcRanc Wordnet Microsoft
1 art art formulate knowledge domain discipline
2 branch branch arithmetic knowledge base knowledge
3 nature law systematize discipline skill
4 law study scientific subject art
5 knowledge practice knowledge subject area
6 principle natural geometry subject field
7 life knowledge philosophical field
8 natural learning learning field of study
9 electricity theory expertness ability
10 biology principle mathematics power
Mark 3.6 4.4 3.2 7.1 6.5
Std Dev 2.0 2.5 2.9 2.6 2.4
32
Sugar
Vectors Central ArcRanc Wordnet Microsoft
1 juice cane granulation sweetening darling
2 starch starch shrub sweetener baby
3 cane sucrose sucrose carbohydrate honey
4 milk milk preserve saccharide dear
5 molasses sweet honeyed organic compound love
6 sucrose dextrose property saccarify dearest
7 wax molasses sorghum sweeten beloved
8 root juice grocer dulcify precious
9 crystalline glucose acetate edulcorate pet
10 confection lactose saccharine dulcorate babe
Mark 3.9 6.3 4.3 6.2 4.7
Std Dev 2.0 2.4 2.3 2.9 2.7
33
Conclusion
  • New notion of similarity between vertices of a
    graph
  • Easy to compute start from X0 1 and take even
    normalized
  • iterates of Xk1BXkATBTXkA
  • Potential use for data-mining, classification,
    clustering
  • Successful implementation for the french
    dictionary Le petit Robert
  • Applications in texts, internet, reference lists,
    telephone networks,
  • bipartite graphs (Melnik, Widom, )
  • Different from sub-graph problems !
Write a Comment
User Comments (0)
About PowerShow.com