Vincent Blondel and Paul Van Dooren CESAME, Universite Catholique de Louvain - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Vincent Blondel and Paul Van Dooren CESAME, Universite Catholique de Louvain

Description:

Ref: Web searching and graph similarity ... confection. 10. pet. edulcorate. acetate. glucose. crystalline. 9. precious. dulcify. grocer ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 28
Provided by: paulvan4
Category:

less

Transcript and Presenter's Notes

Title: Vincent Blondel and Paul Van Dooren CESAME, Universite Catholique de Louvain


1
(No Transcript)
2
The web graph
  • Nodes web pages, Edges hyperlinks between
    pages
  • Over 3 billion webpages searched by Google
  • Average of 7 outgoing links
  • Growth of a few
  • every month

3
Outline
  • 1. Structure of the web
  • 2. Methods for searching the web
  • (Google PageRank and Kleinberg Hits)
  • 3. Similarity in graphs
  • 4. Application to synonym extraction
  • Ref Web searching and graph similarityV.
    Blondel, A. Gajardo, M. Heymans, P. Sennelart and
    P. Van Dooren SIAM Review, http//epubs.siam.org/
    sam-bin/dbq/article/41596

4
Structure of the web
  • In 1999 a giant strongly connected component
    (core) was
  • discovered
  • Contains most prominent sites
  • It contains 30 of all pages
  • Average distance between nodes is 16
  • Small world
  • Ref Broder et al., Graph structure in the web,
    WWW9, 2000
  • http//www.almaden.ibm.com/cs/k53/www9.final/

5
The web is a bowtie
  • Ref The web is a bowtie, Nature, Vol. 405,May
    11, 2000

6
In- and out-degree distributions
  • Power law distribution number of pages of
    in-degree n is
  • proportional to 1/n2.1 (Zipf law)

7
A score for every page
  • The score of a page is high if the page has many
    incoming
  • links coming from pages with high page score
  • One browses from page to page by following
    outgoing links
  • with equal probability. Score frequency a page
    is visited.

8
A score for every page
  • The score of a page is high if the page has many
    incoming
  • links coming from pages with high page score
  • One browses from page to page by following
    outgoing links
  • with equal probability. Score frequency a page
    is visited.
  • some pages may have no outgoing links
  • many pages have zero in-degree

9
(No Transcript)
10
PageRank teleporting random score
  • The surfer follows a path by choosing an outgoing
    link with probability
  • p/dout(i) or teleports to a random web page with
    probability 0 lt 1-p lt 1
  • Put the transition probability of i to j in a
    matrix M (bij1 if i?j)
  • mij p bij /dout(i) (1-p)/n
  • then the vector x of probability distribution on
    the nodes of the graph
  • is the steady state vector of the iteration
    xk1MTxk i.e. the dominant
  • eigenvector of the matrix MT (unique because of
    Perron-Frobenius)
  • PageRank of node i is the (relative) size of
    element i of this vector
  • Ref P. Van Dooren, Theorie des matrices,
    (Sections 3.4 et Chapitre 6) http//www.auto.ucl.a
    c.be/vdooren/cours-inma2380.pdf

11
and my own page rank ?
  • use Google toolbar
  • some top pages
  • PageRank In-degree
  • 1 http//www.yahoo.com 10 654,000
  • 2 http//www.adobe.com 10 646,000
  • 5 http//www.google.com 10 252,000
  • 8 http//www.microsoft.com 10 129,000
  • 12 http//www.nasa.gov 10 93,900
  • 20 http//mit.edu 10 47,600
  • 23 http//www.nsf.gov 10 39,400
  • 26 http//www.inria.fr 10 17,400
  • 72 http//www.stanford.edu 9 36,300
  • Ref S. Brin, L. Page, The Anatomy of a
    Large-Scale Hypertextual Web Search Engine,
    http//dbpubs.stanford.edu8090/pub/1998-8

12
Kleinbergs structure graph
  • The score of a page is high if the page has
  • many incoming links
  • The score is high if the incoming links are
  • from pages that have high scores
  • This inspired Kleinbergs structure graph
  • hub authority

13
Good authorities for University Belgium
14
A good hub for University Belgium
15
Hub and authority scores
  • Web pages have a hub score hj and an authority
    score aj which are
  • mutually reinforcing
  • pages with large hj point to pages with high aj
  • pages with large aj are pointed to by pages with
    high hj
  • hj ?
    S i(j?i) ai
  • aj ? S i(i?j) hi
  • or, using the adjacency matrix B of the graph
    (bji1 if j?i is an edge)
  • h 0 B
    h h 1
  • a k1 BT 0 a
    k a 0 1
  • Use limiting subvector a of xk1 M xk / M
    xk to rank pages



16
Extension to another structure graph
  • Give three scores to each web page begin b,
    center c, end e
  • b
    c e
  • Use again mutual reinforcement to define the
    iteration
  • bj ?
    S i(j?i) ci
  • cj ? S i(i?j) bi
    S i(j?i) ei
  • ej ?
    S i(i?j) ci
  • Defines a limiting vector for the iteration

  • b 0 B
    0
  • xk1 M xk, x0 1 where x
    c , M BT 0 B

  • e 0 BT 0

17
Bow tie example
  • h a h a
  • S S
  • if mgtn if ngtm
  • not satisfactory

graph A h ? a
graph B 2
1 n1
nm1
18
Bow tie example
  • b c e
  • S
  • central score is good

graph A b ? ? e c
graph B 2
1 n1
nm1
19
The dictionary graph
  • OPTED, based on Websters unabridged dictionary
  • http//msowww.anu.edu.au/ralph/OPTED
  • Nodes words present in the dictionary 112,169
    nodes
  • Edge (u,v) if v appears in the definition of u
    1,398,424 edges
  • Average of 12 edges per node

20
In and out degree distribution
  • Very similar to web (power law)
  • Words with highest in degree
  • of, a, the, or, to, in
  • Words with null out degree
  • 14159, Fe3O4, Aaron,
  • and some undefined or misspelled words

21
Neighborhood graph
  • is the subset of vertices used for finding
    synonyms
  • it contains all parents and children of the node
  • neighborhood graph
    of likely
  • Central uses this sub-graph to rank
    automatically synonyms
  • Comparison with Vectors, ArcRank (automatic)
  • Wordnet, Microsoft
    Word (manual)

22
Disappear
23
Sugar
24
Conclusion
  • Potential use for data-mining, classification,
    clustering
  • Applications in internet, graphs, telephone
    networks,

25
(No Transcript)
26
Distribution of calls received
Number of customers
Number of calls received
Example 2000 people have received 100 calls
27
http//www.inma.ucl.ac.be/dekerchove
Write a Comment
User Comments (0)
About PowerShow.com