CS 277: Data Mining Lecture 14: Page Rank and HITS - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

CS 277: Data Mining Lecture 14: Page Rank and HITS

Description:

Homework 3 due in class Nov 20. Progress Report 2 today ... S. Wasserman and K. Faust, Social Network Analysis, Cambridge University Press, 1994. ... – PowerPoint PPT presentation

Number of Views:616
Avg rating:3.0/5.0
Slides: 65
Provided by: MINH71
Category:

less

Transcript and Presenter's Notes

Title: CS 277: Data Mining Lecture 14: Page Rank and HITS


1
CS 277 Data MiningLecture 14 Page Rank and
HITS
  • David Newman
  • Department of Computer Science
  • University of California, Irvine

2
Notices
  • Homework 3 due in class Nov 20
  • Progress Report 2 today

3
Progress Report 2
4
Link Analysis Objectives
  • To review common approaches to link analysis
  • To calculate the popularity of a site based on
    link analysis
  • To model human judgments indirectly

5
Outline
  • Page Rank
  • Hubs and Authorities HITS
  • Stability
  • Probabilistic Link Analysis
  • Limitation of Link Analysis

6
Web Mining
  • Web a potentially enormous data set for data
    mining
  • 3 primary aspects of Web mining
  • Web page content
  • classifying/clustering Web pages based on their
    text
  • Web connectivity
  • characterizing distributions on path lengths
    between pages
  • determining importance of pages from graph
    structure
  • Web usage
  • understanding user behavior from Web logs
  • All 3 are interconnected/interdependent
  • Google (and most search engines) use both content
    and connectivity
  • This lecture Web connectivity

7
The Web Graph
  • G (V, E)
  • V set of all Web pages
  • E set of all hyperlinks
  • Number of nodes ?
  • Difficult to estimate
  • Crawling the Web is highly non-trivial
  • At least 30 billion pages out there
  • Number of edges?
  • E O(V)
  • i.e., mean number of outlinks per page is a small
    constant

8
The Web Graph
  • The Web graph is inherently dynamic
  • nodes and edges are continually appearing and
    disappearing
  • Interested in general properties of the Web graph
  • What is the distribution of the number of
    in-links and out-links?
  • What is the distribution of number of pages per
    site?
  • Typically power-laws for many of these
    distributions
  • How far apart are 2 randomly selected pages on
    the Web?
  • What is the average distance between 2 random
    pages?
  • And so on

9
Power law degree distribution P(k) k-g
Albert, Jeong, Barabasi, 1999
10
Social Networks
  • Social networks graphs
  • V set of actors (e.g., students in a class)
  • E set of interactions (e.g., collaborations)
  • Typically small graphs, e.g., V 10 or 50
  • Long history of social network analysis (e.g. at
    UCI)
  • Quantitative data analysis techniques that can
    automatically extract structure or information
    from graphs
  • who is the most important actor in a network?
  • are there clusters in the network?
  • Comprehensive reference
  • S. Wasserman and K. Faust, Social Network
    Analysis, Cambridge University Press, 1994.

11
Node Importance in Social Networks
  • General idea is that some nodes are more
    important than others in terms of the structure
    of the graph
  • In a directed graph, in-degree may be a useful
    indicator of importance
  • for a citation network among authors (or papers)
  • in-degree is the number of citations gt
    importance
  • However
  • in-degree is only a first-order measure in that
    it implicitly assumes that all edges are of equal
    importance

12
Recursive Notions of Node Importance
  • wij weight of link from node i to node j
  • assume Sj wij 1 and weights are non-negative
  • default choice wij 1/outdegree(i)
  • more outlinks gt less importance attached to each
  • Define rj importance of node j in a directed
    graph
  • rj Si wij ri
    i,j 1,.n
  • Importance of a node is a weighted sum of the
    importance of nodes that point to it
  • Makes intuitive sense
  • Leads to a set of recursive linear equations

13
Simple Example
1
2
3
4
14
Simple Example
1
1
2
3
0.5
0.5
0.5
0.5
0.5
0.5
4
15
Simple Example
1
1
2
3
0.5
0.5
0.5
0.5
0.5
Weight matrix W
0.5
4
16
Matrix-Vector form
  • Recall rj importance of node j
  • rj Si wij ri
    i,j 1,.n
  • e.g., r2 1 r1 0 r2 0.5 r3 0.5 r4
  • dot product of r vector
    with column 2 of W
  • Let r n x 1 vector of importance values for
    the n nodes
  • Let W n x n matrix of link weights
  • gt we can rewrite the importance equations as
  • r WT r

17
Eigenvector Formulation
  • Need to solve the importance equations for
    unknown r, with known W
  • r WT r
  • This is a standard eigenvalue problem, i.e.,
  • A r l r (where A
    WT)
  • with l an eigenvalue 1
  • and r the eigenvector corresponding to l 1
  • Results from linear algebra tell us that
  • Since W is a stochastic matrix, W and WT have the
    same eigenvectors/eigenvalues
  • The largest of these eigenvalues is always 1
  • So the importance vector r corresponds to the
    eigenvector corresponding to the largest
    eigenvector of W (and WT)

18
Solution for the Simple Example
Solving for the eigenvector of W we get r 0.2
0.4 0.13 0.27 Results are quite intuitive
1
1
2
3
0.5
0.5
W
0.5
0.5
0.5
0.5
4
19
Solution for the Simple Example
Importance
1
0.2
0.4
0.13
1
2
3
0.5
0.5
0.5
0.5
0.5
0.5
4
0.27
20
How can we apply this to the Web?
  • Given a set of Web pages and hyperlinks
  • Weights from each page 1/( of outlinks)
  • Solve for the eigenvector (l 1) of the weight
    matrix
  • Problem
  • Solving an eigenvector equation scales as O(n3)
  • For the entire Web graph n gt 10 billion (!!)
  • So direct solution is not feasible
  • Can use the power method (iterative)
    r (k1) WT r (k)


21
Power Method for solving for r
  • r
    (k1) WT r (k)
  • Define a suitable starting vector r (1)
  • e.g., all entries 1/n, or all entries
    indegree(node)/E, etc
  • Each iteration is matrix-vector multiplication
    gtO(n2)
  • - problematic?
  • no since W is highly sparse (Web pages
    have limited outdegree), each
    iteration is effectively O(n)
  • For sparse W, the iterations typically converge
    quite quickly
  • - rate of convergence depends on the spectral
    gap
  • ? how quickly does error(k) (l2/
    l1)k go to 0 as function of k ?
  • ? if l2 is close to 1 ( l1) then
    convergence is slow
  • - empirically Web graph with 300 million
    pages
  • ? 50 iterations to convergence (Brin and Page,
    1998)

22
(No Transcript)
23
Markov Chain Interpretation
  • W is a stochastic matrix (rows sum to 1) by
    definition
  • we can interpret W as defining the transition
    probabilities in a Markov chain
  • wij probability of transitioning from node i to
    node j
  • Markov chain interpretation
    r WT r
  • ? these are the solutions of the steady-state
    probabilities for a Markov chain
  • page importance ? steady-state Markov
    probabilities ? eigenvector

24
The Random Surfer Interpretation
  • Recall that for the Web model, we set wij
    1/outdegree(i)
  • Thus, in using W for computing importance of Web
    pages, this is equivalent to a model where
  • We have a random surfer who surfs the Web for an
    infinitely long time
  • At each page the surfer randomly selects an
    outlink to the next page
  • Importance of a page fraction of visits the
    surfer makes to that page
  • This is intuitive pages that have better
    connectivity will be visited more often

25
Potential Problems
1
2
3
Page 1 is a sink (no outlink) Pages 3 and 4
are also sinks (no outlink from the
system) Markov chain theory tells us that no
steady-state solution exists -
depending on where you start you will end up at 1
or 3, 4 Markov chain is reducible
4
26
Making the Web Graph Irreducible
  • One simple solution to our problem is to modify
    the Markov chain
  • With probability a the random surfer jumps to any
    random page in the system (with probability of
    1/n, conditioned on such a jump)
  • With probability 1-a the random surfer selects an
    outlink (randomly from the set of available
    outlinks)
  • The resulting transition graph is fully connected
    ? Markov system is irreducible ? steady-state
    solutions exist
  • Typically a is chosen to be between 0.1 and 0.2
    in practice
  • New power iterations can be written as
    r (k1) (1- a) WT r (k)
    (a/n) 1T
  • Complexity is still O(n) per iteration for sparse
    W

27
The PageRank Algorithm
  • S. Brin and L. Page, The anatomy of a large-scale
    hypertextual search engine, in Proceedings of the
    7th WWW Conference, 1998.
  • PageRank the method on the previous slide,
    applied to the entire Web graph
  • Crawl the Web (highly non-trivial!)
  • Store both connectivity and content
  • Calculate (off-line) the pagerank r for each
    Web page using the power iteration method
  • How can this be used to answer Web queries
  • Terms in the search query are used to limit the
    set of pages of possible interest
  • Pages are then ordered for the user via
    precomputed pageranks
  • The Google search engine combines r with
    text-based measures
  • This was the first demonstration that link
    information could be used for content-based
    search on the Web

28
Link Structure helps in Web Search
Singhal and Kaszkiel, 2001 SE1, etc, indicate
different (anonymized) commercial search
engines, all using link structure (e.g.,
PageRank) in their rankings
29
PageRank architecture at Google
  • Ranking of pages more important than exact values
    of pi
  • Pre-compute and store the PageRank of each page.
  • PageRank independent of any query or textual
    content.
  • Ranking scheme combines PageRank with textual
    match
  • Unpublished
  • Many empirical parameters, human effort and
    regression testing.
  • Criticism Ad-hoc coupling and decoupling
    between query relevance and graph importance
  • Massive engineering effort
  • Continually crawling the Web and updating page
    ranks

30
(No Transcript)
31
PageRank Limitations
  • Rich get richer syndrome
  • not as democratic as originally (nobly) claimed
  • certainly not 1 vote per WWW citizen
  • also crawling frequency tends to be based on
    pagerank
  • for detailed grumblings, see www.google-watch.org,
    etc.
  • Not query-sensitive
  • random walk same regardless of query topic
  • whereas real random surfer has some topic
    interests
  • non-uniform jumping vector needed
  • would enable personalization (but requires faster
    eigenvector convergence)
  • Topic of ongoing research
  • Ad hoc mix of PageRank keyword match score
  • done in two steps for efficiency, not quality
    motivations

32
(No Transcript)
33
HITS - Kleinbergs Algorithm
  • HITS Hypertext Induced Topic Selection
  • For each vertex v ? V in a subgraph of
    interest

a(v) - the authority of v h(v) - the hubness of v
  • A site is very authoritative if it receives many
    citations. Citation from important sites weight
    more than citations from less-important sites
  • Hubness shows the importance of a site. A good
    hub is a site that links to many authoritative
    sites

34
Authority and Hubness
5
2
3
1
1
6
4
7
h(1) a(5) a(6) a(7)
a(1) h(2) h(3) h(4)
35
Authority and Hubness Convergence
  • Recursive dependency
  • a(v) ? S h(w)
  • h(v) ? S a(w)

w ? pav
w ? chv
  • Using Linear Algebra, we can prove

a(v) and h(v) converge
36
HITS Example
Find a base subgraph
  • Start with a root set R 1, 2, 3, 4
  • 1, 2, 3, 4 - nodes relevant to
    the topic
  • Expand the root set R to include all the
    children and a fixed number of parents of nodes
    in R

? A new set S (base subgraph) ?
37
HITS Example
  • BaseSubgraph( R, d)
  • S ? r
  • for each v in R
  • do S ? S U chv
  • P ? pav
  • if P gt d
  • then P ? arbitrary subset of P having size d
  • S ? S U P
  • return S

38
HITS Example
Hubs and authorities two n-dimensional a and h
  • HubsAuthorities(G)
  • 1 ? 1,,1 ? R
  • a ? h ? 1
  • t ? 1
  • repeat
  • for each v in V
  • do a (v) ? S h (w)
  • h (v) ? S a (w)
  • a ? a / a
  • h ? h / h
  • t ? t 1
  • until a a h h lt
    e
  • return (a , h )

V
0
0
t
w ? pav
t -1
w ? pav
t
t -1
t
t
t
t
t
t
t
t
t -1
t -1
t
t
39
HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
40
HITS Improvements
Brarat and Henzinger (1998)
  • HITS problems
  • The document can contain many identical links to
    the same document in another host
  • Links are generated automatically (e.g. messages
    posted on newsgroups)
  • Solutions
  • Assign weight to identical multiple edges, which
    are inversely proportional to their multiplicity
  • Prune irrelevant nodes or regulating the
    influence of a node with a relevance weight

41
PageRank
  • Introduced by Page et al (1998)
  • The weight is assigned by the rank of parents
  • Difference with HITS
  • HITS takes Hubness Authority weights
  • The page rank is proportional to its parents
    rank, but inversely proportional to its parents
    outdegree

42
Matrix Notation
Adjacent Matrix
A
http//www.kusatro.kyoto-u.com
43
Matrix Notation
  • Matrix Notation
  • r a B r M r
  • a eigenvalue
  • r eigenvector of B
  • A x ? x
  • A - ?I x 0

B
Finding Pagerank ? to find eigenvector of B with
an associated eigenvalue a
44
Matrix Notation
PageRank eigenvector of P relative to max
eigenvalue B P D P-1 D diagonal matrix of
eigenvalues ?1, ?n P regular matrix that
consists of eigenvectors
PageRank r1
normalized
45
Matrix Notation
  • Confirm the result
  • of inlinks from high ranked page
  • hard to explain about 52, 67
  • Interesting Topic
  • How do you create your homepage highly ranked?

46
Markov Chain Notation
  • Random surfer model
  • Description of a random walk through the Web
    graph
  • Interpreted as a transition matrix with
    asymptotic probability that a surfer is currently
    browsing that page

rt M rt-1 M transition matrix for a
first-order Markov chain (stochastic)
Does it converge to some sensible solution (as
t?oo) regardless of the initial ranks ?
47
Problem
  • Rank Sink Problem
  • In general, many Web pages have no
    inlinks/outlinks
  • It results in dangling edges in the graph
  • E.g.
  • no parent ? rank 0
  • MT converges to a matrix
  • whose last column is all zero
  • no children ? no solution
  • MT converges to zero matrix

48
Modification
  • Surfer will restart browsing by picking a new Web
    page at random
  • M ( B E )
  • E escape matrix
  • M stochastic matrix
  • Still problem?
  • It is not guaranteed that M is primitive
  • If M is stochastic and primitive, PageRank
    converges to corresponding stationary
    distribution of M

49
PageRank Algorithm
  • Page et al, 1998

50
Distribution of the Mixture Model
  • The probability distribution that results from
    combining the Markovian random walk distribution
    the static rank source distribution
  • r ee (1- e)x
  • e probability of selecting non-linked page

PageRank
Now, transition matrix eH (1- e)M is
primitive and stochasticrt converges to the
dominant eigenvector
51
Stability
  • Whether the link analysis algorithms based on
    eigenvectors are stable in the sense that results
    dont change significantly?
  • The connectivity of a portion of the graph is
    changed arbitrary
  • How will it affect the results of algorithms?

52
Stability of HITS
  • Ng et al (2001)
  • A bound on the number of hyperlinks k that can
    added or deleted from one page without affecting
    the authority or hubness weights
  • It is possible to perturb a symmetric matrix by
    a quantity that grows as d that produces a
    constant perturbation of the dominant eigenvector

d eigengap ?1 ?2d maximum outdegree of G
53
Stability of PageRank
Ng et al (2001)
V the set of vertices touched by the perturbation
  • The parameter e of the mixture model has a
    stabilization role
  • If the set of pages affected by the perturbation
    have a small rank, the overall change will also
    be small

tighter bound byBianchini et al (2001)
d(j) gt 2 depends on the edges incident on j
54
SALSA
  • SALSA (Lempel, Moran 2001)
  • Probabilistic extension of the HITS algorithm
  • Random walk is carried out by following
    hyperlinks both in the forward and in the
    backward direction
  • Two separate random walks
  • Hub walk
  • Authority walk

55
Forming a Bipartite Graph in SALSA
56
Random Walks
  • Hub walk
  • Follow a Web link from a page uh to a page wa (a
    forward link) and then
  • Immediately traverse a backlink going from wa to
    vh, where (u,w) ? E and (v,w) ? E
  • Authority Walk
  • Follow a Web link from a page w(a) to a page u(h)
    (a backward link) and then
  • Immediately traverse a forward link going back
    from vh to wa where (u,w) ? E and (v,w) ? E

57
Computing Weights
  • Hub weight computed from the sum of the product
    of the inverse degree of the in-links and the
    out-links

58
Why We Care
  • Lempel and Moran (2001) showed theoretically that
    SALSA weights are more robust that HITS weights
    in the presence of the Tightly Knit Community
    (TKC) Effect.
  • This effect occurs when a small collection of
    pages (related to a given topic) is connected so
    that every hub links to every authority and
    includes as a special case the mutual
    reinforcement effect
  • The pages in a community connected in this way
    can be ranked highly by HITS, higher than pages
    in a much larger collection where only some hubs
    link to some authorities
  • TKC could be exploited by spammers hoping to
    increase their page weight (e.g. link farms)

59
A Similar Approach
  • Rafiei and Mendelzon (2000) and Ng et al. (2001)
    propose similar approaches using reset as in
    PageRank
  • Unlike PageRank, in this model the surfer will
    follow a forward link on odd steps but a backward
    link on even steps
  • The stability properties of these ranking
    distributions are similar to those of PageRank
    (Ng et al. 2001)

60
Overcoming TKC
  • Similarity downweight sequencing and sequential
    clustering (Roberts and Rosenthal 2003)
  • Consider the underlying structure of clusters
  • Suggest downweight sequencing to avoid the Tight
    Knit Community problem
  • Results indicate approach is effective for few
    tested queries, but still untested on a large
    scale

61
PHITS and More
  • PHITS Cohn and Chang (2000)
  • Only the principal eigenvector is extracted using
    SALSA, so the authority along the remaining
    eigenvectors is completely neglected
  • Account for more eigenvectors of the co-citation
    matrix
  • See also Lempel, Moran (2003)

62
Limits of Link Analysis
  • META tags/ invisible text
  • Search engines relying on meta tags in documents
    are often misled (intentionally) by web
    developers
  • Pay-for-place
  • Search engine bias organizations pay search
    engines and page rank
  • Advertisements organizations pay high ranking
    pages for advertising space
  • With a primary effect of increased visibility to
    end users and a secondary effect of increased
    respectability due to relevance to high ranking
    page

63
Limits of Link Analysis
  • Stability
  • Adding even a small number of nodes/edges to the
    graph has a significant impact
  • Topic drift similar to TKC
  • A top authority may be a hub of pages on a
    different topic resulting in increased rank of
    the authority page
  • Content evolution
  • Adding/removing links/content can affect the
    intuitive authority rank of a page requiring
    recalculation of page ranks

64
Further Reading
  • R. Lempel and S. Moran, Rank Stability and Rank
    Similarity of Link-Based Web Ranking Algorithms
    in Authority Connected Graphs, Submitted to
    Information Retrieval, special issue on Advances
    in Mathematics/Formal Methods in Information
    Retrieval, 2003.
  • M. Henzinger, Link Analysis in Web Information
    Retreival, Bulletin of the IEEE computer Society
    Technical Committee on Data Engineering, 2000.
  • L. Getoor, N. Friedman, D. Koller, and A.
    Pfeffer. Relational Data Mining, S. Dzeroski and
    N. Lavrac, Eds., Springer-Verlag, 2001
Write a Comment
User Comments (0)
About PowerShow.com