CS 277: Data Mining Lecture 14: Page Rank and HITS

About This Presentation

Title:

CS 277: Data Mining Lecture 14: Page Rank and HITS

Description:

Homework 3 due in class Nov 20. Progress Report 2 today ... S. Wasserman and K. Faust, Social Network Analysis, Cambridge University Press, 1994. ... – PowerPoint PPT presentation

Number of Views:616

Avg rating:3.0/5.0

Slides: 65

Provided by: MINH71

Category:

more less

Transcript and Presenter's Notes

Title: CS 277: Data Mining Lecture 14: Page Rank and HITS

1
CS 277 Data MiningLecture 14 Page Rank and
HITS

David Newman
Department of Computer Science
University of California, Irvine

2
Notices

Homework 3 due in class Nov 20
Progress Report 2 today

3
Progress Report 2
4
Link Analysis Objectives

To review common approaches to link analysis
To calculate the popularity of a site based on
link analysis
To model human judgments indirectly

5
Outline

Page Rank
Hubs and Authorities HITS
Stability
Probabilistic Link Analysis
Limitation of Link Analysis

6
Web Mining

Web a potentially enormous data set for data
mining
3 primary aspects of Web mining
Web page content
classifying/clustering Web pages based on their
text
Web connectivity
characterizing distributions on path lengths
between pages
determining importance of pages from graph
structure
Web usage
understanding user behavior from Web logs
All 3 are interconnected/interdependent
Google (and most search engines) use both content
and connectivity
This lecture Web connectivity

7
The Web Graph

G (V, E)
V set of all Web pages
E set of all hyperlinks
Number of nodes ?
Difficult to estimate
Crawling the Web is highly non-trivial
At least 30 billion pages out there
Number of edges?
E O(V)
i.e., mean number of outlinks per page is a small
constant

8
The Web Graph

The Web graph is inherently dynamic
nodes and edges are continually appearing and
disappearing
Interested in general properties of the Web graph
What is the distribution of the number of
in-links and out-links?
What is the distribution of number of pages per
site?
Typically power-laws for many of these
distributions
How far apart are 2 randomly selected pages on
the Web?
What is the average distance between 2 random
pages?
And so on

9
Power law degree distribution P(k) k-g
Albert, Jeong, Barabasi, 1999
10
Social Networks

Social networks graphs
V set of actors (e.g., students in a class)
E set of interactions (e.g., collaborations)
Typically small graphs, e.g., V 10 or 50
Long history of social network analysis (e.g. at
UCI)
Quantitative data analysis techniques that can
automatically extract structure or information
from graphs
who is the most important actor in a network?
are there clusters in the network?
Comprehensive reference
S. Wasserman and K. Faust, Social Network
Analysis, Cambridge University Press, 1994.

11
Node Importance in Social Networks

General idea is that some nodes are more
important than others in terms of the structure
of the graph
In a directed graph, in-degree may be a useful
indicator of importance
for a citation network among authors (or papers)
in-degree is the number of citations gt
importance
However
in-degree is only a first-order measure in that
it implicitly assumes that all edges are of equal
importance

12
Recursive Notions of Node Importance

wij weight of link from node i to node j
assume Sj wij 1 and weights are non-negative
default choice wij 1/outdegree(i)
more outlinks gt less importance attached to each
Define rj importance of node j in a directed
graph
rj Si wij ri
i,j 1,.n
Importance of a node is a weighted sum of the
importance of nodes that point to it
Makes intuitive sense
Leads to a set of recursive linear equations

13
Simple Example
1
2
3
4
14
Simple Example
1
1
2
3
0.5
0.5
0.5
0.5
0.5
0.5
4
15
Simple Example
1
1
2
3
0.5
0.5
0.5
0.5
0.5
Weight matrix W
0.5
4
16
Matrix-Vector form

Recall rj importance of node j
rj Si wij ri
i,j 1,.n
e.g., r2 1 r1 0 r2 0.5 r3 0.5 r4
dot product of r vector
with column 2 of W
Let r n x 1 vector of importance values for
the n nodes
Let W n x n matrix of link weights
gt we can rewrite the importance equations as
r WT r

17
Eigenvector Formulation

Need to solve the importance equations for
unknown r, with known W
r WT r
This is a standard eigenvalue problem, i.e.,
A r l r (where A
WT)
with l an eigenvalue 1
and r the eigenvector corresponding to l 1
Results from linear algebra tell us that
Since W is a stochastic matrix, W and WT have the
same eigenvectors/eigenvalues
The largest of these eigenvalues is always 1
So the importance vector r corresponds to the
eigenvector corresponding to the largest
eigenvector of W (and WT)

18
Solution for the Simple Example
Solving for the eigenvector of W we get r 0.2
0.4 0.13 0.27 Results are quite intuitive
1
1
2
3
0.5
0.5
W
0.5
0.5
0.5
0.5
4
19
Solution for the Simple Example
Importance
1
0.2
0.4
0.13
1
2
3
0.5
0.5
0.5
0.5
0.5
0.5
4
0.27
20
How can we apply this to the Web?

Given a set of Web pages and hyperlinks
Weights from each page 1/( of outlinks)
Solve for the eigenvector (l 1) of the weight
matrix
Problem
Solving an eigenvector equation scales as O(n3)
For the entire Web graph n gt 10 billion (!!)
So direct solution is not feasible
Can use the power method (iterative)
r (k1) WT r (k)

21
Power Method for solving for r

r
(k1) WT r (k)
Define a suitable starting vector r (1)
e.g., all entries 1/n, or all entries
indegree(node)/E, etc
Each iteration is matrix-vector multiplication
gtO(n2)
- problematic?
no since W is highly sparse (Web pages
have limited outdegree), each
iteration is effectively O(n)
For sparse W, the iterations typically converge
quite quickly
- rate of convergence depends on the spectral
gap
? how quickly does error(k) (l2/
l1)k go to 0 as function of k ?
? if l2 is close to 1 ( l1) then
convergence is slow
- empirically Web graph with 300 million
pages
? 50 iterations to convergence (Brin and Page,
1998)

22
(No Transcript)
23
Markov Chain Interpretation

W is a stochastic matrix (rows sum to 1) by
definition
we can interpret W as defining the transition
probabilities in a Markov chain
wij probability of transitioning from node i to
node j
Markov chain interpretation
r WT r
? these are the solutions of the steady-state
probabilities for a Markov chain
page importance ? steady-state Markov
probabilities ? eigenvector

24
The Random Surfer Interpretation

Recall that for the Web model, we set wij
1/outdegree(i)
Thus, in using W for computing importance of Web
pages, this is equivalent to a model where
We have a random surfer who surfs the Web for an
infinitely long time
At each page the surfer randomly selects an
outlink to the next page
Importance of a page fraction of visits the
surfer makes to that page
This is intuitive pages that have better
connectivity will be visited more often

25
Potential Problems
1
2
3
Page 1 is a sink (no outlink) Pages 3 and 4
are also sinks (no outlink from the
system) Markov chain theory tells us that no
steady-state solution exists -
depending on where you start you will end up at 1
or 3, 4 Markov chain is reducible
4
26
Making the Web Graph Irreducible

One simple solution to our problem is to modify
the Markov chain
With probability a the random surfer jumps to any
random page in the system (with probability of
1/n, conditioned on such a jump)
With probability 1-a the random surfer selects an
outlink (randomly from the set of available
outlinks)
The resulting transition graph is fully connected
? Markov system is irreducible ? steady-state
solutions exist
Typically a is chosen to be between 0.1 and 0.2
in practice
New power iterations can be written as
r (k1) (1- a) WT r (k)
(a/n) 1T
Complexity is still O(n) per iteration for sparse
W

27
The PageRank Algorithm

S. Brin and L. Page, The anatomy of a large-scale
hypertextual search engine, in Proceedings of the
7th WWW Conference, 1998.
PageRank the method on the previous slide,
applied to the entire Web graph
Crawl the Web (highly non-trivial!)
Store both connectivity and content
Calculate (off-line) the pagerank r for each
Web page using the power iteration method
How can this be used to answer Web queries
Terms in the search query are used to limit the
set of pages of possible interest
Pages are then ordered for the user via
precomputed pageranks
The Google search engine combines r with
text-based measures
This was the first demonstration that link
information could be used for content-based
search on the Web

28
Link Structure helps in Web Search
Singhal and Kaszkiel, 2001 SE1, etc, indicate
different (anonymized) commercial search
engines, all using link structure (e.g.,
PageRank) in their rankings
29
PageRank architecture at Google

Ranking of pages more important than exact values
of pi
Pre-compute and store the PageRank of each page.
PageRank independent of any query or textual
content.
Ranking scheme combines PageRank with textual
match
Unpublished
Many empirical parameters, human effort and
regression testing.
Criticism Ad-hoc coupling and decoupling
between query relevance and graph importance
Massive engineering effort
Continually crawling the Web and updating page
ranks

30
(No Transcript)
31
PageRank Limitations

Rich get richer syndrome
not as democratic as originally (nobly) claimed
certainly not 1 vote per WWW citizen
also crawling frequency tends to be based on
pagerank
for detailed grumblings, see www.google-watch.org,
etc.
Not query-sensitive
random walk same regardless of query topic
whereas real random surfer has some topic
interests
non-uniform jumping vector needed
would enable personalization (but requires faster
eigenvector convergence)
Topic of ongoing research
Ad hoc mix of PageRank keyword match score
done in two steps for efficiency, not quality
motivations

32
(No Transcript)
33
HITS - Kleinbergs Algorithm

HITS Hypertext Induced Topic Selection

For each vertex v ? V in a subgraph of
interest

a(v) - the authority of v h(v) - the hubness of v

A site is very authoritative if it receives many
citations. Citation from important sites weight
more than citations from less-important sites

Hubness shows the importance of a site. A good
hub is a site that links to many authoritative
sites

34
Authority and Hubness
5
2
3
1
1
6
4
7
h(1) a(5) a(6) a(7)
a(1) h(2) h(3) h(4)
35
Authority and Hubness Convergence

Recursive dependency
a(v) ? S h(w)
h(v) ? S a(w)

w ? pav
w ? chv

Using Linear Algebra, we can prove

a(v) and h(v) converge
36
HITS Example
Find a base subgraph

Start with a root set R 1, 2, 3, 4

1, 2, 3, 4 - nodes relevant to
the topic

Expand the root set R to include all the
children and a fixed number of parents of nodes
in R

? A new set S (base subgraph) ?
37
HITS Example

BaseSubgraph( R, d)
S ? r
for each v in R
do S ? S U chv
P ? pav
if P gt d
then P ? arbitrary subset of P having size d
S ? S U P
return S

38
HITS Example
Hubs and authorities two n-dimensional a and h

HubsAuthorities(G)
1 ? 1,,1 ? R
a ? h ? 1
t ? 1
repeat
for each v in V
do a (v) ? S h (w)
h (v) ? S a (w)
a ? a / a
h ? h / h
t ? t 1
until a a h h lt
e
return (a , h )

V
0
0
t
w ? pav
t -1
w ? pav
t
t -1
t
t
t
t
t
t
t
t
t -1
t -1
t
t
39
HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
40
HITS Improvements
Brarat and Henzinger (1998)

HITS problems
The document can contain many identical links to
the same document in another host
Links are generated automatically (e.g. messages
posted on newsgroups)
Solutions
Assign weight to identical multiple edges, which
are inversely proportional to their multiplicity
Prune irrelevant nodes or regulating the
influence of a node with a relevance weight

41
PageRank

Introduced by Page et al (1998)
The weight is assigned by the rank of parents
Difference with HITS
HITS takes Hubness Authority weights
The page rank is proportional to its parents
rank, but inversely proportional to its parents
outdegree

42
Matrix Notation
Adjacent Matrix
A
http//www.kusatro.kyoto-u.com
43
Matrix Notation

Matrix Notation
r a B r M r
a eigenvalue
r eigenvector of B
A x ? x
A - ?I x 0

B
Finding Pagerank ? to find eigenvector of B with
an associated eigenvalue a
44
Matrix Notation
PageRank eigenvector of P relative to max
eigenvalue B P D P-1 D diagonal matrix of
eigenvalues ?1, ?n P regular matrix that
consists of eigenvectors
PageRank r1
normalized
45
Matrix Notation

Confirm the result
of inlinks from high ranked page
hard to explain about 52, 67

Interesting Topic
How do you create your homepage highly ranked?

46
Markov Chain Notation

Random surfer model
Description of a random walk through the Web
graph
Interpreted as a transition matrix with
asymptotic probability that a surfer is currently
browsing that page

rt M rt-1 M transition matrix for a
first-order Markov chain (stochastic)
Does it converge to some sensible solution (as
t?oo) regardless of the initial ranks ?
47
Problem

Rank Sink Problem
In general, many Web pages have no
inlinks/outlinks
It results in dangling edges in the graph
E.g.
no parent ? rank 0
MT converges to a matrix
whose last column is all zero
no children ? no solution
MT converges to zero matrix

48
Modification

Surfer will restart browsing by picking a new Web
page at random
M ( B E )
E escape matrix
M stochastic matrix
Still problem?
It is not guaranteed that M is primitive
If M is stochastic and primitive, PageRank
converges to corresponding stationary
distribution of M

49
PageRank Algorithm

Page et al, 1998

50
Distribution of the Mixture Model

The probability distribution that results from
combining the Markovian random walk distribution
the static rank source distribution
r ee (1- e)x
e probability of selecting non-linked page

PageRank
Now, transition matrix eH (1- e)M is
primitive and stochasticrt converges to the
dominant eigenvector
51
Stability

Whether the link analysis algorithms based on
eigenvectors are stable in the sense that results
dont change significantly?
The connectivity of a portion of the graph is
changed arbitrary
How will it affect the results of algorithms?

52
Stability of HITS

Ng et al (2001)
A bound on the number of hyperlinks k that can
added or deleted from one page without affecting
the authority or hubness weights
It is possible to perturb a symmetric matrix by
a quantity that grows as d that produces a
constant perturbation of the dominant eigenvector

d eigengap ?1 ?2d maximum outdegree of G
53
Stability of PageRank
Ng et al (2001)
V the set of vertices touched by the perturbation

The parameter e of the mixture model has a
stabilization role
If the set of pages affected by the perturbation
have a small rank, the overall change will also
be small

tighter bound byBianchini et al (2001)
d(j) gt 2 depends on the edges incident on j
54
SALSA

SALSA (Lempel, Moran 2001)
Probabilistic extension of the HITS algorithm
Random walk is carried out by following
hyperlinks both in the forward and in the
backward direction
Two separate random walks
Hub walk
Authority walk

55
Forming a Bipartite Graph in SALSA
56
Random Walks

Hub walk
Follow a Web link from a page uh to a page wa (a
forward link) and then
Immediately traverse a backlink going from wa to
vh, where (u,w) ? E and (v,w) ? E
Authority Walk
Follow a Web link from a page w(a) to a page u(h)
(a backward link) and then
Immediately traverse a forward link going back
from vh to wa where (u,w) ? E and (v,w) ? E

57
Computing Weights

Hub weight computed from the sum of the product
of the inverse degree of the in-links and the
out-links

58
Why We Care

Lempel and Moran (2001) showed theoretically that
SALSA weights are more robust that HITS weights
in the presence of the Tightly Knit Community
(TKC) Effect.
This effect occurs when a small collection of
pages (related to a given topic) is connected so
that every hub links to every authority and
includes as a special case the mutual
reinforcement effect
The pages in a community connected in this way
can be ranked highly by HITS, higher than pages
in a much larger collection where only some hubs
link to some authorities
TKC could be exploited by spammers hoping to
increase their page weight (e.g. link farms)

59
A Similar Approach

Rafiei and Mendelzon (2000) and Ng et al. (2001)
propose similar approaches using reset as in
PageRank
Unlike PageRank, in this model the surfer will
follow a forward link on odd steps but a backward
link on even steps
The stability properties of these ranking
distributions are similar to those of PageRank
(Ng et al. 2001)

60
Overcoming TKC

Similarity downweight sequencing and sequential
clustering (Roberts and Rosenthal 2003)
Consider the underlying structure of clusters
Suggest downweight sequencing to avoid the Tight
Knit Community problem
Results indicate approach is effective for few
tested queries, but still untested on a large
scale

61
PHITS and More

PHITS Cohn and Chang (2000)
Only the principal eigenvector is extracted using
SALSA, so the authority along the remaining
eigenvectors is completely neglected
Account for more eigenvectors of the co-citation
matrix
See also Lempel, Moran (2003)

62
Limits of Link Analysis

META tags/ invisible text
Search engines relying on meta tags in documents
are often misled (intentionally) by web
developers
Pay-for-place
Search engine bias organizations pay search
engines and page rank
Advertisements organizations pay high ranking
pages for advertising space
With a primary effect of increased visibility to
end users and a secondary effect of increased
respectability due to relevance to high ranking
page

63
Limits of Link Analysis

Stability
Adding even a small number of nodes/edges to the
graph has a significant impact
Topic drift similar to TKC
A top authority may be a hub of pages on a
different topic resulting in increased rank of
the authority page
Content evolution
Adding/removing links/content can affect the
intuitive authority rank of a page requiring
recalculation of page ranks

64
Further Reading

R. Lempel and S. Moran, Rank Stability and Rank
Similarity of Link-Based Web Ranking Algorithms
in Authority Connected Graphs, Submitted to
Information Retrieval, special issue on Advances
in Mathematics/Formal Methods in Information
Retrieval, 2003.
M. Henzinger, Link Analysis in Web Information
Retreival, Bulletin of the IEEE computer Society
Technical Committee on Data Engineering, 2000.
L. Getoor, N. Friedman, D. Koller, and A.
Pfeffer. Relational Data Mining, S. Dzeroski and
N. Lavrac, Eds., Springer-Verlag, 2001