Title: CS 277: Data Mining Lecture 14: Page Rank and HITS
1CS 277 Data MiningLecture 14 Page Rank and
HITS
- David Newman
- Department of Computer Science
- University of California, Irvine
2Notices
- Homework 3 due in class Nov 20
- Progress Report 2 today
3Progress Report 2
4Link Analysis Objectives
- To review common approaches to link analysis
- To calculate the popularity of a site based on
link analysis - To model human judgments indirectly
5Outline
- Page Rank
- Hubs and Authorities HITS
- Stability
- Probabilistic Link Analysis
- Limitation of Link Analysis
6Web Mining
- Web a potentially enormous data set for data
mining - 3 primary aspects of Web mining
- Web page content
- classifying/clustering Web pages based on their
text - Web connectivity
- characterizing distributions on path lengths
between pages - determining importance of pages from graph
structure - Web usage
- understanding user behavior from Web logs
- All 3 are interconnected/interdependent
- Google (and most search engines) use both content
and connectivity - This lecture Web connectivity
7The Web Graph
- G (V, E)
- V set of all Web pages
- E set of all hyperlinks
- Number of nodes ?
- Difficult to estimate
- Crawling the Web is highly non-trivial
- At least 30 billion pages out there
- Number of edges?
- E O(V)
- i.e., mean number of outlinks per page is a small
constant
8The Web Graph
- The Web graph is inherently dynamic
- nodes and edges are continually appearing and
disappearing - Interested in general properties of the Web graph
- What is the distribution of the number of
in-links and out-links? - What is the distribution of number of pages per
site? - Typically power-laws for many of these
distributions - How far apart are 2 randomly selected pages on
the Web? - What is the average distance between 2 random
pages? - And so on
9Power law degree distribution P(k) k-g
Albert, Jeong, Barabasi, 1999
10Social Networks
- Social networks graphs
- V set of actors (e.g., students in a class)
- E set of interactions (e.g., collaborations)
- Typically small graphs, e.g., V 10 or 50
- Long history of social network analysis (e.g. at
UCI) - Quantitative data analysis techniques that can
automatically extract structure or information
from graphs - who is the most important actor in a network?
- are there clusters in the network?
- Comprehensive reference
- S. Wasserman and K. Faust, Social Network
Analysis, Cambridge University Press, 1994.
11Node Importance in Social Networks
- General idea is that some nodes are more
important than others in terms of the structure
of the graph - In a directed graph, in-degree may be a useful
indicator of importance - for a citation network among authors (or papers)
- in-degree is the number of citations gt
importance - However
- in-degree is only a first-order measure in that
it implicitly assumes that all edges are of equal
importance
12Recursive Notions of Node Importance
- wij weight of link from node i to node j
- assume Sj wij 1 and weights are non-negative
- default choice wij 1/outdegree(i)
- more outlinks gt less importance attached to each
- Define rj importance of node j in a directed
graph - rj Si wij ri
i,j 1,.n - Importance of a node is a weighted sum of the
importance of nodes that point to it - Makes intuitive sense
- Leads to a set of recursive linear equations
13Simple Example
1
2
3
4
14Simple Example
1
1
2
3
0.5
0.5
0.5
0.5
0.5
0.5
4
15Simple Example
1
1
2
3
0.5
0.5
0.5
0.5
0.5
Weight matrix W
0.5
4
16Matrix-Vector form
- Recall rj importance of node j
- rj Si wij ri
i,j 1,.n - e.g., r2 1 r1 0 r2 0.5 r3 0.5 r4
- dot product of r vector
with column 2 of W - Let r n x 1 vector of importance values for
the n nodes - Let W n x n matrix of link weights
- gt we can rewrite the importance equations as
- r WT r
17Eigenvector Formulation
- Need to solve the importance equations for
unknown r, with known W - r WT r
- This is a standard eigenvalue problem, i.e.,
- A r l r (where A
WT) -
- with l an eigenvalue 1
- and r the eigenvector corresponding to l 1
- Results from linear algebra tell us that
- Since W is a stochastic matrix, W and WT have the
same eigenvectors/eigenvalues - The largest of these eigenvalues is always 1
- So the importance vector r corresponds to the
eigenvector corresponding to the largest
eigenvector of W (and WT)
18Solution for the Simple Example
Solving for the eigenvector of W we get r 0.2
0.4 0.13 0.27 Results are quite intuitive
1
1
2
3
0.5
0.5
W
0.5
0.5
0.5
0.5
4
19Solution for the Simple Example
Importance
1
0.2
0.4
0.13
1
2
3
0.5
0.5
0.5
0.5
0.5
0.5
4
0.27
20How can we apply this to the Web?
- Given a set of Web pages and hyperlinks
- Weights from each page 1/( of outlinks)
- Solve for the eigenvector (l 1) of the weight
matrix - Problem
- Solving an eigenvector equation scales as O(n3)
- For the entire Web graph n gt 10 billion (!!)
- So direct solution is not feasible
- Can use the power method (iterative)
r (k1) WT r (k) -
21Power Method for solving for r
- r
(k1) WT r (k) - Define a suitable starting vector r (1)
- e.g., all entries 1/n, or all entries
indegree(node)/E, etc - Each iteration is matrix-vector multiplication
gtO(n2) - - problematic?
- no since W is highly sparse (Web pages
have limited outdegree), each
iteration is effectively O(n) -
- For sparse W, the iterations typically converge
quite quickly - - rate of convergence depends on the spectral
gap - ? how quickly does error(k) (l2/
l1)k go to 0 as function of k ? - ? if l2 is close to 1 ( l1) then
convergence is slow -
- - empirically Web graph with 300 million
pages - ? 50 iterations to convergence (Brin and Page,
1998)
22(No Transcript)
23Markov Chain Interpretation
- W is a stochastic matrix (rows sum to 1) by
definition - we can interpret W as defining the transition
probabilities in a Markov chain - wij probability of transitioning from node i to
node j - Markov chain interpretation
r WT r - ? these are the solutions of the steady-state
probabilities for a Markov chain - page importance ? steady-state Markov
probabilities ? eigenvector -
24The Random Surfer Interpretation
- Recall that for the Web model, we set wij
1/outdegree(i) - Thus, in using W for computing importance of Web
pages, this is equivalent to a model where - We have a random surfer who surfs the Web for an
infinitely long time - At each page the surfer randomly selects an
outlink to the next page - Importance of a page fraction of visits the
surfer makes to that page - This is intuitive pages that have better
connectivity will be visited more often
25Potential Problems
1
2
3
Page 1 is a sink (no outlink) Pages 3 and 4
are also sinks (no outlink from the
system) Markov chain theory tells us that no
steady-state solution exists -
depending on where you start you will end up at 1
or 3, 4 Markov chain is reducible
4
26Making the Web Graph Irreducible
- One simple solution to our problem is to modify
the Markov chain - With probability a the random surfer jumps to any
random page in the system (with probability of
1/n, conditioned on such a jump) - With probability 1-a the random surfer selects an
outlink (randomly from the set of available
outlinks) - The resulting transition graph is fully connected
? Markov system is irreducible ? steady-state
solutions exist - Typically a is chosen to be between 0.1 and 0.2
in practice - New power iterations can be written as
r (k1) (1- a) WT r (k)
(a/n) 1T - Complexity is still O(n) per iteration for sparse
W
27The PageRank Algorithm
- S. Brin and L. Page, The anatomy of a large-scale
hypertextual search engine, in Proceedings of the
7th WWW Conference, 1998. - PageRank the method on the previous slide,
applied to the entire Web graph - Crawl the Web (highly non-trivial!)
- Store both connectivity and content
- Calculate (off-line) the pagerank r for each
Web page using the power iteration method - How can this be used to answer Web queries
- Terms in the search query are used to limit the
set of pages of possible interest - Pages are then ordered for the user via
precomputed pageranks - The Google search engine combines r with
text-based measures - This was the first demonstration that link
information could be used for content-based
search on the Web
28Link Structure helps in Web Search
Singhal and Kaszkiel, 2001 SE1, etc, indicate
different (anonymized) commercial search
engines, all using link structure (e.g.,
PageRank) in their rankings
29PageRank architecture at Google
- Ranking of pages more important than exact values
of pi - Pre-compute and store the PageRank of each page.
- PageRank independent of any query or textual
content. - Ranking scheme combines PageRank with textual
match - Unpublished
- Many empirical parameters, human effort and
regression testing. - Criticism Ad-hoc coupling and decoupling
between query relevance and graph importance - Massive engineering effort
- Continually crawling the Web and updating page
ranks
30(No Transcript)
31PageRank Limitations
- Rich get richer syndrome
- not as democratic as originally (nobly) claimed
- certainly not 1 vote per WWW citizen
- also crawling frequency tends to be based on
pagerank - for detailed grumblings, see www.google-watch.org,
etc. - Not query-sensitive
- random walk same regardless of query topic
- whereas real random surfer has some topic
interests - non-uniform jumping vector needed
- would enable personalization (but requires faster
eigenvector convergence) - Topic of ongoing research
- Ad hoc mix of PageRank keyword match score
- done in two steps for efficiency, not quality
motivations
32(No Transcript)
33HITS - Kleinbergs Algorithm
- HITS Hypertext Induced Topic Selection
- For each vertex v ? V in a subgraph of
interest
a(v) - the authority of v h(v) - the hubness of v
- A site is very authoritative if it receives many
citations. Citation from important sites weight
more than citations from less-important sites
- Hubness shows the importance of a site. A good
hub is a site that links to many authoritative
sites
34Authority and Hubness
5
2
3
1
1
6
4
7
h(1) a(5) a(6) a(7)
a(1) h(2) h(3) h(4)
35Authority and Hubness Convergence
- Recursive dependency
-
- a(v) ? S h(w)
- h(v) ? S a(w)
w ? pav
w ? chv
- Using Linear Algebra, we can prove
a(v) and h(v) converge
36HITS Example
Find a base subgraph
- Start with a root set R 1, 2, 3, 4
- 1, 2, 3, 4 - nodes relevant to
the topic
- Expand the root set R to include all the
children and a fixed number of parents of nodes
in R
? A new set S (base subgraph) ?
37HITS Example
- BaseSubgraph( R, d)
- S ? r
- for each v in R
- do S ? S U chv
- P ? pav
- if P gt d
- then P ? arbitrary subset of P having size d
- S ? S U P
- return S
38HITS Example
Hubs and authorities two n-dimensional a and h
- HubsAuthorities(G)
- 1 ? 1,,1 ? R
- a ? h ? 1
- t ? 1
- repeat
- for each v in V
- do a (v) ? S h (w)
- h (v) ? S a (w)
- a ? a / a
- h ? h / h
- t ? t 1
- until a a h h lt
e - return (a , h )
V
0
0
t
w ? pav
t -1
w ? pav
t
t -1
t
t
t
t
t
t
t
t
t -1
t -1
t
t
39HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
40HITS Improvements
Brarat and Henzinger (1998)
- HITS problems
- The document can contain many identical links to
the same document in another host - Links are generated automatically (e.g. messages
posted on newsgroups) - Solutions
- Assign weight to identical multiple edges, which
are inversely proportional to their multiplicity - Prune irrelevant nodes or regulating the
influence of a node with a relevance weight
41PageRank
- Introduced by Page et al (1998)
- The weight is assigned by the rank of parents
- Difference with HITS
- HITS takes Hubness Authority weights
- The page rank is proportional to its parents
rank, but inversely proportional to its parents
outdegree
42Matrix Notation
Adjacent Matrix
A
http//www.kusatro.kyoto-u.com
43Matrix Notation
- Matrix Notation
- r a B r M r
- a eigenvalue
- r eigenvector of B
- A x ? x
- A - ?I x 0
B
Finding Pagerank ? to find eigenvector of B with
an associated eigenvalue a
44Matrix Notation
PageRank eigenvector of P relative to max
eigenvalue B P D P-1 D diagonal matrix of
eigenvalues ?1, ?n P regular matrix that
consists of eigenvectors
PageRank r1
normalized
45Matrix Notation
- Confirm the result
- of inlinks from high ranked page
- hard to explain about 52, 67
- Interesting Topic
- How do you create your homepage highly ranked?
46Markov Chain Notation
- Random surfer model
- Description of a random walk through the Web
graph - Interpreted as a transition matrix with
asymptotic probability that a surfer is currently
browsing that page
rt M rt-1 M transition matrix for a
first-order Markov chain (stochastic)
Does it converge to some sensible solution (as
t?oo) regardless of the initial ranks ?
47Problem
- Rank Sink Problem
- In general, many Web pages have no
inlinks/outlinks - It results in dangling edges in the graph
- E.g.
- no parent ? rank 0
- MT converges to a matrix
- whose last column is all zero
- no children ? no solution
- MT converges to zero matrix
48Modification
- Surfer will restart browsing by picking a new Web
page at random - M ( B E )
- E escape matrix
- M stochastic matrix
- Still problem?
- It is not guaranteed that M is primitive
- If M is stochastic and primitive, PageRank
converges to corresponding stationary
distribution of M
49PageRank Algorithm
50Distribution of the Mixture Model
- The probability distribution that results from
combining the Markovian random walk distribution
the static rank source distribution - r ee (1- e)x
- e probability of selecting non-linked page
PageRank
Now, transition matrix eH (1- e)M is
primitive and stochasticrt converges to the
dominant eigenvector
51Stability
- Whether the link analysis algorithms based on
eigenvectors are stable in the sense that results
dont change significantly? - The connectivity of a portion of the graph is
changed arbitrary - How will it affect the results of algorithms?
52Stability of HITS
- Ng et al (2001)
- A bound on the number of hyperlinks k that can
added or deleted from one page without affecting
the authority or hubness weights - It is possible to perturb a symmetric matrix by
a quantity that grows as d that produces a
constant perturbation of the dominant eigenvector
d eigengap ?1 ?2d maximum outdegree of G
53Stability of PageRank
Ng et al (2001)
V the set of vertices touched by the perturbation
- The parameter e of the mixture model has a
stabilization role - If the set of pages affected by the perturbation
have a small rank, the overall change will also
be small
tighter bound byBianchini et al (2001)
d(j) gt 2 depends on the edges incident on j
54SALSA
- SALSA (Lempel, Moran 2001)
- Probabilistic extension of the HITS algorithm
- Random walk is carried out by following
hyperlinks both in the forward and in the
backward direction - Two separate random walks
- Hub walk
- Authority walk
55Forming a Bipartite Graph in SALSA
56Random Walks
- Hub walk
- Follow a Web link from a page uh to a page wa (a
forward link) and then - Immediately traverse a backlink going from wa to
vh, where (u,w) ? E and (v,w) ? E - Authority Walk
- Follow a Web link from a page w(a) to a page u(h)
(a backward link) and then - Immediately traverse a forward link going back
from vh to wa where (u,w) ? E and (v,w) ? E
57Computing Weights
- Hub weight computed from the sum of the product
of the inverse degree of the in-links and the
out-links
58Why We Care
- Lempel and Moran (2001) showed theoretically that
SALSA weights are more robust that HITS weights
in the presence of the Tightly Knit Community
(TKC) Effect. - This effect occurs when a small collection of
pages (related to a given topic) is connected so
that every hub links to every authority and
includes as a special case the mutual
reinforcement effect - The pages in a community connected in this way
can be ranked highly by HITS, higher than pages
in a much larger collection where only some hubs
link to some authorities - TKC could be exploited by spammers hoping to
increase their page weight (e.g. link farms)
59A Similar Approach
- Rafiei and Mendelzon (2000) and Ng et al. (2001)
propose similar approaches using reset as in
PageRank - Unlike PageRank, in this model the surfer will
follow a forward link on odd steps but a backward
link on even steps - The stability properties of these ranking
distributions are similar to those of PageRank
(Ng et al. 2001)
60Overcoming TKC
- Similarity downweight sequencing and sequential
clustering (Roberts and Rosenthal 2003) - Consider the underlying structure of clusters
- Suggest downweight sequencing to avoid the Tight
Knit Community problem - Results indicate approach is effective for few
tested queries, but still untested on a large
scale
61PHITS and More
- PHITS Cohn and Chang (2000)
- Only the principal eigenvector is extracted using
SALSA, so the authority along the remaining
eigenvectors is completely neglected - Account for more eigenvectors of the co-citation
matrix - See also Lempel, Moran (2003)
62Limits of Link Analysis
- META tags/ invisible text
- Search engines relying on meta tags in documents
are often misled (intentionally) by web
developers - Pay-for-place
- Search engine bias organizations pay search
engines and page rank - Advertisements organizations pay high ranking
pages for advertising space - With a primary effect of increased visibility to
end users and a secondary effect of increased
respectability due to relevance to high ranking
page
63Limits of Link Analysis
- Stability
- Adding even a small number of nodes/edges to the
graph has a significant impact - Topic drift similar to TKC
- A top authority may be a hub of pages on a
different topic resulting in increased rank of
the authority page - Content evolution
- Adding/removing links/content can affect the
intuitive authority rank of a page requiring
recalculation of page ranks
64Further Reading
- R. Lempel and S. Moran, Rank Stability and Rank
Similarity of Link-Based Web Ranking Algorithms
in Authority Connected Graphs, Submitted to
Information Retrieval, special issue on Advances
in Mathematics/Formal Methods in Information
Retrieval, 2003. - M. Henzinger, Link Analysis in Web Information
Retreival, Bulletin of the IEEE computer Society
Technical Committee on Data Engineering, 2000. - L. Getoor, N. Friedman, D. Koller, and A.
Pfeffer. Relational Data Mining, S. Dzeroski and
N. Lavrac, Eds., Springer-Verlag, 2001