Title: CS 277: Data Mining Lecture 15: Page Rank and HITS cont'
1CS 277 Data MiningLecture 15 Page Rank and
HITS (cont.)
- David Newman
- Department of Computer Science
- University of California, Irvine
2Notices
- Homework 3 due in class Nov 20
- NOT ALLOWED prob_z_given_w_d zeros(K,W,D)
3Simple Example
1
2
3
4
4Simple Example
1
1
2
3
0.5
0.5
0.5
0.5
0.5
Weight matrix W
0.5
4
5Solution for the Simple Example
r Importance Steady State Markov probability
1
0.2
0.4
0.13
1
2
3
0.5
0.5
0.5
0.5
0.5
0.5
4
r Wr
0.27
r is the eigenvector corresponding to the unit
eigenvalue
6Code for power method
- define N 10,000,000 ( nodes)
- define M 500,000,000 ( edges)
- int main(int argc, char argv)
-
-
- for (iter 0 iter lt 100 iter)
-
- // compute y Ax
- memset(y,0,Nsizeof(double))
- for (m 0 m lt M m)
- yiim aamxjjm
-
- mynorm norm(N,y)
- for (n 0 n lt N n) yn / mynorm
- memcpy(x,y,Nsizeof(double))
- for (n 0 n lt N n) zn - yn
- printf("iter d diff .3g\n", iter,
norm(N,z))
7HITS - Kleinbergs Algorithm
- HITS Hypertext Induced Topic Selection
- For each vertex v ? V in a subgraph of
interest
a(v) - the authority of v h(v) - the hubness of v
- A site is very authoritative if it receives many
citations. Citation from important sites weight
more than citations from less-important sites
- Hubness shows the importance of a site. A good
hub is a site that links to many authoritative
sites
8Authority and Hubness
5
2
3
1
1
6
4
7
h(1) a(5) a(6) a(7)
a(1) h(2) h(3) h(4)
9Authority and Hubness Convergence
- Recursive dependency
-
- a(v) ? S h(w)
- h(v) ? S a(w)
w ? pav
w ? chv
- Using Linear Algebra, we can prove
a(v) and h(v) converge
10Authority and Hubness Convergence
a E h h E a
? a is the principal eigenvector of EE ? h is
the principal eigenvector of EE
11Connection to HITS and SVD of E
E U S V ? whiteboard U(1) is h V(1) is a
12HITS Example
Find a base subgraph
- Start with a root set R 1, 2, 3, 4
- 1, 2, 3, 4 - nodes relevant to
the topic
- Expand the root set R to include all the
children and a fixed number of parents of nodes
in R
? A new set S (base subgraph) ?
13HITS Example
- BaseSubgraph( R, d)
- S ? r
- for each v in R
- do S ? S U chv
- P ? pav
- if P gt d
- then P ? arbitrary subset of P having size d
- S ? S U P
- return S
14HITS Example
Hubs and authorities two n-dimensional a and h
- HubsAuthorities(G)
- 1 ? 1,,1 ? R
- a ? h ? 1
- t ? 1
- repeat
- for each v in V
- do a (v) ? S h (w)
- h (v) ? S a (w)
- a ? a / a
- h ? h / h
- t ? t 1
- until a a h h lt
e - return (a , h )
V
0
0
t
w ? pav
t -1
w ? pav
t
t -1
t
t
t
t
t
t
t
t
t -1
t -1
t
t
15HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
16HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
17HITS Improvements
Brarat and Henzinger (1998)
- HITS problems
- The document can contain many identical links to
the same document in another host - Links are generated automatically (e.g. messages
posted on newsgroups) - Solutions
- Assign weight to identical multiple edges, which
are inversely proportional to their multiplicity - Prune irrelevant nodes or regulating the
influence of a node with a relevance weight
18Stability
- Whether the link analysis algorithms based on
eigenvectors are stable in the sense that results
dont change significantly? - The connectivity of a portion of the graph is
changed arbitrary - How will it affect the results of algorithms?
19HITS 30 of papers randomly deleted
Perturbed Rank
Rank
Ref Link Analysis, Eigenvectors and Stability
Ng, Zheng, Jordan
20PageRank 30 of papers randomly deleted
Perturbed Rank
Rank
21Stability of HITS
- Ng et al (2001)
- A bound on the number of hyperlinks k that can
added or deleted from one page without affecting
the authority or hubness weights - It is possible to perturb a symmetric matrix by
a quantity that grows as d that produces a
constant perturbation of the dominant eigenvector
d eigengap ?1 ?2d maximum outdegree of G
22Stability of PageRank
Ng et al (2001)
V the set of vertices touched by the perturbation
- The parameter e of the mixture model has a
stabilization role - If the set of pages affected by the perturbation
have a small rank, the overall change will also
be small
tighter bound byBianchini et al (2001)
d(j) gt 2 depends on the edges incident on j
23Netflix Prize
- Netflix Problem Predict Missing Ratings
M17,000 movies
X
1 4 2 ? 2
U500,000 users
24- X W H
- X is U M
- W is U T (user has multiple topics)
- H is T M (topics are groups of movies)
- choose T small
- Goal
- min E, E X - WH
- Predict missing rating(u,m)
-
- (u,m) entry of WH
25- Gradient descent
- min E, E X - WH
- H H a W E
- W W a E H
- Regularization
- min E bW bH)
- H H a W E b H
- W W a E H b W
-
26(No Transcript)
27- BellKor solution
- RMSE 0.8712
- Consists of blending 107 different results
- Neighborhood-based model (k-NN)
- Factorization model
- Restricted Boltzmann Machines
- Asymmetric factor models
- Regression models
28- BellKor solution
- Combining multiple results
- 107 results were blended to deliver RMSE0.8712
- Success of ensemble approach depends on ability
of different predictors to expose different,
complementing aspects of data - Dont want to optimize accuracy of each
individual predictor - Often, more accurate predictors are less useful
within full blend