CS 277: Data Mining Lecture 15: Page Rank and HITS cont'

About This Presentation

Title:

CS 277: Data Mining Lecture 15: Page Rank and HITS cont'

Description:

Department of Computer Science. University of California, Irvine. David Newman, UC Irvine Lecture 15: Page ... printf('iter %d diff = %.3gn', iter, norm(N,z) ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 29

Provided by: MINH71

Category:

more less

Transcript and Presenter's Notes

Title: CS 277: Data Mining Lecture 15: Page Rank and HITS cont'

1
CS 277 Data MiningLecture 15 Page Rank and
HITS (cont.)

David Newman
Department of Computer Science
University of California, Irvine

2
Notices

Homework 3 due in class Nov 20
NOT ALLOWED prob_z_given_w_d zeros(K,W,D)

3
Simple Example
1
2
3
4
4
Simple Example
1
1
2
3
0.5
0.5
0.5
0.5
0.5
Weight matrix W
0.5
4
5
Solution for the Simple Example
r Importance Steady State Markov probability
1
0.2
0.4
0.13
1
2
3
0.5
0.5
0.5
0.5
0.5
0.5
4
r Wr
0.27
r is the eigenvector corresponding to the unit
eigenvalue
6
Code for power method

define N 10,000,000 ( nodes)
define M 500,000,000 ( edges)
int main(int argc, char argv)
for (iter 0 iter lt 100 iter)
// compute y Ax
memset(y,0,Nsizeof(double))
for (m 0 m lt M m)
yiim aamxjjm
mynorm norm(N,y)
for (n 0 n lt N n) yn / mynorm
memcpy(x,y,Nsizeof(double))
for (n 0 n lt N n) zn - yn
printf("iter d diff .3g\n", iter,
norm(N,z))

7
HITS - Kleinbergs Algorithm

HITS Hypertext Induced Topic Selection

For each vertex v ? V in a subgraph of
interest

a(v) - the authority of v h(v) - the hubness of v

A site is very authoritative if it receives many
citations. Citation from important sites weight
more than citations from less-important sites

Hubness shows the importance of a site. A good
hub is a site that links to many authoritative
sites

8
Authority and Hubness
5
2
3
1
1
6
4
7
h(1) a(5) a(6) a(7)
a(1) h(2) h(3) h(4)
9
Authority and Hubness Convergence

Recursive dependency
a(v) ? S h(w)
h(v) ? S a(w)

w ? pav
w ? chv

Using Linear Algebra, we can prove

a(v) and h(v) converge
10
Authority and Hubness Convergence
a E h h E a
? a is the principal eigenvector of EE ? h is
the principal eigenvector of EE
11
Connection to HITS and SVD of E
E U S V ? whiteboard U(1) is h V(1) is a
12
HITS Example
Find a base subgraph

Start with a root set R 1, 2, 3, 4

1, 2, 3, 4 - nodes relevant to
the topic

Expand the root set R to include all the
children and a fixed number of parents of nodes
in R

? A new set S (base subgraph) ?
13
HITS Example

BaseSubgraph( R, d)
S ? r
for each v in R
do S ? S U chv
P ? pav
if P gt d
then P ? arbitrary subset of P having size d
S ? S U P
return S

14
HITS Example
Hubs and authorities two n-dimensional a and h

HubsAuthorities(G)
1 ? 1,,1 ? R
a ? h ? 1
t ? 1
repeat
for each v in V
do a (v) ? S h (w)
h (v) ? S a (w)
a ? a / a
h ? h / h
t ? t 1
until a a h h lt
e
return (a , h )

V
0
0
t
w ? pav
t -1
w ? pav
t
t -1
t
t
t
t
t
t
t
t
t -1
t -1
t
t
15
HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
16
HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
17
HITS Improvements
Brarat and Henzinger (1998)

HITS problems
The document can contain many identical links to
the same document in another host
Links are generated automatically (e.g. messages
posted on newsgroups)
Solutions
Assign weight to identical multiple edges, which
are inversely proportional to their multiplicity
Prune irrelevant nodes or regulating the
influence of a node with a relevance weight

18
Stability

Whether the link analysis algorithms based on
eigenvectors are stable in the sense that results
dont change significantly?
The connectivity of a portion of the graph is
changed arbitrary
How will it affect the results of algorithms?

19
HITS 30 of papers randomly deleted
Perturbed Rank
Rank
Ref Link Analysis, Eigenvectors and Stability
Ng, Zheng, Jordan
20
PageRank 30 of papers randomly deleted
Perturbed Rank
Rank
21
Stability of HITS

Ng et al (2001)
A bound on the number of hyperlinks k that can
added or deleted from one page without affecting
the authority or hubness weights
It is possible to perturb a symmetric matrix by
a quantity that grows as d that produces a
constant perturbation of the dominant eigenvector

d eigengap ?1 ?2d maximum outdegree of G
22
Stability of PageRank
Ng et al (2001)
V the set of vertices touched by the perturbation

The parameter e of the mixture model has a
stabilization role
If the set of pages affected by the perturbation
have a small rank, the overall change will also
be small

tighter bound byBianchini et al (2001)
d(j) gt 2 depends on the edges incident on j
23
Netflix Prize

Netflix Problem Predict Missing Ratings

M17,000 movies
X
1 4 2 ? 2
U500,000 users
24

X W H
X is U M
W is U T (user has multiple topics)
H is T M (topics are groups of movies)
choose T small
Goal
min E, E X - WH
Predict missing rating(u,m)
(u,m) entry of WH

Gradient descent
min E, E X - WH
H H a W E
W W a E H
Regularization
min E bW bH)
H H a W E b H
W W a E H b W

26
(No Transcript)
27

BellKor solution
RMSE 0.8712
Consists of blending 107 different results
Neighborhood-based model (k-NN)
Factorization model
Restricted Boltzmann Machines
Asymmetric factor models
Regression models

BellKor solution
Combining multiple results
107 results were blended to deliver RMSE0.8712
Success of ensemble approach depends on ability
of different predictors to expose different,
complementing aspects of data
Dont want to optimize accuracy of each
individual predictor
Often, more accurate predictors are less useful
within full blend

Write a Comment

User Comments (0)

About PowerShow.com

CS 277: Data Mining Lecture 15: Page Rank and HITS cont' - PowerPoint PPT Presentation

CS 277: Data Mining Lecture 15: Page Rank and HITS cont'

Department of Computer Science. University of California, Irvine. David Newman, UC Irvine Lecture 15: Page ... printf('iter %d diff = %.3gn', iter, norm(N,z) ... – PowerPoint PPT presentation