CS 277: Data Mining Lecture 15: Page Rank and HITS cont' - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

CS 277: Data Mining Lecture 15: Page Rank and HITS cont'

Description:

Department of Computer Science. University of California, Irvine. David Newman, UC Irvine Lecture 15: Page ... printf('iter %d diff = %.3gn', iter, norm(N,z) ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 29
Provided by: MINH71
Category:
Tags: hits | cont | data | iter | lecture | mining | page | rank

less

Transcript and Presenter's Notes

Title: CS 277: Data Mining Lecture 15: Page Rank and HITS cont'


1
CS 277 Data MiningLecture 15 Page Rank and
HITS (cont.)
  • David Newman
  • Department of Computer Science
  • University of California, Irvine

2
Notices
  • Homework 3 due in class Nov 20
  • NOT ALLOWED prob_z_given_w_d zeros(K,W,D)

3
Simple Example
1
2
3
4
4
Simple Example
1
1
2
3
0.5
0.5
0.5
0.5
0.5
Weight matrix W
0.5
4
5
Solution for the Simple Example
r Importance Steady State Markov probability
1
0.2
0.4
0.13
1
2
3
0.5
0.5
0.5
0.5
0.5
0.5
4
r Wr
0.27
r is the eigenvector corresponding to the unit
eigenvalue
6
Code for power method
  • define N 10,000,000 ( nodes)
  • define M 500,000,000 ( edges)
  • int main(int argc, char argv)
  • for (iter 0 iter lt 100 iter)
  • // compute y Ax
  • memset(y,0,Nsizeof(double))
  • for (m 0 m lt M m)
  • yiim aamxjjm
  • mynorm norm(N,y)
  • for (n 0 n lt N n) yn / mynorm
  • memcpy(x,y,Nsizeof(double))
  • for (n 0 n lt N n) zn - yn
  • printf("iter d diff .3g\n", iter,
    norm(N,z))

7
HITS - Kleinbergs Algorithm
  • HITS Hypertext Induced Topic Selection
  • For each vertex v ? V in a subgraph of
    interest

a(v) - the authority of v h(v) - the hubness of v
  • A site is very authoritative if it receives many
    citations. Citation from important sites weight
    more than citations from less-important sites
  • Hubness shows the importance of a site. A good
    hub is a site that links to many authoritative
    sites

8
Authority and Hubness
5
2
3
1
1
6
4
7
h(1) a(5) a(6) a(7)
a(1) h(2) h(3) h(4)
9
Authority and Hubness Convergence
  • Recursive dependency
  • a(v) ? S h(w)
  • h(v) ? S a(w)

w ? pav
w ? chv
  • Using Linear Algebra, we can prove

a(v) and h(v) converge
10
Authority and Hubness Convergence
a E h h E a
? a is the principal eigenvector of EE ? h is
the principal eigenvector of EE
11
Connection to HITS and SVD of E
E U S V ? whiteboard U(1) is h V(1) is a
12
HITS Example
Find a base subgraph
  • Start with a root set R 1, 2, 3, 4
  • 1, 2, 3, 4 - nodes relevant to
    the topic
  • Expand the root set R to include all the
    children and a fixed number of parents of nodes
    in R

? A new set S (base subgraph) ?
13
HITS Example
  • BaseSubgraph( R, d)
  • S ? r
  • for each v in R
  • do S ? S U chv
  • P ? pav
  • if P gt d
  • then P ? arbitrary subset of P having size d
  • S ? S U P
  • return S

14
HITS Example
Hubs and authorities two n-dimensional a and h
  • HubsAuthorities(G)
  • 1 ? 1,,1 ? R
  • a ? h ? 1
  • t ? 1
  • repeat
  • for each v in V
  • do a (v) ? S h (w)
  • h (v) ? S a (w)
  • a ? a / a
  • h ? h / h
  • t ? t 1
  • until a a h h lt
    e
  • return (a , h )

V
0
0
t
w ? pav
t -1
w ? pav
t
t -1
t
t
t
t
t
t
t
t
t -1
t -1
t
t
15
HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
16
HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
17
HITS Improvements
Brarat and Henzinger (1998)
  • HITS problems
  • The document can contain many identical links to
    the same document in another host
  • Links are generated automatically (e.g. messages
    posted on newsgroups)
  • Solutions
  • Assign weight to identical multiple edges, which
    are inversely proportional to their multiplicity
  • Prune irrelevant nodes or regulating the
    influence of a node with a relevance weight

18
Stability
  • Whether the link analysis algorithms based on
    eigenvectors are stable in the sense that results
    dont change significantly?
  • The connectivity of a portion of the graph is
    changed arbitrary
  • How will it affect the results of algorithms?

19
HITS 30 of papers randomly deleted
Perturbed Rank
Rank
Ref Link Analysis, Eigenvectors and Stability
Ng, Zheng, Jordan
20
PageRank 30 of papers randomly deleted
Perturbed Rank
Rank
21
Stability of HITS
  • Ng et al (2001)
  • A bound on the number of hyperlinks k that can
    added or deleted from one page without affecting
    the authority or hubness weights
  • It is possible to perturb a symmetric matrix by
    a quantity that grows as d that produces a
    constant perturbation of the dominant eigenvector

d eigengap ?1 ?2d maximum outdegree of G
22
Stability of PageRank
Ng et al (2001)
V the set of vertices touched by the perturbation
  • The parameter e of the mixture model has a
    stabilization role
  • If the set of pages affected by the perturbation
    have a small rank, the overall change will also
    be small

tighter bound byBianchini et al (2001)
d(j) gt 2 depends on the edges incident on j
23
Netflix Prize
  • Netflix Problem Predict Missing Ratings

M17,000 movies
X
1 4 2 ? 2
U500,000 users
24
  • X W H
  • X is U M
  • W is U T (user has multiple topics)
  • H is T M (topics are groups of movies)
  • choose T small
  • Goal
  • min E, E X - WH
  • Predict missing rating(u,m)
  • (u,m) entry of WH

25
  • Gradient descent
  • min E, E X - WH
  • H H a W E
  • W W a E H
  • Regularization
  • min E bW bH)
  • H H a W E b H
  • W W a E H b W

26
(No Transcript)
27
  • BellKor solution
  • RMSE 0.8712
  • Consists of blending 107 different results
  • Neighborhood-based model (k-NN)
  • Factorization model
  • Restricted Boltzmann Machines
  • Asymmetric factor models
  • Regression models

28
  • BellKor solution
  • Combining multiple results
  • 107 results were blended to deliver RMSE0.8712
  • Success of ensemble approach depends on ability
    of different predictors to expose different,
    complementing aspects of data
  • Dont want to optimize accuracy of each
    individual predictor
  • Often, more accurate predictors are less useful
    within full blend
Write a Comment
User Comments (0)
About PowerShow.com