Hubs and Authorities on the world wide web PowerPoint PPT Presentation

presentation player overlay
1 / 31
About This Presentation
Transcript and Presenter's Notes

Title: Hubs and Authorities on the world wide web


1
Hubs and Authorities on the world wide web
  • (most from Raos lecture slides)
  • Presentor Lei Tang

2
Desiderata for link-based ranking
  • A page that is referenced by lot of important
    pages (has more back links) is more important
    (Authority)
  • A page referenced by a single important page may
    be more important than that referenced by five
    unimportant pages
  • No links between competitive authorities(like
    Ford, Honda)
  • A page that references a lot of important pages
    is also important (Hub)
  • Good authoritative pages (authorities) and good
    hub pages (hubs) reinforce each other.
  • Importance can be propagated
  • Your importance is the weighted sum of the
    importance conferred on you by the pages that
    refer to you
  • The importance you confer on a page may be
    proportional to how many other pages you refer to
    (cite)
  • (Also what you say about them when you cite
    them!)

Different Notions of importance
3
Authority and Hub Pages (1)
  • The basic idea
  • A page is a good authoritative page with respect
    to a given query if it is referenced (i.e.,
    pointed to) by many (good hub) pages that are
    related to the query.
  • A page is a good hub page with respect to a given
    query if it points to many good authoritative
    pages with respect to the query.
  • Good authoritative pages (authorities) and good
    hub pages (hubs) reinforce each other.

4
Authority and Hub Pages (2)
  • Authorities and hubs related to the same query
    tend to form a bipartite subgraph of the web
    graph.
  • A web page can be a good authority and a good hub.

hubs
authorities
5
Authority and Hub Pages (7)
  • Operation I for each page p
  • a(p) ? h(q)
  • q (q, p)?E
  • Operation O for each page p
  • h(p) ? a(q)
  • q (p, q)?E

q1
q2
p
q3
q1
q2
p
q3
6
Authority and Hub Pages (8)
  • Matrix representation of operations I and O.
  • Let A be the adjacency matrix of SG entry (p, q)
    is 1 if p has a link to q, else the entry is 0.
  • Let AT be the transpose of A.
  • Let hi be vector of hub scores after i
    iterations.
  • Let ai be the vector of authority scores after i
    iterations.
  • Operation I ai AT hi-1
  • Operation O hi A ai

Normalize after every multiplication
7
Authority and Hub Pages (11)
  • Example Initialize all scores to 1.
  • 1st Iteration
  • I operation
  • a(q1) 1, a(q2) a(q3) 0,
  • a(p1) 3, a(p2) 2
  • O operation h(q1) 5,
  • h(q2) 3, h(q3) 5, h(p1) 1, h(p2) 0
  • Normalization a(q1) 0.267, a(q2) a(q3)
    0,
  • a(p1) 0.802, a(p2) 0.535, h(q1) 0.645,
  • h(q2) 0.387, h(q3) 0.645, h(p1) 0.129,
    h(p2) 0

q1
p1
q2
p2
q3
8
Authority and Hub Pages (12)
  • After 2 Iterations
  • a(q1) 0.061, a(q2) a(q3) 0, a(p1)
    0.791,
  • a(p2) 0.609, h(q1) 0.656, h(q2) 0.371,
  • h(q3) 0.656, h(p1) 0.029, h(p2) 0
  • After 5 Iterations
  • a(q1) a(q2) a(q3) 0,
  • a(p1) 0.788, a(p2) 0.615
  • h(q1) 0.657, h(q2) 0.369,
  • h(q3) 0.657, h(p1) h(p2) 0

q1
p1
q2
p2
q3
9
(why) Does the procedure converge?
As we multiply repeatedly with M, the component
of x in the direction of principal eigen vector
gets stretched wrt to other directions.. So we
converge finally to the direction of principal
eigenvector Necessary condition x must have a
component in the direction of principal eigen
vector (c1must be non-zero)
The rate of convergence depends on the eigen gap
10
Authority and Hub Pages (3)
  • Main steps of the algorithm for finding good
    authorities and hubs related to a query q.
  • Submit q to a regular similarity-based search
    engine. Let S be the set of top n pages returned
    by the search engine. (S is called the root set
    and n is often in the low hundreds).
  • Expand S into a large set T (base set)
  • Add pages that are pointed to by any page in S.
  • Add pages that point to any page in S.
  • If a page has too many parent pages, only the
    first k parent pages will be used for some k.

11
Authority and Hub Pages (4)
  • 3. Find the subgraph SG of the web graph that
    is induced by T.

12
(No Transcript)
13
Authority and Hub Pages (5)
  • Steps 2 and 3 can be made easy by storing the
    link structure of the Web in advance Link
    structure table (during crawling)
  • --Most search engines serve this
    information now. (e.g. Googles link search)
  • parent_url child_url
  • url1 url2
  • url1 url3

14
B
USER(41) aaa an adjacency matrix 2A((0 0 1)
(0 0 1) (1 0 0)) USER(42) x an initial
vector 2A((1) (2) (3)) USER(43)
(apower-iteration aaa x 2) authority
computationtwo iterations 1 USER(44)
(apower-iterate aaa x 3) after three
iterations 2A((0.041630544) (0.0)
(0.99913305)) 1 USER(45) (apower-iterate aaa x
15) after 15 iterations 2A((1.0172524e-5)
(0.0) (1.0)) 1 USER(46) (power-iterate aaa x
5) hub computation 5 iterations 2A((0.70641726
) (0.70641726) (0.04415108)) 1 USER(47)
(power-iterate aaa x 15) 15 iterations 2A((0.70
71068) (0.7071068) (4.3158376e-5)) 1 USER(48)
Y a new initial vector 2A((89) (25)
(2)) 1 USER(49) (power-iterate aaa Y 15)
Magic same answer after 15 iter 2A((0.7071068)
(0.7071068) (7.571644e-7))
A
C
15
Authority and Hub Pages (6)
  • Compute the authority score and hub score of each
    web page in T based on the subgraph SG(V, E).
  • Given a page p, let
  • a(p) be the authority score of p
  • h(p) be the hub score of p
  • (p, q) be a directed edge in E from p
    to q.
  • Two basic operations
  • Operation I Update each a(p) as the sum of all
    the hub scores of web pages that point to p.
  • Operation O Update each h(p) as the sum of all
    the authority scores of web pages pointed to by p.

16
Authority and Hub Pages (9)
  • After each iteration of applying Operations I
    and O, normalize all authority and hub scores.
  • Repeat until the scores for each page
    converge (the convergence is guaranteed).
  • 5. Sort pages in descending authority scores.
  • 6. Display the top authority pages.

17
Authority and Hub Pages (10)
  • Algorithm (summary)
  • submit q to a search engine to obtain the
    root set S
  • expand S into the base set T
  • obtain the induced subgraph SG(V, E) using T
  • initialize a(p) h(p) 1 for all p in V
  • for each p in V until the scores converge
  • apply Operation I
  • apply Operation O
  • normalize a(p) and h(p)
  • return pages with top authority scores

18
(why) Does the procedure converge?
As we multiply repeatedly with M, the component
of x in the direction of principal eigen vector
gets stretched wrt to other directions.. So we
converge finally to the direction of principal
eigenvector Necessary condition x must have a
component in the direction of principal eigen
vector
19
Handling spam links
  • Should all links be equally treated?
  • Two considerations
  • Some links may be more meaningful/important than
    other links.
  • Web site creators may trick the system to make
    their pages more authoritative by adding dummy
    pages pointing to their cover pages (spamming).

20
Handling Spam Links (contd)
  • Transverse link links between pages with
    different domain names.
  • Domain name the first level of the URL of a
    page.
  • Intrinsic link links between pages with the same
    domain name.
  • Transverse links are more important than
    intrinsic links.
  • Two ways to incorporate this
  • Use only transverse links and discard intrinsic
    links.
  • Give lower weights to intrinsic links.

21
Handling Spam Links (contd)
  • How to give lower weights to intrinsic links?
  • In adjacency matrix A, entry (p, q) should be
    assigned as follows
  • If p has a transverse link to q, the entry is 1.
  • If p has an intrinsic link to q, the entry is c,
    where 0 lt c lt 1.
  • If p has no link to q, the entry is 0.

22
Considering link context
  • For a given link (p, q), let V(p, q) be the
    vicinity (e.g., ? 50 characters) of the link.
  • If V(p, q) contains terms in the user query
    (topic), then the link should be more useful for
    identifying authoritative pages.
  • To incorporate this In adjacency matrix A, make
    the weight associated with link (p, q) to be
    1n(p, q),
  • where n(p, q) is the number of terms in V(p, q)
    that appear in the query.
  • Alternately, consider the vector similarity
    between V(p,q) and the query Q

23
(No Transcript)
24
Evaluation
  • Sample experiments
  • Rank based on large in-degree (or backlinks)
  • query game
  • Rank in-degree URL
  • 1 13 http//www.gotm.org
  • 2 12 http//www.gamezero.c
    om/team-0/
  • 3 12 http//ngp.ngpc.state
    .ne.us/gp.html
  • 4 12 http//www.ben2.ucla.
    edu/permadi/

  • gamelink/gamelink.html
  • 5 11 http//igolfto.net/
  • 6 11 http//www.eduplace.c
    om/geo/indexhi.html
  • Only pages 1, 2 and 4 are authoritative game
    pages.

25
Evaluation
  • Sample experiments (continued)
  • Rank based on large authority score.
  • query game
  • Rank Authority URL
  • 1 0.613 http//www.gotm.org
  • 2 0.390 http//ad/doubleclick/n
    et/jump/

  • gamefan-network.com/
  • 3 0.342 http//www.d2realm.com/
  • 4 0.324 http//www.counter-stri
    ke.net
  • 5 0.324 http//tech-base.com/
  • 6 0.306 http//www.e3zone.com
  • All pages are authoritative game pages.

26
Authority and Hub Pages (19)
  • Sample experiments (continued)
  • Rank based on large authority score.
  • query free email
  • Rank Authority URL
  • 1 0.525 http//mail.chek.com/
  • 2 0.345 http//www.hotmail/com/
  • 3 0.309 http//www.naplesnews.n
    et/
  • 4 0.261 http//www.11mail.com/
  • 5 0.254 http//www.dwp.net/
  • 6 0.246 http//www.wptamail.com
    /
  • All pages are authoritative free email pages.

27
Tyranny of Majority
Which do you think are Authoritative
pages? Which are good hubs? -intutively, we
would say that 4,8,5 will be authoritative
pages and 1,2,3,6,7 will be hub pages.
1
6
8
2
4
7
3
5
The authority and hub mass Will concentrate
completely Among the first component, as The
iterations increase. (See next slide)
BUT The power iteration will show that Only 4 and
5 have non-zero authorities .923 .382 And only
1, 2 and 3 have non-zero hubs .5 .7 .5
28
Tyranny of Majority (explained)
Suppose h0 and a0 are all initialized to 1
p1
q1
m
n
q
p2
p
qn
pm
mgtn
29
Tyranny of Majority (explained)
Suppose h0 and a0 are all initialized to 1
p1
q1
m
n
q
p2
p
qn
pm
mgtn
30
Impact of Bridges..
1
6
When the graph is disconnected, only 4 and 5 have
non-zero authorities .923 .382 And only 1, 2
and 3 have non-zero hubs .5 .7 .5CV
8
2
4
7
3
5
When the components are bridged by adding one
page (9) the authorities change only 4, 5 and 8
have non-zero authorities .853 .224 .47 And 1,
2, 3, 6,7 and 9 will have non-zero hubs .39 .49
.39 .21 .21 .6
Bad news from stability point of view
31
Authority and Hub Pages (24)
  • Multiple Communities (continued)
  • How to retrieve pages from smaller communities?
  • A method for finding pages in nth largest
    community
  • Identify the next largest community using the
    existing algorithm.
  • Destroy this community by removing links
    associated with pages having large authorities.
  • Reset all authority and hub values back to 1 and
    calculate all authority and hub values again.
  • Repeat the above n ? 1 times and the next largest
    community will be the nth largest community.

32
Multiple Clusters on House
Query House (first community)
33
Authority and Hub Pages (26)
Query House (second community)
34
Authority and Hub Pages (20)
  • For a given query, the induced subgraph may have
    multiple dense bipartite communities due to
  • multiple meanings of query terms
  • multiple web communities related to the query

35
Authority and Hub Pages (21)
  • Multiple Communities (continued)
  • If a page is not in a community, then it is
    unlikely to have a high authority score even when
    it has many backlinks.
  • Example Suppose initially all hub and
    authority scores are 1. qs
    p qs ps
  • G1
    G2
  • 1st iteration for G1 a(q) 0, a(p) 5, h(q)
    5, h(p) 0
  • 1st iteration for G2 a(q) 0, a(p) 3, h(q)
    9, h(p) 0

36
Authority and Hub Pages (22)
  • Example (continued)
  • 1st normalization (suppose normalization
    factors H1 for hubs and A1 for authorities)
  • for pages in G1 a(q) 0, a(p) 5/A1, h(q)
    5/H1, h(p) 0
  • for pages in G2 a(q) 0, a(p) 3/A1,
    h(q) 9/H1, a(p) 0
  • After the nth iteration (suppose Hn and An are
    the normalization factors respectively)
  • for pages in G1 a(p) 5n / (H1Hn-1An)
    ---- a
  • for pages in G2 a(p) 39n-1
    /(H1Hn-1An) ---- b
  • Note that a/b approaches 0 when n is
    sufficiently large, that is, a is much much
    smaller than b.

37
Authority and Hub Pages (23)
  • Multiple Communities (continued)
  • If a page is not in the largest community, then
    it is unlikely to have a high authority score.
  • The reason is similar to that regarding pages not
    in a community.
  • larger community smaller community

38
(No Transcript)
39
More stable because random surfer model allows
low prob edges to every place.CV
Can be made stable with subspace-based A/H values
see Ng. et al. 2001
40
Novel uses of Link Analysis
  • Link analysis algorithmsHITS, and Pagerankare
    not limited to hyperlinks
  • Citeseer/Cora use them for analyzing citations
    (the link is through citation)
  • See the irony herelink analysis ideas originated
    from citation analysis, and are now being applied
    for citation analysis ?
  • Some new work on keyword search on databases
    uses foreign-key links and link analysis to
    decide which of the tuples matching the keyword
    query are most important (the link is through
    foreign keys)
  • Sudarshan et. Al. ICDE 2002
  • Keyword search on databases is useful to make
    structured databases accessible to naïve users
    who dont know structured languages (such as
    SQL).
Write a Comment
User Comments (0)
About PowerShow.com