Finding Related Pages in the World Wide Web - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Finding Related Pages in the World Wide Web

Description:

Output: www.usatoday.com www.washingtonpost.com. Related web pages: same topic ... Input: (1) User: the URL of user's interest (2) Connectivity Server: the ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 29
Provided by: xiangyanga
Category:

less

Transcript and Presenter's Notes

Title: Finding Related Pages in the World Wide Web


1
Finding Related Pages in the World Wide Web
  • Author
  • Jeffrey Dean
  • Monika R. Henzinger

Presented By Amal Banerjee
Xiang-Yang Alexander Liu
2
Outline
  • Introduction
  • Companion Algorithm
  • Cocitation Algorithm
  • Performance Comparison with Netscape
  • Conclusion

3
Introduction
  • Another kind of user input a URL address
  • Example
  • Input www.nytimes.com
  • Output www.usatoday.com www.washingtonpost.com
    .
  • Related web pages same topic

4
Introduction (contd)
  • Input (1) User the URL of users interest
  • (2) Connectivity Server the
    linkage information about this URL
  • Output A set of related web pages
  • Method Linkage analysis
  • Objective (1) high precision (2) high speed
  • Solution (1) Companion Algorithm
  • (2) Cocitation Algorithm

5
Companion Algorithm
  • Step 1 Build the vicinity graph based on
    user input and linkage information.
  • Step 2 Near-duplicate elimination
  • Step 3 Compute hub and authority scores
  • Step 4 Sort and output

6
Companion Algorithm (contd)Step 1 Building the
vicinity graph (example)
  • Example

p2
p3
p1
u
c1
c3
c2
7
Companion Algorithm (contd)Step 1 Building the
vicinity graph (example)
  • Example

p2
p3
p1
B21
B22
B11
B12
B31
B32
u
b11
b12
b31
b21
b32
b22
c1
c3
c2
8
Companion Algorithm (cont.)Step 1 Building the
vicinity graph
  • Number of parents of u 2000
  • Number of children of every parent 8
  • Reduce the likelihood of the computation
    dominated by a single parent

9
Companion Algorithm (cont.)Step 1 Building the
vicinity graph(link order)
  • Problem If a parent of u has more than 8
    children, how to make the selection?
  • Observation the links to pages on a similar
    topic tend to cluster together
  • Solution 4 above and 4 below based on the link
    from p to u.

10
Companion Algorithm (cont.)Step 1 Building the
vicinity graph
  • Stoplist (1) unrelated to most queries
  • (2) have very high in-degree
  • 21 URLs by experimentation
  • Most of them are popular search engines and
    portals

11
Companion Algorithm (cont.)Step 1 Building the
vicinity graph(pseudocode)
  • Build-Vicinity-Graph(URL u, Connectivity Server)
  • Su stoplistOriginal-Stoplist which including
    21 URLs
  • If u is in stoplist stoplistNULL SET
  • SSup to P parents of u from Connectivity
    Server and the parent of u must not be in the
    stoplist
  • for every p //p is a parent of u
  • if number of children of p lt Pc SSall
    children of p
  • else SSPc/2 children above and Pc/2 children
    below the link to u
  • SSup to C children of u from Connectivity
    Server
  • for every c //c is a child of u
  • SSup to Cp parents of c from Connectivity
    Server
  • return S

12
Companion Algorithm (cont.)Step 2
Near-duplicate elimination
  • Many pages are duplicated across hosts.
  • Example mirror sites, different aliases for same
    pages
  • Near-duplicate elimination( S )
  • for every two nodes a and b in S
  • if (a and b each have more than 10 links)
  • ( a and b have at least 95 of their links in
    common)
  • c a links b links
  • S S a b c

13
Companion Algorithm (cont.)Step 3 Compute hub
and authority scores
  • Use the weighting scheme of Bharat and Hensinger
  • Compute hub and authority scores( S )
  • Initialize all elements of the hub vector H to
    1.0
  • Initialize all elements of the authority vector
    A to 1.0
  • While the vectors H and A have not converged
  • For all nodes n in the vicinity graph N
  • An
  • For all n in N
  • Hn
  • Normalize the H and A vectors

14
Cocitation Algorithm
  • Observation related pages are often linked
    together by other web pages.
  • Two nodes are co-cited if they have at least one
    common parent.

p2
p3
p1
u
S
15
Cocitation Algorithm
  • Degree of co-citation numbers of common parents
    of two nodes
  • Idea Looking for sibling nodes with high degree
    of co-citation

16
Cocitation Algorithm (cont.)
  • Cocitation( URL u, Connectivity Server)
  • ParentSetempty SiblingSetempty
  • ParentSetParentSet up to P parents of u
  • For every node p in ParentSet do
  • SiblingSetSiblingSet up to C children of p
  • for every node s in SiblingSet calculate the
    degree of co-citation of (s, u)
  • Sort the nodes in SiblingSet according to degree
    of co-citation
  • Output

17
Algorithm Implementation
  • Connectivity Server 180 million URLs - nodes
  • AlphaServer - 8GB RAM prevent page faults
  • Connect Connectivity Server - server code
    mmap

18
Experimental Setup
  • 18 people - at least 2 URLs each
  • 59 URLs get top 10 answers for each, rate these
  • Page is rated as
  • 0 Page not valuable/useful
  • 1 Page valuable/useful
  • - Page inaccessible

19
Algorithm Performance Metrics
  • Intersection Group of URLs for which all return
    at least one answer 37
  • Non-Netscape Group of URLs for which Netscape
    did not return any answers 19

20
Algorithm Performance Metrics (contd)
21
Algorithm Performance Metrics (contd)
22
Algorithm Performance Metrics (contd)
  • Overlap between answers returned by algorithms

23
Algorithm Performance Metrics (contd)Sign Test
Example
  • Sample data set
  • 97.5, 95.2, 97.3, 96.0, 96.8, 100.0, 97.4,
    95.3, 93.2, 99.1, 96.1, 97.6, 98.2, 98.5, 94.9
  • Null Hypothesis median 98.5
  • Alternative Hypothesis median lt 98.5
  • X 2 values with values larger than 98.5

24
Algorithm Performance Metrics (contd)
  • Statistical significance of results

25
Algorithm Performance Metrics (contd)Timing
Characteristics
  • Average running times
  • Companion 109 ms for 50 URLs
  • Cocitation 195 ms for 58 URLs

26
Related Works
  • Order of links Chakrabarti et.al Enhanced
    Hypertext Categorization Using Hyperlinks.
  • Cocitation and other forms of connectivity
    Spertus A points to B and C B, C related
    Pitkow Pirolli Cocitation

27
Conclusion and Future Works
  • Future Work Extend these two algorithms to
    handle more than one input URL.
  • Conclusion The two algorithms significantly
    outperform Netscapes performance for finding
    related web pages.

28
Questions?
  • This presentation is available at
  • http//www.cs.utexas.edu/alex/datamining-0205.ppt
Write a Comment
User Comments (0)
About PowerShow.com