Estimating the Global PageRank of Web Communities - PowerPoint PPT Presentation

About This Presentation
Title:

Estimating the Global PageRank of Web Communities

Description:

Searching for particular terms with several meanings. Relatively ... The subdominant eigenvalue is at most which means that for large l, it is very close to a ... – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 44
Provided by: Sco7111
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: Estimating the Global PageRank of Web Communities


1
Estimating the Global PageRank of Web Communities
  • Paper by Jason V. Davis Inderjit S. Dhillon
  • Dept. of Computer Sciences
  • University of Texas at Austin
  • Presentation given by Scott J. McCallen
  • Dept. of Computer Science
  • Kent State University
  • December 4th 2006

2
Localized Search Engines
  • What are they?
  • Focus on a particular community
  • Examples www.cs.kent.edu (site specific) or all
    computer science related websites (topic
    specific)
  • Advantages
  • Searching for particular terms with several
    meanings
  • Relatively inexpensive to build and use
  • Use less bandwidth, space and time
  • Local domains are orders of magnitude smaller
    than global domain

3
Localized Search Engines (cont)
  • Disadvantages
  • Lack of Global information
  • i.e. only local PageRanks are available
  • Why is this a problem?
  • Only pages within that community that are highly
    regarded will have high PageRanks
  • There is a need for a global PageRank for pages
    only within a local domain
  • Traditionally, this can only be obtained by
    crawling entire domain

4
Some Global Facts
  • 2003 Study by Lyman on the Global Domain
  • 8.9 billion pages on the internet (static pages)
  • Approximately 18.7 kilobytes each
  • 167 terabytes needed to download and crawl the
    entire web
  • These resources are only available to major
    corporations
  • Local Domains
  • May only contain a couple hundred thousand pages
  • May already be contained on a local web server
    (www.cs.kent.edu)
  • There is much less restriction to the entire
    dataset
  • The advantages of localized search engines
    becomes clear

5
Global (N) vs. Local (n)
Each local domain isnt aware of the rest of the
global domain.
Some parts overlap, but others dont. Overlap
represents links to other domains.
How is it possible to extract global information
when only the local domain is available?
Excluding overlap from other domains gives a very
poor estimate of global rank.
6
Proposed Solution
  • Find a good approximation to the global PageRank
    value without crawling entire global domain
  • Find a superdomain of local domain that will well
    approximate the PageRank
  • Find this superdomain by crawling as few as n or
    2n additional pages given a local domain of n
    pages
  • Esessentially, add as few pages to the local
    domain as possible until we find a very good
    approximation of the PageRanks in the local
    domain

7
PageRank - Description
  • Defines importance of pages based on the
    hyperlinks from one page to another (the web
    graph)
  • Computes the stationary distribution of a Markov
    chain created from the web graph
  • Uses the random surfer model to create a
    random walk over the chain

8
PageRank Matrix
  • Given m x m adjacency matrix for the web graph,
    define the PageRank Matrix as
  • DU is diagonal matrix such that UDU-1 is column
    stochastic
  • 0 a 1
  • e is vector of all 1s
  • v is the random surfer vector

9
PageRank Vector
  • The PageRank vector r represents the page rank of
    every node in the webgraph
  • It is defined as the dominate eigenvector of the
    PageRank matrix
  • Computed using the power method using a random
    starting vector
  • Computation can take as much as O(m2) time for a
    dense graph but in practice is normally O(km), k
    being the average number of links per page

10
Algorithm 1
  • Computing the PageRank vector based on the
    adjacency matrix U of the given web graph

11
Algorithm 1 (Explanation)
  • Input Adjacency Matrix U
  • Output PageRank vector r
  • Method
  • Choose a random initial value for r(0)
  • Continue to iterate using the random surfer
    probability and vector until reaching the
    convergence threshold
  • Return the last iteration as the dominant
    eigenvector for adjacency matrix U

12
Defining the Problem ( G vs. L)
  • For a local domain L, we have G as the entire
    global domain with an N x N adjacency matrix
  • Define G to be as the following
  • i.e. we partition G into separate sections that
    allow L to be contained
  • Assume that L has already been crawled and Lout
    is known

13
Defining the Problem (p in g)
  • If we partition G as such, we can denote actual
    PageRank vector of L as
  • with respect to g (the global PageRank vector)
  • Note EL selects only the nodes that correspond
    to L from g

14
Defining the Problem (n ltlt N)
  • We define p as the PageRank vector computed by
    crawling only local domain L
  • Note that p will be much different than p
  • Continue to crawl more nodes of the global domain
    and the difference will become smaller, however
    this is not possible
  • Find the supergraph F of L that will minimize the
    difference between p and p

15
Defining the Problem (finding F)
  • We need to find F that gives us the best
    approximation of p
  • i.e. minimize the following problem (the
    difference between the actual global PageRank and
    the estimated PageRank)
  • F is found with a greedy strategy, using
    Algorithm 2
  • Essentially, start with L and add the nodes in
    Fout that minimize our objective and continue
    doing so a total of T iterations

16
Algorithm 2
17
Algorithm 2 (Explanation)
  • Input L (local domain), Lout (outlinks from L),
    T (number of iterations), k (pages to crawl per
    iteration)
  • Output p (an improved estimated PageRank vector)
  • Method
  • First set F (supergraph) and Fout equal to L and
    Lout
  • Compute the PageRank vector of F
  • While T has not been exceeded
  • Select k new nodes to crawl based on F, Fout, f
  • Expand F to include those new nodes and modify
    Fout
  • Compute the new PageRank vector for F
  • Select the elements from f that correspond to L
    and return p

18
Global (N) vs. Local (n) (Again)
We know how to create the PageRank vector using
the power method.
Using it on only the local domain gives very
inaccurate estimates of the PageRank.
How can we select nodes from other domains (i.e.
expanding the current domain) to improve accuracy?
How far can selecting more nodes be allowed to
proceed without crawling the entire global domain?
19
Selecting Nodes
  • Select nodes to expand L to F
  • Selected nodes must bring us closer to the actual
    PageRank vector
  • Some nodes will greatly influence the current
    PageRank
  • Only want to select at most O(n) more pages than
    those already in L

20
Finding the Best Nodes
  • For a page j in the global domain and the
    frontier of F (Fout), the addition of page j to F
    is as follows
  • uj is the outlinks from F to j
  • s is the estimated inlinks from j into F (j has
    not yet been crawled)
  • s is estimated based on the expectation of inlink
    counts of pages already crawled as so

21
Finding the Best Nodes (cont)
  • We defined the PageRank of F to be f
  • The PageRank of Fj is fj
  • xj is the PageRank of node j (added to the
    current PageRank vector)
  • Directly optimizing requires us to know the
    global PageRank p
  • How can we minimize the objective without knowing
    p?

22
Node Influence
  • Find the nodes in Fout that will have the
    greatest influence on the local domain L
  • Done by attaching an influence score to each node
    j
  • Summation of the difference adding page j will
    make to PageRank vector among all pages in L
  • The influence score has a strong corollary to the
    minimization of the GlobalDiff(fj) function (as
    compared to a baseline, for instance, the total
    outlink count from F to node j)

23
Node Influence Results
  • Node Influence vs. Outlink Count on a crawl of
    conservative web sites

24
Finding the Influence
  • Influence must be calculated for each node j in
    frontier of F that is considered
  • We are considering O(n) pages and the calculation
    is O(n), we are left with a O(n2) computation
  • To reduce this complexity, approximating the
    influence of j may be acceptable, but how?
  • Using the power method for computing the PageRank
    algorithms may lead us to a good approximation
  • However, using the algorithm (Algorithm 1),
    requires having a good starting vector

25
PageRank Vector (again)
  • The PageRank algorithm will converge at a rate
    equal to the random surfer probability a
  • With a starting vector x(0), the complexity of
    the algorithm is
  • That is, the more accurate the vector becomes,
    the more complex the process is
  • Saving Grace Find a very good starting vector
    for x(0), in which case we only need to perform
    one iteration of Algorithm 1

26
Finding the Best x(0)
  • Partition the PageRank matrix for Fj

27
Finding the Best x(0)
  • Simple approach
  • Use as the starting vector (the
    current PageRank vector)
  • Perform one PageRank iteration
  • Remove the element that corresponds to added node
  • Issues
  • The estimate of fj will have an error of at
    least 2axj
  • So if the PageRank of j is very high, very bad
    estimate

28
Stochastic Complement
  • In an expanded form, the PageRank fj is
  • Which can be solved as
  • Observation
  • This is the stochastic complement of PageRank
    matrix of Fj

29
Stochastic Complement (Observations)
  • The stochastic complement of an irreducible
    matrix is unique
  • The stochastic complement is also irreducible and
    therefore has unique stationary distribution
  • With regards to the matrix S
  • The subdominant eigenvalue is at most
    which means that for large l, it is very close to
    a

30
The New PageRank Approximation
  • Estimate the vector fj of length l by performing
    one PageRank iteration over S, starting at f
  • Advantages
  • Starting and ending with a vector of length l
  • Creates a lower bound for error of zero
  • Example Considering adding a node k to F that
    has no influence over the PageRank of F
  • Using the stochastic complement yields the exact
    solution

31
The Details
  • Begin by expanding the difference between two
    PageRank vectors
  • with

32
The Details
  • Substitute PF into the equation
  • Summarizing into vectors

33
(No Transcript)
34
Algorithm 3 (Explanation)
  • Input F (the current local subgraph), Fout
    (outlinks of F), f (current PageRank of F), k
    (number of pages to return)
  • Output k new pages to crawl
  • Method
  • Compute the outlink sums for each page in F
  • Compute a scalar for every known global page j
    (how many pages link to j)
  • Compute y and z as formulated
  • For each of the pages in Fout
  • Computer x as formulated
  • Compute the score of each page using x, y and z
  • Return the k pages with the highest scores

35
PageRank Leaks and Flows
  • The change of a PageRank based on added a node j
    to F can be described as Leaks and Flows
  • A flow is the increase in local PageRanks
  • Represented by
  • Scalar is the total amount j has to
    distribute
  • Vector determines how it will be
    distributed
  • A leak is the decrease in local PageRanks
  • Leaks come from non-positive vectors x and y
  • X is proportional to the weighted sum of sibling
    PageRanks
  • Y is an artifact of the random surfer vector

36
Leaks and Flows
  • J

Leaks Random Surfer Siblings
Local Pages
Flows
37
Experiments
  • Methodology
  • Resources are limited, global graph is
    approximated
  • Baseline Algorithms
  • Random
  • Nodes chosen uniformly at random from known
    global nodes
  • Outlink Count
  • Node chosen have the highest number of outline
    counts from the current local domain

38
Results (Data Sets)
  • Data Set
  • Restricted to http pages that do not contain the
    characters ?, , _at_, or
  • EDU Data Set
  • Crawl of the top 100 computer science
    universities
  • Yielded 4.7 million pages, 22.9 million links
  • Politics Data Set
  • Crawl of the pages under politics in dmoz
    directory
  • Yielded 4.4 million pages, 17.2 million links

39
Results (EDU Data Set)
  • Normalizations show difference, Kendall shows
    similarity

40
Results (Politics Data Set)
41
Result Summary
  • Stochastic Complement outperformed other methods
    in nearly every trial
  • The results are significantly better than the
    random walk approach with minimal computation

42
Conclusion
  • Accurate estimates of the PageRank can be
    obtained by using local results
  • Expand the local graph based on influence
  • Crawl at most O(n) more pages
  • Use stochastic complement to accurately estimate
    the new PageRank vector
  • Not computationally or storage intensive

43
Estimating the Global PageRank of Web Communities
  • The End
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com