Adaptive Online Page Importance, Experiments and Applications - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Adaptive Online Page Importance, Experiments and Applications

Description:

INRIA-Xyleme crawlers. Run on a cluster of Linux PCs - 8 PCs at some point ... Each crawler is in charge of 100 million pages and crawls about 4 million pages per day ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 34

Provided by: wwwroc

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive Online Page Importance, Experiments and Applications

1
Adaptive On-line Page Importance, Experiments and
Applications

Serge Abiteboul (INRIA Xyleme)
with
Grégory Cobéna (INRIA, now Xyleme)
and Mihai Preda (Xyleme)

2
Motivations

Page importance
Notion introduced by Kleinberg
Popularized on the Web by Google
Applications of page importance
Rank results of search engine
Guide frequency of visits of pages
For maintaining the index of a search engine
For monitoring the Web
For archiving the Web
For building a topic specific warehouse of Web
data

3
Organization

What is page importance?
Intuition
Formal model
The Adaptive OPIC algorithm
An on-line algorithm
Adaptive algorithm
Experiments
Crawling in Xyleme
Applications

4
What is page importance?
5
Intuition

All the pages on the Web do not have the same
importance
Le Louvre homepage is more important than Mme
Michus homepage
Page importance information is valuable
To rank query results (e.g. Google) for display
Less and less so?
To decide which pages should be crawled first or
which pages should be refreshed next

6
Model

The Web as a graph/matrix
We view the Web as a directed graph G
Web pages are vertices, HTML links are edges
Let n be the number of pages
G is represented as a non-negative link-matrix
L1..n,1..n
There are many ways to encode the Web as a matrix
Kleinberg sets Li,j1 if there is a link from
i to j
BrinPage set Li,j1/outi if there is a link
from i to j where outi is out-degree
Both Li,j0 if there is no edge from i to j
The importance is represented as a vector x1..n

7
Importance (modulo details -)

Importance is defined as the unique fixpoint of
the equation
xLx
Page importance can be computed inductively
xk1Lxk
If normalized, this corresponds to the limit of
the probability to be in a page in a random walk
on the Web
Start randomly on some page Follow randomly some
link of the page
Keep walking
This corresponds to an intuitive notion of
importance

8
Some of the details in brief

For each nonnegative matrix L
There always exists such a fixpoint but it may
not be unique
Iterating over k will diverge or converge to zero
in most cases
A normalization after each step is necessary
Theorem
If the graph is strongly connected, there exist a
unique fixpoint (up to normalization)
If the graph is a-periodic, the iteration
converges

9
Strong connectivity disjoint components
B
A

The relative importance of AB, as compared to
CD depends on the initial value of x
One solution for (AB) and one for (CD) gives many
solutions for the system

C
D
10
Strong Connectivity sinks
B
A

In the random walk model, the importance of A and
B is zero.
Only C and D accumulate some importance

C
D
11
A-periodic
A

The fixpoint oscillates between several values

B
C
12
Situation for the Web

The Web is not strongly connected
Consider the bow-tie model of the Web graph
Google adds a small edges for any pair i,j
We add small edges to from some virtual page
Intuition Consider the possibility of users to
navigate on the Web without using links (e.g.
bookmarks, URLs)
The Web is reasonably a-periodic

13
Adaptive OPICAdaptive Online Page Importance
Computation
14
Online Computation Motivations

Off-line algorithm
Crawls the Web and builds a Link-matrix
Stores the link matrix and update it very
expensive
Starts an off-line computation on a frozen link
matrix
On-line Page Importance Computation
Does not require storing the link matrix
Works continuously together with crawling
Works independently of any crawling strategy
Provides early an estimate of page importance to
guide crawling
Keeps improving and updating the estimate

15
Static Graphs OPIC

We assign to each page a small amount of cash
When a page is read, its cash is distributed
between its children
The total of cash in all pages does not change
The page importance for a given page is computed
using the history of cash of that page

16
Example

Small Web of 3 pages
Alice has all the cash to start
Importance independent of the
original position

Alice
Georges
Bob
ABAGB
17
What happened?

Cash-Game History
Alice received 600 (200400)
Bob received 600 (200100300)
Georges received 300 (200100)
Solution
I(Alice) 40
I(Bob) 40
I(Georges) 20
It is the fixpoint

I(page) History(page)/ Sum(Histories)
18
Cash-History

Alice, Bob, Georges (history)
0, 0, 0
Read ltAlicegt
0.33, 0, 0
Read Bob
0.33, 0.50, 0
Read Georges
0.33, 0.50, 0.50
Read Bob
0.33, 1.0, 0.50
Read Alice
1.33, 1.0, 0.50

Alice,Bob,Georges (cash)
0.33, 0.33, 0.33 (t0)
Read ltAlicegt
0, 0.50, 0.50
Read Bob
0.5, 0, 0.5
Read Georges
0.5, 0.5, 0
Read Bob
1, 0, 0
Read Alice
0, 0.5, 0.5

19
Computing Page Importance

Cti is the cash of page i at some time t
Hti is the history (sum of previous cash) of
page i
Total of cash is constant
For each page i, Hi goes to infinity
For each page, at each step,
HtjCtj C0j sum(i ancestor of j,
Li,jHti/out(i))
Thm The limit of Htj/sum(Htj)is the
importance of page i

20
The Web is a changing graphThe Adaptive
Algorithm

The Web changes continuously, so does the
importance of pages
Our adaptive algorithm works by considering only
the recent part of the cash history for each page
The time window corresponding to the recent
history may be defined as
A fixed number of measures for each page
A fixed period of time for each page
A single value that interpolates the history for
a specific period of time
Note that the definition of page importance
considers a fixed number of nodes
For instance, the page importance of previously
existing pages decreases automatically when new
pages are added.

21
Experiments
22
Crawling Strategies

Our algorithm works with any crawling strategy if
each page is visited infinitely often.
It does not impose any constraints for the order
of pages to visit.
Simple crawling strategies are
Random all pages have equal probability to be
chosen
Greedy choose the page with largest amount of
cash
Cycle systematic strategy that cycles around the
set of pages
Convergence is faster with Greedy since pages
have more cash on average to distribute

23
Experiment settings for synthetic data

Synthetic models of the Web
More flexibility in studying variants of the
algorithm
We build a graph which simulates the Web
We compute the reference page importance on this
graph using the offline algorithm until the
fixpoint is reached
We simulate Web crawling and computing page
importance online
We compare our estimate of page importance with
the reference

24
Experiments on synthetic data

Impact of the page selection strategy greedy is
best

25
Experiments on synthetic data

Convergence on important pages greedy brilliant
on important pages

26
Experiments on synthetic data

Impact of the size of the window difficult to
fix depends on change rate

27
Experiments on synthetic data

Impact of the window policy Interpolated history
is best

28
Xyleme Crawlers
29
Implementation

INRIA-Xyleme crawlers
Run on a cluster of Linux PCs - 8 PCs at some
point
Code is in C, communications use Corba
Each crawler is in charge of 100 million pages
and crawls about 4 million pages per day
A difficulty is to assign a unique integer to
each page and to provide an efficient translation
from integer to URL
Continuously read pages on the Web (HTML, XML)
Uses HTMLXML links to discover new pages
Monitor the Web archive XML pages

30
Implementation - continued

We implemented a distributed version of Adaptive
OPIC
Crawling strategy
The crawling strategy in Xyleme was defined to
optimize the knowledge of the Web
Intuition refresh frequency proportional to
importance
Turned out to be on average very close to Greedy

31
Overview of Crawler
WWW
Pages are grouped by domain-name and crawled by
robots
Crawler
Robot
Robot
Robot
Robot
Robot
The scheduler decides which pages will be read
next, depending on their importance, change rate,
client interests, etc.
New pages are discovered using links found in
HTML pages. Management of metadata on known pages
Scheduler
32
Some numbers

Fetcher
Up to 100 robots running simultaneously on a
single PC
Average of 50 pages/seconds on an (old)PC (4
millions/day)
Limiting factor is the number of random disk
access
Performance and Politeness
Pages are grouped by domain to minimize the cost
of DNS (Domain Name Server) resolution (the next
10 million pages to be read).
To avoid rapid firing, we maintain a large number
of accessible sites in memory (1 million
domains).
Knowledge about visited pages 100 million pages
in main memory
For each page, the exact disk location of the
info structure (4 bytes) a counter that we use
for page rank and for the crawling strategy
One disk access per page that is read

33
Experiments on Web data

Experiments were conducted using the crawlers of
Xyleme
8 PCs with 1.5Gb of memory each
Crawling strategy is close to Greedy (with focus
on XML)
History is managed using the interpolation policy
Experiments lasted for several months
We discovered close to one billion URLs
After several trials, we set the window size to 3
months
Importance of pages seems correct
Same as Google validated by success
Experiments with BnF librarians over a few
thousand Web pages as good as an average
librarian
Also gives estimates for pages that were never
read 60 so guidelines to discover the Web