Adaptive OnLine Page Importance Computation - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Adaptive OnLine Page Importance Computation

Description:

Adaptive On-Line Page Importance Computation. Serge Abiteboul. INRIA. Domaine de Voluceau ... How to getting those web pages. Changing web pages. How to update ... – PowerPoint PPT presentation

Number of Views:161
Avg rating:3.0/5.0
Slides: 36
Provided by: Gua1
Category:

less

Transcript and Presenter's Notes

Title: Adaptive OnLine Page Importance Computation


1
Adaptive On-Line Page Importance Computation
  • Serge Abiteboul
  • INRIA
  • Domaine de Voluceau

2
Content
  • Brief Introduction
  • Main Idea
  • Problem Presentation
  • Static GraphsOPIC
  • Adaptive OPIC
  • Implementation Experiment
  • Conclusion

3
How do web search engine work?
  • Crawling the web pages
  • Parsing the web pages
  • Indexing the web pages
  • Search page
  • Search word parsing
  • Sending the result

4
Crawling web page
  • How to getting those web pages
  • Changing web pages
  • How to update web pages
  • How often to updating them
  • Different pages, different rank

5
Content
  • Brief Introduction
  • Main Idea
  • Problem Presentation
  • Static GraphsOPIC
  • Adaptive OPIC
  • Implementation Experiment
  • Conclusion

6
The web as a graph
7
A graph as a matrix
8
Importance
Importance of page i
9
Content
  • Brief Introduction
  • Main Idea
  • Problem Presentation
  • Static GraphsOPIC
  • Adaptive OPIC
  • Implementation Experiment
  • Conclusion

10
Cost of computing page rank
  • Huge history crawling web pages
  • Huge Cash Vector
  • Huge History Vector
  • Temp Vector
  • Variable length of

S T O R A G E
11
Cost of computing page rank
  • CPU
  • Memory
  • Disk access
  • Crawling web page
  • Communication

S Y S T E M
12
Content
  • Brief Introduction
  • Main Idea
  • Problem Presentation
  • Static GraphsOPIC
  • Adaptive OPIC
  • Implementation Experiment
  • Conclusion

13
Inductive Equation
Diverge
Converge to zero
14
Inductive Equation
Several solution
Converge problem
15
Inductive Equation
16
Static Graphs OPIC
First Cash
Credit of history page
Temp Vector
17
Static Graphs OPIC
for each i let Ci 1/n for each i let Hi
0 let G0 do forever begin choose some
node i each node is selected infinitely
often Hi Ci single disk access per
page for each child j of i, do Cj
Ci/outi Distribution of cash depends on
L G Ci Ci 0 end
18
Static Graphs OPIC
19
Limma 2.2
20
Limma 2.3
21
Limma 2.4
If all pages are infinitely read,
goes to infinity.
22
Limma 2.5
23
Content
  • Brief Introduction
  • Main Idea
  • Problem Presentation
  • Static GraphsOPIC
  • Adaptive OPIC
  • Implementation Experiment
  • Conclusion

24
Advantages over Adaptive OPIC
  • Less storage resources than standard algorithms
  • Less CPU,memory and disk access
  • Easy to implement

25
Page crawling strategies
Error factor
  • Random average 1/n
  • Greedy 2/n
  • Cycle

26
Window select
Adaptive OPIC select fixed window T
27
A changing graph
  • The Web changes continuously, so does the
    importance of pages.
  • Considering only the recent part of the cash
    history for each page
  • The time window corresponding to the .recent
    history may be defined as
  • A fixed number of measures for each page
  • A fixed period of time for each page
  • A single value that interpolates the history for
    a specific period of time
  • When the number of nodes changes, there are some
    difficulties.
  • More precisely, the page importance of previously
    existing pages decreases automatically.

28
Interpolation
  • If (G Gi) lt T
  • Otherwise

29
Content
  • Brief Introduction
  • Main Idea
  • Problem Presentation
  • Static GraphsOPIC
  • Adaptive OPIC
  • Implementation Experiment
  • Conclusion

30
Adaptive OPIC implement
  • It does not impose any constraints on the order
    of pages to visit
  • The crawling strategy in Xyleme is close to
    Greedy since it is tailored to optimize our
    knowledge of the Web
  • Considering only the recent part of the cash
    history for each page

31
Experiments on synthetic data
  • Convergence on important pages

32
Experiments on synthetic data
  • Impact of the window policy

33
Experiments on Web data
  • Experiments where conducted using the crawlers of
    Xyleme (e.g. 8 PCs with 1.5Gb of memory)
  • Crawling strategy is close to Greedy
  • History is managed using the Interpolation policy
  • Experiments lasted for several months, we
    discovered close to one billion URLs and read 400
    millions of them
  • Importance of read pages seems correct (with
    limitedhuman checking).
  • We could also give importance estimates for pages
    that were never read
  • The size of the window was first too small, then
    we set it to 3 months

34
Content
  • Brief Introduction
  • Main Idea
  • Problem Presentation
  • Static GraphsOPIC
  • Adaptive OPIC
  • Implementation Experiment
  • Conclusion

35
New directions site vs. pages
  • Limitation of page importance
  • Google page importance works well when links have
    a strong semantic
  • More and more web pages are automatically
    generated and most links have little semantics
  • More limitation
  • Refresh at the page level presents drawbacks
  • So we also use link topology between sites and
    not only between pages
Write a Comment
User Comments (0)
About PowerShow.com