Whats New on the Web The Evolution of the Web from a Search Engine Perspective - PowerPoint PPT Presentation


PPT – Whats New on the Web The Evolution of the Web from a Search Engine Perspective PowerPoint presentation | free to view - id: 1fcce5-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Whats New on the Web The Evolution of the Web from a Search Engine Perspective


... metrics: a live study of the world wide web,' F. Douglas, A. Feldmann, and B. Krishnamurthy ... 3.3 TB of web history was saved, as well as an additional 4 ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 20
Provided by: walters70


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Whats New on the Web The Evolution of the Web from a Search Engine Perspective

Whats New on the Web? The Evolution of the Web
from a Search Engine Perspective
  • Authors Alexandros Ntoulas,
  • Junghoo Cho,
  • Christopher Olston
  • Presented by Walter J. Scheirer

The Problem
  • Not only are new websites constantly appearing,
    but content within existing pages is changing
  • Link-structure
  • Textual content
  • Link turnover
  • When pages die, their associated links do too
  • If we can cope with web evolution, we can provide
    the most up-to-date results for our users

Previous Work
  • A number of existing studies have already
    investigated the evolution of the web
  • A large-scale study of the evolution of web
    pages, D. Fetterly, M. Manasse, M. Najork, and
    J. L. Wiener
  • How Dynamic is the web? B.E. Brewington and G.
  • Sizing the Internet, B.H. Murray and A. Moore
  • The evolution of the web and implications for an
    incremental crawler, J. Cho and H. Garcia-Molina
  • Rate of change and other metrics a live study
    of the world wide web, F. Douglas, A. Feldmann,
    and B. Krishnamurthy

Previous Work
  • What did these authors have to say?
  • Concern about the rate of page creation
  • Web crawling techniques
  • Rate of change
  • LAST-MODIFIED header, as well as textual anlaysis
  • Establishing the actual size of the Internet
  • In mid 2000, analysis determined that there were
    in excess of two billion unique, publicly
    accessible pages on the Web, with an average of
    between 10 15KB per page. according to Murray
    and Moore

How does this papers approach differ?
  • Link-structure evolution
  • Link analysis is important (think PageRank), thus
    the evolution of the link structure is an
    important thing that a search engine should know
  • New pages on the web
  • How many new pages are created over time, new
    content being introduced, and the
    characteristics of the newly-created pages are
    all studied

How does this papers approach differ?
  • Search-centric change metric
  • Both the TF-IDF distance metric and the number of
    new words introduced in each update are utilized

  • 1. Whats new on the Web?
  • New pages are created at the rate of about 8
  • About 320 million new pages every week, if we
    assume that the web has 4 billion pages
  • About 20 of the pages available today will still
    be accessible after a year
  • New pages seem to borrow a significant portion
    of their content from existing pages
  • Only 62 of the newly created pages actually have
    new content
  • Link structure of the web is significantly more
    dynamic than the content of the Web
  • Every week, about 25 new links appear. After a
    year, about 80 of links are replaced by new ones

  • 2. How much change?
  • Results indicate that once a page is created, the
    page is likely to go through either minor
    changes, or none at all
  • Of all the pages that are still available after
    one year, about half have not changed at all
    during that time span

  • 3. Can we predict future changes?
  • Frequency of change results indicate this is not
    a good predictor of the degree of change
  • There is no observed correlation between
    historical change rates and future change rates
  • Degree of change past degree of change exhibits
    a strong correlation with the future degree
  • Example if a page has changed by 30 this week,
    then it is likely to change by 30 next week
  • This may not be true of some sites

Experimental Setup
  • How exactly did the authors come to their
  • 154 popular web sites (e.g. acm.org, hp.com,
    oreilly.com, etc.) were downloaded every week for
    an entire year
  • Sites were selected based on their ranking in
    Googles Directory

Experimental Setup
  • Download of pages
  • Maximum limit of 200,000 pages per site
  • Only four sites contained more than this limit
  • Total number of pages downloaded each week
    averaged to 4.4 million
  • 65 GB compressed
  • 3.3 TB of web history was saved, as well as an
    additional 4 TB of derived data

Whats New on the Web?
  • Weekly birth rate of pages
  • Average is about 8
  • Once per month, the number of new pages being
    introduced is significantly higher indicating
    sites use the calendar month for updates

Birth, Death, and Replacement
Birth, Death, and Replacement
  • Observations
  • The total number of pages available from the 154
    sites remained more or less the same throughout
    the duration of the study
  • However, not all are the same pages
  • After one month of crawling, only 75 of the
    first-week pages were still available, after six
    months of crawling, only 52 are available
  • Data did not fit any popular trend
  • Linear, exponential, and inverse-polynomial
    functions were all used with little success

Degree of Change
  • So far, weve talked about presence of change,
    but what about degree of change?
  • Two metrics are presented
  • TF.IDF Cosine Distance -
  • v1v2 is the inner product of v1, v2 and vi2
    is the second norm, or length, of vector vi
  • Word Distance

Degree of Change
  • Experimental results

Predictability of Degree of Change
  • Most pages captured in the study change in a
    highly predictable manner, in terms of cosine
  • For 90 of the pages, their future degree of
    change can be predicted with 8 error
  • The degree of predictability decreases as time
    intervals increase
  • This conclusion is only valid for the sites used
    in this study (which are quite popular)
  • Predictability for an individual site my be
    dependent on that sites contents, and not
    general trends throughout the web

  • Existing pages are being removed and replaced by
    new ones at rapid rates
  • New pages tend to borrow their content heavily
    from existing pages
  • Pages that persist over time exhibit very little
    substantive change
  • Past frequency of change does not appear to be a
    good all-around predictor of degree of change
  • Link structure is evolving faster than page
  • Most links persist for less than six months
  • All of these findings may be incorporated into
    search engine crawlers

About PowerShow.com