Improving Front Pages with Dynamic Content using Cache Policies - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Improving Front Pages with Dynamic Content using Cache Policies

Description:

tbone/warningsigns/: a Warning Signs gallery linked by fark.com. ... Label users (academic vs. 'fark' reader) with cookies. Batched Caching ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 17
Provided by: alb977
Category:

less

Transcript and Presenter's Notes

Title: Improving Front Pages with Dynamic Content using Cache Policies


1
Improving Front Pages with Dynamic Content using
Cache Policies
  • Justin Brickell and Albert Chen
  • Dr. Dhillon
  • 2004/12/10

2
Introduction
  • Motivation
  • Front pages need a small set of quality links
  • Now this is done by hand
  • Future we want to automate it
  • Quality links
  • Accessed frequently by web users
  • Provide useful information
  • Change over time
  • Our approach Using access logs

3
Access Logs
  • Automatically generated by the web server
  • Associates content with time of access
  • Allows us to determine which pages are the most
    popular
  • Better than PageRank?

4
The Dynamic Box
  • Small section of Front Page with dynamic content
  • Content chosen by access log analysis
  • Naïve solution popularity contest (hit count)
  • Better solution caching algorithms
  • Example
  • UTCS front page with dynamic box

5
Cache Analogy
  • Users ?? Processes
  • Web pages ?? Memory locations
  • Dynamic box ?? Cache
  • Dynamic box quality ?? Hit ratio
  • Hope Replacement policies with good hit ratios
    will produce good dynamic boxes!

6
Practical Challenge Robots
  • Robots, Spiders, and Crawlers automatically
    process or skim web pages
  • Properties
  • Methodically access the web pages
  • Problem
  • Non human behavior influences cache designed for
    human consumption
  • Our solution
  • Manually determine which clients are robots
  • Another solution
  • Automated robot detection

7
Practical Challenge Polyonomy
  • Polyonyms are pages with more than one name
  • Example /users/ /
  • Properties
  • Depends on rules in the web server
  • Problem
  • Pages hits split among different names
  • Our Solution
  • Write rules by hand to map polyonyms to a single
    name
  • Another Solution
  • Use content hashes instead of url names

8
Practical Challenge Questionable Content
  • Web page uninteresting to UTCS community linked
    by popular third party
  • Example
  • /tbone/warningsigns/ a Warning Signs gallery
    linked by fark.com.
  • http//www.cs.utexas.edu/tbone/warningsigns/photo
    s/ child20vs.20Tractor202.html
  • Problem
  • These pages tend to appear in the dynamic box
  • Our solution
  • Nothing. Having these pages in dynamic box does
    increase hit ratio
  • Other Solutions
  • Ignore clients who do not visit the front page
  • Filter by referrer/client ip
  • Label users (academic vs. fark reader) with
    cookies

9
Batched Caching
  • Caching Algorithms update cache on every miss
  • This is too frequent for dynamic box
  • Users would be confused or frustrated
  • Solution Introduce a time span
  • Virtual cache is updated normally
  • Real cache is copied from Virtual cache
    periodically

10
Cache Replacement Policies
  • LRU Least Recently Used
  • LFU Least Frequently Used
  • MPP Most Popular Policy
  • cache contains 10 most hit links from previous 2
    hours
  • ARC Adaptive Replacement Cache
  • Maintains two caches to balance between
    frequently used and recently used pages
  • GDF Greedy Dual Frequency
  • Like LFU, but with some recency information

11
Experiment
  • Corpus
  • Web logs of the UTCS web
  • From Nov 2nd to Nov 30th
  • 2.3 million accesses/73,000 accesses per day
  • Removed the accesses of robots (0.8 million robot
    accesses/30,000 accesses per day)
  • 120,000 distinct urls
  • 40,000 urls from crawl 80,000 urls in logs
  • Removed non-content urls (jpg, gif, css, etc.)

12
Results
  • 13 of accesses are for the front page
  • Only 3 of accesses are for the 22 pages
    statically linked from the front page
  • Hit ratio presented in graphs includes this 16
  • Adding just 10 dynamic links can increase hit
    ratio from 3 to 28

13
Experimental Results(2 hour update period)
14
Effect of Time Span on ARC
15
PageRank vs. LogRank
  • PageRank claims to predict time spent at pages
  • Access log gives us this information directly
    LogRank
  • We wanted to compare these. It didnt work
  • Set of crawled pages and set of logged pages are
    too different
  • Still working on an improved crawl

16
DEMO
  • Dynamic boxes of the UTCS web from Nov 2nd to Nov
    30th.
  • ARC dynamic boxes
  • MPP dynamic boxes
  • GDF dynamic boxes
  • LRU dynamic boxes
  • LFU dynamic boxes
Write a Comment
User Comments (0)
About PowerShow.com