Monitoring%20the%20dynamic%20Web%20to%20respond%20to%20Continuous%20Queries - PowerPoint PPT Presentation

About This Presentation
Title:

Monitoring%20the%20dynamic%20Web%20to%20respond%20to%20Continuous%20Queries

Description:

Monitoring the dynamic Web to respond to Continuous Queries. Sandeep ... Users register long ... NIAGARA (DeWitt and Naughton) Query evaluation and ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 22
Provided by: shav150
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Monitoring%20the%20dynamic%20Web%20to%20respond%20to%20Continuous%20Queries


1
Monitoring the dynamic Web to respond to
Continuous Queries
  • Sandeep PandeyKrithi RamamrithamSoumen
    Chakrabarti
  • IIT Bombay
  • www.cse.iitb.ac.in/laiir/

2
Motivation
  • Web pages change rapidly
  • 40 commercial pages
  • 23 of all pages
  • change per day (Sethuraman et al.)
  • Current search engine users
  • Need to repeat queries (how often?) and
  • Diff results with recent versions
  • Or poll frequently updated collections(e.g.,
    Google news)

3
Continuous Queries (CQ)
  • Users register long-lived queries of interest
  • Pages of interest may be added, modified, and
    deleted
  • System continually updates responses
  • Example applications
  • Commuter updates traffic and weather conditions
  • Alerts on cricket scores, stock portfolios

4
Discrete vs. continuous queries
  • Query lives for an instant, one-shot anwer
  • Optimize corpus freshness at all times
  • Objective penalizes delay from update to refresh
  • Usually handled by bulk crawls with diverse
    periods
  • Queries have positive lifetime, many updates over
    time
  • Updates must track changes closely
  • Objective penalizes number or importance of
    missed updates
  • Dynamic monitoring with more restrictive network
    resources

5
Talk outline
  • Introduction and motivation
  • Previous approaches
  • Our contributions
  • Continuous Adaptive Monitoring (CAM)
  • How to allocate limited polling resources among
    pages
  • How to schedule poll instants
  • Experiments
  • Conclusion

6
Related work
  • CONQUER and WebCQ (Liu, Pu and Tang)
  • Query language and architecture for CQ
  • Do not address monitoring for freshness
    optimization
  • NIAGARA (DeWitt and Naughton)
  • Query evaluation and optimization techniques
  • Database query optimization setting
  • ChangeDetector (Boyapati et al.)
  • Fixed-priority polling for given set of pages
  • Freshness for discrete queries
  • Poisson updates (Cho and Garcia-Molina)
  • Quasi-deterministic and other distributions
    (Sethuraman, Wolf, Squillante, Yu)

7
Our contributions
  • New statistical recency objective for CQs
  • New monitoring framework to fit statistical
    models of page change behavior
  • Recency optimization problem constrained by
    network resources
  • Two-phase solution to optimization tailored to CQ
    search systems
  • Resource allocation (knapsack)
  • Poll scheduling (flow-shop)

8
Continuous Adaptive Monitoring
  • Planning horizon or epoch
  • Time proceeds in discrete steps j over epoch
  • Each time step j, each page i has probability
    ?i,j of an update
  • Can capture predictable bursts, periodicity
  • ?j ?i,j ?i, the expected updates to page i
    (change rate)
  • Decision variables yij
  • Is page i polled at time step j?

9
Profit, relevance and importance
  • Each registered query q has a profit ?q
  • Relevance riq of page i w.r.t. query q
  • We use cosine in TFIDF space as in IR
  • Other measures (e.g. PageRank) may be integrated
  • Page i has importance Wi function of
  • Currently resident queries and their profits
  • Relevance of page i to each resident query
  • Importance

10
Returned Information Ratio
  • Update information reported for page i is
  • Goal is to maximize importance-weighted updates
    reported, ?iWiRi subject to polling resource
    constraint
  • Returned info ratio (RIR) is

Importance-weighted updatescaptured by system
Total importance-weightedexpected updates
11
CAM system overview
  • Time proceeds in epochs
  • At the end of every epoch we re-evaluate
  • Relevance
  • Update probabilities
  • For the next epoch
  • We select instants at which to poll each page
    (resource allocation)
  • Schedule these instants subject to resource
    constraint

Determiningrelevant pages
Parametertracking
Monitoring
Resourceallocation
Scheduling
12
CAM overview Tracking phase
  • Relevance riq changes with time, polled
    periodically
  • Modeling relevance change nontrivial, e.g.,
    snippet-level changes
  • Collect instants when page change was detected
    during current epoch
  • Revise estimates of ?i,j for use in the next
    epochs poll optimization

Determiningrelevant pages
Parametertracking
Monitoring
Resourceallocation
Scheduling
13
Resource allocation
  • Existing policies
  • Uniform Resources (polls) distributed uniformly
    among all pages irrespective of their change
    frequency
  • Proportional polls allocated to a page is
    proportional to the frequency with which it
    changes
  • For discrete queries, uniform better than
    proportional for any inter-update distribution
  • CAM solve a knapsack problem
  • Better than uniform and proportional
  • Proportional better than uniform
  • Evidence that CQ objective ? discrete objective

14
Scheduling
Determiningrelevant pages
  • Suppose our crawler can fetch M pages
    concurrently, and
  • An epoch is T time steps long
  • Then we can fetch a total of CMT pages during
    an epoch
  • Ensured by resource allocation phase
  • But at each instant we cannot schedule more than
    M fetches
  • Want small planned-to-actual poll delays
  • May fail to schedule all poll jobs in an epoch

Parametertracking
Monitoring
Resourceallocation
Tentative yijs
Scheduling
15
A flow-shop problem
  • M machines available at any time
  • Each yij which is equal to 1 is a job
  • Job k is released at time step rk ( j )
  • Processing time crawl time tj
  • Completion time of job j is Cj
  • Want to minimize total flow
  • NP-hard problem
  • We use earliest deadline heuristic

Time
Job
16
Experiments
  • Synthetic data
  • Change frequency distribution a few pages change
    very often (Zipfian)
  • Update probability distribution a few ?i,j s
    are large, most are small (Zipfian again)
  • Page importance distribution also Zipfian
    (Wolman, 1999)
  • Real data
  • Eight cricket score sites
  • High update rate

FIXME
17
CAM gt Proportional gt Uniform
  • Uniform update andimportance distrib.
  • Plot RIR against ratioof resources toexpected
    changes
  • RIR for CAM is gt3times better
  • Proportional is betterthan uniform in theCQ
    setting
  • Intuition from minimum total stale duration
    does not apply to CQ

18
Resource allocation
  • Sort pages by increasing change rate
  • Place in ten equally populated bins (10fastest)
  • Uniform spends same resource for each bin
  • Proportional wastes fewer resources on
    slow-changing bins, but is not aggressive enough
  • CAM invests more aggressively in fast-changing
    bins, achieving the greatest RIR

19
Skew-handling and adaptation
  • Fixed monitoring/ change ratio
  • Vary skew in update probability distribution
  • CAMs gains increase with skew
  • CAM improves over initial epochs
  • Change distribution estimates stabilize within a
    few epochs

RIR
20
Experiments on real pages
  • Eight sites with dynamic cricket match
    information
  • In fact, Zipfian updates
  • Adversarial setup monitor/change lt 1
  • CAM close to best possible
  • For M/C2, CAM updates on 80 of the information
    changed

21
Conclusion
  • Continual queries are inherently different from
    discrete queries
  • Approach used in CAM
  • Identify relevant pages
  • Track the pages as they change
  • Characterize page change behavior
  • Decide when to monitor the pages in future
  • CAM approach performs better than other naïve
    approaches

22
References
  • J. Cho, H. Gracia-Molina. Synchronizing the
    database to improve freshness. ACM-SIGMOD, 2000.
  • J. Cho, H. Gracia-Molina. Estimating frequency of
    change. Technical Report, 2000.
  • J. Sethuram, J. L. Wolf, M. S. Squillante, P. S.
    Yu. Optimal Crawling strategies for Web
    search-engines. World Wide Web, 2002.

23
References
  • S. Pandey, K. Ramamritham, S. Chakrabarti.
    Monitoring the dynamic Web to respond to
    Continual Queries. World Wide Web, 2003.
  • S. Pandey, K. Ramamritham, S. Chakrabarti, S.
    Garg, A. Vyas. Web-CAM Monitoring the dynamic
    Web to respond to Continual Queries. Submitted to
    29th VLDB conference, 2003.

24
Future Research Possibilities
  • Maintaining Inverted Index current
  • Account for new entries to the Web
  • Identifying the changes relevant to the query
  • Measuring query-specific change behaviour of a
    page
  • Reusing the page change statistics for other
    related queries

25
Skewed update probability distribution
  • CAM still performs much
  • better than others
  • In fact CAM exploits the
  • skewed nature of distribution
  • and performs even better
  • than the uniform setting

26
Adaptive nature of CAM
  • No difference in allocation
  • of resources in Uniform
  • and Proportional strategy
  • CAM considers the
  • probability distribution
  • while allocating resources
  • Lesser frequency bins also
  • get resources now due to
  • some updating moments
  • of high probability
Write a Comment
User Comments (0)
About PowerShow.com