Title: Monitoring%20the%20dynamic%20Web%20to%20respond%20to%20Continuous%20Queries
1Monitoring the dynamic Web to respond to
Continuous Queries
- Sandeep PandeyKrithi RamamrithamSoumen
Chakrabarti - IIT Bombay
- www.cse.iitb.ac.in/laiir/
2 Motivation
- Web pages change rapidly
- 40 commercial pages
- 23 of all pages
- change per day (Sethuraman et al.)
- Current search engine users
- Need to repeat queries (how often?) and
- Diff results with recent versions
- Or poll frequently updated collections(e.g.,
Google news)
3Continuous Queries (CQ)
- Users register long-lived queries of interest
- Pages of interest may be added, modified, and
deleted - System continually updates responses
- Example applications
- Commuter updates traffic and weather conditions
- Alerts on cricket scores, stock portfolios
4Discrete vs. continuous queries
- Query lives for an instant, one-shot anwer
- Optimize corpus freshness at all times
- Objective penalizes delay from update to refresh
- Usually handled by bulk crawls with diverse
periods
- Queries have positive lifetime, many updates over
time - Updates must track changes closely
- Objective penalizes number or importance of
missed updates - Dynamic monitoring with more restrictive network
resources
5 Talk outline
- Introduction and motivation
- Previous approaches
- Our contributions
- Continuous Adaptive Monitoring (CAM)
- How to allocate limited polling resources among
pages - How to schedule poll instants
- Experiments
- Conclusion
6Related work
- CONQUER and WebCQ (Liu, Pu and Tang)
- Query language and architecture for CQ
- Do not address monitoring for freshness
optimization - NIAGARA (DeWitt and Naughton)
- Query evaluation and optimization techniques
- Database query optimization setting
- ChangeDetector (Boyapati et al.)
- Fixed-priority polling for given set of pages
- Freshness for discrete queries
- Poisson updates (Cho and Garcia-Molina)
- Quasi-deterministic and other distributions
(Sethuraman, Wolf, Squillante, Yu)
7Our contributions
- New statistical recency objective for CQs
- New monitoring framework to fit statistical
models of page change behavior - Recency optimization problem constrained by
network resources - Two-phase solution to optimization tailored to CQ
search systems - Resource allocation (knapsack)
- Poll scheduling (flow-shop)
8Continuous Adaptive Monitoring
- Planning horizon or epoch
- Time proceeds in discrete steps j over epoch
- Each time step j, each page i has probability
?i,j of an update - Can capture predictable bursts, periodicity
- ?j ?i,j ?i, the expected updates to page i
(change rate) - Decision variables yij
- Is page i polled at time step j?
9Profit, relevance and importance
- Each registered query q has a profit ?q
- Relevance riq of page i w.r.t. query q
- We use cosine in TFIDF space as in IR
- Other measures (e.g. PageRank) may be integrated
- Page i has importance Wi function of
- Currently resident queries and their profits
- Relevance of page i to each resident query
- Importance
10Returned Information Ratio
- Update information reported for page i is
- Goal is to maximize importance-weighted updates
reported, ?iWiRi subject to polling resource
constraint - Returned info ratio (RIR) is
Importance-weighted updatescaptured by system
Total importance-weightedexpected updates
11CAM system overview
- Time proceeds in epochs
- At the end of every epoch we re-evaluate
- Relevance
- Update probabilities
- For the next epoch
- We select instants at which to poll each page
(resource allocation) - Schedule these instants subject to resource
constraint
Determiningrelevant pages
Parametertracking
Monitoring
Resourceallocation
Scheduling
12CAM overview Tracking phase
- Relevance riq changes with time, polled
periodically - Modeling relevance change nontrivial, e.g.,
snippet-level changes - Collect instants when page change was detected
during current epoch - Revise estimates of ?i,j for use in the next
epochs poll optimization
Determiningrelevant pages
Parametertracking
Monitoring
Resourceallocation
Scheduling
13Resource allocation
- Existing policies
- Uniform Resources (polls) distributed uniformly
among all pages irrespective of their change
frequency - Proportional polls allocated to a page is
proportional to the frequency with which it
changes - For discrete queries, uniform better than
proportional for any inter-update distribution - CAM solve a knapsack problem
- Better than uniform and proportional
- Proportional better than uniform
- Evidence that CQ objective ? discrete objective
14Scheduling
Determiningrelevant pages
- Suppose our crawler can fetch M pages
concurrently, and - An epoch is T time steps long
- Then we can fetch a total of CMT pages during
an epoch - Ensured by resource allocation phase
- But at each instant we cannot schedule more than
M fetches - Want small planned-to-actual poll delays
- May fail to schedule all poll jobs in an epoch
Parametertracking
Monitoring
Resourceallocation
Tentative yijs
Scheduling
15A flow-shop problem
- M machines available at any time
- Each yij which is equal to 1 is a job
- Job k is released at time step rk ( j )
- Processing time crawl time tj
- Completion time of job j is Cj
- Want to minimize total flow
- NP-hard problem
- We use earliest deadline heuristic
Time
Job
16Experiments
- Synthetic data
- Change frequency distribution a few pages change
very often (Zipfian) - Update probability distribution a few ?i,j s
are large, most are small (Zipfian again) - Page importance distribution also Zipfian
(Wolman, 1999) - Real data
- Eight cricket score sites
- High update rate
FIXME
17CAM gt Proportional gt Uniform
- Uniform update andimportance distrib.
- Plot RIR against ratioof resources toexpected
changes - RIR for CAM is gt3times better
- Proportional is betterthan uniform in theCQ
setting - Intuition from minimum total stale duration
does not apply to CQ
18Resource allocation
- Sort pages by increasing change rate
- Place in ten equally populated bins (10fastest)
- Uniform spends same resource for each bin
- Proportional wastes fewer resources on
slow-changing bins, but is not aggressive enough - CAM invests more aggressively in fast-changing
bins, achieving the greatest RIR
19Skew-handling and adaptation
- Fixed monitoring/ change ratio
- Vary skew in update probability distribution
- CAMs gains increase with skew
- CAM improves over initial epochs
- Change distribution estimates stabilize within a
few epochs
RIR
20Experiments on real pages
- Eight sites with dynamic cricket match
information - In fact, Zipfian updates
- Adversarial setup monitor/change lt 1
- CAM close to best possible
- For M/C2, CAM updates on 80 of the information
changed
21 Conclusion
- Continual queries are inherently different from
discrete queries - Approach used in CAM
- Identify relevant pages
- Track the pages as they change
- Characterize page change behavior
- Decide when to monitor the pages in future
- CAM approach performs better than other naïve
approaches
22References
- J. Cho, H. Gracia-Molina. Synchronizing the
database to improve freshness. ACM-SIGMOD, 2000. - J. Cho, H. Gracia-Molina. Estimating frequency of
change. Technical Report, 2000. - J. Sethuram, J. L. Wolf, M. S. Squillante, P. S.
Yu. Optimal Crawling strategies for Web
search-engines. World Wide Web, 2002.
23References
- S. Pandey, K. Ramamritham, S. Chakrabarti.
Monitoring the dynamic Web to respond to
Continual Queries. World Wide Web, 2003. - S. Pandey, K. Ramamritham, S. Chakrabarti, S.
Garg, A. Vyas. Web-CAM Monitoring the dynamic
Web to respond to Continual Queries. Submitted to
29th VLDB conference, 2003.
24Future Research Possibilities
- Maintaining Inverted Index current
- Account for new entries to the Web
- Identifying the changes relevant to the query
- Measuring query-specific change behaviour of a
page - Reusing the page change statistics for other
related queries
25Skewed update probability distribution
- CAM still performs much
- better than others
- In fact CAM exploits the
- skewed nature of distribution
- and performs even better
- than the uniform setting
26 Adaptive nature of CAM
- No difference in allocation
- of resources in Uniform
- and Proportional strategy
- CAM considers the
- probability distribution
- while allocating resources
- Lesser frequency bins also
- get resources now due to
- some updating moments
- of high probability