Mining%20di%20Dati%20Web - PowerPoint PPT Presentation

About This Presentation
Title:

Mining%20di%20Dati%20Web

Description:

Each component of a WSE records information about its operations. ... Belkin, N.J., et al. 'Rutgers' TREC 2001 Interactive Track Experience', in ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 55
Provided by: fabrizios7
Category:
Tags: 20dati | 20web | 20di | belkin | mining

less

Transcript and Presenter's Notes

Title: Mining%20di%20Dati%20Web


1
Mining di Dati Web
  • Web Search Engines Query Log Mining
  • A.A 2006/2007

2
Whats Recorded in a WSE Query Log?
  • Each component of a WSE records information about
    its operations.
  • We are mainly concerned with frontend logs.
  • They record each query submitted to the WSE.

3
Data Recorded
  • Among other information WSEs record
  • The query topic.
  • The first result wanted.
  • The number of results wanted.
  • Some examples
  • q(fabrizio silvestri)f(1)n(10)
  • q(information retrieval)f(5)n(15)
  • Some other information
  • The language.
  • Results folded? (Y/N).
  • Etc.

Commonly referred to as the query
4
What Can We Look For?
  • The most popular queries.
  • How queries are distributed.
  • How queries are related.
  • How subsequent queries are related.
  • How topics are distributed.
  • How topics change throughout the 24 hours.
  • Can we exploit this information?

5
Lets Start Looking at Some Interesting Items
  • What are the most popular queries?

6
Most Popular Topics
7
Most Popular Terms
8
What Are Users Doing?
  • Not typing many words!
  • Average query was 2.6 words long (in 2001), up
    from 2.4 words in 1997.
  • Moving toward e-commerce
  • Less sex (down from 17 to 9), more business (up
    from 13 to 25).
  • Spink A., et al. From e-Sex to e-Commerce Web
    Search Changes, Computer, March 2002.

9
Why Are Queries so Short?
  • Users minimize effort.
  • Users dont realize more information is better.
  • Users learn that too many words belongs to fewer
    results. (Since implicit AND)
  • Query Boxes are Small.
  • Belkin, N.J., et al. Rutgers TREC 2001
    Interactive Track Experience, in Voorhees
    Harmon, The Tenth Text Retrieval Conference.

10
Different Kind of Queries
11
Distribution of Query Types
12
Hourly Analysis of a Query Log
  • Steven M. Beitzel, Eric C. Jensen, Abdur
    Chowdhury, David Grossman, Ophir Frieder, "Hourly
    Analysis of a Very Large Topically Categorized
    Web Query Log", Proceedings of the 2004 ACM
    Conference on Research and Development in
    Information Retrieval (ACM-SIGIR), Sheffield, UK,
    July 2004.

13
Frequency Time Distribution
14
Query Repetition
15
Query Categories
16
Categories over Time
17
Analysis of Three Query Logs
  • Tiziano Fagni, Salvatore Orlando, Raffaele
    Perego, Fabrizio Silvestri. Boosting the
    Performance of Web Search Engines Caching and
    Prefetching Query Results by Exploiting
    Historical Usage Data. ACM Transactions on
    Information Systems. 24(1). January 2006.

18
Temporal Locality
?0.66
19
Query Submission Distance
20
Page Requested
21
Subsequent Page Requests
22
Query Caching
Index
WSE
Francesca, 1
Francesca, 1
Results
Francesca
23
Caching Who Care?!?
  • Successful caching of query results can
  • Lower the number/cost of query executions.
  • Shorten the engines response time.
  • Increase the engines throughput.

24
Caching How-To?
  • Caching can exploit locality of reference in the
    query streams search engines are faced with.
  • Query popularity follows a power-law and vary
    widely, from the extremely popular to the very
    rare.

25
Caching What to Measure?
  • Hit Ratio
  • Let N be the number of requests to the WSE
  • Let H be the number of hits - i.e. the number of
    queries that can be answered by the cache.
  • The Hit Ratio HR is defined as HR H/N. Usually
    is expressed in percentage.
  • E.g. HR 30 means that the thirty percent of
    the queries are satisfied using the cache.
  • Alternatively we could define the Miss Ratio MR
    1 - HR M/N. Where M is the number of miss -
    i.e. the number of queries that cannot be
    answered by the query.

26
What About the Throughput?
  • The throughput is defined as the number of
    queries answered per-second.
  • Caching, in general, rises the throughput.
  • The lower the hit-ratio the lower the throughput.
  • The lower the cache response-time the higher the
    throughput.

27
Caching Complexity
  • The caching response time depends on the
    replacement policy complexity.
  • The complexity usually depends on the cache size
    K.
  • There exists policies that are
  • O(1) - i.e. constant. They dont depend on the
    size of the cache.
  • O(log K).
  • O(N).

28
Is There Only Caching?
  • No!!!!
  • Theres also PREFETCHING!
  • Whats Prefetching
  • Anticipating users query by exploiting query
    stream properties
  • Uhuuuu! Sounds like kind of Usage Mining!
  • For instance lets have a look at the probability
    of subsequent page requests.
  • Prefetching factor p is the number of pages
    prefetched.

29
Prefetching PROS and CONS
  • Prefetching enhance hit-ratio.
  • Prefetching reduce the query load on the query
    server.
  • The cost for computing p pages of results is
    approx the same of computing only one page
  • Prefetching is very likely to load pages that
    will never be requested in future.

30
Adaptive Prefetching
31
Theoretical Bounds
32
Some Classical Caching Policies
  • LRU
  • Last Recently Used.
  • Evict from Cache the query results that have been
    accessed farthest in the past.
  • SLRU
  • Two segments
  • Probationary
  • Protected.
  • Lines in each segment are ordered from the most
    to the least recently accessed. Data from misses
    is added to the cache at the most recently
    accessed end of the probationary segment. Hits
    are removed from wherever they currently reside
    and added to the most recently accessed end of
    the protected segment. Lines in the protected
    segment have thus been accessed at least twice.
    The protected segment is finite, so migration of
    a line from the probationary segment to the
    protected segment may force the migration of the
    LRU line in the protected segment to the most
    recently used (MRU) end of the probationary
    segment, giving this line another chance to be
    accessed before being replaced.

33
Problems
  • Classical Replacement Policies do not care about
    stream characteristics.
  • They are not designed using usage mining
    investigation techniques.
  • They offer godd performance, though!
  • Uhmmm. Are you sure?!? Stay tuned!

34
Caching May be Attacked from two Directions
  • Architecture of the caching system
  • Two-level caching
  • Three-level caching
  • SDC
  • Replacement policy
  • PDC
  • SDC
  • Both
  • SDC

35
Two-level Caching
  • Cache of Query Results
  • Cache of Inverted Lists
  • Both

36
Throughput
37
Three-level Caching
  • Long, X. and Suel, T. 2005. Three-level caching
    for efficient query processing in large Web
    search engines. In Proceedings of the 14th
    international Conference on World Wide Web
    (Chiba, Japan, May 10 - 14, 2005). WWW '05. ACM
    Press, New York, NY, 257-266.

38
Probability Driven Caching
  • Lempel, R. and Moran, S. 2003. Predictive caching
    and prefetching of query results in search
    engines. In Proceedings of the 12th international
    Conference on World Wide Web (Budapest, Hungary,
    May 20 - 24, 2003). WWW '03. ACM Press, New York,
    NY, 19-28.
  • Tanks to Ronny for his original slides!

39
Static-Dynamic Caching
  • Tiziano Fagni, Salvatore Orlando, Raffaele
    Perego, Fabrizio Silvestri. Boosting the
    Performance of Web Search Engines Caching and
    Prefetching Query Results by Exploiting
    Historical Usage Data. ACM Transactions on
    Information Systems. 24(1). January 2006.
  • Idea
  • Divide the cache in two sets
  • Static Set
  • Dynamic Set.
  • Fill the Static Set using the most frequently
    submitted query in the past.
  • The Static Set is read-only good in
    multithreaded architectures.

40
Inside SDC
  • Static-Dynamic Caching.
  • The cache is divided into two sets
  • Static Set contains the results of the queries
    most frequently submitted so far.
  • Dynamic Set is implemented using a classical
    caching replacement policy like, for instance,
    LRU, SLRU, PDC.
  • The Static Set size is given by fstaticN. Where
    0lt fstatic lt 1 is the fraction of the total
    entries (N) of the cache devoted to the Static
    Set.
  • Adaptive Prefetching is adopted.

41
Benefits in Real-World Caches
SDC Cache
SDC Cache Thread
SDC Cache Thread
SDC Cache Thread
SDC Cache Thread
Static Set
Dynamic Set
Mutex
WSE
42
SDC Performance
  • Linux PC 2GHz Pentium Xeon - 1GB RAM
  • Single process.
  • fstatic 0.5. No prefetching.

43
SDC Hit-Ratio
44
SDC Hit-Ratio
45
SDC Hit-Ratio
46
SDC Hit-Ratio
47
SDC Hit-Ratio
48
SDC Hit-Ratio
49
SDC Hit-Ratio
50
SDC Hit-Ratio
51
Why Static Set Helps?
52
Concurrent Caching
53
Freshness of the Training Data
  • How frequently should we perform mining again on
    the usage data?
  • Does performance of Usage-Mining-based caching
    degrades gracefully as time goes by?
  • Do time-of-day patterns exist in query stream.

54
Daily Patterns
Write a Comment
User Comments (0)
About PowerShow.com