Title: Using%20Database%20Technology%20to%20Improve%20Performance%20of%20Web%20Proxy%20Servers
1Using Database Technology to Improve Performance
of Web Proxy Servers
- K. Cheng¹, Y. Kambayashi¹, M. Mohania²
- ¹Kyoto University, Japan
- ²Western Michigan University, USA
2Caching on web proxy servers
Web Servers
Clients
- Improve throughput of proxy servers
- Improve response times for end users
- Bridge bandwidth gap between WAN and LAN
- Distribute workload from web servers
3Characteristics of proxy caching
Traditional Caching Proxy Caching
Storage Memory-based Disk-based
Cache size Small Huge
Object survival time Short Long
Algorithm Simple Can be complex
Who use ? Programmed process People with specific interest
4Limitations of current caching schemes case 1
- Tom found a very good page P1 about car models
- John is also looking for that kind of pages, but
he only got P2 - Both P1 and P2 were cached, but Tom didnt
know P2 and John didnt know about P1. - After several days, however, both were replaced
since no further visits. - As a result, Tom missed P2, John missed P1,
and cache missed 2 hits
State-of-art caching schemes cannot deal this
case!!
5Limitations of current caching schemes case 2
- Suppose the users of a proxy server are mostly
interested in XML, but rarely favor of Fuzzy - Suppose some clients retrieved pages P1 and
P2 - After checking the content of P1and P2, we
know P1 is a XML one, P2 is a Fuzzy one
Should we prefer to cache P1 or P2 ?
6Why current schemes cant deal with these cases ?
- Physical object based cache management
- Content transparency ? low utilization rate (Case
1) - Approximately 60 data in cache never used
- Approximately 90 data in cache rarely used
- Usage-based object replacement ? Needlessly long
stay time for irrelevant contents (Case 2)
7Our solution
- We propose a hierarchical data model for
management of web data (physical pages, logical
pages and topics). - Object replacement based on
- Link structure (logical pages)
- Semantic similarity with other objects (topics
) - Facilitate active access to cache contents
8A hierarchical model for web data
Topics
navigate
Topic manager
T1
T2
Mapping
Logical pages
Search
Logical page manager
L1
L2
L3
Mapping
Physical pages
Physical page manager
p1
p2
p3
p4
p5
p6
Browse
9Physical pages
http//www.difa.unibas.it/webdb2001
../icons/webdblogo.gif
Physical page A
Physical page B
/instructionsPage/index.html
10Logical page
A
B
11Managing physical pages
- Physical page
- HTML/plain text file (.html, .txt)
- Embedded media file (.gif, .png, wav, .mp3)
- Application Generated File (.pdf, .ps, .doc)
- Managing physical pages based on
- URL (protocol, ip, port, path)
- Physical properties (e.g. size, cost etc.)
- Usage (frequency, recency)
12Constructing logical pages
- Basic logical pages
- Single multimedia document
- HTML(1) embedded media files(1..)
- Extended logical pages
- Several closely related directly linked pages
- E.g. an HTML paper with sections on different
multimedia documents
13Managing topics
- Defining a topic
- Topic ltid, name, criteria, popularity, date, gt
- Popularityf(F, R, P, U)
- F Access Frequency of Topic
- R - Time interval between last access time
and current time - P Number of logical pages belonging to
a topic - U Number of users accessing a topic
- Deciding membership of a logical page to a topic
- IR Approaches (K-NN, )
- ML Approaches (e.g. Support Vector Machine-SVM)
14Definitions
- We use a term Priority for object replacement.
It is a function of several parameters, e.g.
access frequency(F), time interval(R), size of
object(S), retrieval cost(C), significance(G). - Significance Importance of the topic
15Caching policy LRU-SP
- Topic management
- Priority f(F, R, G)
- Logical page management
- Basic logical pages only
- Priority g(F, R)
- Physical page management
- LRU-SP --size-adjusted popularity-aware LRU
(K. Cheng et al, Compsac00) - Priority h(F, R, S)
16Evaluate add new objects
D is of higher priority
T2
T1
Topics
Priority
Higher
Lower
L1
L2
L3
Logical Pages
P10
P40
P30
P20
Physical Pages
P11
P41
P31
P22
P12 P21
P42
New Object D
17Replace an object
- Choose a candidate topic (T1)
- T1 has 1 logical page (L1), choose (L1)
- (L1) has 3 physical pages (P10), ( P11), (P12),
where (P12) shared by (L2) - Choose a victim (P) from (P10), ( P11).
- Replace (P) with the new page
18Preliminary experiments
- Replay access logs of our proxy server(Squid)
- 30 clients, 30 days
- 873,824 requests, 21.30GB data
- 7 Topics, Priority ? 1..5
- Significance Factor (0, 2)
- Measure the significance of each topic
- Hit Rate(HR)
- Percentage of requests satisfied by cache
- Profit Rate(PR)-- is significance of
topic
19Baseline algorithm LRV (Rizzo et al 1998)
- A physical-page-based algorithm
- Using size(S) to predict further access to
incoming objects - Parameters in consideration
- Access frequency (F)
- Time interval (R)
- Size of objects (S)
20Results Hit Rates 20 UP
Cache space in of total unique data
21Results Profit Rates 30 Up
Cache space in of total unique data
22Conclusion and future work
- Performance of caching proxies can be remarkably
improved if cache contents were well organized
and managed - Proposed a hierarchical model and the cache
management scheme based on that model - Future work
- Tuning various parameters to achieve better
performance(Logical page clustering, priority
balancing significance and popularity etc.) - More experiments