Mining Web Traces: Workload Characterization, Performance Diagnosis, and Applications - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

Mining Web Traces: Workload Characterization, Performance Diagnosis, and Applications

Description:

Tutorial Outline. Background. Web workload characterization. Performance diagnosis ... E.g., Mosaic, Netscape Navigator, Microsoft IE. Web servers ... – PowerPoint PPT presentation

Number of Views:361
Avg rating:3.0/5.0
Slides: 84
Provided by: lil8
Category:

less

Transcript and Presenter's Notes

Title: Mining Web Traces: Workload Characterization, Performance Diagnosis, and Applications


1
Mining Web TracesWorkload Characterization,
Performance Diagnosis, and Applications
  • Lili Qiu
  • Microsoft Research
  • Performance2002, Rome, Italy
  • September 2002

2
Motivation
  • Why do we care about Web traces?
  • Content providers
  • How do users come to visit the Web site?
  • Why do users leave the Web site? Is poor
    performance the cause for this?
  • Where are the performance bottlenecks?
  • What content are users interested in?
  • How do users interest vary in time?
  • How do users interest vary across different
    geographical regions?

3
Motivation (Cont.)
  • Web hosting companies
  • Accounting billing
  • Server selection
  • Provisioning server farms where to place servers
  • ISPs
  • How to save bandwidth by storing proxy caches?
  • Traffic engineering provisioning
  • Researchers
  • Where are the performance bottlenecks?
  • How to improve Web performance?
  • Examples Traffic measurements have influenced
    the design of HTTP (e.g., persistent connections
    and pipeline), TCP (e.g., initial congestion
    window)

4
Tutorial Outline
  • Background
  • Web workload characterization
  • Performance diagnosis
  • Applications of traces
  • Bibliography

5
Part I Background
  • Web software components
  • Web semantic components
  • Web protocols
  • Types of Web traces

6
Web Software Components
  • Web clients
  • An application that establishes connections to
    send Web requests
  • E.g., Mosaic, Netscape Navigator, Microsoft IE
  • Web servers
  • An application that accepts connections to
    service requests by sending back responses
  • E.g., Apache, Microsoft IIS
  • Web proxies (optional)
  • Web replicas (optional)

Internet
proxy
replica
proxy
replica
proxy
WebServers
WebClients
7
Web Semantic Components
  • Uniform Resource Identifier (URI)
  • An identifier for a Web resource
  • Name of protocol http, https, ftp, ..
  • Name of the server
  • Name of the resource on the server
  • e.g., http//www.foobar.com/info.html
  • Hypertext Markup Language (HTML)
  • Platform-independent styles (indicated by markup
    tags) that define the various components of a Web
    document
  • Hypertext Transfer Protocol (HTTP)
  • Define the syntax and semantics of messages
    exchanged between Web software components

8
Example of a Web Transaction
DNS server
1. DNS query
2. Setup TCP connection
3. HTTP request
4. HTTP response
Browser
Web server
9
Internet Protocol Stack
Application layer application programs (HTTP,
Telnet, FTP, DNS)
Transport layer error control flow control
(TCP,UDP)
Network layer routing (IP)
Datalink layer handle hardware
details (Ethernet, ATM)
Physical layer moving bits (coaxial cable,
optical fiber)
10
Web Protocols
HTTP messages
HTTP
HTTP
TCP segments
TCP
TCP
IP pkt
IP pkt
IP pkt
IP
IP
IP
IP
Ethernet
Ethernet
Sonet
Sonet
Ethernet
Ethernet
Sonet link
Ethernet
Ethernet
A picture taken from KR01
11
Web Protocols (Cont.)
  • DNS AL01
  • An application layer protocol responsible for
    translating hostname to IP and vice versa (e.g.,
    perf2002.uniroma2.it ? 160.80.2.140)
  • TCP JK88
  • A transport layer protocol that does error
    control and flow control
  • Hypertext Transfer Protocol (HTTP)
  • HTTP 1.0 BLFF96
  • The most widely used HTTP version
  • A Stop and wait protocol
  • HTTP 1.1 GMF99
  • Adds persistent connections, pipelining, caching,
    content negotiation,

12
HTTP 1.0
  • HTTP request
  • Request Simple-Request Full-Request
    Simple-Request "GET" SP Request-URI CRLF
    Full-Request Request-Line (
    General/Request/Entity Header) CRLF
    Entity-Body Request-Line Method SP
    Request-URI SP HTTP-Version CRLF Method "GET"
    "HEAD" "POST" extension-method
  • Example GET /info.html HTTP/1.0

13
HTTP 1.0 (Cont.)
  • HTTP response
  • Response Simple-Response Full-Response
    Simple-Response Entity-Body Full-Response
    Status-Line ( General/Response/En
    tity Header ) CRLF
    Entity-Body
  • ExampleHTTP/1.0 200 OKDate Mon, 09 Sep 2002
    060753 GMTServer Apache/1.3.20 (Unix)
    (Red-Hat/Linux) PHP/4.0.6Last-Modified Mon, 29
    Jul 2002 105859 GMTContent-Length
    21748Content-Type text/htmlThis is the
    document content lt21748 bytes of the current
    version of info.htmlgt

14
HTTP 1.1
  • Connection management
  • Persistent connections Mogul95
  • Use one TCP connection for multiple HTTP requests
  • Pros
  • Reduce the overhead of connection setup and
    teardown
  • Avoid TCP slow start
  • Cons
  • head-of-line blocking
  • increase servers state
  • Pipeline Pad95
  • Send multiple requests without waiting for a
    response between requests
  • Pros avoid the round-trip delay of waiting for
    each response
  • Cons connection aborts are harder to deal with

15
HTTP 1.1 (Cont.)
  • Caching
  • Continues to support the notion of expiration
    used in HTTP 1.0
  • Add a cache-control header to handle the issues
    of cacheability and semantic transparency KR01
  • E.g., no-cache, only-if-cache, no-store, max-age,
    max-stale, min-fresh,
  • Others
  • Range request
  • Content negotiation
  • Security

16
Types of Web Traces
  • Application level traces
  • Server logs CLF and ECLF formats
  • CLF formatltremoteIP,remoteID,usrName,Time,request
    ,responseCode,contentLengthgte.g., 192.1.1.1, -,
    -, 8/1/2000, 100000, GET /news/index.asp
    HTTP/1.1, 200, 3410
  • Proxies logs CLF and ECLF formats
  • Client logs no standard logging formats
  • Packet level traces
  • Collection method monitor a network link
  • Available tools tcpdump, libpcap, netmon
  • Concerns packet dropping, timestamp accuracy

17
Tutorial Outline
  • Background
  • Web workload characterization
  • Performance diagnosis
  • Applications of traces
  • Bibliography

18
Part II Web Workload Characterization
  • Overview of workload characterization
  • Content dynamics
  • Access dynamics
  • Common pitfalls
  • Case studies

19
Overview of Workload Characterization
  • Process of trace analyses
  • Common analysis techniques
  • Common analysis tools
  • Challenges in workload characterization

20
Process of Trace Analyses
  • Collect traces
  • where to monitor, how to collect (e.g.,
    efficiency, privacy, accuracy)
  • Determine key metrics to characterize
  • Process traces
  • Draw inferences from the data
  • Apply the traces or insights gained from the
    trace analyses to design better protocols
    systems

21
Common Analysis Techniques - Statistics
  • Mean
  • Median
  • Geometric mean less sensitive to outliers
  • Variance and standard deviation
  • Confidence interval
  • A range of values that has a specified
    probability of containing the parameter being
    estimated
  • Example 95 confidence interval 10 ? x ? 20

22
Common Analysis Techniques Statistics (Cont.)
  • Cumulative distribution (CDF)
  • Points (x, P(X ? x))
  • Probability density function (PDF)
  • Derivative of CDF f(x) dF(x)/dx
  • Check for heavy tail distribution
  • Log-log complementary plot, and check its tail
  • Example Pareto distribution If ??2,
    distribution has infinite variance (a heavy
    tail)If ??1, distribution has infinite mean

23
Common Analysis Techniques Data Fitting
  • Visually compare two distributions
  • Chi Squared tests AS86,Jain91
  • Divide the data points into k bins
  • Compute
  • If X2 ? X2(?,k-c), then two distributions are
    close, where ? is significance level, c is the
    number of estimated parameters for the
    distribution 1
  • Need enough samples
  • Kolmogorov-Smirnov tests AS86,Jain91
  • Compares two distributions by finding the maximum
    differences between two variables cumulative
    distribution functions
  • Need to fully specify the distribution


24
Common Analysis Techniques Data Fitting (Cont.)
  • Anderson-Darling Test Ste74
  • Modification of the Kolmogorov-Smirnov test,
    giving more weight to the tails
  • If A ? critical value, two distributions are
    similar otherwise they are not (F is CDF, and Yi
    are ordered data)
  • Quantile-quantile plots AS86,Jain91
  • Compare two distributions by plotting the inverse
    of the cumulative distribution function F-1(x)
    for two variables, and find best fitting line
  • If the slope of the line is close to 1, and
    y-intercept is close to 0, the two data sets are
    almost identically distributed

25
Common Analysis Tools
  • Scripting languages
  • Perl, awk, UNIX shell scripts, VB,
  • Databases
  • SQL, DB2,
  • Statistics packages
  • Matlab, S, R, SAS,
  • Write our own low level programs
  • C, C, C, Java,

26
Challenges in Workload Characterization
  • Each of the Web components provides a limited
    perspective on the functioning of the Web
  • Workload characteristics vary both in space and
    in time

Internet
proxy
replica
proxy
replica
proxy
Servers
Clients
27
Views from Clients
  • Capture clients requests to all servers
  • Pros
  • Know details of client activities, such as
    requests satisfied by browser caches, client
    aborts
  • The ability to record detailed information, as
    this does not impose significant load on a client
    browser
  • Cons
  • Need to modify browser software
  • Hard to deploy for a large number of clients

28
Views from Web Servers
  • Capture most clients requests (excluding those
    satisfied by caches) to a single server
  • Pros
  • Relatively easy to deploy/change logging software
  • Cons
  • Requests satisfied by browser proxy caches will
    not appear in the logs
  • May not log detailed information to ensure fast
    processing of client requests

29
Views from Web Proxies
  • Depending on the proxys location
  • A proxy close to clients see requests from a a
    small client group to a large number of servers
    KR00
  • A proxy close to the servers see requests from a
    large client group to a small number of servers
    KR00
  • Pros
  • See requests from a diverse set of clients to a
    diverse set of servers, and determine the
    popularity ranking of different Web sites
  • Useful for studying caching policies
  • Ease of collection
  • Cons
  • Requests satisfied by browser caches will not
    appear in the logs
  • May not log detailed information to ensure fast
    processing of requests

30
Workload Variation
  • Vary with measurement points
  • Vary with sites being measured
  • Information servers (news site), e-commercial
    servers, query servers, streaming servers,
    upload servers
  • US vs. Italy,
  • Vary with the clients being measured
  • Internet clients vs. wireless clients
  • University clients vs. home users
  • US vs. Italy,
  • Vary in time
  • Day vs. night
  • Weekday vs. weekend
  • Changes with new applications, recent events
  • Evolve over time,

31
Part II Web Workload
  • Overview
  • Content dynamics
  • Access dynamics
  • Common pitfalls
  • Case studies

32
Content Dynamics
  • File types
  • File size distribution
  • File update patterns
  • How often files are updated
  • How much files are updated

33
File Types
  • Text files
  • HTML, plain text,
  • Images
  • Jpeg, gif, bitmap,
  • Applications
  • Javascript, cgi, asp, pdf, ps, gzip, ppt,
  • Multimedia files
  • Audio, video

34
File Size Distribution
  • Two definitions
  • D1 Size of all files on a Web server
  • D2 Size of all files transferred by a Web server
  • D1 ? D2, because some files can be transferred
    multiple times or not in completion and other
    files are not transferred
  • Studies show that the distribution of file sizes
    in both definitions exhibit heavy tails (i.e.,
    PF gt x x-?, 0 ? ? ? 2)

35
File Update Interval
  • Varies in time
  • Hot events and fast changing events require more
    frequent update, e.g., Worldcup
  • Varies across sites
  • Depending on server update policies update
    tools
  • Depending on the nature of content (e.g.,
    University sites have slower update rate than
    news sites)
  • Recent studies
  • Study of the proxy traces collected at DEC and
    ATT in 1996 showed the rate of change depended
    on content type, top-level domains etc. DFK97
  • Study of 1999 MSNBC logs shows that modification
    history yields a rough predictor of future
    modification interval PQ00

36
Extent of Change upon Modifications
  • Varies in time
  • Different events trigger different amount of
    updates
  • Varies across sites
  • Depending on servers update policies and update
    tools
  • Depending on the nature of the content
  • Recent studies
  • Studies of 1996 DEC and ATT proxy MDF97 and
    1999 MSNBC log PQ00 show that most file
    modifications are small ? delta encoding can be
    very useful

37
Part II Web Workload
  • Motivation
  • Limitations of workload measurements
  • Content dynamics
  • Access Dynamics
  • Common pitfalls
  • Case studies

38
Access Dynamics
  • File popularity distribution
  • Temporal stability
  • Spatial locality
  • User request arrivals durations

39
Document Popularity
  • Web requests follow Zipf-like distribution
  • Request frequency ? 1/i?, where i is a
    documents ranking
  • The value of ? depends on the point of
    measurements
  • Between 0.6 and 1 for client traces and proxy
    traces
  • Close to or larger than 1 for server traces
    ABC96, PQ00
  • The value of ? varies over time (e.g., larger ?
    during hot events)

40
Impact of the value ?
  • Larger ? means more concentrated accesses on
    popular documents ?caching is more beneficial
  • 90 of the accesses are accounted by
  • Top 36 files in proxy traces BCF99, PQ00
  • Top 10 files in small departmental server logs
    reported in AW96
  • Top 2-4 files in MSNBC traces

41
Temporal Stability
  • Metrics
  • Coarse-grained likely duration that a current
    popular file remains popular
  • e.g., overlap between the set of popular
    documents on day 1 and day 2
  • Fine-grained how soon a requested file will be
    requested again
  • e.g., LRU stack distance ABC96

File 5
File 2
File 4
File 5
Stack distance 4
File 3
File 4
File 2
File 3
File 1
File 1
42
Spatial Locality
  • Refers to if users in the same geographical
    location or at the same organization tend to
    request the similar set of documents
  • E.g., compare the degree of requests locally
    shared

43
Spatial Locality (Cont.)
Domain membership is significant except when
there is a hot event of global interest
44
User Request Arrivals Duration
  • User workload at three levels
  • Session a consecutive series of requests from a
    user to a Web site
  • Click a user action to request a page, submit a
    form, etc.
  • Request each click generates one or more HTTP
    requests
  • Exponential distribution LNJV99,KR01
  • Session duration
  • Heavy-tail distribution KR01
  • clicks in a session, most in the range of 4-6
    Mah97
  • embedded references in a Web page
  • Think time time between clicks
  • Active time time to download a Web page and its
    embedded images

45
Common Pitfalls
  • Trace analyses are all about writing scripts
    plotting nice graphs
  • Challenges
  • Trace collection where to monitor, how to
    collect (e.g., efficiency, privacy, accuracy)
  • Identify important metrics, and understand why
    they are important
  • Sound measurements require disciplines Pax97
  • Dealing with errors and outliers
  • Draw implications from data analyses
  • Understanding the limitation of the traces
  • No representative traces workload changes in
    time and in space
  • Try to diversify data sets (e.g., collect traces
    at different places and different sites) before
    jumping into conclusions
  • Draw inferences more than what data show

46
Part II Web Workload
  • Motivation
  • Limitations of workload measurements
  • Content dynamics
  • Access dynamics
  • Common pitfalls
  • Case studies
  • Boston University client log study
  • UW proxy log study
  • MSNBC server log study
  • Mobile server log study

47
Case Study I BU Client Log Study
  • Overview
  • One of the few client log studies
  • Analyze clients browsing pattern and their
    impact on network traffic CBC95
  • Approaches
  • Trace collection
  • Modify Mosaic and distribute it to machines in CS
    Dept. at Boston Univ. to collect client traces in
    1995
  • Log format ltclient machine, request time, user
    id, URI, document size, retrieval timegt
  • Data analyses
  • Distribution of document size, document
    popularity
  • Relationship between retrieval latency and
    response size
  • Implications on caching strategies

48
Major Findings
  • Power law distributions
  • Distribution of document sizes
  • Distribution of user requests for documents
  • requests to documents as a function of their
    popularity
  • Caching strategies should take into account of
    document size (i.e., give preference to smaller
    documents)

49
Case Study II UW Proxy Log Study
  • Overview
  • Proxy traces collected at the University of
    Washington
  • Approaches WVS99a, WVS99b
  • Trace collection deploy a passive network
    sniffer between the Univ. of Washington and the
    rest of the Internet in May 1999
  • Set well-defined objectives
  • Understand the extent of document sharing within
    an organization and across different
    organizations
  • Understand the performance benefit of cooperative
    proxy caching

50
Major Findings
  • Members of an organization are more likely to
    request the same documents than a random set of
    clients
  • Most popular documents are globally popular
  • Cooperative caching is most beneficial for small
    organizations
  • Cooperative caching among large organizations
    yield minor improvement if any

51
Case Study III MSNBC Server Log Study
  • Overview of MSNBC server site
  • a large news site
  • server cluster with 40 nodes
  • 25 million accesses a day (HTML content alone)
  • Period studied Aug. Oct. 99 Dec. 17, 98
    flash crowd

52
Approaches
  • Trace collection
  • HTTP access logs
  • Content Replication System (CRS) logs
  • HTML content logs
  • Data analyses
  • Content dynamics
  • How often files are modified?
  • How to predict modification interval?
  • How much does a file change upon modification?
  • Access dynamics
  • Document popularity
  • Temporal stability
  • Spatial locality
  • Correlation between document age and popularity

53
Major Findings
  • Content dynamics
  • Modification history is a rough predictor ? guide
    for setting TTL, but need alternative mechanism
    (e.g., callback based invalidation) as backup
  • Frequent but minimal file modifications ? delta
    encoding
  • Access dynamics
  • Set of popular files remains stable for days ?
    pushing/prefetching previous hot data that have
    undergone modifications
  • Domain membership has a significant bearing on
    client accesses except during a flash crowd of
    global interest ? make sense to have a proxy
    cache for an organization
  • Zipf-like distribution of file popularity but
    with a much larger ? than at proxies ? potential
    of reverse caching and replication

54
Case Study IV Mobile Server Log Study
  • Overview of a popular commercial Web site for
    mobile clients
  • Content
  • news, weather, stock quotes, email, yellow pages,
    travel reservations, entertainment etc.
  • Services
  • Notification
  • Browse
  • Period studied
  • 3.25 million notifications in Aug. 20 26, 2000
  • 33 million browse requests in Aug. 15 26, 2000

55
Approaches
  • Analyze by user categories
  • Cellular users
  • Browse the Web in real time using cellular
    technologies
  • Offline users
  • Download content onto their PDAs for later
    (offline) browsing, e.g. AvantGo
  • Desktop users
  • Signup services and specify preferences
  • Analyze by Web services
  • Browse
  • Notifications
  • Use SQL database to manage data

56
Major Findings
  • Notification Services
  • Popularity of notification messages follows a
    Zipf-like distribution, with top 1 notification
    objects responsible for 54-64 of total messages
    ? multicast notifications
  • Exhibits geographical locality ? useful to
    provide localized notification services
  • Browse Services
  • 0.1 - 0.5 urls account for 90 requests ? cache
    the results of popular queries
  • The set of popular urls remain stable ? cache a
    stable set of queries or optimize query based on
    a stable workload
  • Correlation between the two services
  • Correlation is limited ? influence design of
    pricing plans

57
Tutorial Outline
  • Background
  • Web Workload
  • Performance Diagnosis
  • Applications of traces

58
Part III Performance Diagnosis
  • Overview of performance diagnosis
  • Infer the causes of high end-to-end delay in Web
    transfers BC00
  • Infer the causes of high end-to-end loss rate in
    Web transfers CDH99,DPP01,NC01,PQ02, PQW02

59
Overview of Performance Diagnosis
  • Goal Determine internal network characteristics
  • Metrics of interest
  • Delay
  • Loss rate
  • Raw bandwidth
  • Available bandwidth
  • Traffic rate
  • Why interesting
  • Resolve the trouble spots
  • Server selection
  • Placement of mirror servers

Web Server
ATT
Sprint
MCI
UUNET
Earthlink
Why so slow?
AOL
Qwest
60
Finding the Sources of Delays
  • Goal
  • Why is my Web transfer slow? Is it because of the
    server or the network or the client?
  • Sources of delay in Web transfer
  • DNS lookup
  • Server delays
  • Client delays
  • Network delays
  • Propagation delays
  • Queuing delays
  • Delays introduced by packet losses (e.g.,
    signaled by the fast retransmit mechanism or TCP
    timeouts)

61
TCPEval Tool
  • Inputs tcpdump packet traces taken at the
    communicating Web server and client
  • Generates a variety of statistics for file
    transactions
  • File and packet transfer latencies
  • Packet drop characteristics
  • Packet and byte counts per unit time
  • Generates both timeline and sequence plots for
    transactions
  • Generates critical path profiles and statistics
    for transactions

62
Critical Path Analysis Tool
Data flow
Critical Path
Server
Client
Server
Client
Network delay
Network delay
Server delay
Network delay
Client delay
Network delay
Server delay
Network delay due to pkt loss
63
Finding Sources of Packet Losses
  • Goal
  • Determine link loss rate or identify lossy links

server
(1-l1)(1-l2)(1-l4) (1-p1) (1-l1)(1-l2)(1-l5)
(1-p2) (1-l1)(1-l3)(1-l8)
(1-p5) Under-constrained system of equations
l1
l3
l2
l8
l7
l6
l4
l5
clients
p1
p2
p3
p4
p5
64
Approaches
  • Active probing
  • Probing
  • Multicast probes
  • Striped unicast probes
  • Inference technique EM a numerical algorithm
    to compute ? that maximizes P(D?), where D are
    observations, ? are ensemble of link loss rates

S
A
B
65
Approaches (Cont.)
  • Passive monitoring
  • Random sampling
  • Random sample the solution space, and draw
    conclusions based on samples
  • Akin to monte carlo sampling
  • Linear optimization
  • Determine a unique solution by optimizing an
    objective function
  • Gibbs sampling
  • Determine P(?D) by drawing samplings, where ? is
    ensemble of loss rates of links in the network,
    and D is observed packet transmission and losses
    at the clients
  • EM
  • A numerical algorithm to compute ? that maximizes
    P(D?)

66
Other Performance Studies using Web traces
  • Characterize Internet performance (e.g., spatial
    temporal locality) BSS97
  • Study the behavior of TCP during Web transfers
    BPS98
  • Reconstruct different client page accesses and
    measure performance characteristics for the
    accesses FCT02

67
Tutorial Outline
  • Background
  • Web Workload
  • Performance Diagnosis
  • Applications of traces
  • Bibliography

68
Part IV Applications of Traces
  • Synthetic workload generation
  • Cache design
  • Cache replacement policies CI97,BCF99
  • Cache consistency algorithms LC97, YBS99,YAD01
  • Cooperative cache or not WVS99
  • Cache infrastructure
  • Pre-fetching algorithms CB98, FJC99
  • Placement of Web proxies/replicas QPV01
  • Other optimizations
  • Improving TCP for Web transfers
    Mah97,PK98,ZQK00
  • Concurrent downloads, pipelining, compression,

69
Synthetic Workload Generation
  • Generate user requests
  • Generate user sessions using a Poisson arrival
    process
  • For each user session, determine clicks using a
    Pareto distribution
  • Assign a click to a request for a Web page, while
    making sure
  • The popularity distribution of files follows a
    Zipf-like distribution BC98
  • Capture the temporal locality of successive
    requests for the same resource
  • Generate a next click from the same user with
    think time following a Pareto distribution

70
Synthetic Workload Generation (Cont.)
  • Generate Web pages
  • Determine the number of Web pages
  • Generate the size of each Web pages using a
    log-normal distribution
  • Associate a page with some number of embedded
    pages using an empirical distribution
    (heavy-tail)
  • Generate file modification events
  • Examples of generators
  • Webbench Wbe, WebStoneTS95, Surge BC98,
    SPecweb99 SP99, Web Polygraph WP,

71
Cache Replacement Policies
  • Problem formulation
  • Given a fixed size cache, how to evict pages to
    maximize the hit ratio once the cache is full?
  • Hit ratio
  • Fraction of requests satisfied by the cache
  • Fraction of the total size of requested data
    satisfied by the cache
  • Factors to consider
  • Request frequency
  • Modification frequency
  • Benefit of caching reduction in latency BW
  • Cost of caching storage
  • Caveat NOT all hits are equal. Hit ratios do NOT
    map directly to performance improvement.

72
Cache Replacement Policies (Cont.)
  • Approaches
  • Least recently used (LRU)
  • Least frequently used (LFU)
  • Perfect maintain counters for all pages seen
  • In-cache maintain counters only for pages that
    are in cache
  • GreedyDual-size CI97
  • Assign a utility value to each object, and
    replace the one with the lowest utility
  • Use of traces
  • Evaluate the algorithms using trace-driven
    simulations or synthetic workload
  • Analytically derive the hit ratios for different
    replacement policies based on a workload model

73
Placement of Web Proxies/Replicas
  • Problem formulation JJK01,QPV01
  • How to place a fixed number of proxies/replicas
    to minimize users request latency
  • Factors to consider
  • Spatial distribution of requests
  • Temporal stability of requests
  • Stability in popularity of objects
  • Stability in spatial distribution of requests

74
Placement of Web Proxies/Replicas (Cont.)
  • Approaches
  • Greedy placement
  • Hot-spot placement
  • Random placement
  • Use of traces
  • Trace-driven simulations
  • High concentration of requests to a small number
    of objects ? focus on replicating only popular
    objects
  • Temporal stability in requests ? no need to
    frequently change the locations of
    proxies/replicas

75
References
  • AS86 R. B. DAgostino and M. A. Stephens.
    Goodness-of-Fit Techniques. Marcel Dekker, New
    York, NY 1986.
  • ABC96 Virgilio Almeida, Azer Bestavros, Mark
    Crovella and Adriana de Oliveria. Characterizing
    reference locality in the WWW. In Proceedings of
    1996 International Conference on Parallel and
    Distributed Information Systems (PDIS'96),
    December 1996.
  • ABQ01 A. Adya, P. Bahl, and L. Qiu. Analyzing
    Browse Patterns of Mobile Clients. In Proc. of
    SIGCOMM Measurement Workshop, Nov. 2001.
  • ABQ02 A. Adya, P. Bahl, and L. Qiu.
    Characterizing Alert and Browse Services for
    Mobile Clients. In Proc. of USENIX, Jun. 2002.
  • AL01 P. Albitz, and C. Liu. DNS and BIND (4th
    Edition), OReilly Associates, Apr. 2001.
  • AW97 M. Arlitt and C. Williamson. Internet Web
    Servers Workload Characterization and
    Performance Implications. IEEE/ACM Transactions
    on Networking , Vol. 5, No. 5, pp. 631-645,
    October 1997.
  • BC98 P. Barford and M. Crovella. Generating
    representative workloads for network and server
    performance evaluation. In Proc. of SIGMETRICS,
    1998.

76
References (Cont.)
  • BBC98 P. Barford, A. Bestavros, M. Crovella,
    and A. Bradley. Changes in Web Client Access
    Patterns Characteristics and Caching
    Implications, Special Issue on World Wide Web
    Characterization and Performance Evaluation
    World Wide Web Journal, December 1998.
  • BCF99 L. Breslau, P. Cao, L. Fan, G. Phillips,
    and S. Shenker. Web Caching and Zipf-like
    Distributions Evidence and Implications. In
    Proc. of INFOCOM, Mar. 1999.
  • BC00 P. Barford and M. Crovella. Critical Path
    Analysis of TCP Transactions. In Proc. of ACM
    SIGCOMM, Aug. 2000.
  • BLFF96 T. Berners-Lee, R. Fielding, and H.
    Frystyk. Hypertext Transfer Protocol -- HTTP/1.0.
    RFC 1945, May 1996.
  • BPS98 H. Balakrishnan, V. N. Padmanabhan, S.
    Seshan, M. Stemm and R. H. Katz. TCP Behavior of
    a Busy Internet Server Analysis and
    Improvements. In Proc. IEEE Infocom, San
    Francisco, CA, USA, March 1998.
  • BSS97 H. Balakrishnan, S. Seshan, M. Stemm,
    and R. H. Katz. Analyzing Stability in Wide-Area
    Network Performance. In Proc. of SIGMETRICS, Jun.
    1997.

77
References (Cont.)
  • CDH99 R. Caceres, N. G. Duffield, J. Horowitz,
    D. Towsley, T. Bu. Multicast-Based Inference of
    Network Internal Loss Characteristics. In Proc.
    Infocom, Mar. 1999.
  • CB98 M. Crovella and P. Barford. The network
    effects of prefetching. In Proc. of INFOCOM,
    1998.
  • CBC95 C. R. Cunha, A. Bestavros, and M. E.
    Crovella. Characteristics of WWW client-based
    traces. Technical Report BU-CS-95-010, CS Dept.,
    Boston University, 1995.
  • CI97 P. Cao and S. Irani. Cost-Aware WWW proxy
    caching algorithms. In Proc. of USITS, Dec. 1997.
  • DFK97 F. Douglis, A. Feldmann, B.
    Krishnamurth, and J. Mogul. Rate of change and
    other metrics a live study of the World Wide
    Web. In Proc. of USITS, 1997.
  • DPP01 N. G. Duffield, F. Lo Presti, V. Paxson,
    D. Towsley. In Proc. Infocom, Apr. 2001.
  • FCD99 A. Feldmann, R. Caceres, F. Douglis, and
    M. Rabinovich. Performance of Web Proxy Caching
    in heterogeneous bandwidth enviornments. In Proc.
    of INFOCOM, March 1999.

78
References (Cont.)
  • FJC99 L. Fan, Q. Jacobson, P. Cao and W. Lin.
    Web Prefetching Between Low-Bandwidth Clients and
    Proxies Potential and Performance. In Proc. of
    SIGMETRICS, 1999.
  • FCT02 Y. Fu, L. Cherkassova, W. Tang, and A.
    Vahdat. EtE Passive End-to-End Internet Service
    Performance Monitering. In Proc. of USENIX, Jun.
    2002.
  • GMF99 J. Gettys, J. Mogul, H. Frystyk, L.
    Masinter, P. Leach, T. Berners-Lee. Hypertext
    Transfer Protocol HTTP 1.1. RFC 2616, Jun.
    1999.
  • JK88 V. Jacobson, M. J. Karels. Congestion
    Avoidance and Control. In Proc. SIGCOMM, Aug.
    1988.
  • JJK01 S. Jamin, C. Jin, A. R. Kurc, D. Raz,
    and Y. Shavitt. Constrained Mirror Placement on
    the Internet. In Proc. of INFOCOM, Apr. 2001.
  • Jain91 R. Jain. The Art of Computer Systems
    Performance Analysis. John Wiley and Sons, 1991.
  • Kel02 T. Kelly. Thin-Client Web Access
    Patterns Measurements from a Cache-Busting
    Proxy. Computer Communications, Vol. 25, No. 4
    (March 2002), pages 357-366. 
  • KR01 B. Krishnamurthy and J. Rexford. Web
    Protocols and Practice, HTTP/1.1, Networking
    Protocols, Caching, and Traffic Measurement.
    Addison-Wesley, May 2001.

79
References (Cont.)
  • LC97 C. Liu and P. Cao. Maintaining Strong
    Cache Consistency in the World-Wide Web. In Proc.
    of ICDCS'97, pp. 12-21, May 1997.
  • LNJV99 Z. Liu, N. Niclausse, and C.
    Jalpa-Villaneuva. Web Traffic Modeling and
    Performance Comparison Between HTTP 1.0 and HTTP
    1.1. In Erol Gelenbe, editor, System Performance
    Evaluation Methodologies and Applications. CRC
    Press, Aug. 1999.
  • Mah97 Bruce Mah. An empirical model of HTTP
    network traffic. In Proc. of INFOCOM, April 1997.
  • Mogul95 Jeffrey C. Mogul. The Case for
    Persistent-Connection HTTP. In Proc. SIGCOMM '95,
    pages 299-313. Cambridge, MA, August, 1995.
  • MDF97 J. C. Mogul, F. Douglis, A. Feldmann,
    and B. Krishnamurthy. Potential benefits of
    delta-encoding and data compression for HTTP, In
    Proc. of SIGCOMM, September 1997.
  • NC01 R. Nowak and M. Coates. Unicast Network
    Tomography using the EM algorithm. Submitted to
    IEEE Transactions on Information Theory, Dec.
    2001
  • Pad95 V. N. Padmanabhan. Improving World Wide
    Web Latency. Technical Report UCB/CSD-95-875,
    University of California, Berkeley, May 1995.

80
References (Cont.)
  • PQ00 V. N. Padmanabhan and L. Qiu. The Content
    and Access Dynamics of a Busy Web Server. In
    Proc. of SIGCOMM, Aug. 2000.
  • PQ02 V. N. Padmanabhan and L. Qiu. Network
    Tomography using Passive End-to-End Measurements,
    DIMACS on Internet and WWW Measurement, Mapping
    and Modeling, Feb. 2002.
  • PQW02 V. N. Padmanabhan, L. Qiu, and H. J.
    Wang. Passive Network Tomography using Bayesian
    Inference. Internet Measurement Workshop, Nov.
    2002.
  • QPV01 L. Qiu, V. N. Padmanabhan, and G. M.
    Voelker. On the Placement of Web Server Replicas.
    In Proc. of INFOCOM, Apr. 2001.
  • SP99 SPECWeb99 Benchmark. http//www.spec.org/os
    g/web99/.
  • Pax98 V. Paxson. An Introduction to Internet
    Measurement and Modeling. SIGCOMM98 tutorial,
    August 1998.
  • Ste74 M. A. Stephens. EDF Statistics for
    Goodness of Fit and Some Comparison. Journal of
    the American Statistical Association, Vol. 69,
    pp. 730 737.
  • TS95 G. Trent and M. Sake. WebStone The First
    Generation in HTTP Server Benchmarking, Feb.
    1995. http//www.mindcraft.com/webstone/paper.html
    .

81
References (Cont.)
  • Wbe Webbench. http//www.zdnet.com/etestinglabs/
    stories/benchmarks/0,8829,2326243,00.html.
  • WP Web Polygraph Proxy performance benchmark.
    http//polygraph.ircache.net/.
  • WVS99a A. Wolman, G. Voelker, N. Sharma, N.
    Cardwell, M. Brown, T. Landray,D. Pinnel, A.
    Karlin, and H. Levy. Organization-Based Analysis
    of Web-Object Sharing and Caching. In Proc. of
    the Second USENIX Symposium on Internet
    Technologies and Systems, Boulder, CO, October
    1999.
  • WVS99b A. Wolman, G. M. Voelker, N. Sharma, N.
    Cardwell, A. Karlin, and H. M. Levy. On the scale
    and performance of cooperative Web proxy caching.
    In Proc. of the 17th ACM Symposium on Operating
    Systems Principles, Kiawah Island, SC, Dec. 1999.
  • YAD01 J. Yin, L. Alvisi, M. Dahlin, A. Iyengar.
    Engineering server-driven consistency for large
    scale dynamic services.
  • YBS99 H. Yu, L. Breslau, and S. Shenker. A
    Scalable Web Cache Consistency Architecture. In
    Proc. of SIGCOMM, August 1999.

82
Acknowledgement
  • Thanks to Alec Wolman and Yin Zhang
  • for their helpful comments.

83
Thank you! http//www.research.microsoft.com/lili
q/talks/tutorial-perf2002.ppt
Write a Comment
User Comments (0)
About PowerShow.com