MEMORY PERFORMANCE EVALUATION - PowerPoint PPT Presentation

About This Presentation
Title:

MEMORY PERFORMANCE EVALUATION

Description:

MEMORY PERFORMANCE EVALUATION OF HIGH THOUGHPUT SERVERS Garba Ya u Isa Master s Thesis Oral Defense Computer Engineering King Fahd University of Petroleum & Minerals – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 63
Provided by: Yau2
Category:

less

Transcript and Presenter's Notes

Title: MEMORY PERFORMANCE EVALUATION


1
MEMORY PERFORMANCE EVALUATION OF HIGH THOUGHPUT
SERVERS
Garba Yau Isa Masters Thesis Oral
Defense Computer Engineering King Fahd University
of Petroleum Minerals Saturday, 7th June 2003
2
Outline
  • Introduction
  • Problem Statement
  • Analysis of Memory Accesses
  • Measurement Based Performance Evaluation
  • Design and Implementation of Prototype
  • Contributions
  • Conclusions
  • Future Work

3
Introduction
  • Processor and memory performance discrepancy
  • Growing network bandwidth
  • Data rates in Terabits per second possible
  • Gigabit per second LANs already deployed
  • High throughput servers in network infrastructure
  • Streaming media servers
  • Web servers
  • Software Routers

4
Dealing with Performance Gap
  • Hierarchical memory architecture
  • temporal locality
  • spatial locality
  • Constrains
  • Characteristics of network payload data
  • Large ? wont fit into cache
  • Hardly reusable ? poor temporal locality

5
Problem Statement
  • Network servers should
  • Deliver high throughput
  • Respond to requests with low latency
  • Respond to large number of clients
  • Our goal
  • Identify specific conditions at which server
    memory becomes a bottleneck
  • Includes
  • cache,
  • main memory, and
  • virtual memory
  • Benefits
  • Better server design that alleviates memory
    bottlenecks
  • Optimal performance can be achieved
  • Constraints
  • Large amount of data flowing through CPU and
    memory
  • Writing code to optimize memory utilization is a
    challenge

6
Analysis of Memory Accesses Data Flow Analysis
  • Four data transfer paths
  • Memory-CPU
  • Memory-memory
  • Memory-I/O
  • Memory-network

7
Latency Model and Memory Overhead
  • Each transaction involves
  • CPU cycles
  • Data transfers one or more of four identified
    types
  • Transaction latency
  • Ttrans Tcpu n1Tm-c n2Tm-m n3Tm-disk
    n4Tm-net
  • Tcpu ? Total CPU time needed for the transaction
  • Tm-c ? Time to transfer entire PDU from memory to
    CPU for proc.
  • Tm-m ? Latency of memory-memory copy of a PDU
  • Tm-disk ? Latency of memory-I/O read/write of a
    block of data
  • Tm-net ? Latency of memory-network read/write of
    a PDU
  • ni ? Number of each type of data movement
    operations

8
Memory-CPU Transfers
  • PDU Processing
  • checksum computation and header updating
  • Typically, one-way data flow (memory to CPU via
    cache)
  • Memory stall cycles
  • Number of memory stall cycles (IC)(AR)(MR)(MP)
  • Cache miss rate
  • Worst case MR 1 (not as bad!)
  • Best case MR 0 (trivial)

9
Memory-CPU Transfers cont.
  • Cache overhead in various cases
  • Worst case MR 1, MP 10 and (MR)(MP) ?10
  • Best case MR 0 ? trivial
  • Average case MR 0.1, MP 10 and (MR)(MP)?1
  • Memory-CPU latency dependent on internal bus
    bandwidth
  • Tm-c S/32Bi usec where S is the PDU size and Bi
    is the internal bus bandwidth in MB/s

10
Memory-Memory Transfers
  • Memory-memory transfer
  • Due to memory copy of PDU between protocol layers
  • Transfers through caches and CPU
  • Stride 1 (contiguous)
  • Transfer involves memory?cache?CPU?cache?memory
    data movement
  • Latency
  • Dependent on internal (system) bus bandwidth
  • Tm-m 2S/Bi usec

11
Memory-I/O and Memory-Network Transfers
  • Memory-network transfers
  • Passes over the I/O bus
  • DMA can be used
  • Again, stride 1 (contiguous)
  • Latency
  • Limiting factor is the I/O bus bandwidth
  • Tm-net S/Be usec

12
Latency of Reference Applications
  • RTP Transaction Latency

1
  • HTTP Transaction Latency

2
  • IP Transaction Latency

3
13
Peak Throughputs
  • Assumptions
  • CPU usage latency compared to data transfer
    latency
  • is negligible and can be ignored
  • Bus contention from multiple simultaneously
    executed
  • transactions do not result in any additional
    overhead
  • Server Throughput S/T
  • S size of transaction data
  • T latency of a transaction given by equations
    1, 2 and 3

14
Peak Throughputs cont.
Processor Internal bus bandwidth (MB/sec) Throughput of three network applications Throughput of three network applications Throughput of three network applications
Processor Internal bus bandwidth (MB/sec) IP forwarding (Mbits/sec) HTTP (Mbits/sec) RTP Streaming (Mbits/sec)
Intel Pentium IV 3.06 GHz 3200 4264 3640 3640
AMD Athlon XP 3000 2700 4264 3291 3291
MIPS R16000 700 MHz 3200 4264 3640 3640
Sun Ultraspac III 900 MHz 1200 4264 1862 1862
15
Measurement Based PerformanceEvaluation
  • Experimental Testbed
  • Dual boot server (Pentium IV 2.0 GHz)
  • 256 MB RAM
  • 1.0 GHz NIC
  • Closed LAN (Cisco catalyst 1.0 GHz 3550 switch)
  • Tools
  • Intel Vtune
  • Windows Performance Monitor
  • Netstat
  • Linux tools vmstat, sar, iostat

16
Platforms and Applications
  • Platforms
  • Linux (kernel 2.4.7-10)
  • Windows 2000
  • Applications
  • Streaming media servers
  • Darwin streaming server
  • Windows media server
  • Web servers
  • Apache web server
  • Microsoft Internet Information server
  • Software router
  • Linux kernel IP forwarding

17
Analysis of Operating System Role
  • Memory Throughput Test
  • ECT (extended copy transfer) memperf
  • Locality of reference
  • temporal locality varying working set size
    (block size)
  • spatial locality varying access pattern
    (strides)

18
Analysis of Operating System Role cont.
  • Context switching overhead

19
Streaming Media Servers
  • Experimental Design
  • Factors
  • Number of streams (streaming clients)
  • Media encoding rate (56kbps and 300kbps)
  • Stream distribution (unique and multiple media)
  • Metrics
  • Cache miss (L1 and L2 cache)
  • Page fault rate
  • Throughput
  • Benchmarking Tools
  • DSS - streaming load tool
  • WMS media load simulator

20
Cache Performance
  • L1 cache misses (56kbps)

21
Cache Performance cont.
  • L1 cache misses (300 kbps)

22
Memory Performance
  • Page fault (300kbps)

23
Throughput
  • Throughput (300kbps)

24
Summary Streaming Media Server Memory Performance
  • Highest degradation in cache performance (both
    L1 and L2) when the number of clients is large
    and the encoding rate is 300kbps with multiple
    multimedia objects.
  • When clients demand unique media objects, page
    fault rate is constant. However, if the request
    is for multiple objects, the page fault rate
    increases with the number of clients.
  • Throughput increases with number of clients.
    Higher encoding rate - 300kbps, also accounts for
    more throughputs. Darwin streaming server has
    less throughput compared to Windows media server.

25
Web Servers
  • Experimental Design
  • Factors
  • Number of web clients
  • Document size
  • Metrics
  • Cache miss (L1 and L2 cache)
  • Page fault rate
  • Throughput
  • Transactions/sec (connection rate)
  • Average latency
  • Benchmarking Tool
  • Webstone

26
Transactions
27
L1 Cache Miss
28
Page Fault
29
Throughput
30
Summary Web Server Memory Performance Evaluation
Comparing Apache and IIS for an average file size
of 10K
Attribute Value Value
Attribute Apache IIS
Max. transaction rate (conn/sec) Max. throughput (Mbps) CPU utilization () 2586 217 71 4178 (58 more than apache) 349 (62 more than Apache) 63
L1 misses (Millions) L2 misses (Millions) Page fault rate (pfs/sec) 424 1673 lt 10 200 117 lt 10
31
Software Router
  • Experimental Design
  • Factors
  • Routing configurations
  • TCP message size (64bytes, 10 Kbytes, and 64
    Kbytes)
  • Metrics
  • Throughput
  • Number of context switching
  • Number of active pages
  • Benchmarking Tool
  • Netperf

32
Software Router Throughput
33
CPU Utilization
34
Context Switching
35
Active Page
36
Summary Software Router Performance Evaluation
  • Maximum throughput of 449 Mbps for
    configuration
  • number 2 - full duplex one-to-one
    communication.
  • Highest CPU utilization was 84
  • Highest context switching rate was 5378/sec
  • Number of active pages fairly uniformly
    distributed.
  • Indicates low memory activity.

37
Design, Implementation and Evaluation of
Prototype DB-RTP Server
Architecture
  • Implementation
  • Linux platform (C)
  • Our implementation of RTSP/RTP (why?)

38
Double Buffering and Synchronization
Buffer read
Buffer write
39
RTP Server Throughput
40
Jitter
41
Summary DB-RTP Server Performance Evaluation
  • Throughput
  • DB-RTP server 63.85 Mbps
  • RTP server 59 Mbps.
  • Both servers exhibit steady jitter, but DB-RTP
    has relatively lower jitter compared to RTP
    server.

42
Contributions
  • Cache overhead analysis.
  • Memory latency and bandwidth analysis
  • Measurement-based performance evaluation
  • Design, implementation, and evaluation of a
  • prototype streaming server - Double Buffer
    RTP
  • (DB-RTP) server.

43
Conclusions
  • High throughput is possible with server design
  • enhancement.
  • Server throughput is significantly degraded by
  • excessive cache misses and page faults.
  • Latency hiding with pre-fetching and buffering
    can
  • improve throughput and jitter performance

44
Future Work
  • Server Development
  • hybrid multiplexing multithreading
  • Special Architectures (Network processors
    ASICs)
  • resource scheduling
  • investigation of the role I/O
  • use of IRAM (intelligent RAM) architectures
  • integrated network infrastructure server

45
Thank you
46
Array restructuring
Array Padding
go back
Loop nest transformation
47
Testbeds
Software router testbed
Streaming media/web server testbed
go back
48
Communication Configurations
go back
49
Backup slides
50
Memory Performance
Page fault
300 kbps
56 kbps
51
Streaming Server CPU Utilization
52
Cache Performance cont.
  • L2 cache misses (56kbps)

53
Cache Performance cont.
  • L2 cache misses (300kbps)

54
Web Servers
Transaction
Cache performance
L2 cache misses
L1 cache misses
55
Web Servers
Latency
CPU Utilization
56
DB-RTP Server
L2 cache misses
L1 cache misses
CPU Utilization
57
Memory Performance Evaluation Methodologies
  • Analytical
  • Requires just paper and pencil
  • Accuracy?
  • Simulation
  • Requires programming
  • Time and cost?
  • Measurement
  • Real system or a prototype required
  • Using on-chip counters
  • Benchmarking tools
  • More accurate

58
Server Performance Tuning
  • Memory performance tuning
  • Array padding
  • Array restructuring
  • Loop nest transformation
  • Latency hiding and multithreading
  • EPIC (IA-64)
  • VIRAM
  • Impulse
  • Multiprocessing and clustering
  • Task parallelization
  • E.g. Panama cluster router
  • Special Architectures
  • Network processors
  • ASICs and Data flow architectures

59
  • Temporal vs. spatial locality
  • A PDU lacks temporal locality
  • Observation PDU processing exhibits excellent
    spatial locality
  • Suppose data cache line is 32 bytes (or 16 words)
    long
  • Sequential accesses with stride 1
  • Accessing one word, brings other 15 words as well
  • Thus, effective MR 1/16 6.2 ? better than
    even scientific apps
  • Thus, generally MR W/L
  • W - Width of each memory access (in bytes)
  • L - Length of each cache line (in bytes)
  • Validation of above observation
  • Similar special locality characteristics reported
    via measurements
  • S. Sohoni et al., A Study of Memory System
    Performance of Multimedia Applications, in proc.
    of ACM SIGMETRICS 2001
  • MR for streaming media player better than SPEC
    benchmark apps!

60
Memory-CPU Transfers
  • PDU Processing
  • checksum computation and header updating
  • Typically, one-way data flow (memory to CPU via
    cache)
  • Memory stall cycles
  • Number of memory stall cycles (IC)(AR)(MR)(MP)
  • IC Instruction count per transaction
  • AR Number of memory accesses/instruction
    (AR1)
  • MR Ratio of cache misses to memory accesses
  • MP Miss penalty in terms of clock cycles
  • Cache miss rate
  • Worst case MR 1 while typically MP 10
  • Stall cycles 10 x IC

61
Memory-CPU Transfers cont.
  • Determine cache overhead wrt execution time
  • (Execution time)no-cache (IC)(CPI)(CC)
  • (Execution time)with-cache (IC)(CPI)(CC) 1
    (MR)(MP)
  • Cache overhead 1 (MR)(MP)
  • Cache overhead in various cases
  • Worst case MR 1 and MP 10
  • Cache results in 11 times higher latency for each
    transaction!
  • Memory-CPU latency dependent on internal bus
    bandwidth
  • Best case MR 0 ? trivial
  • Average case MR 0.1 and MP 10 and (MR)(MP)?1
  • Latency due to stalls ideal execution time
    without stalls
  • Tm-c S/32Bi usec where S is the PDU size and
  • Bi is the internal bus BW in MB/s

62
Open Questions
Open Questions
  • Role of specific-purpose architecture on
  • performance of high throughput servers
  • (e.g. network processor)
  • Role of memory compression
  • Role of scheduling
Write a Comment
User Comments (0)
About PowerShow.com