Cache Performance, Interfacing, Multiprocessors CPSC 321 - PowerPoint PPT Presentation

About This Presentation
Title:

Cache Performance, Interfacing, Multiprocessors CPSC 321

Description:

Miss penalty 40 cycles for all misses ... huge miss penalty, thus pages should be fairly large (e.g., 4KB) ... Incredible high penalty for a page fault ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 47
Provided by: faculty
Category:

less

Transcript and Presenter's Notes

Title: Cache Performance, Interfacing, Multiprocessors CPSC 321


1
Cache Performance, Interfacing, Multiprocessors
CPSC 321
  • Andreas Klappenecker

2
Todays Menu
  • Cache Performance
  • Review of Virtual Memory
  • Processor and Peripherals
  • Multiprocessors

3
Cache Performance
4
Caching Basics
  • What are the different cache placement schemes?
  • direct mapped
  • set associative
  • fully associative
  • Explain how a 2-way cache with 4 sets works
  • If we want to read a memory block whose address
    is addr, then we search the set addr mod 4
  • The memory block could be in either element of
    the set
  • Compare tags with upper n-2 bits of addr

5
Implementation of a Cache
  • Sketch an implementation of a 4-way associative
    cache

6
Measuring Cache Performance
  • CPU cycle time
  • CPU execution clock cycles (including cache hits)
  • Memory-stall clock cycles (cache misses)
  • CPU time (CPU execution clock cycles memory
    stall clock cycles) x clock cycle time
  • Memory stall clock cycles
  • read stall cycles (rsc)
  • write stall clock cycles (wsc)
  • Memory stall clock cycles rsc wsc

7
Measuring Cache Performance
  • Write-stall cycle write-through scheme
  • two sources of stalls
  • write misses (usually require to fetch the block)
  • write buffer stalls (write buffer is full when
    write occurs)
  • WSCs are sum of the two
  • WSCs (writes/prg x write miss rate x write miss
  • penalty) write buffer stalls
  • Memory stall clock cycles similar

8
Cache Performance Example
  • Instruction cache rate 2
  • Data miss rate 4
  • Assume that 2 CPI without any memory stalls
  • Miss penalty 40 cycles for all misses
  • Instruction count I
  • Instruction miss cycles I x 2 x 40 0.80 x I
  • gcc has 36 loads and stores
  • Data miss cycles I x 36 x 4 x 40 0.58 x I

9
Review of Virtual Memory
10
Virtual Memory
  • Processor generates virtual addresses
  • Memory is accessed using physical addresses
  • Virtual and physical memory is broken into blocks
    of memory, called pages
  • A virtual page may be absent from main memory,
    residing on the disk or may be mapped to a
    physical page

11
Page Tables
The page table maps each page to either a page in
main memory or to a page stored on disk
12
Pages virtual memory blocks
  • Page faults if data is not in memory, retrieve
    it from disk
  • huge miss penalty, thus pages should be fairly
    large (e.g., 4KB)
  • reducing page faults is important (LRU is worth
    the price)
  • using write-through takes too long so we use
    writeback
  • Example page size 2124KB 218 physical pages
  • main memory lt 1GB virtual memory lt 4GB

13
Page Faults
  • Incredible high penalty for a page fault
  • Reduce number of page faults by optimizing page
    placement
  • Use fully associative placement
  • full search of pages is impractical
  • pages are located by a full table that indexes
    the memory, called the page table
  • the page table resides within the memory

14
Making Memory Access Fast
  • Page tables slow us down
  • Memory access will take at least twice as long
  • access page table in memory
  • access page
  • What can we do?

Memory access is local gt use a cache that keeps
track of recently used address translations,
called translation lookaside buffer
15
Making Address Translation Fast
  • A cache for address translations translation
    lookaside buffer

16
Processors and Peripherals
17
Collection of I/O Devices
  • Communication between I/O devices, processor and
  • memory use protocols on the bus and interrupts

18
Impact of I/O on Performance
  • A benchmark executes in 100 seconds
  • 90 seconds CPU time
  • 10 seconds I/O time
  • If CPU improves 50 per year for next 5 years,
    how
  • much faster does the benchmark run in 5 years?

90/(1.5)5 90/7.59 11.851
19
I/O Devices
  • Very diverse devices behavior (i.e., input vs.
    output) partner (who is at the other end?)
    data rate

20
Communicating with Processor
  • Polling
  • simple
  • I/O device puts information in a status register
  • processor retrieves information
  • check the status periodically
  • Interrupt driven I/O
  • device notifies processor that it has completed
    some operation by causing an interrupt
  • similar to exception, except that it is
    asynchronous
  • processor must be notified of the device csng
    interrupt
  • interrupts must be prioritized

21
I/O Example Disk Drives
  • To access data seek position head over the
    proper track (8 to 20 ms. avg.) rotational
    latency wait for desired sector (.5 / RPM)
    transfer grab the data (one or more sectors) 2
    to 15 MB/sec

22
I/O Example Buses
  • Shared communication link (one or more wires)
  • Difficult design may be bottleneck
    tradeoffs (buffers for higher bandwidth increases
    latency) support for many different devices
    cost
  • Types of buses processor-memory (short high
    speed, custom design) backplane (high speed,
    often standardized, e.g., PCI) I/O (lengthy,
    different devices, standardized, e.g., SCSI)
  • Synchronous vs. Asynchronous use a clock and a
    synchronous protocol,
  • fast and small, but every
    device must operate at same
    rate and clock skew requires the bus to be
    short dont use a clock - use handshaking
    instead

23
Asynchronous Handshake Protocol
  • ReadyReq Indicates a read request from memory
  • DataRdy Indicates that data word is now ready on
    data lines
  • Ack Used to acknowledge the ReadyReq or DataRdy
    signal of the other party

24
Asynchronous Handshake Protocol
  • Memory sees ReadReq, reads address from data bus,
    raises Ack
  • I/O device sees Ack high, releases ReadReq and
    data lines
  • Memory sees ReadReq low, drops Ack to acknowledge
    ReadReq
  • When memory has data ready, it places data from
    the read request on the data lines and raises
    DataRdy
  • I/O devices sees DataRdy, reads data from the
    bus, signals that it has the data by raising Ack
  • Memory sees the Ack signal, drops DataRdy,
    releases datalines
  • If DataRdy goes low, the I/O device drops Ack to
    indicate that transmission is over

25
Synchronous vs. Asynchronous Buses
  • Compare max. bandwidth for a synchronous bus and
    an asynchronous bus
  • Synchronous bus
  • has clock cycle time of 50 ns
  • each transmission takes 1 clock cycle
  • Asynchronous bus
  • requires 40 ns per handshake
  • Find bandwidth for each bus when performing
    one-word reads from a 200ns memory

26
Synchronous Bus
  1. Send address to memory 50 ns
  2. Read memory 200 ns
  3. Send data to device 50ns
  4. Total 300 ns
  5. Max. bandwidth
  6. 4 bytes/300ns 13.3 MB/second

27
Asynchronous Bus
  • Apparently much slower because each step of the
    protocol takes 40 ns and memory access 200 ns
  • Notice that several steps are overlapped with
    memory access time
  • Memory receives address at step 1
  • does not need to put address until step 5
  • steps 2,3,4 can overlap with memory access
  • Step 1 40 ns
  • Step 2,3,4 max(3 x 40ns 120ns, 200 ns)
  • Steps 5,6,7 3x40ns 120ns
  • Total time 360ns
  • max. bandwidth 4bytes/360ns11.1MB/second

28
Other important issues
  • Bus Arbitration daisy chain arbitration (not
    very fair) centralized arbitration (requires
    an arbiter), e.g., PCI self selection, e.g.,
    NuBus used in Macintosh collision detection,
    e.g., Ethernet
  • Operating system polling, interrupts, DMA
  • Performance Analysis techniques queuing
    theory simulation analysis, i.e., find the
    weakest link (see I/O System Design)

29
Overhead of Polling
30
Overhead of Polling
31
Ways to Transfer Data between Memory and Device
32
Multiprocessors
33
Idea
  • Build powerful computers by connecting
  • many smaller ones.

34
Multiprocessors
Good for timesharing easy to realize -
difficult to write good concurrent programs -
hard to parallelize tasks - mapping to
architecture can be difficult
35
Questions
  • How do parallel processors share data? single
    address space message passing
  • How do parallel processors coordinate?
    synchronization (locks, semaphores) built into
    send / receive primitives operating system
    protocols
  • How are they implemented? connected by a
    single bus connected by a network

36
Shared Memory Multiprocessors
Problems???
Symmetric multiprocessor (SMP)
37
Distributed Memory Multiprocessors
  • Distributed shared-memory multiprocessor
  • Message passing multiprocessor

38
Multiprocessors
Global Memory Distributed memory
Common Address Space Symmetric Multiprocessor Distributed shared-memory multiprocessor
Distributed Address Space does not exist Message passing multiprocessor
39
Connection Network
  • Static Network
  • fixed connections between nodes
  • Dynamic Network
  • packet switching (packets routed from sender to
    recipient)
  • circuit switching (connection between nodes can
    be established by crossbar or switching network)

40
Static Connection Networks
41
Circuit Switching Delta Networks
0101
  • Route from any input x to output y by selecting
    links determined by successive d-ary digits of
    ys label.
  • This process is reversible we can route from
    output y back to x by following the links
    determined by successive digits of xs label.
  • This self-routing property allows for simple
    hardware-based routing of cells.

1
1
0
1
0
1
1101
0
1
xxk-1 . . . x0
yyk-1 . . . y0
42
Network versus Bus
43
Performance / Unit Cost
44
Programming
  • lock variables
  • semaphores
  • monitor

45
Cache Coherency
46
Outlook
  • Distributed Algorithms
  • Distributed Systems
  • Parallel Programming
  • Parallelizing Compilers
Write a Comment
User Comments (0)
About PowerShow.com