COMP 206: Computer Architecture and Implementation - PowerPoint PPT Presentation

About This Presentation

Title:

COMP 206: Computer Architecture and Implementation

Description:

Number of MM accesses depends on page table organization ... Single bus and MM. Two or more CPUs, each with WB cache ... MM if all cache copies are Clean ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 28

Provided by: Montek5

Learn more at: http://www.cs.unc.edu

Category:

more less

Transcript and Presenter's Notes

Title: COMP 206: Computer Architecture and Implementation

1
COMP 206Computer Architecture and Implementation

Montek Singh
Wed., Nov. 6, 2002
Topic 1. Virtual Memory 2. Cache Coherence

2
Virtual Memory Access Time

Assume existence of TLB, physical cache, MM, disk
Processor issues VA
TLB hit
Send RA to cache
TLB miss
Exception Access page tables, update TLB, retry
Memory reference may involve accesses to
TLB
Page table in MM
Cache
Page in MM
Each of these can be a hit or a miss
16 possible combinations

3
Virtual Memory Access Time (2)

Constraints among these accesses
Hit in TLB ? hit in page table in MM
Hit in cache ? hit in page in MM
Hit in page in MM ? hit in page table in MM
These constraints eliminate eleven combinations

4
Virtual Memory Access Time (3)

Number of MM accesses depends on page table
organization
MIPS R2000/R4000 accomplishes table walking with
CPU instructions (eight instructions per page
table level)
Several CISC machines implement this in
microcode, with MC88200 having dedicated hardware
for this
RS/6000 implements this completely in hardware
TLB miss penalty dominated by having to go to
main memory
Page tables may not be in cache
Further increase in miss penalty if page table
organization is complex
TLB misses can have very damaging effect on
physical caches

5
Page Size

Choices
Fixed at design time (most early VM systems)
Statically configurable
At any moment, only pages of same size exist in
system
MC68030 allowed page sizes between 256B and 32KB
this way
Dynamically configurable
Pages of different sizes coexist in system
Alpha 21164, UltraSPARC 8KB, 64KB, 512KB, 4MB
MIPS R10000, PA-8000 4KB, 16Kb, 64KB, 256 KB, 1
MB, 4 MB, 16 MB
All pages are aligned
Dynamic configuration is a sophisticated way to
decrease TLB miss
Increasing TLB entries increases processor
cycle time
Increasing size of VM page increases internal
memory fragmentation
Needs fully associative TLBs

6
Segmentation and Paging

Paged segments Segments are made up of pages
Paging system has flat, linear address space
32-bit VA (10-bit VPN1, 10-bit VPN2, 12-bit
offset)
If, for given VPN1, we reach max value of VPN2
and add 1, we reach next page at address (VPN1,
0)
Segmented version has two-dimensional address
space
32-bit VA (10-bit segment , 10-bit page
number, 12-bit offset)
If, for given segment , we reach max page number
and add 1, we get an undefined value
Segments are not contiguous
Segments do not need to have the same size
Size can even vary dynamically
Implemented by storing upper bound for each
segment and checking every reference against it

7
Example 1 Alpha 21264 TLB

Figure 5.36

8
Example 2 Hypothetical Virtual Mem

Figure 5.37

9
Cache Coherence

Section 6.3 Appendix I

10
Cache Coherence

Common problem with multiple copies of mutable
information (in both hardware and software)
If a datum is copied and the copy is to match
the original at all times, then all changes to
the original must cause the copy to be
immediately updated or invalidated. (Richard L.
Sites, co-architect of DEC Alpha)

Copy becomes stale
1 2 3 4 A A A C - A B B
Copies diverge hard to recover from
11
Example of Cache Coherence

I/O in uniprocessor with primary unified cache
MM copy and cache copy of memory block not always
coherent
WT cache
MM copy stale while write update to MM in transit
WB cache
MM copy stale while cache copy Dirty
Inconsistency of no concern if no one
reads/writes MM copy
If I/O directed to main memory, need to maintain
coherence

12
Example of Cache Coherence (contd)

Uniprocessor with a split primary cache
I-cache contains instruction
D-cache contains data
Often contents are disjoint
If self-modifying code is allowed, then same
cache block may appear in both caches, and
consistency must be enforced
MS-DOS allows self-modifying code
Strong motivation for unified caches in Intel
i386 and i486
Pentium has split primary cache, and supports SMC
by enforcing coherence between I and D caches
Coordinating primary and secondary caches in
uniprocessor
Shared memory multiprocessors

13
Two Snoopy Protocols

We will discuss two protocols
A simple three-state protocol
Section 6.3 Appendix I of HP3
The MESI protocol
IEEE standard
Used by many machines, including Pentium and
PowerPC 601
Snooping
monitor memory bus activity by individual caches
taking some actions based on this activity
introduces a fourth category of miss to the 3C
model coherence misses
First, we need some notation to discuss the
protocols

14
Notation Write-Through Cache
15
Notation Write-Back Cache
16
Three-State Write-Invalidate Protocol

Minor modification of WB cache
Assumptions
Single bus and MM
Two or more CPUs, each with WB cache
Every cache block in one of three states
Invalid, Clean, Dirty (called Invalid, Shared,
Exclusive in Figure 6.10 of HP3)
MM copies of blocks have no state
At any moment, a single cache owns bus (is bus
master)
Bus master does not obey bus command
All misses (reads or writes) serviced by
MM if all cache copies are Clean
the only Dirty cache copy (which is no longer
Dirty), and MM copy is written instead of being
read

17
Understanding the Protocol

Only two global states
Most up-to-date copy is MM copy, and
all cache copies are Clean
Most up-to-date copy is a single unique
cache copy in state Dirty

Bus owner Clean
Another Clean copy exists
Can read without notifying
other caches

Bus owner Dirty
No other cache copies
Can read or write without
notifying other caches

Bus owner Clean
No other cache copies
Can read without notifying
other caches

18
State Diagram of Cache Block (Part 1)
19
State Diagram of Cache Block (Part 2)
20
Comparison with Single WB Cache

Similarities
Read hit invisible on bus
All misses visible on bus
Differences
In single WB cache, all misses are serviced by
MM in three-state protocol, misses are serviced
either by MM or by unique cache block holding
only Dirty copy
In single WB cache, write hit is invisible on
bus in three-state protocol, write hit of Clean
block
invalidates all other Clean blocks by a Bus Write
Miss (necessary action)
But Bus Write Miss causes a completely
unnecessary block transfer from MM to cache
(which is then written by CPU)

21
Correctness of Three-State Protocol

Problem State transition of FSM is supposed to
be atomic, but they are not in this protocol,
because of the bus
Example CPU read miss in Dirty state
CPU access to cache detects a miss
Request bus
Acquire bus, and change state of cache block
Evict dirty block to MM
Put Bus Read Miss on bus
Receive requested block from MM or another cache
Release bus, and read from cache block just
received
Bus arbitration may cause gap between steps 2 and
3
Whole sequence of operations no longer atomic
App. I.1 argues that protocol will work correctly
if steps 3-7 are atomic, i.e., bus is not a
split-transaction bus

22
Adding More Bits to Protocols

Add third bit, called Shared, to Valid and Dirty
bits
Get five states (M, O, E, S, I)
Developed in context of Futurebus, with
intention of explaining all snoopy protocols, all
of which use 3, 4, or 5 states

23
MESI Protocol

Four-state, write-invalidate
Improved version of three-state protocol
Clean state split into Exclusive and Shared
states
Dirty state equivalent to Modified state
Several slightly different versions of MESI
protocol
Will describe version implemented by Futurebus
PowerPC 601 MESI protocol does not support
cache-to-cache transfer of blocks

24
State Diag. of MESI Cache Block (Part 1)
25
State Diag. of MESI Cache Block (Part 2)
26
Comparison with Three-State Protocol

Similarities
Read hit invisible on bus
All misses handled the same way
Differences
Big improvement in handling write hits
Write hit in Exclusive state invisible on bus
Write hit in Shared state involves no block
transfer, only a control signal

Exclusive state
Can be read or written

Shared state
Can be read only

Modified state
Can be read and written

27
Comments on Write-Invalidate Protocols

Performance
Processor can lose cache block through
invalidation by another processor
Average memory access time goes up, since writes
to shared blocks take more time (other copies
have to be invalidated)
Implementation
Bus and CPU want to simultaneously access same
cache
Either same block or different blocks, but
conflict nonetheless
Three possible solutions
Use a single tag array, and accept structural
hazards
Use two separate tag arrays for bus and CPU,
which must now be kept coherent at all times
Use a multiported tag array (both Intel Pentium
and PowerPC 601 use this solution)