Title: Improving the performance of a Multicore architecture with different Cache coherence protocols and i
1- Improving the performance of a Multi-core
architecture with different Cache coherence
protocols and its related work
2What is Multi-Core Architecture ?
A Multi-Core Architecture is a method of
embedding a no of cores on a single chip. A Multi
Core Architecture improves the performance of a
system by computing a no of tasks at the same
time.
3Improving the performance of a Multi-core
architecture with different Cache coherence
protocols and its related work
4Improving the performance of a Multi-core
architecture with different Cache coherence
protocols and its related work
5Core 1
Core 2
Core 3
6What is cache?
- Cache is simply a high speed Static RAM
(SRAM). Every processor consists of a cache which
is useful in retrieving the data which will be
used frequently. The cache is very fast in
retrieving the frequently used data when compared
to a Dynamic RAM (main memory). -
7Cache coherence?
- Cache Coherence is a protocol which makes the
caches of different processors work correctly and
efficiently i.e. the caches may contain an
inconsistent data in which one cache may contain
one data and the other one may have a different
data which causes the data inconsistency problem.
8Cache Coherence Problem
9- Cache Coherence Schemes
- Software Cache Coherence Schemes
- Software-Based Coherence which is
straightforward approach that is shared data is
not cached. - Hardware Cache Coherence Schemes
- Hardware-Based Coherence is one which is
imposed by snoop devices attached to the cores
and their caches. In this scheme shared data can
be cached because caches are guaranteed to
coherent, but programmer must deal with
synchronization of shared data.
10- Write-invalidate a processor gains exclusive
access ofa block before writing by invalidating
all other copies - Write-update when a processor writes, it
updates other shared copies of that block
11- Directory-based A single location (directory)
keeps trackof the sharing status of a block of
memory - Snooping Every cache block is accompanied by
the sharingstatus of that block all cache
controllers monitor theshared bus so they can
update the sharing status of theblock, if
necessary
12Snoopy based cache coherence
- Snoopy based cache coherence protocol employs
a bus connected to all L1 caches. In this
mechanism for every L1 cache misses a coherence
message is placed in the global state that is in
L2 cache which is connected to bus and all other
L1 caches maintain their cache states and respond
to the message if it belongs to them. The
messages generally used are request messages,
invalidation messages, intervention messages,
data block transfers, etc.
13Pn
P0
bus snoop
Memory bus
memory op from Pn
Mem
Mem
The memory bus is a broadcast medium Caches
contain information on which addresses they
store Cache Controller snoops all transactions
on the bus A transaction is a relevant
transaction if it involves a cache block
currently contained in this cache Take action to
ensure coherenceinvalidate, update, or supply
value
14- Limits of Snoopy Coherence
BW per processor gt 1.28 GB/s combined
BW Assume 4 GHz processor gt 16 GB/s inst BW
per processor (32-bit) gt 9.6 GB/s data BW at
30 load-store of 8-byte elements Suppose 98
inst hit rate and 90 data hit rate gt 320 MB/s
inst BW per processor gt 960 MB/s data Assuming
10 GB/s bus bandwidth 8 processors will saturate
the bus
MEM
MEM
1.28 GB/s
cache
cache
25.6 GB/s
PROC
PROC
15- In cache coherence protocols like snoopy based
protocols, different messages have different
latency and different bandwidth needs. So in
order to exploit these interconnect composed of
wires are designed which have different
latencies, bandwidth and energy properties. Using
some techniques cache coherence protocols can
exploit these interconnect wires and improve
processor performance and also reduce ower
consumption
16- Techniques employed to improve snoopy based cache
coherency protocols
3 wired OR signals In this technique first
signal is given when any other cache has a copy
of block besides the requester and second signal
is asserted when any cache has exclusive copy of
block. The third signal is asserted when all
snoop actions are completed on the bus.9 When
the third signal is asserted, the requesting L1
and the L2 can safely examine the other two
signals. Since all of these signals are on the
critical path, implementing them using
low-latency L-Wires can improve performance.
17- Another technique used to improve snoopy based
protocol with low latencies is voting wires.
Generally cache to cache transfers occur from the
data in the modified state, in which case there
is a single supplier.10 On the other hand in
MESI protocol a block can be retrieved from other
cache rather from memory. Multiple caches share
copy voting mechanism is generally employed to
supply data therefore voting mechanism works with
low latencies and improves processor performance
18Directory based Protocol
- In directory based protocol memory is
distributed among different processors and for
each such memory, directory is maintained. L1
cache misses are sent to L2 caches and an
directory is maintained across each L2 caches
whish store the status of block. When request
comes from requester node from another cache the
request goes to home node where the original data
is stored to check whether it has, if it is not
available the request goes to remote node by home
node and first fetches data from remote node and
later send it to requester node. In chip
multiprocessors especially in the latest
technology we are using that is in core2duo
write-invalidate-direct based protocol is
employed.
19(No Transcript)
20Techniques used to improve Directory based cache
coherency
- Exclusive Read Request for a block in a shared
state - Read request for block in exclusive state
- Proximity Aware coherence Protocols
21-
- Exclusive Read Request for a block in a
shared state - In this approach both the acknowledgment and
reply messages were send simultaneously through
the corresponding low latency L-wires and low
power PW wires. This approach improves the
performance and decreases the consumption of power
22-
- Read request for block in exclusive state
- This approach mainly follows the ways of
improving the performance by sending the
prioritized data through the L-wires and the
least prioritized one through PW-wires.
23- ACCELERATING COHERENCE VIA PROXIMITY AWARENESS
- In this approach the requester will send the
read request to the home node if it is not found
in its L2 cache then home node will sends the
data if it is shared by it otherwise if the data
is shared then it will forwards the request to
the nearest node which contains the data and that
nearest remote node will send the data to the
requester and an ACK to the home node, if the
home node does not get the ACK it will try for
some more times if doesnt get an ACK from the
remote node the it will send the data to the
requester directly from the memory.
24Conclusion
- In conclusion in this paper we say that
multi core processors are better choice than
multiprocessor because chip complexity is
reduced, high frequency is employed and achieve
better performance with low power consumption.
However cache coherence problem is the issue in
multi-core processor. Using various protocols
like snoopy based and directory based protocols,
cache coherence problem is eliminated however it
comes at the cost of trade off between latency
and bandwidth. In snoopy based protocols we use
wire implementation techniques like 3 wired OR
wires, Voting wires to improve latencies at
expense of high bandwidth. Directory based
protocols are alternatives to snoopy based
protocols which achieves low latencies and high
bandwidth and this protocol is implemented in
present day technologies like in Core2Duo
processors.