Title: CSCE 432/832 High Performance ---- An Introduction to Multicore Memory Hierarchy
1CSCE 432/832 High Performance---- An
Introduction toMulticore Memory Hierarchy
2What We Learnt from the Video
- The Motivation of Multi-core Processors
- Better utilization of on-chip transistor
resources as technology scales - Use thread-level parallelism to increase
throughput - Two Models of Multi-core Processors
- Homogenous vs. Heterogeneous CMPs
- Communication Synchronization among Cores
- Communicate with each other via the shared
cache/memory - Synchronize reads/writes via locks, mutex or
transactional memory - How to Program Multi-core Processors
- Using OpenMP to write parallel programs
3From Teraflop Multiprocessor to Teraflop
Multicore
ASCI RED (19972005)
4Intel Teraflop Multicore Prototype
5From Teraflop Multiprocessor to Teraflop
Multicore
- Pictured here is ASCI Red which was the first
computer to reach a Teraflops of processing,
equal to trillions of calculations per second. - Using about 10,000 Pentium Processors running at
200MHz - Consuming 500kW of power for computation and
another 500kW for cooling - Occupy a very large room
- Intel has now announced just over 10 yeas later
that they have developed the worlds first
processor that will deliver the same Teraflops
performance all on one single - 80-core on a single chip running at 5 GHz
- Consuming only 62 watts power
- Small enough to rest on the tip of your finger.
6A Commodity Many-core Processor
- Tile64 Multicore Processor (2007now)
7The Schematic Design of Tile64
DDR2 Memory Controller 0
PCIe 0 MAC PHY
Serdes
UART, HPI JTAG, I2C, SPI
GbE 0
GbE 1
Flexible IO
Flexible IO
PCIe 1 MAC PHY
- 4 essential components
- Processor Core
- on-chip Cache
- Network-on-Chip (NoC)
- I/O controllers
Serdes
DDR2 Memory Controller 3
8Agenda Today
- An Introduction to the Multi-core Memory
Hierarchy - Why do we need the memory hierarchy for any
processors? - A tradeoff between capacity and latency
- Make common cases fast as a result of programs
locality (general principle in computer
architecture) - What is the difference between the memory
hierarchies of single-core and multi-core CPUs? - Quite distinct from each other in on-chip caches
- Managing the CMP caches is of paramount
importance in performance - Again, we still have the capacity and latency
issues for CMP caches - How to keep CMP cache coherent
- Hardware software management schemes
9The Motivation for Mem Hierarchy
Trading off between capacity and latency
Capacity Access Time Cost
Upper Level
faster
CPU Registers 100s Bytes 0.3-0.5 ns
Registers
prog./compiler 4-8 bytes
Instr. Operands
L1 Cache
L1 and L2 Cache 10s-100s K Bytes 1 ns - 10 ns
cache cntl 32 or 64 bytes
Blocks
L2 Cache
On Chip
cache cntl 64 or 128 bytes
Blocks
Main Memory G Bytes 200ns 300ns 15/ GByte
Memory
OS 4K 64K bytes
Off Chip
Pages
Disk 1s -10s T Bytes 10 ms 0.15 / GByte
Disk
Larger
Lower Level
10Programs Locality
- Two Kinds of Basic Locality
- Temporal
- if a memory location is referenced, then it is
likely that the same memory location will be
referenced again in the near future. - int i register int j
- for (i 0 i lt 20000 i)
- for (j 0 j lt 300 j)
- Spatial
- if a memory location is referenced, then it is
likely that nearby memory locations will be
referenced in the near future. - Locality smaller HW is to make common cases
faster memory hierarchy
11The Challenges of Memory Wall
- The Truths
- In many applications, 30-40 the total
instructions are memory operations - CPU speed scales much faster than the DRAM speed
- In 1980, CPUs and DRAMs were operated at almost
the same speed, about 4MHz8MHz - CPU clock frequency has doubled every 2 years
- DRAM speed have only been doubling about every 6
years.
12Memory Wall
- DRAM bandwidth is quite limited two DDR2-800
modules can reach the bandwidth of 12.8GB/sec
(about 6.4B/cpu_cycle if the cpu runs at 2GHz).
So, in a multicore processor, when multiple
64-bit cores need to access the memory at the
same time, they will exacerbate contention on the
DRAM bandwidth. - Memory Wall CPU needs to speed a lot of time on
off-chip memory accesses. E.g., Intel XScale
spends on average 35 of the total execution time
on memory accesses. High latency and low
bandwidth of the DRAM system becomes a bottleneck
for CPUs.
13Solutions
- How to alleviate the memory wall problem
- Hiding the mem access latency prefetching
- Reducing the latency making memory closer to the
CPU 3D-stacked on-chip DRAM - Increasing the bandwidth optical I/O
- Reducing the number of memory accesses keeping
as much reusable data on cache as possible
14CMP Cache Organizations(Shared L2 Cache)
15CMP Cache Organizations(Private L2 Cache)
16How to Address Blocks in a CMP
- How to address blocks in a single-core processor
- L1 caches are typically virtually indexed but
physically tagged, while L2 caches are mostly
physically indexed and tagged (related to virtual
memory). - How to address blocks in a CMP
- L1 caches are accessed in the same way as in a
single-core processor - If the L2 caches are private, the addressing of a
block is still the same - If the L2 caches are shared among all of the
cores, then -
17How to Address Blocks in a CMP
18How to Address Blocks in a CMP
19CMP Cache Coherence
- Snoop based
- All caches on the bus snoop the bus to determine
if they have a copy of the block of data that is
requested on the bus. Multiple copies of a data
block can be read without any coherence problems
however, a processor must have exclusive access
(either invalidate or update other copies) to the
bus in order to write. - Enough for small-scale CMPs with bus
interconnection - Directory based
- the data being shared is tracked in a common
directory that maintains the coherence between
caches. When a cache line is changed the
directory either updates or invalidates the other
caches with that cache line. - Necessary for many-core CMPs with such
interconnection as mesh
20Non-Uniform Cache Access Timein Shared L2 Caches
21Non-Uniform Cache Access Timein Shared L2 Caches
- Lets assume that Core0 needs to access a data
block stored in Tile15 - Assume that access an L2 cache bank needs 10
cycles - Assume transferring a data block from one router
to an adjacent one needs 2 cycles - Then, an remote access to the block in Tile 15
needs 102(26)34 cycles, much greater than an
local L2 access. - Non-Uniform Cache Access Time (NUCA) means that
the latency of accessing an cache is a function
of the physical locations of both the requesting
core and the cache.
22How to reduce the latency of Remote Cache Access
- At least two solutions
- Place the data close enough to the requesting
core - Victim replication 1 placing L1 victim blocks
in the Local L2 cache - Change the layout of the data I will talk about
one approach pretty soon - Use faster transmission
- Use special on-chip interconnect to transmit data
via radio-wave or light-wave signals
23The RF-Interconnect 2
24Interference in Cachingin Shared L2 Caches
- The Problem because the shared L2 caches are
accessible to all cores, one core can interfere
with another in placing blocks in L2 caches - For example, in a dual-core CMP, if a stream
application like a video player is co-scheduled
with a scientific computation application that
has good locality, then the aggressive stream
application will continuously place new blocks in
L2 cache and replace the computation
applications cached blocks, thus affecting the
computation applications performance. - Solution
- Regulate cores usage of the L2 cache based on
their utility of using the cache 3
25The Capacity Problemsin Private L2 Caches
- The Problems
- the L2 capacity accessible to each core is fixed,
regardless of the cores real cache capacity
demand. E.g., if two applications are
co-scheduled on a dual core CMP with two 1MB
private L2 caches, and if one application has a
cache demand of 0.5 MB while the other asks for
1.5MB, then one private L2 cache is underutilized
while the other is overwhelmed. - If a parallel program is running on the CMP,
different cores will have a lot of data in
common. However, the private L2 cache
organization requires each core maintain a copy
of the common data in its local cache, leading to
a lot of data redundancy and degrading the
effective - A Solution Cooperative Caching 4
26A Comparison Between Shared and Private L2 Caches
27Using OS to Manage CMP Caches 5
- Two kinds of address space
- virtual (or logic) physical
- Page coloring there is a correspondence between
a physical page and its location in the cache - In CMPs with Shared L2 Cache, by changing the
mapping scheme, we can use the OS to determine
where a virtual page required by a core is
located in the L2 cache - Tile(where a page is cached) physical page
number Tiles
28Using OS to Manage CMP Caches
29Using OS to Manage CMP Caches
- The Benefits
- Improved Data Proximity
- Capacity Sharing
- Data Sharing (to be introduced next time)
30Summary
- What we have covered this class
- The Memory Wall problem for CMPs
- The two basic cache organizations for CMPs
- HW SW approaches of managing the last level
cache.
31References
- 1 M. Zhang, et al. Victim Replication
Maximizing Capacity while Hiding Wire Delay in
Tiled Chip Multiprocessors. ISCA05. - 2 F. Chang, et al. CMP Network-on-Chip Overlaid
With Multi-Band RF-Interconnect. HPCA08. - 3 A. Jaleel, et al. Adaptive Insertion Policies
for Managing Shared Caches. PACT08. - 4 J. Chang, et al. Cooperative Caching for Chip
Multiprocessors. ISCA06 - 5 S. Cho, et al. Managing Distributed, Shared
L2 Caches through OS-Level Page Allocation.
MICRO06.