CSCE 930 Advanced Computer Architecture ---- A Brief Introduction to CMP Memory Hierarchy

About This Presentation

Title:

CSCE 930 Advanced Computer Architecture ---- A Brief Introduction to CMP Memory Hierarchy

Description:

CSCE 930 Advanced Computer Architecture---- A Brief Introduction to CMP Memory Hierarchy & Simulators Dongyuan Zhan An Overview of CMP Research Tools The Commonly ... – PowerPoint PPT presentation

Number of Views:392

Avg rating:3.0/5.0

Slides: 52

Provided by: cseUnlEd

Learn more at: http://cse.unl.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSCE 930 Advanced Computer Architecture ---- A Brief Introduction to CMP Memory Hierarchy

1
CSCE 930 Advanced Computer Architecture---- A
Brief Introduction toCMP Memory Hierarchy
Simulators

Dongyuan Zhan

2
From Teraflop Multiprocessor to Teraflop
Multicore
ASCI RED (19972005)
3
Intel Teraflop Multicore Prototype
4
From Teraflop Multiprocessor to Teraflop
Multicore

Pictured here is ASCI Red which was the first
computer to reach a Teraflops of processing,
equal to trillions of calculations per second.
Using about 10,000 Pentium Processors running at
200MHz
Consuming 500kW of power for computation and
another 500kW for cooling
Occupy a very large room
Intel has now announced just over 10 yeas later
that they have developed the worlds first
processor that will deliver the same Teraflops
performance all on one single
80-core on a single chip running at 5 GHz
Consuming only 62 watts power
Small enough to rest on the tip of your finger.

5
A Commodity Many-core Processor

Tile64 Multicore Processor (2007now)

6
The Schematic Design of Tile64
DDR2 Memory Controller 0
PCIe 0 MAC PHY
Serdes
UART, HPI JTAG, I2C, SPI
GbE 0
GbE 1
Flexible IO
Flexible IO
PCIe 1 MAC PHY

4 essential components
Processor Core
on-chip Cache
Network-on-Chip (NoC)
I/O controllers

Serdes
DDR2 Memory Controller 3
7
Outline

An Introduction to the Multi-core Memory
Hierarchy
Why do we need the memory hierarchy for any
processors?
A tradeoff between capacity and latency
Make common cases fast as a result of programs
locality (general principle in computer
architecture)
What is the difference between the memory
hierarchies of single-core and multi-core CPUs?
Quite distinct from each other in on-chip caches
Managing the CMP caches is of paramount
importance in performance
Again, we still have the capacity and latency
issues for CMP caches
How to keep CMP cache coherent
Hardware software management schemes

8
The Motivation for Mem Hierarchy
Trading off between capacity and latency
Capacity Access Time Cost
Upper Level
faster
CPU Registers 100s Bytes 0.3-0.5 ns
Registers
prog./compiler 4-8 bytes
Instr. Operands
L1 Cache
L1 and L2 Cache 10s-100s K Bytes 1 ns - 10 ns
cache cntl 32 or 64 bytes
Blocks
L2 Cache
On Chip
cache cntl 64 or 128 bytes
Blocks
Main Memory G Bytes 200ns 300ns 15/ GByte
Memory
OS 4K 64K bytes
Off Chip
Pages
Disk 1s -10s T Bytes 10 ms 0.15 / GByte
Disk
Larger
Lower Level
9
Programs Locality

Two Kinds of Basic Locality
Temporal
if a memory location is referenced, then it is
likely that the same memory location will be
referenced again in the near future.
int i register int j
for (i 0 i lt 20000 i)
for (j 0 j lt 300 j)
Spatial
if a memory location is referenced, then it is
likely that nearby memory locations will be
referenced in the near future.
Locality smaller HW is to make common cases
faster memory hierarchy

10
The Challenges of Memory Wall

The Truths
In many applications, 30-40 the total
instructions are memory operations
CPU speed scales much faster than the DRAM speed
In 1980, CPUs and DRAMs were operated at almost
the same speed, about 4MHz8MHz
CPU clock frequency has doubled every 2 years
DRAM speed have only been doubling about every 6
years.

11
Memory Wall

DRAM bandwidth is quite limited two DDR2-800
modules can reach the bandwidth of 12.8GB/sec
(about 6.4B/cpu_cycle if the cpu runs at 2GHz).
So, in a multicore processor, when multiple
64-bit cores need to access the memory at the
same time, they will exacerbate contention on the
DRAM bandwidth.
Memory Wall CPU needs to speed a lot of time on
off-chip memory accesses. E.g., Intel XScale
spends on average 35 of the total execution time
on memory accesses. High latency and low
bandwidth of the DRAM system becomes a bottleneck
for CPUs.

12
Solutions

How to alleviate the memory wall problem
Hiding the mem access latency prefetching
Reducing the latency making memory closer to the
CPU 3D-stacked on-chip DRAM
Increasing the bandwidth optical I/O
Reducing the number of memory accesses keeping
as much reusable data on cache as possible

13
CMP Cache Organizations(Shared L2 Cache)
14
CMP Cache Organizations(Private L2 Cache)
15
How to Address Blocks in a CMP

How to address blocks in a single-core processor
L1 caches are typically virtually indexed but
physically tagged, while L2 caches are mostly
physically indexed and tagged (related to virtual
memory).
How to address blocks in a CMP
L1 caches are accessed in the same way as in a
single-core processor
If the L2 caches are private, the addressing of a
block is still the same
If the L2 caches are shared among all of the
cores, then

16
How to Address Blocks in a CMP
17
How to Address Blocks in a CMP
18
CMP Cache Coherence

Snoop based
All caches on the bus snoop the bus to determine
if they have a copy of the block of data that is
requested on the bus. Multiple copies of a data
block can be read without any coherence problems
however, a processor must have exclusive access
(either invalidate or update other copies) to the
bus in order to write.
Enough for small-scale CMPs with bus
interconnection
Directory based
the data being shared is tracked in a common
directory that maintains the coherence between
caches. When a cache line is changed the
directory either updates or invalidates the other
caches with that cache line.
Necessary for many-core CMPs with such
interconnection as mesh

19
Interference in Cachingin Shared L2 Caches

The Problem because the shared L2 caches are
accessible to all cores, one core can interfere
with another in placing blocks in L2 caches
For example, in a dual-core CMP, if a stream
application like a video player is co-scheduled
with a scientific computation application that
has good locality, then the aggressive stream
application will continuously place new blocks in
L2 cache and replace the computation
applications cached blocks, thus affecting the
computation applications performance.
Solution
Regulate cores usage of the L2 cache based on
their utility of using the cache 3

20
The Capacity Problemsin Private L2 Caches

The Problems
the L2 capacity accessible to each core is fixed,
regardless of the cores real cache capacity
demand. E.g., if two applications are
co-scheduled on a dual core CMP with two 1MB
private L2 caches, and if one application has a
cache demand of 0.5 MB while the other asks for
1.5MB, then one private L2 cache is underutilized
while the other is overwhelmed.
If a parallel program is running on the CMP,
different cores will have a lot of data in
common. However, the private L2 cache
organization requires each core maintain a copy
of the common data in its local cache, leading to
a lot of data redundancy and degrading the
effective
A Solution Cooperative Caching 4

21
Non-Uniform Cache Access Timein Shared L2 Caches
22
Non-Uniform Cache Access Timein Shared L2 Caches

Lets assume that Core0 needs to access a data
block stored in Tile15
Assume that access an L2 cache bank needs 10
cycles
Assume transferring a data block from one router
to an adjacent one needs 2 cycles
Then, an remote access to the block in Tile 15
needs 102(26)34 cycles, much greater than an
local L2 access.
Non-Uniform Cache Access Time (NUCA) means that
the latency of accessing an cache is a function
of the physical locations of both the requesting
core and the cache.

23
How to reduce the latency of Remote Cache Access

At least two solutions
Place the data close enough to the requesting
core
Victim replication 1 placing L1 victim blocks
in the Local L2 cache
Change the layout of the data I will talk about
one approach pretty soon
Use faster transmission
Use special on-chip interconnect to transmit data
via radio-wave or light-wave signals

24
A Comparison Between Shared and Private L2 Caches
L2S L2P
set mapping First locate the tile and then index the set The same as in a single-core CPU
coherence directory each L2 entry has its own directory bits no separate directory caches. independent shared directory caches employing the same mapping scheme as in Fig.
capacity high aggregate capacity for any cores relatively low capacity for each core
latency due to the distributed mapping, a lot of requested data (on-chip) will be in non-local L2 the requested data (on-chip) is in the private (closest) L2
Sharing Capacity Data None
performance Isolation severe contention in L2 capacity allocation among cores no interference among cores
Commodity CMPs Intel Core 2 Duo E6600 Sun Sparc Niagara2 Tilera Tile64 (64 cores) AMD Athlon64 6400 Intel Pentium D 840
25
The RF-Interconnect 2
26
Using OS to Manage CMP Caches 5

Two kinds of address space
virtual (or logic) physical
Page coloring there is a correspondence between
a physical page and its location in the cache
In CMPs with Shared L2 Cache, by changing the
mapping scheme, we can use the OS to determine
where a virtual page required by a core is
located in the L2 cache
Tile(where a page is cached) physical page
number Tiles

27
Using OS to Manage CMP Caches
28
Summary

What we have covered this class
The Memory Wall problem for CMPs
The two basic cache organizations for CMPs
HW SW approaches of managing the last level
cache.

29
References

1 M. Zhang, et al. Victim Replication
Maximizing Capacity while Hiding Wire Delay in
Tiled Chip Multiprocessors. ISCA05.
2 F. Chang, et al. CMP Network-on-Chip Overlaid
With Multi-Band RF-Interconnect. HPCA08.
3 A. Jaleel, et al. Adaptive Insertion Policies
for Managing Shared Caches. PACT08.
4 J. Chang, et al. Cooperative Caching for Chip
Multiprocessors. ISCA06
5 S. Cho, et al. Managing Distributed, Shared
L2 Caches through OS-Level Page Allocation.
MICRO06.

30
Outline

An Overview of CMP Research Tools
A Detailed Introduction to SIMICS
A Detailed Introduction to GEMS
Other Online Resources

31
An Overview of CMP Research Tools

CMP Simulators
SESC (http//users.soe.ucsc.edu/renau/rtools.html
)
M5 (http//www.m5sim.org/wiki/index.php/Main_Page)
Simics (https//www.simics.net/)
Benchmark Suites
Single-threaded Applications
SPEC2000 (www.spec.org)
SPEC2006
Multi-threaded Applications
SPECOMP2001
SPECWeb2009
SPLASH2 (http//www-flash.stanford.edu/apps/SPLASH
/)
Parsec (http//parsec.cs.princeton.edu/)

32
An Overview of CMP Research Tools

A Taxonomy of Simulation
Function vs. Timing
Functional simulation simulate the
functionalities of a system
Timing simulation simulate the timing behavior
of a system
Full System vs. Non FS
Full system simulation like a VM that can boot
up Oss
Syscall emulation no OS but syscalls are
emulated by the simulator
Simulation Stages
Configuration stage connect cores, caches,
drams, interconnects and I/Os to build up a
system
Fast-forward stage bypass the initialization
stage of a benchmark program without timing
simulation
Warm-up stage fill in the pipelines, branch
predictors and caches by executing a certain
number of instructions but do not count them in
the performance statistics
Simulation stage detailed simulation to obtain
performance statistics

33
An Overview of CMP Research Tools

The Commonly used CMP Simulators
SESC
Only supports timing syscall simulation
Only supports MIPS ISA
Able to seamlessly cooperate with Cacti (power),
Hotspot (temperature) and Hotleakage (static
power)
Especially useful in power/thermal research
Cacti is available at http//www.cs.utah.edu/raje
ev/cacti6/
Hotspot http//lava.cs.virginia.edu/HotSpot/
Hotleakage http//lava.cs.virginia.edu/HotLeakage
/index.htm

34
An Overview of CMP Research Tools

The Commonly used CMP Simulators
SIMICS (Commercial but free-use for academia)
Only supports functional full-system simulation
Supports multiple ISAs
SparcV9 (well supported by public-domain add-on
modules)
X86, Alpha, MIPS, ARM (seldom supported by
3rd-party modules)
Needs add-on models to do performance power
simulation
GEMS (http//www.cs.wisc.edu/gems/)
it has two components for performance
simulation
OPAL an out-of-order processing core model
RUBY a detailed CMP mem hierarchy model
Simflex (http//parsa.epfl.ch/simflex/)
It is similar to GEMS in functionality
It supports statistical sampling for simulation
Garnet (http//www.princeton.edu/niketa/garnet.ht
ml)
It supports the performance and power simulation
for NoC

35
An Overview of CMP Research Tools

The Commonly used CMP Simulators
M5
Supports both functional and timing simulation
Has two simulation modes full-system (FS) and
syscall emulation (SE)
Supports multiple ISAs
ALPHA well-developed to support both FS and SE
modes
It models
Processor Cores Memory Hierarchy I/O Systems
Written by using C, Python Swig, and totally
open-source
More things about M5
http//www.m5sim.org/wiki/index.php/Main_Page
The most important document http//www.m5sim.org/
wiki/index.php/Tutorials

36
A Detailed Introduction to M5

M5s Source Tree Structure

37
A Detailed Introduction to M5

CPU Modeled by M5
SimpleCPU
TimingCPU
O3CPU

38
A Detailed Introduction to M5

Memory Hierarchy Modeled by M5

39
A Detailed Introduction to Simics

Directory Tree Organization
Under the root directory of Simics
licenses licenses for functional simics
doc detailed documents about all aspects
targets simics scripts that describe specific
computer systems
src simics header files for user programming
amd64-linux dynamic modules .so that are
invoked by Simics to build up modeled computer
systems

40
A Detailed Introduction to Simics

Key Features of Simics
Simics can be regarded as a command interpreter
Command Line Interface (CLI) let users to
control Simics
Simics is quite modular
It uses Simics scripts to connect different
FUNCTIONAL modules (e.g., ISA, dram, disk,
Ethernet), which are compiled as lib/.so
files, to build up a system.
The information of all pre-compiled modules can
be found in doc/simics-reference-manual-public-al
l.pdf.
Modules can be designed in C/C, python, and
DML.
Simics has already implemented several specific
target systems (defined in scripts) for booting
up an operating system
E.g., SUNs Serengeti system with Ultrasparc-III
processors, which is scripted in the directory
targets/serengeti

41
A Detailed Introduction to Simics

Key Features of Simics
DML, MAIs, APIs and CMDs
DML the Simics-specific Device Modeling
Language, a C-like programming language for
writing device models for Simics using
Transaction Level Modeling. DML is simpler than
C/C and python in device modeling.
MAI
the Simics-specific Micro-Architectural
Interface, enables users to define when things
happen while letting Simics to handle how things
happen.
the add-on GEMS uses this feature to implement
timing simulation.
APIs a set of functions that provide access to
Simics functionality from script languages in the
frontend and from extensions, usually written in
C/C.
CMDs the Simics-specific commands used in CLI to
let users to control Simics, such as loading
modules or running python scripts.

42
A Detailed Introduction to Simics

Using Simics
Installing Simics
See simics-installation-guide-unix.pdf
Creating Workspace
See Chapter 4 of doc/simics-user-guide-unix.pdf
Installing a Solaris OS
Change the disk capacity by modifying the
cylinder-head-sector parameters in
targets/serengeti/abisko-sol-cd-install1.simics
.
E.g., a 32GB409802080512B disk is created by
the command
(scsi_disk.get-component-object
sd).create-sun-vtoc-header -quiet 40980 20 80
Enter the workspace just created
See Chapter 6 of doc/simics-target-guide-serenget
i.pdf

43
A Detailed Introduction to Simics

Using Simics
Modify the Simics script (for describing the
Serengeti system) to enable multiple cores
Change num_cpus in targets/serengeti/serengeti-6
800-system.include
Booting the Solaris OS in Simics
Under the workspace directory just created, enter
the subdirectory home/serengeti
Type ./simcs abisko-common.simics
Type continue
Install the SimicsFS (used to communicate with
your host system)
See Section 7.3 of doc/simics-user-guide-unix.pdf
Save a breakpoint, exit and restart from the
previous breakpoint
Type Write-configuration try.conf
Type exit
Type ./simics c try.conf

44
A Detailed Introduction to GEMS

An Overview of GEMS

Random Tester
Simics
Deterministic
Contended locks
Trace flie
Microbenchmarks
45
A Detailed Introduction to GEMS
instructions
SIMICS
P2
P3
P0
P1
stall()/unstall()
stall()/unstall()
stall()/unstall()
stall()/unstall()
Simics in-order processor model
Simics time queue
46
A Detailed Introduction to GEMS

Essential Components in Ruby
Caches Memory
Coherence Protocols
CMP protocols
MOESI_CMP_token M-CMP token coherence
MSI_MOSI_CMP_directory 2-level Directory
MOESI_CMP_directory higher performing 2-level
Directory
SMP protocols
MOSI_SMP_bcast snooping on ordered interconnect
MOSI_SMP_directory
MOSI_SMP_hammer based on AMD Hammer
User defined protocols using GEMS SLICC

47
A Detailed Introduction to GEMS

Essential Components in Ruby
Interconnection Networks
Either be automatically generated by default
Intra-chip network Single on-chip switch
Inter-chip network 4 included (next slide)
Or be customized by users
Defined in _FILE_SPECIFIED.txt under the
directory GEMS_ROOT_DIR/ruby/network/simple/Netw
ork_Files

48
Auto-generated Inter-chip Network Topologies
TopologyType_HIERARCHICAL_SWITCH
TopologyType_TORUS_2D
TopologyType_CROSSBAR
TopologyType_PT_TO_PT
Slide 48
49
Topology Parameters