CSCE 930 Advanced Computer Architecture ---- A Brief Introduction to CMP Memory Hierarchy - PowerPoint PPT Presentation

About This Presentation
Title:

CSCE 930 Advanced Computer Architecture ---- A Brief Introduction to CMP Memory Hierarchy

Description:

CSCE 930 Advanced Computer Architecture---- A Brief Introduction to CMP Memory Hierarchy & Simulators Dongyuan Zhan An Overview of CMP Research Tools The Commonly ... – PowerPoint PPT presentation

Number of Views:392
Avg rating:3.0/5.0
Slides: 52
Provided by: cseUnlEd
Learn more at: http://cse.unl.edu
Category:

less

Transcript and Presenter's Notes

Title: CSCE 930 Advanced Computer Architecture ---- A Brief Introduction to CMP Memory Hierarchy


1
CSCE 930 Advanced Computer Architecture---- A
Brief Introduction toCMP Memory Hierarchy
Simulators
  • Dongyuan Zhan

2
From Teraflop Multiprocessor to Teraflop
Multicore
ASCI RED (19972005)
3
Intel Teraflop Multicore Prototype
4
From Teraflop Multiprocessor to Teraflop
Multicore
  • Pictured here is ASCI Red which was the first
    computer to reach a Teraflops of processing,
    equal to trillions of calculations per second.
  • Using about 10,000 Pentium Processors running at
    200MHz
  • Consuming 500kW of power for computation and
    another 500kW for cooling
  • Occupy a very large room
  • Intel has now announced just over 10 yeas later
    that they have developed the worlds first
    processor that will deliver the same Teraflops
    performance all on one single
  • 80-core on a single chip running at 5 GHz
  • Consuming only 62 watts power
  • Small enough to rest on the tip of your finger.

5
A Commodity Many-core Processor
  • Tile64 Multicore Processor (2007now)

6
The Schematic Design of Tile64
DDR2 Memory Controller 0
PCIe 0 MAC PHY
Serdes
UART, HPI JTAG, I2C, SPI
GbE 0
GbE 1
Flexible IO
Flexible IO
PCIe 1 MAC PHY
  • 4 essential components
  • Processor Core
  • on-chip Cache
  • Network-on-Chip (NoC)
  • I/O controllers

Serdes
DDR2 Memory Controller 3
7
Outline
  • An Introduction to the Multi-core Memory
    Hierarchy
  • Why do we need the memory hierarchy for any
    processors?
  • A tradeoff between capacity and latency
  • Make common cases fast as a result of programs
    locality (general principle in computer
    architecture)
  • What is the difference between the memory
    hierarchies of single-core and multi-core CPUs?
  • Quite distinct from each other in on-chip caches
  • Managing the CMP caches is of paramount
    importance in performance
  • Again, we still have the capacity and latency
    issues for CMP caches
  • How to keep CMP cache coherent
  • Hardware software management schemes

8
The Motivation for Mem Hierarchy
Trading off between capacity and latency
Capacity Access Time Cost
Upper Level
faster
CPU Registers 100s Bytes 0.3-0.5 ns
Registers
prog./compiler 4-8 bytes
Instr. Operands
L1 Cache
L1 and L2 Cache 10s-100s K Bytes 1 ns - 10 ns
cache cntl 32 or 64 bytes
Blocks
L2 Cache
On Chip
cache cntl 64 or 128 bytes
Blocks
Main Memory G Bytes 200ns 300ns 15/ GByte
Memory
OS 4K 64K bytes
Off Chip
Pages
Disk 1s -10s T Bytes 10 ms 0.15 / GByte
Disk
Larger
Lower Level
9
Programs Locality
  • Two Kinds of Basic Locality
  • Temporal
  • if a memory location is referenced, then it is
    likely that the same memory location will be
    referenced again in the near future.
  • int i register int j
  • for (i 0 i lt 20000 i)
  • for (j 0 j lt 300 j)
  • Spatial
  • if a memory location is referenced, then it is
    likely that nearby memory locations will be
    referenced in the near future.
  • Locality smaller HW is to make common cases
    faster memory hierarchy

10
The Challenges of Memory Wall
  • The Truths
  • In many applications, 30-40 the total
    instructions are memory operations
  • CPU speed scales much faster than the DRAM speed
  • In 1980, CPUs and DRAMs were operated at almost
    the same speed, about 4MHz8MHz
  • CPU clock frequency has doubled every 2 years
  • DRAM speed have only been doubling about every 6
    years.

11
Memory Wall
  • DRAM bandwidth is quite limited two DDR2-800
    modules can reach the bandwidth of 12.8GB/sec
    (about 6.4B/cpu_cycle if the cpu runs at 2GHz).
    So, in a multicore processor, when multiple
    64-bit cores need to access the memory at the
    same time, they will exacerbate contention on the
    DRAM bandwidth.
  • Memory Wall CPU needs to speed a lot of time on
    off-chip memory accesses. E.g., Intel XScale
    spends on average 35 of the total execution time
    on memory accesses. High latency and low
    bandwidth of the DRAM system becomes a bottleneck
    for CPUs.

12
Solutions
  • How to alleviate the memory wall problem
  • Hiding the mem access latency prefetching
  • Reducing the latency making memory closer to the
    CPU 3D-stacked on-chip DRAM
  • Increasing the bandwidth optical I/O
  • Reducing the number of memory accesses keeping
    as much reusable data on cache as possible

13
CMP Cache Organizations(Shared L2 Cache)
14
CMP Cache Organizations(Private L2 Cache)
15
How to Address Blocks in a CMP
  • How to address blocks in a single-core processor
  • L1 caches are typically virtually indexed but
    physically tagged, while L2 caches are mostly
    physically indexed and tagged (related to virtual
    memory).
  • How to address blocks in a CMP
  • L1 caches are accessed in the same way as in a
    single-core processor
  • If the L2 caches are private, the addressing of a
    block is still the same
  • If the L2 caches are shared among all of the
    cores, then

16
How to Address Blocks in a CMP
17
How to Address Blocks in a CMP
18
CMP Cache Coherence
  • Snoop based
  • All caches on the bus snoop the bus to determine
    if they have a copy of the block of data that is
    requested on the bus. Multiple copies of a data
    block can be read without any coherence problems
    however, a processor must have exclusive access
    (either invalidate or update other copies) to the
    bus in order to write.
  • Enough for small-scale CMPs with bus
    interconnection
  • Directory based
  • the data being shared is tracked in a common
    directory that maintains the coherence between
    caches. When a cache line is changed the
    directory either updates or invalidates the other
    caches with that cache line.
  • Necessary for many-core CMPs with such
    interconnection as mesh

19
Interference in Cachingin Shared L2 Caches
  • The Problem because the shared L2 caches are
    accessible to all cores, one core can interfere
    with another in placing blocks in L2 caches
  • For example, in a dual-core CMP, if a stream
    application like a video player is co-scheduled
    with a scientific computation application that
    has good locality, then the aggressive stream
    application will continuously place new blocks in
    L2 cache and replace the computation
    applications cached blocks, thus affecting the
    computation applications performance.
  • Solution
  • Regulate cores usage of the L2 cache based on
    their utility of using the cache 3

20
The Capacity Problemsin Private L2 Caches
  • The Problems
  • the L2 capacity accessible to each core is fixed,
    regardless of the cores real cache capacity
    demand. E.g., if two applications are
    co-scheduled on a dual core CMP with two 1MB
    private L2 caches, and if one application has a
    cache demand of 0.5 MB while the other asks for
    1.5MB, then one private L2 cache is underutilized
    while the other is overwhelmed.
  • If a parallel program is running on the CMP,
    different cores will have a lot of data in
    common. However, the private L2 cache
    organization requires each core maintain a copy
    of the common data in its local cache, leading to
    a lot of data redundancy and degrading the
    effective
  • A Solution Cooperative Caching 4

21
Non-Uniform Cache Access Timein Shared L2 Caches
22
Non-Uniform Cache Access Timein Shared L2 Caches
  • Lets assume that Core0 needs to access a data
    block stored in Tile15
  • Assume that access an L2 cache bank needs 10
    cycles
  • Assume transferring a data block from one router
    to an adjacent one needs 2 cycles
  • Then, an remote access to the block in Tile 15
    needs 102(26)34 cycles, much greater than an
    local L2 access.
  • Non-Uniform Cache Access Time (NUCA) means that
    the latency of accessing an cache is a function
    of the physical locations of both the requesting
    core and the cache.

23
How to reduce the latency of Remote Cache Access
  • At least two solutions
  • Place the data close enough to the requesting
    core
  • Victim replication 1 placing L1 victim blocks
    in the Local L2 cache
  • Change the layout of the data I will talk about
    one approach pretty soon
  • Use faster transmission
  • Use special on-chip interconnect to transmit data
    via radio-wave or light-wave signals

24
A Comparison Between Shared and Private L2 Caches
L2S L2P
set mapping First locate the tile and then index the set The same as in a single-core CPU
coherence directory each L2 entry has its own directory bits no separate directory caches. independent shared directory caches employing the same mapping scheme as in Fig.
capacity high aggregate capacity for any cores relatively low capacity for each core
latency due to the distributed mapping, a lot of requested data (on-chip) will be in non-local L2 the requested data (on-chip) is in the private (closest) L2
Sharing Capacity Data None
performance Isolation severe contention in L2 capacity allocation among cores no interference among cores
Commodity CMPs Intel Core 2 Duo E6600 Sun Sparc Niagara2 Tilera Tile64 (64 cores) AMD Athlon64 6400 Intel Pentium D 840
25
The RF-Interconnect 2
26
Using OS to Manage CMP Caches 5
  • Two kinds of address space
  • virtual (or logic) physical
  • Page coloring there is a correspondence between
    a physical page and its location in the cache
  • In CMPs with Shared L2 Cache, by changing the
    mapping scheme, we can use the OS to determine
    where a virtual page required by a core is
    located in the L2 cache
  • Tile(where a page is cached) physical page
    number Tiles

27
Using OS to Manage CMP Caches
28
Summary
  • What we have covered this class
  • The Memory Wall problem for CMPs
  • The two basic cache organizations for CMPs
  • HW SW approaches of managing the last level
    cache.

29
References
  • 1 M. Zhang, et al. Victim Replication
    Maximizing Capacity while Hiding Wire Delay in
    Tiled Chip Multiprocessors. ISCA05.
  • 2 F. Chang, et al. CMP Network-on-Chip Overlaid
    With Multi-Band RF-Interconnect. HPCA08.
  • 3 A. Jaleel, et al. Adaptive Insertion Policies
    for Managing Shared Caches. PACT08.
  • 4 J. Chang, et al. Cooperative Caching for Chip
    Multiprocessors. ISCA06
  • 5 S. Cho, et al. Managing Distributed, Shared
    L2 Caches through OS-Level Page Allocation.
    MICRO06.

30
Outline
  • An Overview of CMP Research Tools
  • A Detailed Introduction to SIMICS
  • A Detailed Introduction to GEMS
  • Other Online Resources

31
An Overview of CMP Research Tools
  • CMP Simulators
  • SESC (http//users.soe.ucsc.edu/renau/rtools.html
    )
  • M5 (http//www.m5sim.org/wiki/index.php/Main_Page)
  • Simics (https//www.simics.net/)
  • Benchmark Suites
  • Single-threaded Applications
  • SPEC2000 (www.spec.org)
  • SPEC2006
  • Multi-threaded Applications
  • SPECOMP2001
  • SPECWeb2009
  • SPLASH2 (http//www-flash.stanford.edu/apps/SPLASH
    /)
  • Parsec (http//parsec.cs.princeton.edu/)

32
An Overview of CMP Research Tools
  • A Taxonomy of Simulation
  • Function vs. Timing
  • Functional simulation simulate the
    functionalities of a system
  • Timing simulation simulate the timing behavior
    of a system
  • Full System vs. Non FS
  • Full system simulation like a VM that can boot
    up Oss
  • Syscall emulation no OS but syscalls are
    emulated by the simulator
  • Simulation Stages
  • Configuration stage connect cores, caches,
    drams, interconnects and I/Os to build up a
    system
  • Fast-forward stage bypass the initialization
    stage of a benchmark program without timing
    simulation
  • Warm-up stage fill in the pipelines, branch
    predictors and caches by executing a certain
    number of instructions but do not count them in
    the performance statistics
  • Simulation stage detailed simulation to obtain
    performance statistics

33
An Overview of CMP Research Tools
  • The Commonly used CMP Simulators
  • SESC
  • Only supports timing syscall simulation
  • Only supports MIPS ISA
  • Able to seamlessly cooperate with Cacti (power),
    Hotspot (temperature) and Hotleakage (static
    power)
  • Especially useful in power/thermal research
  • Cacti is available at http//www.cs.utah.edu/raje
    ev/cacti6/
  • Hotspot http//lava.cs.virginia.edu/HotSpot/
  • Hotleakage http//lava.cs.virginia.edu/HotLeakage
    /index.htm

34
An Overview of CMP Research Tools
  • The Commonly used CMP Simulators
  • SIMICS (Commercial but free-use for academia)
  • Only supports functional full-system simulation
  • Supports multiple ISAs
  • SparcV9 (well supported by public-domain add-on
    modules)
  • X86, Alpha, MIPS, ARM (seldom supported by
    3rd-party modules)
  • Needs add-on models to do performance power
    simulation
  • GEMS (http//www.cs.wisc.edu/gems/)
  • it has two components for performance
    simulation
  • OPAL an out-of-order processing core model
  • RUBY a detailed CMP mem hierarchy model
  • Simflex (http//parsa.epfl.ch/simflex/)
  • It is similar to GEMS in functionality
  • It supports statistical sampling for simulation
  • Garnet (http//www.princeton.edu/niketa/garnet.ht
    ml)
  • It supports the performance and power simulation
    for NoC

35
An Overview of CMP Research Tools
  • The Commonly used CMP Simulators
  • M5
  • Supports both functional and timing simulation
  • Has two simulation modes full-system (FS) and
    syscall emulation (SE)
  • Supports multiple ISAs
  • ALPHA well-developed to support both FS and SE
    modes
  • It models
  • Processor Cores Memory Hierarchy I/O Systems
  • Written by using C, Python Swig, and totally
    open-source
  • More things about M5
  • http//www.m5sim.org/wiki/index.php/Main_Page
  • The most important document http//www.m5sim.org/
    wiki/index.php/Tutorials

36
A Detailed Introduction to M5
  • M5s Source Tree Structure

37
A Detailed Introduction to M5
  • CPU Modeled by M5
  • SimpleCPU
  • TimingCPU
  • O3CPU

38
A Detailed Introduction to M5
  • Memory Hierarchy Modeled by M5

39
A Detailed Introduction to Simics
  • Directory Tree Organization
  • Under the root directory of Simics
  • licenses licenses for functional simics
  • doc detailed documents about all aspects
  • targets simics scripts that describe specific
    computer systems
  • src simics header files for user programming
  • amd64-linux dynamic modules .so that are
    invoked by Simics to build up modeled computer
    systems

40
A Detailed Introduction to Simics
  • Key Features of Simics
  • Simics can be regarded as a command interpreter
  • Command Line Interface (CLI) let users to
    control Simics
  • Simics is quite modular
  • It uses Simics scripts to connect different
    FUNCTIONAL modules (e.g., ISA, dram, disk,
    Ethernet), which are compiled as lib/.so
    files, to build up a system.
  • The information of all pre-compiled modules can
    be found in doc/simics-reference-manual-public-al
    l.pdf.
  • Modules can be designed in C/C, python, and
    DML.
  • Simics has already implemented several specific
    target systems (defined in scripts) for booting
    up an operating system
  • E.g., SUNs Serengeti system with Ultrasparc-III
    processors, which is scripted in the directory
    targets/serengeti

41
A Detailed Introduction to Simics
  • Key Features of Simics
  • DML, MAIs, APIs and CMDs
  • DML the Simics-specific Device Modeling
    Language, a C-like programming language for
    writing device models for Simics using
    Transaction Level Modeling. DML is simpler than
    C/C and python in device modeling.
  • MAI
  • the Simics-specific Micro-Architectural
    Interface, enables users to define when things
    happen while letting Simics to handle how things
    happen.
  • the add-on GEMS uses this feature to implement
    timing simulation.
  • APIs a set of functions that provide access to
    Simics functionality from script languages in the
    frontend and from extensions, usually written in
    C/C.
  • CMDs the Simics-specific commands used in CLI to
    let users to control Simics, such as loading
    modules or running python scripts.

42
A Detailed Introduction to Simics
  • Using Simics
  • Installing Simics
  • See simics-installation-guide-unix.pdf
  • Creating Workspace
  • See Chapter 4 of doc/simics-user-guide-unix.pdf
  • Installing a Solaris OS
  • Change the disk capacity by modifying the
    cylinder-head-sector parameters in
    targets/serengeti/abisko-sol-cd-install1.simics
    .
  • E.g., a 32GB409802080512B disk is created by
    the command
  • (scsi_disk.get-component-object
    sd).create-sun-vtoc-header -quiet 40980 20 80
  • Enter the workspace just created
  • See Chapter 6 of doc/simics-target-guide-serenget
    i.pdf

43
A Detailed Introduction to Simics
  • Using Simics
  • Modify the Simics script (for describing the
    Serengeti system) to enable multiple cores
  • Change num_cpus in targets/serengeti/serengeti-6
    800-system.include
  • Booting the Solaris OS in Simics
  • Under the workspace directory just created, enter
    the subdirectory home/serengeti
  • Type ./simcs abisko-common.simics
  • Type continue
  • Install the SimicsFS (used to communicate with
    your host system)
  • See Section 7.3 of doc/simics-user-guide-unix.pdf
  • Save a breakpoint, exit and restart from the
    previous breakpoint
  • Type Write-configuration try.conf
  • Type exit
  • Type ./simics c try.conf

44
A Detailed Introduction to GEMS
  • An Overview of GEMS

Random Tester
Simics
Deterministic
Contended locks
Trace flie
Microbenchmarks
45
A Detailed Introduction to GEMS
instructions
SIMICS
P2
P3
P0
P1
stall()/unstall()
stall()/unstall()
stall()/unstall()
stall()/unstall()
Simics in-order processor model
Simics time queue
46
A Detailed Introduction to GEMS
  • Essential Components in Ruby
  • Caches Memory
  • Coherence Protocols
  • CMP protocols
  • MOESI_CMP_token M-CMP token coherence
  • MSI_MOSI_CMP_directory 2-level Directory
  • MOESI_CMP_directory higher performing 2-level
    Directory
  • SMP protocols
  • MOSI_SMP_bcast snooping on ordered interconnect
  • MOSI_SMP_directory
  • MOSI_SMP_hammer based on AMD Hammer
  • User defined protocols using GEMS SLICC

47
A Detailed Introduction to GEMS
  • Essential Components in Ruby
  • Interconnection Networks
  • Either be automatically generated by default
  • Intra-chip network Single on-chip switch
  • Inter-chip network 4 included (next slide)
  • Or be customized by users
  • Defined in _FILE_SPECIFIED.txt under the
    directory GEMS_ROOT_DIR/ruby/network/simple/Netw
    ork_Files

48
Auto-generated Inter-chip Network Topologies
TopologyType_HIERARCHICAL_SWITCH
TopologyType_TORUS_2D
TopologyType_CROSSBAR
TopologyType_PT_TO_PT
Slide 48
49
Topology Parameters
  • Link latency
  • Auto-generated
  • ON_CHIP_LINK_LATENCY
  • NETWORK_LINK_LATENCY
  • Customized
  • link_latency
  • Link bandwidth
  • Auto-generated
  • On-chip 10 x g_endpoint_bandwidth
  • Off-chip g_endpoint_bandwidth
  • Customized
  • Individual link bandwidth bw_multiplier x
    g_endpoint_bandwidth
  • Buffer size
  • Infinite by default
  • Customized network supports finite buffering
  • Prevent 2D-mesh network deadlock through e-cube
    restrictive routing
  • link_weight
  • Perfect switch bandwidth

50
A Detailed Introduction to GEMS
  • Steps of Using GEMS
  • Choosing a Ruby protocol
  • Building Ruby and Opal
  • Starting and configuring Simics
  • Loading and configuring Ruby
  • Loading and configuring Opal
  • Running simulation
  • Getting results

51
Other Online Resources
  • Simics Online Forum
  • https//www.simics.net/
  • GEMS Mailing List Archive
  • http//lists.cs.wisc.edu/mailman/listinfo/gems-use
    rs
  • A Student wrote some articles about installing
    and using Simics at
  • http//fisherduyu.blogspot.com/
Write a Comment
User Comments (0)
About PowerShow.com