A%20Performance%20Comparison%20of%20DRAM%20Memory%20System%20Optimizations%20for%20SMT%20Processors

About This Presentation

Title:

A%20Performance%20Comparison%20of%20DRAM%20Memory%20System%20Optimizations%20for%20SMT%20Processors

Description:

Title: A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Author: zhichun Last modified by: Zhao Zhang Created Date – PowerPoint PPT presentation

Number of Views:142

Avg rating:3.0/5.0

Slides: 23

Provided by: zhi87

Learn more at: https://home.engineering.iastate.edu

Category:

more less

Transcript and Presenter's Notes

Title: A%20Performance%20Comparison%20of%20DRAM%20Memory%20System%20Optimizations%20for%20SMT%20Processors

1
A Performance Comparison of DRAM Memory System
Optimizations for SMT Processors

Zhichun Zhu Zhao Zhang
ECE Department ECE Department
Univ. Illinois at Chicago Iowa State Univ.

2
DRAM Memory Optimizations

Optimizations at DRAM side can make a big
difference on single-threaded processors
Enhancement of chip interface/interconnect
Access scheduling Hong et al. HPCA99, Mathew et
al. HPCA00, Rixner et al. ISCA00
DRAM-side locality Cuppu et al. ISCA99,
ISCA01, Zhang et al., MICRO00, Lin et al.
HPCA01

3
How does SMT Impact Memory Hierarchy?

Less performance loss per cache miss to DRAM
memories Lower benefit from DRAM-side
optimizations?
But more cache misses due to cache contention
Much more pressure on main memory
Is DRAM memory design more important or not?

4
Outline

Motivation
Memory optimization techniques
Thread-aware memory access scheduling
Outstanding request-based
Resource occupancy-based
Methodology
Memory performance analysis on SMT systems
Effectiveness of single-thread techniques
Effectiveness of thread-aware schemes
Conclusion

5
Memory Optimization Techniques

Page modes
Open page good for programs with good locality
Close page good for programs with poor locality
Mapping schemes
Exploitation of concurrency (multiple channels,
chips, banks)
Row buffer conflicts
Memory access scheduling
Reorder of concurrent accesses
Reducing average latency and improving bandwidth
utilization

6
Memory Access Scheduling for Single-Threaded
Systems

Hit-first
A row buffer hit has a higher priority than a row
buffer miss
Read-first
A read has a higher priority than a write
Age-based
An older request has a higher priority than a new
one
Criticality-based
A critical request has a higher priority than a
non-critical one

7
Memory Access Concurrency with Multithreaded
Processors
Processor
Memory
Single-threaded
Multi-threaded
8
Thread-Aware Memory Scheduling

New dimension in memory scheduling for SMT
systems considering the current state of each
thread
States related to memory accesses
Number of outstanding requests
Number of processor resources occupied

9
Outstanding Request-Based Scheme

Request-based
A request generated by a thread with fewer
pending requests has a higher priority

10
Outstanding Request-Based Scheme

Request-based
Hit-first and read-first are applied on top
For SMT processors, sustained memory bandwidth is
more important than the latency of an individual
access

11
Resource Occupancy-Based Scheme

ROB-based
Higher priority to requests from threads holding
more ROB entries
IQ-based
Higher priority to requests from threads holding
more IQ entries
Hit-first and read-first are applied on top

12
Methodology

Simulator
SMT extension of sim-Alpha
Event-driven memory simulator (DDR SDRAM and
Direct Rambus DRAM)
Workload
Mixture of SPEC 2000 applications
2-, 4-, 8-thread workload
ILP, MIX, and MEM workload mixes

13
Simulation Parameters
Processor speed 3 GHz L1 caches 64KB I/D, 2-way, 1-cycle latency
Fetch width 8 inst. L2 cache 512KB, 2-way, 10-cycle latency
Baseline fetch policy DWarn.2.8 L3 cache 4MB, 4-way, 20-cycle latency
Pipeline depth 11 MSHR entries (164 prefetch)/cache
Issue queue size 64 Int., 32 FP Memory channels 2/4/8
Reorder buffer size 256/thread Memory BW/channel 200 MHz, DDR, 16B width
Physical register num 384 Int., 384 FP Memory banks 4 banks/chip
Load/store queue size 64 LQ, 64 SQ DRAM access latency 15ns row, 15ns column, 15ns precharge
14
Workload Mixes
2-thread ILP bzip2, gzip
MIX gzip, mcf
MEM mcf, ammp
4-thread ILP bzip2, gzip, sixtrack, eon
MIX gzip, mcf, bzip2, ammp
MEM mcf, ammp, swim, lucas
8-thread ILP gzip, bzip2, sixtrack, eon, mesa, galgel, crafty, wupwise
MIX gzip, mcf, bzip2, ammp, sixtrack, swim, eon, lucas
MEM mcf, ammp, swim, lucas, equake, applu, vpr, facerec
15
Performance Loss Due to Memory Access
16
Memory Access Concurrency
17
Memory Channel Configurations
18
Memory Channel Configurations
19
Mapping Schemes
20
Memory Access Concurrency
21
Thread-Aware Schemes
22
Conclusion

DRAM optimizations have significant impacts on
the performance of SMT (and likely CMP)
processors
Mostly effective when a workload mix includes
some memory-intensive programs
Performance is sensitive to memory channel
organizations
DRAM-side locality is harder to explore due to
contention
Thread-aware access scheduling schemes does bring
good performance