Title: A%20Performance%20Comparison%20of%20DRAM%20Memory%20System%20Optimizations%20for%20SMT%20Processors
1A Performance Comparison of DRAM Memory System
Optimizations for SMT Processors
- Zhichun Zhu Zhao Zhang
- ECE Department ECE Department
- Univ. Illinois at Chicago Iowa State Univ.
2DRAM Memory Optimizations
- Optimizations at DRAM side can make a big
difference on single-threaded processors - Enhancement of chip interface/interconnect
- Access scheduling Hong et al. HPCA99, Mathew et
al. HPCA00, Rixner et al. ISCA00 - DRAM-side locality Cuppu et al. ISCA99,
ISCA01, Zhang et al., MICRO00, Lin et al.
HPCA01
3How does SMT Impact Memory Hierarchy?
- Less performance loss per cache miss to DRAM
memories Lower benefit from DRAM-side
optimizations? - But more cache misses due to cache contention
Much more pressure on main memory - Is DRAM memory design more important or not?
4Outline
- Motivation
- Memory optimization techniques
- Thread-aware memory access scheduling
- Outstanding request-based
- Resource occupancy-based
- Methodology
- Memory performance analysis on SMT systems
- Effectiveness of single-thread techniques
- Effectiveness of thread-aware schemes
- Conclusion
5Memory Optimization Techniques
- Page modes
- Open page good for programs with good locality
- Close page good for programs with poor locality
- Mapping schemes
- Exploitation of concurrency (multiple channels,
chips, banks) - Row buffer conflicts
- Memory access scheduling
- Reorder of concurrent accesses
- Reducing average latency and improving bandwidth
utilization
6Memory Access Scheduling for Single-Threaded
Systems
- Hit-first
- A row buffer hit has a higher priority than a row
buffer miss - Read-first
- A read has a higher priority than a write
- Age-based
- An older request has a higher priority than a new
one - Criticality-based
- A critical request has a higher priority than a
non-critical one
7Memory Access Concurrency with Multithreaded
Processors
Processor
Memory
Single-threaded
Multi-threaded
8Thread-Aware Memory Scheduling
- New dimension in memory scheduling for SMT
systems considering the current state of each
thread - States related to memory accesses
- Number of outstanding requests
- Number of processor resources occupied
9Outstanding Request-Based Scheme
- Request-based
- A request generated by a thread with fewer
pending requests has a higher priority
10Outstanding Request-Based Scheme
- Request-based
- Hit-first and read-first are applied on top
- For SMT processors, sustained memory bandwidth is
more important than the latency of an individual
access
11Resource Occupancy-Based Scheme
- ROB-based
- Higher priority to requests from threads holding
more ROB entries - IQ-based
- Higher priority to requests from threads holding
more IQ entries - Hit-first and read-first are applied on top
12Methodology
- Simulator
- SMT extension of sim-Alpha
- Event-driven memory simulator (DDR SDRAM and
Direct Rambus DRAM) - Workload
- Mixture of SPEC 2000 applications
- 2-, 4-, 8-thread workload
- ILP, MIX, and MEM workload mixes
13Simulation Parameters
Processor speed 3 GHz L1 caches 64KB I/D, 2-way, 1-cycle latency
Fetch width 8 inst. L2 cache 512KB, 2-way, 10-cycle latency
Baseline fetch policy DWarn.2.8 L3 cache 4MB, 4-way, 20-cycle latency
Pipeline depth 11 MSHR entries (164 prefetch)/cache
Issue queue size 64 Int., 32 FP Memory channels 2/4/8
Reorder buffer size 256/thread Memory BW/channel 200 MHz, DDR, 16B width
Physical register num 384 Int., 384 FP Memory banks 4 banks/chip
Load/store queue size 64 LQ, 64 SQ DRAM access latency 15ns row, 15ns column, 15ns precharge
14Workload Mixes
2-thread ILP bzip2, gzip
MIX gzip, mcf
MEM mcf, ammp
4-thread ILP bzip2, gzip, sixtrack, eon
MIX gzip, mcf, bzip2, ammp
MEM mcf, ammp, swim, lucas
8-thread ILP gzip, bzip2, sixtrack, eon, mesa, galgel, crafty, wupwise
MIX gzip, mcf, bzip2, ammp, sixtrack, swim, eon, lucas
MEM mcf, ammp, swim, lucas, equake, applu, vpr, facerec
15Performance Loss Due to Memory Access
16Memory Access Concurrency
17Memory Channel Configurations
18Memory Channel Configurations
19Mapping Schemes
20Memory Access Concurrency
21Thread-Aware Schemes
22Conclusion
- DRAM optimizations have significant impacts on
the performance of SMT (and likely CMP)
processors - Mostly effective when a workload mix includes
some memory-intensive programs - Performance is sensitive to memory channel
organizations - DRAM-side locality is harder to explore due to
contention - Thread-aware access scheduling schemes does bring
good performance