A%20Performance%20Comparison%20of%20DRAM%20Memory%20System%20Optimizations%20for%20SMT%20Processors - PowerPoint PPT Presentation

About This Presentation
Title:

A%20Performance%20Comparison%20of%20DRAM%20Memory%20System%20Optimizations%20for%20SMT%20Processors

Description:

Title: A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Author: zhichun Last modified by: Zhao Zhang Created Date – PowerPoint PPT presentation

Number of Views:142
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: A%20Performance%20Comparison%20of%20DRAM%20Memory%20System%20Optimizations%20for%20SMT%20Processors


1
A Performance Comparison of DRAM Memory System
Optimizations for SMT Processors
  • Zhichun Zhu Zhao Zhang
  • ECE Department ECE Department
  • Univ. Illinois at Chicago Iowa State Univ.

2
DRAM Memory Optimizations
  • Optimizations at DRAM side can make a big
    difference on single-threaded processors
  • Enhancement of chip interface/interconnect
  • Access scheduling Hong et al. HPCA99, Mathew et
    al. HPCA00, Rixner et al. ISCA00
  • DRAM-side locality Cuppu et al. ISCA99,
    ISCA01, Zhang et al., MICRO00, Lin et al.
    HPCA01

3
How does SMT Impact Memory Hierarchy?
  • Less performance loss per cache miss to DRAM
    memories Lower benefit from DRAM-side
    optimizations?
  • But more cache misses due to cache contention
    Much more pressure on main memory
  • Is DRAM memory design more important or not?

4
Outline
  • Motivation
  • Memory optimization techniques
  • Thread-aware memory access scheduling
  • Outstanding request-based
  • Resource occupancy-based
  • Methodology
  • Memory performance analysis on SMT systems
  • Effectiveness of single-thread techniques
  • Effectiveness of thread-aware schemes
  • Conclusion

5
Memory Optimization Techniques
  • Page modes
  • Open page good for programs with good locality
  • Close page good for programs with poor locality
  • Mapping schemes
  • Exploitation of concurrency (multiple channels,
    chips, banks)
  • Row buffer conflicts
  • Memory access scheduling
  • Reorder of concurrent accesses
  • Reducing average latency and improving bandwidth
    utilization

6
Memory Access Scheduling for Single-Threaded
Systems
  • Hit-first
  • A row buffer hit has a higher priority than a row
    buffer miss
  • Read-first
  • A read has a higher priority than a write
  • Age-based
  • An older request has a higher priority than a new
    one
  • Criticality-based
  • A critical request has a higher priority than a
    non-critical one

7
Memory Access Concurrency with Multithreaded
Processors
Processor
Memory
Single-threaded
Multi-threaded
8
Thread-Aware Memory Scheduling
  • New dimension in memory scheduling for SMT
    systems considering the current state of each
    thread
  • States related to memory accesses
  • Number of outstanding requests
  • Number of processor resources occupied

9
Outstanding Request-Based Scheme
  • Request-based
  • A request generated by a thread with fewer
    pending requests has a higher priority

10
Outstanding Request-Based Scheme
  • Request-based
  • Hit-first and read-first are applied on top
  • For SMT processors, sustained memory bandwidth is
    more important than the latency of an individual
    access

11
Resource Occupancy-Based Scheme
  • ROB-based
  • Higher priority to requests from threads holding
    more ROB entries
  • IQ-based
  • Higher priority to requests from threads holding
    more IQ entries
  • Hit-first and read-first are applied on top

12
Methodology
  • Simulator
  • SMT extension of sim-Alpha
  • Event-driven memory simulator (DDR SDRAM and
    Direct Rambus DRAM)
  • Workload
  • Mixture of SPEC 2000 applications
  • 2-, 4-, 8-thread workload
  • ILP, MIX, and MEM workload mixes

13
Simulation Parameters
Processor speed 3 GHz L1 caches 64KB I/D, 2-way, 1-cycle latency
Fetch width 8 inst. L2 cache 512KB, 2-way, 10-cycle latency
Baseline fetch policy DWarn.2.8 L3 cache 4MB, 4-way, 20-cycle latency
Pipeline depth 11 MSHR entries (164 prefetch)/cache
Issue queue size 64 Int., 32 FP Memory channels 2/4/8
Reorder buffer size 256/thread Memory BW/channel 200 MHz, DDR, 16B width
Physical register num 384 Int., 384 FP Memory banks 4 banks/chip
Load/store queue size 64 LQ, 64 SQ DRAM access latency 15ns row, 15ns column, 15ns precharge
14
Workload Mixes
2-thread ILP bzip2, gzip
MIX gzip, mcf
MEM mcf, ammp
4-thread ILP bzip2, gzip, sixtrack, eon
MIX gzip, mcf, bzip2, ammp
MEM mcf, ammp, swim, lucas
8-thread ILP gzip, bzip2, sixtrack, eon, mesa, galgel, crafty, wupwise
MIX gzip, mcf, bzip2, ammp, sixtrack, swim, eon, lucas
MEM mcf, ammp, swim, lucas, equake, applu, vpr, facerec
15
Performance Loss Due to Memory Access
16
Memory Access Concurrency
17
Memory Channel Configurations
18
Memory Channel Configurations
19
Mapping Schemes
20
Memory Access Concurrency
21
Thread-Aware Schemes
22
Conclusion
  • DRAM optimizations have significant impacts on
    the performance of SMT (and likely CMP)
    processors
  • Mostly effective when a workload mix includes
    some memory-intensive programs
  • Performance is sensitive to memory channel
    organizations
  • DRAM-side locality is harder to explore due to
    contention
  • Thread-aware access scheduling schemes does bring
    good performance
Write a Comment
User Comments (0)
About PowerShow.com