Reducing DRAM Latencies with an Integrated Memory Hierarchy Design - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Reducing DRAM Latencies with an Integrated Memory Hierarchy Design

Description:

IPCReal Instructions per cycle with real memory system ... Thus proposed prefetch scheme overshadows the software prefetching benefits. 21. OUTLINE ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 24
Provided by: Pra855
Category:

less

Transcript and Presenter's Notes

Title: Reducing DRAM Latencies with an Integrated Memory Hierarchy Design


1
  • Reducing DRAM Latencies with an Integrated
    Memory Hierarchy Design
  • Authors
  • Wei-fen Lin and Steven K. Reinhardt, University
    of Michigan
  • Doug Burger, University of Texas at Austin
  • Presentation
  • by
  • Pravin Dalale

2
OUTLINE
  • Motivation
  • Main idea in the paper
  • - Analysis
  • - Main idea
  • Prefetch engine
  • - Insertion policy
  • - Prefetch scheduling
  • Results
  • Conclusion

3
Motivation
Memory density and capacity have grown along with
the CPU power and complexity, but memory speed
has not kept pace.
4
Solutions
  • Multithreading
  • Multiple levels of caches
  • Prefetching

5
OUTLINE
  • Motivation
  • Main idea in the paper
  • - Analysis
  • - Main idea
  • Prefetch engine
  • - Insertion policy
  • - Prefetch scheduling
  • Results
  • Conclusion

6
Analysis (1)
  • IPCReal Instructions per cycle with real memory
    system
  • IPCPerfectL2 - Instructions per cycle with real
    L1 cache but perfect L2 cache
  • IPCPerfectMem - Instructions per cycle with
    perfetct L1 cache but perfect L2 cache

7
Analysis (2)
  • Fraction of performance lost due to imperfect L1
    and L2
  • (IPCPerfectMem IPCReal) / IPCPerfectMem
  • Fraction of performance lost due to imperfect L2
  • (IPCPerfectL2 IPCReal) / IPCPerfectL2

8
Analysis (3)
  • Simulated 1.6GHz, out-of-order core
  • 64KB L1
  • 1MB L2
  • Direct Rambus Memory System with four 1.6GB/s
    channels
  • The 26 SPEC benchmarks were tested on this system
    to obtain IPCReal, IPCPerfectL2, IPCPerfectMem.

9
Analysis (4)
L2 stall fraction is 80 for mcf benchmark
Average stall fraction caused by L2 misses is 57
over 26 SPEC CPU2000 benchmarks
10
Main idea
  • Paper describes technique to reduce L2 miss
    latencies
  • Introduces a prefetch engine to prefetch data to
    L2 cache upon a L2 demand miss

11
OUTLINE
  • Motivation
  • Main idea in the paper
  • - Analysis
  • - Main idea
  • Prefetch Engine
  • - Insertion policy
  • - Prefetch scheduling
  • Results
  • Conclusion

12
Prefetch Engine
1
3
2
13
Prefetch Engine
1 Prefetch queue maintains the list of n region
entries not in L2 cache 2 Prefetch prioritizer
uses the bank state and the region age to
determine which prefetch to issue next. 3 Access
prirotizer selects a prefetch in case of no
demand misses
14
Insertion policy (1)
  • The prefetched block may be loaded into L2 with
    one of four priorities
  • 1. most-recently-used (MRU)
  • 2. second-most-recently-used (SMRU)
  • 3. second-least-recently-used (SLRU)
  • 4. least-recently-used (LRU)

15
Insertion policy (2)
  • Benchmarks were divided into two classes
  • High (above 20) prefetch accuracy benchmarks
  • Low (below 20) prefetch accuracy benchmarks
  • All benchmarks were tested for four possible
    insertion policies.

LRU insertion policy gives best results in both
categories.
16
Prefetch Scheduling
  • Simple aggressive prefetching can consume large
    amount of bandwidth and cause channel contention
  • This large contention at channel can be avoided
    scheduling prefetch accesses onlt when Rambus
    channels are idle

17
OUTLINE
  • Motivation
  • Main idea in the paper
  • - Analysis
  • - Main idea
  • Prefetch engine
  • - Insertion policy
  • - Prefetch scheduling
  • Results
  • Conclusion

18
Results (1/3) - Overall performance improvement
The performance with prefetching is very close to
that of perfect L2
19
Results (2/3) - Sensitivity of prefetch scheme to
DRAM latencies
  • Base DRDRAM had 40ns latency and 800MHz data
    transfer rate
  • If latency is increased to 50ns the mean
    performance of prefetch scheme reduces by less
    than 1 as compared to the base system
  • If latency is reduced to 34ns the mean
    performance of prefetch scheme was again reduced
    by less than 2

20
Results (3/3) - Interaction with software
prefetching
  • When prefetch scheme is coupled with software
    prefetching, none of the benchmarks improved
    significantly (at most 2)
  • Thus proposed prefetch scheme overshadows the
    software prefetching benefits

21
OUTLINE
  • Motivation
  • Main idea in the paper
  • - Analysis
  • - Main idea
  • Prefetch engine
  • - Insertion policy
  • - Prefetch scheduling
  • Results
  • Conclusion

22
Conclusions
  • Authors proposed and evaluated a prefetch
    architecture, integrated with on chip L2 cache
  • This architecture involves aggressive prefetching
    of large regions of data to L2 on demand misses
  • By scheduling these prefetches only during idle
    cycles, inserting them into the cache with low
    replacement prioroty a significant improvement is
    obtained in 10 of 26 SPEC benchmarks

23
QUESTIONS
Write a Comment
User Comments (0)
About PowerShow.com