Tag Correlating Prefetcher Analysis

About This Presentation

Title:

Description:

Number of Views:17

Avg rating:3.0/5.0

Slides: 17

Provided by: tnt54

Category:

Tags: analysis | correlating | prefetcher | queuing | tag

Transcript and Presenter's Notes

Title: Tag Correlating Prefetcher Analysis

1
Tag Correlating Prefetcher Analysis

2
Problem

3
Problem
OFFSET10
Decoder
Decoder
INDEX20
4
Solution

Software prefetch
Adding prefetch instructions in data intensive
loops can yield great performance gain
How far ahead to prefetch is architecture
dependent
Compiler not effective
Instruction has to be processed by CPU
Hardware prefetch
Analyze program behavior and dynamically issue
prefetches

5
Hardware prefetching

Prefetch to L1 cache
Prefetch has low accuracy and will pollute L1
cache.
Degrades performance
Prefetch to L2 cache
Creates the least disturbance on overall data
flow
Reference to prefetched data still cause cache
miss but miss latency is much less than fetching
data from memory

6
TCP Algorithm

Miss address
tag
tag
tag
index
offset
tag1
tag2
.
tagk
Indexing function
Pattern History Table (8-way)
Tag History Table
7
TCP Update

Miss address
tag
tag
tag
index
offset
tag1
tag2
.
tagk
Indexing function
Pattern History Table (8-way)
Tag History Table
8
TCP Lookup

Sequence tag2, , tagk, missTag used to index
PHT and entry with missTag is selected.
Prefetch data block with tag from tag field

Miss address
tag
tag
tag
index
offset
tag1
tag2
.
tagk
Indexing function
Pattern History Table (8-way)
Tag History Table
9
TCP Example

Miss address
4444
12
3
THT sequence at 12th set
1111
2222
3333
PHT at 10th set
10
TCP Example
Miss address

4444
12
3
Updated THT sequence at 12th set
2222
3333
4444
PHT at 15th set
11
SimpleScalar Setup
12
TCP setup

13
Results

NB Prefetching the next sequential block in
memory
TCP THT sequence has 1 entry
-- Prefetch request queue has 16 entries
dl1 ver 2 Second version of L1 data cache. Has
half as many sets with twice block size

14
Results

15
Results

16
Conclusion

Performance gain is maximized with 1 entry in
each THT sequence
Prefetch request buffer has little effect on
performance
If bus is busy, discard the request
Effective for applications that has high number
of capacity misses
Separate L2 cache for instruction and data is
require to prevent pollution of instruction
footprint