Tag Correlating Prefetcher Analysis - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Tag Correlating Prefetcher Analysis

Description:

Prefetch request queue has 16 entries. dl1 ver 2 Second ... Only service TCP request queue when bus is free. Queuing shows little effect on performance ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 17
Provided by: tnt54
Category:

less

Transcript and Presenter's Notes

Title: Tag Correlating Prefetcher Analysis


1
Tag Correlating Prefetcher Analysis
  • Tung Nguyen

2
Problem
  • Memory access bottleneck
  • Alleviated with fast on-chip cache hierarchy
  • Cache miss can cause CPU to stall
  • Increase sets causes access latency to
    increases
  • Address decoder propagation delay increases
  • Bit line loading increases
  • Increasing associativity causes access latency to
    increase
  • Associativity matching time increases
  • Increasing block size causes miss penalty to
    increase
  • Cache size can only grow by factor of 2

3
Problem
OFFSET10
Decoder
Decoder
INDEX20
4
Solution
  • Software prefetch
  • Adding prefetch instructions in data intensive
    loops can yield great performance gain
  • How far ahead to prefetch is architecture
    dependent
  • Compiler not effective
  • Instruction has to be processed by CPU
  • Hardware prefetch
  • Analyze program behavior and dynamically issue
    prefetches

5
Hardware prefetching
  • Prefetch to L1 cache
  • Prefetch has low accuracy and will pollute L1
    cache.
  • Degrades performance
  • Prefetch to L2 cache
  • Creates the least disturbance on overall data
    flow
  • Reference to prefetched data still cause cache
    miss but miss latency is much less than fetching
    data from memory

6
TCP Algorithm
  • Tag History Table sequence of tags associated
    with each block tag1,,tagk
  • Pattern History Table current tag and next tag
    to prefetch to L2
  • Each prefetcher access consist of an Update and a
    Lookup operations

Miss address
tag
tag
tag
index
offset
tag1
tag2
.
tagk
Indexing function
Pattern History Table (8-way)
Tag History Table
7
TCP Update
  • Sequence tag1, tag2,, tagk used to index PHT
    and entry with tagk is selected
  • Tag field is set to tagk
  • Update sequence in THT from tag1, tag2,,
    tagk to tag2, , tagk, misstag

Miss address
tag
tag
tag
index
offset
tag1
tag2
.
tagk
Indexing function
Pattern History Table (8-way)
Tag History Table
8
TCP Lookup
  • Sequence tag2, , tagk, missTag used to index
    PHT and entry with missTag is selected.
  • Prefetch data block with tag from tag field

Miss address
tag
tag
tag
index
offset
tag1
tag2
.
tagk
Indexing function
Pattern History Table (8-way)
Tag History Table
9
TCP Example
  • Update
  • PHT index (111122223333)255
  • PHT index 10
  • 10th set is located and tag field match with 3333
  • tag is set to 4444

Miss address
4444
12
3
THT sequence at 12th set
1111
2222
3333
PHT at 10th set
10
TCP Example
Miss address
  • Look up
  • Update THT sequence
  • PTH index (222233334444)255
  • PHT index 15
  • 15th set is located and tag field match with 4444
  • Tag 1234 and index 12 will be use to form
    prefetch address

4444
12
3
Updated THT sequence at 12th set
2222
3333
4444
PHT at 15th set
11
SimpleScalar Setup
12
TCP setup
  • TCP is placed between L1 and L2 data cache to
    observe L1 miss stream
  • THT has 1024 entries
  • Same as number of sets in L1 data cache
  • 8k PHT
  • 256 entries with associativity of 8
  • PHT indexing function
  • Truncated addition of tags in THT sequence
  • The lower log2(256)8 bits is used to index PHT

13
Results
  • NB Prefetching the next sequential block in
    memory
  • TCP THT sequence has 1 entry
  • -- Prefetch request queue has 16 entries
  • dl1 ver 2 Second version of L1 data cache. Has
    half as many sets with twice block size

14
Results
  • Increase in THT sequence size does not increase
    prediction accuracy
  • Accuracy depends of the indexing function
  • Some apps such as gzip and bzip2 only exhibit 2
    tags sequence correlation

15
Results
  • CPU driven memory access has priority over TCP
    driven accesses.
  • Only service TCP request queue when bus is free
  • Queuing shows little effect on performance

16
Conclusion
  • Performance gain is maximized with 1 entry in
    each THT sequence
  • Prefetch request buffer has little effect on
    performance
  • If bus is busy, discard the request
  • Effective for applications that has high number
    of capacity misses
  • Separate L2 cache for instruction and data is
    require to prevent pollution of instruction
    footprint
Write a Comment
User Comments (0)
About PowerShow.com