Caching Considerations for Generational Garbage Collection presentation

About This Presentation

Transcript and Presenter's Notes

Title: Caching Considerations for Generational Garbage Collection

1
Caching Considerations for Generational Garbage
Collection

Presented By
Felix Gartsman 306054172
http//www.cs.tau.ac.il/gartsma/seminar.ppt
gartsma_at_post.tau.ac.il

2
Introduction

Main theme Effect of memory caches on GC
performance
What is a memory cache?
How caches work?
How and why caches and GC interact?
Can we boost GC performance by knowing more about
caches?

3
Motivation

CPU and memory performance dont advance with
same speed
When CPU waits for memory it is idle
Solutions Pipeline, speculative execution and
caches
Caches provide fast access for commonly accessed
memory

4
Caches and GC

Two-way relationship
Improving GC performance by cache awareness
minimizing cache misses
GC improving mutator memory access locality and
minimizes cache misses by mutator (not dealt by
the article)

5
Previous Work (Outdated!)

Deal mainly with interaction with virtual memory
systems
No special attention to Generational GC
Assumed best/worst cases, special hardware
Investigated only direct-mapped caches

6
Article Contribution

Survey GGC performance on various caches
Check techniques for improving performance
Main advice Try keeping the youngest generation
fully in cache. If impossible prefer associative
caches

7
Roadmap

Cache in-depth
GC memory reuse cycle
GGC as better GC
Comparing cache size requirements
Comparing misses for different cache types
Conclusions

8
Cache in-depth
Memory Hierarchy
Cache Hierarchy
L1
Registers
Cache
L2
Main Memory
L3???
Virtual Memory (Disk)

Higher level means higher speed and smaller
capacity
Miss in one level relays the handling to a lower
level

9
Motivation contd.

When a memory word is not in cache, a cache
miss occurs
Cache miss stalls the CPU, and forces access to
main memory
Cache misses are expensive
Cache misses become more expensive with each new
generation of CPUs
Penalty for memory access in P4 L1 2 cycles,
L2 7, miss dozens depending on memory type

10
Cache properties

Size (8-64KB in L1, 128KB-3MB in L2, 6-8MB in
L3?)
Layout (block size and sub-blocks)
Placement (NM hash function)
Associativity
Write strategy
Write-through or Write-back
Fetch-on-write or write-around

11
Cache Size

Size The bigger the better. Too small cache can
render fast CPU to sluggish (Intel Celeron as
example)
Bigger cache reduces cache misses
Constraints
Physical feasibility (proximity, size, heat)
Money (cost vs. performance ratio)

12
Cache Layout

Cache memory is divided to blocks called cache
lines
Each line contains validity bit, dirty bit,
replacement policy bits, address tag and of
course the data
Bigger block reduce misses for good spatial
locality. Hurt performance if working on multiple
memory regions. Also longer to fill lines

13
Cache Layout contd.

Can be solved by dividing lines to sub-blocks and
managing them separately

14
Cache Placement

Map memory address to block number
Examples
Address modulo blocks
Select middle bits of address
Select a set of bits
Must be fast and hardware friendly
Should be uniform mapping

15
Cache Associativity

Fully associative each address can be in any
block. Need to check all tags slow or
expensive. LRU replacement
Direct mapped address can be only in one block.
Fast lookup, but no usage history
Set associative - each address can be in a set
(2,4,8) of blocks. A compromise fast access and
limited usage history

16
Cache Write Strategy

Write-Through Write directly to memory and of
course update the cache (slow, but can use write
buffers)
Write-Back Write to cache, and mark it dirty.
Flush to memory later. Very useful for multiple
writes to close addresses (object
initialization). Can also enjoy write buffers
(less useful)

17
Cache Write Strategy contd.

What to do on write cache miss?
Fetch-on-write/Write-allocate on miss fetch the
corresponding cache line, and treat it as write
hit
Write-around/Write-no-allocate Write directly
to memory
Usually Write-back Write-allocate,
Write-through Write-no-allocate

18
Modern memory usage

Object-Oriented languages tend to create many
small objects for short periods. For example, STL
uses value semantics which copies objects for
every operation!
Functional languages (Lisp, Scheme) constantly
create new objects which replace old ones (cons
and friends)

19
Modern memory usage contd.

Creation is expensive allocation with probable
write miss (new address used). Article cites
sources claiming functional languages writing in
up to 25 of their instructions (others 10)

20
Memory Recycling Pattern

GC systems tend to violate locality assumptions
Cyclic reuse of memory beats any caching policy.
The reuse cycle is too long to be captured
GC systems become bandwidth limited

21
Allocation is to blame, not GC

Locality of the GC process itself is not the
weakest link
The problem is fast allocation of memory, which
will be reclaimed much later
Main memory filled very fast. What to do?
GC Too frequent, but avoids page
Use VM Touches many pages and causes paging

22
Pattern Results

Allocation touches new memory, and force a
page-in/page fetch (slow)
Why fetch? The memory allocated was used
previously. OS doesnt know its garbage, and
allocation will overwrite it anyway
Informing OS no fetch required speeds execution

23
Pattern Results contd.

When main exhausted (or the process isnt allowed
more pages), old pages must be evicted
Those pages are probably dirty must be written
to disk
Even worse the evicted page is LRU probably
garbage!
Worst case Disk B/W 2Allocation Rate

24
Another view

View GC allocator as a co-process to the mutator
Each one has it own locality reference
The mutator probably with good spatial locality
The allocator linearly marches through memory
Allocation is cyclic (remember LRU)

25
Compaction and Semi-Spaces

Compaction helps the mutator, little difference
to allocator
Still marches through large memory areas
Trouble with semi-spaces the tospace was
probably evicted. All addresses are replaced
cache flush. Marching through entire heap every
second cycle

26
Solution?

So LRU is bad, can we replace it?
We can, but it wont help much
Too much memory touched too frequently
Allocator page faults dominate program execution!
Only holding entire reuse cycle in memory will
stop paging

27
Generational GC

Solution Touch less memory, less frequently
Divide heap to generations
GC the young generation(s) touching less memory
This eliminates vast memory marching memory
reuse cycle minimized
Eliminates paging, what about cache?

28
Generational GC variations

Can use single space immediate promotion
Can use semi-spaces promote at will, at the
expense of more memory

29
Better Generational GC

Ungar Use a pair of semi-spaces and a separate
dedicated creation space
This space is emptied and reused every cycle, but
the semi-spaces alternate roles as destination
The result Only little part of semi-spaces are
touched, and new objects are created in hot
space in main memory (and maybe in cache)

30
Cache Revised

Cache misses can be categorized to
Capacity misses No matter what cache is used,
the miss will occur
Conflict misses A miss occurs because two (or
more) addresses mapped to same cache line (set)
Direct mapped suffer from conflict misses the
most every miss evicts blocked with same mapping

31
Conflict Misses in-depth

Miss rate function is roughly a minimization one
Example Both addresses map to same line. The
first accessed every ms, the second every µs. The
(double) miss is every ms.
The rate depends on the usage frequency of
addresses not in cache

32
Minimizing Conflict Misses Rate

Most non-GC systems are skewed many frequent
objects, little others. If placed well, cache is
efficient
If many block accessed in intermediate time scale
more misses, and more chances they will
interfere each other
Over-simplified to help understanding

33
Example

Program marches memory, while doing normal
activity. We use 16KB cache
2-way associative the most frequent block are
not touched
Direct Mapped Total flush every cycle
Conclusion It takes twice time to be remapped in
DM, but the result is painful (flush)
DM cant handle multiple access pattern

34
Experiments

Instrumented Scheme compiler with integrated
cache simulator
Executes millions of instructions, allocates MBs
Well present 2 programs
Scheme compiler
Boyer benchmark objects live long, and tend to
be promoted

35
Experiments contd.

Cache lines are 16 bytes width
3 collectors
GGC with 2 MB spaces for generation no
promotion ever done
GGC with 141 KB spaces for generation
2 141 KB creation space (Ungar)

36
Results (Capacity)
37
Interpretation

LRU queue distance distribution
What it means?
The probability of a block to be touched at
different point in LRU queue
The probability of a block to be touched given
how long since it was last touched
The probability of a block to be touched given
how many other blocks have been more recently

38
Interpretation contd.

Fourth queue position 128 KB
Eight queue position 256 KB
For any given position The area under the curve
to the left cache hits, to the right misses
Curves height at point the marginal increase
in hits due cache enlargement at that point

39
Experiment Meaning

First entries absorb most hits
Collector 1
Dramatic drop
About tenth position (320 KB) no need
Collector 23
Hump peaking when memory starts recycling

40
Experiment Meaning contd.

2 Recycling after 1412 KB, cache of 300-400 KB
should suffice
3 Creation space is constantly recycled, and a
small part of other spaces is touched, cache of
200-300 KB should suffice

41
Experiment Meaning contd. 2

Boyer behaves differently
3 better than 2 by 30
Capacity misses disappear if cache larger than
youngest generation size

42
Results (Collision)
43
Interpretation

The graph plots cache size vs. miss rate
Shows results only for collector 3

44
Experiment Meaning

Associative shows dramatic and almost linear
dropdown to 256 KB (contains youngest
generation). From then nothing interesting
Direct mapped same on 16-90 KB interval, better
on 90-135 KB, much worse latter on

45
Experiment Meaning contd.

Why DM better at that period?
Cache big enough to hold creation area, and
suffers interference for other blocks
Associative evicts before used due collision
Later associative suffers only re-fill misses,
DM also suffers collisions

46
More Performance Notes

When cache is too small, most evicted blocks are
dirty and require expensive writebacks
Interference may also cause writebacks

47
Conclusions

Caches are important part of modern computer
Garbage collectors reuse memory in cycles, often
march memory
LRU evicts dirty pages/cache lines, needless
fetches are costly
GGC reuses smaller area, reduces paging

48
Conclusions contd.

Similar idea for caches hold youngest generation
entirely
Ungar 3-space proposition reduces required
footprint by 30
Excluding small interval, associative caches
perform better than direct mapped which suffer
collision misses

49
Questions?
50
The End

Write a Comment

User Comments (0)

About PowerShow.com

Caching Considerations for Generational Garbage Collection PowerPoint PPT Presentation