Caching Considerations for Generational Garbage Collection - PowerPoint PPT Presentation

About This Presentation
Title:

Caching Considerations for Generational Garbage Collection

Description:

Caching Considerations for Generational Garbage Collection. Presented By: Felix Gartsman 306054172. http://www.cs.tau.ac.il/~gartsma/seminar.ppt. gartsma_at_post. ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 51
Provided by: tng8
Category:

less

Transcript and Presenter's Notes

Title: Caching Considerations for Generational Garbage Collection


1
Caching Considerations for Generational Garbage
Collection
  • Presented By
  • Felix Gartsman 306054172
  • http//www.cs.tau.ac.il/gartsma/seminar.ppt
  • gartsma_at_post.tau.ac.il

2
Introduction
  • Main theme Effect of memory caches on GC
    performance
  • What is a memory cache?
  • How caches work?
  • How and why caches and GC interact?
  • Can we boost GC performance by knowing more about
    caches?

3
Motivation
  • CPU and memory performance dont advance with
    same speed
  • When CPU waits for memory it is idle
  • Solutions Pipeline, speculative execution and
    caches
  • Caches provide fast access for commonly accessed
    memory

4
Caches and GC
  • Two-way relationship
  • Improving GC performance by cache awareness
    minimizing cache misses
  • GC improving mutator memory access locality and
    minimizes cache misses by mutator (not dealt by
    the article)

5
Previous Work (Outdated!)
  • Deal mainly with interaction with virtual memory
    systems
  • No special attention to Generational GC
  • Assumed best/worst cases, special hardware
  • Investigated only direct-mapped caches

6
Article Contribution
  • Survey GGC performance on various caches
  • Check techniques for improving performance
  • Main advice Try keeping the youngest generation
    fully in cache. If impossible prefer associative
    caches

7
Roadmap
  • Cache in-depth
  • GC memory reuse cycle
  • GGC as better GC
  • Comparing cache size requirements
  • Comparing misses for different cache types
  • Conclusions

8
Cache in-depth
Memory Hierarchy
Cache Hierarchy
L1
Registers
Cache
L2
Main Memory
L3???
Virtual Memory (Disk)
  • Higher level means higher speed and smaller
    capacity
  • Miss in one level relays the handling to a lower
    level

9
Motivation contd.
  • When a memory word is not in cache, a cache
    miss occurs
  • Cache miss stalls the CPU, and forces access to
    main memory
  • Cache misses are expensive
  • Cache misses become more expensive with each new
    generation of CPUs
  • Penalty for memory access in P4 L1 2 cycles,
    L2 7, miss dozens depending on memory type

10
Cache properties
  • Size (8-64KB in L1, 128KB-3MB in L2, 6-8MB in
    L3?)
  • Layout (block size and sub-blocks)
  • Placement (NM hash function)
  • Associativity
  • Write strategy
  • Write-through or Write-back
  • Fetch-on-write or write-around

11
Cache Size
  • Size The bigger the better. Too small cache can
    render fast CPU to sluggish (Intel Celeron as
    example)
  • Bigger cache reduces cache misses
  • Constraints
  • Physical feasibility (proximity, size, heat)
  • Money (cost vs. performance ratio)

12
Cache Layout
  • Cache memory is divided to blocks called cache
    lines
  • Each line contains validity bit, dirty bit,
    replacement policy bits, address tag and of
    course the data
  • Bigger block reduce misses for good spatial
    locality. Hurt performance if working on multiple
    memory regions. Also longer to fill lines

13
Cache Layout contd.
  • Can be solved by dividing lines to sub-blocks and
    managing them separately

14
Cache Placement
  • Map memory address to block number
  • Examples
  • Address modulo blocks
  • Select middle bits of address
  • Select a set of bits
  • Must be fast and hardware friendly
  • Should be uniform mapping

15
Cache Associativity
  • Fully associative each address can be in any
    block. Need to check all tags slow or
    expensive. LRU replacement
  • Direct mapped address can be only in one block.
    Fast lookup, but no usage history
  • Set associative - each address can be in a set
    (2,4,8) of blocks. A compromise fast access and
    limited usage history

16
Cache Write Strategy
  • Write-Through Write directly to memory and of
    course update the cache (slow, but can use write
    buffers)
  • Write-Back Write to cache, and mark it dirty.
    Flush to memory later. Very useful for multiple
    writes to close addresses (object
    initialization). Can also enjoy write buffers
    (less useful)

17
Cache Write Strategy contd.
  • What to do on write cache miss?
  • Fetch-on-write/Write-allocate on miss fetch the
    corresponding cache line, and treat it as write
    hit
  • Write-around/Write-no-allocate Write directly
    to memory
  • Usually Write-back Write-allocate,
    Write-through Write-no-allocate

18
Modern memory usage
  • Object-Oriented languages tend to create many
    small objects for short periods. For example, STL
    uses value semantics which copies objects for
    every operation!
  • Functional languages (Lisp, Scheme) constantly
    create new objects which replace old ones (cons
    and friends)

19
Modern memory usage contd.
  • Creation is expensive allocation with probable
    write miss (new address used). Article cites
    sources claiming functional languages writing in
    up to 25 of their instructions (others 10)

20
Memory Recycling Pattern
  • GC systems tend to violate locality assumptions
  • Cyclic reuse of memory beats any caching policy.
    The reuse cycle is too long to be captured
  • GC systems become bandwidth limited

21
Allocation is to blame, not GC
  • Locality of the GC process itself is not the
    weakest link
  • The problem is fast allocation of memory, which
    will be reclaimed much later
  • Main memory filled very fast. What to do?
  • GC Too frequent, but avoids page
  • Use VM Touches many pages and causes paging

22
Pattern Results
  • Allocation touches new memory, and force a
    page-in/page fetch (slow)
  • Why fetch? The memory allocated was used
    previously. OS doesnt know its garbage, and
    allocation will overwrite it anyway
  • Informing OS no fetch required speeds execution

23
Pattern Results contd.
  • When main exhausted (or the process isnt allowed
    more pages), old pages must be evicted
  • Those pages are probably dirty must be written
    to disk
  • Even worse the evicted page is LRU probably
    garbage!
  • Worst case Disk B/W 2Allocation Rate

24
Another view
  • View GC allocator as a co-process to the mutator
  • Each one has it own locality reference
  • The mutator probably with good spatial locality
  • The allocator linearly marches through memory
  • Allocation is cyclic (remember LRU)

25
Compaction and Semi-Spaces
  • Compaction helps the mutator, little difference
    to allocator
  • Still marches through large memory areas
  • Trouble with semi-spaces the tospace was
    probably evicted. All addresses are replaced
    cache flush. Marching through entire heap every
    second cycle

26
Solution?
  • So LRU is bad, can we replace it?
  • We can, but it wont help much
  • Too much memory touched too frequently
  • Allocator page faults dominate program execution!
  • Only holding entire reuse cycle in memory will
    stop paging

27
Generational GC
  • Solution Touch less memory, less frequently
  • Divide heap to generations
  • GC the young generation(s) touching less memory
  • This eliminates vast memory marching memory
    reuse cycle minimized
  • Eliminates paging, what about cache?

28
Generational GC variations
  • Can use single space immediate promotion
  • Can use semi-spaces promote at will, at the
    expense of more memory

29
Better Generational GC
  • Ungar Use a pair of semi-spaces and a separate
    dedicated creation space
  • This space is emptied and reused every cycle, but
    the semi-spaces alternate roles as destination
  • The result Only little part of semi-spaces are
    touched, and new objects are created in hot
    space in main memory (and maybe in cache)

30
Cache Revised
  • Cache misses can be categorized to
  • Capacity misses No matter what cache is used,
    the miss will occur
  • Conflict misses A miss occurs because two (or
    more) addresses mapped to same cache line (set)
  • Direct mapped suffer from conflict misses the
    most every miss evicts blocked with same mapping

31
Conflict Misses in-depth
  • Miss rate function is roughly a minimization one
  • Example Both addresses map to same line. The
    first accessed every ms, the second every µs. The
    (double) miss is every ms.
  • The rate depends on the usage frequency of
    addresses not in cache

32
Minimizing Conflict Misses Rate
  • Most non-GC systems are skewed many frequent
    objects, little others. If placed well, cache is
    efficient
  • If many block accessed in intermediate time scale
    more misses, and more chances they will
    interfere each other
  • Over-simplified to help understanding

33
Example
  • Program marches memory, while doing normal
    activity. We use 16KB cache
  • 2-way associative the most frequent block are
    not touched
  • Direct Mapped Total flush every cycle
  • Conclusion It takes twice time to be remapped in
    DM, but the result is painful (flush)
  • DM cant handle multiple access pattern

34
Experiments
  • Instrumented Scheme compiler with integrated
    cache simulator
  • Executes millions of instructions, allocates MBs
  • Well present 2 programs
  • Scheme compiler
  • Boyer benchmark objects live long, and tend to
    be promoted

35
Experiments contd.
  • Cache lines are 16 bytes width
  • 3 collectors
  • GGC with 2 MB spaces for generation no
    promotion ever done
  • GGC with 141 KB spaces for generation
  • 2 141 KB creation space (Ungar)

36
Results (Capacity)
37
Interpretation
  • LRU queue distance distribution
  • What it means?
  • The probability of a block to be touched at
    different point in LRU queue
  • The probability of a block to be touched given
    how long since it was last touched
  • The probability of a block to be touched given
    how many other blocks have been more recently

38
Interpretation contd.
  • Fourth queue position 128 KB
  • Eight queue position 256 KB
  • For any given position The area under the curve
    to the left cache hits, to the right misses
  • Curves height at point the marginal increase
    in hits due cache enlargement at that point

39
Experiment Meaning
  • First entries absorb most hits
  • Collector 1
  • Dramatic drop
  • About tenth position (320 KB) no need
  • Collector 23
  • Hump peaking when memory starts recycling

40
Experiment Meaning contd.
  • 2 Recycling after 1412 KB, cache of 300-400 KB
    should suffice
  • 3 Creation space is constantly recycled, and a
    small part of other spaces is touched, cache of
    200-300 KB should suffice

41
Experiment Meaning contd. 2
  • Boyer behaves differently
  • 3 better than 2 by 30
  • Capacity misses disappear if cache larger than
    youngest generation size

42
Results (Collision)
43
Interpretation
  • The graph plots cache size vs. miss rate
  • Shows results only for collector 3

44
Experiment Meaning
  • Associative shows dramatic and almost linear
    dropdown to 256 KB (contains youngest
    generation). From then nothing interesting
  • Direct mapped same on 16-90 KB interval, better
    on 90-135 KB, much worse latter on

45
Experiment Meaning contd.
  • Why DM better at that period?
  • Cache big enough to hold creation area, and
    suffers interference for other blocks
  • Associative evicts before used due collision
  • Later associative suffers only re-fill misses,
    DM also suffers collisions

46
More Performance Notes
  • When cache is too small, most evicted blocks are
    dirty and require expensive writebacks
  • Interference may also cause writebacks

47
Conclusions
  • Caches are important part of modern computer
  • Garbage collectors reuse memory in cycles, often
    march memory
  • LRU evicts dirty pages/cache lines, needless
    fetches are costly
  • GGC reuses smaller area, reduces paging

48
Conclusions contd.
  • Similar idea for caches hold youngest generation
    entirely
  • Ungar 3-space proposition reduces required
    footprint by 30
  • Excluding small interval, associative caches
    perform better than direct mapped which suffer
    collision misses

49
Questions?
50
The End
Write a Comment
User Comments (0)
About PowerShow.com