Hoard: A Scalable Memory Allocator for Multithreaded Applications - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Description:

Avoids heap contention - local heaps allocate & free small blocks from their set ... People think & talk about heaps as if they were modular: Select heap based on size ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 32
Provided by: eme64
Category:

less

Transcript and Presenter's Notes

Title: Hoard: A Scalable Memory Allocator for Multithreaded Applications


1
Hoard A Scalable Memory Allocator for
Multithreaded Applications
Emery Berger, Kathryn McKinley
Robert Blumofe, Paul Wilson
2
Motivation
  • Parallel multithreaded programs becoming
    prevalent
  • web servers, search engines, database managers,
    etc.
  • run on SMPs for high performance
  • often embarrassingly parallel
  • Memory allocation is a bottleneck
  • prevents scaling with number of processors

3
Assessment Criteria for Multiprocessor Allocators
  • Speed
  • competitive with uniprocessor allocators on one
    processor
  • Scalability
  • performance linear with the number of processors
  • Fragmentation ( max allocated / max in use)
  • competitive with uniprocessor allocators
  • worst-case and average-case

4
Uniprocessor Allocators on Multiprocessors
  • Fragmentation Excellent
  • Very low for most programs Wilson Johnstone
  • Speed Scalability Poor
  • Heap contention
  • a single lock protects the heap
  • Can exacerbate false sharing
  • different processors can share cache lines

5
Allocator-InducedFalse Sharing
A cache line
  • Allocators cause false sharing!
  • Cache lines can end up spread across a number of
    processors
  • Practically all allocators do this

processor 1
processor 2
x2 malloc(s)
x1 malloc(s)
thrash
thrash
6
Existing Multiprocessor Allocators
  • Speed
  • One concurrent heap (e.g., concurrent B-tree)
    too expensive
  • too many locks/atomic updates
  • O(log n) cost per memory operation
  • ? Fast allocators use multiple heaps
  • Scalability
  • Allocator-induced false sharing and other
    bottlenecks
  • Fragmentation P-fold increase or even unbounded

7
Multiprocessor Allocator IPure Private Heaps
  • Pure private heapsone heap per processor.
  • malloc gets memoryfrom the processor's heap or
    the system
  • free puts memory on the processor's heap
  • Avoids heap contention
  • Examples STL, ad hoc (e.g., Cilk 4.1)

processor 1
processor 2
x1 malloc(s)
x2 malloc(s)
free(x1)
free(x2)
x4 malloc(s)
x3 malloc(s)
allocated by heap 1
free, on heap 2
8
How to Break Pure Private Heaps Fragmentation
  • Pure private heaps
  • memory consumption can grow without bound!
  • Producer-consumer
  • processor 1 allocates
  • processor 2 frees

processor 1
processor 2
x1 malloc(s)
free(x1)
x2 malloc(s)
free(x2)
x3 malloc(s)
free(x3)
9
Multiprocessor Allocator IIPrivate Heaps with
Ownership
  • Private heaps with ownershipfree puts memory
    back on the originating processor's heap.
  • Avoids unbounded memory consumption
  • Examples ptmalloc Gloger, LKmalloc Larson
    Krishnan

processor 1
processor 2
x1 malloc(s)
free(x1)
x2 malloc(s)
free(x2)
10
How to Break Private Heaps with
OwnershipFragmentation
  • Private heaps with ownershipmemory consumption
    can blowup by a factor of P.
  • Round-robin producer-consumer
  • processor i allocates
  • processor i1 frees
  • This really happens (NDS).

processor 1
processor 2
processor 3
x1 malloc(s)
free(x1)
x2 malloc(s)
free(x2)
x3malloc(s)
free(x3)
11
So What Do We Do Now?
12
The Hoard Multiprocessor Memory Allocator
  • Manages memory in page-sized superblocks of
    same-sized objects
  • - Avoids false sharing by not carving up cache
    lines
  • - Avoids heap contention - local heaps allocate
    free small blocks from their set of superblocks
  • Adds a global heap that is a repository of
    superblocks
  • When the fraction of free memory exceeds the
    empty fraction, moves superblocks to the global
    heap
  • - Avoids blowup in memory consumption

13
Hoard Example
processor 1
global heap
  • Hoardone heap per processor a global heap
  • malloc gets memory from a superblock on its heap.
  • free returns memory to its superblock. If the
    heap is too empty, it moves a superblock to the
    global heap.

x1 malloc(s)
some mallocs
some frees
free(x7)
Empty fraction 1/3
14
Summary of Analytical Results
  • Worst-case memory consumption
  • O(n log M/m P) instead of O(P n log M/m)
  • n memory required
  • M biggest object size
  • m smallest object size
  • P number of processors
  • Best possible O(n log M/m) Robson
  • Provably low synchronization in most cases

15
Performance threadtest
speedup(x,P) runtime(Solaris allocator, one
processor) / runtime(x on P processors)
16
Performance Larson
Server-style benchmark with sharing
17
Performance false sharing
Each thread reads writes heap data
18
Hoard Conclusions
  • Speed Excellent
  • As fast as a uniprocessor allocator on one
    processor
  • amortized O(1) cost
  • 1 lock for malloc, 2 for free
  • Scalability Excellent
  • Scales linearly with the number of processors
  • Avoids false sharing
  • Fragmentation Very good
  • Worst-case is provably close to ideal
  • Actual observed fragmentation is low

19
Even Faster Allocation
  • Custom allocators can be very fast
  • Linked lists of objects for highly-used classes
  • Region (arena, zone) allocators
  • Best practices Meyers 1995, Bulka 2001
  • Used in 3 SPEC2000 benchmarks (parser, gcc, vpr),
    Apache, PGP, SQLServer, etc.

20
Custom Allocators Work
  • Using a custom allocator reduces runtime by 60

21
Problems with Current Practice
  • Brittle code
  • written from scratch
  • macros/monolithic functions to avoid overhead
  • hard to write, reuse or maintain
  • Excessive fragmentation
  • good memory allocatorscomplicated, not reusable

22
Allocator Conceptual Design
  • People think talk about heaps as if they were
    modular

System memory manager
Manage small objects
Manage large objects
Select heap based on size
malloc
free
23
Infrastructure Requirements
  • Flexible
  • can add functionality
  • Reusable
  • in other contexts in same program
  • Fast
  • very low or no overhead
  • High-level
  • as component-like as possible

24
Ordinary Classes vs. Mixins
  • Ordinary classes
  • fixed inheritance dag
  • cant rearrange hierarchy
  • cant use class multiple times

Mixins no fixed inheritance dag multiple
hierarchies possible can reuse classes multiple
times fast static dispatch
25
A Heap Layer
Provides malloc and free methods Top heaps get
memory from system e.g., mallocHeap uses C
librarys malloc and free
template ltclass SuperHeapgtclass HeapLayer
public SuperHeap
void malloc (sz) do something void p
SuperHeapmalloc (sz) do something else
return p
heap layer
26
Example Thread-safety
  • LockedHeap
  • protects the parent heap with a single lock

class LockedMallocHeappublic LockedHeapltmallocHe
apgt
void malloc (sz) acquire lock void p
release lock return p
SuperHeapmalloc (sz)
27
Example Debugging
  • DebugHeap
  • Protects against invalid multiple frees.

class LockedDebugMallocHeappublic LockedHeaplt
DebugHeapltmallocHeapgt gt
void free (p) check that p is valid check
that p hasnt been freed before
DebugHeap
SuperHeapfree (p)
LockedHeap
28
Implementation in Heap Layers
  • Modular design and implementation

FreelistHeap
manage objects on freelist
SizeHeap
add size info to objects
SegHeap
select heap based on size
malloc
free
29
Experimental ResultsCustom Allocation gcc
30
Experimental ResultsGeneral-Purpose Allocators
31
Conclusion
  • Heap layers infrastructure for composing
    allocators
  • Useful experimental infrastructure
  • Allows rapid implementation of high-quality
    allocators
  • custom allocators as fast as originals
  • general-purpose allocators comparable to
    state-of-the-artin speed and efficiency
Write a Comment
User Comments (0)
About PowerShow.com