Title: Hoard: A Scalable Memory Allocator for Multithreaded Applications
1Hoard A Scalable Memory Allocator for
Multithreaded Applications
Emery Berger, Kathryn McKinley
Robert Blumofe, Paul Wilson
2Motivation
- Parallel multithreaded programs becoming
prevalent - web servers, search engines, database managers,
etc. - run on SMPs for high performance
- often embarrassingly parallel
- Memory allocation is a bottleneck
- prevents scaling with number of processors
3Assessment Criteria for Multiprocessor Allocators
- Speed
- competitive with uniprocessor allocators on one
processor - Scalability
- performance linear with the number of processors
- Fragmentation ( max allocated / max in use)
- competitive with uniprocessor allocators
- worst-case and average-case
4Uniprocessor Allocators on Multiprocessors
- Fragmentation Excellent
- Very low for most programs Wilson Johnstone
- Speed Scalability Poor
- Heap contention
- a single lock protects the heap
- Can exacerbate false sharing
- different processors can share cache lines
5Allocator-InducedFalse Sharing
A cache line
- Allocators cause false sharing!
- Cache lines can end up spread across a number of
processors - Practically all allocators do this
processor 1
processor 2
x2 malloc(s)
x1 malloc(s)
thrash
thrash
6Existing Multiprocessor Allocators
- Speed
- One concurrent heap (e.g., concurrent B-tree)
too expensive - too many locks/atomic updates
- O(log n) cost per memory operation
- ? Fast allocators use multiple heaps
- Scalability
- Allocator-induced false sharing and other
bottlenecks - Fragmentation P-fold increase or even unbounded
7Multiprocessor Allocator IPure Private Heaps
- Pure private heapsone heap per processor.
- malloc gets memoryfrom the processor's heap or
the system - free puts memory on the processor's heap
- Avoids heap contention
- Examples STL, ad hoc (e.g., Cilk 4.1)
processor 1
processor 2
x1 malloc(s)
x2 malloc(s)
free(x1)
free(x2)
x4 malloc(s)
x3 malloc(s)
allocated by heap 1
free, on heap 2
8How to Break Pure Private Heaps Fragmentation
- Pure private heaps
- memory consumption can grow without bound!
- Producer-consumer
- processor 1 allocates
- processor 2 frees
processor 1
processor 2
x1 malloc(s)
free(x1)
x2 malloc(s)
free(x2)
x3 malloc(s)
free(x3)
9Multiprocessor Allocator IIPrivate Heaps with
Ownership
- Private heaps with ownershipfree puts memory
back on the originating processor's heap. - Avoids unbounded memory consumption
- Examples ptmalloc Gloger, LKmalloc Larson
Krishnan
processor 1
processor 2
x1 malloc(s)
free(x1)
x2 malloc(s)
free(x2)
10How to Break Private Heaps with
OwnershipFragmentation
- Private heaps with ownershipmemory consumption
can blowup by a factor of P. - Round-robin producer-consumer
- processor i allocates
- processor i1 frees
- This really happens (NDS).
processor 1
processor 2
processor 3
x1 malloc(s)
free(x1)
x2 malloc(s)
free(x2)
x3malloc(s)
free(x3)
11So What Do We Do Now?
12The Hoard Multiprocessor Memory Allocator
- Manages memory in page-sized superblocks of
same-sized objects - - Avoids false sharing by not carving up cache
lines - - Avoids heap contention - local heaps allocate
free small blocks from their set of superblocks - Adds a global heap that is a repository of
superblocks - When the fraction of free memory exceeds the
empty fraction, moves superblocks to the global
heap - - Avoids blowup in memory consumption
13Hoard Example
processor 1
global heap
- Hoardone heap per processor a global heap
- malloc gets memory from a superblock on its heap.
- free returns memory to its superblock. If the
heap is too empty, it moves a superblock to the
global heap.
x1 malloc(s)
some mallocs
some frees
free(x7)
Empty fraction 1/3
14Summary of Analytical Results
- Worst-case memory consumption
- O(n log M/m P) instead of O(P n log M/m)
- n memory required
- M biggest object size
- m smallest object size
- P number of processors
- Best possible O(n log M/m) Robson
- Provably low synchronization in most cases
15Performance threadtest
speedup(x,P) runtime(Solaris allocator, one
processor) / runtime(x on P processors)
16Performance Larson
Server-style benchmark with sharing
17Performance false sharing
Each thread reads writes heap data
18Hoard Conclusions
- Speed Excellent
- As fast as a uniprocessor allocator on one
processor - amortized O(1) cost
- 1 lock for malloc, 2 for free
- Scalability Excellent
- Scales linearly with the number of processors
- Avoids false sharing
- Fragmentation Very good
- Worst-case is provably close to ideal
- Actual observed fragmentation is low
19Even Faster Allocation
- Custom allocators can be very fast
- Linked lists of objects for highly-used classes
- Region (arena, zone) allocators
- Best practices Meyers 1995, Bulka 2001
- Used in 3 SPEC2000 benchmarks (parser, gcc, vpr),
Apache, PGP, SQLServer, etc.
20Custom Allocators Work
- Using a custom allocator reduces runtime by 60
21Problems with Current Practice
- Brittle code
- written from scratch
- macros/monolithic functions to avoid overhead
- hard to write, reuse or maintain
- Excessive fragmentation
- good memory allocatorscomplicated, not reusable
22Allocator Conceptual Design
- People think talk about heaps as if they were
modular
System memory manager
Manage small objects
Manage large objects
Select heap based on size
malloc
free
23Infrastructure Requirements
- Flexible
- can add functionality
- Reusable
- in other contexts in same program
- Fast
- very low or no overhead
- High-level
- as component-like as possible
24Ordinary Classes vs. Mixins
- Ordinary classes
- fixed inheritance dag
- cant rearrange hierarchy
- cant use class multiple times
Mixins no fixed inheritance dag multiple
hierarchies possible can reuse classes multiple
times fast static dispatch
25A Heap Layer
Provides malloc and free methods Top heaps get
memory from system e.g., mallocHeap uses C
librarys malloc and free
template ltclass SuperHeapgtclass HeapLayer
public SuperHeap
void malloc (sz) do something void p
SuperHeapmalloc (sz) do something else
return p
heap layer
26Example Thread-safety
- LockedHeap
- protects the parent heap with a single lock
-
class LockedMallocHeappublic LockedHeapltmallocHe
apgt
void malloc (sz) acquire lock void p
release lock return p
SuperHeapmalloc (sz)
27Example Debugging
- DebugHeap
- Protects against invalid multiple frees.
class LockedDebugMallocHeappublic LockedHeaplt
DebugHeapltmallocHeapgt gt
void free (p) check that p is valid check
that p hasnt been freed before
DebugHeap
SuperHeapfree (p)
LockedHeap
28Implementation in Heap Layers
- Modular design and implementation
FreelistHeap
manage objects on freelist
SizeHeap
add size info to objects
SegHeap
select heap based on size
malloc
free
29Experimental ResultsCustom Allocation gcc
30Experimental ResultsGeneral-Purpose Allocators
31Conclusion
- Heap layers infrastructure for composing
allocators - Useful experimental infrastructure
- Allows rapid implementation of high-quality
allocators - custom allocators as fast as originals
- general-purpose allocators comparable to
state-of-the-artin speed and efficiency