Title: Hoard: A Scalable Memory Allocator for Multithreaded Applications
1Hoard A Scalable Memory Allocator for
Multithreaded Applications
Emery Berger, Kathryn McKinley, Robert Blumofe,
Paul Wilson
Department of Computer Sciences
Department of Computer Science
2Motivation
- Parallel multithreaded programs becoming
prevalent - web servers, search engines, database managers,
etc. - run on SMPs for high performance
- often embarrassingly parallel
- Memory allocation is a bottleneck
- prevents scaling with number of processors
3Assessment Criteria for Multiprocessor Allocators
- Speed
- competitive with uniprocessor allocators on one
processor - Scalability
- performance linear with the number of processors
- Fragmentation ( max allocated / max in use)
- competitive with uniprocessor allocators
- worst-case and average-case
4Uniprocessor Allocators on Multiprocessors
- Fragmentation Excellent
- Very low for most programs Wilson Johnstone
- Speed Scalability Poor
- Heap contention
- a single lock protects the heap
- Can exacerbate false sharing
- different processors can share cache lines
5Allocator-InducedFalse Sharing
A cache line
- Allocators cause false sharing!
- Cache lines can end up spread across a number of
processors - Practically all allocators do this
processor 1
processor 2
x2 malloc(s)
x1 malloc(s)
thrash
thrash
6Existing Multiprocessor Allocators
- Speed
- One concurrent heap (e.g., concurrent B-tree)
too expensive - too many locks/atomic updates
- O(log n) cost per memory operation
- ? Fast allocators use multiple heaps
- Scalability
- Allocator-induced false sharing and other
bottlenecks - Fragmentation P-fold increase or even unbounded
7Multiprocessor Allocator IPure Private Heaps
- Pure private heapsone heap per processor.
- malloc gets memoryfrom the processor's heap or
the system - free puts memory on the processor's heap
- Avoids heap contention
- Examples STL, ad hoc (e.g., Cilk 4.1)
processor 1
processor 2
x1 malloc(s)
x2 malloc(s)
free(x1)
free(x2)
x3 malloc(s)
x4 malloc(s)
allocated by heap 1
free, on heap 2
8How to Break Pure Private Heaps Fragmentation
- Pure private heaps
- memory consumption can grow without bound!
- Producer-consumer
- processor 1 allocates
- processor 2 frees
processor 1
processor 2
x1 malloc(s)
free(x1)
x2 malloc(s)
free(x2)
x3 malloc(s)
free(x3)
9Multiprocessor Allocator IIPrivate Heaps with
Ownership
- Private heaps with ownershipfree puts memory
back on the originating processor's heap. - Avoids unbounded memory consumption
- Examples ptmalloc Gloger, LKmalloc Larson
Krishnan
processor 1
processor 2
x1 malloc(s)
free(x1)
x2 malloc(s)
free(x2)
10How to Break Private Heaps with
OwnershipFragmentation
- Private heaps with ownershipmemory consumption
can blowup by a factor of P. - Round-robin producer-consumer
- processor i allocates
- processor i1 frees
- This really happens (NDS).
processor 1
processor 2
processor 3
x1 malloc(s)
free(x1)
x2 malloc(s)
free(x2)
x3malloc(s)
free(x3)
11So What Do We Do Now?
12The Hoard Multiprocessor Memory Allocator
- Manages memory in page-sized superblocks of
same-sized objects - - Avoids false sharing by not carving up cache
lines - - Avoids heap contention - local heaps allocate
free small blocks from their set of superblocks - Adds a global heap that is a repository of
superblocks - When the fraction of free memory exceeds the
empty fraction, moves superblocks to the global
heap - - Avoids blowup in memory consumption
13Hoard Example
processor 1
global heap
- Hoardone heap per processor a global heap
- malloc gets memory from a superblock on its heap.
- free returns memory to its superblock. If the
heap is too empty, it moves a superblock to the
global heap.
x1 malloc(s)
some mallocs
some frees
free(x7)
Empty fraction 1/3
14Summary of Analytical Results
- Worst-case memory consumption
- O(n log M/m P) instead of O(P n log M/m)
- n memory required
- M biggest object size
- m smallest object size
- P number of processors
- Best possible O(n log M/m) Robson
- Provably low synchronization in most cases
15Experiments
- Run on a dedicated 14-processor Sun Enterprise
- 300 MHz UltraSparc, 1 GB of RAM
- Solaris 2.7
- All programs compiled with g version 2.95.1
- Allocators
- Hoard version 2.0.2
- Solaris (system allocator)
- Ptmalloc (GNU libc private heaps with
ownership) - mtmalloc (Suns MT-hot allocator)
16Performance threadtest
speedup(x,P) runtime(Solaris allocator, one
processor) / runtime(x on P processors)
17Performance Larson
Server-style benchmark with sharing
18Performance false sharing
Each thread reads writes heap data
19Fragmentation Results
- On most standard uniprocessor benchmarks,Hoards
fragmentation was low - p2c (Pascal-to-C) 1.20 espresso 1.47
- LRUsim 1.05 Ghostscript 1.15
- Within 20 of Leas allocator
- On the multiprocessor benchmarksand other codes
- Fragmentation was between 1.02 and 1.24 for all
but one anomalous benchmark (shbench 3.17).
20Hoard Conclusions
- Speed Excellent
- As fast as a uniprocessor allocator on one
processor - amortized O(1) cost
- 1 lock for malloc, 2 for free
- Scalability Excellent
- Scales linearly with the number of processors
- Avoids false sharing
- Fragmentation Very good
- Worst-case is provably close to ideal
- Actual observed fragmentation is low
21Hoard Heap Details
- Segregated size class allocator
- Size classes are logarithmically-spaced
- Superblocks hold objects of one size class
- empty superblocks are recycled
- Approximately radix-sorted
- Allocate from mostly-full superblocks
- Fast removal of mostly-empty superblocks
8
40
16
24
32
48
sizeclass bins
radix-sorted superblock lists (emptiest to
fullest)
superblocks