Hoard: A Scalable Memory Allocator for Multithreaded Applications - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Description:

Avoids heap contention - local heaps allocate & free small blocks from their set ... People think & talk about heaps as if they were modular: Select heap based on size ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 32

Provided by: eme64

Category:

more less

Transcript and Presenter's Notes

Title: Hoard: A Scalable Memory Allocator for Multithreaded Applications

1
Hoard A Scalable Memory Allocator for
Multithreaded Applications
Emery Berger, Kathryn McKinley
Robert Blumofe, Paul Wilson
2
Motivation

Parallel multithreaded programs becoming
prevalent
web servers, search engines, database managers,
etc.
run on SMPs for high performance
often embarrassingly parallel
Memory allocation is a bottleneck
prevents scaling with number of processors

3
Assessment Criteria for Multiprocessor Allocators

Speed
competitive with uniprocessor allocators on one
processor
Scalability
performance linear with the number of processors
Fragmentation ( max allocated / max in use)
competitive with uniprocessor allocators
worst-case and average-case

4
Uniprocessor Allocators on Multiprocessors

Fragmentation Excellent
Very low for most programs Wilson Johnstone
Speed Scalability Poor
Heap contention
a single lock protects the heap
Can exacerbate false sharing
different processors can share cache lines

5
Allocator-InducedFalse Sharing
A cache line

Allocators cause false sharing!
Cache lines can end up spread across a number of
processors
Practically all allocators do this

processor 1
processor 2
x2 malloc(s)
x1 malloc(s)
thrash
thrash
6
Existing Multiprocessor Allocators

Speed
One concurrent heap (e.g., concurrent B-tree)
too expensive
too many locks/atomic updates
O(log n) cost per memory operation
? Fast allocators use multiple heaps
Scalability
Allocator-induced false sharing and other
bottlenecks
Fragmentation P-fold increase or even unbounded

7
Multiprocessor Allocator IPure Private Heaps

Pure private heapsone heap per processor.
malloc gets memoryfrom the processor's heap or
the system
free puts memory on the processor's heap
Avoids heap contention
Examples STL, ad hoc (e.g., Cilk 4.1)

processor 1
processor 2
x1 malloc(s)
x2 malloc(s)
free(x1)
free(x2)
x4 malloc(s)
x3 malloc(s)
allocated by heap 1
free, on heap 2
8
How to Break Pure Private Heaps Fragmentation

Pure private heaps
memory consumption can grow without bound!
Producer-consumer
processor 1 allocates
processor 2 frees

processor 1
processor 2
x1 malloc(s)
free(x1)
x2 malloc(s)
free(x2)
x3 malloc(s)
free(x3)
9
Multiprocessor Allocator IIPrivate Heaps with
Ownership

Private heaps with ownershipfree puts memory
back on the originating processor's heap.
Avoids unbounded memory consumption
Examples ptmalloc Gloger, LKmalloc Larson
Krishnan

processor 1
processor 2
x1 malloc(s)
free(x1)
x2 malloc(s)
free(x2)
10
How to Break Private Heaps with
OwnershipFragmentation

Private heaps with ownershipmemory consumption
can blowup by a factor of P.
Round-robin producer-consumer
processor i allocates
processor i1 frees
This really happens (NDS).

processor 1
processor 2
processor 3
x1 malloc(s)
free(x1)
x2 malloc(s)
free(x2)
x3malloc(s)
free(x3)
11
So What Do We Do Now?
12
The Hoard Multiprocessor Memory Allocator

Manages memory in page-sized superblocks of
same-sized objects
- Avoids false sharing by not carving up cache
lines
- Avoids heap contention - local heaps allocate
free small blocks from their set of superblocks
Adds a global heap that is a repository of
superblocks
When the fraction of free memory exceeds the
empty fraction, moves superblocks to the global
heap
- Avoids blowup in memory consumption

13
Hoard Example
processor 1
global heap

Hoardone heap per processor a global heap
malloc gets memory from a superblock on its heap.
free returns memory to its superblock. If the
heap is too empty, it moves a superblock to the
global heap.

x1 malloc(s)
some mallocs
some frees
free(x7)
Empty fraction 1/3
14
Summary of Analytical Results

Worst-case memory consumption
O(n log M/m P) instead of O(P n log M/m)
n memory required
M biggest object size
m smallest object size
P number of processors
Best possible O(n log M/m) Robson
Provably low synchronization in most cases

15
Performance threadtest
speedup(x,P) runtime(Solaris allocator, one
processor) / runtime(x on P processors)
16
Performance Larson
Server-style benchmark with sharing
17
Performance false sharing
Each thread reads writes heap data
18
Hoard Conclusions

Speed Excellent
As fast as a uniprocessor allocator on one
processor
amortized O(1) cost
1 lock for malloc, 2 for free
Scalability Excellent
Scales linearly with the number of processors
Avoids false sharing
Fragmentation Very good
Worst-case is provably close to ideal
Actual observed fragmentation is low

19
Even Faster Allocation

Custom allocators can be very fast
Linked lists of objects for highly-used classes
Region (arena, zone) allocators
Best practices Meyers 1995, Bulka 2001
Used in 3 SPEC2000 benchmarks (parser, gcc, vpr),
Apache, PGP, SQLServer, etc.

20
Custom Allocators Work

Using a custom allocator reduces runtime by 60

21
Problems with Current Practice

Brittle code
written from scratch
macros/monolithic functions to avoid overhead
hard to write, reuse or maintain
Excessive fragmentation
good memory allocatorscomplicated, not reusable

22
Allocator Conceptual Design

People think talk about heaps as if they were
modular

System memory manager
Manage small objects
Manage large objects
Select heap based on size
malloc
free
23
Infrastructure Requirements

Flexible
can add functionality
Reusable
in other contexts in same program
Fast
very low or no overhead
High-level
as component-like as possible

24
Ordinary Classes vs. Mixins

Ordinary classes
fixed inheritance dag
cant rearrange hierarchy
cant use class multiple times

Mixins no fixed inheritance dag multiple
hierarchies possible can reuse classes multiple
times fast static dispatch
25
A Heap Layer
Provides malloc and free methods Top heaps get
memory from system e.g., mallocHeap uses C
librarys malloc and free
template ltclass SuperHeapgtclass HeapLayer
public SuperHeap
void malloc (sz) do something void p
SuperHeapmalloc (sz) do something else
return p
heap layer
26
Example Thread-safety

LockedHeap
protects the parent heap with a single lock

class LockedMallocHeappublic LockedHeapltmallocHe
apgt
void malloc (sz) acquire lock void p
release lock return p
SuperHeapmalloc (sz)
27
Example Debugging

DebugHeap
Protects against invalid multiple frees.

class LockedDebugMallocHeappublic LockedHeaplt
DebugHeapltmallocHeapgt gt
void free (p) check that p is valid check
that p hasnt been freed before
DebugHeap
SuperHeapfree (p)
LockedHeap
28
Implementation in Heap Layers

Modular design and implementation

FreelistHeap
manage objects on freelist
SizeHeap
add size info to objects
SegHeap
select heap based on size
malloc
free
29
Experimental ResultsCustom Allocation gcc
30
Experimental ResultsGeneral-Purpose Allocators
31
Conclusion

Heap layers infrastructure for composing
allocators
Useful experimental infrastructure
Allows rapid implementation of high-quality
allocators
custom allocators as fast as originals
general-purpose allocators comparable to
state-of-the-artin speed and efficiency

Write a Comment

User Comments (0)