Hoard: A Scalable Memory Allocator for Multithreaded Applications - PowerPoint PPT Presentation

About This Presentation

Title:

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Description:

Approximately radix-sorted: Allocate from mostly-full superblocks ... 48. sizeclass bins. radix-sorted. superblock lists (emptiest to fullest) superblocks ... – PowerPoint PPT presentation

Number of Views:160

Avg rating:3.0/5.0

Slides: 22

Provided by: csUm4

Learn more at: https://people.cs.umass.edu

Category:

more less

Transcript and Presenter's Notes

Title: Hoard: A Scalable Memory Allocator for Multithreaded Applications

1
Hoard A Scalable Memory Allocator for
Multithreaded Applications
Emery Berger, Kathryn McKinley, Robert Blumofe,
Paul Wilson
Department of Computer Sciences
Department of Computer Science
2
Motivation

Parallel multithreaded programs becoming
prevalent
web servers, search engines, database managers,
etc.
run on SMPs for high performance
often embarrassingly parallel
Memory allocation is a bottleneck
prevents scaling with number of processors

3
Assessment Criteria for Multiprocessor Allocators

Speed
competitive with uniprocessor allocators on one
processor
Scalability
performance linear with the number of processors
Fragmentation ( max allocated / max in use)
competitive with uniprocessor allocators
worst-case and average-case

4
Uniprocessor Allocators on Multiprocessors

Fragmentation Excellent
Very low for most programs Wilson Johnstone
Speed Scalability Poor
Heap contention
a single lock protects the heap
Can exacerbate false sharing
different processors can share cache lines

5
Allocator-InducedFalse Sharing
A cache line

Allocators cause false sharing!
Cache lines can end up spread across a number of
processors
Practically all allocators do this

processor 1
processor 2
x2 malloc(s)
x1 malloc(s)
thrash
thrash
6
Existing Multiprocessor Allocators

Speed
One concurrent heap (e.g., concurrent B-tree)
too expensive
too many locks/atomic updates
O(log n) cost per memory operation
? Fast allocators use multiple heaps
Scalability
Allocator-induced false sharing and other
bottlenecks
Fragmentation P-fold increase or even unbounded

7
Multiprocessor Allocator IPure Private Heaps

Pure private heapsone heap per processor.
malloc gets memoryfrom the processor's heap or
the system
free puts memory on the processor's heap
Avoids heap contention
Examples STL, ad hoc (e.g., Cilk 4.1)

processor 1
processor 2
x1 malloc(s)
x2 malloc(s)
free(x1)
free(x2)
x3 malloc(s)
x4 malloc(s)
allocated by heap 1
free, on heap 2
8
How to Break Pure Private Heaps Fragmentation

Pure private heaps
memory consumption can grow without bound!
Producer-consumer
processor 1 allocates
processor 2 frees

processor 1
processor 2
x1 malloc(s)
free(x1)
x2 malloc(s)
free(x2)
x3 malloc(s)
free(x3)
9
Multiprocessor Allocator IIPrivate Heaps with
Ownership

Private heaps with ownershipfree puts memory
back on the originating processor's heap.
Avoids unbounded memory consumption
Examples ptmalloc Gloger, LKmalloc Larson
Krishnan

processor 1
processor 2
x1 malloc(s)
free(x1)
x2 malloc(s)
free(x2)
10
How to Break Private Heaps with
OwnershipFragmentation

Private heaps with ownershipmemory consumption
can blowup by a factor of P.
Round-robin producer-consumer
processor i allocates
processor i1 frees
This really happens (NDS).

processor 1
processor 2
processor 3
x1 malloc(s)
free(x1)
x2 malloc(s)
free(x2)
x3malloc(s)
free(x3)
11
So What Do We Do Now?
12
The Hoard Multiprocessor Memory Allocator

Manages memory in page-sized superblocks of
same-sized objects
- Avoids false sharing by not carving up cache
lines
- Avoids heap contention - local heaps allocate
free small blocks from their set of superblocks
Adds a global heap that is a repository of
superblocks
When the fraction of free memory exceeds the
empty fraction, moves superblocks to the global
heap
- Avoids blowup in memory consumption

13
Hoard Example
processor 1
global heap

Hoardone heap per processor a global heap
malloc gets memory from a superblock on its heap.
free returns memory to its superblock. If the
heap is too empty, it moves a superblock to the
global heap.

x1 malloc(s)
some mallocs
some frees
free(x7)
Empty fraction 1/3
14
Summary of Analytical Results

Worst-case memory consumption
O(n log M/m P) instead of O(P n log M/m)
n memory required
M biggest object size
m smallest object size
P number of processors
Best possible O(n log M/m) Robson
Provably low synchronization in most cases

15
Experiments

Run on a dedicated 14-processor Sun Enterprise
300 MHz UltraSparc, 1 GB of RAM
Solaris 2.7
All programs compiled with g version 2.95.1
Allocators
Hoard version 2.0.2
Solaris (system allocator)
Ptmalloc (GNU libc private heaps with
ownership)
mtmalloc (Suns MT-hot allocator)

16
Performance threadtest
speedup(x,P) runtime(Solaris allocator, one
processor) / runtime(x on P processors)
17
Performance Larson
Server-style benchmark with sharing
18
Performance false sharing
Each thread reads writes heap data
19
Fragmentation Results

On most standard uniprocessor benchmarks,Hoards
fragmentation was low
p2c (Pascal-to-C) 1.20 espresso 1.47
LRUsim 1.05 Ghostscript 1.15
Within 20 of Leas allocator
On the multiprocessor benchmarksand other codes
Fragmentation was between 1.02 and 1.24 for all
but one anomalous benchmark (shbench 3.17).

20
Hoard Conclusions