Title: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap
1Automatic Pool AllocationImproving Performance
by Controlling Data Structure Layout in the Heap
- Paper by
- Chris Lattner and Vikram Adve
- University of Illinois at Urbana-Champaign
- Best Paper Award at PLDI 2005
Presented by Jeff Da Silva CARG - Aug 2nd 2005
2Motivation
- Computer architecture and compiler research has
primarily focused on analyzing optimizing
memory access pattern for dense arrays rather
than for pointer-based data structures. - e.g. caches, prefetching, loop transformations,
etc. - Why?
- Compilers have precise knowledge of the runtime
layout and traversal patterns associated with
arrays. - The layout of heap allocated data structures and
traversal patterns can be difficult to statically
predict (generally its not worth the effort).
3Intuition
- Improving a dynamic data structures spatial
locality through program analysis (like shape
analysis) is probably too difficult and might not
be the best approach. - What if you can somehow influence the layout
dynamically so that the data structure is
allocated intelligently and possibly appears like
a dense array? - A new approach develop a new technique that
operates at the macroscopic level - i.e. at the level of the entire data structure
rather than individual pointers or objects
4What is the problem?
List 1 Nodes
List 2 Nodes
Tree Nodes
5What is the problem?
List 1 Nodes
List 2 Nodes
Tree Nodes
6What is the problem?
List 1 Nodes
List 2 Nodes
Tree Nodes
7Their Approach Segregate the Heap
- Step 1 Memory Usage Analysis
- Build context-sensitive points-to graphs for
program - We use a fast unification-based algorithm
- Step 2 Automatic Pool Allocation
- Segregate memory based on points-to graph nodes
- Find lifetime bounds for memory with escape
analysis - Preserve points-to graph-to-pool mapping
- Step 3 Follow-on pool-specific optimizations
- Use segregation and points-to graph for later
optimizations
8Why Segregate Data Structures?
- Primary Goal Better compiler information
control - Compiler knows where each data structure lives in
memory - Compiler knows order of data in memory (in some
cases) - Compiler knows type info for heap objects (from
points-to info) - Compiler knows which pools point to which other
pools - Second Goal Better performance
- Smaller working sets
- Improved spatial locality
- Especially if allocation order matches traversal
order - Sometimes convert irregular strides to regular
strides
9Contributions of this Paper
- First region inference technique for C/C
- Previous work required type-safe programs ML,
Java - Previous work focused on memory management
- Region inference driven by pointer analysis
- Enables handling non-type-safe programs
- Simplifies handling imperative programs
- Simplifies further poolptr transformations
- New pool-based optimizations
- Exploit per-pool and pool-specific properties
- Evaluation of impact on memory hierarchy
- We show that pool allocation reduces working sets
10Outline
- Introduction Motivation
- Automatic Pool Allocation Transformation
- Pool Allocation-Based Optimizations
- Pool Allocation Optimization Performance Impact
- Conclusion
11Example
- struct list list Next int Data
- list createnode(int Data)
- list New malloc(sizeof(list))
- New-gtData Num return New
- return New
-
- void splitclone(list L, list R1, list R2)
- if(L0) R1 R2 0 return
- if(some_predicate(L-gtData))
- R1 createnode(L-gtdata)
- splitclone(L-gtNext, (R1)-gtNext)
- else
- R2 createnode(L-gtdata)
- splitclone(L-gtNext, (R1)-gtNext)
-
12Example
- void processlist(list L)
- list A, B, tmp
- // Clone L, splitting nodes in list A, and B.
- splitclone(L, A, B)
- processPortion(A) // Process first list
- processPortion(B) // Process second list
- // free A list
- while (A) tmp A-gtNext free(A) A tmp
- // free B list
- while (B) tmp B-gtNext free(B) B tmp
Note that lists A and B use distinct heap memory
it would therefore be beneficial if they were
allocated using separate pools of memory.
13Pool Alloc Runtime Library Interface
- void poolcreate(Pool PD, uint Size, uint Align)
- Initialize a pool descriptor. (obtains one or
more pages of memory using malloc) - void pooldestroy(Pool PD)
- Releases pool memory and destroy pool
descriptor. - void poolalloc(Pool PD, uint numBytes)
- void poolfree(Pool PD, void ptr)
- void poolrealloc(Pool PD, void ptr, uint
numBytes) - Interface also uses
- poolinit_bp(..), poolalloc_bp(..),
pooldestroy_bp(..).
14Algorithm Steps
- Generate a DS Graph (Points-to Graphs) for each
function - Insert code to create and destroy pool
descriptors for DS nodes whos lifetime does not
escape a function. - Add pool descriptor arguments for every DS node
that escapes a function. - Replace malloc and free with calls to poolalloc
and poolfree. - Further refinements and optimizations
15Points-To Graph DS Graph
- Builds a points-to graph for each function in
Bottom Up BU order - Context sensitive naming of heap objects
- More advanced than the traditional allocation
callsite - A unification-based approach
- Allows for a fast and scalable analysis
- Ensures every pointer points to one unique node
- Field Sensitive
- Added accuracy
- Also used to compute escape info
16Example DS Graph
- list createnode(int Data)
- list New malloc(sizeof(list))
- New-gtData Num return New
- return New
-
17Example DS Graph
- void splitclone(list L, list R1, list R2)
- if(L0) R1 R2 0 return
- if(some_predicate(L-gtData))
- R1 createnode(L-gtdata)
- splitclone(L-gtNext, (R1)-gtNext)
- else
- R2 createnode(L-gtdata)
- splitclone(L-gtNext, (R1)-gtNext)
-
18Example DS Graph
- void processlist(list L)
- list A, B, tmp
- // Clone L, splitting nodes in list A, and B.
- splitclone(L, A, B)
- processPortion(A) // Process first list
- processPortion(B) // Process second list
- // free A list
- while (A) tmp A-gtNext free(A) A tmp
- // free B list
- while (B) tmp B-gtNext free(B) B tmp
19Example Transformation
- list createnode(Pool PD, int Data)
- list New poolalloc(PD, sizeof(list))
- New-gtData Num return New
- return New
-
20Example Transformation
- void splitclone(Pool PD1, Pool PD2,
- list L, list R1, list R2)
-
- if(L0) R1 R2 0 return
- if(some_predicate(L-gtData))
- R1 createnode(PD1, L-gtdata)
- splitclone(PD1, PD2, L-gtNext, (R1)-gtNext)
- else
- R2 createnode(PD2, L-gtdata)
- splitclone(PD1, PD2, L-gtNext, (R1)-gtNext)
-
21Example Transformation
- void processlist(list L)
- list A, B, tmp
- Pool PD1, PD2
- poolcreate(PD1, sizeof(list), 8)
- poolcreate(PD2, sizeof(list), 8)
- splitclone(PD1, PD2, L, A, B)
- processPortion(A) // Process first list
- processPortion(B) // Process second list
- // free A list
- while (A) tmp A-gtNext poolfree(PD1, A) A
tmp - // free B list
- while (B) tmp B-gtNext poolfree(PD2, B) B
tmp - pooldestroy(PD1)
- pooldestroy(PD2)
22More Algorithm Details
- Indirect Function Call Handling
- Partition functions into equivalence classes
- If F1, F2 have common call-site ? same class
- Merge points-to graphs for each equivalence class
- Apply previous transformation unchanged
- Global variables pointing to memory nodes
- Use global pool variable rather than passing them
around through function args.
23More Algorithm Details
- poolcreate / pooldestroy placement
- Move calls earlier/later by analyzing the pools
lifetime - Reduces memory usage
- Enables poolfree elimination
- poolfree elimination
- Eliminate unnecessary poolfree calls
- No allocations between poolfree pooldestroy
- Behaves like static garbage collection
24Example poolcreate/pooldestroy placement
- void processlist(list L)
- list A, B, tmp
- Pool PD1, PD2
- poolcreate(PD1, sizeof(list), 8)
- poolcreate(PD2, sizeof(list), 8)
- splitclone(PD1, PD2, L, A, B)
- processPortion(A) // Process first list
- processPortion(B) // Process second list
- // free A list
- while (A) tmp A-gtNext poolfree(PD1, A) A
tmp - // free B list
- while (B) tmp B-gtNext poolfree(PD2, B) B
tmp - pooldestroy(PD1)
- pooldestroy(PD2)
void processlist(list L) list A, B,
tmp Pool PD1, PD2 poolcreate(PD1,
sizeof(list), 8) poolcreate(PD2, sizeof(list),
8) splitclone(PD1, PD2, L, A,
B) processPortion(A) // Process first
list processPortion(B) // Process second
list // free A list while (A) tmp A-gtNext
poolfree(PD1, A) A tmp pooldestroy(PD1)
// free B list while (B) tmp B-gtNext
poolfree(PD2, B) B tmp pooldestroy(PD2)
25Example poolfree Elimination
void processlist(list L) list A, B,
tmp Pool PD1, PD2 poolcreate(PD1,
sizeof(list), 8) poolcreate(PD2, sizeof(list),
8) splitclone(PD1, PD2, L, A,
B) processPortion(A) // Process first
list processPortion(B) // Process second
list // free A list while (A) tmp A-gtNext
poolfree(PD1, A) A tmp pooldestroy(PD1)
// free B list while (B) tmp B-gtNext
poolfree(PD2, B) B tmp pooldestroy(PD2)
void processlist(list L) list A, B,
tmp Pool PD1, PD2 poolcreate(PD1,
sizeof(list), 8) poolcreate(PD2, sizeof(list),
8) splitclone(PD1, PD2, L, A,
B) processPortion(A) // Process first
list processPortion(B) // Process second
list // free A list while (A) tmp A-gtNext
A tmp pooldestroy(PD1) // free B
list while (B) tmp B-gtNext B tmp
pooldestroy(PD2)
void processlist(list L) list A, B,
tmp Pool PD1, PD2 poolcreate(PD1,
sizeof(list), 8) poolcreate(PD2, sizeof(list),
8) splitclone(PD1, PD2, L, A,
B) processPortion(A) // Process first
list processPortion(B) // Process second
list pooldestroy(PD1) pooldestroy(P
D2)
26Outline
- Introduction Motivation
- Automatic Pool Allocation Transformation
- Pool Allocation-Based Optimizations
- Pool Allocation Optimiztion Performance Impact
- Conclusion
27PAOpts (1/4) and (2/4)
- Selective Pool Allocation
- Dont pool allocate when not profitable
- Avoids creating and destroying a pool descriptor
(minor) and avoids significant wasted space when
the object is much smaller than the smallest
internal page. - PoolFree Elimination
- Remove explicit de-allocations that are not
needed
28Looking closely Anatomy of a heap
- Fully general malloc-compatible allocator
- Supports malloc/free/realloc/memalign etc.
- Standard malloc overheads object header,
alignment - Allocates slabs of memory with exponential growth
- By default, all returned pointers are 8-byte
aligned - In memory, things look like (16 byte allocs)
4-byte padding for user-data alignment
4-byte object header
16-byte user data
One 32-byte Cache Line
29PAOpts (3/4) Bump Pointer Optzn
- If a pool has no poolfrees
- Eliminate per-object header
- Eliminate freelist overhead (faster object
allocation) - Eliminates 4 bytes of inter-object padding
- Pack objects more densely in the cache
- Interacts with poolfree elimination (PAOpt 2/4)!
- If poolfree elim deletes all frees, BumpPtr can
apply
16-byte user data
16-byte user data
16-byte user data
16-byte user data
One 32-byte Cache Line
30PAOpts (4/4) Alignment Analysis
- Malloc must return 8-byte aligned memory
- It has no idea what types will be used in the
memory - Some machines bus error, others suffer
performance problems for unaligned memory - Type-safe pools infer a type for the pool
- Use 4-byte alignment for pools we know dont need
it - Reduces inter-object padding
4-byte object header
16-byte user data
16-byte user data
16-byte user data
16-byte user data
One 32-byte Cache Line
31Outline
- Introduction Motivation
- Automatic Pool Allocation Transformation
- Pool Allocation-Based Optimizations
- Pool Allocation Optimization Performance Impact
- Conclusion
32Implementation Infrastructure
- Link-time transformation using the LLVM Compiler
Infrastructure - Uses LLVM-to-C back-end and the resulting code is
compiled with GCC 3.4.2 O3 - Evaluated on AMD Athlon MP 2100
- 64KB L1, 256KB L2
33Simple Pool Allocation Statistics
91
Table 1
34Simple Pool Allocation Statistics
91
Table 1
35Compile Time
Table 3
36Pool Allocation Speedup
- Several programs unaffected by pool allocation
(see paper) - Sizable speedup across many pointer intensive
programs - Some programs (ft, chomp) order of magnitude
faster
37Pool Allocation Speedup
- Several programs unaffected by pool allocation
(see paper) - Sizable speedup across many pointer intensive
programs - Some programs (ft, chomp) order of magnitude
faster
38Pool Optimization Speedup (FullPA)
PA Time
- Baseline 1.0 Run Time with Pool Allocation
- Optimizations help all of these programs
- Despite being very simple, they make a big impact
Figure 9 (with different baseline)
39Pool Optimization Speedup (FullPA)
PA Time
- Baseline 1.0 Run Time with Pool Allocation
- Optimizations help all of these programs
- Despite being very simple, they make a big impact
Figure 9 (with different baseline)
40Pool Optimization Speedup (FullPA)
PA Time
- Baseline 1.0 Run Time with Pool Allocation
- Optimizations help all of these programs
- Despite being very simple, they make a big impact
Figure 9 (with different baseline)
41Pool Optimization Speedup (FullPA)
PA Time
- Baseline 1.0 Run Time with Pool Allocation
- Optimizations help all of these programs
- Despite being very simple, they make a big impact
Figure 9 (with different baseline)
42Cache/TLB miss reduction
Miss rate measured with perfctr on AMD Athlon
2100
- Sources
- Defragmented heap
- Reduced inter-object padding
- Segregating the heap!
Figure 10
43Pool Optimization Statistics
Table 2
44Optimization Contribution
Figure 11
45Pool Allocation Conclusions
- Segregate heap based on points-to graph
- Improved Memory Hierarchy Performance
- Give compiler some control over layout
- Give compiler information about locality
- Optimize pools based on per-pool properties
- Very simple (but useful) optimizations proposed
here - Optimizations could be applied to other systems
46The End
47Backup Slides
48Table 4
49Table 5
50Pool Allocation Example
- list makeList(int Num)
- list New malloc(sizeof(list))
- New-gtNext Num ? makeList(Num-1) 0
- New-gtData Num return New
-
- int twoLists( )
- list X makeList(10)
- list Y makeList(100)
- GL Y
- processList(X)
- processList(Y)
- freeList(X)
- freeList(Y)
Change calls to free into calls to poolfree ?
retain explicit deallocation
51Pool Specific Optimizations
- Different Data Structures Have Different
Properties - Pool allocation segregates heap
- Roughly into logical data structures
- Optimize using pool-specific properties
- Examples of properties we look for
- Pool is type-homogenous
- Pool contains data that only requires 4-byte
alignment - Opportunities to reduce allocation overhead
52Benchmarks
- Pointer-intensive SPECINT 2000, Ptrdist, Olden,
FreeBench suites - povray, espresso, fpgrowth, llu-bench, chomp
- Benchmarks with custom allocators are not
evaluated except for parser, which they hand
modified.