Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap - PowerPoint PPT Presentation

About This Presentation

Title:

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Description:

Computer architecture and compiler research has primarily focused ... Defragmented heap. Reduced inter-object padding. Segregating the heap! Miss rate measured ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 53

Provided by: chrisl53

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

1
Automatic Pool AllocationImproving Performance
by Controlling Data Structure Layout in the Heap

Paper by
Chris Lattner and Vikram Adve
University of Illinois at Urbana-Champaign
Best Paper Award at PLDI 2005

Presented by Jeff Da Silva CARG - Aug 2nd 2005
2
Motivation

Computer architecture and compiler research has
primarily focused on analyzing optimizing
memory access pattern for dense arrays rather
than for pointer-based data structures.
e.g. caches, prefetching, loop transformations,
etc.
Why?
Compilers have precise knowledge of the runtime
layout and traversal patterns associated with
arrays.
The layout of heap allocated data structures and
traversal patterns can be difficult to statically
predict (generally its not worth the effort).

3
Intuition

Improving a dynamic data structures spatial
locality through program analysis (like shape
analysis) is probably too difficult and might not
be the best approach.
What if you can somehow influence the layout
dynamically so that the data structure is
allocated intelligently and possibly appears like
a dense array?
A new approach develop a new technique that
operates at the macroscopic level
i.e. at the level of the entire data structure
rather than individual pointers or objects

4
What is the problem?
List 1 Nodes
List 2 Nodes
Tree Nodes
5
What is the problem?
List 1 Nodes
List 2 Nodes
Tree Nodes
6
What is the problem?
List 1 Nodes
List 2 Nodes
Tree Nodes
7
Their Approach Segregate the Heap

Step 1 Memory Usage Analysis
Build context-sensitive points-to graphs for
program
We use a fast unification-based algorithm
Step 2 Automatic Pool Allocation
Segregate memory based on points-to graph nodes
Find lifetime bounds for memory with escape
analysis
Preserve points-to graph-to-pool mapping
Step 3 Follow-on pool-specific optimizations
Use segregation and points-to graph for later
optimizations

8
Why Segregate Data Structures?

Primary Goal Better compiler information
control
Compiler knows where each data structure lives in
memory
Compiler knows order of data in memory (in some
cases)
Compiler knows type info for heap objects (from
points-to info)
Compiler knows which pools point to which other
pools
Second Goal Better performance
Smaller working sets
Improved spatial locality
Especially if allocation order matches traversal
order
Sometimes convert irregular strides to regular
strides

9
Contributions of this Paper

First region inference technique for C/C
Previous work required type-safe programs ML,
Java
Previous work focused on memory management
Region inference driven by pointer analysis
Enables handling non-type-safe programs
Simplifies handling imperative programs
Simplifies further poolptr transformations
New pool-based optimizations
Exploit per-pool and pool-specific properties
Evaluation of impact on memory hierarchy
We show that pool allocation reduces working sets

10
Outline

Introduction Motivation
Automatic Pool Allocation Transformation
Pool Allocation-Based Optimizations
Pool Allocation Optimization Performance Impact
Conclusion

11
Example

struct list list Next int Data
list createnode(int Data)
list New malloc(sizeof(list))
New-gtData Num return New
return New
void splitclone(list L, list R1, list R2)
if(L0) R1 R2 0 return
if(some_predicate(L-gtData))
R1 createnode(L-gtdata)
splitclone(L-gtNext, (R1)-gtNext)
else
R2 createnode(L-gtdata)
splitclone(L-gtNext, (R1)-gtNext)

12
Example

void processlist(list L)
list A, B, tmp
// Clone L, splitting nodes in list A, and B.
splitclone(L, A, B)
processPortion(A) // Process first list
processPortion(B) // Process second list
// free A list
while (A) tmp A-gtNext free(A) A tmp
// free B list
while (B) tmp B-gtNext free(B) B tmp

Note that lists A and B use distinct heap memory
it would therefore be beneficial if they were
allocated using separate pools of memory.
13
Pool Alloc Runtime Library Interface

void poolcreate(Pool PD, uint Size, uint Align)
Initialize a pool descriptor. (obtains one or
more pages of memory using malloc)
void pooldestroy(Pool PD)
Releases pool memory and destroy pool
descriptor.
void poolalloc(Pool PD, uint numBytes)
void poolfree(Pool PD, void ptr)
void poolrealloc(Pool PD, void ptr, uint
numBytes)
Interface also uses
poolinit_bp(..), poolalloc_bp(..),
pooldestroy_bp(..).

14
Algorithm Steps

Generate a DS Graph (Points-to Graphs) for each
function
Insert code to create and destroy pool
descriptors for DS nodes whos lifetime does not
escape a function.
Add pool descriptor arguments for every DS node
that escapes a function.
Replace malloc and free with calls to poolalloc
and poolfree.
Further refinements and optimizations

15
Points-To Graph DS Graph

Builds a points-to graph for each function in
Bottom Up BU order
Context sensitive naming of heap objects
More advanced than the traditional allocation
callsite
A unification-based approach
Allows for a fast and scalable analysis
Ensures every pointer points to one unique node
Field Sensitive
Added accuracy
Also used to compute escape info

16
Example DS Graph

list createnode(int Data)
list New malloc(sizeof(list))
New-gtData Num return New
return New

17
Example DS Graph

void splitclone(list L, list R1, list R2)
if(L0) R1 R2 0 return
if(some_predicate(L-gtData))
R1 createnode(L-gtdata)
splitclone(L-gtNext, (R1)-gtNext)
else
R2 createnode(L-gtdata)
splitclone(L-gtNext, (R1)-gtNext)

18
Example DS Graph

void processlist(list L)
list A, B, tmp
// Clone L, splitting nodes in list A, and B.
splitclone(L, A, B)
processPortion(A) // Process first list
processPortion(B) // Process second list
// free A list
while (A) tmp A-gtNext free(A) A tmp
// free B list
while (B) tmp B-gtNext free(B) B tmp

19
Example Transformation

list createnode(Pool PD, int Data)
list New poolalloc(PD, sizeof(list))
New-gtData Num return New
return New

20
Example Transformation

void splitclone(Pool PD1, Pool PD2,
list L, list R1, list R2)
if(L0) R1 R2 0 return
if(some_predicate(L-gtData))
R1 createnode(PD1, L-gtdata)
splitclone(PD1, PD2, L-gtNext, (R1)-gtNext)
else
R2 createnode(PD2, L-gtdata)
splitclone(PD1, PD2, L-gtNext, (R1)-gtNext)

21
Example Transformation

void processlist(list L)
list A, B, tmp
Pool PD1, PD2
poolcreate(PD1, sizeof(list), 8)
poolcreate(PD2, sizeof(list), 8)
splitclone(PD1, PD2, L, A, B)
processPortion(A) // Process first list
processPortion(B) // Process second list
// free A list
while (A) tmp A-gtNext poolfree(PD1, A) A
tmp
// free B list
while (B) tmp B-gtNext poolfree(PD2, B) B
tmp
pooldestroy(PD1)
pooldestroy(PD2)

22
More Algorithm Details

Indirect Function Call Handling
Partition functions into equivalence classes
If F1, F2 have common call-site ? same class
Merge points-to graphs for each equivalence class
Apply previous transformation unchanged
Global variables pointing to memory nodes
Use global pool variable rather than passing them
around through function args.

23
More Algorithm Details

poolcreate / pooldestroy placement
Move calls earlier/later by analyzing the pools
lifetime
Reduces memory usage
Enables poolfree elimination
poolfree elimination
Eliminate unnecessary poolfree calls
No allocations between poolfree pooldestroy
Behaves like static garbage collection

24
Example poolcreate/pooldestroy placement

void processlist(list L)
list A, B, tmp
Pool PD1, PD2
poolcreate(PD1, sizeof(list), 8)
poolcreate(PD2, sizeof(list), 8)
splitclone(PD1, PD2, L, A, B)
processPortion(A) // Process first list
processPortion(B) // Process second list
// free A list
while (A) tmp A-gtNext poolfree(PD1, A) A
tmp
// free B list
while (B) tmp B-gtNext poolfree(PD2, B) B
tmp
pooldestroy(PD1)
pooldestroy(PD2)

void processlist(list L) list A, B,
tmp Pool PD1, PD2 poolcreate(PD1,
sizeof(list), 8) poolcreate(PD2, sizeof(list),
8) splitclone(PD1, PD2, L, A,
B) processPortion(A) // Process first
list processPortion(B) // Process second
list // free A list while (A) tmp A-gtNext
poolfree(PD1, A) A tmp pooldestroy(PD1)
// free B list while (B) tmp B-gtNext
poolfree(PD2, B) B tmp pooldestroy(PD2)

25
Example poolfree Elimination
void processlist(list L) list A, B,
tmp Pool PD1, PD2 poolcreate(PD1,
sizeof(list), 8) poolcreate(PD2, sizeof(list),
8) splitclone(PD1, PD2, L, A,
B) processPortion(A) // Process first
list processPortion(B) // Process second
list // free A list while (A) tmp A-gtNext
poolfree(PD1, A) A tmp pooldestroy(PD1)
// free B list while (B) tmp B-gtNext
poolfree(PD2, B) B tmp pooldestroy(PD2)

void processlist(list L) list A, B,
tmp Pool PD1, PD2 poolcreate(PD1,
sizeof(list), 8) poolcreate(PD2, sizeof(list),
8) splitclone(PD1, PD2, L, A,
B) processPortion(A) // Process first
list processPortion(B) // Process second
list // free A list while (A) tmp A-gtNext
A tmp pooldestroy(PD1) // free B
list while (B) tmp B-gtNext B tmp
pooldestroy(PD2)
void processlist(list L) list A, B,
tmp Pool PD1, PD2 poolcreate(PD1,
sizeof(list), 8) poolcreate(PD2, sizeof(list),
8) splitclone(PD1, PD2, L, A,
B) processPortion(A) // Process first
list processPortion(B) // Process second
list pooldestroy(PD1) pooldestroy(P
D2)
26
Outline

Introduction Motivation
Automatic Pool Allocation Transformation
Pool Allocation-Based Optimizations
Pool Allocation Optimiztion Performance Impact
Conclusion

27
PAOpts (1/4) and (2/4)

Selective Pool Allocation
Dont pool allocate when not profitable
Avoids creating and destroying a pool descriptor
(minor) and avoids significant wasted space when
the object is much smaller than the smallest
internal page.
PoolFree Elimination
Remove explicit de-allocations that are not
needed

28
Looking closely Anatomy of a heap

Fully general malloc-compatible allocator
Supports malloc/free/realloc/memalign etc.
Standard malloc overheads object header,
alignment
Allocates slabs of memory with exponential growth
By default, all returned pointers are 8-byte
aligned
In memory, things look like (16 byte allocs)

4-byte padding for user-data alignment
4-byte object header
16-byte user data
One 32-byte Cache Line
29
PAOpts (3/4) Bump Pointer Optzn

If a pool has no poolfrees
Eliminate per-object header
Eliminate freelist overhead (faster object
allocation)
Eliminates 4 bytes of inter-object padding
Pack objects more densely in the cache
Interacts with poolfree elimination (PAOpt 2/4)!
If poolfree elim deletes all frees, BumpPtr can
apply

16-byte user data
16-byte user data
16-byte user data
16-byte user data
One 32-byte Cache Line
30
PAOpts (4/4) Alignment Analysis

Malloc must return 8-byte aligned memory
It has no idea what types will be used in the
memory
Some machines bus error, others suffer
performance problems for unaligned memory
Type-safe pools infer a type for the pool
Use 4-byte alignment for pools we know dont need
it
Reduces inter-object padding

4-byte object header
16-byte user data
16-byte user data
16-byte user data
16-byte user data
One 32-byte Cache Line
31
Outline

Introduction Motivation
Automatic Pool Allocation Transformation
Pool Allocation-Based Optimizations
Pool Allocation Optimization Performance Impact
Conclusion

32
Implementation Infrastructure

Link-time transformation using the LLVM Compiler
Infrastructure
Uses LLVM-to-C back-end and the resulting code is
compiled with GCC 3.4.2 O3
Evaluated on AMD Athlon MP 2100
64KB L1, 256KB L2

33
Simple Pool Allocation Statistics
91
Table 1
34
Simple Pool Allocation Statistics
91
Table 1
35
Compile Time
Table 3
36
Pool Allocation Speedup

Several programs unaffected by pool allocation
(see paper)
Sizable speedup across many pointer intensive
programs
Some programs (ft, chomp) order of magnitude
faster

37
Pool Allocation Speedup

Several programs unaffected by pool allocation
(see paper)
Sizable speedup across many pointer intensive
programs
Some programs (ft, chomp) order of magnitude
faster

38
Pool Optimization Speedup (FullPA)
PA Time

Baseline 1.0 Run Time with Pool Allocation
Optimizations help all of these programs
Despite being very simple, they make a big impact

Figure 9 (with different baseline)
39
Pool Optimization Speedup (FullPA)
PA Time

Baseline 1.0 Run Time with Pool Allocation
Optimizations help all of these programs
Despite being very simple, they make a big impact

Figure 9 (with different baseline)
40
Pool Optimization Speedup (FullPA)
PA Time

Baseline 1.0 Run Time with Pool Allocation
Optimizations help all of these programs
Despite being very simple, they make a big impact

Figure 9 (with different baseline)
41
Pool Optimization Speedup (FullPA)
PA Time

Baseline 1.0 Run Time with Pool Allocation
Optimizations help all of these programs
Despite being very simple, they make a big impact

Figure 9 (with different baseline)
42
Cache/TLB miss reduction
Miss rate measured with perfctr on AMD Athlon
2100

Sources
Defragmented heap
Reduced inter-object padding
Segregating the heap!

Figure 10
43
Pool Optimization Statistics
Table 2
44
Optimization Contribution
Figure 11
45
Pool Allocation Conclusions

Segregate heap based on points-to graph
Improved Memory Hierarchy Performance
Give compiler some control over layout
Give compiler information about locality
Optimize pools based on per-pool properties
Very simple (but useful) optimizations proposed
here
Optimizations could be applied to other systems

46
The End
47
Backup Slides
48
Table 4
49
Table 5
50
Pool Allocation Example

list makeList(int Num)
list New malloc(sizeof(list))
New-gtNext Num ? makeList(Num-1) 0
New-gtData Num return New
int twoLists( )
list X makeList(10)
list Y makeList(100)
GL Y
processList(X)
processList(Y)
freeList(X)
freeList(Y)

Change calls to free into calls to poolfree ?
retain explicit deallocation
51
Pool Specific Optimizations

Different Data Structures Have Different
Properties
Pool allocation segregates heap
Roughly into logical data structures
Optimize using pool-specific properties
Examples of properties we look for
Pool is type-homogenous
Pool contains data that only requires 4-byte
alignment
Opportunities to reduce allocation overhead

52
Benchmarks

Pointer-intensive SPECINT 2000, Ptrdist, Olden,
FreeBench suites
povray, espresso, fpgrowth, llu-bench, chomp
Benchmarks with custom allocators are not
evaluated except for parser, which they hand
modified.

Write a Comment

User Comments (0)