Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap

Description:

Computer architecture and compiler research has primarily focused ... Defragmented heap. Reduced inter-object padding. Segregating the heap! Miss rate measured ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 53
Provided by: chrisl53
Category:

less

Transcript and Presenter's Notes

Title: Automatic Pool Allocation: Improving Performance by Controlling Data Structure Layout in the Heap


1
Automatic Pool AllocationImproving Performance
by Controlling Data Structure Layout in the Heap
  • Paper by
  • Chris Lattner and Vikram Adve
  • University of Illinois at Urbana-Champaign
  • Best Paper Award at PLDI 2005

Presented by Jeff Da Silva CARG - Aug 2nd 2005
2
Motivation
  • Computer architecture and compiler research has
    primarily focused on analyzing optimizing
    memory access pattern for dense arrays rather
    than for pointer-based data structures.
  • e.g. caches, prefetching, loop transformations,
    etc.
  • Why?
  • Compilers have precise knowledge of the runtime
    layout and traversal patterns associated with
    arrays.
  • The layout of heap allocated data structures and
    traversal patterns can be difficult to statically
    predict (generally its not worth the effort).

3
Intuition
  • Improving a dynamic data structures spatial
    locality through program analysis (like shape
    analysis) is probably too difficult and might not
    be the best approach.
  • What if you can somehow influence the layout
    dynamically so that the data structure is
    allocated intelligently and possibly appears like
    a dense array?
  • A new approach develop a new technique that
    operates at the macroscopic level
  • i.e. at the level of the entire data structure
    rather than individual pointers or objects

4
What is the problem?
List 1 Nodes
List 2 Nodes
Tree Nodes
5
What is the problem?
List 1 Nodes
List 2 Nodes
Tree Nodes
6
What is the problem?
List 1 Nodes
List 2 Nodes
Tree Nodes
7
Their Approach Segregate the Heap
  • Step 1 Memory Usage Analysis
  • Build context-sensitive points-to graphs for
    program
  • We use a fast unification-based algorithm
  • Step 2 Automatic Pool Allocation
  • Segregate memory based on points-to graph nodes
  • Find lifetime bounds for memory with escape
    analysis
  • Preserve points-to graph-to-pool mapping
  • Step 3 Follow-on pool-specific optimizations
  • Use segregation and points-to graph for later
    optimizations

8
Why Segregate Data Structures?
  • Primary Goal Better compiler information
    control
  • Compiler knows where each data structure lives in
    memory
  • Compiler knows order of data in memory (in some
    cases)
  • Compiler knows type info for heap objects (from
    points-to info)
  • Compiler knows which pools point to which other
    pools
  • Second Goal Better performance
  • Smaller working sets
  • Improved spatial locality
  • Especially if allocation order matches traversal
    order
  • Sometimes convert irregular strides to regular
    strides

9
Contributions of this Paper
  • First region inference technique for C/C
  • Previous work required type-safe programs ML,
    Java
  • Previous work focused on memory management
  • Region inference driven by pointer analysis
  • Enables handling non-type-safe programs
  • Simplifies handling imperative programs
  • Simplifies further poolptr transformations
  • New pool-based optimizations
  • Exploit per-pool and pool-specific properties
  • Evaluation of impact on memory hierarchy
  • We show that pool allocation reduces working sets

10
Outline
  • Introduction Motivation
  • Automatic Pool Allocation Transformation
  • Pool Allocation-Based Optimizations
  • Pool Allocation Optimization Performance Impact
  • Conclusion

11
Example
  • struct list list Next int Data
  • list createnode(int Data)
  • list New malloc(sizeof(list))
  • New-gtData Num return New
  • return New
  • void splitclone(list L, list R1, list R2)
  • if(L0) R1 R2 0 return
  • if(some_predicate(L-gtData))
  • R1 createnode(L-gtdata)
  • splitclone(L-gtNext, (R1)-gtNext)
  • else
  • R2 createnode(L-gtdata)
  • splitclone(L-gtNext, (R1)-gtNext)

12
Example
  • void processlist(list L)
  • list A, B, tmp
  • // Clone L, splitting nodes in list A, and B.
  • splitclone(L, A, B)
  • processPortion(A) // Process first list
  • processPortion(B) // Process second list
  • // free A list
  • while (A) tmp A-gtNext free(A) A tmp
  • // free B list
  • while (B) tmp B-gtNext free(B) B tmp

Note that lists A and B use distinct heap memory
it would therefore be beneficial if they were
allocated using separate pools of memory.
13
Pool Alloc Runtime Library Interface
  • void poolcreate(Pool PD, uint Size, uint Align)
  • Initialize a pool descriptor. (obtains one or
    more pages of memory using malloc)
  • void pooldestroy(Pool PD)
  • Releases pool memory and destroy pool
    descriptor.
  • void poolalloc(Pool PD, uint numBytes)
  • void poolfree(Pool PD, void ptr)
  • void poolrealloc(Pool PD, void ptr, uint
    numBytes)
  • Interface also uses
  • poolinit_bp(..), poolalloc_bp(..),
    pooldestroy_bp(..).

14
Algorithm Steps
  1. Generate a DS Graph (Points-to Graphs) for each
    function
  2. Insert code to create and destroy pool
    descriptors for DS nodes whos lifetime does not
    escape a function.
  3. Add pool descriptor arguments for every DS node
    that escapes a function.
  4. Replace malloc and free with calls to poolalloc
    and poolfree.
  5. Further refinements and optimizations

15
Points-To Graph DS Graph
  • Builds a points-to graph for each function in
    Bottom Up BU order
  • Context sensitive naming of heap objects
  • More advanced than the traditional allocation
    callsite
  • A unification-based approach
  • Allows for a fast and scalable analysis
  • Ensures every pointer points to one unique node
  • Field Sensitive
  • Added accuracy
  • Also used to compute escape info

16
Example DS Graph
  • list createnode(int Data)
  • list New malloc(sizeof(list))
  • New-gtData Num return New
  • return New

17
Example DS Graph
  • void splitclone(list L, list R1, list R2)
  • if(L0) R1 R2 0 return
  • if(some_predicate(L-gtData))
  • R1 createnode(L-gtdata)
  • splitclone(L-gtNext, (R1)-gtNext)
  • else
  • R2 createnode(L-gtdata)
  • splitclone(L-gtNext, (R1)-gtNext)

18
Example DS Graph
  • void processlist(list L)
  • list A, B, tmp
  • // Clone L, splitting nodes in list A, and B.
  • splitclone(L, A, B)
  • processPortion(A) // Process first list
  • processPortion(B) // Process second list
  • // free A list
  • while (A) tmp A-gtNext free(A) A tmp
  • // free B list
  • while (B) tmp B-gtNext free(B) B tmp

19
Example Transformation
  • list createnode(Pool PD, int Data)
  • list New poolalloc(PD, sizeof(list))
  • New-gtData Num return New
  • return New

20
Example Transformation
  • void splitclone(Pool PD1, Pool PD2,
  • list L, list R1, list R2)
  • if(L0) R1 R2 0 return
  • if(some_predicate(L-gtData))
  • R1 createnode(PD1, L-gtdata)
  • splitclone(PD1, PD2, L-gtNext, (R1)-gtNext)
  • else
  • R2 createnode(PD2, L-gtdata)
  • splitclone(PD1, PD2, L-gtNext, (R1)-gtNext)

21
Example Transformation
  • void processlist(list L)
  • list A, B, tmp
  • Pool PD1, PD2
  • poolcreate(PD1, sizeof(list), 8)
  • poolcreate(PD2, sizeof(list), 8)
  • splitclone(PD1, PD2, L, A, B)
  • processPortion(A) // Process first list
  • processPortion(B) // Process second list
  • // free A list
  • while (A) tmp A-gtNext poolfree(PD1, A) A
    tmp
  • // free B list
  • while (B) tmp B-gtNext poolfree(PD2, B) B
    tmp
  • pooldestroy(PD1)
  • pooldestroy(PD2)

22
More Algorithm Details
  • Indirect Function Call Handling
  • Partition functions into equivalence classes
  • If F1, F2 have common call-site ? same class
  • Merge points-to graphs for each equivalence class
  • Apply previous transformation unchanged
  • Global variables pointing to memory nodes
  • Use global pool variable rather than passing them
    around through function args.

23
More Algorithm Details
  • poolcreate / pooldestroy placement
  • Move calls earlier/later by analyzing the pools
    lifetime
  • Reduces memory usage
  • Enables poolfree elimination
  • poolfree elimination
  • Eliminate unnecessary poolfree calls
  • No allocations between poolfree pooldestroy
  • Behaves like static garbage collection

24
Example poolcreate/pooldestroy placement
  • void processlist(list L)
  • list A, B, tmp
  • Pool PD1, PD2
  • poolcreate(PD1, sizeof(list), 8)
  • poolcreate(PD2, sizeof(list), 8)
  • splitclone(PD1, PD2, L, A, B)
  • processPortion(A) // Process first list
  • processPortion(B) // Process second list
  • // free A list
  • while (A) tmp A-gtNext poolfree(PD1, A) A
    tmp
  • // free B list
  • while (B) tmp B-gtNext poolfree(PD2, B) B
    tmp
  • pooldestroy(PD1)
  • pooldestroy(PD2)

void processlist(list L) list A, B,
tmp Pool PD1, PD2 poolcreate(PD1,
sizeof(list), 8) poolcreate(PD2, sizeof(list),
8) splitclone(PD1, PD2, L, A,
B) processPortion(A) // Process first
list processPortion(B) // Process second
list // free A list while (A) tmp A-gtNext
poolfree(PD1, A) A tmp pooldestroy(PD1)
// free B list while (B) tmp B-gtNext
poolfree(PD2, B) B tmp pooldestroy(PD2)

25
Example poolfree Elimination
void processlist(list L) list A, B,
tmp Pool PD1, PD2 poolcreate(PD1,
sizeof(list), 8) poolcreate(PD2, sizeof(list),
8) splitclone(PD1, PD2, L, A,
B) processPortion(A) // Process first
list processPortion(B) // Process second
list // free A list while (A) tmp A-gtNext
poolfree(PD1, A) A tmp pooldestroy(PD1)
// free B list while (B) tmp B-gtNext
poolfree(PD2, B) B tmp pooldestroy(PD2)

void processlist(list L) list A, B,
tmp Pool PD1, PD2 poolcreate(PD1,
sizeof(list), 8) poolcreate(PD2, sizeof(list),
8) splitclone(PD1, PD2, L, A,
B) processPortion(A) // Process first
list processPortion(B) // Process second
list // free A list while (A) tmp A-gtNext
A tmp pooldestroy(PD1) // free B
list while (B) tmp B-gtNext B tmp
pooldestroy(PD2)
void processlist(list L) list A, B,
tmp Pool PD1, PD2 poolcreate(PD1,
sizeof(list), 8) poolcreate(PD2, sizeof(list),
8) splitclone(PD1, PD2, L, A,
B) processPortion(A) // Process first
list processPortion(B) // Process second
list pooldestroy(PD1) pooldestroy(P
D2)
26
Outline
  • Introduction Motivation
  • Automatic Pool Allocation Transformation
  • Pool Allocation-Based Optimizations
  • Pool Allocation Optimiztion Performance Impact
  • Conclusion

27
PAOpts (1/4) and (2/4)
  • Selective Pool Allocation
  • Dont pool allocate when not profitable
  • Avoids creating and destroying a pool descriptor
    (minor) and avoids significant wasted space when
    the object is much smaller than the smallest
    internal page.
  • PoolFree Elimination
  • Remove explicit de-allocations that are not
    needed

28
Looking closely Anatomy of a heap
  • Fully general malloc-compatible allocator
  • Supports malloc/free/realloc/memalign etc.
  • Standard malloc overheads object header,
    alignment
  • Allocates slabs of memory with exponential growth
  • By default, all returned pointers are 8-byte
    aligned
  • In memory, things look like (16 byte allocs)

4-byte padding for user-data alignment
4-byte object header
16-byte user data
One 32-byte Cache Line
29
PAOpts (3/4) Bump Pointer Optzn
  • If a pool has no poolfrees
  • Eliminate per-object header
  • Eliminate freelist overhead (faster object
    allocation)
  • Eliminates 4 bytes of inter-object padding
  • Pack objects more densely in the cache
  • Interacts with poolfree elimination (PAOpt 2/4)!
  • If poolfree elim deletes all frees, BumpPtr can
    apply

16-byte user data
16-byte user data
16-byte user data
16-byte user data
One 32-byte Cache Line
30
PAOpts (4/4) Alignment Analysis
  • Malloc must return 8-byte aligned memory
  • It has no idea what types will be used in the
    memory
  • Some machines bus error, others suffer
    performance problems for unaligned memory
  • Type-safe pools infer a type for the pool
  • Use 4-byte alignment for pools we know dont need
    it
  • Reduces inter-object padding

4-byte object header
16-byte user data
16-byte user data
16-byte user data
16-byte user data
One 32-byte Cache Line
31
Outline
  • Introduction Motivation
  • Automatic Pool Allocation Transformation
  • Pool Allocation-Based Optimizations
  • Pool Allocation Optimization Performance Impact
  • Conclusion

32
Implementation Infrastructure
  • Link-time transformation using the LLVM Compiler
    Infrastructure
  • Uses LLVM-to-C back-end and the resulting code is
    compiled with GCC 3.4.2 O3
  • Evaluated on AMD Athlon MP 2100
  • 64KB L1, 256KB L2

33
Simple Pool Allocation Statistics
91
Table 1
34
Simple Pool Allocation Statistics
91
Table 1
35
Compile Time
Table 3
36
Pool Allocation Speedup
  • Several programs unaffected by pool allocation
    (see paper)
  • Sizable speedup across many pointer intensive
    programs
  • Some programs (ft, chomp) order of magnitude
    faster

37
Pool Allocation Speedup
  • Several programs unaffected by pool allocation
    (see paper)
  • Sizable speedup across many pointer intensive
    programs
  • Some programs (ft, chomp) order of magnitude
    faster

38
Pool Optimization Speedup (FullPA)
PA Time
  • Baseline 1.0 Run Time with Pool Allocation
  • Optimizations help all of these programs
  • Despite being very simple, they make a big impact

Figure 9 (with different baseline)
39
Pool Optimization Speedup (FullPA)
PA Time
  • Baseline 1.0 Run Time with Pool Allocation
  • Optimizations help all of these programs
  • Despite being very simple, they make a big impact

Figure 9 (with different baseline)
40
Pool Optimization Speedup (FullPA)
PA Time
  • Baseline 1.0 Run Time with Pool Allocation
  • Optimizations help all of these programs
  • Despite being very simple, they make a big impact

Figure 9 (with different baseline)
41
Pool Optimization Speedup (FullPA)
PA Time
  • Baseline 1.0 Run Time with Pool Allocation
  • Optimizations help all of these programs
  • Despite being very simple, they make a big impact

Figure 9 (with different baseline)
42
Cache/TLB miss reduction
Miss rate measured with perfctr on AMD Athlon
2100
  • Sources
  • Defragmented heap
  • Reduced inter-object padding
  • Segregating the heap!

Figure 10
43
Pool Optimization Statistics
Table 2
44
Optimization Contribution
Figure 11
45
Pool Allocation Conclusions
  • Segregate heap based on points-to graph
  • Improved Memory Hierarchy Performance
  • Give compiler some control over layout
  • Give compiler information about locality
  • Optimize pools based on per-pool properties
  • Very simple (but useful) optimizations proposed
    here
  • Optimizations could be applied to other systems

46
The End
47
Backup Slides
48
Table 4
49
Table 5
50
Pool Allocation Example
  • list makeList(int Num)
  • list New malloc(sizeof(list))
  • New-gtNext Num ? makeList(Num-1) 0
  • New-gtData Num return New
  • int twoLists( )
  • list X makeList(10)
  • list Y makeList(100)
  • GL Y
  • processList(X)
  • processList(Y)
  • freeList(X)
  • freeList(Y)

Change calls to free into calls to poolfree ?
retain explicit deallocation
51
Pool Specific Optimizations
  • Different Data Structures Have Different
    Properties
  • Pool allocation segregates heap
  • Roughly into logical data structures
  • Optimize using pool-specific properties
  • Examples of properties we look for
  • Pool is type-homogenous
  • Pool contains data that only requires 4-byte
    alignment
  • Opportunities to reduce allocation overhead

52
Benchmarks
  • Pointer-intensive SPECINT 2000, Ptrdist, Olden,
    FreeBench suites
  • povray, espresso, fpgrowth, llu-bench, chomp
  • Benchmarks with custom allocators are not
    evaluated except for parser, which they hand
    modified.
Write a Comment
User Comments (0)
About PowerShow.com