Title: Compiler%20Optimizations%20for%20Memory%20Hierarchy%20Chapter%2020%20http://research.microsoft.com/~trishulc/%20http://www.cs.umd.edu/~tseng/%20High%20Performance%20Compilers%20for%20Parellel%20Computing%20(Wolfe)
1Compiler Optimizations for Memory
HierarchyChapter 20http//research.microsoft.co
m/trishulc/ http//www.cs.umd.edu/tseng/High
Performance Compilers forParellel Computing
(Wolfe)
2Outline
- Motivation
- Instruction Cache Optimizations
- Scalar Replacement of Aggregates
- Data Cache Optimizations
- Where does it fit in a compiler
- Complementary Techniques
- Preliminary Conclusion
3Motivation
- Every year
- CPUs are improving by 50-60
- Main memory speed is improving 10
- So what?
- What can we do?
- Programmers
- Compiler writers
- Operating system designers
- Hardware architectures
4A Typical Machine
CPU memory bus
Cache
Main Memory
Bus adaptor
CPU
I/O bus
I/O controler
I/O controler
I/O controler
network
Graphics output
Disk
Disk
5Types of Locality in Programs
- Temporal Locality
- The same data is accessed many times in
successive instructions - Example
- while ()
- x x a
-
- Spatial Locality
- Nearby memory locations are accessed many times
in successive instructions - Examplefor (i 1 i lt n i) xi
xi a
6Compiler Optimizations forMemory Hierarchy
- Register allocation (Chapter 16)
- Improve locality
- Improve branch predication
- Software prefetching
- Improve memory allocation
7A Reasonable Assumption
- The machine has two separate caches
- Instruction cache
- Data cache
- Employ different compiler optimizations
- Instruction cache optimizations
- Data Cache optimizations
8Instruction-Cache Optimizations
- Instruction Prefecthing
- Procedure Sorting
- Procedure and Block Placement
- Intraprocedural Code Positioning(Pettis Hensen
1990) - Procedure Splitting
- Tailored for specific cache policy
9Instruction Prefetching
- Many machines prefetch instruction of blocks
predicted to be executed - Some RISC architectures support software
prefecth - iprefetch address (Sparc-V9)
- Criteria for inserting prefetching
- Tprefetch - The latency of prefecting
- t - The time that the address is known
10Procedure Sorting
- Interprocedural Optimization
- Place the caller and the callee close to each
other - Applies for statically linked procedures
- Create undirected call graph
- Label arcs with execution frequencies
- Use a greedy approach to select neighboring
procedures
1150
P1
P2
50
P5
40
100
20
P3
P4
5
3
32
90
P6
P7
40
P8
12Intraprocedural Code Positioning
- Move infrequently executed code out of main body
- Straighten the code
- Higher fraction of fetched instructions are
actually executed - Operates on a control flow graph
- Edges are annotated with execution frequencies
- Cover the graph with traces
13Intraprocedural Code Positioning
- Input
- Contrtol flow graph
- Edges are annotated with execution frequencies
- Bottom-up trace selection
- Initially each basic block is a trace
- Combine traces with the maximal edge from tail to
head - Place traces from entry
- Traces with many outgoing edges appear earlier
- Successive traces are close
- Fix up the code by inserting and deleting branches
14entry
20
30
B1
45
10
14
B2
B3
40
14
5
10
B4
B5
B7
B6
5
10
10
B9
B8
15
10
exit
15Procedure Splitting
- Enhances the effectiveness of
- Procedure sorting
- Code positioning
- Divides procedures into hot and cold parts
- Place hot code in a separate section
16Scalar Replacement of Array Elements
- Reduce the number of memory accesses
- Improve the effectiveness of register allocation
do i 1..N do j1..N do k1..N C(i, j)
C(i, j) A(i, k) B(k, j) endo
endo endo
17Data-Cache Optimizations
- Loop transformations
- Re-arrange loops in scientific code
- Allow parallel/pipelined/vector execution
- Improve locality
- Data placement of dynamic storage
- Software prefetching
18Loop Transformations
- Loop interchange
- Loop permutation
- Loop skewing
- Loop fusion
- Loop distribution
- Loop tiling
19Tiling
- Perform array operations in small blocks
- Rearrange the loops so that innermost loops fits
in cache (due to fewer iterations) - Allow reuse in all tiled dimensions
- Padding may be required to avoid cache conflicts
20do i 1..N, T do j1..N, T do k1..N,
T do iii, min(iT-1, N) do jjj, min(jT-1,
N) do kkk, min(kT-1, N) C(ii,
jj) C(ii, jj) A(ii, kk) B(kk, jj)
endo endo
endo endo endo endo
21Dynamic storage
- Improve special locality at allocation time
- Examples
- Use type of data structure at malloc
- Reorganize heap
- Allocate the parent of tree node and the node
close - Useful information
- Types
- Traversal patterns
- Research Frontier
22void addList(struct List list struct Patient
patient) struct list b while (list !NULL)
b list list list-gtforward list
(struct List ) ccmaloc(sizeof(struct
List), b) list-gtpatient patient list-gtback
b list-gtforwardNULL b-gtforwardlist
23Software Prefetching
- Requires special hardware (Alpha, PowerPC,
Sparc-V9) - Reduces the cost of subsequent accesses in loops
- Not limited to scientific code
- More effective for large memory bandwidth
24struct node int val struct node next
struct node jump ptr
the_list-gthead while (ptr-gtnext)
prefetch(ptr-gtjump) ptr ptr-gtnext
struct node int val struct node next
ptr the_list-gthead while (ptr-gtnext)
ptr ptr-gtnext
25Textbook Order
constant-folding simplifications
26LIR(D)
Inline expansion Leaf-routine optimizations Shrink
wrapping Machine idioms Tail merging Branch
optimization and conditional moves Dead code
elimination Software pipelining, Instruction
Scheduling 1 Register allocation Instruction
Scheduling 2 Intraprocedural I-cache
optimizations Instruction prefetching Data
prefertching Branch predication
constant-folding simplifications
27Link-time optimizations(E)
Interprocedural register allocation Aggregation
global references Interprcudural I-cache
optimizations
28Complementary Techniques
- Cache aware data structures
- Smart hardware
- Cache aware garbage collection
29Preliminary Conclusion
- For imperative programs current I-cache
optimizations suffice to get good speed-ups (10) - For D-cache optimizations
- Locality optimizations are effective for regular
scientific code (46) - Software prefetching is effective with large
memory bandwidth - For pointer chasing programs more research is
needed - Memory optimizations is a profitable area