Compiler%20Optimizations%20for%20Memory%20Hierarchy%20Chapter%2020%20http://research.microsoft.com/~trishulc/%20http://www.cs.umd.edu/~tseng/%20High%20Performance%20Compilers%20for%20Parellel%20Computing%20(Wolfe)

About This Presentation

Title:

Compiler%20Optimizations%20for%20Memory%20Hierarchy%20Chapter%2020%20http://research.microsoft.com/~trishulc/%20http://www.cs.umd.edu/~tseng/%20High%20Performance%20Compilers%20for%20Parellel%20Computing%20(Wolfe)

Description:

Operating system designers. Hardware architectures. A Typical Machine. CPU memory bus ... Requires special hardware (Alpha, PowerPC, Sparc-V9) ... – PowerPoint PPT presentation

Number of Views:183

Avg rating:3.0/5.0

Slides: 30

Provided by: thoma423

Category:

more less

Transcript and Presenter's Notes

Title: Compiler%20Optimizations%20for%20Memory%20Hierarchy%20Chapter%2020%20http://research.microsoft.com/~trishulc/%20http://www.cs.umd.edu/~tseng/%20High%20Performance%20Compilers%20for%20Parellel%20Computing%20(Wolfe)

1
Compiler Optimizations for Memory
HierarchyChapter 20http//research.microsoft.co
m/trishulc/ http//www.cs.umd.edu/tseng/High
Performance Compilers forParellel Computing
(Wolfe)

Mooly Sagiv

2
Outline

Motivation
Instruction Cache Optimizations
Scalar Replacement of Aggregates
Data Cache Optimizations
Where does it fit in a compiler
Complementary Techniques
Preliminary Conclusion

3
Motivation

Every year
CPUs are improving by 50-60
Main memory speed is improving 10
So what?
What can we do?
Programmers
Compiler writers
Operating system designers
Hardware architectures

4
A Typical Machine
CPU memory bus
Cache
Main Memory
Bus adaptor
CPU
I/O bus
I/O controler
I/O controler
I/O controler
network
Graphics output
Disk
Disk
5
Types of Locality in Programs

Temporal Locality
The same data is accessed many times in
successive instructions
Example
while ()
x x a
Spatial Locality
Nearby memory locations are accessed many times
in successive instructions
Examplefor (i 1 i lt n i) xi
xi a

6
Compiler Optimizations forMemory Hierarchy

Register allocation (Chapter 16)
Improve locality
Improve branch predication
Software prefetching
Improve memory allocation

7
A Reasonable Assumption

The machine has two separate caches
Instruction cache
Data cache
Employ different compiler optimizations
Instruction cache optimizations
Data Cache optimizations

8
Instruction-Cache Optimizations

Instruction Prefecthing
Procedure Sorting
Procedure and Block Placement
Intraprocedural Code Positioning(Pettis Hensen
1990)
Procedure Splitting
Tailored for specific cache policy

9
Instruction Prefetching

Many machines prefetch instruction of blocks
predicted to be executed
Some RISC architectures support software
prefecth
iprefetch address (Sparc-V9)
Criteria for inserting prefetching
Tprefetch - The latency of prefecting
t - The time that the address is known

10
Procedure Sorting

Interprocedural Optimization
Place the caller and the callee close to each
other
Applies for statically linked procedures
Create undirected call graph
Label arcs with execution frequencies
Use a greedy approach to select neighboring
procedures

11
50
P1
P2
50
P5
40
100
20
P3
P4
5
3
32
90
P6
P7
40
P8
12
Intraprocedural Code Positioning

Move infrequently executed code out of main body
Straighten the code
Higher fraction of fetched instructions are
actually executed
Operates on a control flow graph
Edges are annotated with execution frequencies
Cover the graph with traces

13
Intraprocedural Code Positioning

Input
Contrtol flow graph
Edges are annotated with execution frequencies
Bottom-up trace selection
Initially each basic block is a trace
Combine traces with the maximal edge from tail to
head
Place traces from entry
Traces with many outgoing edges appear earlier
Successive traces are close
Fix up the code by inserting and deleting branches

14
entry
20
30
B1
45
10
14
B2
B3
40
14
5
10
B4
B5
B7
B6
5
10
10
B9
B8
15
10
exit
15
Procedure Splitting

Enhances the effectiveness of
Procedure sorting
Code positioning
Divides procedures into hot and cold parts
Place hot code in a separate section

16
Scalar Replacement of Array Elements

Reduce the number of memory accesses
Improve the effectiveness of register allocation

do i 1..N do j1..N do k1..N C(i, j)
C(i, j) A(i, k) B(k, j) endo
endo endo
17
Data-Cache Optimizations

Loop transformations
Re-arrange loops in scientific code
Allow parallel/pipelined/vector execution
Improve locality
Data placement of dynamic storage
Software prefetching

18
Loop Transformations

Loop interchange
Loop permutation
Loop skewing
Loop fusion
Loop distribution
Loop tiling

19
Tiling

Perform array operations in small blocks
Rearrange the loops so that innermost loops fits
in cache (due to fewer iterations)
Allow reuse in all tiled dimensions
Padding may be required to avoid cache conflicts

20
do i 1..N, T do j1..N, T do k1..N,
T do iii, min(iT-1, N) do jjj, min(jT-1,
N) do kkk, min(kT-1, N) C(ii,
jj) C(ii, jj) A(ii, kk) B(kk, jj)
endo endo
endo endo endo endo
21
Dynamic storage

Improve special locality at allocation time
Examples
Use type of data structure at malloc
Reorganize heap
Allocate the parent of tree node and the node
close
Useful information
Types
Traversal patterns
Research Frontier

22
void addList(struct List list struct Patient
patient) struct list b while (list !NULL)
b list list list-gtforward list
(struct List ) ccmaloc(sizeof(struct
List), b) list-gtpatient patient list-gtback
b list-gtforwardNULL b-gtforwardlist
23
Software Prefetching

Requires special hardware (Alpha, PowerPC,
Sparc-V9)
Reduces the cost of subsequent accesses in loops
Not limited to scientific code
More effective for large memory bandwidth

24
struct node int val struct node next
struct node jump ptr
the_list-gthead while (ptr-gtnext)
prefetch(ptr-gtjump) ptr ptr-gtnext
struct node int val struct node next
ptr the_list-gthead while (ptr-gtnext)
ptr ptr-gtnext
25
Textbook Order
constant-folding simplifications
26
LIR(D)
Inline expansion Leaf-routine optimizations Shrink
wrapping Machine idioms Tail merging Branch
optimization and conditional moves Dead code
elimination Software pipelining, Instruction
Scheduling 1 Register allocation Instruction
Scheduling 2 Intraprocedural I-cache
optimizations Instruction prefetching Data
prefertching Branch predication
constant-folding simplifications
27
Link-time optimizations(E)
Interprocedural register allocation Aggregation
global references Interprcudural I-cache
optimizations
28
Complementary Techniques

Cache aware data structures
Smart hardware
Cache aware garbage collection

29
Preliminary Conclusion

For imperative programs current I-cache
optimizations suffice to get good speed-ups (10)
For D-cache optimizations
Locality optimizations are effective for regular
scientific code (46)
Software prefetching is effective with large
memory bandwidth
For pointer chasing programs more research is
needed
Memory optimizations is a profitable area

Write a Comment

User Comments (0)

About PowerShow.com

Compiler%20Optimizations%20for%20Memory%20Hierarchy%20Chapter%2020%20http://research.microsoft.com/~trishulc/%20http://www.cs.umd.edu/~tseng/%20High%20Performance%20Compilers%20for%20Parellel%20Computing%20(Wolfe) - PowerPoint PPT Presentation

Compiler%20Optimizations%20for%20Memory%20Hierarchy%20Chapter%2020%20http://research.microsoft.com/~trishulc/%20http://www.cs.umd.edu/~tseng/%20High%20Performance%20Compilers%20for%20Parellel%20Computing%20(Wolfe)

Operating system designers. Hardware architectures. A Typical Machine. CPU memory bus ... Requires special hardware (Alpha, PowerPC, Sparc-V9) ... – PowerPoint PPT presentation