Compiler%20Optimizations%20for%20Memory%20Hierarchy%20Chapter%2020%20http://research.microsoft.com/~trishulc/%20http://www.cs.umd.edu/~tseng/%20High%20Performance%20Compilers%20for%20Parellel%20Computing%20(Wolfe) - PowerPoint PPT Presentation

About This Presentation
Title:

Compiler%20Optimizations%20for%20Memory%20Hierarchy%20Chapter%2020%20http://research.microsoft.com/~trishulc/%20http://www.cs.umd.edu/~tseng/%20High%20Performance%20Compilers%20for%20Parellel%20Computing%20(Wolfe)

Description:

Operating system designers. Hardware architectures. A Typical Machine. CPU memory bus ... Requires special hardware (Alpha, PowerPC, Sparc-V9) ... – PowerPoint PPT presentation

Number of Views:183
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Compiler%20Optimizations%20for%20Memory%20Hierarchy%20Chapter%2020%20http://research.microsoft.com/~trishulc/%20http://www.cs.umd.edu/~tseng/%20High%20Performance%20Compilers%20for%20Parellel%20Computing%20(Wolfe)


1
Compiler Optimizations for Memory
HierarchyChapter 20http//research.microsoft.co
m/trishulc/ http//www.cs.umd.edu/tseng/High
Performance Compilers forParellel Computing
(Wolfe)
  • Mooly Sagiv

2
Outline
  • Motivation
  • Instruction Cache Optimizations
  • Scalar Replacement of Aggregates
  • Data Cache Optimizations
  • Where does it fit in a compiler
  • Complementary Techniques
  • Preliminary Conclusion

3
Motivation
  • Every year
  • CPUs are improving by 50-60
  • Main memory speed is improving 10
  • So what?
  • What can we do?
  • Programmers
  • Compiler writers
  • Operating system designers
  • Hardware architectures

4
A Typical Machine
CPU memory bus
Cache
Main Memory
Bus adaptor
CPU
I/O bus
I/O controler
I/O controler
I/O controler
network
Graphics output
Disk
Disk
5
Types of Locality in Programs
  • Temporal Locality
  • The same data is accessed many times in
    successive instructions
  • Example
  • while ()
  • x x a
  • Spatial Locality
  • Nearby memory locations are accessed many times
    in successive instructions
  • Examplefor (i 1 i lt n i) xi
    xi a

6
Compiler Optimizations forMemory Hierarchy
  • Register allocation (Chapter 16)
  • Improve locality
  • Improve branch predication
  • Software prefetching
  • Improve memory allocation

7
A Reasonable Assumption
  • The machine has two separate caches
  • Instruction cache
  • Data cache
  • Employ different compiler optimizations
  • Instruction cache optimizations
  • Data Cache optimizations

8
Instruction-Cache Optimizations
  • Instruction Prefecthing
  • Procedure Sorting
  • Procedure and Block Placement
  • Intraprocedural Code Positioning(Pettis Hensen
    1990)
  • Procedure Splitting
  • Tailored for specific cache policy

9
Instruction Prefetching
  • Many machines prefetch instruction of blocks
    predicted to be executed
  • Some RISC architectures support software
    prefecth
  • iprefetch address (Sparc-V9)
  • Criteria for inserting prefetching
  • Tprefetch - The latency of prefecting
  • t - The time that the address is known

10
Procedure Sorting
  • Interprocedural Optimization
  • Place the caller and the callee close to each
    other
  • Applies for statically linked procedures
  • Create undirected call graph
  • Label arcs with execution frequencies
  • Use a greedy approach to select neighboring
    procedures

11
50
P1
P2
50
P5
40
100
20
P3
P4
5
3
32
90
P6
P7
40
P8
12
Intraprocedural Code Positioning
  • Move infrequently executed code out of main body
  • Straighten the code
  • Higher fraction of fetched instructions are
    actually executed
  • Operates on a control flow graph
  • Edges are annotated with execution frequencies
  • Cover the graph with traces

13
Intraprocedural Code Positioning
  • Input
  • Contrtol flow graph
  • Edges are annotated with execution frequencies
  • Bottom-up trace selection
  • Initially each basic block is a trace
  • Combine traces with the maximal edge from tail to
    head
  • Place traces from entry
  • Traces with many outgoing edges appear earlier
  • Successive traces are close
  • Fix up the code by inserting and deleting branches

14
entry
20
30
B1
45
10
14
B2
B3
40
14
5
10
B4
B5
B7
B6
5
10
10
B9
B8
15
10
exit
15
Procedure Splitting
  • Enhances the effectiveness of
  • Procedure sorting
  • Code positioning
  • Divides procedures into hot and cold parts
  • Place hot code in a separate section

16
Scalar Replacement of Array Elements
  • Reduce the number of memory accesses
  • Improve the effectiveness of register allocation

do i 1..N do j1..N do k1..N C(i, j)
C(i, j) A(i, k) B(k, j) endo
endo endo
17
Data-Cache Optimizations
  • Loop transformations
  • Re-arrange loops in scientific code
  • Allow parallel/pipelined/vector execution
  • Improve locality
  • Data placement of dynamic storage
  • Software prefetching

18
Loop Transformations
  • Loop interchange
  • Loop permutation
  • Loop skewing
  • Loop fusion
  • Loop distribution
  • Loop tiling

19
Tiling
  • Perform array operations in small blocks
  • Rearrange the loops so that innermost loops fits
    in cache (due to fewer iterations)
  • Allow reuse in all tiled dimensions
  • Padding may be required to avoid cache conflicts

20
do i 1..N, T do j1..N, T do k1..N,
T do iii, min(iT-1, N) do jjj, min(jT-1,
N) do kkk, min(kT-1, N) C(ii,
jj) C(ii, jj) A(ii, kk) B(kk, jj)
endo endo
endo endo endo endo
21
Dynamic storage
  • Improve special locality at allocation time
  • Examples
  • Use type of data structure at malloc
  • Reorganize heap
  • Allocate the parent of tree node and the node
    close
  • Useful information
  • Types
  • Traversal patterns
  • Research Frontier

22
void addList(struct List list struct Patient
patient) struct list b while (list !NULL)
b list list list-gtforward list
(struct List ) ccmaloc(sizeof(struct
List), b) list-gtpatient patient list-gtback
b list-gtforwardNULL b-gtforwardlist
23
Software Prefetching
  • Requires special hardware (Alpha, PowerPC,
    Sparc-V9)
  • Reduces the cost of subsequent accesses in loops
  • Not limited to scientific code
  • More effective for large memory bandwidth

24
struct node int val struct node next
struct node jump ptr
the_list-gthead while (ptr-gtnext)
prefetch(ptr-gtjump) ptr ptr-gtnext
struct node int val struct node next
ptr the_list-gthead while (ptr-gtnext)
ptr ptr-gtnext
25
Textbook Order
constant-folding simplifications
26
LIR(D)
Inline expansion Leaf-routine optimizations Shrink
wrapping Machine idioms Tail merging Branch
optimization and conditional moves Dead code
elimination Software pipelining, Instruction
Scheduling 1 Register allocation Instruction
Scheduling 2 Intraprocedural I-cache
optimizations Instruction prefetching Data
prefertching Branch predication
constant-folding simplifications
27
Link-time optimizations(E)
Interprocedural register allocation Aggregation
global references Interprcudural I-cache
optimizations
28
Complementary Techniques
  • Cache aware data structures
  • Smart hardware
  • Cache aware garbage collection

29
Preliminary Conclusion
  • For imperative programs current I-cache
    optimizations suffice to get good speed-ups (10)
  • For D-cache optimizations
  • Locality optimizations are effective for regular
    scientific code (46)
  • Software prefetching is effective with large
    memory bandwidth
  • For pointer chasing programs more research is
    needed
  • Memory optimizations is a profitable area
Write a Comment
User Comments (0)
About PowerShow.com