CacheAware Partitioning of MultiDimensional Iteration Spaces

About This Presentation

Title:

CacheAware Partitioning of MultiDimensional Iteration Spaces

Description:

Alexandru Nicolau, Utpal Banerjee, Alexander V. Veidenbaum (UC Irvine, CA) ... Loop indices are affine functions of outer loop indices. Cache-Aware ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 18

Provided by: guangq

Category:

more less

Transcript and Presenter's Notes

Title: CacheAware Partitioning of MultiDimensional Iteration Spaces

1
Cache-Aware Partitioning of Multi-Dimensional
Iteration Spaces

Arun Kejariwal (Yahoo!Inc. Santa Clara, CA)
Alexandru Nicolau, Utpal Banerjee, Alexander V.
Veidenbaum (UC Irvine, CA)
Constantine D. Polychronopoulos (UI Urbana, IL)
Presenter Olga Golovanevsky

2
Outline

Motivation
Motivating examples
The Techniques
Results
Conclusion

3
Motivation

Multi-cores becoming ubiquitous
Examples Intels Sandybridge, IBMs Cell and
POWER Suns UltraSPARC T family
Number of cores is expected to increase
Large-scale hardware parallelism available
Software challenges
Thread-level application parallelization
How to map threads on different cores
Load balancing
Data affinity

4
Application Parallelization

Loops account for most of application run-time
Loop classification
DOALL No loop-carried dependence
Amenable to auto-parallelization
Execute iterations in parallel on different
threads
Non-DOALL
Thread synchronization
support needed for
parallelization

5
Parallel Execution of DOALL Loops

Auto-parallelized
Directive-driven parallelization
Example OpenMP pragmas
Issue with parallel execution
Load balancing
How to partition the iteration space for best
performance?
Naïve way Partition the iteration space
uniformly amongst the different threads
Doesnt yield the best performance!

6
Iteration Space Partitioning

Why is it non-trivial?
Non-rectangular geometry of iteration space

7
Iteration Space Partitioning (contd.)

Why is it non-trivial?
Use of indirect referencing
Non-uniform cache miss profile
Variation in L1 cache misses
462.libquantum

8
Iteration Space Partitioning (contd.)

Why is it non-trivial?
Non-perfect multi-way loops
Outermost loop may have multiple
loops at the same nesting level
Conditional execution of inner loops

do k2, nk -1 do j1, ny ? first loop
do i1, nx read Ak,j,i
end do end do do j2, ny-1 ? second loop
do i2, nx-1 write Ak,j,i
end do end do end do
T1 k1
T2 k4
9
Iteration Space Partitioning (contd.)

Why is it non-trivial?
Presence of conditionals in the loop body
Non-uniform workload
distribution
Variation in Inst Retired
403.gcc

10
How to partition?

Guiding factors
Partition the outermost loop
Minimizes scheduling overhead
Geometry-aware
Model the iteration space as a convex polytope
Loop indices are affine functions of outer loop
indices
Cache-Aware
Account for non-uniform cache miss profile across
the iteration space
Account for non-uniform workload distribution
across the iteration space

11
Algorithm

High-level steps
Obtain the cache miss profile
Obtain the workload distribution
Compute the total volume of iteration space
Weighted by cache misses and instructions retired
Given n threads
Compute n-1 breakpoints along the axis
corresponding to the outermost loop wherein
Each breakpoint delimits a set
Each set has equal weighted volume
Map each set on to a different thread

12
Experimental Setup