CacheAware Partitioning of MultiDimensional Iteration Spaces - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

CacheAware Partitioning of MultiDimensional Iteration Spaces

Description:

Alexandru Nicolau, Utpal Banerjee, Alexander V. Veidenbaum (UC Irvine, CA) ... Loop indices are affine functions of outer loop indices. Cache-Aware ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 18
Provided by: guangq
Category:

less

Transcript and Presenter's Notes

Title: CacheAware Partitioning of MultiDimensional Iteration Spaces


1
Cache-Aware Partitioning of Multi-Dimensional
Iteration Spaces
  • Arun Kejariwal (Yahoo!Inc. Santa Clara, CA)
  • Alexandru Nicolau, Utpal Banerjee, Alexander V.
    Veidenbaum (UC Irvine, CA)
  • Constantine D. Polychronopoulos (UI Urbana, IL)
  • Presenter Olga Golovanevsky

2
Outline
  • Motivation
  • Motivating examples
  • The Techniques
  • Results
  • Conclusion

3
Motivation
  • Multi-cores becoming ubiquitous
  • Examples Intels Sandybridge, IBMs Cell and
    POWER Suns UltraSPARC T family
  • Number of cores is expected to increase
  • Large-scale hardware parallelism available
  • Software challenges
  • Thread-level application parallelization
  • How to map threads on different cores
  • Load balancing
  • Data affinity

4
Application Parallelization
  • Loops account for most of application run-time
  • Loop classification
  • DOALL No loop-carried dependence
  • Amenable to auto-parallelization
  • Execute iterations in parallel on different
    threads
  • Non-DOALL
  • Thread synchronization
  • support needed for
  • parallelization

5
Parallel Execution of DOALL Loops
  • Auto-parallelized
  • Directive-driven parallelization
  • Example OpenMP pragmas
  • Issue with parallel execution
  • Load balancing
  • How to partition the iteration space for best
    performance?
  • Naïve way Partition the iteration space
    uniformly amongst the different threads
  • Doesnt yield the best performance!

6
Iteration Space Partitioning
  • Why is it non-trivial?
  • Non-rectangular geometry of iteration space

7
Iteration Space Partitioning (contd.)
  • Why is it non-trivial?
  • Use of indirect referencing
  • Non-uniform cache miss profile
  • Variation in L1 cache misses
  • 462.libquantum

8
Iteration Space Partitioning (contd.)
  • Why is it non-trivial?
  • Non-perfect multi-way loops
  • Outermost loop may have multiple
  • loops at the same nesting level
  • Conditional execution of inner loops

do k2, nk -1 do j1, ny ? first loop
do i1, nx read Ak,j,i
end do end do do j2, ny-1 ? second loop
do i2, nx-1 write Ak,j,i
end do end do end do
T1 k1
T2 k4
9
Iteration Space Partitioning (contd.)
  • Why is it non-trivial?
  • Presence of conditionals in the loop body
  • Non-uniform workload
  • distribution
  • Variation in Inst Retired
  • 403.gcc

10
How to partition?
  • Guiding factors
  • Partition the outermost loop
  • Minimizes scheduling overhead
  • Geometry-aware
  • Model the iteration space as a convex polytope
  • Loop indices are affine functions of outer loop
    indices
  • Cache-Aware
  • Account for non-uniform cache miss profile across
    the iteration space
  • Account for non-uniform workload distribution
    across the iteration space

11
Algorithm
  • High-level steps
  • Obtain the cache miss profile
  • Obtain the workload distribution
  • Compute the total volume of iteration space
  • Weighted by cache misses and instructions retired
  • Given n threads
  • Compute n-1 breakpoints along the axis
    corresponding to the outermost loop wherein
  • Each breakpoint delimits a set
  • Each set has equal weighted volume
  • Map each set on to a different thread

12
Experimental Setup
  • Use in-built hardware performance counters
  • MEM_LOAD_RETIRED.L1D_MISS
  • Obtain cache miss profile
  • INST_RETIRED.ANY
  • Obtain instructions retired profile

13
Kernel Set
14
Results (contd.)
  • Compute two metrics
  • Speedup (tco tca) 100
  • tca
  • tco cache-oblivious
  • tca cache-aware
  • Deviation
  • Difference between proposed
  • technique and worst case

15
Thank You!
16
Results (contd.)
  • Performance variation with different partitioning
    planes
  • 3 threads

17
Results
  • Performance variation with different partitioning
    planes
  • A kernel from 178.galgel
  • Nested non-perfect
  • multiway DOALL
  • loop
  • 2 threads
Write a Comment
User Comments (0)
About PowerShow.com