# Performance Analysis of Divide and Conquer Algorithms for the WHT - PowerPoint PPT Presentation

View by Category
Title:

## Performance Analysis of Divide and Conquer Algorithms for the WHT

Description:

### Performance Analysis of Divide and Conquer Algorithms for the WHT. Jeremy Johnson ... Measure hardware events (coupled with PCL/PAPI) Search for good implementation ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 32
Provided by: jose269
Category:
Tags:
Transcript and Presenter's Notes

Title: Performance Analysis of Divide and Conquer Algorithms for the WHT

1
Performance Analysis of Divide and Conquer
Algorithms for the WHT
• Jeremy Johnson
• Mihai Furis, Pawel Hitczenko, Hung-Jen Huang
• Dept. of Computer Science
• Drexel University

www.spiral.net
2
Motivation
• On modern machines operation count is not always
the most important performance metric.
• Effective utilization of the memory hierarchy,
pipelining, and Instruction Level Parallelism is
important, and it is not easy to determine such
utilization from source code.
• Automatic Performance Tuning and Architecture
• Generate and Test
• FFT, Matrix Multiplication,
• Explain performance distribution

3
Outline
• Space of WHT Algorithms
• WHT Package and Performance Distribution
• Performance Model
• Instruction Count
• Cache

4
• y WHTN x, N 2n

5
Factoring the WHT Matrix
• AC Ä BD (A Ä B)(C Ä D)
• A Ä B (A Ä I)(I Ä B)
• A Ä (B Ä C) (A Ä B) Ä C
• Im Ä In Imn

WHT2 Ä WHT2 (WHT2 Ä I2)(I2 Ä WHT2)
6
Recursive Algorithm
(WHT2 ? I4)(I2 ? (WHT2 ? I2) (I2 ? WHT2))
7
Iterative Algorithm
(WHT2 ? I4)(I2 ? WHT2 ? I2) (I4 ? WHT2))
8
WHT Algorithms
• Recursive
• Iterative
• General

9
WHT Implementation
• N N1 N2?Nt Ni2ni
• x WHTNx x (x(b),x(bs),x(b(M-1)s))
• Implementation(nested loop)
• RN S1
• for it,,1
• RR/Ni
• for j0,,R-1
• for k0,,S-1
• SS Ni

M b,s
t
Õ
)
Ä
Ä

WHT
WHT
I
I
(
n
n
2
2
2
2
i
i
1

10
Partition Trees
Right Recursive
Left Recursive
Balanced
Iterative
11
Number of Algorithms
12
Outline
• WHT Algorithms
• WHT Package and Performance Distribution
• Performance Model
• Instruction Count
• Cache

13
WHT PackagePüschel Johnson (ICASSP 00)
• Allows easy implementation of any of the possible
• WHT algorithms
• Partition tree representation
• W(n)smalln splitW(n1),W(nt)
• Tools
• Measure runtime of any algorithm
• Measure hardware events (coupled with PCL/PAPI)
• Search for good implementation
• Dynamic programming
• Evolutionary algorithm

14
Algorithm Comparison
15
Cache Miss Data
16
Histogram (n 16, 10,000 samples)
• Wide range in performance despite equal number
of arithmetic operations (n2n flops)
• Pentium III vs. UltraSPARC II

17
Outline
• WHT Algorithms
• WHT Package and Performance Distribution
• Performance Model
• Instruction Count
• Cache

18
WHT Implementation
• N N1 N2?Nt Ni2ni
• x WHTNx x (x(b),x(bs),x(b(M-1)s))
• Implementation(nested loop)
• RN S1
• for it,,1
• RR/Ni
• for j0,,R-1
• for k0,,S-1
• SS Ni

M b,s
t
Õ
)
Ä
Ä

WHT
WHT
I
I
(
n
n
2
2
2
2
i
i
1

19
Instruction Count Model
• A(n) number of calls to WHT procedure
• number of instructions outside loops
• Al(n) Number of calls to base case of size l
• ? l number of instructions in base case of size
l
• Li number of iterations of outer (i1), middle
(i2), and
• outer (i3) loop
• ?i number of instructions in outer (i1),
middle (i2), and
• outer (i3) loop body

20
Small1
.file "s_1.c" .version "01.01" gcc2_compiled. .
text .align 4 .globl apply_small1 .type
apply_small1,_at_function apply_small1 movl
8(esp),edx //load stride S to EDX movl
EAX fldl (eax) // st(0)R7x0 fldl
(eax,edx,8) //st(0)R6xS fld st(1)
R5x0xS fxch st(2) //st(0)R5x0,s(2)R7
x0xS fsubp st,st(1) //st(0)R6xS-x0
????? fxch st(1) //st(0)R6x0xS,st(1)R7
xS-x0 fstpl (eax) //store
x0x0xS fstpl (eax,edx,8) //store
x0x0-xS ret
21
Recurrences
22
Histogram using Instruction Model (P3)
• ? l 12, ? l 34, and ? l 106
• ? 27
• ?1 18, ?2 18, and ?1 20

23
Cache Model
• Different WHT algorithms access data in different
patterns
• All algorithms with the same set of leaf nodes
have the same number of memory accesses
• Count misses for accesses to data array
• Parameterized by cache size, associativity, and
block size
• simulate using program traces (restrict to data
vector accesses)
• Analytic formula?

24
Blocked Access

4
1
3
1
2
25
Interleaved Access
4
3
1
2
1
26
Cache Simulator
• 144 memory accesses
• C 4, A 1, B 1 (80, 112)
• C 4, A 4, B 1 (48, 48)
• C 4, A 1, B 2 (72, 88)
• Iterative vs. Recursive (192 memory accesses)
• C 4, A 1, B 1 (128, 112)

27
Cache Misses as a Function of Cache Size
C22
C23
C24
C25
28
Formula for Cache Misses
• M(L,WN,R) Number of misses for (IL Ä WHTN Ä IR)

29
Closed Form
• M(L,WN,R) Number of misses for (IL Ä WHTN Ä IR)
• M(0,W_n,0) 3(n-c)2n k2n
• C 2c, k number of parts in the rightmost c
positions
• c 3, n 4

Iterative
Right Recursive
Balanced
k 1
k 3
k 2
30
Summary of Results and Future Work
• Instruction Count Model
• min, max, expected value, variance, limiting
distribution
• Cache Model
• Direct mapped (closed form solution,
distribution, expected value, and variance)
• Combine models
• Extend cache formula to include A and B
• Use as heuristic to limit search and predict
performance

31