Performance Analysis of Divide and Conquer Algorithms for the WHT - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

Performance Analysis of Divide and Conquer Algorithms for the WHT

Description:

Performance Analysis of Divide and Conquer Algorithms for the WHT. Jeremy Johnson ... Measure hardware events (coupled with PCL/PAPI) Search for good implementation ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 32
Provided by: jose269
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Performance Analysis of Divide and Conquer Algorithms for the WHT


1
Performance Analysis of Divide and Conquer
Algorithms for the WHT
  • Jeremy Johnson
  • Mihai Furis, Pawel Hitczenko, Hung-Jen Huang
  • Dept. of Computer Science
  • Drexel University

www.spiral.net
2
Motivation
  • On modern machines operation count is not always
    the most important performance metric.
  • Effective utilization of the memory hierarchy,
    pipelining, and Instruction Level Parallelism is
    important, and it is not easy to determine such
    utilization from source code.
  • Automatic Performance Tuning and Architecture
    Adaptation
  • Generate and Test
  • FFT, Matrix Multiplication,
  • Explain performance distribution

3
Outline
  • Space of WHT Algorithms
  • WHT Package and Performance Distribution
  • Performance Model
  • Instruction Count
  • Cache

4
Walsh-Hadamard Transform
  • y WHTN x, N 2n

5
Factoring the WHT Matrix
  • AC Ä BD (A Ä B)(C Ä D)
  • A Ä B (A Ä I)(I Ä B)
  • A Ä (B Ä C) (A Ä B) Ä C
  • Im Ä In Imn

WHT2 Ä WHT2 (WHT2 Ä I2)(I2 Ä WHT2)
6
Recursive Algorithm
(WHT2 ? I4)(I2 ? (WHT2 ? I2) (I2 ? WHT2))
7
Iterative Algorithm
(WHT2 ? I4)(I2 ? WHT2 ? I2) (I4 ? WHT2))
8
WHT Algorithms
  • Recursive
  • Iterative
  • General

9
WHT Implementation
  • N N1 N2?Nt Ni2ni
  • x WHTNx x (x(b),x(bs),x(b(M-1)s))
  • Implementation(nested loop)
  • RN S1
  • for it,,1
  • RR/Ni
  • for j0,,R-1
  • for k0,,S-1
  • SS Ni

M b,s
t
Õ
)
Ä
Ä

WHT
WHT
I
I
(
n
n
2
2
2
2
i
i
1

10
Partition Trees
Right Recursive
Left Recursive
Balanced
Iterative
11
Number of Algorithms
12
Outline
  • WHT Algorithms
  • WHT Package and Performance Distribution
  • Performance Model
  • Instruction Count
  • Cache

13
WHT PackagePüschel Johnson (ICASSP 00)
  • Allows easy implementation of any of the possible
  • WHT algorithms
  • Partition tree representation
  • W(n)smalln splitW(n1),W(nt)
  • Tools
  • Measure runtime of any algorithm
  • Measure hardware events (coupled with PCL/PAPI)
  • Search for good implementation
  • Dynamic programming
  • Evolutionary algorithm

14
Algorithm Comparison
15
Cache Miss Data
16
Histogram (n 16, 10,000 samples)
  • Wide range in performance despite equal number
    of arithmetic operations (n2n flops)
  • Pentium III vs. UltraSPARC II

17
Outline
  • WHT Algorithms
  • WHT Package and Performance Distribution
  • Performance Model
  • Instruction Count
  • Cache

18
WHT Implementation
  • N N1 N2?Nt Ni2ni
  • x WHTNx x (x(b),x(bs),x(b(M-1)s))
  • Implementation(nested loop)
  • RN S1
  • for it,,1
  • RR/Ni
  • for j0,,R-1
  • for k0,,S-1
  • SS Ni

M b,s
t
Õ
)
Ä
Ä

WHT
WHT
I
I
(
n
n
2
2
2
2
i
i
1

19
Instruction Count Model
  • A(n) number of calls to WHT procedure
  • number of instructions outside loops
  • Al(n) Number of calls to base case of size l
  • ? l number of instructions in base case of size
    l
  • Li number of iterations of outer (i1), middle
    (i2), and
  • outer (i3) loop
  • ?i number of instructions in outer (i1),
    middle (i2), and
  • outer (i3) loop body

20
Small1
.file "s_1.c" .version "01.01" gcc2_compiled. .
text .align 4 .globl apply_small1 .type
apply_small1,_at_function apply_small1 movl
8(esp),edx //load stride S to EDX movl
12(esp),eax //load x array's base address to
EAX fldl (eax) // st(0)R7x0 fldl
(eax,edx,8) //st(0)R6xS fld st(1)
//st(0)R5x0 fadd st(1),st //
R5x0xS fxch st(2) //st(0)R5x0,s(2)R7
x0xS fsubp st,st(1) //st(0)R6xS-x0
????? fxch st(1) //st(0)R6x0xS,st(1)R7
xS-x0 fstpl (eax) //store
x0x0xS fstpl (eax,edx,8) //store
x0x0-xS ret
21
Recurrences
22
Histogram using Instruction Model (P3)
  • ? l 12, ? l 34, and ? l 106
  • ? 27
  • ?1 18, ?2 18, and ?1 20

23
Cache Model
  • Different WHT algorithms access data in different
    patterns
  • All algorithms with the same set of leaf nodes
    have the same number of memory accesses
  • Count misses for accesses to data array
  • Parameterized by cache size, associativity, and
    block size
  • simulate using program traces (restrict to data
    vector accesses)
  • Analytic formula?

24
Blocked Access
 
4
1
3
1
2
25
Interleaved Access
4
3
1
2
1
26
Cache Simulator
  • 144 memory accesses
  • C 4, A 1, B 1 (80, 112)
  • C 4, A 4, B 1 (48, 48)
  • C 4, A 1, B 2 (72, 88)
  • Iterative vs. Recursive (192 memory accesses)
  • C 4, A 1, B 1 (128, 112)

27
Cache Misses as a Function of Cache Size
C22
C23
C24
C25
28
Formula for Cache Misses
  • M(L,WN,R) Number of misses for (IL Ä WHTN Ä IR)

29
Closed Form
  • M(L,WN,R) Number of misses for (IL Ä WHTN Ä IR)
  • M(0,W_n,0) 3(n-c)2n k2n
  • C 2c, k number of parts in the rightmost c
    positions
  • c 3, n 4

Iterative
Right Recursive
Balanced
k 1
k 3
k 2
30
Summary of Results and Future Work
  • Instruction Count Model
  • min, max, expected value, variance, limiting
    distribution
  • Cache Model
  • Direct mapped (closed form solution,
    distribution, expected value, and variance)
  • Combine models
  • Extend cache formula to include A and B
  • Use as heuristic to limit search and predict
    performance

31
Sponsors
Work supported by DARPA (DSO), Applied
Computational Mathematics Program, OPAL, through
grant managed by research grant DABT63-98-1-0004
administered by the Army Directorate of
Contracting, DESA Intelligent HW-SW Compilers
for Signal Processing Applications, and NSF
ITR/NGS 0325687 Intelligent HW/SW Compilers for
DSP.
About PowerShow.com