Title: CS4230%20Parallel%20Programming%20%20Lecture%2019:%20SIMD%20and%20Multimedia%20Extensions%20Mary%20Hall%20November%2013,%202012
1CS4230 Parallel ProgrammingLecture 19 SIMD
and Multimedia ExtensionsMary HallNovember 13,
2012
2Todays Lecture
- Practical understanding of SIMD in the context of
multimedia extensions - Slide source
- Sam Larsen, PLDI 2000, http//people.csail.mit.edu
/slarsen/ - Jaewook Shin, my former PhD student
3SIMD and Multimedia Extensions
- Common feature of most microprocessors
- Very different architecture from GPUs
- Mostly, you use this feature without realizing it
if you use an optimization flag (typically O2 or
above) - You can improve the result substantially with
some awareness of challenges for architecture - Wide SIMD is becoming common in high-end
architectures - Intel AVX
- Intel MIC
- IBM BG/Q
4Multimedia Extension Architectures
- At the core of multimedia extensions
- SIMD parallelism
- Variable-sized data fields
- Vector length register width / type size
- Data in contiguous memory
Slide source Jaewook Shin
5Characteristics of Multimedia Applications
- Regular data access pattern
- Data items are contiguous in memory
- Short data types
- 8, 16, 32 bits
- Data streaming through a series of processing
stages - Some temporal reuse for such data streams
- Sometimes
- Many constants
- Short iteration counts
- Requires saturation arithmetic
6Programming Multimedia Extensions
- Use compiler or low-level API
- Programming interface similar to function call
- C built-in functions, Fortran intrinsics
- Most native compilers support their own
multimedia extensions - GCC -faltivec, -msse2
- AltiVec dst vec_add(src1, src2)
- SSE2 dst _mm_add_ps(src1, src2)
- BG/L dst __fpadd(src1, src2)
- No Standard !
71. Independent ALU Ops
R R XR 1.08327 G G XG 1.89234 B B
XB 1.29835
Slide source Sam Larsen
82. Adjacent Memory References
R R Xi0 G G Xi1 B B Xi2
Slide source Sam Larsen
93. Vectorizable Loops
for (i0 ilt100 i1) Ai0 Ai0 Bi0
Slide source Sam Larsen
103. Vectorizable Loops
for (i0 ilt100 i4) Ai0 Ai0 Bi0
Ai1 Ai1 Bi1 Ai2 Ai2
Bi2 Ai3 Ai3 Bi3
Slide source Sam Larsen
114. Partially Vectorizable Loops
for (i0 ilt16 i1) L Ai0 Bi0 D
D abs(L)
Slide source Sam Larsen
124. Partially Vectorizable Loops
for (i0 ilt16 i2) L Ai0 Bi0 D
D abs(L)
L Ai1 Bi1 D D abs(L)
Slide source Sam Larsen
13Rest of Lecture
- Understanding multimedia execution
- Understanding the overheads
- Understanding how to write code to deal with the
overheads - What are the overheads
- Packing and unpacking
- rearrange data so that it is contiguous
- Alignment overhead
- Accessing data from the memory system so that it
is aligned to a superword boundary - Control flow
- Control flow may require executing all paths
14Packing/Unpacking Costs
C A 2 D B 3
Slide source Sam Larsen
15Packing/Unpacking Costs
- Packing source operands
- Copying into contiguous memory
A f() B g()
C A 2 D B 3
C A 2 D B 3
16Packing/Unpacking Costs
- Packing source operands
- Copying into contiguous memory
- Unpacking destination operands
- Copying back to location
A f() B g()
C A 2 D B 3
C A 2 D B 3
E C / 5 F D 7
Slide source Sam Larsen
17Alignment Code Generation
- Aligned memory access
- The address is always a multiple of vector length
(16 bytes in example) - Just one superword load or store instruction
float a64 for (i0 ilt64 i4) Va
aii3
18Alignment Code Generation (cont.)
- Misaligned memory access
- The address is always a non-zero constant offset
away from the 16 byte boundaries. - Static alignment For a misaligned load, issue
two adjacent aligned loads followed by a merge. - Sometimes the hardware does this for you, but
still results in multiple loads
float a64 for (i0 ilt60 i4) Va
ai2i5
19Alignment Code Generation (cont.)
- Statically align loop iterations
- float a64
- for (i0 ilt60 i4)
- Va ai2i5
- float a64
- Sa2 a2 Sa3 a3
- for (i4 ilt64 i4)
- Va aii3
20Alignment Code Generation (cont.)
- Unaligned memory access
- The offset from 16 byte boundaries is varying or
not enough information is available. - Dynamic alignment The merging point is computed
during run time.
float a64 start read() for (istart ilt60
i) Va aii3
21Summary of dealing with alignment issues
- Worst case is dynamic alignment based on address
calculation (previous slide) - Compiler (or programmer) can use analysis to
prove data is already aligned - We know that data is initially aligned at its
starting address by convention - If we are stepping through a loop with a constant
starting point and accessing the data
sequentially, then it preserves the alignment
across the loop - We can adjust computation to make it aligned by
having a sequential portion until aligned,
followed by a SIMD portion, possibly followed by
a sequential cleanup - Sometimes alignment overhead is so significant
that there is no performance gain from SIMD
execution
22Last SIMD issue Control Flow
- What if I have control flow?
- Both control flow paths must be executed!
What happens Compute ai !0 for all
fields Compare bi for all fields in
temporary t1 Copy bi into another register
t2 Merge t1 and t2 according to value of ai!0
for (i0 ilt16 i) if (ai ! 0) bi
What happens Compute ai !0 for all
fields Compute bi bi/ai in register
t1 Compare bi for all fields in t2 Merge t1
and t2 according to value of ai!0
for (i0 ilt16 i) if (ai ! 0) bi
bi / ai else bi
23Reasons why Compiler Fails Parallelize Code
- Dependence interferes with parallelization
- Sometimes assertion of no alias will solve this
- Sometimes assertion of no dependence will solve
this - Anticipates too much overhead will eliminate gain
- Costly alignment, or cant prove alignment
- Number of loop iterations too small