CS4230%20Parallel%20Programming%20%20Lecture%2019:%20SIMD%20and%20Multimedia%20Extensions%20Mary%20Hall%20November%2013,%202012 - PowerPoint PPT Presentation

About This Presentation
Title:

CS4230%20Parallel%20Programming%20%20Lecture%2019:%20SIMD%20and%20Multimedia%20Extensions%20Mary%20Hall%20November%2013,%202012

Description:

Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012 11/13/2012 CS4230 – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 24
Provided by: Katheri159
Category:

less

Transcript and Presenter's Notes

Title: CS4230%20Parallel%20Programming%20%20Lecture%2019:%20SIMD%20and%20Multimedia%20Extensions%20Mary%20Hall%20November%2013,%202012


1
CS4230 Parallel ProgrammingLecture 19 SIMD
and Multimedia ExtensionsMary HallNovember 13,
2012
2
Todays Lecture
  • Practical understanding of SIMD in the context of
    multimedia extensions
  • Slide source
  • Sam Larsen, PLDI 2000, http//people.csail.mit.edu
    /slarsen/
  • Jaewook Shin, my former PhD student

3
SIMD and Multimedia Extensions
  • Common feature of most microprocessors
  • Very different architecture from GPUs
  • Mostly, you use this feature without realizing it
    if you use an optimization flag (typically O2 or
    above)
  • You can improve the result substantially with
    some awareness of challenges for architecture
  • Wide SIMD is becoming common in high-end
    architectures
  • Intel AVX
  • Intel MIC
  • IBM BG/Q

4
Multimedia Extension Architectures
  • At the core of multimedia extensions
  • SIMD parallelism
  • Variable-sized data fields
  • Vector length register width / type size
  • Data in contiguous memory

Slide source Jaewook Shin
5
Characteristics of Multimedia Applications
  • Regular data access pattern
  • Data items are contiguous in memory
  • Short data types
  • 8, 16, 32 bits
  • Data streaming through a series of processing
    stages
  • Some temporal reuse for such data streams
  • Sometimes
  • Many constants
  • Short iteration counts
  • Requires saturation arithmetic

6
Programming Multimedia Extensions
  • Use compiler or low-level API
  • Programming interface similar to function call
  • C built-in functions, Fortran intrinsics
  • Most native compilers support their own
    multimedia extensions
  • GCC -faltivec, -msse2
  • AltiVec dst vec_add(src1, src2)
  • SSE2 dst _mm_add_ps(src1, src2)
  • BG/L dst __fpadd(src1, src2)
  • No Standard !

7
1. Independent ALU Ops
R R XR 1.08327 G G XG 1.89234 B B
XB 1.29835
Slide source Sam Larsen
8
2. Adjacent Memory References
R R Xi0 G G Xi1 B B Xi2
Slide source Sam Larsen
9
3. Vectorizable Loops
for (i0 ilt100 i1) Ai0 Ai0 Bi0
Slide source Sam Larsen
10
3. Vectorizable Loops
for (i0 ilt100 i4) Ai0 Ai0 Bi0
Ai1 Ai1 Bi1 Ai2 Ai2
Bi2 Ai3 Ai3 Bi3
Slide source Sam Larsen
11
4. Partially Vectorizable Loops
for (i0 ilt16 i1) L Ai0 Bi0 D
D abs(L)
Slide source Sam Larsen
12
4. Partially Vectorizable Loops
for (i0 ilt16 i2) L Ai0 Bi0 D
D abs(L)
L Ai1 Bi1 D D abs(L)
Slide source Sam Larsen
13
Rest of Lecture
  • Understanding multimedia execution
  • Understanding the overheads
  • Understanding how to write code to deal with the
    overheads
  • What are the overheads
  • Packing and unpacking
  • rearrange data so that it is contiguous
  • Alignment overhead
  • Accessing data from the memory system so that it
    is aligned to a superword boundary
  • Control flow
  • Control flow may require executing all paths

14
Packing/Unpacking Costs
C A 2 D B 3
Slide source Sam Larsen
15
Packing/Unpacking Costs
  • Packing source operands
  • Copying into contiguous memory

A f() B g()
C A 2 D B 3
C A 2 D B 3

16
Packing/Unpacking Costs
  • Packing source operands
  • Copying into contiguous memory
  • Unpacking destination operands
  • Copying back to location

A f() B g()
C A 2 D B 3
C A 2 D B 3

E C / 5 F D 7
Slide source Sam Larsen
17
Alignment Code Generation
  • Aligned memory access
  • The address is always a multiple of vector length
    (16 bytes in example)
  • Just one superword load or store instruction

float a64 for (i0 ilt64 i4) Va
aii3

18
Alignment Code Generation (cont.)
  • Misaligned memory access
  • The address is always a non-zero constant offset
    away from the 16 byte boundaries.
  • Static alignment For a misaligned load, issue
    two adjacent aligned loads followed by a merge.
  • Sometimes the hardware does this for you, but
    still results in multiple loads

float a64 for (i0 ilt60 i4) Va
ai2i5

19
Alignment Code Generation (cont.)
  • Statically align loop iterations
  • float a64
  • for (i0 ilt60 i4)
  • Va ai2i5
  • float a64
  • Sa2 a2 Sa3 a3
  • for (i4 ilt64 i4)
  • Va aii3

20
Alignment Code Generation (cont.)
  • Unaligned memory access
  • The offset from 16 byte boundaries is varying or
    not enough information is available.
  • Dynamic alignment The merging point is computed
    during run time.

float a64 start read() for (istart ilt60
i) Va aii3
21
Summary of dealing with alignment issues
  • Worst case is dynamic alignment based on address
    calculation (previous slide)
  • Compiler (or programmer) can use analysis to
    prove data is already aligned
  • We know that data is initially aligned at its
    starting address by convention
  • If we are stepping through a loop with a constant
    starting point and accessing the data
    sequentially, then it preserves the alignment
    across the loop
  • We can adjust computation to make it aligned by
    having a sequential portion until aligned,
    followed by a SIMD portion, possibly followed by
    a sequential cleanup
  • Sometimes alignment overhead is so significant
    that there is no performance gain from SIMD
    execution

22
Last SIMD issue Control Flow
  • What if I have control flow?
  • Both control flow paths must be executed!

What happens Compute ai !0 for all
fields Compare bi for all fields in
temporary t1 Copy bi into another register
t2 Merge t1 and t2 according to value of ai!0
for (i0 ilt16 i) if (ai ! 0) bi
What happens Compute ai !0 for all
fields Compute bi bi/ai in register
t1 Compare bi for all fields in t2 Merge t1
and t2 according to value of ai!0
for (i0 ilt16 i) if (ai ! 0) bi
bi / ai else bi
23
Reasons why Compiler Fails Parallelize Code
  • Dependence interferes with parallelization
  • Sometimes assertion of no alias will solve this
  • Sometimes assertion of no dependence will solve
    this
  • Anticipates too much overhead will eliminate gain
  • Costly alignment, or cant prove alignment
  • Number of loop iterations too small
Write a Comment
User Comments (0)
About PowerShow.com