Class Overview - PowerPoint PPT Presentation

1 / 111
About This Presentation
Title:

Class Overview

Description:

Advanced Idiom Recognition. The Pentium 4 processor provides ... The Intel compiler aggressively detects such idioms during intra-register vectorization. ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 112
Provided by: ericw77
Category:
Tags: class | idiom | overview

less

Transcript and Presenter's Notes

Title: Class Overview


1
Building the Multi-core FutureEnabling Software
and Applications
Feilong Huang feilong.huang_at_intel.com Technical
Consulting Engr. SSG/SPD Intel China Ltd
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
2
Agenda
  • Why is Multi-core Important?
  • Intel Software Development Products overview
  • Intel Compiler Optimization
  • Vectorization
  • OpenMP
  • Summary

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
3
Agenda
  • Why is Multi-core Important?
  • Intel Software Development Products overview
  • Intel Compiler Optimization
  • Vectorization
  • OpenMP
  • Summary

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
4
No More Free Lunch
  • Moores Law continues to hold
  • Transistors on die continues to increase
  • Faster clock gt increased performance
  • Minimal work by ISVs
  • As clock speed increases, heat increases
  • Intel is addressing power and heat issues

Use extra chip real estate for multiple cores to
increase performance
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
5
How to Take Advantage of Multi-Core
  • Thread programs for performance
  • Execute independent tasks in parallel

How can this improve performance?
6
Multitasking Advantages
  • Serial applications can still benefit
  • Multiple applications can run simultaneously
  • Listen to music and compress data file
  • Background jobs use separate processor
  • Dont interfere with foreground tasks
  • Virus checking, system backup

7
Intel Architecture
  • Instruction Level Parallelism (1993)
  • Out-of-order instruction pipeline
  • Multiple execution units
  • Data Level Parallelism (1997)
  • MMX
  • Streaming SIMD Extensions (SSE)
  • Streaming SIMD Extensions 2 (SSE2)
  • Streaming SIMD Extensions 3 (SSE3)
  • Thread Level Parallelism (2002)
  • Hyper-Threading Technology (HT Technology)
  • Multi-core Intel processor architecture

8
Increasing Degrees of Parallelism
Long History in Providing Parallelism
9
Hyper-Threading Technology (HT Technology)
Multi-core Intel processor architecture
Execution Core
Cache
Intel Xeon/Pentium 4 processor supporting
Hyper-Threading Technology
Multiple execution cores ramping across Intel
platforms
10
Cores and Logical Thread Roadmap
Current Platforms
2005 2006 Future
Intel Itanium 2-(6M/9M) (1 core, 1 thread)
Intel E8870/OEM
Tukwila (2 cores, 4 threads) Common Platform
Montecito / Montvale (2 cores, 4 threads)
Intel E8870/OEM
MP Servers
Paxville /Tulsa (2 cores, 2 threads) Intel
Twin Castle/OEM
Cranford/Potomac (1 core, 2 Threads) Intel Twin
Castle/OEM
Intel Xeon Processor MP (1 core, 2 Threads)
OEM chip sets
Whitefield (2 cores, 2 threads) Common Platform
Intel Itanium 2 ( LV) (1 core, 1 thread)
Intel E8870/OEM
Millington ( LV) / DP Montvale (2 cores, 4
threads) Intel E8870/OEM
Dimona ( LV) (2 cores, 4 threads) Common
Platform
DP Servers and workstations
Future (2 cores, 2 threads) Next generation
chip set
Dempsey (2 cores, 2 threads) Next generation
chip set
Intel Xeon Processor Intel E7520 and Intel
E7320
Irwindale (1 core, 2 Threads) Lindenhurst/Tumwate
r
Dual/Multi-core
Pentium 4 processor (1 core, 2 Threads)
Future 65nm (2 cores, 2 threads)
Smithfield (2 cores, 2 threads)
Future (2 cores, 2 threads)
Desktop Client
Single core
Next Generation 65nm (1 core, 2 Threads)
Pentium 4 processor (1 core, 2 Threads)
Yonah (2 cores, 2 threads)
Mobile Client
Pentium M processor
Future (2 cores, 2 threads)
X indicates X or more
11
Agenda
  • Why is Multi-core Important?
  • Intel Software Development Products overview
  • Intel Compiler Optimization
  • Vectorization
  • OpenMP
  • Summary

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
12
Cross Platform Support From Servers to Cell
Phones
From Servers to Mobile/Wireless Computing, Intel
Software Development Products Enable Application
Development Across Intel Platforms
13
Intel Software Development Products
  • Intel Compilers
  • Deliver great application performance and improve
    developer productivity
  • Intel VTune Performance Analyzers
  • Quickly identify performance bottlenecks
  • Intel Performance Libraries
  • Highly optimized, threaded, ready to use
    building-block functions
  • Intel Threading Tools
  • Find threading errors and optimize threaded
    applications for maximum performance
  • Intel Cluster Tools
  • Create, analyze, optimize and deploy
    cluster-based applications

14
Intel Development Tools Help Unlock Multi-Core
Potential
  • Intel Thread Checker Thread Profiler
  • Unique product locates hard to find threading
    errors before they happen!!
  • Helps developers optimize threaded applications
  • Intel C and Fortran Compilers
  • Built-in threading support with
    Auto-Parallelization and OpenMP support
  • Intel MKL and IPP Performance libraries
  • Highly optimized threaded libraries that help
    realize multi-core performance gains even if the
    application isnt threaded!
  • Intel VTune analyzer identifies performance
    bottlenecks in code.

All Intel S/W Development Products help to get
the most performance out of threaded apps
15
Multithreading Development Cycle
  • Analysis
  • VTune Performance Analyzer
  • Design (Introduce Threads)
  • OpenMP (Intel Compiler)
  • Intel Performance libraries IPP and MKL
  • Explicit threading (Win32, Pthreads)
  • Debug for Correctness
  • Intel Thread Checker
  • Intel Debugger
  • Tune for Performance
  • Intel Compiler
  • Intel Thread Profiler
  • VTune Performance Analyzer

16
Agenda
  • Why is Multi-core Important?
  • Intel Software Development Products overview
  • Intel Compiler Optimization
  • General and processor-specific optimization
  • Advanced optimizations
  • Multi-pass optimizations
  • Vectorization
  • OpenMP
  • Summary

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
17
Intel Compilers 9.0
  • C and Fortran
  • IA-32, Intel Itanium 2, EM 64T Intel XScale
    processor-based systems
  • Windows specific
  • Integrates into MS Visual Studio .NET IDE
  • Compatible with Microsoft Visual C
  • Security feature which reduces buffer overrun
    security
  • Intel Code-Coverage Intel Test-Prioritization
    tools
  • Threaded application (support of multi-core
    processors )
  • OpenMP 2.0 standard support
  • Auto-Parallel feature that automatically
    generates threaded code

18
General Optimizations
  • Disables optimizations
  • Creates symbols
  • Optimizes for speed without increasing code size
    that is, disables library function inlining
  • Optimizes for speed (default)
  • High-level optimizations

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
19
Processor Tuning
  • Schedule instructions to be optimal for specific
    processor instruction latencies and cache sizes.

Note Default may change in future compilers.
Intel, Pentium, MMX, and the Intel logo are
trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United
States or other countries
20
Example of Processor Tuning
  • Intel compiler uses a variety of optimizations
  • Changes multiply by constant into adds
  • Changes shift left by constant into adds
  • Does not use shifts for sign extension
  • Eliminates shifts aggressively

Compiler accounts for these differences for you!
Intel, Pentium and the Intel logo are trademarks
or registered trademarks of Intel Corporation or
its subsidiaries in the United States or other
countries.
21
for (int i0iltlengthi) pi qi 32
Example of Scheduling
Intel Xeon or Pentium 4 Processor
  • B14 /G7
  • mov esi, DWORD PTR ebxedx4
  • add esi, esi
  • add esi, esi
  • add esi, esi
  • add esi, esi
  • add esi, esi
  • mov DWORD PTR ebxedx4, esi

Intel, Pentium, Intel Xeon, and the Intel logo
are trademarks or registered trademarks of Intel
Corporation or its subsidiaries in the United
States or other countries
22
Which Processor ax?
Intel, Pentium, Intel Centrino, and the Intel
logo are trademarks or registered trademarks of
Intel Corporation or its subsidiaries in the
United States or other countries.
23
Which Processor ax?
Intel, Pentium, and the Intel logo are trademarks
or registered trademarks of Intel Corporation or
its subsidiaries in the United States or other
countries.
24
Automatic Processor Dispatch
  • Single executable
  • Optimized for target processors and generic code
    that runs on all x86 processors.
  • For each target processor it uses
  • Processor-specific instructions
  • Vectorization
  • Low overhead
  • Some increase in code size

Intel, Pentium, and the Intel logo are trademarks
or registered trademarks of Intel Corporation or
its subsidiaries in the United States or other
countries.
25
Manual Processor Dispatch
// Declaration Here __declspec(cpu_dispatch(generi
c, pentium_4_sse3, )) void array_sum(int num)
/ do not add any code here code added by
compiler/ // All Definitions
Below __declspec(cpu_specific(generic)) void
array_sum(int num) / add your code for
scalar x87 floating-point here/ __declspec(cpu
_specific(pentium_4_sse3)) void array_sum(int
num) / add your code for SIMD FP with
intrinsics or vector class library here
/
Multiple optimized functions, one binary
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
26
Agenda
  • Why is Multi-core Important?
  • Intel Software Development Products overview
  • Intel Compiler Optimization
  • General and processor-specific optimization
  • Advanced optimizations
  • Vectorization
  • High level optimization
  • OpenMP
  • Auto parallelization
  • Multi-pass optimizations
  • Vectorization
  • OpenMP
  • Summary

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
27
Vectorization Converts Loops
  • Example
  • for (I0IltMAXI)
  • cIaIbI
  • Usage
  • (Linux) -axN, -axB, -axP
  • (Windows) -QaxN, -QaxB, -QaxP

A3
A1
A2
A0




128-bit Registers
B3
B1
B2
B0
C3
C2
C1
C0
28
Advanced Idiom Recognition
  • The Pentium 4 processor provides SIMD
    instructions for a wide variety of idioms
  • Type conversions and type conversions with
    saturation
  • Saturation arithmetic
  • Clipping
  • AVG, ABS computations
  • The Intel compiler aggressively detects such
    idioms during intra-register vectorization.
  • Clean methodology, no ad-hoc pattern matching

Intel, Pentium, and the Intel logo are trademarks
or registered trademarks of Intel Corporation or
its subsidiaries in the United States or other
countries.
29
Saturation and Clipping Idioms
  • unsigned char a256, b256..
  • for (i 0 i lt 256 i) int x (ai lt 200)
    ? ai55 255
  • if (x gt bi) bi x
  • .B1.11 xmm1 is
    preloaded with 55,..,55
  • movdqa xmm0, XMMWORD PTReaxecx
  • paddusb xmm0, xmm1 (saturate)
  • pmaxub xmm0, XMMWORD PTRecxebp
    (clipping)
  • movdqa XMMWORD PTRecxebp, xmm0
  • add ecx, 16
  • cmp ecx, esi
  • jl .B1.11

The Intel Compiler does the work for you!
Intel and the Intel logo are trademarks or
registered trademarksof Intel Corporation or its
subsidiaries in the United States or other
countries
30
High Level Optimizer
  • Turn on via -O3
  • xN, xB, xP family of switches and so on
  • Additional loop optimizations
  • More aggressive dependency analysis
  • Scalar replacement
  • Loops must meet certain criteria

Intel, Pentium, and the Intel logo are trademarks
or registered trademarks of Intel Corporation or
its subsidiaries in the United States or other
countries.
31
Share Memory Parallel (SMP) Programming
  • OpenMP
  • Easy multithreading using directives
  • Use Intel tools to optimize for IA in tandem
    with OpenMP
  • Auto-parallelization
  • Simple loops threaded by compiler alone
  • Loops must meet certain criteria

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
32
OpenMP Support
  • OpenMP 2.0 for Fortran and C
  • VTune analyzer works well for multithreaded code
    generated by the 8.1 compiler.
  • Also supports the OpenMP extension workqueuing
    model from Intel to exploit task level
    parallelism.
  • OpenMP switches

Intel, VTune and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
33
Auto-parallelization
  • Auto-parallelization Automatic threading of
    loops without having to manually insert OpenMP
    directives.
  • It is better to use OpenMP directives.
  • The compiler can identify easy candidates for
    parallelization, but large applications are
    difficult to analyze.

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
34
Agenda
  • Why is Multi-core Important?
  • Intel Software Development Products overview
  • Intel Compiler Optimization
  • General and processor-specific optimization
  • Advanced optimizations
  • Multi-pass optimizations
  • Interprocedural optimization (IPO)
  • Profile-guided optimization (PGO)
  • Vectorization
  • OpenMP
  • Summary

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
35
Interprocedural Optimizations (IPO)
  • ip Enables interprocedural optimizations for
    single file compilation
  • ipo Enables interprocedural optimizations
    across files
  • Enhances optimization when used in combination
    with other compiler features

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
36
Interprocedural Optimizations (IPO)
  • More benefits than just inlining
  • Partial inlining
  • Interprocedural constant propagation
  • Passing arguments in registers
  • Loop-invariant code motion
  • Dead code elimination
  • Helps vectorization, memory disambiguation

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
37
Usage Two-Step Process
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
38
Agenda
  • Why is Multi-core Important?
  • Intel Software Development Products overview
  • Intel Compiler Optimization
  • General and processor-specific optimization
  • Advanced optimizations
  • Multi-pass optimizations
  • Interprocedural optimization (IPO)
  • Profile-guided optimization (PGO)
  • Vectorization
  • OpenMP
  • Summary

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
39
Profile Guided Optimizations (PGO)
  • Use execution-time feedback to guide many other
    compiler optimizations
  • Helps I-cache, paging, branch-prediction
  • Enabled optimizations
  • Basic block ordering
  • Better register allocation
  • Better decision of functions to inline
  • Function ordering
  • Switch-statement optimization
  • Better vectorization decisions

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
40
Usage Three-Step Process
Step 1
Instrumented Compilation (Linux) icc -prof_gen
prog.c(Windows) icl -Qprof_gen prog.c
Instrumented executable
Step 2
Instrumented Execution Run program on a typical
dataset
DYN file containing dynamic info .dyn
Step 3
Merged DYN summary file .dpi Delete old dyn
files if you do not want the info included
Feedback Compilation (Linux) icc -prof_use
prog.c(Windows) icl -Qprof_use prog.c
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
41
When to Use
  • Applications with lots of functions, calls, or
    branches
  • Examples databases, decision-support
    (enterprise), MCAD
  • Applications with computation spread throughout
  • Knowing trip counts for loops can enable more
    aggressive optimization
  • Considerations
  • Different paradigm for builds - three steps
  • Schedule time in final stages of development when
    code is more stable
  • Use representative data sets (not for corner
    cases)

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
42
Programs That Benefit
  • Consistent hot paths
  • Many if statements or switches
  • Nested if statements or switches

versus
Little Benefit
Significant Benefit
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
43
Indirect Branches
  • Indirect branches not as predictable
  • Compared with conditional branches
  • Usually generated for switch statements
  • Have much larger relative latency than direct
    branches
  • Intel compiler does
  • Optimize likely cases to use conditional branches

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
44
PGO Example
  • for (i0 i lt NUM_BLOCKS i)
  • switch (check3(i))
  • case 3 / 25 /
  • xi 3 break
  • case 10 / 75 /
  • xi 10 break
  • default / 0 /
  • xi 99 break
  • PGO can eliminate jumps for the common case.

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
45
PGO Example
Before
After
  • cmp edx, 3
  • je B19 //jumps 25 of time
  • cmp edx, 10
  • jne B110
  • mov DWORD PTR espebx4-4, 10
  • jmp B17 //goto loop termination
  • B19
  • mov DWORD PTR espebx4-4, 3
  • jmp B17 //goto loop termination
  • B110
  • mov DWORD PTR espebx4-4, 99
  • jmp B17 //goto loop termination
  • B17 //loop termination
  • cmp ebx, 100
  • jl B14
  • cmp edx, 3
  • jne B16 //jumps 75 of time
  • mov DWORD PTR espebx4-4, 3
  • jmp B19 //goto loop termination
  • B16
  • cmp edx, 10
  • jne B18
  • mov DWORD PTR espebx4-4, 10
  • jmp B19 //goto loop termination
  • B18
  • mov DWORD PTR espebx4-4, 99
  • B19 //loop termination
  • cmp ebx, 100
  • jl B14

Intel and the Intel logo are trademarks or
registered trademarksof Intel Corporation or its
subsidiaries in the United States or other
countries
46
Agenda
  • Why is Multi-core Important?
  • Intel Software Development Products overview
  • Intel Compiler Optimization
  • Vectorization
  • OpenMP
  • Summary

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
47
Your Task!
  • Convert to this
  • for (I0IltMAXI)
  • cIaIbI

A3
A1
A2
A0




128-bit Registers
B3
B1
B2
B0
C3
C2
C1
C0
48
Conditions Affecting Vectorization
  • Loop Iterations generally must be Independent
  • Variables within the loop must be disambiguated
  • Also
  • Most function calls cannot be vectorized.
  • Some conditional branches prevent vectorization.
  • Loops must be countable.
  • Outer loop of nest cannot be vectorized.
  • Mixed data types cannot be vectorized.
  • Others

Intel, Pentium, and the Intel logo are trademarks
or registered trademarks of Intel Corporation or
its subsidiaries in the United States or other
countries.
49
Why didnt my loop Vectorize?
  • Linux Windows
  • -vec_reportn -Qvec_reportn
  • Set diagnostic level dumped to stdout
  • n0 No diagnostic information
  • n1 (Default) Loops successfully vectorized
  • n2 Adds loops not vectorized
  • n3 Adds information about why it couldnt
    vectroize the loop

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
50
Vectorization Report
  • Loop was not vectorized because
  • Mixed Data Types
  • Nonunit stride used
  • Condition too Complex
  • Condition may protect exception
  • "vectorization possible but seems inefficient"
  • Low trip count
  • Operator unsuited for vectorization
  • Subscript too complex
  • Unsupported Loop Structure
  • Existence of vector dependence
  • Complex subscript expression
  • Contains unvectorizable statement at line XX
  • Not Inner Loop

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
51
Loop did not Vectorize Because Nonunit stride
used
Memory
  • for (I0IltMAXI)
  • for (J0JltMAXJ)
  • cI,J1 //Unit Stride
  • cJ,I1 //Non-Unit
  • AJJ1 //Non-unit
  • ABI1 //Non-Unit
  • //cant use single ld instruction to populate
    Vector
  • if (AMAX-J)1 last1J//Non-Unit
  • //non-unit stride used to preserve order
  • //End Result Loading Vector may take more
    cycles than executing operation sequentially.

52
Loop did not Vectorize Because Condition may
protect exception
  • Conditions get turned into bit masks that execute
    both sides
  • If the compiler thinks there is a possibility
    that one side could throw an exception it wont
    vectorize the loop
  • int AMAX-50
  • for (I0IltMAXi)
  • if (IltMAX-50) Ai0

53
Loop did not Vectorize Because Low trip Count
  • Setting up loops for SSE instructions incurs an
    overhead. Loops with a small of iterations
    probably shouldnt be vectorized
  • Tell the compiler how large the loop is.
  • pragma loop count(5)
  • for (i 0 i lt count1 i)
  • a1i b1i c1i d1i
  • pragma loop count(1000)
  • for (i 0 i lt count2 i)
  • a2i b2i c2i d2i
  • Note Aligning the Data elements will allow the
    vectorizer to align more Low Trip Count Loops
    See Alignment Section

54
Loop did not Vectorize Because Vectorization
possible but seems inefficient
  • Could be the loop is so large it needs too many
    registers
  • Possible Solution Loop Fission
  • pragma distribute point
  • for (I0IltNI)
  • AI0
  • pragma distribute point
  • BI0
  • Hints to the compiler that creating two loops is
    possible/practical.
  • Other reasons could cause the creation of this
    warning!

55
Directives
  • Control vectorization of the subsequent loop
    (CDIR or !dir in Fortran).
  • pragma novector Specifies that loop should
    never be vectorized, even if it is legal to do
    so.
  • pragma vector always Overrides heuristic
    decision about profitability and exception
    detection

pragma vector always for (i 0 i lt MAX
i) ai bi ci di
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
56
Loop did not Vectorize Because Mixed Data Types
  • int howmany_close(float x, float y)
  • int withinborder0
  • float dist
  • for(int I0IltMAXI)
  • distsqrtf(xixi yiyi)
  • if (distlt5) withinborder

57
Loop did not Vectorize Because Existence of
Vector Dependency
  • Real Dependency between Iterations of the Loop?
  • Or
  • Compiler cant assume that 2 pointers dont
    Alias? (overlap in memory)

void scale(float z, float x) for (i 0 i lt
100 i) zi A xi
58
Disambiguation Switches
  • Linux Windows
  • -fno-alias /Oa
  • All pointers in file.c are assumed not to alias.
  • -fno-fnalias /Ow
  • Assume no aliasing within functions (that is,
    pointer arguments are unique).
  • -ipo /Qipo
  • Global static analysis by IPO can disambiguate
    pointers.
  • -restrict /Qrestrict
  • Restrict qualifier enables pointer
    disambiguation.

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
59
Disambiguation - Restrict Keyword
  • Part of ISO C99 standard
  • Need command line switch
  • /Qrestrict -restrict

void foo(int x, int y, int restrict z)
int i for (i 0 i lt 100 i) zi
A xi yi // two-dimension
example void mult(int aNUM, int
brestrictNUM)
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
60
Disambiguation - Pragma
  • Tell the compiler to ignore assumed vector
    dependencies
  • pragma ivdep
  • for (i 0 i lt 100 i)
  • zi A xi yi

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
61
Loop did not Vectorize Because Contains
unvectorizable statement at line XX
  • Loops with function calls normally dont
    vectorize
  • IPO compiles will inline more user defined
    functions - which in turn allows vectorization
  • The Question Which functions got inlined?

62
IPO Report
  • Compiler options
  • Linux
  • -opt_report opt_report_phase ipo
  • Windows
  • -Qopt_report Qopt_report_phase ipo
  • Information provided
  • Function inlining

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
63
Square_charge A C Inlining Example
Calculate electrostatic potential due to a
uniform two- dimension charge distribution.
  • float trap_int(float y, float x0, float xn, int
    nx, float xp, float yp)
  • h (xn-x0) / nx
  • sumx 0.5( func(x0,y,xp,yp) func(xn,y,xp,yp)
    )
  • for (i1iltnxi)
  • x x0 ih
  • sumx sumx func(x,y,xp,yp)
  • sumx sumx h
  • return sumx
  • float func(float x, float y, float xp, float yp)
  • float denom
  • denom (x-xp)(x-xp) (y-yp)(y-yp)
  • denom 1./sqrt(denom)
  • return denom

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
64
Square_charge Example
Run on a 1.7 GHz dual Pentium 4 processor
system.
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests.  Any
difference in system hardware or software design
or configuration may affect actual performance. 
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing.   For more information on performance
tests and on the performance of Intel products,
reference www.intel.com or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
Intel, Pentium, and the Intel logo are trademarks
or registered trademarks of Intel Corporation or
its subsidiaries in the United States or other
countries.
65
SVML Example
  • Should this vectorize?
  • include ltmath.hgt
  • void svml(int length, float a, float b,
    float restrict x)
  • for (int i0 iltlength i)
  • xi exp( bi )
  • Answer Yes
  • A special set of functions inside the inner loop
    will vectorize

icc -c -xN -vec_report3 svml-exp.cpp
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
66
Math Functions Using SVML Library
  • Short vector math library (SVML) provides
    efficient software implementations
  • sin/cos/tan
  • asin/acos/atan
  • sinh/cosh/tanh
  • asinh/acosh/atanh
  • log10/ln
  • exp/pow
  • Note VML provided by MKL may even be faster

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
67
Agenda
  • Why is Multi-core Important?
  • Intel Software Development Products overview
  • Intel Compiler Optimization
  • Vectorization
  • OpenMP
  • Runtime functions/environment variables
  • Parallel regions
  • Work-sharing
  • Data environment
  • Summary

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
68
Introduction to OpenMPThree Major Parallel
Technologies
  • Thread libraries
  • Win32 API
  • POSIX threads
  • Message passing libraries
  • Message passing interface (MPI)
  • Compiler directives
  • OpenMP - portable shared memory parallelism

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
69
Introduction to OpenMPWhat Is It?
www.openmp.org
  • Portable, shared memory multi-processing
    application program interface (API)
  • Fortran 77, Fortran 90, C, and C
  • Multi-vendor support, for both Unix and Windows
  • Standardizes loop-level parallelism
  • Supports coarse-grained parallelism
  • Combines serial and parallel code in single
    source
  • Standardizes the last 15 years of symmetric
    multiprocessing (SMP) experience

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
70
Introduction to OpenMPProgramming Model
  • Fork-Join Parallelism
  • Master thread spawns a team of threads as needed
  • Parallelism is added incrementally that is, the
    sequential program evolves into a parallel program

Master Thread
Parallel Regions
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
71
Introduction to OpenMPParallelize Loops
  • Find your most time consuming loops.
  • Split them up between threads.

Split-up this loop between multiple threads
void main() double Res1000pragma omp
parallel for for(int i0ilt1000i)
do_huge_comp(Resi)
void main() double Res1000 for(int
i0ilt1000i) do_huge_comp(Resi)
Sequential Program
Parallel Program
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
72
Agenda
  • Why is Multi-core Important?
  • Intel Software Development Products overview
  • Intel Compiler Optimization
  • Vectorization
  • OpenMP
  • Runtime functions/environment variables
  • Parallel regions
  • Work-sharing
  • Data environment
  • Summary

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
73
Introduction to OpenMPLibrary Routines
  • Runtime environment routines
  • Modify/check the number of threads
  • omp_set_num_threads()
  • omp_get_num_threads()
  • omp_get_thread_num()
  • omp_get_max_threads()
  • Are we in a parallel region?
  • omp_in_parallel()
  • How many processors in the system?
  • omp_num_procs()

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
74
Introduction to OpenMPLibrary Routines
  • To fix the number of threads used in a program
  • Set the number of threads
  • Then save the number returned

Request as many threads as you have processors.
include ltomp.hgtvoid main() int
num_threads omp_set_num_threads(omp_num_procs
())pragma omp parallel int
idomp_get_thread_num()pragma omp single
num_threads omp_get_num_threads()
do_lots_of_stuff(id)
Protect this operation because memory stores are
not atomic
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
75
Introduction to OpenMPEnvironment Variables
  • Set the default number of threads to use
  • OMP_NUM_THREADS int_literal
  • Control how omp for schedule(RUNTIME) loop
    iterations are scheduled
  • OMP_SCHEDULE schedule, chunk_size

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
76
OpenMP LabHello Worlds
  • When run on four threads, this program
  • may print this
  • hello(2) world(2)
  • Hello(3) world(3)
  • hello(0) world(0)
  • Hello(1) world(1)

include omp.hint main() pragma omp
parallel int ID omp_get_thread_num()
printf( hello(d) , ID) printf(
world(d) \n, ID) return 0
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
77
Agenda
  • Why is Multi-core Important?
  • Intel Software Development Products overview
  • Intel Compiler Optimization
  • Vectorization
  • OpenMP
  • Runtime functions/environment variables
  • Parallel regions
  • Work-sharing
  • Data environment
  • Summary

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
78
Introduction to OpenMPStructured Blocks (C/C)
  • Most OpenMP constructs apply to structured
    blocks
  • Structured block a block with one point of entry
    at the top and one point of exit at the bottom
  • The only branches allowed are STOP statements
    in Fortran and exit() in C/C


if(go_now()) goto more pragma omp
parallel int id omp_get_thread_num() more
res(id) do_big_job(id) if(conv(res(id))
goto done goto more done if(!really_done())
goto more
pragma omp parallel int id
omp_get_thread_num() more res(id)
do_big_job(id) if(conv(res(id)) goto
more printf( All done \n)
Not a structured block
A structured block
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
79
Introduction to OpenMPStructured Block
Boundaries
  • In C/C a block is a single statement or a
    group of statements between brackets

pragma omp parallel id omp_thread_num() re
s(id) lots_of_work(id)
pragma omp for for(I0IltNI)
resI big_calc(I) AI BI resI
  • In Fortran a block is a single statement or a
    group of statements between directive/end-directiv
    e pairs.

COMP PARALLEL 10 wrk(id) garbage(id)
res(id) wrk(id)2 if(conv(res(id)) goto
10 COMP END PARALLEL
COMP PARALLEL DO do I1,N res(I)bigComp(I)
end do COMP END PARALLEL DO
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
80
Agenda
  • Why is Multi-core Important?
  • Intel Software Development Products overview
  • Intel Compiler Optimization
  • Vectorization
  • OpenMP
  • Runtime functions/environment variables
  • Parallel regions
  • Work-sharing
  • Data environment
  • Summary

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
81
Introduction to OpenMPWork-Sharing Constructs
  • The for work-sharing construct splits up loop
    iterations among the threads in a team

pragma omp parallelpragma omp for for
(I0IltNI) NEAT_STUFF(I)
By default, there is a barrier at the end of the
omp for. Use the nowait clause to turn off
the barrier.
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
82
Introduction to OpenMPWork-Sharing Constructs
A Motivating Example
for(i0IltNi) ai ai bi
Sequential code
pragma omp parallel int id, i, Nthrds,
istart, iend id omp_get_thread_num() Nthrds
omp_get_num_threads() istart id N /
Nthrds iend (id1) N / Nthrds for(iistart
Iltiendi) ai ai bi0
OpenMP parallel region
pragma omp parallel pragma omp for
schedule(static) for(i0IltNi) ai
ai bi
OpenMP parallel region and a work-sharing
for-construct
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
83
Introduction to OpenMPThe Schedule Clause
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
84
Introduction to OpenMPParallel Sections (Task
Parallelism)
  • Independent sections of code can execute
    concurrently

pragma omp parallel sections pragma omp
section phase1() pragma omp section
phase2() pragma omp section phase3()
Serial
Parallel
By default, there is a barrier at the end of the
omp sections. Use the nowait clause to turn
off the barrier.
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
85
Agenda
  • Why is Multi-core Important?
  • Intel Software Development Products overview
  • Intel Compiler Optimization
  • Vectorization
  • OpenMP
  • Runtime functions/environment variables
  • Parallel regions
  • Work-sharing
  • Data environment
  • Summary

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
86
Introduction to OpenMPReduction
  • Another clause that affects the way variables are
    shared
  • reduction (op list)
  • The variables in list must be shared in the
    enclosing parallel region
  • Inside a parallel or a work-sharing construct
  • A local copy of each list variable is made and
    initialized depending on the op (For example, 0
    for ).
  • Compiler finds standard reduction expressions
    containing op and uses them to update the local
    copy
  • Local copies are reduced to a single value and
    combined with the original global value

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
87
Introduction to OpenMPReduction Example
program closer IS 0 DO
J1,1000 IS IS J 1000 CONTINUE
print , IS
  • Remember the code we used to demonstrate private,
    firstprivate and lastprivate?

program correct IS 0 pragma omp
parallel for reduction(IS) DO J1,1000
IS IS J 1000 CONTINUE print , IS
Here is the correct way to parallelize this code
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
88
Introduction to OpenMPReduction
Operands/Initial-Values
  • A range of associative operands can be used with
    reduction
  • Initial values are the ones that make sense
    mathematically.

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
89
OpenMP Lab The PI ProgramNumerical Integration
Mathematically, we know that
We can approximate the integral as a sum of
rectangles
Where each rectangle has width ?x and height
F(xi) at the middle of interval i.
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
90
OpenMP Lab PI Program The Sequential Program
static int num_steps 100000 double step int
main () int i double x, pi, sum
0.0 step 1.0/(double) num_steps for (i0
ilt num_steps i) x (i0.5)step sum
sum 4.0/(1.0xx) pi step sum return
0
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
91
OpenMP Lab
include ltomp.hgt define NUM_THREADS 2 static
int num_steps 100000 double step int main
() int i double x, pi, sumNUM_THREADS
0 step 1.0/(double) num_steps omp_set_num
_threads(NUM_THREADS) pragma omp parallel
double x int id, i, nthreads id
omp_get_thread_num() nthreads
omp_get_num_threads() for (iidilt
num_steps iinthreads) x
(i0.5)step sumid 4.0/(1.0xx)
for(i0, pi0.0iltNUM_THREADSi)pi
sumi step return 0
SPMD Programs Each thread runs the same code with
the thread ID selecting any thread specific
behavior.
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
92
OpenMP Lab
include ltomp.hgt define NUM_THREADS 2 static int
num_steps 100000 double step int main
() int i double x, pi, sumNUM_THREADS
0.0 step 1.0/(double) num_steps omp_set_n
um_threads(NUM_THREADS) pragma omp parallel
double x int i, id id
omp_get_thread_num() pragma omp for for
(i0ilt num_steps i) x (i0.5)step
sumid 4.0/(1.0xx) for(i0,
pi0.0iltNUM_THREADSi)pi sumi
step return 0
Work Sharing Programs Each thread runs the same
code with the system selecting the proper
iteration count for each thread.
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
93
OpenMP LabPI Program Parallel For With a
Reduction
include ltomp.hgt static int num_steps 100000
double step define NUM_THREADS 2 int main
() int i double x, pi, sum 0.0 step
1.0/(double) num_steps omp_set_num_threads(NUM
_THREADS) pragma omp parallel for
reduction(sum) private(x) for (i0ilt
num_steps i) x (i0.5)step sum sum
4.0/(1.0xx) pi step sum return 0
OpenMP adds 2 to 4 lines of code
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
94
Agenda
  • Why is Multi-core Important?
  • Intel Software Development Products overview
  • Intel Compiler Optimization
  • Vectorization
  • OpenMP
  • Summary

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
95
Multithreading Development Cycle
  • Analysis
  • VTune Performance Analyzer
  • Design (Introduce Threads)
  • OpenMP (Intel Compiler)
  • Intel Performance libraries IPP and MKL
  • Explicit threading (Win32, Pthreads)
  • Debug for Correctness
  • Intel Thread Checker
  • Intel Debugger
  • Tune for Performance
  • Intel Compiler
  • Intel Thread Profiler
  • VTune Performance Analyzer

96
ReferenceFor Further InformationTraining and
Support
  • Web-based and classroom training
  • www.intel.com/software/college
  • White papers and technical notes
  • www.intel.com/ids
  • www.intel.com/software/products
  • Product support resources
  • www.intel.com/software/products/support

Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries.
97
Questions?
Thank you!
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
98
Backup
Intel and the Intel logo are trademarks or
registered trademarks of Intel Corporation or its
subsidiaries in the United States or other
countries
99
Intel C and Fortran Compilers
The Intel C Compiler for Linux provided to
Fluents Computational Fluid Dynamics (CFD)
software an impressive 9 to 37 performance
improvement over the GNU C compiler, when we ran
our standard benchmarks. The Intel C Compiler
for Linux integrated smoothly into our
development environment, with no technical
issues. Dr. Dipankar Choudhury, CTO, Fluent
Inc.
  • Helps your software to run at top speeds!
  • Windows support
  • Plug-in compatibility with Microsoft Visual
    Studio
  • Source compatibility with Compaq Visual Fortran
  • Native source and object code compatibility with
    Microsoft Visual C
  • Linux support
  • Improved command line compatibility with GCC (C
    Linux)
  • Source and Binary compatibility with GCC 4.0
  • Integration with Eclipse 3.0/CDT 2.1.1 (IA-32
    only)
  • Intel processor support
  • 32-bit processors, Intel EM64T, and Itanium 2
    processor families
  • Support for Streaming SIMD Extensions (SSE2
    SSE3)
  • Support Intel multi-core processors
  • Continue to support various AMD processors such
    as AMD Opteron and Athlon
  • Intel Code Coverage Intel Test Prioritization
    Tools

More
Compiler Product Page
100
Intel VTune PerformanceAnalyzer
The Intel VTune Performance Analyzer took a
multi-day task and turned it into a sub-day
task. Randy Camp, V.P. Software Research and
Development, MUSICMATCH, Inc.
  • Quickly identify application bottlenecks
  • Increase application performance and save time in
    the development cycle
  • Use multiple methods to gather data with minimal
    intrusion
  • Call Graph, Sampling
  • Support for Java and .NET
  • Windows capabilities
  • Powerful graphical analysis
  • Remote agents for profiling Linux and XScale
    processor platforms
  • Linux capabilities
  • Native Linux - No Windows required
  • Supports many Red Hat and SuSE Linux releases
  • Powerful Eclipse based GUI or flexible command
    line interface

VTune Analyzer Product Page
101
VTune Analyzer Features and Usage Models
Sampling Collects System-wide Performance Data
Intel, VTune, and the Intel logo are trademarks
or registered trademarks of Intel Corporation or
its subsidiaries in the United States or other
countries.
102
VTune Analyzer Features and Usage Models
Sampling Source View Displays Source Code
Annotated with Performance Data
Intel, VTune, and the Intel logo are trademarks
or registered trademarks of Intel Corporation or
its subsidiaries in the United States or other
countries.
103
VTune Analyzer Features and Usage Models
Call Graph Collects and Displays Information
About the Program Flow of the Application
Intel, VTune, and the Intel logo are trademarks
or registered trademarks of Intel Corporation or
its subsidiaries in the United States or other
countries.
104
Intel Thread Checker / Intel Thread Profiler
Intels Thread Checker helped identify potential
threading issues very quickly, in days compared
to weeks if done otherwise. Dana BataliDirector
of RenderMan DevelopmentPixar
  • Intel Thread Checker
  • Identifies nearly impossible-to-find bugs using
    an advanced error detection engine.
  • Bugs do not need to occur to be detected.
  • Graphically group, sort, and filter errors by
    severity and applicability
  • Automates detection of most threading errors like
    race conditions, deadlocks and stalls
  • Supports commonly used threading models
  • Win32 (Windows), OpenMP (Windows or Linux),
    and POSIX (Linux) threads
  • Intel Thread Profiler
  • Identifies areas that are/are not optimally
    threaded on multi-core, computers with
    Hyper-Threading Technology (HT) , and
    multi-processor computers
  • Profiles critical path for threaded applications
  • Highlights thread workload imbalances
  • Supports commonly used threading models
  • Win32 (Windows), OpenMP (Windows or Linux),
    and POSIX (Linux) threads

Threading Tools Product Page
105
Intel Thread Checker Diagnostics
The Four Step Process - Correctness
106
Intel Thread Checker Diagnostics
The Four Step Process - Correctness
107
Intel Thread Profiler Results
The Four Step Process - Performance
108
Intel Thread Profiler Timeline View
The Four Step Process - Performance
  • Available when analyzing explicit threads

109
Intel Performance Libraries
  • Intel Integrated Performance Primitives
  • A library of highly optimized functio
Write a Comment
User Comments (0)
About PowerShow.com