Title: Tight Analysis of the Performance Potential of Thread Speculation using SPEC CPU2006
1Tight Analysis of the Performance Potential of
Thread Speculation using SPEC CPU2006
- Arun Kejariwal,, Xinmin Tian
- Milind Girkar, Wei Li, Sergey Kozhukhov,
Hideki Saito - Utpal Banerjee Alexandru Nicolau, Alexander V.
Veidenbaum - Constantine D. Polychronopoulos
- Software and Solutions Group, Intel Corporation
- Center for Embedded Computer Systems, University
of California, Irvine - Center for Supercomputing Research and
Development - University of Illinois at
Urbana-Champaign -
- March 16, 2007
2Parallelism Becoming Ubiquitous
- Emergence of multi-cores, hyper-threaded
processors - Intel Core 2 Duo processor
- Intel Kentsfield (quad-core) processor
- Program parallelization
- Auto-parallelization
- Hardware-assisted (speculative) parallelization
Loop-level
- Main contributions
- Evaluating the performance potential of
thread-level parallelism using SPEC CPU2006 - Trade-off between ILP and TLP (thread-level
parallelism) - Analysis w.r.t. threading overhead,
transformations and conflict probability
Other brands and names may be claimed as the
property of others.
3Clarification of Chosen Approach
- Tight upper bounds on the performance potential
of TLS by filtering out - Inherently parallel program regions
- Non-profitable candidates for TLS
- Focus on parallelism that cannot be exploited
with state-of-the-art compiler technology, but
can uniquely be exploited via TLS
4Global View
5Differentiating TLP and TLS
- Loop-level parallelism
- Parallel (DOALL) loops corresponds to true TLP
- Non-parallel (Non-DOALL) loops corresponds to
sTLP - Performance achievable by each technique
standalone - Performance achievable by each technique in
conjunction with others
6Taxonomy
CS Control Speculation DDS Data Dependence
Speculation DVS Data Value Speculation
7Methodology Details
- Analysis will refer only to innermost loops
(however) - Two-step approach filter out loops
- - that can be parallelized using
state-of-the-art compiler techniques (dep.
analysis, pointer analysis, IPA, etc) - - for which it is more profitable to exploit ILP
instead of sTLP - Filtering was done by using the Intel Compiler
and manual analysis (for lt10 of the loops). - The remaining loops are considered for evaluating
the performance potential of TLS at the innermost
loop level.
8The Baseline
9ILP/sTLP Trade-off
- Determine what is achievable beyond existing
ILP techniques - Example A candidate loop for DVS
- Also a candidate for software pipelining
- Too small for TLS to be profitable (too little
computation vs. threading overhead)
10ILP/sTLP Trade-off (contd.)
- CSDVS vs. Perfect Pipelining (PP)
- Profitability of CSDVS High coverage per
iteration - PP can be applied in an unrestricted fashion
- Example loop on the right
- Less than 1 coverage
- Very small coverage per iteration
- Not suitable for TLS
11ILP/sTLP Trade-off (contd.)
- Symbolic Analysis
- Convert a non-DOALL loop into a DOALL loop
- Example loop on the right
- No need for TLS
- Too small to exploit non-speculative TLP in
- profitable fashion
non-DOALL loop
DOALL loop
12Procedural-level TLS
- Subject to
- Limitations of the inlining heuristic of the
compiler - The strength of dependence analysis
- supported
- Example loop on the right
- Procedural calls phi0, phi1 and phi2
- are loop invariant
- Hoist the functions
- Resulting loop A DOALL loop
13Evaluation of TLS
- SPEC INT and SPEC FP 2006
- Intel auto-parallelizing compiler
- System configuration
- Evaluate the performance potential of various
speculation techniques, - at the innermost loop level, subject to
practical constraints such as - threading overhead
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
14Notes on Results
- Important issue 1 previous studies of the
potential of TLS used unrealistic (small) thread
overheads. - Authors study thread overhead 1000 cycles
- Linux NPTL 10k cycles
- Thread overhead
- - thread creation
- - thread management
- - thread synchronization
15Notes on Results (contd.)
- Important issue 2 all results presented
heretofore correspond to dynamic scheduling with
one iteration scheduled at a time. - Loops whose coverage per iteration is smaller
than the threading overhead are filtered out.
16Variation of TLS Performance Potential w.r.t.
Threading Overhead
- Using state-of-the-art threading library
overhead is min. 1K cycles - Ideal case 40
- Practical case lt3
0
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
17Variation of TLS Performance Potential w.r.t.
Threading Overhead
- Ideal case 40
- Practical case 0
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
18Mitigating the Threading Overhead
- Loop Unrolling
- Higher inter-thread destruction interference in
the D-Cache (me but, but!) - Increases the number of potential dependences
between any two iterations of the loop - Result in higher misspeculation rate
Unroll twice
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
19Mitigating the Threading Overhead (contd.)
- Using large number of processors
- Tovhd lt (Np -1) x Titer
- Tovhd lt (Niter 1) x Titer for unbounded number
of processors
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
20Performance Potential of Different Types of TLS
- SPEC CINT2006 (refer to the paper for CFP2006)
- Gray columns assuming Tovhd 1000 cycles
- White columns assuming Tovhd 10 cycles
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
21Bounds on Conflict Probability
- Model the impact of misspeculation penalty
For m points-of-speculation on an iteration
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
22Bounds on Conflict Probability (contd.)
- Example 401.perlbench
- Applying TLS on Loop 8 is profitable only if the
misspeculation probability lt 0.28 - Applying TLS on other
- loops is not beneficial
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
23Conclusions
- Performance potential of innermost loop-level
TLS (excluding TLP) - Assuming threading overhead 1K cycles Geometric
Mean 1 - Performance potential of standalone CS, DDS and
DVS is rather limited - Future work evaluate the TLS-run time
parallelization trade-off
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
24(No Transcript)