Tight Analysis of the Performance Potential of Thread Speculation using SPEC CPU2006 - PowerPoint PPT Presentation

About This Presentation
Title:

Tight Analysis of the Performance Potential of Thread Speculation using SPEC CPU2006

Description:

PP can be applied in an unrestricted fashion. Example loop on the right. Less than 1% coverage ... hardware or software design or configuration may affect ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 25
Provided by: intel158
Category:

less

Transcript and Presenter's Notes

Title: Tight Analysis of the Performance Potential of Thread Speculation using SPEC CPU2006


1
Tight Analysis of the Performance Potential of
Thread Speculation using SPEC CPU2006
  • Arun Kejariwal,, Xinmin Tian
  • Milind Girkar, Wei Li, Sergey Kozhukhov,
    Hideki Saito
  • Utpal Banerjee Alexandru Nicolau, Alexander V.
    Veidenbaum
  • Constantine D. Polychronopoulos
  • Software and Solutions Group, Intel Corporation
  • Center for Embedded Computer Systems, University
    of California, Irvine
  • Center for Supercomputing Research and
    Development
  • University of Illinois at
    Urbana-Champaign
  • March 16, 2007

2
Parallelism Becoming Ubiquitous
  • Emergence of multi-cores, hyper-threaded
    processors
  • Intel Core 2 Duo processor
  • Intel Kentsfield (quad-core) processor
  • Program parallelization
  • Auto-parallelization
  • Hardware-assisted (speculative) parallelization

Loop-level
  • Main contributions
  • Evaluating the performance potential of
    thread-level parallelism using SPEC CPU2006
  • Trade-off between ILP and TLP (thread-level
    parallelism)
  • Analysis w.r.t. threading overhead,
    transformations and conflict probability

Other brands and names may be claimed as the
property of others.
3
Clarification of Chosen Approach
  • Tight upper bounds on the performance potential
    of TLS by filtering out
  • Inherently parallel program regions
  • Non-profitable candidates for TLS
  • Focus on parallelism that cannot be exploited
    with state-of-the-art compiler technology, but
    can uniquely be exploited via TLS

4
Global View
5
Differentiating TLP and TLS
  • Loop-level parallelism
  • Parallel (DOALL) loops corresponds to true TLP
  • Non-parallel (Non-DOALL) loops corresponds to
    sTLP
  • Performance achievable by each technique
    standalone
  • Performance achievable by each technique in
    conjunction with others

6
Taxonomy
CS Control Speculation DDS Data Dependence
Speculation DVS Data Value Speculation
7
Methodology Details
  • Analysis will refer only to innermost loops
    (however)
  • Two-step approach filter out loops
  • - that can be parallelized using
    state-of-the-art compiler techniques (dep.
    analysis, pointer analysis, IPA, etc)
  • - for which it is more profitable to exploit ILP
    instead of sTLP
  • Filtering was done by using the Intel Compiler
    and manual analysis (for lt10 of the loops).
  • The remaining loops are considered for evaluating
    the performance potential of TLS at the innermost
    loop level.

8
The Baseline
  • Auto-parallelization

9
ILP/sTLP Trade-off
  • Determine what is achievable beyond existing
    ILP techniques
  • Example A candidate loop for DVS
  • Also a candidate for software pipelining
  • Too small for TLS to be profitable (too little
    computation vs. threading overhead)

10
ILP/sTLP Trade-off (contd.)
  • CSDVS vs. Perfect Pipelining (PP)
  • Profitability of CSDVS High coverage per
    iteration
  • PP can be applied in an unrestricted fashion
  • Example loop on the right
  • Less than 1 coverage
  • Very small coverage per iteration
  • Not suitable for TLS

11
ILP/sTLP Trade-off (contd.)
  • Symbolic Analysis
  • Convert a non-DOALL loop into a DOALL loop
  • Example loop on the right
  • No need for TLS
  • Too small to exploit non-speculative TLP in
  • profitable fashion

non-DOALL loop
DOALL loop
12
Procedural-level TLS
  • Subject to
  • Limitations of the inlining heuristic of the
    compiler
  • The strength of dependence analysis
  • supported
  • Example loop on the right
  • Procedural calls phi0, phi1 and phi2
  • are loop invariant
  • Hoist the functions
  • Resulting loop A DOALL loop

13
Evaluation of TLS
  • SPEC INT and SPEC FP 2006
  • Intel auto-parallelizing compiler
  • System configuration
  • Evaluate the performance potential of various
    speculation techniques,
  • at the innermost loop level, subject to
    practical constraints such as
  • threading overhead

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
14
Notes on Results
  • Important issue 1 previous studies of the
    potential of TLS used unrealistic (small) thread
    overheads.
  • Authors study thread overhead 1000 cycles
  • Linux NPTL 10k cycles
  • Thread overhead
  • - thread creation
  • - thread management
  • - thread synchronization

15
Notes on Results (contd.)
  • Important issue 2 all results presented
    heretofore correspond to dynamic scheduling with
    one iteration scheduled at a time.
  • Loops whose coverage per iteration is smaller
    than the threading overhead are filtered out.

16
Variation of TLS Performance Potential w.r.t.
Threading Overhead
  • Using state-of-the-art threading library
    overhead is min. 1K cycles
  • Ideal case 40
  • Practical case lt3

0
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
17
Variation of TLS Performance Potential w.r.t.
Threading Overhead
  • Ideal case 40
  • Practical case 0

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
18
Mitigating the Threading Overhead
  • Loop Unrolling
  • Higher inter-thread destruction interference in
    the D-Cache (me but, but!)
  • Increases the number of potential dependences
    between any two iterations of the loop
  • Result in higher misspeculation rate

Unroll twice
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
19
Mitigating the Threading Overhead (contd.)
  • Using large number of processors
  • Tovhd lt (Np -1) x Titer
  • Tovhd lt (Niter 1) x Titer for unbounded number
    of processors

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
20
Performance Potential of Different Types of TLS
  • SPEC CINT2006 (refer to the paper for CFP2006)
  • Gray columns assuming Tovhd 1000 cycles
  • White columns assuming Tovhd 10 cycles

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
21
Bounds on Conflict Probability
  • Model the impact of misspeculation penalty

For m points-of-speculation on an iteration
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
22
Bounds on Conflict Probability (contd.)
  • Example 401.perlbench
  • Applying TLS on Loop 8 is profitable only if the
    misspeculation probability lt 0.28
  • Applying TLS on other
  • loops is not beneficial

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
23
Conclusions
  • Performance potential of innermost loop-level
    TLS (excluding TLP)
  • Assuming threading overhead 1K cycles Geometric
    Mean 1
  • Performance potential of standalone CS, DDS and
    DVS is rather limited
  • Future work evaluate the TLS-run time
    parallelization trade-off

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
24
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com