Tight Analysis of the Performance Potential of Thread Speculation using SPEC CPU2006 - PowerPoint PPT Presentation

About This Presentation

Title:

Tight Analysis of the Performance Potential of Thread Speculation using SPEC CPU2006

Description:

PP can be applied in an unrestricted fashion. Example loop on the right. Less than 1% coverage ... hardware or software design or configuration may affect ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 25

Provided by: intel158

Category:

more less

Transcript and Presenter's Notes

Title: Tight Analysis of the Performance Potential of Thread Speculation using SPEC CPU2006

1
Tight Analysis of the Performance Potential of
Thread Speculation using SPEC CPU2006

Arun Kejariwal,, Xinmin Tian
Milind Girkar, Wei Li, Sergey Kozhukhov,
Hideki Saito
Utpal Banerjee Alexandru Nicolau, Alexander V.
Veidenbaum
Constantine D. Polychronopoulos
Software and Solutions Group, Intel Corporation
Center for Embedded Computer Systems, University
of California, Irvine
Center for Supercomputing Research and
Development
University of Illinois at
Urbana-Champaign
March 16, 2007

2
Parallelism Becoming Ubiquitous

Emergence of multi-cores, hyper-threaded
processors
Intel Core 2 Duo processor
Intel Kentsfield (quad-core) processor
Program parallelization
Auto-parallelization
Hardware-assisted (speculative) parallelization

Loop-level

Main contributions
Evaluating the performance potential of
thread-level parallelism using SPEC CPU2006
Trade-off between ILP and TLP (thread-level
parallelism)
Analysis w.r.t. threading overhead,
transformations and conflict probability

Other brands and names may be claimed as the
property of others.
3
Clarification of Chosen Approach

Tight upper bounds on the performance potential
of TLS by filtering out
Inherently parallel program regions
Non-profitable candidates for TLS
Focus on parallelism that cannot be exploited
with state-of-the-art compiler technology, but
can uniquely be exploited via TLS

4
Global View
5
Differentiating TLP and TLS

Loop-level parallelism
Parallel (DOALL) loops corresponds to true TLP
Non-parallel (Non-DOALL) loops corresponds to
sTLP
Performance achievable by each technique
standalone
Performance achievable by each technique in
conjunction with others

6
Taxonomy
CS Control Speculation DDS Data Dependence
Speculation DVS Data Value Speculation
7
Methodology Details

Analysis will refer only to innermost loops
(however)
Two-step approach filter out loops
- that can be parallelized using
state-of-the-art compiler techniques (dep.
analysis, pointer analysis, IPA, etc)
- for which it is more profitable to exploit ILP
instead of sTLP
Filtering was done by using the Intel Compiler
and manual analysis (for lt10 of the loops).
The remaining loops are considered for evaluating
the performance potential of TLS at the innermost
loop level.

8
The Baseline

Auto-parallelization

9
ILP/sTLP Trade-off

Determine what is achievable beyond existing
ILP techniques
Example A candidate loop for DVS
Also a candidate for software pipelining
Too small for TLS to be profitable (too little
computation vs. threading overhead)

10
ILP/sTLP Trade-off (contd.)

CSDVS vs. Perfect Pipelining (PP)
Profitability of CSDVS High coverage per
iteration
PP can be applied in an unrestricted fashion
Example loop on the right
Less than 1 coverage
Very small coverage per iteration
Not suitable for TLS

11
ILP/sTLP Trade-off (contd.)

Symbolic Analysis
Convert a non-DOALL loop into a DOALL loop
Example loop on the right
No need for TLS
Too small to exploit non-speculative TLP in
profitable fashion

non-DOALL loop
DOALL loop
12
Procedural-level TLS

Subject to
Limitations of the inlining heuristic of the
compiler
The strength of dependence analysis
supported
Example loop on the right
Procedural calls phi0, phi1 and phi2
are loop invariant
Hoist the functions
Resulting loop A DOALL loop

13
Evaluation of TLS

SPEC INT and SPEC FP 2006
Intel auto-parallelizing compiler
System configuration

Evaluate the performance potential of various
speculation techniques,
at the innermost loop level, subject to
practical constraints such as
threading overhead

Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
14
Notes on Results

Important issue 1 previous studies of the
potential of TLS used unrealistic (small) thread
overheads.
Authors study thread overhead 1000 cycles
Linux NPTL 10k cycles
Thread overhead
- thread creation
- thread management
- thread synchronization

15
Notes on Results (contd.)

Important issue 2 all results presented
heretofore correspond to dynamic scheduling with
one iteration scheduled at a time.
Loops whose coverage per iteration is smaller
than the threading overhead are filtered out.

16
Variation of TLS Performance Potential w.r.t.
Threading Overhead

Using state-of-the-art threading library
overhead is min. 1K cycles
Ideal case 40
Practical case lt3

0
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
17
Variation of TLS Performance Potential w.r.t.
Threading Overhead

Ideal case 40
Practical case 0

Loop Unrolling
Higher inter-thread destruction interference in
the D-Cache (me but, but!)
Increases the number of potential dependences
between any two iterations of the loop
Result in higher misspeculation rate

Unroll twice
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
19
Mitigating the Threading Overhead (contd.)

Using large number of processors
Tovhd lt (Np -1) x Titer
Tovhd lt (Niter 1) x Titer for unbounded number
of processors

SPEC CINT2006 (refer to the paper for CFP2006)
Gray columns assuming Tovhd 1000 cycles
White columns assuming Tovhd 10 cycles

Model the impact of misspeculation penalty

For m points-of-speculation on an iteration
Performance tests and ratings are measured using
specific computer systems and/or components and
reflect the approximate performance of Intel
products as measured by those tests. Any
difference in system hardware or software design
or configuration may affect actual performance.
Buyers should consult other sources of
information to evaluate the performance of
systems or components they are considering
purchasing. For more information on performance
tests and on the performance of Intel products,
reference http//www.intel.com/performance/resourc
es/benchmark_limitations.htm or call (U.S.)
1-800-628-8686 or 1-916-356-3104.
22
Bounds on Conflict Probability (contd.)