Title: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU
1Debunking the 100X GPU vs. CPU Myth An
Evaluation of Throughput Computing on CPU and GPU
- Presented by Ahmad Lashgar
- ECE Department, University of Tehran
- Seminar of Parallel Processing. Instructor Dr.
Fakhraie - 29 Dec 11
- ISCA 2010
- Original authors Victor W Lee et al.
- Intel Corporation
Some slides are included from original paper only
for educational purposes
2Abstract
- Is the GPU silver bullet of parallel computing?
- How far is the difference between peak and
achievable performance?
3Overview
- Abstract
- Architecture
- CPU Intel core i7
- GPU Nvidia GTX280
- Implications for throughput computing
applications - Methodology
- Results
- Analyzing the results
- Platform optimization guides
- Conclusion
4Architecture (1)
- Intel core i7-960
- 4-core, 3.2 GHz
- 2-way multi-threading
- 4-wide
- L1 32KB, L2 256KB, L3 3MB
- 32 GB/sec
DIXON2010
5Architecture (2)
- Nvidia GTX280
- 30 core, 1.3GHz
- 1024-way multi-threading
- 8-way SIMD
- 16KB software managed cache (shared memory)
- 141 GB/s
LINDHOLM2008
6Architecture (3)
Core i7-960 GTX280
Core 4 30
Frequency (GHz) 3.2 1.3
Transistors 0.7B (263mm2) 1.4B (576mm2)
Memory Bandwidth (GB/s) 32 141
SP SIMD 4 8
DP SIMD 2 1
Peak SP scalar GFLOPS 25.6 116.6
Peak SP SIMD GFLOPS 102.4 311.1 (933.1)
Peak DB SIMD GFLOPS 51.2 77.8
Red texts are not the authors numbers.
7Implications for throughput computing applications
- Number of core difference
- Cache size/multi-threading
- Bandwidth difference
81. Number of cores difference
- It is all about the core complexity
- The common goal Improving pipeline efficiency
- CPU goal Single-thread performance
- Exploiting ILP
- Sophisticated branch predictor
- Multiple issue logics
- GPU goal Throughput
- Interleaving hundreds of threads
92. Cache size/multi-threading
- CPU goal reducing memory latency
- Programmer-transparent data caching
- Increasing the cache size to capture the working
set - Prefetching (HW/SW)
- GPU goal hiding memory latency
- Interleave the execution of hundreds of threads
to hide the latency of each other - Notice
- CPU uses multi-threading for latency hiding
- GPU uses software controlled caching (shared
memory) for reducing memory latency
103. Bandwidth difference
- Bandwidth versus latency
- CPU goal single thread performance
- Workloads do not demand for many memory accesses
- Bring the data as soon as possible
- GPU goal throughput
- There are lots of memory accesses, provide the
good bandwidth - No matter the latency, core will hide it!
11Methodology (1)
- Hardware
- Intel Core i7-960, 6GB DRAM, GTX280 1GB
- Software
- SUSE Enterprise 11
- CUDA Toolkit 2.3
12Methodology (2)
- Optimizations
- On CPU
- SGEMM, SpMV and FFT from Intel MKL 10
- Always 2 threads per core
- On GPU
- Best possible algorithm for SpMV, FFT and MC
- Often 128 to 256 threads per core (to leverage
shared memory and register-file usage) - Interleaving GPU execution and HD/DH memory
transfers where possible
13Results
- The HD/DH data transfer time is not considered
- Only 2.5X on average
- Far from what is reported by previous researches
(100X)
14Where is the speedup of previous researches?!
- What CPU and GPU are compared?
- How much optimization is performed on CPU and
GPU? - Where they optimize both platforms, they reported
much lower speedup (like this paper)
15Analyzing the results (1)
- Bandwidth
- Compute flops (single precision)
- Compute flops (double precision)
- Reduction and synchronization
- Fixed function
16Analyzing the results (2)
- Bandwidth
- Peak GTX280/Corei7-960 4.7X
- Feature Large working set, Performance is
bounded by the bandwidth - Examples
- SAXPY (5.3X)
- LBM (5X)
- SpMV (1.9X)
- CPU benefits from caching
17Analyzing the results (3)
- Compute Flops (Single Precision)
- Peak GTX280/Corei7-960 3X
- Feature Bounded by computation, benefit from
more cores - Examples
- SGEMM, Conv and FFT (2.8-4X)
18Analyzing the results (4)
- Compute Flops (Double Precision)
- Peak GTX280/Corei7-960 1.5X
- Feature Bounded by computation, benefit from
more cores - Examples
- MC (1.8X)
- Blitz (5X)
- Uses transcendental operations
- Sort (1.25X slower)
- Due to decrease in SIMD width usage
- Depends on scalar performance
19Analyzing the results (5)
- Reduction and Synchronization
- Feature More threads, higher the synchronization
overhead - Examples
- Hist (1.8X)
- On CPU, 28 of the time is spent on atomic
operations - On GPU, the atomic operations are much slower
- Solv (1.9X slower)
- Multiple kernel launches to preserve cache
coherency on GPU
20Analyzing the results (6)
- Fixed function
- Feature Interpolation, texturing and
transcendental operation are bonus on GPU - Examples
- Bilat (5.7X)
- On CPU, 66 of the time is spent on
transcendental operations - GJK (14.9X)
- Uses texture lookup
21Platform optimization guides
- CPU programmer have heavily relied on increasing
clock frequency - Their application do not benefits from TLP and
DLP - Today CPUs use wider SIMD which stays idle if not
exploited by programmer (or compiler) - This paper showed that careful multi-threading
can reduce the gap heavily - For LBM, from 114X down to 5X
- Lets learn some optimization tips from the
authors
22CPU optimization
- Scalability (4X)
- Scale the kernel with the number of threads
- Blocking (5X)
- Be aware of cache hierarchy and use it
efficiently - Regularizing (1.5X)
- Align the data regularly to take advantage of
SIMD
23GPU optimization
- Global synchronization
- Reduce the atomic operations
- Shared memory
- Use shared memory to reduce of-chip demand
- Shared memory is multi-banked and is efficient
for gathers/scatters operations
24Conclusion
- This work analyzed the performance of important
throughput computing kernels on CPU and GPU - the gap is much lower that previous reports
(2.5X) - Recommendation for a throughput computing
architecture - High compute
- High bandwidth
- Large cache
- Gather/scatter support
- Efficient synchronization
- Fixed function units
25- Thank you for your attention.
- any question?
26References
- LEE2010 V. W. Lee et al, Debunking the 100X
GPU vs. CPU Myth An Evaluation of Throughput
Computing on CPU and GPU, ISCA 2010 - DIXON2010 M. Dixon et al, The next-generation
Intel Core Microarchitecture, Intel
Technology Journal, Volume 14 Issue 3, 2010 - LINDHOLM2008 E. Lindholm et al, NVIDIA Tesla A
Unified Graphics and Computing Architecture, IEEE
Micro 2008