Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

About This Presentation

Title:

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

Description:

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran – PowerPoint PPT presentation

Number of Views:323

Avg rating:3.0/5.0

Slides: 27

Provided by: Ahma55

Category:

more less

Transcript and Presenter's Notes

Title: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

1
Debunking the 100X GPU vs. CPU Myth An
Evaluation of Throughput Computing on CPU and GPU

Presented by Ahmad Lashgar
ECE Department, University of Tehran
Seminar of Parallel Processing. Instructor Dr.
Fakhraie
29 Dec 11
ISCA 2010
Original authors Victor W Lee et al.
Intel Corporation

Some slides are included from original paper only
for educational purposes
2
Abstract

Is the GPU silver bullet of parallel computing?
How far is the difference between peak and
achievable performance?

3
Overview

Abstract
Architecture
CPU Intel core i7
GPU Nvidia GTX280
Implications for throughput computing
applications
Methodology
Results
Analyzing the results
Platform optimization guides
Conclusion

4
Architecture (1)

Intel core i7-960
4-core, 3.2 GHz
2-way multi-threading
4-wide
L1 32KB, L2 256KB, L3 3MB
32 GB/sec

DIXON2010
5
Architecture (2)

Nvidia GTX280
30 core, 1.3GHz
1024-way multi-threading
8-way SIMD
16KB software managed cache (shared memory)
141 GB/s

LINDHOLM2008
6
Architecture (3)
Core i7-960 GTX280
Core 4 30
Frequency (GHz) 3.2 1.3
Transistors 0.7B (263mm2) 1.4B (576mm2)
Memory Bandwidth (GB/s) 32 141
SP SIMD 4 8
DP SIMD 2 1
Peak SP scalar GFLOPS 25.6 116.6
Peak SP SIMD GFLOPS 102.4 311.1 (933.1)
Peak DB SIMD GFLOPS 51.2 77.8
Red texts are not the authors numbers.
7
Implications for throughput computing applications

Number of core difference
Cache size/multi-threading
Bandwidth difference

8
1. Number of cores difference

It is all about the core complexity
The common goal Improving pipeline efficiency
CPU goal Single-thread performance
Exploiting ILP
Sophisticated branch predictor
Multiple issue logics
GPU goal Throughput
Interleaving hundreds of threads

9
2. Cache size/multi-threading

CPU goal reducing memory latency
Programmer-transparent data caching
Increasing the cache size to capture the working
set
Prefetching (HW/SW)
GPU goal hiding memory latency
Interleave the execution of hundreds of threads
to hide the latency of each other
Notice
CPU uses multi-threading for latency hiding
GPU uses software controlled caching (shared
memory) for reducing memory latency

10
3. Bandwidth difference

Bandwidth versus latency
CPU goal single thread performance
Workloads do not demand for many memory accesses
Bring the data as soon as possible
GPU goal throughput
There are lots of memory accesses, provide the
good bandwidth
No matter the latency, core will hide it!

11
Methodology (1)

Hardware
Intel Core i7-960, 6GB DRAM, GTX280 1GB
Software
SUSE Enterprise 11
CUDA Toolkit 2.3

12
Methodology (2)

Optimizations
On CPU
SGEMM, SpMV and FFT from Intel MKL 10
Always 2 threads per core
On GPU
Best possible algorithm for SpMV, FFT and MC
Often 128 to 256 threads per core (to leverage
shared memory and register-file usage)
Interleaving GPU execution and HD/DH memory
transfers where possible

13
Results

The HD/DH data transfer time is not considered
Only 2.5X on average
Far from what is reported by previous researches
(100X)

14
Where is the speedup of previous researches?!

What CPU and GPU are compared?
How much optimization is performed on CPU and
GPU?
Where they optimize both platforms, they reported
much lower speedup (like this paper)

15
Analyzing the results (1)

Bandwidth
Compute flops (single precision)
Compute flops (double precision)
Reduction and synchronization
Fixed function

16
Analyzing the results (2)

Bandwidth
Peak GTX280/Corei7-960 4.7X
Feature Large working set, Performance is
bounded by the bandwidth
Examples
SAXPY (5.3X)
LBM (5X)
SpMV (1.9X)
CPU benefits from caching

17
Analyzing the results (3)

Compute Flops (Single Precision)
Peak GTX280/Corei7-960 3X
Feature Bounded by computation, benefit from
more cores
Examples
SGEMM, Conv and FFT (2.8-4X)

18
Analyzing the results (4)

Compute Flops (Double Precision)
Peak GTX280/Corei7-960 1.5X
Feature Bounded by computation, benefit from
more cores
Examples
MC (1.8X)
Blitz (5X)
Uses transcendental operations
Sort (1.25X slower)
Due to decrease in SIMD width usage
Depends on scalar performance

19
Analyzing the results (5)

Reduction and Synchronization
Feature More threads, higher the synchronization
overhead
Examples
Hist (1.8X)
On CPU, 28 of the time is spent on atomic
operations
On GPU, the atomic operations are much slower
Solv (1.9X slower)
Multiple kernel launches to preserve cache
coherency on GPU

20
Analyzing the results (6)

Fixed function
Feature Interpolation, texturing and
transcendental operation are bonus on GPU
Examples
Bilat (5.7X)
On CPU, 66 of the time is spent on
transcendental operations
GJK (14.9X)
Uses texture lookup

21
Platform optimization guides

CPU programmer have heavily relied on increasing
clock frequency
Their application do not benefits from TLP and
DLP
Today CPUs use wider SIMD which stays idle if not
exploited by programmer (or compiler)
This paper showed that careful multi-threading
can reduce the gap heavily
For LBM, from 114X down to 5X
Lets learn some optimization tips from the
authors

22
CPU optimization

Scalability (4X)
Scale the kernel with the number of threads
Blocking (5X)
Be aware of cache hierarchy and use it
efficiently
Regularizing (1.5X)
Align the data regularly to take advantage of
SIMD

23
GPU optimization

Global synchronization
Reduce the atomic operations
Shared memory
Use shared memory to reduce of-chip demand
Shared memory is multi-banked and is efficient
for gathers/scatters operations

24
Conclusion

This work analyzed the performance of important
throughput computing kernels on CPU and GPU
the gap is much lower that previous reports
(2.5X)
Recommendation for a throughput computing
architecture
High compute
High bandwidth
Large cache
Gather/scatter support
Efficient synchronization
Fixed function units

Thank you for your attention.
any question?

26
References

LEE2010 V. W. Lee et al, Debunking the 100X
GPU vs. CPU Myth An Evaluation of Throughput
Computing on CPU and GPU, ISCA 2010
DIXON2010 M. Dixon et al, The next-generation
Intel Core Microarchitecture, Intel
Technology Journal, Volume 14 Issue 3, 2010
LINDHOLM2008 E. Lindholm et al, NVIDIA Tesla A
Unified Graphics and Computing Architecture, IEEE
Micro 2008

Write a Comment

User Comments (0)

About PowerShow.com

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU - PowerPoint PPT Presentation

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran – PowerPoint PPT presentation