Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU - PowerPoint PPT Presentation

Loading...

PPT – Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU PowerPoint presentation | free to view - id: 7856a1-MzM0N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU

Description:

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran – PowerPoint PPT presentation

Number of Views:155
Avg rating:3.0/5.0
Slides: 27
Provided by: Ahma55
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU


1
Debunking the 100X GPU vs. CPU Myth An
Evaluation of Throughput Computing on CPU and GPU
  • Presented by Ahmad Lashgar
  • ECE Department, University of Tehran
  • Seminar of Parallel Processing. Instructor Dr.
    Fakhraie
  • 29 Dec 11
  • ISCA 2010
  • Original authors Victor W Lee et al.
  • Intel Corporation

Some slides are included from original paper only
for educational purposes
2
Abstract
  • Is the GPU silver bullet of parallel computing?
  • How far is the difference between peak and
    achievable performance?

3
Overview
  • Abstract
  • Architecture
  • CPU Intel core i7
  • GPU Nvidia GTX280
  • Implications for throughput computing
    applications
  • Methodology
  • Results
  • Analyzing the results
  • Platform optimization guides
  • Conclusion

4
Architecture (1)
  • Intel core i7-960
  • 4-core, 3.2 GHz
  • 2-way multi-threading
  • 4-wide
  • L1 32KB, L2 256KB, L3 3MB
  • 32 GB/sec

DIXON2010
5
Architecture (2)
  • Nvidia GTX280
  • 30 core, 1.3GHz
  • 1024-way multi-threading
  • 8-way SIMD
  • 16KB software managed cache (shared memory)
  • 141 GB/s

LINDHOLM2008
6
Architecture (3)
Core i7-960 GTX280
Core 4 30
Frequency (GHz) 3.2 1.3
Transistors 0.7B (263mm2) 1.4B (576mm2)
Memory Bandwidth (GB/s) 32 141
SP SIMD 4 8
DP SIMD 2 1
Peak SP scalar GFLOPS 25.6 116.6
Peak SP SIMD GFLOPS 102.4 311.1 (933.1)
Peak DB SIMD GFLOPS 51.2 77.8
Red texts are not the authors numbers.
7
Implications for throughput computing applications
  1. Number of core difference
  2. Cache size/multi-threading
  3. Bandwidth difference

8
1. Number of cores difference
  • It is all about the core complexity
  • The common goal Improving pipeline efficiency
  • CPU goal Single-thread performance
  • Exploiting ILP
  • Sophisticated branch predictor
  • Multiple issue logics
  • GPU goal Throughput
  • Interleaving hundreds of threads

9
2. Cache size/multi-threading
  • CPU goal reducing memory latency
  • Programmer-transparent data caching
  • Increasing the cache size to capture the working
    set
  • Prefetching (HW/SW)
  • GPU goal hiding memory latency
  • Interleave the execution of hundreds of threads
    to hide the latency of each other
  • Notice
  • CPU uses multi-threading for latency hiding
  • GPU uses software controlled caching (shared
    memory) for reducing memory latency

10
3. Bandwidth difference
  • Bandwidth versus latency
  • CPU goal single thread performance
  • Workloads do not demand for many memory accesses
  • Bring the data as soon as possible
  • GPU goal throughput
  • There are lots of memory accesses, provide the
    good bandwidth
  • No matter the latency, core will hide it!

11
Methodology (1)
  • Hardware
  • Intel Core i7-960, 6GB DRAM, GTX280 1GB
  • Software
  • SUSE Enterprise 11
  • CUDA Toolkit 2.3

12
Methodology (2)
  • Optimizations
  • On CPU
  • SGEMM, SpMV and FFT from Intel MKL 10
  • Always 2 threads per core
  • On GPU
  • Best possible algorithm for SpMV, FFT and MC
  • Often 128 to 256 threads per core (to leverage
    shared memory and register-file usage)
  • Interleaving GPU execution and HD/DH memory
    transfers where possible

13
Results
  • The HD/DH data transfer time is not considered
  • Only 2.5X on average
  • Far from what is reported by previous researches
    (100X)

14
Where is the speedup of previous researches?!
  • What CPU and GPU are compared?
  • How much optimization is performed on CPU and
    GPU?
  • Where they optimize both platforms, they reported
    much lower speedup (like this paper)

15
Analyzing the results (1)
  1. Bandwidth
  2. Compute flops (single precision)
  3. Compute flops (double precision)
  4. Reduction and synchronization
  5. Fixed function

16
Analyzing the results (2)
  • Bandwidth
  • Peak GTX280/Corei7-960 4.7X
  • Feature Large working set, Performance is
    bounded by the bandwidth
  • Examples
  • SAXPY (5.3X)
  • LBM (5X)
  • SpMV (1.9X)
  • CPU benefits from caching

17
Analyzing the results (3)
  • Compute Flops (Single Precision)
  • Peak GTX280/Corei7-960 3X
  • Feature Bounded by computation, benefit from
    more cores
  • Examples
  • SGEMM, Conv and FFT (2.8-4X)

18
Analyzing the results (4)
  • Compute Flops (Double Precision)
  • Peak GTX280/Corei7-960 1.5X
  • Feature Bounded by computation, benefit from
    more cores
  • Examples
  • MC (1.8X)
  • Blitz (5X)
  • Uses transcendental operations
  • Sort (1.25X slower)
  • Due to decrease in SIMD width usage
  • Depends on scalar performance

19
Analyzing the results (5)
  • Reduction and Synchronization
  • Feature More threads, higher the synchronization
    overhead
  • Examples
  • Hist (1.8X)
  • On CPU, 28 of the time is spent on atomic
    operations
  • On GPU, the atomic operations are much slower
  • Solv (1.9X slower)
  • Multiple kernel launches to preserve cache
    coherency on GPU

20
Analyzing the results (6)
  • Fixed function
  • Feature Interpolation, texturing and
    transcendental operation are bonus on GPU
  • Examples
  • Bilat (5.7X)
  • On CPU, 66 of the time is spent on
    transcendental operations
  • GJK (14.9X)
  • Uses texture lookup

21
Platform optimization guides
  • CPU programmer have heavily relied on increasing
    clock frequency
  • Their application do not benefits from TLP and
    DLP
  • Today CPUs use wider SIMD which stays idle if not
    exploited by programmer (or compiler)
  • This paper showed that careful multi-threading
    can reduce the gap heavily
  • For LBM, from 114X down to 5X
  • Lets learn some optimization tips from the
    authors

22
CPU optimization
  • Scalability (4X)
  • Scale the kernel with the number of threads
  • Blocking (5X)
  • Be aware of cache hierarchy and use it
    efficiently
  • Regularizing (1.5X)
  • Align the data regularly to take advantage of
    SIMD

23
GPU optimization
  • Global synchronization
  • Reduce the atomic operations
  • Shared memory
  • Use shared memory to reduce of-chip demand
  • Shared memory is multi-banked and is efficient
    for gathers/scatters operations

24
Conclusion
  • This work analyzed the performance of important
    throughput computing kernels on CPU and GPU
  • the gap is much lower that previous reports
    (2.5X)
  • Recommendation for a throughput computing
    architecture
  • High compute
  • High bandwidth
  • Large cache
  • Gather/scatter support
  • Efficient synchronization
  • Fixed function units

25
  • Thank you for your attention.
  • any question?

26
References
  • LEE2010 V. W. Lee et al, Debunking the 100X
    GPU vs. CPU Myth An Evaluation of Throughput
    Computing on CPU and GPU, ISCA 2010
  • DIXON2010 M. Dixon et al, The next-generation
    Intel Core Microarchitecture, Intel
    Technology Journal, Volume 14 Issue 3, 2010
  • LINDHOLM2008 E. Lindholm et al, NVIDIA Tesla A
    Unified Graphics and Computing Architecture, IEEE
    Micro 2008
About PowerShow.com