Evaluation of Multicore Architectures for Image Processing Algorithms - PowerPoint PPT Presentation

About This Presentation
Title:

Evaluation of Multicore Architectures for Image Processing Algorithms

Description:

Linux (Fedora) Cell SDK 3.1. Graphics Processing Unit (GPU) ... Linux (Fedora) CUDA 2.1. Can execute legacy IA-32 and SIMD applications at higher clock rate. ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 40
Provided by: Tru104
Learn more at: http://cecas.clemson.edu
Category:

less

Transcript and Presenter's Notes

Title: Evaluation of Multicore Architectures for Image Processing Algorithms


1
Evaluation of Multi-core Architectures for Image
Processing Algorithms
  • Masters Thesis Presentation by
  • Trupti Patil
  • July 22, 2009

2
Overview
  • Motivation
  • Contribution scope
  • Background
  • Platforms
  • Algorithms
  • Experimental Results
  • Conclusion

3
Motivation
  • Fast processing response a major requirement in
    many image processing applications.
  • Image processing algorithms can be
    computationally expensive
  • Data needs to be processed in parallel, and
    optimized for real-time execution
  • Recent introduction of massively-parallel
    computer architectures promising significant
    acceleration.
  • Some architectures havent been actively explored
    yet.

4
Overview
  • Motivation
  • Contribution scope
  • Background
  • Platforms
  • Algorithms
  • Experimental Results
  • Conclusion

5
Contribution scope of the thesis
  • This thesis adapts and optimizes three image
    processing and computer vision algorithms for
    four multi-core architectures.
  • The timings are found
  • Obtained timings are compared against available
    corresponding previous work (intra-class) and
    architecture type (inter-class).
  • Appropriate deductions are made based on results.

6
Overview
  • Motivation
  • Contribution scope
  • Background
  • Platforms
  • Algorithms
  • Implementation
  • Conclusion

7
Background
  • Need for Parallelization
  • SIMD Optimization
  • The need for faster execution time
  • Related work
  • Canny edge detection on CellBE Gupta et al. and
    on GPU Luo et al.
  • KLT tracking implementation on GPU Sinha et al.,
    Zach et al.

8
Overview
  • Motivation
  • Contribution scope
  • Background
  • Platforms
  • Algorithms
  • Implementation
  • Experimental Results
  • Conclusion

9
Hardware Software Platforms
10
Intel NetBurst Core Microarchitectures
  • Can execute legacy IA-32 and SIMD applications at
    higher clock rate.
  • HT allows simultaneous multithreading.
  • Has two logical processors on each physical
    processor
  • Support for upto SSE3
  • Improved performance/watt factor.
  • SSSE3 support for effective XMM registers
    utilization.
  • Supports SSE4
  • Scales upto Quad-core

11
Cell Broadband Engine (CBE)
  • Structural diagram of the Cell Broadband Engine

12
Cell processor overview
  • One Power-based PPE, with VMX
  • 32/32kB I/D L1, and 512kB L2
  • dual issue, in order PPU, 2 HW threads
  • Eight SPEs, with up to 16x SIMD
  • dual issue, in order SPU
  • 128 registers (128b wide)
  • 256 kB local store (LS)
  • 2x 16B/cycle DMA, 16 outstanding req.
  • Element Interconnect Bus (EIB)
  • 4 rings, 16B wide (at 12 clock)
  • 96B/cycle peak, 16B/cycle to memory
  • 2x 16B/cycle BIF and I/O
  • External communication
  • Dual XDR memory controller (MIC)
  • Two configurable bus interfaces (BIC)
  • Classical I/O interface
  • SMP coherent interface

13
Graphics Processing Unit (GPU)

F R A M E B U F F E R

Vertex Processor
Fragment Processor
Assemble Rasterize
Frame buffer Operations
Application
Textures
  • Data flow in GPU

14
Nvidia GeForce 8 Series GPU
  • Graphics pipeline in NVIDIA GeForce 8 Series GPU

15
Compute Unified Device Interface (CUDA)
  • Computing engine in Nvidia GPUs
  • Makes GPU a compute device into a highly
    multithreaded coprocessor.
  • Provides both low level and a higher level APIs
  • Has several advantages over GPUs using graphics
    APIs (e.g. OpenGL)

16
Overview
  • Motivation
  • Contribution scope
  • Background
  • Platforms
  • Algorithms
  • Experimental Results
  • Conclusion

17
Algorithm 1 Gaussian Smoothing
  • Gaussian smoothing is a filtering kernel
  • Removes small-scale texture and noise for given
    spatial extent
  • 1-D Gaussian kernel written as
  • 2-D Gaussian kernel

18
Gaussian Smoothing (example)
19
Algorithm 2 Canny Edge Detection
  • Edge detection a commonly operation in image
    processing
  • Edges are discontinuities in image gray levels,
    have strong intensity contrast.
  • Canny Edge Detection is an optimal edge-detector
    algorithm.
  • Illustrated ahead with an example.

20
Canny Edge Detection (example)
21
Algorithm 3 KLT Tracking
  • First proposed by Lucas and Kanade. Extended by
    Tomasi and Kanade and Shi and Tomasi .
  • Firstly, determine what feature(s) to track
    through feature selection
  • Secondly, track the selected feature(s) across
    image sequence.
  • Rests on three assumptions temporal persistence,
    spatial coherence and brightness constancy

22
Algorithm 3 KLT Tracking
23
Overview
  • Motivation
  • Contribution scope
  • Background
  • Platforms
  • Algorithms
  • Results
  • Conclusion

24
Gaussian Smoothing Results
  • Lenna
  • Mandrill

25
Results Gaussian Smoothing
26
Canny edge detection Results
  • Lenna
  • Mandrill

27
Results Canny edge detection
28
Results Canny Edge Detection
  • Comparison with other implementations on Cell
  • Comparison with other implementations on GPU

29
Results KLT Tracking
30
Results KLT Tracking
  • Comparison with other implementations on GPU
  • Comparison with other implementations on GPU
  • No known implementations yet.

31
Overview
  • Motivation
  • Contribution scope
  • Background
  • Platforms
  • Algorithms
  • Results
  • Conclusion Extension

32
Conclusion Future work
  • GPU still ahead of other architectures, most
    suited for image processing applications.
  • Optimizing PS3 could improve timings to narrow
    the gap between its and GPU timings.
  • We could provide
  • Support for faster color Canny.
  • Support for kernel width larger than 5
  • Better management of thread alignment in GPU if
    not a multiple of 16
  • Include Intel Xeon Larrabee as potential
    architectures.

33
Questions..
34
Additional Slides
35
CBE Architecture
  • Contains traditional microprocessor, PowerPC
    Processor Element (PPE) Controls tasks
  • 64-bit PPC 32 KB L1 instruction cache, 32 KB L1
    data cache, and 512 KB L2 cache.
  • PPE controls 8 synergistic processor elements
    (SPEs) operating as SIMD units
  • Each SPE has an SPU and a memory flow controller
    (MFC) - data intensive tasks
  • SPU (RISC) with 128 128-bit SIMD registers 256KB
    local store (LS).
  • PPE, SPE, MIC, BIC connected by Element
    Interconnect Bus (EIB) for data movement - ring
    bus consisting of four 16 byte channels providing
    sustained b/w of 204.8 GB/s. MFC connection to
    Rambus XDR memory and BIC interface to I/O
    devices connected via RapidIO provide 25.6 GB/s
    of data b/w.

36
CBE What makes it fast?
  • Huge inter-SPE bandwidth
  • 205 GB/s sustained output
  • Fast main memory
  • 256.5 GB/s bandwidth for Rambus XDR memory
  • Predictable DMA latency and throughput
  • DMA traffic has negligible impact on SPE local
    store bandwidth
  • Easy to overlap data movement with computation
  • High performance, low-power SPE cores

37
Nvidia GeForce (Continued)
  • GPU has K multiprocessors (MP)
  • Each MP has L scalar processors (SP)
  • Each MP performs block processing in batches
  • A block is processed by only one MP
  • Each block is split into SIMD groups of threads
    (warps)
  • A warp is executed physically in parallel
  • A scheduler switches between warps
  • A warp contains threads of increasing,
    consecutive thread IDs
  • Currently a warp size is 32 threads

38
CUDA Programming model
  • Grid consist of thread blocks
  • Each thread executes the kernel
  • Grid and block dimensions specified by
    application. Max. by GPU memory
  • 1/ 2/ 3-D grid layout
  • Thread and Block-IDs are unique

39
CUDA Memory model
  • Shared memory(R/W) - For sharing data within
    block
  • Texture memory spatially cached
  • Constant memory About 20K, cached
  • Global Memory Not cached, coalesce
  • Explicit GPU memory alloc/de-allocation
  • Slow copying between CPU and GPU memory
Write a Comment
User Comments (0)
About PowerShow.com