Title: Evaluation of Multicore Architectures for Image Processing Algorithms
1Evaluation of Multi-core Architectures for Image
Processing Algorithms
- Masters Thesis Presentation by
- Trupti Patil
- July 22, 2009
2Overview
- Motivation
- Contribution scope
- Background
- Platforms
- Algorithms
- Experimental Results
- Conclusion
3Motivation
- Fast processing response a major requirement in
many image processing applications. - Image processing algorithms can be
computationally expensive - Data needs to be processed in parallel, and
optimized for real-time execution - Recent introduction of massively-parallel
computer architectures promising significant
acceleration. - Some architectures havent been actively explored
yet.
4Overview
- Motivation
- Contribution scope
- Background
- Platforms
- Algorithms
- Experimental Results
- Conclusion
5Contribution scope of the thesis
- This thesis adapts and optimizes three image
processing and computer vision algorithms for
four multi-core architectures. - The timings are found
- Obtained timings are compared against available
corresponding previous work (intra-class) and
architecture type (inter-class). - Appropriate deductions are made based on results.
6Overview
- Motivation
- Contribution scope
- Background
- Platforms
- Algorithms
- Implementation
- Conclusion
7Background
- Need for Parallelization
- SIMD Optimization
- The need for faster execution time
- Related work
- Canny edge detection on CellBE Gupta et al. and
on GPU Luo et al. - KLT tracking implementation on GPU Sinha et al.,
Zach et al.
8Overview
- Motivation
- Contribution scope
- Background
- Platforms
- Algorithms
- Implementation
- Experimental Results
- Conclusion
9Hardware Software Platforms
10Intel NetBurst Core Microarchitectures
- Can execute legacy IA-32 and SIMD applications at
higher clock rate. - HT allows simultaneous multithreading.
- Has two logical processors on each physical
processor - Support for upto SSE3
- Improved performance/watt factor.
- SSSE3 support for effective XMM registers
utilization. - Supports SSE4
- Scales upto Quad-core
11Cell Broadband Engine (CBE)
- Structural diagram of the Cell Broadband Engine
12Cell processor overview
- One Power-based PPE, with VMX
- 32/32kB I/D L1, and 512kB L2
- dual issue, in order PPU, 2 HW threads
- Eight SPEs, with up to 16x SIMD
- dual issue, in order SPU
- 128 registers (128b wide)
- 256 kB local store (LS)
- 2x 16B/cycle DMA, 16 outstanding req.
- Element Interconnect Bus (EIB)
- 4 rings, 16B wide (at 12 clock)
- 96B/cycle peak, 16B/cycle to memory
- 2x 16B/cycle BIF and I/O
- External communication
- Dual XDR memory controller (MIC)
- Two configurable bus interfaces (BIC)
- Classical I/O interface
- SMP coherent interface
13Graphics Processing Unit (GPU)
F R A M E B U F F E R
Vertex Processor
Fragment Processor
Assemble Rasterize
Frame buffer Operations
Application
Textures
14Nvidia GeForce 8 Series GPU
- Graphics pipeline in NVIDIA GeForce 8 Series GPU
15Compute Unified Device Interface (CUDA)
- Computing engine in Nvidia GPUs
- Makes GPU a compute device into a highly
multithreaded coprocessor. - Provides both low level and a higher level APIs
- Has several advantages over GPUs using graphics
APIs (e.g. OpenGL)
16Overview
- Motivation
- Contribution scope
- Background
- Platforms
- Algorithms
- Experimental Results
- Conclusion
17Algorithm 1 Gaussian Smoothing
- Gaussian smoothing is a filtering kernel
- Removes small-scale texture and noise for given
spatial extent - 1-D Gaussian kernel written as
- 2-D Gaussian kernel
18Gaussian Smoothing (example)
19Algorithm 2 Canny Edge Detection
- Edge detection a commonly operation in image
processing - Edges are discontinuities in image gray levels,
have strong intensity contrast. - Canny Edge Detection is an optimal edge-detector
algorithm. - Illustrated ahead with an example.
20Canny Edge Detection (example)
21Algorithm 3 KLT Tracking
- First proposed by Lucas and Kanade. Extended by
Tomasi and Kanade and Shi and Tomasi . - Firstly, determine what feature(s) to track
through feature selection - Secondly, track the selected feature(s) across
image sequence. - Rests on three assumptions temporal persistence,
spatial coherence and brightness constancy
22Algorithm 3 KLT Tracking
23Overview
- Motivation
- Contribution scope
- Background
- Platforms
- Algorithms
- Results
- Conclusion
24Gaussian Smoothing Results
25Results Gaussian Smoothing
26Canny edge detection Results
27Results Canny edge detection
28Results Canny Edge Detection
- Comparison with other implementations on Cell
- Comparison with other implementations on GPU
29Results KLT Tracking
30Results KLT Tracking
- Comparison with other implementations on GPU
- Comparison with other implementations on GPU
- No known implementations yet.
31Overview
- Motivation
- Contribution scope
- Background
- Platforms
- Algorithms
- Results
- Conclusion Extension
32Conclusion Future work
- GPU still ahead of other architectures, most
suited for image processing applications. - Optimizing PS3 could improve timings to narrow
the gap between its and GPU timings. - We could provide
- Support for faster color Canny.
- Support for kernel width larger than 5
- Better management of thread alignment in GPU if
not a multiple of 16 - Include Intel Xeon Larrabee as potential
architectures.
33Questions..
34Additional Slides
35CBE Architecture
- Contains traditional microprocessor, PowerPC
Processor Element (PPE) Controls tasks - 64-bit PPC 32 KB L1 instruction cache, 32 KB L1
data cache, and 512 KB L2 cache. - PPE controls 8 synergistic processor elements
(SPEs) operating as SIMD units - Each SPE has an SPU and a memory flow controller
(MFC) - data intensive tasks - SPU (RISC) with 128 128-bit SIMD registers 256KB
local store (LS). - PPE, SPE, MIC, BIC connected by Element
Interconnect Bus (EIB) for data movement - ring
bus consisting of four 16 byte channels providing
sustained b/w of 204.8 GB/s. MFC connection to
Rambus XDR memory and BIC interface to I/O
devices connected via RapidIO provide 25.6 GB/s
of data b/w.
36CBE What makes it fast?
- Huge inter-SPE bandwidth
- 205 GB/s sustained output
- Fast main memory
- 256.5 GB/s bandwidth for Rambus XDR memory
- Predictable DMA latency and throughput
- DMA traffic has negligible impact on SPE local
store bandwidth - Easy to overlap data movement with computation
- High performance, low-power SPE cores
37Nvidia GeForce (Continued)
- GPU has K multiprocessors (MP)
- Each MP has L scalar processors (SP)
- Each MP performs block processing in batches
- A block is processed by only one MP
- Each block is split into SIMD groups of threads
(warps) - A warp is executed physically in parallel
- A scheduler switches between warps
- A warp contains threads of increasing,
consecutive thread IDs - Currently a warp size is 32 threads
38CUDA Programming model
- Grid consist of thread blocks
- Each thread executes the kernel
- Grid and block dimensions specified by
application. Max. by GPU memory - 1/ 2/ 3-D grid layout
- Thread and Block-IDs are unique
39CUDA Memory model
- Shared memory(R/W) - For sharing data within
block - Texture memory spatially cached
- Constant memory About 20K, cached
- Global Memory Not cached, coalesce
- Explicit GPU memory alloc/de-allocation
- Slow copying between CPU and GPU memory