Title: Skeletons and Asynchronous RPC for Embedded Data- and Task Parallel Image Processing
1Skeletons and Asynchronous RPC for Embedded Data-
and Task Parallel Image Processing
IAPR Conference on Machine Vision Applications
Wouter Caarls, Pieter Jonker, Henk Corporaal
Quantitative Imaging Group, department of Imaging
Science Technology
2Overview
- Introduction Motivation
- Approach
- Algorithmic skeletons
- Asynchronous RPC
- Implementation
- Run-time system
- Prototype architecture
- Results
- Conclusions Future work
3Introduction
- SmartCam Integrating efficient user programmable
image processing hardware within the camera or
sensor itself.
- Efficient hardware implies parallelism and
heterogeneity - Efficient programmability implies custom,
application dependent hardware - Finding the best hardware configuration for an
application requires hardware independent
software - For wide acceptance, programming should be easy
Philips CFT IV Inca 320 x 10-bit SIMD 5-issue
VLIW
4Our approach
5Algorithmic Skeletons
- Separating structure from computation
6Algorithmic Skeletons
- Implicit parallel programming
- Choice of skeleton implies set of constraints
(dependencies) - System is free as long as constraints are not
violated - Distribution
- Scanning order
- Consistent library interface facilitates
between-skeleton dependency analysis - No side effects
- Well-defined inputs and outputs
7Algorithmic Skeletons
Disadvantages
- Inability to parallelize algorithms that cannot
be expressed using one of the skeletons in the
library - Inability to specify certain algorithmic
optimizations - Inability to specify architecture-dependent
optimizations - Solution Allow the programmer to add his own
(application-specific or architecture-specific)
skeletons to the library
8Remote procedure call
- Just like a function call
- Computes the function on a different processor
- All data goes through the calling processor
- Synchronous stub returns when remote function
is done data is available immediately - Asynchronous stub returns immediately data is
available later.
9Futures
Control processor
Coprocessor
Function1
- Function returns reference to future result
- Reference can be used in other RPC calls
- Using the reference outside an RPC call requires
an (implicit) block.
Function1(a)
Function2
Function2(a)
10Parallelism through RPC
- RPC is not intrinsically parallel
- Synchronous RPC calling parallel function data
parallelism - Asynchronous RPC calling (parallel) function
task parallelism
11Optimizing communications
- Real-time image processing requires vast amounts
of bandwidth - Scatter-gather creates a bottleneck at the
control processor. - Allow peer-to-peer communications between remote
functions
12Optimizing memory usage
- In embedded applications, memory is scarce
- Normal task parallelism requires a frame store
per concurrent operation - Pipelining
13Example
Object following
- / Object following /
- While (1)
-
- GetImage(in)
- IsoWindowOp(WDW(5), gauss_5x5,
- in, filtered)
- IsoPixelOp(binarize, filtered,
- segmented, 50)
- x, y, n 0, 0, 0
- AnisoPixelReductionOp(gravity,
- add, segmented, x, y, n,
- 3sizeof(int))
- block(x, y, n)
- xx/n yy/n
- SetMotorSpeed(WIDTH/2-x,
- HEIGHT/2-y)
-
GetImage
x, y, n0,0,0
gauss_5x5
binarize
gravity
xx/n yy/n
SetMotorSpeed
14Run-time system implementation
Function calls Read(a) Process(a, b)
15Prototype architecture
16Results
Double thresholding edge detection
Operation Single Split (TM only) Split (TMXTL)
Time (ms) 115 124 67
Relative 100 108 58
17Conclusion
- The proposed programming model
- Is easy to use
- Skeletons hide data parallel bookkeeping
- RPC hides task parallel implementation
- Is architecture independent
- A skeleton can be implemented for different
architectures - RPC can map to heterogeneous system
- Is optimized for embedded usage
- Peer-to-peer communication no scatter/gather
bottleneck - Pipelined no frame stores
18Future work
- Skeletons
- Skeleton Definition Language
- Skeleton merging
- Mapping
- Memory
- Scalar dependencies
- Evaluation
- New prototype architecture
- Dynamic, complex application