The Perception Processor - PowerPoint PPT Presentation

Loading...

PPT – The Perception Processor PowerPoint presentation | free to download - id: 750a61-NzczN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

The Perception Processor

Description:

The Perception Processor Binu Mathew Advisor: Al Davis – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 73
Provided by: uta92
Learn more at: http://www.cs.utah.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: The Perception Processor


1
The Perception Processor
Binu Mathew Advisor Al Davis
2
What is Perception Processing ?
  • Ubiquitous computing needs natural human
    interfaces
  • Processor support for perceptual applications
  • Gesture recognition
  • Object detection, recognition, tracking
  • Speech recognition
  • Speaker identification
  • Applications
  • Multi-modal human friendly interfaces
  • Intelligent digital assistants
  • Robotics, unmanned vehicles
  • Perception prosthetics

3
The Problem with Perception Processing
4
The Problem with Perception Processing
  • Too slow, too much power for embedded space!
  • 2.4 GHz Pentium 4 60 Watts
  • 400 MHz Xscale 800 mW
  • 10x or more difference in performance
  • Inadequate memory bandwidth
  • Sphinx requires 1.2 GB/s memory bandwidth
  • Xscale delivers 64 MB/s 1/19th
  • Characterize application to find the problem
  • Derive acceleration architecture
  • History of FPUs is an analogy

5
High Level Architecture
Processor
Coprocessor Interface
Memory Controller
Input SRAMs
Output SRAM
Custom Accelerator
Scratch SRAMS
DRAM Interface
6
Thesis Statement
It is possible to design programmable processors
that can handle sophisticated perception
workloads in real-time at power budgets suitable
for embedded devices.
7
The FaceRec Application
8
FaceRec In Action
Rob Evans
9
Application Structure
Rowley Face Detector
Neural Net Eye Locator
Eigenfaces Face Recognizer
Segment Image
Flesh tone
Viola Jones Face Detector
Image
Identity, Coordinates
  • Flesh toning Soriano et al, Bertran et al
  • Segmentation Text book approach
  • Rowley detector, voter Henry Rowley, CMU
  • Viola Jones detector Published algorithm
    Carbonetto, UBC
  • Eigenfaces Re-implementation by Colorado State
    University

10
FaceRec Characterization
  • ML-RSIM out of order processor simulator
  • SPARC V8 ISA, Unmodified SunOS binaries

Out of order processor similar to 2GHz Intel Pentium 4
1-4 ALUs, 1-4 FPUs
Max 4 issue Max 4 graduations/cycle
16 KB 2-way L1 I -Cache
16-64 KB 2-way L1 D-Cache
256 KB-2MB 2-way L2 Cache
600 MHz, 64-bit DDR Memory interface
In-order processor similar to 400MHz Intel XScale
1 ALU, 1 FPU
Max 1 issue Max 1 graduation/cycle
32 KB 32-way L1 I -Cache
32 KB 32-way L1 D-Cache
No L2 Cache
100 MHz, 32-bit SDR Memory interface
11
Application Profile
12
Memory System Characteristics L1 D Cache
13
Memory System Characteristics L2 Cache
14
IPC
15
Why is IPC low ?
Neural Network Evaluation Sum Sn i0 Weighti
Image Inputi Result Tanh(Sum)
  • Dependences e.g. no single cycle floating
    point accumulate
  • Indirect accesses
  • Several array accesses per operator
  • Load store ports saturate
  • Need architectures that can move data efficiently

16
Real Time Performance
17
Example App CMU Sphinx 3.2
  • Speech recognition engine
  • Speaker and language independent
  • Acoustic model Triphone based, continuous
  • Hidden Markov Model (HMM) based
  • Grammar Trigram with back-off
  • Open source HUB4 speech model
  • Broadcast news model (ABC news, NPR etc)
  • 64000 word vocabulary

18
CMU Sphinx 3.2 Profile
19
L1 D-cache Miss Rate
20
L2 Cache Miss Rate
21
DRAM Bandwidth
22
IPC
23
High Level Architecture
Processor
Coprocessor Interface
Memory Controller
Input SRAMs
Output SRAM
Custom Accelerator
Scratch SRAMS
DRAM Interface
24
ASIC Accelerator Design Matrix Multiply
def matrix_multiply(A, B, C) C is the result
matrix for i in range(0, 16) for j
in range(0, 16) Cij
inner_product(A, B, i, j) def
inner_product(A, B, row, col) sum 0.0
for i in range(0,16) sum sum
Arowi Bicol return sum
25
ASIC Accelerator Design Matrix Multiply
Control Pattern
def matrix_multiply(A, B, C) C is the result
matrix for i in range(0, 16) for j
in range(0, 16) Cij
inner_product(A, B, i, j) def
inner_product(A, B, row, col) sum 0.0
for i in range(0,16) sum sum
Arowi Bicol return sum
26
ASIC Accelerator Design Matrix Multiply
Access Pattern
def matrix_multiply(A, B, C) C is the result
matrix for i in range(0, 16) for j
in range(0, 16) Cij
inner_product(A, B, i, j) def
inner_product(A, B, row, col) sum 0.0
for i in range(0,16) sum sum
Arowi Bicol return sum
27
ASIC Accelerator Design Matrix Multiply
Compute Pattern
def matrix_multiply(A, B, C) C is the result
matrix for i in range(0, 16) for j
in range(0, 16) Cij
inner_product(A, B, i, j) def
inner_product(A, B, row, col) sum 0.0
for i in range(0,16) sum sum
Arowi Bicol return sum
28
ASIC Accelerator Design Matrix Multiply
def matrix_multiply(A, B, C) C is the result
matrix for i in range(0, 16) for j
in range(0, 16) Cij
inner_product(A, B, i, j) def
inner_product(A, B, row, col) sum 0.0
for i in range(0,16) sum sum
Arowi Bicol return sum
29
ASIC Accelerator Design Matrix Multiply
7 cycle latency
def matrix_multiply(A, B, C) C is the result
matrix for i in range(0, 16) for j
in range(0, 16) Cij
inner_product(A, B, i, j) def
inner_product(A, B, row, col) sum 0.0
for i in range(0,16) sum sum
Arowi Bicol return sum
30
ASIC Accelerator Design Matrix Multiply
Interleave gt 7 inner products Complicates
address generation
def matrix_multiply(A, B, C) C is the result
matrix for i in range(0, 16) for j
in range(0, 16) Cij
inner_product(A, B, i, j) def
inner_product(A, B, row, col) sum 0.0
for i in range(0,16) sum sum
Arowi Bicol return sum
31
How can we generalize ?
  • Decompose loop into
  • Control pattern
  • Access pattern
  • Compute pattern
  • Programmable h/w acceleration for each pattern

32
The Perception Processor Architecture Family
33
Perception Processor Pipeline
34
Function Unit Organization
35
Interconnect
36
Loop Unit
37
Address Generator
A(ik1)ltltk2k3(jk4)ltltk5k6 ABi
38
Inner Product Micro-code
i_loop LoopContext(start_count0,
end_count15, increment1,
II7 ) A_ri AddressContext(portinq.a_port,
loop0row_loop, rowsize16,
loop1i_loop, base0) B_ic
AddressContext(portinq.b_port,
loop0i_loop, rowsize16,
loop1Constant, base256) for i in
LOOP(i_loop) t0 LOAD( fpu0.a_reg, A_ri )
for k in range(0,7) Will be unrolled 7x
AT(t0 k) t1 LOAD(fpu0.b_reg, B_ic,
loop1_constantk) AT(t1) t2
fpu0.mult( fpu0.a_reg, fpu0.b_reg )
AT(t2) t3 TRANSFER( fpu1.b_reg, fpu0 )
AT(t3) fpu1.add( fpu1, fpu1.b_reg
)
39
Loop Scheduling
40
Unroll and Software Pipeline
41
Modulo Scheduling
42
Modulo Scheduling - Problem
i, j
i1, j
i2, j
i3, j
43
Traditional Solution
  • Generate multiple copies of address calculation
    instructions
  • Use register rotation to fix dependences

44
Traditional Solution
  • Generate multiple copies of address calculation
    instructions
  • Use register rotation to fix dependences

45
Array Variable Renaming
tag0
tag1
tag2
tag3
46
Array Variable Renaming
47
Array Variable Renaming
48
Experimental Method
  • Measure processor power on
  • 2.4 GHz Pentium 4, 0.13u process
  • 400 MHz XScale, 0.18u process
  • Perception Processor
  • 1 GHz, 0.13u process (Berkeley Predictive Tech
    Model)
  • Verilog, MCL HDLs
  • Synthesized using Synopsys Design Compiler
  • Fanout based heuristic wire loads
  • Spice (Nanosim) simulation yields current
    waveform
  • Numerical integration to calculate energy
  • ASICs in 0.25u process
  • Normalize 0.18u, 0.25u energy and delay numbers

49
Benchmarks
  • Visual feature recognition
  • Erode, Dilate Image segmentation opertators
  • Fleshtone NCC flesh tone detector
  • Viola, Rowley Face detectors
  • Speech recognition
  • HMM 5 state Hidden Markov Model
  • GAU 39 element, 8 mixture Gaussian
  • DSP
  • FFT 128 point, complex to complex, floating
    point
  • FIR 32 tap, integer
  • Encryption
  • Rijndael 128 bit key, 576 byte packets

50
Results IPC
Mean IPC 3.3x R14K
51
Results Throughput
Mean Throughput 1.75x Pentium 0.41x ASIC
52
Results Energy
Mean Energy/packet 7.4 of XScale 5x of ASIC
53
Results Clock Gating Synergy
Mean Power Savings 39.5
54
Results Energy Delay Product
Mean EDP 159x XScale 1/12 of ASIC
55
The Cost of Generality PP
Intel XScale Processor
Coprocessor Interface
Memory Controller
Input SRAMs
Output SRAM
Custom Accelerator
Scratch SRAMS
DRAM Interface
56
Results Energy of PP
Mean Energy/packet 18.2 of XScale 12.4x of ASIC
57
Results Energy Delay Product of PP
Mean EDP 64x of XScale 1/30x of ASIC
58
Results Summary
  • 41 of ASICs performance
  • But programmable!
  • 1.75 times the Pentium 4s throughput
  • But 7.4 of the energy of an XScale!

59
Related Work
  • Johnny Pihls PDF coprocessor
  • Anantharaman and Bisiani
  • Beam search optimization for CMU recognizer
  • SPERT, MultiSPERT, UC Berkeley
  • Corporaal et als MOVE processor
  • Transport triggered architecture
  • Vector Chaining (Cray 1)
  • MIT RAW m/c (Agarwal)
  • Stanford Imagine (Dally)
  • Bit reversed addressing modes in DSPs

60
Contributions
  • Programmable architecture for perception and
    stream computations
  • Energy efficient, custom flows w/o register files
  • Drastic reductions in power while simultaneously
    improving ILP
  • Pattern oriented loop accelerators for improved
    data delivery and throughput
  • Array variable renaming generalizes register
    rotation
  • Compiler directed data-flow generalizes vector
    chaining
  • Rapid semi-automated generation of application
    specific processors
  • Makes real-time low-power perception possible!

61
Future Work
  • Loop pattern accelerators for more scheduling
    regimes and data structures
  • Programming language support
  • Automated architecture exploration
  • Generic Stream Processors
  • Architectures for list comprehension, map(),
    reduce(), filter() in h/w ?
  • e.g. B (K1i, K2ii) for i in A if i 2 !
    0

62
Thanks! Questions ?
63
Future Work
64
Flesh Toning
65
Image Segmentation
  • Erosion operator
  • 3 x 3 matrix
  • Remove pixels if all neighbors are not set
  • Removes false connections between objects
  • Dilation operator
  • 5 x 5 matrix
  • Set pixel if any neighbor is set
  • Smoothes out, fill holes in objects
  • Connected components
  • Cut image into rectangles

66
Rowley Detector
30 x 30 window
  • Neural network based
  • Specialized neurons for horizontal and vertical
    strips
  • Multiple independent networks for accuracy
  • Typically 100 neurons, 100-150 inputs each
  • Henry Rowleys implementation provided by CMU


Face or Not face ?
67
Viola and Jones Detector
30 x 30 window
Feature
  • Feature/Wavelet based
  • AdaBoost boosting algorithm combines weak
    heuristics to make stronger ones
  • Feature Sum/difference of rectangles
  • 100 features
  • Integral image representation
  • Our implementation based on published algorithm

68
Face Detection Example
69
Eigenfaces Face Recognizer
  • Known faces stored as face space representation
  • Test image is projected to face space and
    distance from known face computed
  • Closest distance gives identity of person
  • Matrix multiply and transpose operations, Eigen
    values
  • Eye co-ordinates provided by neural net
  • Original algorithm by Pentland, MIT
  • Re-implemented by researchers at Colorado State
    University

70
L1 Cache Hit Rate - Explanation
  • 320 x 200 color image Approximately 180 KB
  • Gray scale version 64 KB
  • Only flesh toning touches color image
  • One pixel at a time
  • Detectors work at 30 x 30 scale
  • Viola 5.2 KB of tables and image rows
  • Rowley Approx 80 KB per neural net, but stream
    mode

71
A Brief Introduction to Speech Recognition
72
Sphinx 3 Profile
About PowerShow.com