Loading...

PPT – The Perception Processor PowerPoint presentation | free to download - id: 750a61-NzczN

The Adobe Flash plugin is needed to view this content

The Perception Processor

Binu Mathew Advisor Al Davis

What is Perception Processing ?

- Ubiquitous computing needs natural human

interfaces - Processor support for perceptual applications
- Gesture recognition
- Object detection, recognition, tracking
- Speech recognition
- Speaker identification
- Applications
- Multi-modal human friendly interfaces
- Intelligent digital assistants
- Robotics, unmanned vehicles
- Perception prosthetics

The Problem with Perception Processing

The Problem with Perception Processing

- Too slow, too much power for embedded space!
- 2.4 GHz Pentium 4 60 Watts
- 400 MHz Xscale 800 mW
- 10x or more difference in performance
- Inadequate memory bandwidth
- Sphinx requires 1.2 GB/s memory bandwidth
- Xscale delivers 64 MB/s 1/19th
- Characterize application to find the problem
- Derive acceleration architecture
- History of FPUs is an analogy

High Level Architecture

Processor

Coprocessor Interface

Memory Controller

Input SRAMs

Output SRAM

Custom Accelerator

Scratch SRAMS

DRAM Interface

Thesis Statement

It is possible to design programmable processors

that can handle sophisticated perception

workloads in real-time at power budgets suitable

for embedded devices.

The FaceRec Application

FaceRec In Action

Rob Evans

Application Structure

Rowley Face Detector

Neural Net Eye Locator

Eigenfaces Face Recognizer

Segment Image

Flesh tone

Viola Jones Face Detector

Image

Identity, Coordinates

- Flesh toning Soriano et al, Bertran et al
- Segmentation Text book approach
- Rowley detector, voter Henry Rowley, CMU
- Viola Jones detector Published algorithm

Carbonetto, UBC - Eigenfaces Re-implementation by Colorado State

University

FaceRec Characterization

- ML-RSIM out of order processor simulator
- SPARC V8 ISA, Unmodified SunOS binaries

Out of order processor similar to 2GHz Intel Pentium 4

1-4 ALUs, 1-4 FPUs

Max 4 issue Max 4 graduations/cycle

16 KB 2-way L1 I -Cache

16-64 KB 2-way L1 D-Cache

256 KB-2MB 2-way L2 Cache

600 MHz, 64-bit DDR Memory interface

In-order processor similar to 400MHz Intel XScale

1 ALU, 1 FPU

Max 1 issue Max 1 graduation/cycle

32 KB 32-way L1 I -Cache

32 KB 32-way L1 D-Cache

No L2 Cache

100 MHz, 32-bit SDR Memory interface

Application Profile

Memory System Characteristics L1 D Cache

Memory System Characteristics L2 Cache

IPC

Why is IPC low ?

Neural Network Evaluation Sum Sn i0 Weighti

Image Inputi Result Tanh(Sum)

- Dependences e.g. no single cycle floating

point accumulate - Indirect accesses
- Several array accesses per operator
- Load store ports saturate
- Need architectures that can move data efficiently

Real Time Performance

Example App CMU Sphinx 3.2

- Speech recognition engine
- Speaker and language independent
- Acoustic model Triphone based, continuous
- Hidden Markov Model (HMM) based
- Grammar Trigram with back-off
- Open source HUB4 speech model
- Broadcast news model (ABC news, NPR etc)
- 64000 word vocabulary

CMU Sphinx 3.2 Profile

L1 D-cache Miss Rate

L2 Cache Miss Rate

DRAM Bandwidth

IPC

High Level Architecture

Processor

Coprocessor Interface

Memory Controller

Input SRAMs

Output SRAM

Custom Accelerator

Scratch SRAMS

DRAM Interface

ASIC Accelerator Design Matrix Multiply

def matrix_multiply(A, B, C) C is the result

matrix for i in range(0, 16) for j

in range(0, 16) Cij

inner_product(A, B, i, j) def

inner_product(A, B, row, col) sum 0.0

for i in range(0,16) sum sum

Arowi Bicol return sum

ASIC Accelerator Design Matrix Multiply

Control Pattern

def matrix_multiply(A, B, C) C is the result

matrix for i in range(0, 16) for j

in range(0, 16) Cij

inner_product(A, B, i, j) def

inner_product(A, B, row, col) sum 0.0

for i in range(0,16) sum sum

Arowi Bicol return sum

ASIC Accelerator Design Matrix Multiply

Access Pattern

def matrix_multiply(A, B, C) C is the result

matrix for i in range(0, 16) for j

in range(0, 16) Cij

inner_product(A, B, i, j) def

inner_product(A, B, row, col) sum 0.0

for i in range(0,16) sum sum

Arowi Bicol return sum

ASIC Accelerator Design Matrix Multiply

Compute Pattern

def matrix_multiply(A, B, C) C is the result

matrix for i in range(0, 16) for j

in range(0, 16) Cij

inner_product(A, B, i, j) def

inner_product(A, B, row, col) sum 0.0

for i in range(0,16) sum sum

Arowi Bicol return sum

ASIC Accelerator Design Matrix Multiply

def matrix_multiply(A, B, C) C is the result

matrix for i in range(0, 16) for j

in range(0, 16) Cij

inner_product(A, B, i, j) def

inner_product(A, B, row, col) sum 0.0

for i in range(0,16) sum sum

Arowi Bicol return sum

ASIC Accelerator Design Matrix Multiply

7 cycle latency

def matrix_multiply(A, B, C) C is the result

matrix for i in range(0, 16) for j

in range(0, 16) Cij

inner_product(A, B, i, j) def

inner_product(A, B, row, col) sum 0.0

for i in range(0,16) sum sum

Arowi Bicol return sum

ASIC Accelerator Design Matrix Multiply

Interleave gt 7 inner products Complicates

address generation

def matrix_multiply(A, B, C) C is the result

matrix for i in range(0, 16) for j

in range(0, 16) Cij

inner_product(A, B, i, j) def

inner_product(A, B, row, col) sum 0.0

for i in range(0,16) sum sum

Arowi Bicol return sum

How can we generalize ?

- Decompose loop into
- Control pattern
- Access pattern
- Compute pattern
- Programmable h/w acceleration for each pattern

The Perception Processor Architecture Family

Perception Processor Pipeline

Function Unit Organization

Interconnect

Loop Unit

Address Generator

A(ik1)ltltk2k3(jk4)ltltk5k6 ABi

Inner Product Micro-code

i_loop LoopContext(start_count0,

end_count15, increment1,

II7 ) A_ri AddressContext(portinq.a_port,

loop0row_loop, rowsize16,

loop1i_loop, base0) B_ic

AddressContext(portinq.b_port,

loop0i_loop, rowsize16,

loop1Constant, base256) for i in

LOOP(i_loop) t0 LOAD( fpu0.a_reg, A_ri )

for k in range(0,7) Will be unrolled 7x

AT(t0 k) t1 LOAD(fpu0.b_reg, B_ic,

loop1_constantk) AT(t1) t2

fpu0.mult( fpu0.a_reg, fpu0.b_reg )

AT(t2) t3 TRANSFER( fpu1.b_reg, fpu0 )

AT(t3) fpu1.add( fpu1, fpu1.b_reg

)

Loop Scheduling

Unroll and Software Pipeline

Modulo Scheduling

Modulo Scheduling - Problem

i, j

i1, j

i2, j

i3, j

Traditional Solution

- Generate multiple copies of address calculation

instructions - Use register rotation to fix dependences

Traditional Solution

- Generate multiple copies of address calculation

instructions - Use register rotation to fix dependences

Array Variable Renaming

tag0

tag1

tag2

tag3

Array Variable Renaming

Array Variable Renaming

Experimental Method

- Measure processor power on
- 2.4 GHz Pentium 4, 0.13u process
- 400 MHz XScale, 0.18u process
- Perception Processor
- 1 GHz, 0.13u process (Berkeley Predictive Tech

Model) - Verilog, MCL HDLs
- Synthesized using Synopsys Design Compiler
- Fanout based heuristic wire loads
- Spice (Nanosim) simulation yields current

waveform - Numerical integration to calculate energy
- ASICs in 0.25u process
- Normalize 0.18u, 0.25u energy and delay numbers

Benchmarks

- Visual feature recognition
- Erode, Dilate Image segmentation opertators
- Fleshtone NCC flesh tone detector
- Viola, Rowley Face detectors
- Speech recognition
- HMM 5 state Hidden Markov Model
- GAU 39 element, 8 mixture Gaussian
- DSP
- FFT 128 point, complex to complex, floating

point - FIR 32 tap, integer
- Encryption
- Rijndael 128 bit key, 576 byte packets

Results IPC

Mean IPC 3.3x R14K

Results Throughput

Mean Throughput 1.75x Pentium 0.41x ASIC

Results Energy

Mean Energy/packet 7.4 of XScale 5x of ASIC

Results Clock Gating Synergy

Mean Power Savings 39.5

Results Energy Delay Product

Mean EDP 159x XScale 1/12 of ASIC

The Cost of Generality PP

Intel XScale Processor

Coprocessor Interface

Memory Controller

Input SRAMs

Output SRAM

Custom Accelerator

Scratch SRAMS

DRAM Interface

Results Energy of PP

Mean Energy/packet 18.2 of XScale 12.4x of ASIC

Results Energy Delay Product of PP

Mean EDP 64x of XScale 1/30x of ASIC

Results Summary

- 41 of ASICs performance
- But programmable!
- 1.75 times the Pentium 4s throughput
- But 7.4 of the energy of an XScale!

Related Work

- Johnny Pihls PDF coprocessor
- Anantharaman and Bisiani
- Beam search optimization for CMU recognizer
- SPERT, MultiSPERT, UC Berkeley
- Corporaal et als MOVE processor
- Transport triggered architecture
- Vector Chaining (Cray 1)
- MIT RAW m/c (Agarwal)
- Stanford Imagine (Dally)
- Bit reversed addressing modes in DSPs

Contributions

- Programmable architecture for perception and

stream computations - Energy efficient, custom flows w/o register files
- Drastic reductions in power while simultaneously

improving ILP - Pattern oriented loop accelerators for improved

data delivery and throughput - Array variable renaming generalizes register

rotation - Compiler directed data-flow generalizes vector

chaining - Rapid semi-automated generation of application

specific processors - Makes real-time low-power perception possible!

Future Work

- Loop pattern accelerators for more scheduling

regimes and data structures - Programming language support
- Automated architecture exploration
- Generic Stream Processors
- Architectures for list comprehension, map(),

reduce(), filter() in h/w ? - e.g. B (K1i, K2ii) for i in A if i 2 !

0

Thanks! Questions ?

Future Work

Flesh Toning

Image Segmentation

- Erosion operator
- 3 x 3 matrix
- Remove pixels if all neighbors are not set
- Removes false connections between objects
- Dilation operator
- 5 x 5 matrix
- Set pixel if any neighbor is set
- Smoothes out, fill holes in objects
- Connected components
- Cut image into rectangles

Rowley Detector

30 x 30 window

- Neural network based
- Specialized neurons for horizontal and vertical

strips - Multiple independent networks for accuracy
- Typically 100 neurons, 100-150 inputs each
- Henry Rowleys implementation provided by CMU

Face or Not face ?

Viola and Jones Detector

30 x 30 window

Feature

- Feature/Wavelet based
- AdaBoost boosting algorithm combines weak

heuristics to make stronger ones - Feature Sum/difference of rectangles
- 100 features
- Integral image representation
- Our implementation based on published algorithm

Face Detection Example

Eigenfaces Face Recognizer

- Known faces stored as face space representation
- Test image is projected to face space and

distance from known face computed - Closest distance gives identity of person
- Matrix multiply and transpose operations, Eigen

values - Eye co-ordinates provided by neural net
- Original algorithm by Pentland, MIT
- Re-implemented by researchers at Colorado State

University

L1 Cache Hit Rate - Explanation

- 320 x 200 color image Approximately 180 KB
- Gray scale version 64 KB
- Only flesh toning touches color image
- One pixel at a time
- Detectors work at 30 x 30 scale
- Viola 5.2 KB of tables and image rows
- Rowley Approx 80 KB per neural net, but stream

mode

A Brief Introduction to Speech Recognition

Sphinx 3 Profile