Predicting Performance Potential of Modern DSPs

About This Presentation

Title:

Predicting Performance Potential of Modern DSPs

Description:

Department of Electrical Engineering and Computer Sciences. University of California at Berkeley ... Exploitation of Computation Locality (e.g. data pre-fetching) ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 10

Provided by: aha125

Category:

more less

Transcript and Presenter's Notes

Title: Predicting Performance Potential of Modern DSPs

1
Predicting Performance Potential of Modern DSPs

Naji S. Ghazal,
Richard Newton, Jan Rabaey
Department of Electrical Engineering and Computer
Sciences
University of California at Berkeley
http//www-cad.eecs.berkeley.edu/naji/Research/

2
Current Challenges in DSP Design

Too Many Choices in DSP Architecture
Diverging architecture styles Bier98
New User-Configurable Instruction Sets (e.g.
CARMEL DSP)
Insufficient High-level Development Support
DSP Compilers still cannot exploit architectures
crucial optimizations automatically
Challenges
Statically unknown control flow
Identifying supported arithmetic contexts
Identifying memory addressing sequences (e.g.
streams)
Growing need for Tools to explore and predict
true potential of DSP architectural choices

for a particular application
Ever more crucial, and yet getting harder
3
Key Opportunity for Improvement

Using widely used Practices and Syntactic Styles
can be leveraged
Most code is in well-behaved Loops
Stream Accesses are in (or convertible to) the
format
Arrayconstant-coefficient Loop-Index
constant-offset
Auxiliary Fixed-point Arithmetic Operations
usually identifiable
e.g. x ( (a ltlt SCALE) b) (1 ltlt RND) )
gtgt TRUNC
x a b (MAC)
Predictions of run-time behavior with aggressive
exposure of potential optimization opportunities
is possible

4
Approach Retargetable Estimation
Application (Generic C Code)
Parameterized Architecture Model
SUIF Compiler Front-end
Profiler

Translation of SUIF instructions
? Architecture-Compatible instrs.
Optimized Computation Patterns
High-level Optimization Features

Intermediate Format (SUIF)
Translation Annotation
Architecture-specific SUIF
Cycle-level Estimation

Address Generation Conditions
Functional Unit Usage/Ordering Rules
Instruction Set Attributes

Estimate, Profile annotated with chosen
optimizations and ranked bottlenecks
5
Parameterized Architecture Model

Special Optimization Features
Optimized Special Operations
Supported Arithmetic Modes
(Scaling, Rounding, Truncating, Saturation)
Multi-operation Hyper-Patterns
Arithmetic(e.g. dual-MAC, complex-MUL)
Packed (stream-related) Operations
Memory Pack/Unpack Support
Memory Addressing Support
Auto-update Mode Ranges
Hardware Circular Addressing
and size of Circular buffers
Control Flow Optimization Support
Simple If-Conversion
Looping Support
Loop Vectorization/Packing

Functional Unit Composition
Functional Unit (FU) Types
FU Usage Limits
FU Latencies, Throughputs
FU-to-FU Constraints
Instruction Set
List of Instruction Types (ITypes)
default types Add/Sub, ALU(gen.), Mul, MAC,
Load, Store, Branch
Max of Instructions per Cycle
Instructions FU Usage
Operand Handling Rules

6
Targeting the Model -- Example

Target Processor LSI401 4-way Superscalar DSP
Salient Features Double-Loads, Dual MAC,
Packed-Add/Sub
Estimation Example
for (i0 iltN i)
x (ptr bi)gtgt16
yi ptr

Multi-Operation Patterns
Dual-MAC x x /- a b /- c d
Arithmetic Support
Truncation ON

Loop Vectorization
Degree 2
Vectorization ITypes
Add/Sub, Mul, MAC, Ld, Str

? Loop 2x Vectorizable Ld2, Ld2, Dual-MAC
Str2 Loops N ? N/2
? Ld, Ld, MAC Str
Truncation ?
7
Results Comparison with Optimizing Compiler
Inaccuracy in Predicting Hand-Optimized
Performance of DSP Kernels
DSP 1 TI C6201
Ave. Error for Estimator 4.2
Ave. Error for Compiler 220
Frequency
0 1 2 3 4 5
100 200 300 400 500
600 700
Distribution of Percent Error
DSP 2 LSI401
Ave. Error for Compiler 310
Ave. Error for Estimator 3.5
Frequency
0 1 2 3 4 5
100 200 300 400
500 600 700
Distribution of Percent Error

Latest Compilers fall short of hand-optimized
performance
substantially even for DSP Kernels

8
Why Does Retargetable Estimation Work Well?

Machine Description Method has sufficient detail
Captures main Instructions, Functional Unit
constraints, pipelining effects
Uses well-tested abstractions of features in
todays DSPs
Estimation encompasses optimization the features
with heaviest impact on performance
Includes crucial optimizing compiler technology
Targets common DSP styles/semantics and
characteristics
Reaches DSP-oriented, Loop-level and
computation-context-level optimizations

for their effect on performance
9
Conclusions
With Predictive Analysis Architecture Model
capturing and profiling differentiating features
of DSPs

Quick quantitative evaluation of architectural
tradeoffs
Guidance to shorten design development cycle
Quick comparison of different versions of an
algorithm on a given architecture
What differentiates this approach
No need for generating assembly code and
simulating
Rapid Retargetability
Can be applied to other metrics, e.g. power,
memory usage.

10
Architecture Selection Ever More Challenging
Todays tools for this task face many challenges

Difficult to retarget
Still lagging optimization
technology

Low reliability
No development support

Limited to supported DSPs
Expensive!

11
Background The DSP Domain

DSP Applications have attributes different from
GP applicationsBeyond Generic Instruction-Level
Parallelism
High Computation Regularity
Predictable Data Access Patterns (e.g.
sequential, circular access)
High, Well-structured Data Parallelism
DSP Processors (old and new) leverage these
attributes, using
Special Complex/Parallel Instructions
Variable Arithmetic Mode Support
Specialized Memory addressing and control-flow
support

12
Example Model for LSI401 (formerly ZSP16401)

LSI4014-way Superscalar RISC-based DSP with
DSP-oriented features

Optimized Operations
- Multi-operation Patterns (SUIF Expression
trees)
MAC x ? ?
Load2 pair of x, x/-1in same basic
block
MAC2 x x y
x/-1y/-1
Add2/Sub2
- Special Arithmetic modes Saturation, Rounding
Memory Addressing Features
- Address Generation Cost 1 cycle, if not a
register,
or if not array with offset -8..7
- Hardware Circular Addressing 2 circular
buffers
Control Flow Optimization Features
- Looping Support FOR Loop cost 0 (2
counters)
- Loop Vectorization/Packing
(List of instructions eligible for 2x
parallelization
Add,Sub, Ld, Str, MAC)

FU Types (6) Limits Latency Throughput ALU
2 1 1 MAC 1 1 1 Prefetch-Lds
2 0 1 Load 1 1 1 Store
1 1 1 Branch 1 1 1 FU-to-FU
Latency Str-gtLd 1 cycle min. Instruction
Types (12) Ld, Pref-Ld, Str, Branch Add,
Sub, ALU(gen.) Mul, MAC, Add2, Sub2, MAC2 Max
Instructions issued in Parallel 4 Instrs FU
Usage (16- and 32-bit 12x6 tables) Operand
Handling Rules Padd/MACs allow 3 opds per
instr., (the rest allow 2, default)
13
Assumptions in the Architecture Model
Assumptions not restrictive in DSP domain can be
used to simplify model
though register casting is performed

No Register Allocation Conflicts
Variable lifetimes usually short
No Cache Misses/Extended Memory Latencies
High Execution time locality and data access
predictability
Perfect Branch Prediction
Predicable Control Flow
Auto-update Address Modes(post-incr.
post-decr.) Available

14
Results Estimated vs. Hand-Optimized Cycle Count
15
Determining Level of Confidence of Estimation

Empirical Approach Correlation to Benchmark
Results
Establish Benchmark sets with high confidence
(target different application types)
Correlate Application to Benchmark Set gt
determine how similar it is to benchmarks
Potkonjak98
Application Estimates Confidence Level
(Accuracy) function of its similarity to
Benchmarks their Confidence Level

16
Determining Distance from a Benchmark Set
Quantitative method based on Potkonjak98

Characterize a Benchmark set numerically by
measuring some run-time properties
He used CPI, Cache hit rate, Bus utilization,
ALU issue rate...
Obtain Averages for each property over all
benchmarks
Characterize New application similarly
Add Application to Benchmark Set, Obtain NEW
Averages
Distance of New application from Benchmark Set is
function of differences between Old and New
averages

17
Architecture Selection Criteria--System Viewpoint

Performance and Power constraints MUST be met
Using Benchmark result lookup is increasingly
unreliable
Other Criteria
Chip Cost
Presence of Peripherals
Vendor/Development Support
Data Formats Supported
can be encapsulated in Framework directly as
known data, used to rank final candidate
architectures
Memory Cost
can be estimated using Retargetable Estimation
method)
Composite Metrics should be used
Cost/Energy-Efficiency

18
Retargetable Compiler Architecture Model

Example ISDL (Instruction Set Description
Language) Devadas98
Language consists of
Instruction Word Format
Storage Resources (names, sizes of Memory and
Register files)
Instruction Set (with RTL description of
operation)
Constraints (grouping rules for parallel
instructions)
Optimization Hints (e.g. Branch Prediction Hints,
Delay Slot Use)
Emphasis on Code Generation for VLIW Embedded
Processors

19
Characteristics of ConventionalDSP Processor
Architectures

Highly Non-orthogonal Data Paths
Restricted/specialized Parallelism
Specialized Support for
Control Flow and Addressing
Special, small Register Files
Complex Instruction Set
(emphasis on High Code Density)
High Memory Bandwidth
(usually at least two words/cycle)
Difficult for programming and Compilation

Data Bus
G.P. Registers
Address Registers
MULT
AGU
ALU
Accumulator
Address Bus
20
New Trends in DSP Processor Architectures

Diversification
Enhanced Conventional DSPs (more specialized
parallelism)
VLIW (deeply pipelined) / Superscalar
(dynamically scheduled)
DSP-enhanced General-Purpose Processors /
Embedded Processors
More Parallelism, but harder to track instruction
behavior
Exploitation of Computation Locality (e.g. data
pre-fetching)
Data Paths with User-configurable Extensions
(e.g. Siemens CLIW ISA)
Still as much, if not more, difficulty to program
optimally
Software Development Tools becoming more crucial

21
Capturing Multi-operation Hyper-Patterns

Patterns described as expression trees with nodes
assigned not one value, but lists of possible
values

Dual-MAC (MAC2) for ZSP16401
SIMD Pattern (PADD, PSUB) for ZSP16401
Operand (can be thru a variable)
Array, double Var
Array
, -

Y
i

Array, CONST
Array
Array
Array
Array
Array
Array
X
0
Y
0
X
Y
X
i
Y
i
1, -1
1, -1

Write a Comment

User Comments (0)

About PowerShow.com

Predicting Performance Potential of Modern DSPs - PowerPoint PPT Presentation

Predicting Performance Potential of Modern DSPs

Department of Electrical Engineering and Computer Sciences. University of California at Berkeley ... Exploitation of Computation Locality (e.g. data pre-fetching) ... – PowerPoint PPT presentation