Predicting Performance Potential of Modern DSPs - PowerPoint PPT Presentation

1 / 9
About This Presentation
Title:

Predicting Performance Potential of Modern DSPs

Description:

Department of Electrical Engineering and Computer Sciences. University of California at Berkeley ... Exploitation of Computation Locality (e.g. data pre-fetching) ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 10
Provided by: aha125
Category:

less

Transcript and Presenter's Notes

Title: Predicting Performance Potential of Modern DSPs


1
Predicting Performance Potential of Modern DSPs
  • Naji S. Ghazal,
  • Richard Newton, Jan Rabaey
  • Department of Electrical Engineering and Computer
    Sciences
  • University of California at Berkeley
  • http//www-cad.eecs.berkeley.edu/naji/Research/

2
Current Challenges in DSP Design
  • Too Many Choices in DSP Architecture
  • Diverging architecture styles Bier98
  • New User-Configurable Instruction Sets (e.g.
    CARMEL DSP)
  • Insufficient High-level Development Support
  • DSP Compilers still cannot exploit architectures
    crucial optimizations automatically
  • Challenges
  • Statically unknown control flow
  • Identifying supported arithmetic contexts
  • Identifying memory addressing sequences (e.g.
    streams)
  • Growing need for Tools to explore and predict
    true potential of DSP architectural choices

for a particular application
Ever more crucial, and yet getting harder
3
Key Opportunity for Improvement
  • Using widely used Practices and Syntactic Styles
    can be leveraged
  • Most code is in well-behaved Loops
  • Stream Accesses are in (or convertible to) the
    format
  • Arrayconstant-coefficient Loop-Index
    constant-offset
  • Auxiliary Fixed-point Arithmetic Operations
    usually identifiable
  • e.g. x ( (a ltlt SCALE) b) (1 ltlt RND) )
    gtgt TRUNC
  • x a b (MAC)
  • Predictions of run-time behavior with aggressive
    exposure of potential optimization opportunities
    is possible

4
Approach Retargetable Estimation
Application (Generic C Code)
Parameterized Architecture Model
SUIF Compiler Front-end
Profiler
  • Translation of SUIF instructions
  • ? Architecture-Compatible instrs.
  • Optimized Computation Patterns
  • High-level Optimization Features

Intermediate Format (SUIF)
Translation Annotation
Architecture-specific SUIF
Cycle-level Estimation
  • Address Generation Conditions
  • Functional Unit Usage/Ordering Rules
  • Instruction Set Attributes

Estimate, Profile annotated with chosen
optimizations and ranked bottlenecks
5
Parameterized Architecture Model
  • Special Optimization Features
  • Optimized Special Operations
  • Supported Arithmetic Modes
  • (Scaling, Rounding, Truncating, Saturation)
  • Multi-operation Hyper-Patterns
  • Arithmetic(e.g. dual-MAC, complex-MUL)
  • Packed (stream-related) Operations
  • Memory Pack/Unpack Support
  • Memory Addressing Support
  • Auto-update Mode Ranges
  • Hardware Circular Addressing
  • and size of Circular buffers
  • Control Flow Optimization Support
  • Simple If-Conversion
  • Looping Support
  • Loop Vectorization/Packing
  • Functional Unit Composition
  • Functional Unit (FU) Types
  • FU Usage Limits
  • FU Latencies, Throughputs
  • FU-to-FU Constraints
  • Instruction Set
  • List of Instruction Types (ITypes)
  • default types Add/Sub, ALU(gen.), Mul, MAC,
    Load, Store, Branch
  • Max of Instructions per Cycle
  • Instructions FU Usage
  • Operand Handling Rules

6
Targeting the Model -- Example
  • Target Processor LSI401 4-way Superscalar DSP
  • Salient Features Double-Loads, Dual MAC,
    Packed-Add/Sub
  • Estimation Example
  • for (i0 iltN i)
  • x (ptr bi)gtgt16
  • yi ptr
  • Multi-Operation Patterns
  • Dual-MAC x x /- a b /- c d
  • Arithmetic Support
  • Truncation ON
  • Loop Vectorization
  • Degree 2
  • Vectorization ITypes
  • Add/Sub, Mul, MAC, Ld, Str

? Loop 2x Vectorizable Ld2, Ld2, Dual-MAC
Str2 Loops N ? N/2
? Ld, Ld, MAC Str
Truncation ?
7
Results Comparison with Optimizing Compiler
Inaccuracy in Predicting Hand-Optimized
Performance of DSP Kernels
DSP 1 TI C6201
Ave. Error for Estimator 4.2
Ave. Error for Compiler 220
Frequency
0 1 2 3 4 5
100 200 300 400 500
600 700
Distribution of Percent Error
DSP 2 LSI401
Ave. Error for Compiler 310
Ave. Error for Estimator 3.5
Frequency
0 1 2 3 4 5
100 200 300 400
500 600 700
Distribution of Percent Error
  • Latest Compilers fall short of hand-optimized
    performance
  • substantially even for DSP Kernels

8
Why Does Retargetable Estimation Work Well?
  • Machine Description Method has sufficient detail
  • Captures main Instructions, Functional Unit
    constraints, pipelining effects
  • Uses well-tested abstractions of features in
    todays DSPs
  • Estimation encompasses optimization the features
    with heaviest impact on performance
  • Includes crucial optimizing compiler technology
  • Targets common DSP styles/semantics and
    characteristics
  • Reaches DSP-oriented, Loop-level and
    computation-context-level optimizations

for their effect on performance
9
Conclusions
With Predictive Analysis Architecture Model
capturing and profiling differentiating features
of DSPs
  • Quick quantitative evaluation of architectural
    tradeoffs
  • Guidance to shorten design development cycle
  • Quick comparison of different versions of an
    algorithm on a given architecture
  • What differentiates this approach
  • No need for generating assembly code and
    simulating
  • Rapid Retargetability
  • Can be applied to other metrics, e.g. power,
    memory usage.

10
Architecture Selection Ever More Challenging
Todays tools for this task face many challenges
  • Difficult to retarget
  • Still lagging optimization
  • technology
  • Low reliability
  • No development support
  • Limited to supported DSPs
  • Expensive!

11
Background The DSP Domain
  • DSP Applications have attributes different from
    GP applicationsBeyond Generic Instruction-Level
    Parallelism
  • High Computation Regularity
  • Predictable Data Access Patterns (e.g.
    sequential, circular access)
  • High, Well-structured Data Parallelism
  • DSP Processors (old and new) leverage these
    attributes, using
  • Special Complex/Parallel Instructions
  • Variable Arithmetic Mode Support
  • Specialized Memory addressing and control-flow
    support

12
Example Model for LSI401 (formerly ZSP16401)
  • LSI4014-way Superscalar RISC-based DSP with
    DSP-oriented features
  • Optimized Operations
  • - Multi-operation Patterns (SUIF Expression
    trees)
  • MAC x ? ?
  • Load2 pair of x, x/-1in same basic
    block
  • MAC2 x x y
    x/-1y/-1
  • Add2/Sub2
  • - Special Arithmetic modes Saturation, Rounding
  • Memory Addressing Features
  • - Address Generation Cost 1 cycle, if not a
    register,
  • or if not array with offset -8..7
  • - Hardware Circular Addressing 2 circular
    buffers
  • Control Flow Optimization Features
  • - Looping Support FOR Loop cost 0 (2
    counters)
  • - Loop Vectorization/Packing
  • (List of instructions eligible for 2x
    parallelization
  • Add,Sub, Ld, Str, MAC)

FU Types (6) Limits Latency Throughput ALU
2 1 1 MAC 1 1 1 Prefetch-Lds
2 0 1 Load 1 1 1 Store
1 1 1 Branch 1 1 1 FU-to-FU
Latency Str-gtLd 1 cycle min. Instruction
Types (12) Ld, Pref-Ld, Str, Branch Add,
Sub, ALU(gen.) Mul, MAC, Add2, Sub2, MAC2 Max
Instructions issued in Parallel 4 Instrs FU
Usage (16- and 32-bit 12x6 tables) Operand
Handling Rules Padd/MACs allow 3 opds per
instr., (the rest allow 2, default)
13
Assumptions in the Architecture Model
Assumptions not restrictive in DSP domain can be
used to simplify model
though register casting is performed
  • No Register Allocation Conflicts
  • Variable lifetimes usually short
  • No Cache Misses/Extended Memory Latencies
  • High Execution time locality and data access
    predictability
  • Perfect Branch Prediction
  • Predicable Control Flow
  • Auto-update Address Modes(post-incr.
    post-decr.) Available

14
Results Estimated vs. Hand-Optimized Cycle Count
15
Determining Level of Confidence of Estimation
  • Empirical Approach Correlation to Benchmark
    Results
  • Establish Benchmark sets with high confidence
    (target different application types)
  • Correlate Application to Benchmark Set gt
    determine how similar it is to benchmarks
    Potkonjak98
  • Application Estimates Confidence Level
    (Accuracy) function of its similarity to
    Benchmarks their Confidence Level

16
Determining Distance from a Benchmark Set
Quantitative method based on Potkonjak98
  • Characterize a Benchmark set numerically by
    measuring some run-time properties
  • He used CPI, Cache hit rate, Bus utilization,
    ALU issue rate...
  • Obtain Averages for each property over all
    benchmarks
  • Characterize New application similarly
  • Add Application to Benchmark Set, Obtain NEW
    Averages
  • Distance of New application from Benchmark Set is
    function of differences between Old and New
    averages

17
Architecture Selection Criteria--System Viewpoint
  • Performance and Power constraints MUST be met
  • Using Benchmark result lookup is increasingly
    unreliable
  • Other Criteria
  • Chip Cost
  • Presence of Peripherals
  • Vendor/Development Support
  • Data Formats Supported
  • can be encapsulated in Framework directly as
    known data, used to rank final candidate
    architectures
  • Memory Cost
  • can be estimated using Retargetable Estimation
    method)
  • Composite Metrics should be used
    Cost/Energy-Efficiency

18
Retargetable Compiler Architecture Model
  • Example ISDL (Instruction Set Description
    Language) Devadas98
  • Language consists of
  • Instruction Word Format
  • Storage Resources (names, sizes of Memory and
    Register files)
  • Instruction Set (with RTL description of
    operation)
  • Constraints (grouping rules for parallel
    instructions)
  • Optimization Hints (e.g. Branch Prediction Hints,
    Delay Slot Use)
  • Emphasis on Code Generation for VLIW Embedded
    Processors

19
Characteristics of ConventionalDSP Processor
Architectures
  • Highly Non-orthogonal Data Paths
  • Restricted/specialized Parallelism
  • Specialized Support for
  • Control Flow and Addressing
  • Special, small Register Files
  • Complex Instruction Set
  • (emphasis on High Code Density)
  • High Memory Bandwidth
  • (usually at least two words/cycle)
  • Difficult for programming and Compilation

Data Bus
G.P. Registers
Address Registers
MULT
AGU
ALU
Accumulator
Address Bus
20
New Trends in DSP Processor Architectures
  • Diversification
  • Enhanced Conventional DSPs (more specialized
    parallelism)
  • VLIW (deeply pipelined) / Superscalar
    (dynamically scheduled)
  • DSP-enhanced General-Purpose Processors /
    Embedded Processors
  • More Parallelism, but harder to track instruction
    behavior
  • Exploitation of Computation Locality (e.g. data
    pre-fetching)
  • Data Paths with User-configurable Extensions
    (e.g. Siemens CLIW ISA)
  • Still as much, if not more, difficulty to program
    optimally
  • Software Development Tools becoming more crucial

21
Capturing Multi-operation Hyper-Patterns
  • Patterns described as expression trees with nodes
    assigned not one value, but lists of possible
    values

Dual-MAC (MAC2) for ZSP16401
SIMD Pattern (PADD, PSUB) for ZSP16401
Operand (can be thru a variable)
Array, double Var
Array
, -

Y
i


Array, CONST
Array
Array
Array
Array
Array
Array
X
0
Y
0
X
Y
X
i
Y
i
1, -1
1, -1
Write a Comment
User Comments (0)
About PowerShow.com