Loading...

PPT – Computing Engine Choices PowerPoint presentation | free to download - id: 80e1e2-N2M0Y

The Adobe Flash plugin is needed to view this content

Computing Engine Choices

- General Purpose Processors (GPPs) Intended for

general purpose computing (desktops, servers,

clusters..) - Application-Specific Processors (ASPs)

Processors with ISAs and architectural features

tailored towards specific application domains - E.g Digital Signal Processors (DSPs), Network

Processors (NPs), Media Processors, Graphics

Processing Units (GPUs), Vector Processors???

... - Co-Processors A hardware (hardwired)

implementation of specific algorithms with

limited programming interface (augment GPPs or

ASPs) - Configurable Hardware
- Field Programmable Gate Arrays (FPGAs)
- Configurable array of simple processing elements
- Application Specific Integrated Circuits (ASICs)

A custom VLSI hardware solution for a specific

computational task - The choice of one or more depends on a number of

factors including - - Type and complexity of computational

algorithm - (general purpose vs. Specialized)
- - Desired level of flexibility and

programmability - - Performance requirements
- - Desired level of computational

efficiency - (e.g Computations per watt

or computations per chip area) - - Power requirements -

Real-time constraints - - Development time and cost -

System cost

General Purpose ISAs (RISC or CISC)

Special Purpose ISAs

The ISA forms an abstraction layer that sets the

requirements for both complier and CPU designers

- Expected useful lifecycle of
- computing element or system

Repeated here from lecture 1

Computing Engine Choices

For Application-Specific Processors (ASPs)

e.g Digital Signal Processors (DSPs), Network

Processors (NPs), Media Processors, Graphics

Processing Units (GPUs) Physics Processor .

ASPs

General Purpose Processors (GPPs)

Flexibility

Processor Programmable computing element that

runs programs written using a pre-defined set of

instructions

Application-Specific Processors (ASPs)

ISA

Programmability /

Configurable Hardware

Selection Factors

- Type and complexity of computational algorithm
- (general purpose vs. Specialized)
- - Desired level of flexibility and

programmability - - Performance requirements
- - Desired level of computational efficiency
- Power requirements - Real-time

constraints - - Development time and cost - System cost

Co-Processors

Application Specific Integrated Circuits

(ASICs)

Specialization , Development cost/time

Performance/Chip Area/Watt (Computational

Efficiency)

Repeated here from lecture 1

Software

Hardware

(Processors)

Computing Element Choices Observation

Why Application-Specific Processors (ASPs)?

- Generality and efficiency are in some sense

inversely related to one another - The more general-purpose a computing element is

and thus the greater the number of tasks it can

perform, the less efficient (e.g. Computations

per chip area /watt) it will be in performing any

of those specific tasks. - Design decisions are therefore almost always

compromises designers identify key features or

requirements of applications that must be met and

and make compromises on other less important

features. - To counter the problem of computationally intense

and specialized problems for which general

purpose processors/machines cannot achieve the

necessary performance/other requirements - Special-purpose processors (or Application-Specifi

c Processors, ASPs) , attached processors, and

coprocessors have been designed/built for many

years, for specific application domains, such

as image or digital signal processing (for which

many of the computational tasks are specialized

and can be very well defined).

i.e computational efficiency

ASPs

Generality Flexibility Programmability

? Efficiency Computational Efficiency

(Computations per watt or chip area)

Digital Signal Processor (DSP) Architecture

- Classification of Main Processor

Types/Applications - Requirements of Embedded Processors
- DSP vs. General Purpose CPUs
- DSP Cores vs. Chips
- Classification of DSP Applications
- DSP Algorithm Format
- DSP Benchmarks
- Basic Architectural Features of DSPs
- DSP Software Development Considerations
- Classification of Current DSP Architectures and

example DSPs - Conventional DSPs TI TMSC54xx
- Enhanced Conventional DSPs TI TMSC55xx
- Multiple-Issue DSPs
- VLIW DSPs TI TMS320C62xx, TMS320C64xx
- Superscalar DSPs LSI Logic ZSP400/500 DSP core

DSPs are often embedded

1-2

DSP Generations

3

4

Main Processor Types/Applications

- General Purpose Computing General Purpose

Processors (GPPs) - High performance In general, faster is always

better. - RISC or CISC Intel P4, IBM Power4, SPARC,

PowerPC, MIPS ... - Used for general purpose software
- End-user programmable
- Real-time performance may not be fully

predictable (due to dynamic arch. features) - Heavy weight, multi-tasking OS - Windows, UNIX
- Normally, low cost and power not a requirement

(changing) - Servers, Workstations, Desktops (PCs),

Notebooks, Clusters - Embedded Processing Embedded processors and

processor cores - Cost, power code-size and real-time requirements

and constraints - Once real-time constraints are met, a faster

processor may not be better - e.g Intel XScale, ARM, 486SX, Hitachi SH7000,

NEC V800... - Often require Digital signal processing (DSP)

support or other - application-specific support (e.g

network, media processing) - Single or few specialized programs known at

system design time - Not end-user programmable
- Real-time performance must be fully predictable

(avoid dynamic arch. features) - Lightweight, often realtime OS or no OS

64 bit

Increasing Cost/Complexity

16-32 bit

Increasing volume

8-16 bit

Examples of Application-Specific Processors (ASPs)

The Processor Design Space

(Main Types)

Embedded processors

Application specific architectures for performance

Microprocessors

GPPs

Real-time constraints Specialized

applications Low power/cost constraints

Performance is everything Software rules

Performance

Microcontrollers

Examples of ASPs

Cost is everything

Chip Area, Power complexity

Processor Cost

Requirements of Embedded Processors

- Usually must meet strict real-time constraints
- Real-time performance must be fully predictable
- Avoid dynamic processor architectural features

that make real-time performance harder to predict

( e.g cache, dynamic scheduling, hardware

speculation ) - Once real-time constraints are met, a faster

processor is not desirable (overkill) due to

increased cost/power requirements. - Optimized for a single (or few) program (s) -

code often in on-chip ROM or on/off chip

EPROM/flash memory. - Minimum code size (one of the motivations

initially for Java) - Performance obtained by optimizing datapath
- Low cost
- Lowest possible area
- High computational efficiency Computation per

unit area - VLSI implementation technology usually behind the

leading edge - High level of integration of peripherals

(System-on-Chip -SoC- approach reduces system

cost/power) - Fast time to market
- Compatible architectures (e.g. ARM family)

allows reusable code - Customizable cores (System-on-Chip, SoC).
- Low power if application requires portability

Embedded Processors How Fast?

Good or bad?

Area of processor cores Cost

Embedded Processors

(and Power requirements)

Thus need to minimize chip area

Embedded version of a GPP

Nintendo processor

Cellular phones

Another figure of merit Computation per unit

chip area

Embedded Processors

(Computational Efficiency)

Embedded version of a GPP

Nintendo processor

Cellular phones

Code size

Embedded Processors

Smaller is better

- If a majority of the chip is the program stored

in ROM, then minimizing code size is a critical

issue - Common embedded processor ISA features to

minimize code size - Variable length instruction encoding common
- e.g. the Piranha has 3 sized instructions - basic

2 byte, and 2 byte plus 16 or 32 bit immediate - Complex/specialized instructions
- Complex addressing modes

1

How?

CISC-Like ?

2

3

Embedded Systems vs. General Purpose Computing

General Purpose Computing Systems

Embedded Systems

(and processors GPPs)

(and embedded processors)

Used for general purpose software Intended to

run a fully general set of applications that may

not be known at design time

Run a single or few specialized applications

often known at system design time

May require application-specific capability (e.g

DSP)

No application-specific capability required

End-user programmable

Not end-user programmable

Minimum code size is highly desirable

Minimizing code size is not an issue

Heavy weight, multi-tasking OS - Windows, UNIX

Lightweight, often real-time OS or no OS

Low power and cost constraints/requirements

Higher power and cost constraints/requirements

- Usually must meet strict real-time constraints
- (e.g. real-time sampling rate)

In general, no real-time constraints

Thus

Thus

- Real-time performance must be fully predictable
- Avoid dynamic processor architectural features

that make real-time performance harder to predict

- Real-time performance may not be fully

predictable (due to dynamic processor

architectural features) - Superscalar dynamic scheduling, hardware

speculation, branch prediction, cache.

Once real-time constraints are met, a faster

processor is not desirable (overkill) due to

increased cost/power requirements.

Faster (higher-performance) is always better

usually

Evolution of GPPs and DSPs

- General Purpose Processors (GPPs) trace roots

back to Eckert, Mauchly, Von Neumann (ENIAC) - Digital Signal Processors (DSPs) are

microprocessors designed for efficient

mathematical manipulation of digital signals

utilizing digital signal processing algorithms. - DSPs usually process infinite continuous sampled

(digitized) data streams (physical signals) while

meeting real-time and power constraints. - DSPs evolved from Analog Signal Processors (ASPs)

that utilize analog hardware to transform

physical signals (classical electrical

engineering) - ASP to DSP because
- DSP insensitive to environment (e.g., same

response in snow or desert if it works at all) - DSP performance identical even with variations in

components 2 analog systems behavior varies even

if built with same components with 1 variation - Different history and different applications

requirements led to different ISA design

considerations, terms, different metrics,

architectures, some new inventions.

EDSAC

First generation processors

i.e.

DSP vs. General Purpose CPUs

- DSPs tend to run one (or few) program(s), not

many programs. - Hence OSes (if any) are much simpler, there is no

virtual memory or protection, ... - DSPs usually run applications with hard real-time

constraints - DSP must meet application signal sampling rate

computational requirements - Once above real-time constraints are met, a

faster DSP is overkill (higher DSP cost,

power..) without additional benefit. - You must account for anything that could happen

in a time slot (DSP algorithm inner-loop, data

sampling rate) - All possible interrupts or exceptions must be

accounted for and their collective time be

subtracted from the time interval. - Therefore, exceptions are BAD.
- DSPs usually process infinite continuous data

streams - Requires high memory bandwidth (with predictable

latency, e.g no data cache) for streaming

real-time data samples and predictable processing

time on the data samples - The design of DSP ISAs and processor

architectures is driven by the requirements of

DSP algorithms. - Thus DSPs are application-specific processors

DSP Performance Requirements

Similar to other embedded processors

DSP vs. GPP

- The MIPS/MFLOPS of DSPs is speed of

Multiply-Accumulate (MAC). - MAC is common in DSP algorithms that involve

computing a vector dot product, such as digital

filters, correlation, and Fourier transforms. - DSP are judged by whether they can keep the

multipliers busy 100 of the time and by how many

MACs are performed in each cycle. - The "SPEC" of DSPs is 4 algorithms
- Inifinite Impule Response (IIR) filters
- Finite Impule Response (FIR) filters
- FFT, and
- convolvers
- In DSPs, target algorithms are important
- Binary compatibility not a major issue
- High-level Software is not as important in DSPs

as in GPPs. - People still write in assembly language for a

product to minimize the die area for ROM in the

DSP chip and improve performance.

i.e Main performance measure of DSPs is MAC speed

Why?

Since DSPS are application domain specific

processors

unlike general purpose

Code size

Note While this is still mostly true, however,

programming for DSPs in high level languages

(HLLs) has been gaining more acceptance due to

the development of more efficient HLL DSP

compilers in recent years.

Types of DSP Processors

According to type of Arithmetic/operand Size

Supported

- 32-BIT FLOATING POINT (5 of DSP market)
- TI TMS320C3X, TMS320C67xx (VLIW)
- ATT DSP32C
- ANALOG DEVICES ADSP21xxx
- Hitachi SH-4
- 16-BIT FIXED POINT (95 of DSP market)
- TI TMS320C2X, TMS320C62xx (VLIW)
- Infineon TC1xxx (TriCore1) (VLIW)
- MOTOROLA DSP568xx, MSC810x (VLIW)
- ANALOG DEVICES ADSP21xx
- Agere Systems DSP16xxx, Starpro2000
- LSI Logic LSI140x (ZPS400) superscalar
- Hitachi SH3-DSP
- StarCore SC110, SC140 (VLIW)

Examples

Or 24 bit

Examples

DSP Cores vs. Chips

- DSP are usually available as synthesizable cores

or off-the- - shelf packaged chips
- Synthesizable Cores
- Map into chosen fabrication process
- Speed, power, and size vary
- Choice of peripherals, etc. (SoC)
- Requires extensive hardware development effort.
- Off-the-shelf packaged chips
- Highly optimized for speed, energy efficiency,

and/or cost. - Lower development time/cost/effort.
- Tools, 3rd-party support often more mature.
- Faster time to market.
- Limited performance, integration options.

IP

SOC System On Chip

Resulting in more development time and cost (very

high volume needed to justify development cost

DSP ARCHITECTUREEnabling Technologies

First microprocessor DSP TI TMS 32010

1

2

3

4

Generations of single-chip (microprocessor) DSPs

Texas Instruments TMS320 Family Multiple DSP ?P

Generations

1 2 3 4

(VLIW)

Generations of single-chip (microprocessor) DSPs

DSP Applications

- Digital audio applications
- MPEG Audio
- Portable audio
- Digital cameras
- Cellular telephones
- Wearable medical appliances
- Storage products
- disk drive servo control
- Military applications
- radar
- sonar

- Industrial control
- Seismic exploration
- Networking
- (Telecom infrastructure)
- Wireless
- Base station
- Cable modems
- ADSL
- VDSL
- ...

Current DSP Killer Applications Cell phones and

telecom infrastructure

HDTV? .. Other?

DSP Algorithms Applications

Another Look at DSP Applications

- High-end
- Military applications (e.g. radar/sonar)
- Wireless Base Station - TMS320C6000
- Cable modem
- Gateways - HDTV
- Mid-range
- Industrial control
- Cellular phone - TMS320C540
- Fax/ voice server
- Low end
- Storage products - TMS320C27 (hard drive

controllers) - Digital camera - TMS320C5000
- Portable phones
- Wireless headsets
- Consumer audio
- Automobiles, thermostats, ...

Increasing Cost

Increasing volume

DSP range of applications

Possible Target DSPs

Cellular Phone System

1 2 3 4 5 6 7 8 9 0

415-555-1212

CONTROLLER

RF MODEM

PHYSICAL LAYER PROCESSING

BASEBAND CONVERTER

A/D

SPEECH DECODE

SPEECH ENCODE

DAC

Example DSP Application

Cellular Phone HW/SW/IC Partitioning

MICROCONTROLLER

1 2 3 4 5 6 7 8 9 0

415-555-1212

CONTROLLER

RF MODEM

PHYSICAL LAYER PROCESSING

BASEBAND CONVERTER

ASIC

A/D

SPEECH DECODE

SPEECH ENCODE

DAC

DSP

ANALOG IC

Example DSP Application

Mapping Onto System-on-Chip (SoC)

(Cellular Phone)

S/P

phone book

keypad intfc

Micro-controller or embedded processor

protocol

DMA

control

RAM

µC

speech quality enhancment

voice recognition

ASIC LOGIC

RPE-LTP speech decoder

de-intl decoder

Viterbi equalizer

demodulator and synchronizer

DSP Core

Example DSP Application

Example Cellular Phone Organization

C540

(DSP)

ARM7

(µC)

Example DSP Application

Multimedia System-on-Chip (SoC)

e.g. Multimedia terminal electronics

ASIC Co-processor Or ASP

- Future chips will be a mix of processors, memory

and dedicated hardware for specific algorithms

and I/O

(ASIC)

Example DSP Application

DSP Algorithm Format

- DSP culture has a graphical format to represent

formulas. - Like a flowchart for formulas, inner loops, not

programs. - Some seem natural ? is add, X is multiply
- Others are obtuse z1 means take variable from

earlier iteration (delay). - These graphs are trivial to decode

i.e. DSP algorithms

DSP Algorithm Notation

- Uses flowchart notation instead of equations
- Multiply is or X
- Add is or
- ?
- Delay/Storage is

or or - Delay z1 D

Typical DSP Algorithm Finite-Impulse Response

(FIR) Filter

- Filters reduce signal noise and enhance image or

signal quality by removing unwanted frequencies.

- Finite Impulse Response (FIR) filters compute
- where
- x is the input sequence
- y is the output sequence
- h is the impulse response (filter coefficients)
- N is the number of taps (coefficients) in the

filter - Output sequence depends only on input sequence

and impulse response.

Filter coefficients

N Taps

Signal samples

Vector Dot Product Multiply Accumulate (MAC)

Operations

i.e filter coefficients

Typical DSP Algorithms Finite-impulse Response

(FIR) Filter

- N most recent samples in the delay line (Xi)
- New sample moves data down delay line
- Filter Tap is a multiply-add
- Each tap (N taps total) nominally requires
- Two data fetches
- Multiply
- Accumulate
- Memory write-back to update delay line
- Special addressing modes (e.g modulo)
- Performance Goal At least 1 FIR Tap / DSP

instruction cycle

(Multiply And Accumulate, MAC)

- Requires real-time data sample streaming
- Predictable data bandwidth/latency
- Special addressing modes
- Separate memory banks/busses?

- Repetitive computations, multiply and accumulate

(MAC) - Requires efficient MAC support

MAC

From A/D

- FINITE-IMPULSE RESPONSE (FIR) FILTER

Signal Samples

Delay (accumulator register)

Filter Coefficients

MAC

To D/A

Delayed samples

Filter coefficients

A Filter Tap

One FIR Filter Tap

i.e. Vector dot product

Performance Goal at least 1 FIR Tap / DSP

instruction cycle

DSP must meet application signal sampling rate

computational requirements A faster DSP is

overkill (more cost/power than really needed)

Sample Computational Rates for FIR Filtering

FIR Type

1-D

1-D

2-D

2-D

(4.37 GOPs)

2-D

(23.3 GOPs)

OPs Operation Per Second

1-D FIR has nop 2N and a 2-D FIR has nop 2N2.

- DSP must meet application signal sampling rate

computational requirements - A faster DSP is overkill (higher DSP cost,

power..)

DSP Performance Requirements

FIR Filter on (Simple) General Purpose Processor

(GPP)

- loop lw x0, 0(r0) lw y0, 0(r1) mul a,

x0,y0add y0,a,b sw y0,(r2) inc r0 inc r1

inc r2 dec ctr tst ctr jnz loop - Problems
- Bus / memory bandwidth bottleneck,
- control/loop code overhead
- No suitable addressing modes, instructions -
- e.g. multiply and accumulate (MAC) instruction

- GPP Real-time performance may (to meet signal

sampling rate) not be fully predictable (due to

dynamic processor architectural features) - Superscalar dynamic scheduling, hardware

speculation, branch prediction, cache.

Typical DSP Algorithms Infinite-Impulse

Response (IIR) Filter

- Infinite Impulse Response (IIR) filters compute
- Output sequence depends on input sequence,

previous outputs, and impulse response. - Both FIR and IIR filters
- Require vector dot product (multiply-accumulate)

operations - Use fixed coefficients
- Adaptive filters update their coefficients to

minimize the distance between the filter output

and the desired signal.

MAC

MAC

i.e Filter coefficients a(k), b(k)

MAC

normally

Typical DSP Algorithms Discrete Fourier

Transform (DFT)

- The Discrete Fourier Transform (DFT) allows for

spectral analysis in the frequency domain. - It is computed as
- for k 0, 1, , N-1, where
- x is the input sequence in the time domain
- y is an output sequence in the frequency domain
- The Inverse Discrete Fourier Transform is

computed as - The Fast Fourier Transform (FFT) provides an

efficient method for computing the DFT.

MAC

MAC

Typical DSP Algorithms Discrete Cosine

Transform (DCT)

- The Discrete Cosine Transform (DCT) is frequently

used in image video compression (e.g. JPEG,

MPEG-2). - The DCT and Inverse DCT (IDCT) are computed as
- where e(k) 1/sqrt(2) if k 0 otherwise e(k)

1. - A N-Point, 1D-DCT requires N2 MAC operations.

MAC

MAC

DSP BENCHMARKS

- DSPstone University of Aachen, application

benchmarks - ADPCM TRANSCODER - CCITT G.721, REAL_UPDATE,

COMPLEX_UPDATES - DOT_PRODUCT, MATRIX_1X3, CONVOLUTION
- FIR, FIR2DIM, HR_ONE_BIQUAD
- LMS, FFT_INPUT_SCALED
- BDTImark2000 Berkeley Design Technology Inc
- 12 DSP kernels in hand-optimized assembly

language - FIR, IIR, Vector dot product, Vector add, Vector

maximum, FFT . - Returns single number (higher means faster) per

processor - Use only on-chip memory (memory bandwidth is the

major bottleneck in performance of embedded

applications). - EEMBC (pronounced embassy) EDN Embedded

Microprocessor Benchmark Consortium - 30 companies formed by Electronic Data News (EDN)
- Benchmark evaluates compiled C code on a variety

of embedded processors (microcontrollers, DSPs,

etc.) - Application domains automotive-industrial,

consumer, office automation, networking and

telecommunications

BDTI

4th Generation

3rd Generation

2nd Generation

gt 800x Faster than first generation

1st Generation

DSPs from generations 2, 3 and 4 are in use

today. Why?

Basic DSP ISA/Architectural Features

- Data path configured for DSP algorithms
- Fixed-point arithmetic (most DSPs)
- Modulo arithmetic (saturation to handle overflow)
- MAC- Multiply-accumulate unit(s)
- Hardware rounding support
- Multiple memory banks and buses -
- Harvard Architecture
- Multiple data memories/buses
- Specialized addressing modes
- Bit-reversed addressing
- Circular buffers
- Specialized instruction set and execution control

- Zero-overhead loops
- Support for fast MAC
- Fast Interrupt Handling
- Specialized peripherals for DSP
- - (System on Chip - SoC style)

DSP ISA Feature

DSP Architectural Features

DSP Architectural Feature

Usually with no data cache for predictable fast

data sample streaming

DSP ISA Feature

DSP Architectural Feature

Dedicated address generation units are usually

used

DSP ISA Feature

To meet real-time signal sampling/processing

constraints

DSP Architectural Feature

DSP Data Path Arithmetic

DSP ISA Features

Most Common Fixed Point (16-bit or 24-bit)

Integer Arithmetic

- DSPs dealing with numbers representing real world

signalsgt Want reals/ fractions - DSPs dealing with numbers for addressesgt Want

integers - DSP ISA (and DSP) must Support fixed point as

well as integers

Fixed-point

Thus

.

-1 Š x lt 1

S

DSP ISA Feature

radix point

In DSP ISAs Fixed-point arithmetic must be

supported, floating point support is optional

and is much less common

.

2N1 Š x lt 2N1

S

radix point

Usually 16-bit fixed-point

Much Less Common Single Precision Floating-point

Support

DSP Data Path Precision

DSP ISA Features

16-bit Fixed-Point Most Common

- Word size affects precision of fixed point

numbers - DSPs have 16-bit, 20-bit, or 24-bit data words
- Floating Point DSPs cost 2X - 4X vs. fixed point,

slower than fixed point - DSP programmers will scale values inside code
- SW Libraries
- Separate explicit exponent
- Blocked Floating Point single exponent for a

group of fractions - Floating point support simplify development for

high-end DSP applications.

16-bit most common

Single Precision

In DSP ISAs Fixed-point arithmetic must be

supported, floating point (single precision)

support is optional and is much less common

DSP Data Path Overflow Handling

DSP ISA Feature

Saturation

- DSP are descended from analog signal processors
- Modulo Arithmetic.
- Set to most positive (2N11) or most negative

value(2N1) saturation - Many DSP algorithms were developed in this model.

2N11

Saturation

Why Support?

Due to physical nature of signals

2N1

Saturation

DSP Data Path Specialized Hardware

DSP Architectural Features

- Fast specialized hardware functional units

performs all key arithmetic operations in 1

cycle, including - Shifters
- Saturation
- Guard bits
- Rounding modes
- Multiplication/addition (MAC)
- 50 of instructions can involve multipliergt

single cycle latency multiplier - Need to perform multiply-accumulate (MAC) fast
- n-bit multiplier gt 2n-bit product

To help meet real-time constraints for commonly

needed operations

i.e. must optimize common operations

DSP Data Path Multiply Accumulate (MAC) Unit

One or more MAC units

- Dont want overflow or have to scale accumulator
- Option 1 accumalator wider than product guard

bits - Motorola DSP 24b x 24b gt 48b product, 56b

Accumulator - Option 2 shift right and round product before

adder

MAC Unit

add

add

DSP Data Path Rounding Modes

- Even with guard bits, will need to round when

storing accumulator into memory - 3 DSP standard options (supported in hardware)
- Truncation chop resultsgt biases results up
- Round to nearest lt 1/2 round down,