Digital Signal Processor: Architectures and Applications - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Digital Signal Processor: Architectures and Applications

Description:

identifies live and free registers. allows using variable names in assembly code and ... Video - DVD, MPEG 1 & 2 decoding. Audio - Dolby AC-3, 3D Audio, MPEG ... – PowerPoint PPT presentation

Number of Views:713
Avg rating:3.0/5.0
Slides: 45
Provided by: sur73
Category:

less

Transcript and Presenter's Notes

Title: Digital Signal Processor: Architectures and Applications


1
Digital Signal Processor Architectures and
Applications
  • ECE734 VLSI Array Structure for Digital Signal
    Processing
  • Spring 1998
  • By
  • Surin Kittitornkun
  • Apr. 28, 1998

2
Contents
  • Programmable DSP Why ?
  • TMS320C8x C80 C82
  • MPACT-R3600 -2
  • TMS320C6x C62x C67x
  • Target Applications
  • Application H.324 on TMS320C82
  • Current Multimedia Processors
  • References

3
Programmable DSP Why ?
  • More flexibility changes can be made in software
  • Less complexity shorter time to market
  • More cost efficient than ASIC design
  • Requires software development
  • May consume more power
  • PC and consumer product

4
TMS320C8x Overview
  • RISC Master processor _at_ 50 and 60 MHz
  • Parallel processors x2 (4 for c80)
  • Transfer controller DMA and memory controller
  • Video controller (C80 only)

5
TMS320C8x Master Processor
  • 32-bit RISC instruction/64-bit data
  • Scoreboarded 31 GP registers and a zero register
  • IEEE 754 floating point
  • Supports vector FP operations
  • Performs single precision FP MAC in 1 cycle 100
    MFLOPS (_at_50 MHz)
  • Suitable for control protocols and FP intensive
    algorithms

6
TMS320C8x Processor communication
  • Shared memory multiprocessor
  • MP sends commands through command buffers located
    in shared memory

7
TMS320C8x Parallel Processor
  • Data unit 32-bit datapath, ALU, multiplier,
    etc.
  • 2 Independent Address units global and local
  • 1 cycle on-chip memory access (no conflict)
  • 1 cycle load/store of byte, halfword, and word
  • Internal adder can offload data unit computation
  • Program flow control unit

8
TMS320C8x Parallel Procesor Program Flow
Control Unit
  • 3-stage pipelining
  • Instruction fetch
  • Address generation, and
  • Operation execution
  • conditional operation of data unit operations,
    moves, load from memory and branches
  • PC is mapped into register file
  • To minimize overhead Loop controller supports 3
    levels of nested loops

9
TMS320C8x Parallel Procesor Data Unit
  • Split 32-bit 3-input ALU Boolean and arithmetic
    operations
  • Split and rounded multiplier dual 8x816,
    16x1632
  • Flexible datapath barrel rotator, mask
    generator
  • Supports signed, unsigned and saturate arithmetic

10
TMS320C8x Parallel Procesor Data Unit
3-input ALU
  • Supports totally 512 operations Boolean 256
    Arith. 256
  • Boolean F0 (ABC) F1 (ABC) F2
    (ABC) F3 (ABC) F4 (ABC)
    F5 (ABC) F6 (ABC) F7 (ABC)
  • Arithmetic A f1(B,C) f2(B,C) 1
  • Example
  • AB1
  • (AC)(BC) Mask A and B by C and then add
  • A((BC) (-BC)) Multiple-byte AB
  • A-((BC) (-BC)) Multiple-byte A-B

11
TMS320C8x Parallel Processor Instruction Set
  • 64-bit opcode contains multiple subinstructions
    for
  • Data unit
  • Global address unit and
  • Local address unit
  • Ex d4d5d6gtgtd0 a8d7 d0(a0x1)

12
TMS320C8x Transfer Controller
  • Prioritizes, schedules, and transfers data cache
    between on- and off-chip memories
  • Handles data cache (on chip RAM) miss and
    instruction cache
  • Supports multidimensional data transfers
  • simple contiguous linear sequence up to 3D region
  • Memory interface supports a wide range of memory
    system
  • DRAM, SDRAM, Video RAM and SRAM

13
TMS320C8x Video Controller (c80 only)
  • Provides simultaneous control over two
    independent capture or display systems and frame
    grabber or frame buffer image storage
  • Dual-frame timers
  • Programmable timing and control registers
  • Programmable line interrupt to MP

14
TMS320C8x Development Tools
  • C-like compilers and assemblers for both master
    and parallel processor
  • Register allocator
  • identifies live and free registers
  • allows using variable names in assembly code and
  • assigns specific register to variable
  • Code compactor converts straight-line assembly
    codes into parallel codes
  • Optimization can be done by hand for
    time-critical parallel code

15
TMS320C8x Execution Time for 256-Point FFT
-C" indicates performance with the cache
pre-loaded - Benchmark results for the TMS320C80
are for one of the on-chip DSP processors
16
MPACT-R3600 -2 Overview
  • VLIW CPU
  • Multimedia ISA
  • Hardware/Software relationship
  • Variety of high speed I/O interfaces

17
MPACT-R3600 -2 CPU Datapath
  • Data size multiple of 9 bits
  • 512 72-bit register file with 4 read and 4 write
    ports
  • ALU1 - shift and align
  • ALU2 - add and logic
  • ALU3 - arithmetic and logic
  • ALU4 - stage 1 of multiplication
  • ALU5 - motion estimation
  • Full crossbar between ALU outputs, inputs,
    register read and write ports

18
MPACT-R3600 Datapath
ALU group 3 Muliply and add
4 write ports SRAM (512 entries) 4 read ports
ALU group 4 Stage 1 of Multipl.
ALU group 2 Add and logic
ALU group 1 Shift and align
ALU group 5 Motion estimation
19
MPACT-R3600 -2 Multimedia ISA
  • Issues two instruction pack of 72 bits every
    cycle
  • Data forwarding from one ALU to one another
  • Vector instruction (length upto 255)
  • Multimedia data byte of 9, 18, 27, and 36 bits
  • Supports signed , unsigned and saturating
    arithmetic
  • MPACT 2 includes single-precision FP for 3D
    graphics
  • Flow control branch, jump and calls
  • Special purpose instruction
  • Motion Estimation
  • IDCT
  • Butterfly FFT, etc.

20
MPACT-R3600 -2 Hardware/Software Relationship
  • Requires a host x86 CPU
  • Mediaware- uses standard APIs
  • RM Resource Manager running under Windows
  • MRK MPACT real-time kernel
  • Nearest deadline scheduling algorithm
  • Interrupt-driven kernel with 4-us context switch
    time in the worst case

21
MPACT-R3600 -2 Hardware/Software Relationship
22
MPACT-R3600 -2 High speed I/O interface
  • PCI bus or AGP (Accelerated Graphics Port)
  • x86 Host CPU bus
  • 66 MHz gt 264 Mbytes/s
  • Rambus Memory Interface
  • 300 MHz bus (9-bit wide) on both edge600Mbytes/s
  • Requires 2-4 Mbytes
  • Display Controller
  • 24-bit RAMDAC
  • High resolution up to 1280x1024 24-bit or
    1600x1200 16-bit
  • Video Interface
  • Accepts NTSC and PAL format video or
  • DVD input through PCI or AGP
  • Programmable Peripheral I/O Interface
  • Supports connection to several devices

23
MPACT-R3600 -2 Architecture trade-offs
  • High speed I/O to move data inout
  • No Data cache but large register file
  • multimedia data has poor locality
  • Based on standard APIs (Application Program
    Interface) of Microsoft Windows no proprietary
    API
  • Pin counts vs. high memory bandwidth/low latency
  • RDRAM is chosen
  • PC and Consumer market

24
TMS320C6x VelociTI Overview
  • VLIW DSPs
  • TMS320C62x Fixed-point DSPs
  • TMS320C67x Floating-point DSPs

25
TMS320C6x VelociTI Key features
  • Issues and executes up to 8 instructions every
    cycle
  • Load/store architecture
  • 32-bit RISC instruction /32-bit data
  • Conditional instructions
  • reduces costly branching
  • increases parallelism for higher sustained
    performance
  • Instruction packing
  • Reduces code size, program fetches, and power
    consumption.

26
TMS320C6x VelociTI Datapath
27
TMS320C6x VelociTI Datapath
  • Two register files
  • 16x32 bits
  • Each supports simultaneous 10 reads and 6 writes
  • Two sets of identical functional units 8 units
  • L logic functions, bit counting, and add/sub
  • S shifting, bit manipulation, branch/control
    and add/sub
  • D adddressing and add/sub
  • M multiplication
  • Grouping of functional units reduces the reg.
    ports

28
TMS320C6x VelociTI Instruction set
  • 32-bit RISC like opcode format
  • creg conditional registers
  • z zero or nonzero
  • dst destination
  • src1/2 source 1 and 2
  • cst constant
  • x use cross path for src2
  • s side A or B for destination
  • op operation
  • Instruction can be conditioned on value of A1,
    A2, B0, B1, B2
  • Each instruction takes 1 cycle to execute except
    double- precision operations in C67x

29
TMS320C6x VelociTI Instruction packing
  • Fetch packet 8 32-bit instructions are fetched
    simultaneously

30
TMS320C6x VelociTI Instruction packing
Execute packet indicated by p-bit or
parallel-bit 1 in parallell 0 not in
parallell Example
31
TMS320C6x VelociTI Pipeline
  • 3 stages of 16 phases of deep pipeline
  • Fetch - 4 phases PG, PS, PW, PR
  • Decode - 2 phases DP, DC
  • Execute - 10 phases max E1 to E10
  • No stall except cache miss or external access
  • Performs load after store to the same memory
    location
  • Each branch takes 5 cycle to be taken or not-taken

32
TMS320C6x VelociTI Memory Hierachy
  • Internal Program Memory is configurable
  • Mapped memory or direct mapped cache
  • 16 K of 32-bit instructions or 2 K of 256-bit
    fetch packets
  • Internal Data Memory
  • 2 blocks of 4 8-Kbyte interleaved banks
  • DMA Controller 800 Mbytes/s peak
  • Transfers between on-chip memories, peripherals
    and external memory
  • EMIF (External Memory Interface) 800 Mbytes/s
    peak
  • Supports SBSRAM, SDRAM, etc.

33
TMS320C6x VelociTI Peripherals
  • McBSP (Multichannel Buffered Serial Port)
  • Two independent 100 Mbits/s full duplex serial
    port
  • Supports standards ST-BUS, AC97 audio codec,
    etc.
  • Timers
  • Two programmable 32-bit timers
  • Host Port Interface
  • 100 Mbytes/s 16-bit bidirectional port to
    standard processors
  • Power-Down Modes 1,2,3
  • Reduce power consumption

34
TMS320C6x VelociTI Programming
  • Includes C compiler, Assembler, , Optimizer, and
    Debuggers in software simulator
  • 72-82 efficiency compared to handwritten
    assembly codes
  • Optimization techniques
  • Intrinsic functions in C compiler
  • Software pipelining
  • If..Else and Case conversion to conditional
    instruction
  • Data types (by compiler)
  • long 40 bits
  • int 32 bits
  • short 16 bits
  • char 8 bits

35
Target Applications
  • Video - DVD, MPEG 1 2 decoding
  • Audio - Dolby AC-3, 3D Audio, MPEG Decode,
    Wavetable Synthesis
  • Graphics - 2D 3D acceleration
  • Communication
  • Vocoder
  • ADSL, Fax/MODEM V.34, 56k
  • Echo cancellor
  • Desktop Videoconferencing
  • H.320 ISDN
  • H.324 on POTS (Plain Old Telephone System)

36
H.324 on TMS320C82 Overview
  • ITU-T H.324 Low-bit-rate multimedia
    teleconferencing on circuit-switched network
    includes
  • G.723 Audio coding at 5.3-6.4 kbps requires 18-20
    fixed-point MIPS
  • H.263 Video coding based on H.261 includes some
    enhancements
  • H.223 MUX/DEMUX control
  • H.245 Control protocol
  • V.34 Modem up to 33.6 kbps
  • Other related standards H.320 (ISDN), H.323
    (LAN), and H.310 (ATM/B-ISDN)

37
H.324 on TMS320C82 Overview
38
H.324 on TMS320C82 Task Partitioning
  • Video Processing (H.263)
  • Encoding
  • Pre-processing MP
  • Motion estimation PP0
  • DCT PP0
  • Decoding
  • Huffman or arithmetic decode, IDCT, etc. PP0
  • Post processing PP0
  • Audio Processing and AEC (Acoustic Echo
    Cancellation) - PP1
  • G.723
  • Encoding 22 MIPS
  • Decoding 3 MIPS
  • AEC LMS algorithm up to 64-ms echo 10MIPS
  • MODEM V.34 20 MIPS - PP1

39
H.324 on TMS320C82 Task Partitioning
40
Current Multimedia Processors
  • Digital Signal Processor gt Multimedia Processor
  • Employ RISC instruction set and pipelining to
    gain higher clock frequency
  • Perform operations on single and multiple bytes
    of data
  • Try to exploit more parallelisms on static
    instruction level parallelism (ILP) rather than
    dynamic ILP
  • Concern more and more on data movement and I/O
    interface
  • Pay more attention on low power design
  • PC/consumer market is one of their primary targets

41
Current Multimedia Processors
42
References
  • TMS320C8x
  • J. Golston, Single-chip H.324 video
    conferencing, IEEE Micro, August 1996, pp. 42-50
  • Texas Instrument, TMSC320C80 Data Sheet, 1997
    at http//www.ti.com/../sprs023b.pdf
  • P. Lapseley and G. Blalock, How to estimate DPS
    processor performance, IEEE Spectrum, July 1996,
    pp. 74-78
  • HTML file http//www.bdti.com/../wpeval.html
  • MPACT
  • P. Kalapathy, Hardware-software interfacing on
    Mpact, IEEE Micro, March 1997, pp. 20-26
  • Presentation file http//infopad.eecs.berkeley.ed
    u/HotChips8/
  • Chromatic Research, MPACT2 Preliminary Data
    Sheet, Feb. 1998
  • http//www.mpact.com/../mpact2.pdf
  • Toshiba, TOSHIBA ANNOUNCES ITS NEXT-GENERATION
    MPACT MEDIA PROCESSOR, September 22, 1997
  • http//www.toshiba.com/taec/../to-628.htm

43
References
  • TMS320C6x
  • N. Seshan, High VelociTI Processing, IEEE Signal
    Processing Mag, March 1998, pp. 86-101
  • TMS320C6x data sheet
  • Trimedia
  • G. A. Slavenburg, The Trimedia TM-1 PCI VLIW
    Mediaprocessor, IEEE Hot Chips 8 Symposium on
    High-Performance Chips, Aug. 1996
  • http//infopad.eecs.berkeley.edu/HotChips8/
  • MSP
  • L. T.Nguyen, M. Mohamed, H. Park, Y. Pal, R.
    Wong, A. Qureshi, P. Psong, F. Valesco, H. D.
    Truong, C. Reader, Multi-media Signal Processor
    (MSP) Summary , IEEE Hot Chips 8 Symposium on
    High-Performance Chips, Aug. 1996
  • http//infopad.eecs.berkeley.edu/HotChips8/
  • H.324
  • D. Lindbergh, The H.324 multimedia communication
    standard, IEEE Communication Magazine, December
    1996, pp. 46-51
  • K. Rijkse, H.263 Video coding for low-bit-rate
    communication, IEEE Communication Magazine,
    December 1996, pp. 42-45

44
Useful links
  • CPU Information Center
  • http//infopad.eecs.berkeley.edu/CIC/
  • Microprocessor Report
  • http//www.chipanalyst.com/q/
  • Berkeley Design Technology Inc.
  • http//www.bdti.com/
  • Peter Pirschs research group
  • http//www.mst.uni-hannover.de/
Write a Comment
User Comments (0)
About PowerShow.com