Software Optimization Of MPEG Audio LayerIII For A 32Bit RISC Processor Wonchul Lee, Kisun You and W - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Software Optimization Of MPEG Audio LayerIII For A 32Bit RISC Processor Wonchul Lee, Kisun You and W

Description:

Miss penalty between the main processor and the external SDRAM. SDRAM model ... SDRAM clock freq. CPU clock freq. The # of clock cycles according to different ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 19
Provided by: MSL87
Category:

less

Transcript and Presenter's Notes

Title: Software Optimization Of MPEG Audio LayerIII For A 32Bit RISC Processor Wonchul Lee, Kisun You and W


1
Software Optimization Of MPEG Audio Layer-III For
A 32Bit RISC Processor Wonchul Lee, Kisun You
and Wonyong Sung
  • School of Electrical Engineering
  • Seoul National University
  • Wonyong Sung
  • Dec. 17th, 2002

2
Contents
  • Introduction
  • Architecture features
  • IMDCT and Subband synthesis optimization
  • Assembly language optimization
  • Using block data transfer
  • ARM7/ARM9 based implementation
  • Results
  • Conclusion

3
Introduction(1)
  • MPEG1/2 Layer-III(MP3)
  • Compressed audio standard
  • Large computation DSP algorithm
  • Custom VLSI-based implementation
  • High speed, low power
  • Constraint of flexibility
  • Software implementation
  • Format and application flexibility
  • Can apply to multi-standard portable players
  • Not only MP3, but also AC3, AAC and WMA

4
Introduction(2)
  • Implementation for an ARM RISC Processor
  • RISC processor has disadvantages to implement of
    DSP algorithms
  • Need software optimization methods using
    architecture features
  • Reducing the of cycles and the of memory
    operands
  • Converting floating-point version to fixed-point
    version
  • Automatic scaling method, AUTOSCALER (Kum, Sung)

5
Architecture
  • ARM architecture
  • No floating point unit
  • 328 bits multiplier accuracy

6
Architecture Features
  • 328 bits multiplier accuracy
  • Not good for executing multiplication intensive
    DSP programs
  • Complement the demerits by using ARM features
  • ARM architecture features
  • Conditional execution
  • Reduce the control overhead in MP3 decoders
    significantly
  • 32-bits barrel shifter
  • Simultaneously execute shift and rotation with
    ALU operations
  • Scaling, Multiplication by 2
  • Multiple load/store instruction
  • Reduce total memory access time
  • Software optimization methods
  • Loop unrolling, loop termination, circular
    addressing, arranging data

7
MP3 decoding algorithms
  • Processing intensive
  • IMDCT, Subband synthesis
  • Control intensive
  • Dequantization, Huffman decoding

Major processing parts about 84 ? optimization
8
IMDCT and Subband Synthesis Optimization
  • Employed Britanak and Rao IMDCT algorithm
  • Small number of multiplications algorithm
  • Good for ARM CPU since it doesnt have a full
    precision multiplier
  • MPEG1/2 Audio standard
  • N36 for long blocks, N12 for short blocks
  • lt Iso reference vs. Britanak and Raos
    algorithm gt

9
Assembly Language Optimization
  • Using block transfer instructions, LDM, STM
  • Rarely found at the compiler generated code!!
  • Accessing the internal memory and cache
  • N, S, I 1 cycle
  • Accessing for external DRAM access
  • N gtgt 1 cycle

S Sequential cycle N Non-sequential cycle I
Internal cycle
14S 2N cycles (Store) 15S 1N 1I cycles
(Load)
2N 15 cycles (Store) (1S 1N 1l) 15 cycles
(Load)
10
Instructions vs. Clock cycles
  • Optimization?
  • Using block transfer instructions, LDM, STM
  • IMDCT
  • Instructions 28 decreased
  • Clock cycles 21 decreased
  • Subband
  • Instructions 34 decreased
  • Clock cycles 35 decreased

See Instruction types
11
Instruction types
  • Subband part is more efficient using block
    transfer instructions
  • Load/Store reduction !!
  • Reducing Memory
  • access operand !!

12
of Memory Access
  • To know access of external memory, cache
    performance, power consumption, etc
  • ARM7 architecture based implementation
  • Unified Cache
  • ARM9 architecture based implementation
  • Separated Cache
  • Cache simulator
  • DineroIV

13
ARM7 Architecture-based Implementation
  • ARM7
  • 8KByte Unified Cache
  • Performance degradation due to cache miss can be
    significant
  • No support write allocation
  • Improve the spatial locality of data which
    reduces the miss ratio with block transfer
    instructions
  • gt reduce the of accesses to/from the external
    DRAM

1.7
13.5
14
ARM9 Architecture-based Implementation
  • ARM9
  • Separate 16KB Instruction and 16KB Data cache
  • Support write allocation

ARM7(Unified 8KB) Instruction cache miss
13.5 Data cache miss 1.7
15
Miss penalty
  • Miss penalty between the main processor and the
    external SDRAM
  • SDRAM model
  • 8 clock cycles of latency for the first word read
  • 5 clock cycles for the first word write
  • 1 clock cycles for successive memory read and
    write
  • Assume
  • SDRAM clock freq. CPU clock freq.

72
23
The of clock cycles according to different
SDRAM bandwidth(16/32bit)
The of clock cycles according to different
cache size(ARM9)
16
Performance
  • Tested using the ISO reference standard, and
    proved by using 10 popular pop-songs
  • With ARM7TDMI_at_60MHz
  • Mono, stereo, joint stereo
  • Sampling freq. 44.1kHz, 22.05kHz
  • Bit rate 32kbps 192kbps
  • Average of 94.24dB SNR
  • Average of 16.5 MIPS

17
Future Work
  • We need more accurate modeling methods between
    DRAM and CPU which includes internal cache
    memory.
  • With some appeared memory optimization
    techniques, reducing R/W memory of MP3 decoder
    and data cache miss ratio which is much more than
    instruction cache miss ratio
  • With assembly language optimization, it is
    difficult to reuse the other systems.
  • Need higher level algorithm optimization
  • With compiler?
  • With algorithm?
  • With transformation program?

18
Conclusion
  • Implementation of MPEG1/2 Layer-III decoding
    algorithm using ARM7 and ARM9 based systems
  • By modifying codes which increase the locality of
    memory reference, then applying block transfer
    instructions, reduced 26 instruction demand, 8
    data demand.
  • The overhead of date-transfer should be
    considered very seriously for real-time and
    low-power implementation
Write a Comment
User Comments (0)
About PowerShow.com