Title: Software Optimization Of MPEG Audio LayerIII For A 32Bit RISC Processor Wonchul Lee, Kisun You and W
1Software Optimization Of MPEG Audio Layer-III For
A 32Bit RISC Processor Wonchul Lee, Kisun You
and Wonyong Sung
- School of Electrical Engineering
- Seoul National University
- Wonyong Sung
- Dec. 17th, 2002
2Contents
- Introduction
- Architecture features
- IMDCT and Subband synthesis optimization
- Assembly language optimization
- Using block data transfer
- ARM7/ARM9 based implementation
- Results
- Conclusion
3Introduction(1)
- MPEG1/2 Layer-III(MP3)
- Compressed audio standard
- Large computation DSP algorithm
- Custom VLSI-based implementation
- High speed, low power
- Constraint of flexibility
- Software implementation
- Format and application flexibility
- Can apply to multi-standard portable players
- Not only MP3, but also AC3, AAC and WMA
4Introduction(2)
- Implementation for an ARM RISC Processor
- RISC processor has disadvantages to implement of
DSP algorithms - Need software optimization methods using
architecture features - Reducing the of cycles and the of memory
operands - Converting floating-point version to fixed-point
version - Automatic scaling method, AUTOSCALER (Kum, Sung)
5Architecture
- ARM architecture
- No floating point unit
- 328 bits multiplier accuracy
6Architecture Features
- 328 bits multiplier accuracy
- Not good for executing multiplication intensive
DSP programs - Complement the demerits by using ARM features
- ARM architecture features
- Conditional execution
- Reduce the control overhead in MP3 decoders
significantly - 32-bits barrel shifter
- Simultaneously execute shift and rotation with
ALU operations - Scaling, Multiplication by 2
- Multiple load/store instruction
- Reduce total memory access time
- Software optimization methods
- Loop unrolling, loop termination, circular
addressing, arranging data
7MP3 decoding algorithms
- Processing intensive
- IMDCT, Subband synthesis
- Control intensive
- Dequantization, Huffman decoding
Major processing parts about 84 ? optimization
8IMDCT and Subband Synthesis Optimization
- Employed Britanak and Rao IMDCT algorithm
- Small number of multiplications algorithm
- Good for ARM CPU since it doesnt have a full
precision multiplier - MPEG1/2 Audio standard
- N36 for long blocks, N12 for short blocks
- lt Iso reference vs. Britanak and Raos
algorithm gt
9Assembly Language Optimization
- Using block transfer instructions, LDM, STM
- Rarely found at the compiler generated code!!
- Accessing the internal memory and cache
- N, S, I 1 cycle
- Accessing for external DRAM access
- N gtgt 1 cycle
S Sequential cycle N Non-sequential cycle I
Internal cycle
14S 2N cycles (Store) 15S 1N 1I cycles
(Load)
2N 15 cycles (Store) (1S 1N 1l) 15 cycles
(Load)
10Instructions vs. Clock cycles
- Optimization?
- Using block transfer instructions, LDM, STM
- IMDCT
- Instructions 28 decreased
- Clock cycles 21 decreased
- Subband
- Instructions 34 decreased
- Clock cycles 35 decreased
See Instruction types
11Instruction types
- Subband part is more efficient using block
transfer instructions
- Load/Store reduction !!
- Reducing Memory
- access operand !!
12 of Memory Access
- To know access of external memory, cache
performance, power consumption, etc - ARM7 architecture based implementation
- Unified Cache
- ARM9 architecture based implementation
- Separated Cache
- Cache simulator
- DineroIV
13ARM7 Architecture-based Implementation
- ARM7
- 8KByte Unified Cache
- Performance degradation due to cache miss can be
significant - No support write allocation
- Improve the spatial locality of data which
reduces the miss ratio with block transfer
instructions - gt reduce the of accesses to/from the external
DRAM
1.7
13.5
14ARM9 Architecture-based Implementation
- ARM9
- Separate 16KB Instruction and 16KB Data cache
- Support write allocation
ARM7(Unified 8KB) Instruction cache miss
13.5 Data cache miss 1.7
15Miss penalty
- Miss penalty between the main processor and the
external SDRAM - SDRAM model
- 8 clock cycles of latency for the first word read
- 5 clock cycles for the first word write
- 1 clock cycles for successive memory read and
write - Assume
- SDRAM clock freq. CPU clock freq.
72
23
The of clock cycles according to different
SDRAM bandwidth(16/32bit)
The of clock cycles according to different
cache size(ARM9)
16Performance
- Tested using the ISO reference standard, and
proved by using 10 popular pop-songs - With ARM7TDMI_at_60MHz
- Mono, stereo, joint stereo
- Sampling freq. 44.1kHz, 22.05kHz
- Bit rate 32kbps 192kbps
- Average of 94.24dB SNR
- Average of 16.5 MIPS
17Future Work
- We need more accurate modeling methods between
DRAM and CPU which includes internal cache
memory. - With some appeared memory optimization
techniques, reducing R/W memory of MP3 decoder
and data cache miss ratio which is much more than
instruction cache miss ratio - With assembly language optimization, it is
difficult to reuse the other systems. - Need higher level algorithm optimization
- With compiler?
- With algorithm?
- With transformation program?
18Conclusion
- Implementation of MPEG1/2 Layer-III decoding
algorithm using ARM7 and ARM9 based systems - By modifying codes which increase the locality of
memory reference, then applying block transfer
instructions, reduced 26 instruction demand, 8
data demand. - The overhead of date-transfer should be
considered very seriously for real-time and
low-power implementation