Parallelization of Turbo Codec and Performance Analysis - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Parallelization of Turbo Codec and Performance Analysis

Description:

simulation methodology for the standard forward error correction scheme in ... Decrement the message communication between master and slave nodes ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 29
Provided by: griffi
Category:

less

Transcript and Presenter's Notes

Title: Parallelization of Turbo Codec and Performance Analysis


1
Parallelization of Turbo Codec and Performance
Analysis
Jung Soo Oh, Sang Moon Lee and Beongjo
KimSamsung Advanced Institute of
TechnologyandSamsung Electronics
2
Abstract
  • Turbo Codec
  • simulation methodology for the standard forward
    error correction scheme in IMT2000 system (UMTS
    and CDMA2000)
  • Parallelized Turbo Codec
  • Master and slave parallelization mechanism
  • Uneven frame distribution
  • contributed to
  • obtaining reliable simulation results within a
    short period of time
  • participating in IMT2000 standardization forum
    thanks to the tremendously high throughput

3
Turbo Codes
  • Forward error correction scheme for data service
    over
  • Forward/Reverse Supplemental Channels (F/R-SCH)
    in CDMA2000 (IMT2000)
  • Uplink/Downlink Transport Channels (TrCH) in UMTS
  • Need huge amount of simulation time to develop
    and implement an efficient and a cost-effective
    Turbo codec
  • ? We tried to parallelize Turbo Codec program.

4
Concept
Forward
Link
Multimidea
terminal
BS
Network
2
Mbps
6
4
,
1
2
8
,
3
8
4
Kbps
8
,
6
4
Kbps
MSC
Hand
Portable
Telephone
Searcher
(
Sync)
Turbo Codec
5
Key Factors
  • Evaluation factors of Performance of Turbo Codec
  • BER ( Bit Error Rate )
  • FER ( Frame Error Rate )
  • Turbo Codec operates in frame mode, hence
    generating output codeword in frame.
  • In computational simulation, a random number
    generator arbitrarily produces number of frames.
  • Turbo Codec simulator corrects parameters to make
    them more appropriate by analyzing BER/FER from
    the independent frame transmissions.

6
Main computational algorithm
Set environmental variables and memory
allocation for (frame_number 0
frame_number) make_turbo_frame()
frame generation with a random number
turbo_encoder() channel()
A random number is used.
turbo_decoder() extract_turbo_frame() if
( Summation of total bit errors
gt Given maximum number of bit errors
) then break get global BER and FER
assigned before the program operation. so
the number of the transmitted frames is undefined.
7
First simple approach
  • In each processor
  • Even allocation of the maximum number of bit
    errors
  • different seeds of random number generation
  • For example
  • Maximum number of bit errors 1000
  • Number of processors 4
  • Maximum number of bit errors / Number of
    processors 250
  • seed rank of each processor
  • Unbalanced Loading Problem
  • can not predict the number of producing,
    transmitting and analyzing frames in each
    processor.
  • More frame transmission can be assigned to
    certain processors.
  • It is simple but not useful.

8
First Simple Approach( example )
Serial Program if ( Summation of total bit
errors
gt 1000 ) then terminate
The number of transmitted frames (unpredictable
before starting of simulation)
Perfect parallelization with 4 processors
parallel Program processor 0 if ( local bit
errors gt 250 ) then terminate
idling
The number of transmitted frames
parallel Program processor 1 if ( local bit
errors gt 250 ) then terminate
idling
The number of transmitted frames
Program terminating point
parallel Program processor 2 if ( local bit
errors gt 250 ) then terminate
The number of transmitted frames
idling
parallel Program processor 3 if ( local bit
errors gt 250 ) then terminate
The number of transmitted frames
9
Master-Slave1
Set environmental variables and memory
allocation for (frame_number 0
frame_number) random number generation
make_turbo_frame() frame generation with a
random number turbo_encoder()
channel() A random number
is used. turbo_decoder()
extract_turbo_frame() if ( Summation of total
bit errors gt Given
maximum number of bit errors ) then break get
global BER and FER
Master
Slave
10
Master - Slave1
  • Slave nodes
  • establish the transmitting channels and transmit
    frames.
  • Master node
  • generates the random numbers for all slave
    processors.
  • summarizes all of the bit errors from slave
    nodes.
  • determine the termination of frame transmission.
  • Analysis of masters random number generation
  • Advantage
  • Getting rid of the overlapping problems in each
    procossor
  • Disadvantage
  • increment the message communication between
    master and slave nodes

11
Master-slave 1( message communication )
master
BER message
Random number message
12
Master-Slave1
  • Intel Paragon ( 75Mflops/node, total 256 nodes ),
    rate1/4

13
Master-slave 1
  • Performance Analysis
  • Better performance than that of first simple
    approach
  • But ...
  • Not linear performance
  • Uneven frame distribution to each slave
  • No better remedy!
  • Slave nodes idling time
  • waiting for random numbers from master node
  • It degrades the performance.
  • Try to cut down the idling time!

14
Master-slave2
  • Goal Minimization the idle time in slave nodes
  • Modification
  • slave nodes generate random numbers by itselves
  • Analysis of slaves random number generation
  • Advantage
  • Decrement the message communication between
    master and slave nodes
  • Minimization the idle time in slave nodes
  • Disadvantage
  • Random numeber redundancy problem among slave
    nodes
  • Solve the redundancy problem
  • a random number generator with sufficiently long
    period
  • period 2256

15
Master-slave 2
master
16
Master-Slave 2
  • the idling time of slave nodes can be ignored.
  • This modified master-slave method is applied to
    parallelized Turbo Codec to build a core chip.
  • Intel Paragon ( 75Mflops/node, total 256 nodes ),
    rate1/3

17
Experimental Result (Condition)
  • Parallelized Turbo Codec
  • the improved master-slave method
  • 1.5dB
  • maximum number of bit errors 200
  • Platforms at Samsung Advanced Institute of
    Technology (SAIT).
  • Intel Paragon
  • HP Exemplar
  • Linux Clusters
  • Intel CPU
  • Alpha CPU

18
Intel Paragon (1995. 10 1999. 11)
  • 19.2 Gflops peak performance, 256 nodes (MPP)
  • main platform to develop parallelized Turbo Codec
  • Deficient CPU clock speed, but 102 order of
    parallel processing nodes.

Master-Slave 2
Master-Slave 1
19
HP Exemplar (1998. 5 2001. 6)
  • 51.2 Gflops peak performance, 64 nodes (CC-NUMA)
  • parallel performance
  • only up to 8 nodes
  • the lowest parallel performance results

2.5 times faster
20
Alpha Linux Cluster 1
  • LX - Board (21164) / 533 MHz CPU
  • 4 or 8 nodes CPML(Compaq Portable Math Library)
    for efficient performance.

21
Alpha Linux Cluster 2
  • UP - Board (21264) / 667MHz CPU
  • 8 nodes
  • the fastest system

22
Intel Linux Cluster
  • 4 CPUs with Intel Pentium II 450MHz.
  • GNU gcc compiler proper for Intel CPU
  • general PC-cluster

23
Proprietary
DB1.5, rate1/3, error_max200
1 CPU
4 CPU
8 CPU
time
7
645
6
5
335
357
4
240
254
258
3
220
212
2
123
126
143
047
043
1
0
HP Developed
EGCS
CPMLEGCS
CPMLCompaq C
EGCS
HP Exemplar
Alpha cluster (533MHz LX)
Intel cluster
24
Turbo Codec
DB1.5, rate1/3, error_max200
25
Performance Analysis
  • The parallel scalability is not linear.
  • Uneven and unpredicted number of frames in slave
    nodes
  • It does not affect on parallel simulation.
  • Through-put
  • the biggest advantage for parameter study using
    several nodes with shortened computation time.
  • Parallelized Turbo Codec has an invisible linear
    scalability.

26
Invisible Scalability
  • Analysis more reliable parallel efficiency
  • Analysis computing time per frame
  • the global execution time of every processor
  • Sprocessor the execution time of each
    processors
  • ( the execution time of parallelized program
    )
  • x ( the number of CPUs )
  • Consumed time to transmit a frame
  • ( the global execution time of every
    processor )
  • / ( the number of total
    transmitted frames )
  • By analyzing of numbers of parallelized
    simulations,
  • ? the computing time per frame is equal in every
    simulation.
  • ? Time ratio in total transmitted frame is
    scalable.

27
UP vs LX (computing time per frame)
DB1.5, rate1/3, error_max200
lt computing time per frame gt
533MHz
0.287 sec
0.063 sec
0.027 sec
667MHz
0.212 sec
0.05 sec
0.025 sec
28
Concluding
  • Parallelized Turbo Codec simulating algorithm
  • remarkable parallel efficiency and higher
    throughput.
  • Most outstanding contribution of parallelized
    Turbo Codec
  • reducing the computational simulation time in the
    design process of core chip.
  • allowing optimization of parameters in shorter
    time
  • allowing analysis with the range over 2 dB.
  • Samsung Electronics participates in IMT2000
    standardization forum with large amount of
    simulation data based on this parallel Turbo
    Codec simulating algorithm.
Write a Comment
User Comments (0)
About PowerShow.com