Parallelization of Turbo Codec and Performance Analysis - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Parallelization of Turbo Codec and Performance Analysis

Description:

simulation methodology for the standard forward error correction scheme in ... Decrement the message communication between master and slave nodes ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 29

Provided by: griffi

Category:

more less

Transcript and Presenter's Notes

Title: Parallelization of Turbo Codec and Performance Analysis

1
Parallelization of Turbo Codec and Performance
Analysis
Jung Soo Oh, Sang Moon Lee and Beongjo
KimSamsung Advanced Institute of
TechnologyandSamsung Electronics
2
Abstract

Turbo Codec
simulation methodology for the standard forward
error correction scheme in IMT2000 system (UMTS
and CDMA2000)
Parallelized Turbo Codec
Master and slave parallelization mechanism
Uneven frame distribution
contributed to
obtaining reliable simulation results within a
short period of time
participating in IMT2000 standardization forum
thanks to the tremendously high throughput

3
Turbo Codes

Forward error correction scheme for data service
over
Forward/Reverse Supplemental Channels (F/R-SCH)
in CDMA2000 (IMT2000)
Uplink/Downlink Transport Channels (TrCH) in UMTS
Need huge amount of simulation time to develop
and implement an efficient and a cost-effective
Turbo codec
? We tried to parallelize Turbo Codec program.

4
Concept
Forward
Link
Multimidea
terminal
BS
Network
2
Mbps
6
4
,
1
2
8
,
3
8
4
Kbps
8
,
6
4
Kbps
MSC
Hand
Portable
Telephone
Searcher
(
Sync)
Turbo Codec
5
Key Factors

Evaluation factors of Performance of Turbo Codec
BER ( Bit Error Rate )
FER ( Frame Error Rate )
Turbo Codec operates in frame mode, hence
generating output codeword in frame.
In computational simulation, a random number
generator arbitrarily produces number of frames.
Turbo Codec simulator corrects parameters to make
them more appropriate by analyzing BER/FER from
the independent frame transmissions.

6
Main computational algorithm
Set environmental variables and memory
allocation for (frame_number 0
frame_number) make_turbo_frame()
frame generation with a random number
turbo_encoder() channel()
A random number is used.
turbo_decoder() extract_turbo_frame() if
( Summation of total bit errors
gt Given maximum number of bit errors
) then break get global BER and FER
assigned before the program operation. so
the number of the transmitted frames is undefined.
7
First simple approach

In each processor
Even allocation of the maximum number of bit
errors
different seeds of random number generation
For example
Maximum number of bit errors 1000
Number of processors 4
Maximum number of bit errors / Number of
processors 250
seed rank of each processor
Unbalanced Loading Problem
can not predict the number of producing,
transmitting and analyzing frames in each
processor.
More frame transmission can be assigned to
certain processors.
It is simple but not useful.

8
First Simple Approach( example )
Serial Program if ( Summation of total bit
errors
gt 1000 ) then terminate
The number of transmitted frames (unpredictable
before starting of simulation)
Perfect parallelization with 4 processors
parallel Program processor 0 if ( local bit
errors gt 250 ) then terminate
idling
The number of transmitted frames
parallel Program processor 1 if ( local bit
errors gt 250 ) then terminate
idling
The number of transmitted frames
Program terminating point
parallel Program processor 2 if ( local bit
errors gt 250 ) then terminate
The number of transmitted frames
idling
parallel Program processor 3 if ( local bit
errors gt 250 ) then terminate
The number of transmitted frames
9
Master-Slave1
Set environmental variables and memory
allocation for (frame_number 0
frame_number) random number generation
make_turbo_frame() frame generation with a
random number turbo_encoder()
channel() A random number
is used. turbo_decoder()
extract_turbo_frame() if ( Summation of total
bit errors gt Given
maximum number of bit errors ) then break get
global BER and FER
Master
Slave
10
Master - Slave1

Slave nodes
establish the transmitting channels and transmit
frames.
Master node
generates the random numbers for all slave
processors.
summarizes all of the bit errors from slave
nodes.
determine the termination of frame transmission.
Analysis of masters random number generation
Advantage
Getting rid of the overlapping problems in each
procossor
Disadvantage
increment the message communication between
master and slave nodes

11
Master-slave 1( message communication )
master
BER message
Random number message
12
Master-Slave1

Intel Paragon ( 75Mflops/node, total 256 nodes ),
rate1/4

13
Master-slave 1

Performance Analysis
Better performance than that of first simple
approach
But ...
Not linear performance
Uneven frame distribution to each slave
No better remedy!
Slave nodes idling time
waiting for random numbers from master node
It degrades the performance.
Try to cut down the idling time!

14
Master-slave2

Goal Minimization the idle time in slave nodes
Modification
slave nodes generate random numbers by itselves
Analysis of slaves random number generation
Advantage
Decrement the message communication between
master and slave nodes
Minimization the idle time in slave nodes
Disadvantage
Random numeber redundancy problem among slave
nodes
Solve the redundancy problem
a random number generator with sufficiently long
period
period 2256

15
Master-slave 2
master
16
Master-Slave 2

the idling time of slave nodes can be ignored.
This modified master-slave method is applied to
parallelized Turbo Codec to build a core chip.
Intel Paragon ( 75Mflops/node, total 256 nodes ),
rate1/3

17
Experimental Result (Condition)

Parallelized Turbo Codec
the improved master-slave method
1.5dB
maximum number of bit errors 200
Platforms at Samsung Advanced Institute of
Technology (SAIT).
Intel Paragon
HP Exemplar
Linux Clusters
Intel CPU
Alpha CPU

18
Intel Paragon (1995. 10 1999. 11)

19.2 Gflops peak performance, 256 nodes (MPP)
main platform to develop parallelized Turbo Codec
Deficient CPU clock speed, but 102 order of
parallel processing nodes.

Master-Slave 2
Master-Slave 1
19
HP Exemplar (1998. 5 2001. 6)

51.2 Gflops peak performance, 64 nodes (CC-NUMA)
parallel performance
only up to 8 nodes
the lowest parallel performance results

2.5 times faster
20
Alpha Linux Cluster 1

LX - Board (21164) / 533 MHz CPU
4 or 8 nodes CPML(Compaq Portable Math Library)
for efficient performance.

21
Alpha Linux Cluster 2

UP - Board (21264) / 667MHz CPU
8 nodes
the fastest system

22
Intel Linux Cluster

4 CPUs with Intel Pentium II 450MHz.
GNU gcc compiler proper for Intel CPU
general PC-cluster

23
Proprietary
DB1.5, rate1/3, error_max200
1 CPU
4 CPU
8 CPU
time
7
645
6
5
335
357
4
240
254
258
3
220
212
2
123
126
143
047
043
1
0
HP Developed
EGCS
CPMLEGCS
CPMLCompaq C
EGCS
HP Exemplar
Alpha cluster (533MHz LX)
Intel cluster
24
Turbo Codec
DB1.5, rate1/3, error_max200
25
Performance Analysis

The parallel scalability is not linear.
Uneven and unpredicted number of frames in slave
nodes
It does not affect on parallel simulation.
Through-put
the biggest advantage for parameter study using
several nodes with shortened computation time.
Parallelized Turbo Codec has an invisible linear
scalability.

26
Invisible Scalability

Analysis more reliable parallel efficiency
Analysis computing time per frame
the global execution time of every processor
Sprocessor the execution time of each
processors
( the execution time of parallelized program
)
x ( the number of CPUs )
Consumed time to transmit a frame
( the global execution time of every
processor )
/ ( the number of total
transmitted frames )
By analyzing of numbers of parallelized
simulations,
? the computing time per frame is equal in every
simulation.
? Time ratio in total transmitted frame is
scalable.

27
UP vs LX (computing time per frame)
DB1.5, rate1/3, error_max200
lt computing time per frame gt
533MHz
0.287 sec
0.063 sec
0.027 sec
667MHz
0.212 sec
0.05 sec
0.025 sec
28
Concluding

Parallelized Turbo Codec simulating algorithm
remarkable parallel efficiency and higher
throughput.
Most outstanding contribution of parallelized
Turbo Codec
reducing the computational simulation time in the
design process of core chip.
allowing optimization of parameters in shorter
time
allowing analysis with the range over 2 dB.
Samsung Electronics participates in IMT2000
standardization forum with large amount of
simulation data based on this parallel Turbo
Codec simulating algorithm.