Streaming Multimedia Applications Development in MPSoC Embedded Platforms Design flow issues Communication Middleware and Dataflow support library - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Streaming Multimedia Applications Development in MPSoC Embedded Platforms Design flow issues Communication Middleware and Dataflow support library

Description:

Title: Streaming Middleware Retargetable queue-based communication library, towards an OpenMAX integration Author: Alessandro Last modified by – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 30

Provided by: Ales239

Category:

more less

Transcript and Presenter's Notes

Title: Streaming Multimedia Applications Development in MPSoC Embedded Platforms Design flow issues Communication Middleware and Dataflow support library

1
Streaming Multimedia Applications Development in
MPSoCEmbedded PlatformsDesign flow
issuesCommunication Middleware andDataflow
support library

Alessandro Dalla Torre,
Martino Ruggiero,
Luca Benini
Andrea Acquaviva
acquaviva_at_.univr.it

2
Dataflow/Streaming
AtoD

Peculiar Model of Computation (MoC)
Regular and repeating computation, applications
based on streaming data.
Easily fits to multimedia applications like
Audio/Video CODECs, DSP, networking
Independent processes (filters) with explicit
communication.
Indipendent address spaces and multiple program
counters.
Composable patterns.
Natural expression of Parallelism
Task, data and pipeline parallelism.
Producer / Consumer dependencies are exposed.

FMDemod
Scatter
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
Gather
Adder
Speaker
3
Types of parallelism recall
4
Streaming applications issues on embedded MPSoC

Which paradigm of communication?
Message passing, producer/consumer
Efficiency of communication is a major
requirement.
Portability of code has to be kept in mind as
well.
Low level optimization may compromise that.
Compiler behaviour may interfere.
Need for a support to programmer
Communication middleware.

5
MP-Queue library

Distributed queue middleware
Support library for streaming channels in a MPSoC
environment.
Producer / consumer paradigm.
FIFOs are circular buffers.
Three main contributions

Configurability MP-Queue library matches several
heterogeneous architectural templates.
An architecture independent efficiency metric is
given.
It achieves both high efficiency and portability

Prod

synch operations optimized for minimal
interconnect utilization.
data transfer optimized for performance through
analyses of disassembled code.
portable C is the final implementation language.

Cons
6
Message delivering
NI
NI
P1
C1
P2
NI
NI
C2
Write counter
Read counter
7
Communication library primitives

Autonit_system()
Every core has to call it at the very beginning.
Allocates data structures and prepares the
semaphore arrays.
Autoinit_producer()
To be called by a producer core only.
Requires a queue id.
Creates the queue buffers and signals its
position to n consumers.
Autoinit_consumer()
To be called by a consumer core only.
Requires a queue id.
Waits for n producers to be bounded to the
consumer structures.
Read()
Gets a message from the circular buffer
(consumer only).
Write()
Puts a message into the circular buffer (producer
only).

8
Architectural Flexibility

Multi core architectures with distributed
memories (scratchpad).
Purely shared memory based architectures.
Hybrid platforms (MPARM cycle accurate simulator,
different settings).

9
Transaction Chart

Shares bus accesses are minimized as much as
possible
Local polling on scratchpad memories.
Insertion and extraction indexes are stored into
shared memory and protected by mutex.
Data transfer section involves shared bus
Critical for performance.

10
Sequence diagrams

Parallel activity of 1 producer and 1 consumer,
delivering 20 messages.
Synch time vs pure data transfer.
Local polling onto scratch semaphore
Signaling to remote core scratch
Pure data transfer to and from FIFO buffer in
shared memory

Message size 64 WORDS
Message size 8 WORDS
11
Communication efficiency

Comparison against ideal point-to-point
communication
1-1 queue,
1-N queue,
Interrupt based queue.
For small messages the library synchronization
overhead is prevailing,
While for a 256 words size we reach good
performance.
Monotonicity, difference in absyntotic behaviour.
Abrupt variation of the curve slope.
Interrupt based notification adds overhead
(15).

Ratio between
Real number of bus cycles needed to complete 1
word transfer through the MP-QUEUE based message
passing,
And
number of bus cycles needed to perform 1 write
1 read operation (ideal 1 word transfer, no
overhead)
Not absolute timings, depending of the frequency
clock of CPU, but normalized metric
archictecture independent metric.

12
Low-level optimizations are critical!
32 words per token
16 words per token

Gcc compiler avoids to insert the multiple load
/multiple store loop from 32 words on.
Code size would be exponentially rising.
Up to 256 words per token we can tolerate the
code size growing.

Not produced any more by compiler!!!
13
Compiler-aware optimization benefits

Compiler may be forced to unroll data transfer
loops.
New efficiency curves are obtained.
About 15 improvement with 32 word sized
messages.
A Typical JPEG 8x8 block is encoded in a 32 word
struct.
8x8x16 bit data.
Above 256 words, we dont get any improvement by
using loop unrolling.

14
Shared cacheable memory management with flush

Flushing is needed to avoid thrashingif no
cache coherency support is given.
With small messages, flush compromises
performances.
64 words is the break-even size for
cacheable-shared based communication.
Efficiency is asymptotically rising to 98 !

15
JPEG Decoding parallelized through MP-Queue

Starting from a sequencial JPEG decoding
Huffman extraction.
Inverse DCT.
Dequantization.
Frame reconstruction.
Split Join parallelization.
After step 1, the reconstructed 240 x 160 image
is
split by master core into slices
delivered to N worker cores through MP-Queue
messages.
2 different architectural templates are explored.

16
Experimental part metrics
Extra parallel overhead Parallel part exec time
(Parallelizable part execution time / no. of
parallel cores) Communication overhead
Overall exec time Sequential part exec time
Parallel part exec time.
Parallelizable part execution time
Ideal parallel time
Overall Exec time
Parallel part exec time
Sequential part exec time
cores
17
JPEG Split join with one 1-N buffer in shared
memory (cachable)

We explored configurations with 1, 2, 4 and 8
workers (parallel cores).
Execution time (cost in terms of bus transaction)
scales down.
Shared bus becomes quite busy while using 8
parallel cores destination bottleneck negates
speed up.

18
JPEG Split join with locally mapped (scratchpad)
1-1 queues

This version performs significantly better when N
increases beyond 2.
It eliminates the waiting time due to contention
on the bus
on the shared memory.

19
Comparison of the two templates
CACHEABLE SHARED
SCRATCHPAD

In the second experiment (using scratchpad
located 1-1 queues) the communication overhead
becomes negligible.
Extra parallel overhead remains
this is mostly due to a synchronization
mismatch
it would be removed through a pipelined
execution of the JPEG decoding.

20
Synchronous Dataflow

Each process has a firing rule
Consumes and produces a fixed number of tokens
every time
Predictable communication easy scheduling
Well-suited for multi-rate signal processing
A subset of Kahn Networks deterministic

1
2
1
2
3
1
2
Initial token (delay)
1
21
MP-Queue Middleware evolution API extension
towards the Dataflow MoC

Add_input_to_actor()
Transform the FIFO into an input channel for the
specified actor
Add_output_to actor()
Transform the FIFO into an output channel for the
specified actor
Set_compute_function for actor()
By mean of a function pointer, allows to
associate a portion of code to an actor.
That code will be executed whenever the firing
conditions are met, by consuming the available
input tokens.
Try_and_fire()
Cyclic checking of the firing conditions.
Fire()
Invoked by try_and_fire

22
SDF3 tool

Developed by Eindhoven University of Technology
(Tu/e)
Sander Stuijk, Marc Geilen and Twan Basten, SDF3
SDF for Free. In ACDS 06, Proc. (2006)
Synchronous DataFlow extensive library
SDFG analysis.
Transformation algorithms.
Visualization functionality.

23
Analysis techniques example 1

Compute the repetition vector
how often each actor must fire with respect to
other actors without a net chance in the token
distribution.
in this example, the repetition vector indicates
actor A should fire 3 times,
actor B 2 times
actor C 1 time

24
Analysis techniques example 2

Checking the consistency
A non-consistent SDFG requires unbounded memory
to execute or it deadclocks during its execution.
An SDFG is called consistent if its repetition
vector is not equal to the null-vector.
This example shows an SDFG that is inconsistent.
Actor A produces too few tokens on the channel
between A and B to allow an infinite execution of
the graph.

25
Analysis techniques example 3

SDFG to HSDFG conversion (Homogeneous SDFG)
Any SDFG can be converted to a corresponding
HSDFG in which all rates are equal to one.
The number of actors in the resulting HSDFG
depends on the entries of the actors in the
repetition vector
the conversion can lead to an exponential
increase in the number of actors in the HSDFG

26
Analysis techniques example 4

MCM (Maximum Cycle Mean) Analysis
The most commenly used means to compute the
throughput of an SDFG.
Each Cycle mean is computed as the total
execution time over the number of tokens in that
cycle.
The critical cycle (the cycle limiting the
throughput), is the upper cycle in the HSDFG.
The execution time of the actors on the cycle is
equal to two time-units and only one token is
present on the cycle

27
Design flow

Give an XML representation of the dataflow.
Invoke the SDF tool in order to produce the
abstract internal representation.
Produce a graphical ouput.
Compute statistics and execute some analysis
techniques.
Generate C code.
Cross-compile.
Simulate into virtual platform.

28
Mapping and scheduling issues

How to map several actors into a single processor
tile in an optimal way?
Scheduling issues. Need for
Either an embedded OS with preemption and
timeslicing.
Or some static analyses techniques in order to
compute a feasible scheduling that is not going
to provoke any dead-locks.

29
Streaming Multimedia Applications Development in
MPSoCEmbedded PlatformsDesign flow
issuesCommunication Middleware andDataflow
support library