Title: Streaming Multimedia Applications Development in MPSoC Embedded Platforms Design flow issues Communication Middleware and Dataflow support library
1Streaming Multimedia Applications Development in
MPSoCEmbedded PlatformsDesign flow
issuesCommunication Middleware andDataflow
support library
- Alessandro Dalla Torre,
- Martino Ruggiero,
- Luca Benini
- Andrea Acquaviva
- acquaviva_at_.univr.it
2Dataflow/Streaming
AtoD
- Peculiar Model of Computation (MoC)
- Regular and repeating computation, applications
based on streaming data. - Easily fits to multimedia applications like
Audio/Video CODECs, DSP, networking - Independent processes (filters) with explicit
communication. - Indipendent address spaces and multiple program
counters. - Composable patterns.
- Natural expression of Parallelism
- Task, data and pipeline parallelism.
- Producer / Consumer dependencies are exposed.
FMDemod
Scatter
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
Gather
Adder
Speaker
3Types of parallelism recall
4Streaming applications issues on embedded MPSoC
- Which paradigm of communication?
- Message passing, producer/consumer
- Efficiency of communication is a major
requirement. - Portability of code has to be kept in mind as
well. - Low level optimization may compromise that.
- Compiler behaviour may interfere.
- Need for a support to programmer
- Communication middleware.
5MP-Queue library
- Distributed queue middleware
- Support library for streaming channels in a MPSoC
environment. - Producer / consumer paradigm.
- FIFOs are circular buffers.
- Three main contributions
- Configurability MP-Queue library matches several
heterogeneous architectural templates. - An architecture independent efficiency metric is
given. - It achieves both high efficiency and portability
Prod
- synch operations optimized for minimal
interconnect utilization. - data transfer optimized for performance through
analyses of disassembled code. - portable C is the final implementation language.
Cons
6Message delivering
NI
NI
P1
C1
P2
NI
NI
C2
Write counter
Read counter
7Communication library primitives
- Autonit_system()
- Every core has to call it at the very beginning.
- Allocates data structures and prepares the
semaphore arrays. - Autoinit_producer()
- To be called by a producer core only.
- Requires a queue id.
- Creates the queue buffers and signals its
position to n consumers. - Autoinit_consumer()
- To be called by a consumer core only.
- Requires a queue id.
- Waits for n producers to be bounded to the
consumer structures. - Read()
- Gets a message from the circular buffer
(consumer only). - Write()
- Puts a message into the circular buffer (producer
only).
8Architectural Flexibility
- Multi core architectures with distributed
memories (scratchpad). - Purely shared memory based architectures.
- Hybrid platforms (MPARM cycle accurate simulator,
different settings).
9Transaction Chart
- Shares bus accesses are minimized as much as
possible - Local polling on scratchpad memories.
- Insertion and extraction indexes are stored into
shared memory and protected by mutex. - Data transfer section involves shared bus
- Critical for performance.
10Sequence diagrams
- Parallel activity of 1 producer and 1 consumer,
- delivering 20 messages.
- Synch time vs pure data transfer.
- Local polling onto scratch semaphore
- Signaling to remote core scratch
- Pure data transfer to and from FIFO buffer in
shared memory
Message size 64 WORDS
Message size 8 WORDS
11Communication efficiency
- Comparison against ideal point-to-point
communication - 1-1 queue,
- 1-N queue,
- Interrupt based queue.
- For small messages the library synchronization
overhead is prevailing, - While for a 256 words size we reach good
performance. - Monotonicity, difference in absyntotic behaviour.
- Abrupt variation of the curve slope.
- Interrupt based notification adds overhead
(15).
- Ratio between
- Real number of bus cycles needed to complete 1
word transfer through the MP-QUEUE based message
passing, - And
- number of bus cycles needed to perform 1 write
1 read operation (ideal 1 word transfer, no
overhead) - Not absolute timings, depending of the frequency
clock of CPU, but normalized metric - archictecture independent metric.
12Low-level optimizations are critical!
32 words per token
16 words per token
- Gcc compiler avoids to insert the multiple load
/multiple store loop from 32 words on. - Code size would be exponentially rising.
- Up to 256 words per token we can tolerate the
code size growing.
Not produced any more by compiler!!!
13Compiler-aware optimization benefits
- Compiler may be forced to unroll data transfer
loops. - New efficiency curves are obtained.
- About 15 improvement with 32 word sized
messages. - A Typical JPEG 8x8 block is encoded in a 32 word
struct. - 8x8x16 bit data.
- Above 256 words, we dont get any improvement by
using loop unrolling.
14Shared cacheable memory management with flush
- Flushing is needed to avoid thrashingif no
cache coherency support is given. - With small messages, flush compromises
performances. - 64 words is the break-even size for
cacheable-shared based communication. - Efficiency is asymptotically rising to 98 !
15JPEG Decoding parallelized through MP-Queue
- Starting from a sequencial JPEG decoding
- Huffman extraction.
- Inverse DCT.
- Dequantization.
- Frame reconstruction.
- Split Join parallelization.
- After step 1, the reconstructed 240 x 160 image
is - split by master core into slices
- delivered to N worker cores through MP-Queue
messages. - 2 different architectural templates are explored.
16Experimental part metrics
Extra parallel overhead Parallel part exec time
(Parallelizable part execution time / no. of
parallel cores) Communication overhead
Overall exec time Sequential part exec time
Parallel part exec time.
Parallelizable part execution time
Ideal parallel time
Overall Exec time
Parallel part exec time
Sequential part exec time
cores
17JPEG Split join with one 1-N buffer in shared
memory (cachable)
- We explored configurations with 1, 2, 4 and 8
workers (parallel cores). - Execution time (cost in terms of bus transaction)
scales down. - Shared bus becomes quite busy while using 8
parallel cores destination bottleneck negates
speed up.
18JPEG Split join with locally mapped (scratchpad)
1-1 queues
- This version performs significantly better when N
increases beyond 2. - It eliminates the waiting time due to contention
- on the bus
- on the shared memory.
19Comparison of the two templates
CACHEABLE SHARED
SCRATCHPAD
- In the second experiment (using scratchpad
located 1-1 queues) the communication overhead
becomes negligible. - Extra parallel overhead remains
- this is mostly due to a synchronization
mismatch - it would be removed through a pipelined
execution of the JPEG decoding.
20Synchronous Dataflow
- Each process has a firing rule
- Consumes and produces a fixed number of tokens
every time - Predictable communication easy scheduling
- Well-suited for multi-rate signal processing
- A subset of Kahn Networks deterministic
1
2
1
2
3
1
2
Initial token (delay)
1
21MP-Queue Middleware evolution API extension
towards the Dataflow MoC
- Add_input_to_actor()
- Transform the FIFO into an input channel for the
specified actor - Add_output_to actor()
- Transform the FIFO into an output channel for the
specified actor - Set_compute_function for actor()
- By mean of a function pointer, allows to
associate a portion of code to an actor. - That code will be executed whenever the firing
conditions are met, by consuming the available
input tokens. - Try_and_fire()
- Cyclic checking of the firing conditions.
- Fire()
- Invoked by try_and_fire
22SDF3 tool
- Developed by Eindhoven University of Technology
(Tu/e) - Sander Stuijk, Marc Geilen and Twan Basten, SDF3
SDF for Free. In ACDS 06, Proc. (2006) - Synchronous DataFlow extensive library
- SDFG analysis.
- Transformation algorithms.
- Visualization functionality.
23Analysis techniques example 1
- Compute the repetition vector
- how often each actor must fire with respect to
other actors without a net chance in the token
distribution. - in this example, the repetition vector indicates
- actor A should fire 3 times,
- actor B 2 times
- actor C 1 time
24Analysis techniques example 2
- Checking the consistency
- A non-consistent SDFG requires unbounded memory
to execute or it deadclocks during its execution.
- An SDFG is called consistent if its repetition
vector is not equal to the null-vector. - This example shows an SDFG that is inconsistent.
- Actor A produces too few tokens on the channel
between A and B to allow an infinite execution of
the graph.
25Analysis techniques example 3
- SDFG to HSDFG conversion (Homogeneous SDFG)
- Any SDFG can be converted to a corresponding
HSDFG in which all rates are equal to one. - The number of actors in the resulting HSDFG
depends on the entries of the actors in the
repetition vector - the conversion can lead to an exponential
increase in the number of actors in the HSDFG
26Analysis techniques example 4
- MCM (Maximum Cycle Mean) Analysis
- The most commenly used means to compute the
throughput of an SDFG. - Each Cycle mean is computed as the total
execution time over the number of tokens in that
cycle. - The critical cycle (the cycle limiting the
throughput), is the upper cycle in the HSDFG. - The execution time of the actors on the cycle is
equal to two time-units and only one token is
present on the cycle
27Design flow
- Give an XML representation of the dataflow.
- Invoke the SDF tool in order to produce the
abstract internal representation. - Produce a graphical ouput.
- Compute statistics and execute some analysis
techniques. - Generate C code.
- Cross-compile.
- Simulate into virtual platform.
28Mapping and scheduling issues
- How to map several actors into a single processor
tile in an optimal way? - Scheduling issues. Need for
- Either an embedded OS with preemption and
timeslicing. - Or some static analyses techniques in order to
compute a feasible scheduling that is not going
to provoke any dead-locks.
29Streaming Multimedia Applications Development in
MPSoCEmbedded PlatformsDesign flow
issuesCommunication Middleware andDataflow
support library
- Alessandro Dalla Torre,
- Martino Ruggiero,
- Luca Benini
- Andrea Acquaviva
- acquaviva_at_.univr.it