Streaming Multimedia Applications Development in MPSoC Embedded Platforms Design flow issues Communication Middleware and Dataflow support library - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Streaming Multimedia Applications Development in MPSoC Embedded Platforms Design flow issues Communication Middleware and Dataflow support library

Description:

Title: Streaming Middleware Retargetable queue-based communication library, towards an OpenMAX integration Author: Alessandro Last modified by – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Streaming Multimedia Applications Development in MPSoC Embedded Platforms Design flow issues Communication Middleware and Dataflow support library


1
Streaming Multimedia Applications Development in
MPSoCEmbedded PlatformsDesign flow
issuesCommunication Middleware andDataflow
support library
  • Alessandro Dalla Torre,
  • Martino Ruggiero,
  • Luca Benini
  • Andrea Acquaviva
  • acquaviva_at_.univr.it

2
Dataflow/Streaming
AtoD
  • Peculiar Model of Computation (MoC)
  • Regular and repeating computation, applications
    based on streaming data.
  • Easily fits to multimedia applications like
    Audio/Video CODECs, DSP, networking
  • Independent processes (filters) with explicit
    communication.
  • Indipendent address spaces and multiple program
    counters.
  • Composable patterns.
  • Natural expression of Parallelism
  • Task, data and pipeline parallelism.
  • Producer / Consumer dependencies are exposed.

FMDemod
Scatter
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
Gather
Adder
Speaker
3
Types of parallelism recall
4
Streaming applications issues on embedded MPSoC
  • Which paradigm of communication?
  • Message passing, producer/consumer
  • Efficiency of communication is a major
    requirement.
  • Portability of code has to be kept in mind as
    well.
  • Low level optimization may compromise that.
  • Compiler behaviour may interfere.
  • Need for a support to programmer
  • Communication middleware.

5
MP-Queue library
  • Distributed queue middleware
  • Support library for streaming channels in a MPSoC
    environment.
  • Producer / consumer paradigm.
  • FIFOs are circular buffers.
  • Three main contributions
  1. Configurability MP-Queue library matches several
    heterogeneous architectural templates.
  2. An architecture independent efficiency metric is
    given.
  3. It achieves both high efficiency and portability

Prod
  • synch operations optimized for minimal
    interconnect utilization.
  • data transfer optimized for performance through
    analyses of disassembled code.
  • portable C is the final implementation language.

Cons
6
Message delivering
NI
NI
P1
C1
P2
NI
NI
C2
Write counter
Read counter
7
Communication library primitives
  1. Autonit_system()
  2. Every core has to call it at the very beginning.
  3. Allocates data structures and prepares the
    semaphore arrays.
  4. Autoinit_producer()
  5. To be called by a producer core only.
  6. Requires a queue id.
  7. Creates the queue buffers and signals its
    position to n consumers.
  8. Autoinit_consumer()
  9. To be called by a consumer core only.
  10. Requires a queue id.
  11. Waits for n producers to be bounded to the
    consumer structures.
  12. Read()
  13. Gets a message from the circular buffer
    (consumer only).
  14. Write()
  15. Puts a message into the circular buffer (producer
    only).

8
Architectural Flexibility
  1. Multi core architectures with distributed
    memories (scratchpad).
  2. Purely shared memory based architectures.
  3. Hybrid platforms (MPARM cycle accurate simulator,
    different settings).

9
Transaction Chart
  • Shares bus accesses are minimized as much as
    possible
  • Local polling on scratchpad memories.
  • Insertion and extraction indexes are stored into
    shared memory and protected by mutex.
  • Data transfer section involves shared bus
  • Critical for performance.

10
Sequence diagrams
  • Parallel activity of 1 producer and 1 consumer,
  • delivering 20 messages.
  • Synch time vs pure data transfer.
  • Local polling onto scratch semaphore
  • Signaling to remote core scratch
  • Pure data transfer to and from FIFO buffer in
    shared memory

Message size 64 WORDS
Message size 8 WORDS
11
Communication efficiency
  • Comparison against ideal point-to-point
    communication
  • 1-1 queue,
  • 1-N queue,
  • Interrupt based queue.
  • For small messages the library synchronization
    overhead is prevailing,
  • While for a 256 words size we reach good
    performance.
  • Monotonicity, difference in absyntotic behaviour.
  • Abrupt variation of the curve slope.
  • Interrupt based notification adds overhead
    (15).
  • Ratio between
  • Real number of bus cycles needed to complete 1
    word transfer through the MP-QUEUE based message
    passing,
  • And
  • number of bus cycles needed to perform 1 write
    1 read operation (ideal 1 word transfer, no
    overhead)
  • Not absolute timings, depending of the frequency
    clock of CPU, but normalized metric
  • archictecture independent metric.

12
Low-level optimizations are critical!
32 words per token
16 words per token
  • Gcc compiler avoids to insert the multiple load
    /multiple store loop from 32 words on.
  • Code size would be exponentially rising.
  • Up to 256 words per token we can tolerate the
    code size growing.

Not produced any more by compiler!!!
13
Compiler-aware optimization benefits
  • Compiler may be forced to unroll data transfer
    loops.
  • New efficiency curves are obtained.
  • About 15 improvement with 32 word sized
    messages.
  • A Typical JPEG 8x8 block is encoded in a 32 word
    struct.
  • 8x8x16 bit data.
  • Above 256 words, we dont get any improvement by
    using loop unrolling.

14
Shared cacheable memory management with flush
  • Flushing is needed to avoid thrashingif no
    cache coherency support is given.
  • With small messages, flush compromises
    performances.
  • 64 words is the break-even size for
    cacheable-shared based communication.
  • Efficiency is asymptotically rising to 98 !

15
JPEG Decoding parallelized through MP-Queue
  • Starting from a sequencial JPEG decoding
  • Huffman extraction.
  • Inverse DCT.
  • Dequantization.
  • Frame reconstruction.
  • Split Join parallelization.
  • After step 1, the reconstructed 240 x 160 image
    is
  • split by master core into slices
  • delivered to N worker cores through MP-Queue
    messages.
  • 2 different architectural templates are explored.

16
Experimental part metrics
Extra parallel overhead Parallel part exec time
(Parallelizable part execution time / no. of
parallel cores) Communication overhead
Overall exec time Sequential part exec time
Parallel part exec time.
Parallelizable part execution time
Ideal parallel time
Overall Exec time
Parallel part exec time
Sequential part exec time
cores
17
JPEG Split join with one 1-N buffer in shared
memory (cachable)
  • We explored configurations with 1, 2, 4 and 8
    workers (parallel cores).
  • Execution time (cost in terms of bus transaction)
    scales down.
  • Shared bus becomes quite busy while using 8
    parallel cores destination bottleneck negates
    speed up.
  • .

18
JPEG Split join with locally mapped (scratchpad)
1-1 queues
  • This version performs significantly better when N
    increases beyond 2.
  • It eliminates the waiting time due to contention
  • on the bus
  • on the shared memory.

19
Comparison of the two templates
CACHEABLE SHARED
SCRATCHPAD
  • In the second experiment (using scratchpad
    located 1-1 queues) the communication overhead
    becomes negligible.
  • Extra parallel overhead remains
  • this is mostly due to a synchronization
    mismatch
  • it would be removed through a pipelined
    execution of the JPEG decoding.

20
Synchronous Dataflow
  • Each process has a firing rule
  • Consumes and produces a fixed number of tokens
    every time
  • Predictable communication easy scheduling
  • Well-suited for multi-rate signal processing
  • A subset of Kahn Networks deterministic

1
2
1
2
3
1
2
Initial token (delay)
1
21
MP-Queue Middleware evolution API extension
towards the Dataflow MoC
  • Add_input_to_actor()
  • Transform the FIFO into an input channel for the
    specified actor
  • Add_output_to actor()
  • Transform the FIFO into an output channel for the
    specified actor
  • Set_compute_function for actor()
  • By mean of a function pointer, allows to
    associate a portion of code to an actor.
  • That code will be executed whenever the firing
    conditions are met, by consuming the available
    input tokens.
  • Try_and_fire()
  • Cyclic checking of the firing conditions.
  • Fire()
  • Invoked by try_and_fire

22
SDF3 tool
  • Developed by Eindhoven University of Technology
    (Tu/e)
  • Sander Stuijk, Marc Geilen and Twan Basten, SDF3
    SDF for Free. In ACDS 06, Proc. (2006)
  • Synchronous DataFlow extensive library
  • SDFG analysis.
  • Transformation algorithms.
  • Visualization functionality.

23
Analysis techniques example 1
  • Compute the repetition vector
  • how often each actor must fire with respect to
    other actors without a net chance in the token
    distribution.
  • in this example, the repetition vector indicates
  • actor A should fire 3 times,
  • actor B 2 times
  • actor C 1 time

24
Analysis techniques example 2
  • Checking the consistency
  • A non-consistent SDFG requires unbounded memory
    to execute or it deadclocks during its execution.
  • An SDFG is called consistent if its repetition
    vector is not equal to the null-vector.
  • This example shows an SDFG that is inconsistent.
  • Actor A produces too few tokens on the channel
    between A and B to allow an infinite execution of
    the graph.

25
Analysis techniques example 3
  • SDFG to HSDFG conversion (Homogeneous SDFG)
  • Any SDFG can be converted to a corresponding
    HSDFG in which all rates are equal to one.
  • The number of actors in the resulting HSDFG
    depends on the entries of the actors in the
    repetition vector
  • the conversion can lead to an exponential
    increase in the number of actors in the HSDFG

26
Analysis techniques example 4
  • MCM (Maximum Cycle Mean) Analysis
  • The most commenly used means to compute the
    throughput of an SDFG.
  • Each Cycle mean is computed as the total
    execution time over the number of tokens in that
    cycle.
  • The critical cycle (the cycle limiting the
    throughput), is the upper cycle in the HSDFG.
  • The execution time of the actors on the cycle is
    equal to two time-units and only one token is
    present on the cycle

27
Design flow
  1. Give an XML representation of the dataflow.
  2. Invoke the SDF tool in order to produce the
    abstract internal representation.
  3. Produce a graphical ouput.
  4. Compute statistics and execute some analysis
    techniques.
  5. Generate C code.
  6. Cross-compile.
  7. Simulate into virtual platform.

28
Mapping and scheduling issues
  • How to map several actors into a single processor
    tile in an optimal way?
  • Scheduling issues. Need for
  • Either an embedded OS with preemption and
    timeslicing.
  • Or some static analyses techniques in order to
    compute a feasible scheduling that is not going
    to provoke any dead-locks.

29
Streaming Multimedia Applications Development in
MPSoCEmbedded PlatformsDesign flow
issuesCommunication Middleware andDataflow
support library
  • Alessandro Dalla Torre,
  • Martino Ruggiero,
  • Luca Benini
  • Andrea Acquaviva
  • acquaviva_at_.univr.it
Write a Comment
User Comments (0)
About PowerShow.com