Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip

About This Presentation

Title:

Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip

Description:

Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip Luca Benini Lbenini_at_deis.unibo.it – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 63

Provided by: con99

Category:

more less

Transcript and Presenter's Notes

Title: Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip

1
Tecniche di ottimizzazione per lo sviluppo di
applicazioni embedded su piattatforme
multiprocessore su singolo chip

Luca Benini
Lbenini_at_deis.unibo.it
DEIS Università di Bologna

2
Embedded Systems
General purpose systems
Embedded systems
Microprocessor market shares
3
Example Area Automotive Electronics

What is automotive electronics?
Vehicle functions implemented with electronics
Body electronics
System electronics chassis, engine
Information/entertainment

4
Automotive Electronics Market Size
Cost of Electronics / Car ()
1400
1200
1000
800
2006 25 of the total cost of a car will be
electronics
600
400
200
0
1998
1999
2000
2001
2002
2003
2004
2005
Market (billions)
8.9
10.5
13.1
14.1
15.8
17.4
19.3
21.0
90 of future innovations in vehiclesbased on
electronic embedded systems
5
Automotive Electronics Platform Example
Source Expanding automotive electronic systems,
IEEE Computer, Jan. 2002
6
Digital Convergence Mobile Example

One device, multiple functions
Center of ubiquitous media network
Smart mobile device next drive for semicon.
Industry

7
4th Gen and Next-Gen Networks
Includes 802.20, WiMAX (802.16), HSDPA, TDD
UMTS, UMTS and future versions of UMTS
8
SoC Enabler for Digital Convergence
4G/5G, DMB, WiBro, etc.
Future
Performance Low Power Complexity Storage
gt 100X
Today
SoC
9
Application pull
3D gaming
1TOPS/W
3D TV
3D ambient interaction
Structured decoding
Ubiquitous navigation
3D projected display
Autonomous driving
HMI by motion Gesture detection
Structured encoding
100GOPS/W
Expression recognition
Gbit radio
Collision avoidance
Adaptive route
H264 encoding
Language
dictation
Emotion recognition
Gesture recognition
UWB
Sign recognition
A/V streaming
5 GOPS/W
Image recognition
802.11n
Si Xray
Mobile Base-band
H264 decoding
Auto personalization
Fully recognition (security)
2005
2007
2009
2011
2013
2015
IMEC
Year of Introduction
10
MPSoC Platform Evolution
Middleware, RTOS, API, Run-Time Controller
Applications
Software opt.
Mapping V,Vt,Fclk,IL
I/O P E R I P H E R A L S
45 nm
router
Bus based Multi Proc
2
lt4mm
Net Int
3D stacked main memory
30Mtr
Local Memory hierarchy
Power Test Mgmt
lt1GHz

Todays SoCs could fit in 1 tile!!
Tile-based design

11
Multicores Are Here!
Amarasinghe06
512
256
128
64
of cores
32
16
8
4
2
4004
8086
8080
286
386
486
Pentium
P2
P3
Itanium
1
P4
8008
Itanium 2
Athlon
1985
1990
1980
1970
1975
1995
2000
2005
20??
12
MPSoC 2005 ITRS roadmap
Martin06
13
SoC ? Solution-on-a-Chip
Target System Application
System / Service
Application S/W
Mobile Terminal
Middleware
System

e-SW
Module
RTOS
Chip
HAL
Chip
Process
S/W IP

Requires design of Hardware AND software

14
Design as optimization

Design space
The set of all possible design choices
Constraints
Solutions that we are not willing to accept
Cost function
A property we are interested in (execution time,
power, reliability)

15
Hardware synthesis
16
Behavioral synthesis
17
Allocation, Assignment, and Scheduling
Techniques Well Understood and Mature
18
Resource constraints
Control Step
19
Scheduling under resource constraints

Intractable problem
Algorithms
Exact
Integer linear program
Hu (restrictive assumptions)
Approximate
List scheduling
Force-directed scheduling

20
ILP formulation

Binary decision variables
X xil, i 1,2,. n l 1,2,, ? 1
xil, is TRUE only when operation vi starts in
step l of the schedule (i.e. l ti)
? is an upper bound on latency
Start time of operation vi
Sl . xil

l
21
ILP formulation constraints

Operations start only once
S xil 1 i 1, 2,, n
Sequencing relations must be satisfied
ti tj dj (vj, vi) ? E
S l xil S l xil dj 0 (vj, vi) ?
E
Resource bounds must be satisfied
Simple case (unit delay)
S xil ak k 1,2,nres l

l
A
A
l
l
A
iT(vi)k
22
ILP Formulation
min (S l xnl) such that
S xil 1 i 1, 2, , n S l xij - S l
xjl - dj 0 i, j 1, 2, , n, (vj, vi)
? E S S xim ak k 1,
2, , nres l 0, 1, , ?
l
l
l
l
l
ml-di1
iT(vi)k
23
Example

Resource constraints
2 ALUs 2 Multipliers
a1 2 a2 2
Single-cycle operation
di 1 i

A
24
Example

Operations start only once
x11 1
x61 x62 1
Sequencing relations must be satisfied
x61 2x62 2x72 3x73 1 0
2x92 3x93 4x94 5xN5 1 0
Resource bounds must be satisfied
x11 x21 x61 x81 2
x32 x62 x72 x81 2

25
Example
26
Resource-EfficientApplication mapping for MPSoCs
MULTIMEDIA APPLICATIONS

Given a platform
Achieve a specified throughput
Minimize usage of shared resources

27
Application design flow
Abstraction gap
Platform Modelling
Starting Implementation
Optimization Analysis
Final Implementation
Optimal Solution
Platform Execution

The abstraction gap between high level
optimization tools and standard application
programming models can introduce unpredictable
and undesired behaviours.
Programmers must be conscious about simplified
assumptions taken into account in optimization
tools.
New methodology for multi-task application
development on MPSoCs.

28
Resource assignment and scheduling
THE SYSTEM
Task. A (WCET Ta)
Processor
Task. B (WCET Tb)
. . . . .
Limited Size Mem
Tightly-Coupled Memory
Task. N (WCET Tn)
Node 1
Node N
Bus Interface
SHARED SYSTEM BUS
On-chip Memory
29
The application
Signal Processing Pipeline
Throughput Constraint

Each task is characterized by
WCET
Memory requirements

Queues for inter-processor communication
in TCM for efficiency reasons
Program data
in TCM (if space) or on-chip memory
Internal state
in TCM (if space) or on-chip memory

30
Communication-aware Allocation and Scheduling for
Stream-Oriented MPSoCs
T7
T1
T2
T0
Signal Processing Pipeline
..

Simplifying assumptions vs predictability
Efficient solutions in reasonable time
Pure ILP formulations suitable for small task
sets
Widespread use of heuristics

?
ARM7
Private Memory
B U S
Local Scratchpad Memory
Message- oriented MPSoC architecture
..
.
Private Memory
ARM7
Local Scratchpad Memory
31
Master Problem model

Assignment of tasks and memory slots (master
problem)
Tij 1 if task i executes on processor j, 0
otherwise,
Yij 1 if task i allocates program data on
processor j memory, 0 otherwise,
Zij 1 if task i allocates the internal state on
processor j memory, 0 otherwise
Xij 1 if task i executes on processor j and
task i1 does not, 0 otherwise
Each process should be allocated to one
processor ? Tij 1 for all j
Link between variables X and T Xij Tij
Ti1 j for all i and j (can be linearized)
If a task is NOT allocated to a processor nor
its required memories are
Tij 0 ? Yij 0 and Zij 0
Objective function ? ? memi (Tij - Yij) statei
(Tij - Yij) datai Xij /2

i
i
j
32
Improvement of the model

With the proposed model, the allocation problem
solver tends to pack all tasks on a single
processor and all memory required on the local
memory so as to have a ZERO communication cost
TRIVIAL SOLUTION
To improve the model we should add a relaxation
of the subproblem to the master problem model
For each set S of consecutive tasks whose sum of
durations exceeds the Real time requirement, we
impose that their processors should not be the
same
? WCETi gt RT ? ? Tij ? S -1

i ? S
i ? S
33
Sub-Problem model

Task scheduling with static resource assignment
(subproblem)

i
34
Sub-Problem model

Task scheduling with static resource assignment
(subproblem)
We have to schedule tasks so we have to decide
when they start
Activity Starting Time Starti0..Deadlinei
Precedence constraints StartiDuri ? Startj
Real time constraints for all activities running
on the same processor
? (StartiDuri ) ? RT
Cumulative constraints on resources
processors are unary resources
cumulative(Start, Dur, 1,1)
memories are additive resources
cumulative(Start,Dur,MR,C)
What about the bus??

i
35
Bus model
Unary resource granularity clock cycle
BANDWIDTH BIT/SEC
Execution time taski and task j
Max bus bandwidth
TIME
Taski state write
Taskj State write
Taski state read
Taskj state read
Arbitration mechanism that decides the bus
allocation
36
Bus model
BANDWIDTH BIT/SEC
Additive bus model
Max bus bandwidth
Size of program data TaskExecTime
Task0 accesses input data BWMaxBW/NoProc
taski
taskj
TIME
Taski state write
Taski state write
Taski state read
Taskj state read
The model does not hold under heavy bus
congestion (more than 65 of total bandwidth) Bus
traffic has to be minimized
37
No good generation

Assignment of tasks and memory slots (master
problem)
Task scheduling with static resource assignment
(subproblem)
If no feasible schedule exist for the allocation
provided by the master a no-good is generated.
We use the simple BUT EFFECTIVE one identify
CONFLICTING RESOURCES CR. For each R ? CR, STR
set of tasks allocated on R
? TiR ? STR - 1
Other cuts are also possible, Hooker,
Constraints 2005, but these are enough for our
case and easy to extract

i ? STR
38
Computational efficiency

CP and IP formulations simplified
Hybrid approach clearly outperforms pure CP and
IP techniques
Search time bounded to 15 minutes
CP and IP can found a solution only in 50- of
the instances
Hybrid approach always found a solution

39
Validation of bus model

Requesting more than 65 of the theoretical
maximum bandwidth causes the additive model to
fail.
Lower threshold in presence of communication
hotspots (50)

Benefits of the additive model
task execution time almost indep. of bus
utilization
Performance predictability greatly enhanced

40
Validation of optimizer solutions

MAX error lower than 10
AVG error equal to 4.7, with standard deviation
of 0.08
Optimizer turn out to be conservative in
predicting infeasibility
The flow was successfully applied to GSM
benchmark

41
Energy-EfficientApplication mapping for MPSoCs
MULTIMEDIA APPLICATIONS

Given a platform
Achieve a specified throughput
Minimize power consumption

42
Application Mapping
Allocation
Schedule Freq.sel.

The problem of allocating, scheduling and freq.
selection for task graphs on multi-processors in
a distributed real-time system is NP-hard.
New tool flows for efficient mapping of
multi-task applications onto hardware platforms

43
Exploiting Voltage Supply

Supply voltage impacts power and performance
Circuit slowdown T1/fK/(Vdd-Vt)a
Cubic power savings PCeffVdd2f
Just-in-time computation
Stretch execution time up to the max tolerable

Fixed voltage Shutdown
Power
Variable voltage
Available time
44
Scheduling Voltage Scaling
Different voltagesdifferent frequencies
CPU
f1
f2
f3
45
Target architecture - 2

Homogeneous computation tiles
ARM cores (including instruction and data
caches)
Tightly coupled software-controlled scratch-pad
memories (SPM)
AMBA AHB
DMA engine
RTEMS OS
Technology homogeneous (0.13um) industrial power
models (ST)

Variable Voltage/Frequency cores with discrete
(Vdd,f) pairs
Frequency dividers scale down the baseline 200
MHz system clock
Cores use non-cacheable shared memory to
communicate
Semaphore and interrupt facilities are used for
synchronization
Private on-chip memory to store data.

46
Application model

A task graph represents
A group of tasks T
Task dependencies
Execution times express in clock cycles WCN(Ti)
Communication time (writes reads) expressed as
WCN(WTiTj) and WCN(RTiTj)
These values can be back-annotated from
functional simulation

WCN(T2)
WCN(T4)
WCN(WT2T4) WCN(RT2T4)
Task2
Task4
WCN(WT1T2) WCN(RT1T2)
WCN(WT4T6) WCN(RT4T6)
WCN(T1)
WCN(T6)
Task1
Task6
WCN(WT1T3) WCN(RT1T3)
Task3
Task5
WCN(WT5T6) WCN(RT5T6)
WCN(WT3T5) WCN(RT3T5)
WCN(T3)
WCN(T5)
47
Efficient Application Development Support

In optimization tools many simplifying
assumptions are generally considered
The neglecting of these assumptions in software
implementation can generate
unpredictable and not desired system-level
interactions
make the overall system error-prone.
We propose an entire framework to help
programmers in software implementation
a generic customizable application template ?
OFFLINE SUPPORT
a set of high-level APIs ? ONLINE SUPPORT.
The main goals of our development framework are
the exact and reliable applications execution
after the optimization step
guarantees about high performance and constraint
satisfaction.

48
Customizable Application Template

Starting from a high level task and data flow
graph, software developers can easily and quickly
build their application infrastructure.
Programmer can intuitively translate high level
representation into C-code using our facilities
and library.
Users can specify
the number of tasks included in the target
application
their nature (e.g. branch, fork, or-node,
and-node)
their precedence constraints (e.g. due to data
communication)
.thus quickly drawing its CTG schema.
Programmer can focus onto the functionalities of
the tasks
the main effort is given to the more specific and
critic sections of the application.

49
OS-level and Task-level APIs

Users can easily reproduce optimizer solutions,
thus
Indirectly neglecting optimizers abstractions
Task model
Communication model
OS overheads.
Obtaining the needed application constraint
satisfaction.
Programmer can allocate to the right hardware
resources
Tasks
Program data
Queues.
Scheduling support APIs
Frequency and voltage selection
Communication issues
Shared queues
Semaphores
Interrupts.

50
Example
P1
T1
N1
P2
a1
a2

Number of nodes 12
Graph of activities
Node type
Normal, Branch, Conditional, Terminator
Node behaviour
Or, And, Fork, Branch
Number of CPU 2
Task Allocation
Task Scheduling
Arc priorities
Freq. Voltage

fork
T2
T3
B2
B3
branch
branch
a3
a4
a5
a6
T4
T5
T6
T7
C4
C5
C6
C7
a7
a8
a9
a10
or
T8
T9
T10
N8
N9
N10
a12
or
//Node Type 0 NORMAL 1 BRANCH 2
STOCHASTIC uint node_typeTASK_NUMBER
1,2,2,1,..
a11
uint queue_consumer .. ..
0,1,1,0,..,
0,0,0,1,1,., 0,0,0,0,0,1,1..,

0,0,0,0,....
N11
define TASK_NUMBER 12
T11
a13
define N_CPU 2 uint task_on_coreTASK_NUMBER
1,1,2,1 int schedule_on_coreN_CPUTASK_NUMBER
1,2,4,8..
//Node Behaviour 0 AND 1 OR 2 FORK 3
BRANCH uint node_behaviourTASK_NUMBER
2,3,3,..
and
a14
T12
T12
Deadline
Resources
B3
N10
B3
C7
N10
C7
N1
B2
C4
N8
N11
T12
T12
Time
51
Queue ordering optimization
CPU1
CPU2
T1
Wait!
C3
C1
RUN!
T4
C2
T2
C4
C5
T3
T5
T6

Communication ordering affects system performances

52
Queue ordering optimization
CPU1
CPU2
T1
Wait!
C3
C1
RUN!
T4
C2
T2
C4
C5
T3
T5
T6

Communication ordering affects system performances

53
Synchronization among tasks
T1
Proc. 1
Proc. 2
C1
T2
T4
T1
T3
T4
T2
C2
C3
T4 re-activated
T4 is suspended
T3
Non blocked semaphores
54
Logic Based Benders Decomposition
Memory constraints
Obj. Function Communication cost energy
consumption
Allocation Freq. Assign. INTEGER PROGRAMMING
No good linear constraint
Valid allocation
Real Time constraint
Scheduling CONSTRAINT PROGRAMMING

Decomposes the problem into 2 sub-problems
Allocation Assignment of freq. settings ? IP
Objective Function minimizing energy consumption
during execution and communication of tasks
Scheduling ? CP
Objective Function minimizing energy consumption
during frequency switching

55
Solver Performance

Hundreds of of decision variables
Much beyond ILP solver or CP solver capability

56
Allocation problem model
The objective function minimize energy
consumption associated with task execution and
communication
Xtfp 1 if task t executes on processor p at
frequency f Wijfp 1 if task i and j run on
different core. Task i on core p writes data to
j at freq. f Rijfp 1 if task i and j run on
different core. Task j on core p reads data to
i at freq. f
57
Allocation problem model
58
Scheduling problem model
Duration of task i is now fixed since mode is
fixed
Reading phase
Writing phase

Five phases behaviour
INPUTinput data reading
EXECcomputation activity
OUTPUToutput data writing.
Atomic activities

fork
join
input
output
input
exec
output
input
output

Processors are modelled as unary resource
Bus is modelled as additive resource

The objective function minimize energy
consumption associated with frequency switching
59
Application Development Methodology
Simulator
Optimizer
Application Profiles
Optimization Phase
Characterization Phase
Allocation Scheduling
Application Development Support
Optimal SW Application Implementation
Platform Execution
60
Validation of optimizer solutions Throughput
Optimizer
250 instances
Probability ()
Throughput difference ()

MAX error lower than 10
AVG error equal to 4.51, with standard deviation
of 1.94
All the deadline constraints are satisfied.

61
Validation of optimizer solutions Power
Optimizer
250 instances
250 instances
Probability ()
Energy consumption difference ()

MAX error lower than 10
AVG error equal to 4.80, with standard deviation
of 1.71

62
GSM Encoder

Task Graph
10 computational tasks
15 communication tasks.

Throughput required 1 frame/10ms.
With 2 processors and 4 possible freq.voltage
settings

Without optimizations 50.9µJ
With optimizations 17.1 µJ
- 66,4
63
Summary future work

Energy-optimal task mapping
Strong optimization engine (complete)
Programmer support (design exec time)
Validation accuracy optimality
Future work
Conditional task graphs
Dealing with multiple use cases
Variable execution times
Aggressive communication scheduling

Write a Comment

User Comments (0)

About PowerShow.com

Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip - PowerPoint PPT Presentation

Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip

Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip Luca Benini Lbenini_at_deis.unibo.it – PowerPoint PPT presentation