Title: Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip
1Tecniche di ottimizzazione per lo sviluppo di
applicazioni embedded su piattatforme
multiprocessore su singolo chip
- Luca Benini
- Lbenini_at_deis.unibo.it
- DEIS Università di Bologna
2Embedded Systems
General purpose systems
Embedded systems
Microprocessor market shares
3Example Area Automotive Electronics
- What is automotive electronics?
- Vehicle functions implemented with electronics
- Body electronics
- System electronics chassis, engine
- Information/entertainment
4Automotive Electronics Market Size
Cost of Electronics / Car ()
1400
1200
1000
800
2006 25 of the total cost of a car will be
electronics
600
400
200
0
1998
1999
2000
2001
2002
2003
2004
2005
Market (billions)
8.9
10.5
13.1
14.1
15.8
17.4
19.3
21.0
90 of future innovations in vehiclesbased on
electronic embedded systems
5Automotive Electronics Platform Example
Source Expanding automotive electronic systems,
IEEE Computer, Jan. 2002
6Digital Convergence Mobile Example
- One device, multiple functions
- Center of ubiquitous media network
- Smart mobile device next drive for semicon.
Industry
74th Gen and Next-Gen Networks
Includes 802.20, WiMAX (802.16), HSDPA, TDD
UMTS, UMTS and future versions of UMTS
8SoC Enabler for Digital Convergence
4G/5G, DMB, WiBro, etc.
Future
Performance Low Power Complexity Storage
gt 100X
Today
SoC
9Application pull
3D gaming
1TOPS/W
3D TV
3D ambient interaction
Structured decoding
Ubiquitous navigation
3D projected display
Autonomous driving
HMI by motion Gesture detection
Structured encoding
100GOPS/W
Expression recognition
Gbit radio
Collision avoidance
Adaptive route
H264 encoding
Language
dictation
Emotion recognition
Gesture recognition
UWB
Sign recognition
A/V streaming
5 GOPS/W
Image recognition
802.11n
Si Xray
Mobile Base-band
H264 decoding
Auto personalization
Fully recognition (security)
2005
2007
2009
2011
2013
2015
IMEC
Year of Introduction
10MPSoC Platform Evolution
Middleware, RTOS, API, Run-Time Controller
Applications
Software opt.
Mapping V,Vt,Fclk,IL
I/O P E R I P H E R A L S
45 nm
router
Bus based Multi Proc
2
lt4mm
Net Int
3D stacked main memory
30Mtr
Local Memory hierarchy
Power Test Mgmt
lt1GHz
- Todays SoCs could fit in 1 tile!!
- Tile-based design
11Multicores Are Here!
Amarasinghe06
512
256
128
64
of cores
32
16
8
4
2
4004
8086
8080
286
386
486
Pentium
P2
P3
Itanium
1
P4
8008
Itanium 2
Athlon
1985
1990
1980
1970
1975
1995
2000
2005
20??
12MPSoC 2005 ITRS roadmap
Martin06
13SoC ? Solution-on-a-Chip
Target System Application
System / Service
Application S/W
Mobile Terminal
Middleware
System
e-SW
Module
RTOS
Chip
HAL
Chip
Process
S/W IP
- Requires design of Hardware AND software
14Design as optimization
- Design space
- The set of all possible design choices
- Constraints
- Solutions that we are not willing to accept
- Cost function
- A property we are interested in (execution time,
power, reliability)
15Hardware synthesis
16Behavioral synthesis
17Allocation, Assignment, and Scheduling
Techniques Well Understood and Mature
18Resource constraints
Control Step
19Scheduling under resource constraints
- Intractable problem
- Algorithms
- Exact
- Integer linear program
- Hu (restrictive assumptions)
- Approximate
- List scheduling
- Force-directed scheduling
20ILP formulation
- Binary decision variables
- X xil, i 1,2,. n l 1,2,, ? 1
- xil, is TRUE only when operation vi starts in
step l of the schedule (i.e. l ti) - ? is an upper bound on latency
- Start time of operation vi
- Sl . xil
l
21ILP formulation constraints
- Operations start only once
- S xil 1 i 1, 2,, n
- Sequencing relations must be satisfied
- ti tj dj (vj, vi) ? E
- S l xil S l xil dj 0 (vj, vi) ?
E - Resource bounds must be satisfied
- Simple case (unit delay)
- S xil ak k 1,2,nres l
l
A
A
l
l
A
iT(vi)k
22ILP Formulation
min (S l xnl) such that
S xil 1 i 1, 2, , n S l xij - S l
xjl - dj 0 i, j 1, 2, , n, (vj, vi)
? E S S xim ak k 1,
2, , nres l 0, 1, , ?
l
l
l
l
l
ml-di1
iT(vi)k
23Example
- Resource constraints
- 2 ALUs 2 Multipliers
- a1 2 a2 2
- Single-cycle operation
- di 1 i
A
24Example
- Operations start only once
- x11 1
- x61 x62 1
-
- Sequencing relations must be satisfied
- x61 2x62 2x72 3x73 1 0
- 2x92 3x93 4x94 5xN5 1 0
-
- Resource bounds must be satisfied
- x11 x21 x61 x81 2
- x32 x62 x72 x81 2
-
25Example
26Resource-EfficientApplication mapping for MPSoCs
MULTIMEDIA APPLICATIONS
- Given a platform
- Achieve a specified throughput
- Minimize usage of shared resources
27Application design flow
Abstraction gap
Platform Modelling
Starting Implementation
Optimization Analysis
Final Implementation
Optimal Solution
Platform Execution
- The abstraction gap between high level
optimization tools and standard application
programming models can introduce unpredictable
and undesired behaviours. - Programmers must be conscious about simplified
assumptions taken into account in optimization
tools. - New methodology for multi-task application
development on MPSoCs.
28Resource assignment and scheduling
THE SYSTEM
Task. A (WCET Ta)
Processor
Task. B (WCET Tb)
. . . . .
Limited Size Mem
Tightly-Coupled Memory
Task. N (WCET Tn)
Node 1
Node N
Bus Interface
SHARED SYSTEM BUS
On-chip Memory
29The application
Signal Processing Pipeline
Throughput Constraint
- Each task is characterized by
- WCET
- Memory requirements
- Queues for inter-processor communication
- in TCM for efficiency reasons
- Program data
- in TCM (if space) or on-chip memory
- Internal state
- in TCM (if space) or on-chip memory
30Communication-aware Allocation and Scheduling for
Stream-Oriented MPSoCs
T7
T1
T2
T0
Signal Processing Pipeline
..
- Simplifying assumptions vs predictability
- Efficient solutions in reasonable time
- Pure ILP formulations suitable for small task
sets - Widespread use of heuristics
?
ARM7
Private Memory
B U S
Local Scratchpad Memory
Message- oriented MPSoC architecture
..
.
Private Memory
ARM7
Local Scratchpad Memory
31Master Problem model
- Assignment of tasks and memory slots (master
problem) - Tij 1 if task i executes on processor j, 0
otherwise, - Yij 1 if task i allocates program data on
processor j memory, 0 otherwise, - Zij 1 if task i allocates the internal state on
processor j memory, 0 otherwise - Xij 1 if task i executes on processor j and
task i1 does not, 0 otherwise - Each process should be allocated to one
processor ? Tij 1 for all j - Link between variables X and T Xij Tij
Ti1 j for all i and j (can be linearized) - If a task is NOT allocated to a processor nor
its required memories are - Tij 0 ? Yij 0 and Zij 0
- Objective function ? ? memi (Tij - Yij) statei
(Tij - Yij) datai Xij /2
i
i
j
32Improvement of the model
- With the proposed model, the allocation problem
solver tends to pack all tasks on a single
processor and all memory required on the local
memory so as to have a ZERO communication cost
TRIVIAL SOLUTION - To improve the model we should add a relaxation
of the subproblem to the master problem model - For each set S of consecutive tasks whose sum of
durations exceeds the Real time requirement, we
impose that their processors should not be the
same - ? WCETi gt RT ? ? Tij ? S -1
-
i ? S
i ? S
33Sub-Problem model
- Task scheduling with static resource assignment
(subproblem)
i
34Sub-Problem model
- Task scheduling with static resource assignment
(subproblem) - We have to schedule tasks so we have to decide
when they start - Activity Starting Time Starti0..Deadlinei
- Precedence constraints StartiDuri ? Startj
- Real time constraints for all activities running
on the same processor - ? (StartiDuri ) ? RT
- Cumulative constraints on resources
- processors are unary resources
cumulative(Start, Dur, 1,1) - memories are additive resources
cumulative(Start,Dur,MR,C) -
- What about the bus??
i
35Bus model
Unary resource granularity clock cycle
BANDWIDTH BIT/SEC
Execution time taski and task j
Max bus bandwidth
TIME
Taski state write
Taskj State write
Taski state read
Taskj state read
Arbitration mechanism that decides the bus
allocation
36Bus model
BANDWIDTH BIT/SEC
Additive bus model
Max bus bandwidth
Size of program data TaskExecTime
Task0 accesses input data BWMaxBW/NoProc
taski
taskj
TIME
Taski state write
Taski state write
Taski state read
Taskj state read
The model does not hold under heavy bus
congestion (more than 65 of total bandwidth) Bus
traffic has to be minimized
37No good generation
- Assignment of tasks and memory slots (master
problem) - Task scheduling with static resource assignment
(subproblem) - If no feasible schedule exist for the allocation
provided by the master a no-good is generated. - We use the simple BUT EFFECTIVE one identify
CONFLICTING RESOURCES CR. For each R ? CR, STR
set of tasks allocated on R - ? TiR ? STR - 1
- Other cuts are also possible, Hooker,
Constraints 2005, but these are enough for our
case and easy to extract
i ? STR
38Computational efficiency
- CP and IP formulations simplified
- Hybrid approach clearly outperforms pure CP and
IP techniques - Search time bounded to 15 minutes
- CP and IP can found a solution only in 50- of
the instances - Hybrid approach always found a solution
39Validation of bus model
- Requesting more than 65 of the theoretical
maximum bandwidth causes the additive model to
fail. - Lower threshold in presence of communication
hotspots (50)
- Benefits of the additive model
- task execution time almost indep. of bus
utilization - Performance predictability greatly enhanced
40Validation of optimizer solutions
- MAX error lower than 10
- AVG error equal to 4.7, with standard deviation
of 0.08 - Optimizer turn out to be conservative in
predicting infeasibility - The flow was successfully applied to GSM
benchmark
41Energy-EfficientApplication mapping for MPSoCs
MULTIMEDIA APPLICATIONS
- Given a platform
- Achieve a specified throughput
- Minimize power consumption
42Application Mapping
Allocation
Schedule Freq.sel.
- The problem of allocating, scheduling and freq.
selection for task graphs on multi-processors in
a distributed real-time system is NP-hard. - New tool flows for efficient mapping of
multi-task applications onto hardware platforms
43Exploiting Voltage Supply
- Supply voltage impacts power and performance
- Circuit slowdown T1/fK/(Vdd-Vt)a
- Cubic power savings PCeffVdd2f
- Just-in-time computation
- Stretch execution time up to the max tolerable
Fixed voltage Shutdown
Power
Variable voltage
Available time
44Scheduling Voltage Scaling
Different voltagesdifferent frequencies
CPU
f1
f2
f3
45Target architecture - 2
- Homogeneous computation tiles
- ARM cores (including instruction and data
caches) - Tightly coupled software-controlled scratch-pad
memories (SPM) - AMBA AHB
- DMA engine
- RTEMS OS
- Technology homogeneous (0.13um) industrial power
models (ST)
- Variable Voltage/Frequency cores with discrete
(Vdd,f) pairs - Frequency dividers scale down the baseline 200
MHz system clock - Cores use non-cacheable shared memory to
communicate - Semaphore and interrupt facilities are used for
synchronization - Private on-chip memory to store data.
46Application model
- A task graph represents
- A group of tasks T
- Task dependencies
- Execution times express in clock cycles WCN(Ti)
- Communication time (writes reads) expressed as
WCN(WTiTj) and WCN(RTiTj) - These values can be back-annotated from
functional simulation
WCN(T2)
WCN(T4)
WCN(WT2T4) WCN(RT2T4)
Task2
Task4
WCN(WT1T2) WCN(RT1T2)
WCN(WT4T6) WCN(RT4T6)
WCN(T1)
WCN(T6)
Task1
Task6
WCN(WT1T3) WCN(RT1T3)
Task3
Task5
WCN(WT5T6) WCN(RT5T6)
WCN(WT3T5) WCN(RT3T5)
WCN(T3)
WCN(T5)
47Efficient Application Development Support
- In optimization tools many simplifying
assumptions are generally considered - The neglecting of these assumptions in software
implementation can generate - unpredictable and not desired system-level
interactions - make the overall system error-prone.
- We propose an entire framework to help
programmers in software implementation - a generic customizable application template ?
OFFLINE SUPPORT - a set of high-level APIs ? ONLINE SUPPORT.
- The main goals of our development framework are
- the exact and reliable applications execution
after the optimization step - guarantees about high performance and constraint
satisfaction.
48Customizable Application Template
- Starting from a high level task and data flow
graph, software developers can easily and quickly
build their application infrastructure. - Programmer can intuitively translate high level
representation into C-code using our facilities
and library. - Users can specify
- the number of tasks included in the target
application - their nature (e.g. branch, fork, or-node,
and-node) - their precedence constraints (e.g. due to data
communication) - .thus quickly drawing its CTG schema.
- Programmer can focus onto the functionalities of
the tasks - the main effort is given to the more specific and
critic sections of the application.
49OS-level and Task-level APIs
- Users can easily reproduce optimizer solutions,
thus - Indirectly neglecting optimizers abstractions
- Task model
- Communication model
- OS overheads.
- Obtaining the needed application constraint
satisfaction. - Programmer can allocate to the right hardware
resources - Tasks
- Program data
- Queues.
- Scheduling support APIs
- Frequency and voltage selection
- Communication issues
- Shared queues
- Semaphores
- Interrupts.
50Example
P1
T1
N1
P2
a1
a2
- Number of nodes 12
- Graph of activities
- Node type
- Normal, Branch, Conditional, Terminator
- Node behaviour
- Or, And, Fork, Branch
- Number of CPU 2
- Task Allocation
- Task Scheduling
- Arc priorities
- Freq. Voltage
fork
T2
T3
B2
B3
branch
branch
a3
a4
a5
a6
T4
T5
T6
T7
C4
C5
C6
C7
a7
a8
a9
a10
or
T8
T9
T10
N8
N9
N10
a12
or
//Node Type 0 NORMAL 1 BRANCH 2
STOCHASTIC uint node_typeTASK_NUMBER
1,2,2,1,..
a11
uint queue_consumer .. ..
0,1,1,0,..,
0,0,0,1,1,., 0,0,0,0,0,1,1..,
0,0,0,0,....
N11
define TASK_NUMBER 12
T11
a13
define N_CPU 2 uint task_on_coreTASK_NUMBER
1,1,2,1 int schedule_on_coreN_CPUTASK_NUMBER
1,2,4,8..
//Node Behaviour 0 AND 1 OR 2 FORK 3
BRANCH uint node_behaviourTASK_NUMBER
2,3,3,..
and
a14
T12
T12
Deadline
Resources
B3
N10
B3
C7
N10
C7
N1
B2
C4
N8
N11
T12
T12
Time
51Queue ordering optimization
CPU1
CPU2
T1
Wait!
C3
C1
RUN!
T4
C2
T2
C4
C5
T3
T5
T6
- Communication ordering affects system performances
52Queue ordering optimization
CPU1
CPU2
T1
Wait!
C3
C1
RUN!
T4
C2
T2
C4
C5
T3
T5
T6
- Communication ordering affects system performances
53Synchronization among tasks
T1
Proc. 1
Proc. 2
C1
T2
T4
T1
T3
T4
T2
C2
C3
T4 re-activated
T4 is suspended
T3
Non blocked semaphores
54Logic Based Benders Decomposition
Memory constraints
Obj. Function Communication cost energy
consumption
Allocation Freq. Assign. INTEGER PROGRAMMING
No good linear constraint
Valid allocation
Real Time constraint
Scheduling CONSTRAINT PROGRAMMING
- Decomposes the problem into 2 sub-problems
- Allocation Assignment of freq. settings ? IP
- Objective Function minimizing energy consumption
during execution and communication of tasks - Scheduling ? CP
- Objective Function minimizing energy consumption
during frequency switching
55Solver Performance
- Hundreds of of decision variables
- Much beyond ILP solver or CP solver capability
56Allocation problem model
The objective function minimize energy
consumption associated with task execution and
communication
Xtfp 1 if task t executes on processor p at
frequency f Wijfp 1 if task i and j run on
different core. Task i on core p writes data to
j at freq. f Rijfp 1 if task i and j run on
different core. Task j on core p reads data to
i at freq. f
57Allocation problem model
58Scheduling problem model
Duration of task i is now fixed since mode is
fixed
Reading phase
Writing phase
- Five phases behaviour
- INPUTinput data reading
- EXECcomputation activity
- OUTPUToutput data writing.
- Atomic activities
fork
join
input
output
input
exec
output
input
output
- Processors are modelled as unary resource
- Bus is modelled as additive resource
The objective function minimize energy
consumption associated with frequency switching
59Application Development Methodology
Simulator
Optimizer
Application Profiles
Optimization Phase
Characterization Phase
Allocation Scheduling
Application Development Support
Optimal SW Application Implementation
Platform Execution
60Validation of optimizer solutions Throughput
Optimizer
250 instances
Probability ()
Throughput difference ()
- MAX error lower than 10
- AVG error equal to 4.51, with standard deviation
of 1.94 - All the deadline constraints are satisfied.
61Validation of optimizer solutions Power
Optimizer
250 instances
250 instances
Probability ()
Energy consumption difference ()
- MAX error lower than 10
- AVG error equal to 4.80, with standard deviation
of 1.71
62GSM Encoder
- Task Graph
- 10 computational tasks
- 15 communication tasks.
- Throughput required 1 frame/10ms.
- With 2 processors and 4 possible freq.voltage
settings
Without optimizations 50.9µJ
With optimizations 17.1 µJ
- 66,4
63Summary future work
- Energy-optimal task mapping
- Strong optimization engine (complete)
- Programmer support (design exec time)
- Validation accuracy optimality
- Future work
- Conditional task graphs
- Dealing with multiple use cases
- Variable execution times
- Aggressive communication scheduling