Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip

Description:

Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip Luca Benini Lbenini_at_deis.unibo.it – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Tecniche di ottimizzazione per lo sviluppo di applicazioni embedded su piattatforme multiprocessore su singolo chip


1
Tecniche di ottimizzazione per lo sviluppo di
applicazioni embedded su piattatforme
multiprocessore su singolo chip
  • Luca Benini
  • Lbenini_at_deis.unibo.it
  • DEIS Università di Bologna

2
Embedded Systems
General purpose systems
Embedded systems
Microprocessor market shares
3
Example Area Automotive Electronics
  • What is automotive electronics?
  • Vehicle functions implemented with electronics
  • Body electronics
  • System electronics chassis, engine
  • Information/entertainment

4
Automotive Electronics Market Size
Cost of Electronics / Car ()
1400
1200
1000
800
2006 25 of the total cost of a car will be
electronics
600
400
200
0
1998
1999
2000
2001
2002
2003
2004
2005
Market (billions)
8.9
10.5
13.1
14.1
15.8
17.4
19.3
21.0
90 of future innovations in vehiclesbased on
electronic embedded systems
5
Automotive Electronics Platform Example
Source Expanding automotive electronic systems,
IEEE Computer, Jan. 2002
6
Digital Convergence Mobile Example
  • One device, multiple functions
  • Center of ubiquitous media network
  • Smart mobile device next drive for semicon.
    Industry

7
4th Gen and Next-Gen Networks
Includes 802.20, WiMAX (802.16), HSDPA, TDD
UMTS, UMTS and future versions of UMTS
8
SoC Enabler for Digital Convergence
4G/5G, DMB, WiBro, etc.
Future
Performance Low Power Complexity Storage
gt 100X
Today
SoC
9
Application pull
3D gaming
1TOPS/W
3D TV
3D ambient interaction
Structured decoding
Ubiquitous navigation
3D projected display
Autonomous driving
HMI by motion Gesture detection
Structured encoding
100GOPS/W
Expression recognition
Gbit radio
Collision avoidance
Adaptive route
H264 encoding
Language
dictation
Emotion recognition
Gesture recognition
UWB
Sign recognition
A/V streaming
5 GOPS/W
Image recognition
802.11n
Si Xray
Mobile Base-band
H264 decoding
Auto personalization
Fully recognition (security)
2005
2007
2009
2011
2013
2015
IMEC
Year of Introduction
10
MPSoC Platform Evolution
Middleware, RTOS, API, Run-Time Controller
Applications
Software opt.
Mapping V,Vt,Fclk,IL
I/O P E R I P H E R A L S
45 nm
router
Bus based Multi Proc
2
lt4mm
Net Int
3D stacked main memory
30Mtr
Local Memory hierarchy
Power Test Mgmt
lt1GHz
  • Todays SoCs could fit in 1 tile!!
  • Tile-based design

11
Multicores Are Here!
Amarasinghe06
512
256
128
64
of cores
32
16
8
4
2
4004
8086
8080
286
386
486
Pentium
P2
P3
Itanium
1
P4
8008
Itanium 2
Athlon
1985
1990
1980
1970
1975
1995
2000
2005
20??
12
MPSoC 2005 ITRS roadmap
Martin06
13
SoC ? Solution-on-a-Chip
Target System Application
System / Service
Application S/W
Mobile Terminal
Middleware
System

e-SW
Module
RTOS
Chip
HAL
Chip
Process
S/W IP
  • Requires design of Hardware AND software

14
Design as optimization
  • Design space
  • The set of all possible design choices
  • Constraints
  • Solutions that we are not willing to accept
  • Cost function
  • A property we are interested in (execution time,
    power, reliability)

15
Hardware synthesis
16
Behavioral synthesis
17
Allocation, Assignment, and Scheduling
Techniques Well Understood and Mature
18
Resource constraints
Control Step
19
Scheduling under resource constraints
  • Intractable problem
  • Algorithms
  • Exact
  • Integer linear program
  • Hu (restrictive assumptions)
  • Approximate
  • List scheduling
  • Force-directed scheduling

20
ILP formulation
  • Binary decision variables
  • X xil, i 1,2,. n l 1,2,, ? 1
  • xil, is TRUE only when operation vi starts in
    step l of the schedule (i.e. l ti)
  • ? is an upper bound on latency
  • Start time of operation vi
  • Sl . xil

l
21
ILP formulation constraints
  • Operations start only once
  • S xil 1 i 1, 2,, n
  • Sequencing relations must be satisfied
  • ti tj dj (vj, vi) ? E
  • S l xil S l xil dj 0 (vj, vi) ?
    E
  • Resource bounds must be satisfied
  • Simple case (unit delay)
  • S xil ak k 1,2,nres l

l
A
A
l
l
A
iT(vi)k
22
ILP Formulation
min (S l xnl) such that
S xil 1 i 1, 2, , n S l xij - S l
xjl - dj 0 i, j 1, 2, , n, (vj, vi)
? E S S xim ak k 1,
2, , nres l 0, 1, , ?
l
l
l
l
l
ml-di1
iT(vi)k
23
Example
  • Resource constraints
  • 2 ALUs 2 Multipliers
  • a1 2 a2 2
  • Single-cycle operation
  • di 1 i

A
24
Example
  • Operations start only once
  • x11 1
  • x61 x62 1
  • Sequencing relations must be satisfied
  • x61 2x62 2x72 3x73 1 0
  • 2x92 3x93 4x94 5xN5 1 0
  • Resource bounds must be satisfied
  • x11 x21 x61 x81 2
  • x32 x62 x72 x81 2

25
Example
26
Resource-EfficientApplication mapping for MPSoCs
MULTIMEDIA APPLICATIONS
  • Given a platform
  • Achieve a specified throughput
  • Minimize usage of shared resources

27
Application design flow
Abstraction gap
Platform Modelling
Starting Implementation
Optimization Analysis
Final Implementation
Optimal Solution
Platform Execution
  • The abstraction gap between high level
    optimization tools and standard application
    programming models can introduce unpredictable
    and undesired behaviours.
  • Programmers must be conscious about simplified
    assumptions taken into account in optimization
    tools.
  • New methodology for multi-task application
    development on MPSoCs.

28
Resource assignment and scheduling
THE SYSTEM
Task. A (WCET Ta)
Processor
Task. B (WCET Tb)
. . . . .
Limited Size Mem
Tightly-Coupled Memory
Task. N (WCET Tn)
Node 1
Node N
Bus Interface
SHARED SYSTEM BUS
On-chip Memory
29
The application
Signal Processing Pipeline
Throughput Constraint
  • Each task is characterized by
  • WCET
  • Memory requirements
  • Queues for inter-processor communication
  • in TCM for efficiency reasons
  • Program data
  • in TCM (if space) or on-chip memory
  • Internal state
  • in TCM (if space) or on-chip memory

30
Communication-aware Allocation and Scheduling for
Stream-Oriented MPSoCs
T7
T1
T2
T0
Signal Processing Pipeline
..
  • Simplifying assumptions vs predictability
  • Efficient solutions in reasonable time
  • Pure ILP formulations suitable for small task
    sets
  • Widespread use of heuristics

?
ARM7
Private Memory
B U S
Local Scratchpad Memory
Message- oriented MPSoC architecture
..
.
Private Memory
ARM7
Local Scratchpad Memory
31
Master Problem model
  • Assignment of tasks and memory slots (master
    problem)
  • Tij 1 if task i executes on processor j, 0
    otherwise,
  • Yij 1 if task i allocates program data on
    processor j memory, 0 otherwise,
  • Zij 1 if task i allocates the internal state on
    processor j memory, 0 otherwise
  • Xij 1 if task i executes on processor j and
    task i1 does not, 0 otherwise
  • Each process should be allocated to one
    processor ? Tij 1 for all j
  • Link between variables X and T Xij Tij
    Ti1 j for all i and j (can be linearized)
  • If a task is NOT allocated to a processor nor
    its required memories are
  • Tij 0 ? Yij 0 and Zij 0
  • Objective function ? ? memi (Tij - Yij) statei
    (Tij - Yij) datai Xij /2

i
i
j
32
Improvement of the model
  • With the proposed model, the allocation problem
    solver tends to pack all tasks on a single
    processor and all memory required on the local
    memory so as to have a ZERO communication cost
    TRIVIAL SOLUTION
  • To improve the model we should add a relaxation
    of the subproblem to the master problem model
  • For each set S of consecutive tasks whose sum of
    durations exceeds the Real time requirement, we
    impose that their processors should not be the
    same
  • ? WCETi gt RT ? ? Tij ? S -1

i ? S
i ? S
33
Sub-Problem model
  • Task scheduling with static resource assignment
    (subproblem)

i
34
Sub-Problem model
  • Task scheduling with static resource assignment
    (subproblem)
  • We have to schedule tasks so we have to decide
    when they start
  • Activity Starting Time Starti0..Deadlinei
  • Precedence constraints StartiDuri ? Startj
  • Real time constraints for all activities running
    on the same processor
  • ? (StartiDuri ) ? RT
  • Cumulative constraints on resources
  • processors are unary resources
    cumulative(Start, Dur, 1,1)
  • memories are additive resources
    cumulative(Start,Dur,MR,C)
  • What about the bus??

i
35
Bus model
Unary resource granularity clock cycle
BANDWIDTH BIT/SEC
Execution time taski and task j
Max bus bandwidth
TIME
Taski state write
Taskj State write
Taski state read
Taskj state read
Arbitration mechanism that decides the bus
allocation
36
Bus model
BANDWIDTH BIT/SEC
Additive bus model
Max bus bandwidth
Size of program data TaskExecTime
Task0 accesses input data BWMaxBW/NoProc
taski
taskj
TIME
Taski state write
Taski state write
Taski state read
Taskj state read
The model does not hold under heavy bus
congestion (more than 65 of total bandwidth) Bus
traffic has to be minimized
37
No good generation
  • Assignment of tasks and memory slots (master
    problem)
  • Task scheduling with static resource assignment
    (subproblem)
  • If no feasible schedule exist for the allocation
    provided by the master a no-good is generated.
  • We use the simple BUT EFFECTIVE one identify
    CONFLICTING RESOURCES CR. For each R ? CR, STR
    set of tasks allocated on R
  • ? TiR ? STR - 1
  • Other cuts are also possible, Hooker,
    Constraints 2005, but these are enough for our
    case and easy to extract

i ? STR
38
Computational efficiency
  • CP and IP formulations simplified
  • Hybrid approach clearly outperforms pure CP and
    IP techniques
  • Search time bounded to 15 minutes
  • CP and IP can found a solution only in 50- of
    the instances
  • Hybrid approach always found a solution

39
Validation of bus model
  • Requesting more than 65 of the theoretical
    maximum bandwidth causes the additive model to
    fail.
  • Lower threshold in presence of communication
    hotspots (50)
  • Benefits of the additive model
  • task execution time almost indep. of bus
    utilization
  • Performance predictability greatly enhanced

40
Validation of optimizer solutions
  • MAX error lower than 10
  • AVG error equal to 4.7, with standard deviation
    of 0.08
  • Optimizer turn out to be conservative in
    predicting infeasibility
  • The flow was successfully applied to GSM
    benchmark

41
Energy-EfficientApplication mapping for MPSoCs
MULTIMEDIA APPLICATIONS
  • Given a platform
  • Achieve a specified throughput
  • Minimize power consumption

42
Application Mapping
Allocation
Schedule Freq.sel.
  • The problem of allocating, scheduling and freq.
    selection for task graphs on multi-processors in
    a distributed real-time system is NP-hard.
  • New tool flows for efficient mapping of
    multi-task applications onto hardware platforms

43
Exploiting Voltage Supply
  • Supply voltage impacts power and performance
  • Circuit slowdown T1/fK/(Vdd-Vt)a
  • Cubic power savings PCeffVdd2f
  • Just-in-time computation
  • Stretch execution time up to the max tolerable

Fixed voltage Shutdown
Power
Variable voltage
Available time
44
Scheduling Voltage Scaling
Different voltagesdifferent frequencies
CPU
f1
f2
f3
45
Target architecture - 2
  • Homogeneous computation tiles
  • ARM cores (including instruction and data
    caches)
  • Tightly coupled software-controlled scratch-pad
    memories (SPM)
  • AMBA AHB
  • DMA engine
  • RTEMS OS
  • Technology homogeneous (0.13um) industrial power
    models (ST)
  • Variable Voltage/Frequency cores with discrete
    (Vdd,f) pairs
  • Frequency dividers scale down the baseline 200
    MHz system clock
  • Cores use non-cacheable shared memory to
    communicate
  • Semaphore and interrupt facilities are used for
    synchronization
  • Private on-chip memory to store data.

46
Application model
  • A task graph represents
  • A group of tasks T
  • Task dependencies
  • Execution times express in clock cycles WCN(Ti)
  • Communication time (writes reads) expressed as
    WCN(WTiTj) and WCN(RTiTj)
  • These values can be back-annotated from
    functional simulation

WCN(T2)
WCN(T4)
WCN(WT2T4) WCN(RT2T4)
Task2
Task4
WCN(WT1T2) WCN(RT1T2)
WCN(WT4T6) WCN(RT4T6)
WCN(T1)
WCN(T6)
Task1
Task6
WCN(WT1T3) WCN(RT1T3)
Task3
Task5
WCN(WT5T6) WCN(RT5T6)
WCN(WT3T5) WCN(RT3T5)
WCN(T3)
WCN(T5)
47
Efficient Application Development Support
  • In optimization tools many simplifying
    assumptions are generally considered
  • The neglecting of these assumptions in software
    implementation can generate
  • unpredictable and not desired system-level
    interactions
  • make the overall system error-prone.
  • We propose an entire framework to help
    programmers in software implementation
  • a generic customizable application template ?
    OFFLINE SUPPORT
  • a set of high-level APIs ? ONLINE SUPPORT.
  • The main goals of our development framework are
  • the exact and reliable applications execution
    after the optimization step
  • guarantees about high performance and constraint
    satisfaction.

48
Customizable Application Template
  • Starting from a high level task and data flow
    graph, software developers can easily and quickly
    build their application infrastructure.
  • Programmer can intuitively translate high level
    representation into C-code using our facilities
    and library.
  • Users can specify
  • the number of tasks included in the target
    application
  • their nature (e.g. branch, fork, or-node,
    and-node)
  • their precedence constraints (e.g. due to data
    communication)
  • .thus quickly drawing its CTG schema.
  • Programmer can focus onto the functionalities of
    the tasks
  • the main effort is given to the more specific and
    critic sections of the application.

49
OS-level and Task-level APIs
  • Users can easily reproduce optimizer solutions,
    thus
  • Indirectly neglecting optimizers abstractions
  • Task model
  • Communication model
  • OS overheads.
  • Obtaining the needed application constraint
    satisfaction.
  • Programmer can allocate to the right hardware
    resources
  • Tasks
  • Program data
  • Queues.
  • Scheduling support APIs
  • Frequency and voltage selection
  • Communication issues
  • Shared queues
  • Semaphores
  • Interrupts.

50
Example
P1
T1
N1
P2
a1
a2
  • Number of nodes 12
  • Graph of activities
  • Node type
  • Normal, Branch, Conditional, Terminator
  • Node behaviour
  • Or, And, Fork, Branch
  • Number of CPU 2
  • Task Allocation
  • Task Scheduling
  • Arc priorities
  • Freq. Voltage

fork
T2
T3
B2
B3
branch
branch
a3
a4
a5
a6
T4
T5
T6
T7
C4
C5
C6
C7
a7
a8
a9
a10
or
T8
T9
T10
N8
N9
N10
a12
or
//Node Type 0 NORMAL 1 BRANCH 2
STOCHASTIC uint node_typeTASK_NUMBER
1,2,2,1,..
a11
uint queue_consumer .. ..
0,1,1,0,..,
0,0,0,1,1,., 0,0,0,0,0,1,1..,

0,0,0,0,....
N11
define TASK_NUMBER 12
T11
a13
define N_CPU 2 uint task_on_coreTASK_NUMBER
1,1,2,1 int schedule_on_coreN_CPUTASK_NUMBER
1,2,4,8..
//Node Behaviour 0 AND 1 OR 2 FORK 3
BRANCH uint node_behaviourTASK_NUMBER
2,3,3,..
and
a14
T12
T12
Deadline
Resources
B3
N10
B3
C7
N10
C7
N1
B2
C4
N8
N11
T12
T12
Time
51
Queue ordering optimization
CPU1
CPU2
T1
Wait!
C3
C1
RUN!
T4
C2
T2
C4
C5
T3
T5
T6



  • Communication ordering affects system performances

52
Queue ordering optimization
CPU1
CPU2
T1
Wait!
C3
C1
RUN!
T4
C2
T2
C4
C5
T3
T5
T6



  • Communication ordering affects system performances

53
Synchronization among tasks
T1
Proc. 1
Proc. 2
C1
T2
T4
T1
T3
T4
T2
C2
C3
T4 re-activated
T4 is suspended
T3
Non blocked semaphores
54
Logic Based Benders Decomposition
Memory constraints
Obj. Function Communication cost energy
consumption
Allocation Freq. Assign. INTEGER PROGRAMMING
No good linear constraint
Valid allocation
Real Time constraint
Scheduling CONSTRAINT PROGRAMMING
  • Decomposes the problem into 2 sub-problems
  • Allocation Assignment of freq. settings ? IP
  • Objective Function minimizing energy consumption
    during execution and communication of tasks
  • Scheduling ? CP
  • Objective Function minimizing energy consumption
    during frequency switching

55
Solver Performance
  • Hundreds of of decision variables
  • Much beyond ILP solver or CP solver capability

56
Allocation problem model
The objective function minimize energy
consumption associated with task execution and
communication
Xtfp 1 if task t executes on processor p at
frequency f Wijfp 1 if task i and j run on
different core. Task i on core p writes data to
j at freq. f Rijfp 1 if task i and j run on
different core. Task j on core p reads data to
i at freq. f
57
Allocation problem model
58
Scheduling problem model
Duration of task i is now fixed since mode is
fixed
Reading phase
Writing phase
  • Five phases behaviour
  • INPUTinput data reading
  • EXECcomputation activity
  • OUTPUToutput data writing.
  • Atomic activities

fork
join
input
output
input
exec
output
input
output
  • Processors are modelled as unary resource
  • Bus is modelled as additive resource

The objective function minimize energy
consumption associated with frequency switching
59
Application Development Methodology
Simulator
Optimizer
Application Profiles
Optimization Phase
Characterization Phase
Allocation Scheduling
Application Development Support
Optimal SW Application Implementation
Platform Execution
60
Validation of optimizer solutions Throughput
Optimizer
250 instances
Probability ()
Throughput difference ()
  • MAX error lower than 10
  • AVG error equal to 4.51, with standard deviation
    of 1.94
  • All the deadline constraints are satisfied.

61
Validation of optimizer solutions Power
Optimizer
250 instances
250 instances
Probability ()
Energy consumption difference ()
  • MAX error lower than 10
  • AVG error equal to 4.80, with standard deviation
    of 1.71

62
GSM Encoder
  • Task Graph
  • 10 computational tasks
  • 15 communication tasks.
  • Throughput required 1 frame/10ms.
  • With 2 processors and 4 possible freq.voltage
    settings

Without optimizations 50.9µJ
With optimizations 17.1 µJ
- 66,4
63
Summary future work
  • Energy-optimal task mapping
  • Strong optimization engine (complete)
  • Programmer support (design exec time)
  • Validation accuracy optimality
  • Future work
  • Conditional task graphs
  • Dealing with multiple use cases
  • Variable execution times
  • Aggressive communication scheduling
Write a Comment
User Comments (0)
About PowerShow.com