Sesame Opening new doors to Multi-level Design Space Exploration of Embedded Systems Architectures

About This Presentation

Title:

Sesame Opening new doors to Multi-level Design Space Exploration of Embedded Systems Architectures

Description:

Opening new doors to Multi-level Design Space Exploration of Embedded Systems Architectures Andy D. Pimentel Computer Systems Architecture group University of Amsterdam – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 54

Provided by: AndyPi1

Learn more at: http://www.artist-embedded.org

Category:

more less

Transcript and Presenter's Notes

Title: Sesame Opening new doors to Multi-level Design Space Exploration of Embedded Systems Architectures

1
Sesame Opening new doors to Multi-level Design
Space Exploration of Embedded Systems
Architectures
Andy D. Pimentel
Computer Systems Architecture group
University of Amsterdam
Informatics Institute
2
Thank you.Questions?
3
Outline

Background and problem statement
General overview of modeling methodology
Sesame environment
Application modeling layer
Architecture modeling layer
Mapping layer
Gradual refinement of architecture models
Event refinement using dataflow graphs
Both computational and communication refinement
Current status and future work

4
Sketching the context

Lets play a little quiz
What is the most popular microprocessor around?
You may have answered something like Intel
Pentium
If so, thanks for playing!
Intel Pentium has almost 0 market share.Zip.
Zilch.
Pentium is a statistically insignificant chip
with tiny sales!
The answer should (of course?) be embedded
processors (no particular brand)

5
Sketching the context (contd)
Relating microprocessors to life on earth are
Pentiums the viruses of the microprocessor
market? -)
6
Sketching the context (contd)
7
Sketching the context (contd)

Estimation 5 times as much embedded software
than normal software
Embedded systems are everywhere
On the average, a human touches about 50 to 100
embedded processors per day
Average car has 15 processors, luxurious one
60!
The domain of embedded multimedia and signal
processing applications plays an important role
Camcorders, PDAs, set-top boxes, (Digital) TVs,
cell phones, etc.

8
Embedded media systems

Modern embedded systems for media and signal
processing must
support multiple applications and various
standards
often provide real-time performance
These systems increasingly have heterogeneous
system architectures, integrating
Dedicated hardware
High performance and low power/cost
Embedded processor cores
High flexibility
Reconfigurable components (e.g. FPGAs)
Good performance/power/flexibility

9
Trends in system design (contd)

Silicon budgets are increasing (Moores Law)
Integration of functions Systems-on-Chip
(Massively) Parallel Systems on a single chip!

Life cycle of systems decreasing (e.g., look at
cellphones)
Short time to market

10
Design crisis
Log Scale
0.35µ
0.25µ
0.18µ
0.15µ
0.12µ
0.1µ
Technology (micron)
11
The system design problem

Design better products faster
Design productivity
Design technology architectures, methods, tools,
libraries
Design quality
Low cost, low power, flexible, no bugs
Multi-dimensional design space with many
tradeoffs
Cost (silicon area, design time)
Performance
Power consumption
Flexibility
Time-to-market
etc.

12
Design tradeoffs computational efficiency
13
From Applications to Silicon Software
Silicon
Application(s)

Software
HW / SW Architecture
TM
CP1
MIPS
Architecture components
14
Rethinking system design

Design complexity forces us to reconsider current
design practice
Classical design methods
often depart from a single application
specification which is gradually synthesized into
HW/SW implementation
lack generalizability to cope with highly
programmable architectures targeting multiple
applications
also hamper extensibility to efficiently support
future applications

15
Rethinking system design (contd)

Traditionally, designers only rely on detailed
simulators for design space exploration
HW/SW co-simulation
This approach becomes infeasible for the early
design stages
Effort to build these simulators is too high as
systems become too complex
The low speeds of these simulators seriously
hamper the architectural exploration
HW/SW co-simulation requires a HW/SW partitioning
A new system model is needed for assessment of
each HW/SW partitioning

16
Jumping down the design pyramid
High
Low
Effort
Abstraction
Low
High
Alternative realizations
17
Design by stepwise refinement
High
Low
Effort
Abstraction
Low
High
Alternative realizations
18
SesameSimulation of Embedded Systems
Architectures for Multi-level Exploration

Part of Artemisia project
Design methods for NoC-based embedded systems
Co-operation of
Leiden Embedded Research Center, Leiden
University (prof. E.F. Deprettere)
Computer Engineering group, Delft University of
Technology (prof. S. Vassiliadis)
Computer Systems Architecture group, University
of Amsterdam (prof. C. Jesshope)
Philips Research Labs in Eindhoven

19
SesameSimulation of Embedded Systems
Architectures for Multi-level Exploration

Provides methods and tools to efficiently
evaluate the performance of heterogeneous
embedded systems and explore their design space
Different architectures, applications, and
mappings
Different HW/SW partitionings
Smooth transition between abstraction levels
Mixed-level simulations
Promotes reuse of models (re-use of IP)
Targets the multimedia application domain
Techniques and tools also applicable to other
application domains

20
Y-chart Design Methodology Kienhuis
Architecture
21
Modeling and simulation using the Y-Chart
methodology

Application model
Description of functional behavior of an
application
Independent from architecture, HW/SW partitioning
and timing characteristics
Generates application events representing the
workload imposed on the architecture

Architecture model
Parameterized timing behavior of architecture
components
Models timing consequences of application events

Explicit mapping of application and architecture
models
Trace-driven co-simulation Lieverse
Easy reuse of both application and architecture
models!

22
Application modeling

Using Kahn Process Networks (KPNs)
Parallel (C/C) processes communicating with
each other via unbounded FIFO channels
expresses parallelism in an application and makes
communication explicit
blocking reads, non-blocking writes
Generation of application events
Code is instrumented with annotations describing
computational actions
Reading from/writing to Kahn channels represent
communication behavior
Application events can be very coarse grain like
compute a DCT or read/write a pixel block

23
Application modeling (contd)

Why Kahn process networks (KPNs)?
Fit very well to multimedia application domain
KPNs are deterministic
automatically guarantees validity of event traces
when application and architecture simulators are
executed independently
Application model can also be analyzed in
isolation from any architecture model
Investigation of upper performance bounds and
early recognition of bottlenecks within
application

24
Architecture modeling

Architecture models react to application trace
events to simulate the timing behavior
Accounting for functional behavior is not
necessary!
Architecture modeling at varying abstraction
levels
Starting at black box level
Processing cores can model timing behavior of SW,
HW or reconfigurable execution
parameterizable latencies for the application
events
SW execution high latency, HW execution low
latency
Allows for rapid evaluation of different HW/SW
partitionings!

25
Architecture modeling (contd)

Models implemented in Pearl
Object-based discrete event simulation language
Keeps track of virtual time
Provides simulation primitives
Inter-object communication via message-passing
Keeps track of simulation statistics
RISC-like language keep it simple and make the
common case fast
Lacks features not needed for architectural
modeling (e.g., no dynamic datastructures,
dynamic object creation, etc.)
Result high-performance modeling simulation
High simulation speed and low modeling effort

26
Pearl an example
Processor object
message
27
Architecture modeling (contd)

Models implemented in SystemC
We added a layer on top of SystemC 2.0, called
SCPEx (SystemC Pearl Extension)
Provides SystemC with Pearls message-passing
semantics
Raises abstraction level of SystemC (e.g., no
ports, transparent incorporation of
synchronization)
Improves transaction-level modeling
SCPEx enables reuse of Pearl models in SystemC
context
Makes Pearl ? SystemC translation trivial
Provides link towards possible implementation
Facilitates importing SystemC IP models in Sesame

28
Sesame in layers
Application model
Event trace
Mapping layer
Architecture model
29
Sesames mapping layer

Maps application tasks (event traces)to
architecture model components
Guarantees deadlock-free schedulingof
application events

30
Scheduling of communication events
Because Read events are blocking (Kahn), some
schedules may yield deadlock
A
C
Application model
B
Write(A)
Read(C)
Read(B)
Write(C)
Proc. core
Proc. core
Architecture model
Bus
31
Sesames mapping layer

Accounts for synchronization behavior
Mapping layer executes in same time domain as
architecture model
Transforms application-level events into
primitives (events) for architecture model
More on this later on...
Tool for auto-generation of mapping layer

Maps application tasks (event traces)to
architecture model components
Guarantees deadlock-free schedulingof
application events

32
Sesame from a software perspective
(SCPEx)
33
Y-chart Modeling Language (YML)

Flexible and persistent description (XML) of
The structure of application and architecture
models (connecting library components)
SCPEx also supports YML!
The mapping of appl. models onto arch.
models(i.e., the mapping layer)
YML combines scripting language within XML
Simplifies descriptions of complicated structures
Increases expressive power of components
E.g., a parameterized complex interconnect
component modeling a network of arbitrary size
Increases reusability
Re-use of components and structures

34
An illustrative case study M-JPEG

Lossy, Motion-JPEG encoder
Accepts both RGB and YUV formats
Includes dynamic quality control by on-the-fly
adaptation of quantization and Huffman tables

35
The platform architecture

Bus-based shared memory multiprocessor
architecture

36
M-JPEG case study (contd)
Exploration
mapping
37
M-JPEG case study (contd)

Kahn Process Network
Functional behavior

Library approach
Timing behavior

38
Screenshot model editor
39
M-JPEG design space exploration

Experimented with different
HW/SW partitionings
Application-architecture mappings
Processor speeds
Interconnect structures (bus, crossbar and O
networks)

This took about 1 person-month (all modeling
included)
Simulation performance for 128x128 frames, a 270
MHz Sun Ultra 5 Sparcstation simulated 2,3
frames/second ( 0.43 secs/frame)

40
M-JPEG design space exploration
41
M-JPEG design space exploration
42
Mapping problem implementation gap
Application behavioral model (what?)
Primitive operations
Implementation
Primitive operations
Architecture model (how?)
43
Mapping problem

Application events Read, Write and Execute
Typical mismatch between application events and
architecture primitives, examples
Architecture primitives operating on different
data granularities
Architecture primitives more refined than
application events
Trace events from the application layer need to
be refined
How?
Refine the application model
A transformation mechanism between the
application and architecture models

44
Communication refinement

Lets take the mismatch of communication
primitives as an example
Assume following architecture communication
primitives
Check-Data (CD)
Load-Data (Ld)
Signal-Room (SR)
Check-Room (CR)
Store-Data (St)
Signal-Data (SD)

45
Communication refinement (contd)

Transformation rules for refining
application-level communication events Lieverse
R ? CD ? Ld ? SR (1)
W ? CR ? St ? SD (2)
E ? E (3)
How to transform traces of application events
using (1), (2) and (3)?

Generates R?E?W event sequences
46
Communication refinement (contd)
Processor 1
Processor 2
Processor 3
bus
Mem

Assumption 1 processor 2 has local (block)
memory
Transforming R?E?W event sequences from process
B
R ?E?W ? CD?Ld?SR?E?CR?St?SD

Assumption 2 processor 2 has NO local (block)
memory
Transforming R?E?W event sequences from process
B
R ?E?W ? CD?CR?Ld?E?St?SR?SD

47
IDF-based trace transformation

Virtual processors in mapping layer are refined
to accomplish trace refinement
Integer-controlled DataFlow (IDF) model describes
internal behavior of a virtual processor
Application events specify
what a virtual processor executes
with whom it communicates
Internal IDF model specifies
how the computations and communications take
place at the architecture layer

48
A short Dataflow intermezzo

Synchronous DataFlow (SDF) Lee,Messerschmitt
Static model of computation allowing compile-time
scheduling
Basic idea each actor consumes and produces a
fixed number of tokens each time it fires
Integer-controlled DataFlow (IDF) Buck
Extends SDF with dynamic integer-controlled
switch and select actors to allow data dependent
execution
Generalization makes it more powerful(Turing
complete) but generally needs dynamic scheduling
Hard to analyze statically

49
Process B
Application modelProcess network
Process A
Process C
Virtual proc. Y
Virtual proc. Z
MappinglayerDataflow
Virtual proc. X
ArchitecturemodelDiscrete event
bus
50
IDF-based trace transformation (contd)

IDF models transform application events into
architecture events at run-time
IDF models execute in the same simulation
time-domain as the architecture model
timed IDF models
We distinguish three IDF token-channel types
Intra-event dependency channels specify
dependencies within the refinement of an
application event
Inter-event dependency channels specify
dependencies between refinements of different
application events
Token-exchange channels connected to architecture
model (accomplish timed execution)

51
Communication refinement revisited
Process B
Process A
Process C
Processor 1
Processor 2
Processor 3
bus
Mem

Assumption processor 2 has NO local (block)
memory
Transforming R?E?W event sequences from process
B
R ?E?W ? CD?CR?Ld?E?St?SR?SD

52
Communication refinement revisited (2)
Event trace process B
Virtual processor Y
switch
Virtual processor X
Virtual processor Z
R
E
W
CD
E
CR
CR
CD
b
b
Ld
St
St
Ld
SR
SD
SD
SR
processor 2
Architecture model
Bus
53
Communication refinement revisited (3)
Process B
Process A
Process C
Virtual proc. X
Virtual proc. Z
Virtual proc. Y
Processor 1
Processor 2
Processor 3
bus
Mem
R?E?W?R?E?W ? CD?CR?
Ld(line)?E(line)?St(line)?
Ld(line)?E(line)?St(li
ne)?
Ld(line)?E(line)?St(line)?
SR?SD

Now assume that
processor 2 operates on lines (3 lines 2
blocks)
processor 2 has a single-entry local line buffer
processors 1 and 3 still operate at block
granularity

54
Communication refinement revisited (4)
Event trace from process B
switch
R
E
W
...,1,0,1,0
0,1,0,1,...
switch
switch
Virtual processor Z
1
0
1
0
2?3
from virtual proc. X
CD
CR
b
CD
1?2
...,1,0,1,0
0,1,0,1,...
1
0
1
0
E(line)
select
select
2?3
2?3
Ld
1?3
Ld(line)
St(line)
SR
2?1
1
3?1
3?1
processor 2
to virtual proc. X
SR
SD
55
A case of computational refinement

The application models a synthetic 2D-IDCT by
computing two consecutive IDCT operations at
block level
High level, so execute(block) 1D-IDCT on a data
block

while(1) read(block) execute(block)
write(block)
while(1) read(block) execute(block)
write(block)
while(1) write(block) execute(block)
while(1) read(block) execute(block)
write(block)
while(1) read(block) execute(block)
56
Computational refinement (contd)

Two target architectures are explored

Proc D
Proc B
Proc C
Proc A
Proc A
Proc C
Proc D
Proc B
Mem

Scenario 2 The PE models onto which the IDCT
tasks are mapped, operate at line leveland are
pipelined

And two scenarios...

Scenario 1 All processing elements (PE's) are
modeled at block level

57
Computational refinement (contd)

Trace transformation rules
R(block) ? R(line) ? . . . ? R(line) (1)
W(block) ? W(line) ? . . . ? W(line) (2)
E(block) ? E(line) ? . . . ? E(line) (3)
E(line) ? e1 ? . . . ? en (4)

58
Computational refinement
Process B
Process A
Process C
Virtual proc. X
Virtual proc. Z
bus
59
Computational refinement (contd)
60
(No Transcript)
61
(No Transcript)
62
Putting Sesame to use An example design flow
Compaan/Laura (Leiden University) Molen (Delft
University)
Motion-JPEG encoder
Architecture simulation environment
Reconfigurable architecture framework
DCT
Experimentation
System-level architecture exploration
Applications
Code suitable for FPGA execution
63
A real implementation using Compaan/Laura/Molen
Mapping M-JPEG on the Molen platform architecture
The DCT kernel
for k 114, for j 1164, Pixel
(k,j) In(inBlock) end end for k 114,
if k lt 2, for j 1164, Pixel
(k,j) PreShift(Pixel
(k,j)) end end Block 2D_dct( Pixel
) end for k 114, for j 1164,
outBlockOut(Pixel(k,j)) end end
C Compiler
Laura
64
System-level simulation experiment

Modeling Molen with DCT mapped onto CCU
Validation against real implementation
Information from Compaan/Laura/Molen used for
calibration of architecture model
Apply architecture model refinement
Keep M-JPEG application model untouched
DCT component in architecture model is refined
Operates at pixel level
Abstract pipeline model, deeply pipelined
Other architecture components operate at
(pixel-)block level

65
Sesames IDF-based model refinement
Process B
Process A
Process C
Application model
M-JPEG
Virtual proc. X
Virtual proc. Z
Mapping layer
Map DCT on CCU and refine
Architecture model
Molen
bus
66
DCT virtual processor
Event trace
scheduler
Control trace
63
P2
P1
Block out
Type in
2d-dct
Block in
To/from architecture model
67
Simulation results

Full software implementation
Simulation 85024000 cycles
Real Molen 84581250 cycles
Error 0.5
DCT mapped onto CCU
Simulation 40107869
Real Molen 39369970
Error 1.9
No tuning was done!

68
Where are we going?

Some ongoing and future work

69
NoC modeling

So far, we mainly modeled bus-based systems
Networks-on-Chip (NoC) will be our (near) future
Standardized interfaces
Scalable (point-to-point) networks
Much more complex protocols (protocol stack?)
QoS aspects
Modeling NoCs
Topologies, switching routing methods,
flow-control, protocols, QoS, etc.
Communication mapping
Modeling at multiple abstraction levels
Gradual refinement
Role of IDF models

70
Communication mapping
With more complex Networks-on-Chip routing
information is needed
71
Architecture model calibration
Initial derivation of latency parameters

documentation
educated guess
performance budgeting (what is the required
parameter range?)

Next step calibration with lower-level, external
simulation models or prototypes, e.g.

Instruction set simulators (ISSs)
Compaan/Laura framework

72
Calibration using an ISS
1
C
ISS (e.g. Simplescalar)
2
API
read(1,) API_write(C,..)
ISS measures cycle times of annotated code
fragments
API_read(C,)
computation e
computation
API_read(C,..) write(2,)
API_write(C,)
73
Mixed-level system simulation

Zoom in on interesting system components in
architecture model
Simulate these components at a lower level
Retain high abstraction level for other
components
Saves modeling effort
May save simulation overhead
Integration of external simulation models
ISSs, SystemC models, etc.
Also allows calibration of higher-level models
BUT
Mixed-level simulation can be complex!
multiple time domains and time grain sizes
(synchronization)
differences in protocol and data granularity of
components

74
Mixed-level system simulation (contd)
Embedding external models
IDF-based refinement
75
Does mixed-level need to be hard? NO!
C
ISS (e.g. Simplescalar)
API
Virtual processor
Virtual processor
Virtual processor
Read E(N cycles) Write
buffer
buffer
Trace calibration!
76
Towards real design space exploration

Sesame supplies basic methods tools for
evaluating application, architecture, and mapping
combinations
Simulating entire design space is not an option
More is needed to explore large design spaces
What will be the initial design(s) to evaluate?
How to react when the evaluated architecture does
not suffice?
We need steering before and during simulation
Design decisions using analytical modeling
Finding Pareto-optimal candidates using
multi-objective optimization
Design evaluation using simulation

77
Real design space exploration (contd)
Heuristic methods like evolutionary algorithms
78
Credits
This work would not have been possible without
the (ground-laying work of the) following people