Title: Sesame Opening new doors to Multi-level Design Space Exploration of Embedded Systems Architectures
1Sesame Opening new doors to Multi-level Design
Space Exploration of Embedded Systems
Architectures
Andy D. Pimentel
Computer Systems Architecture group
University of Amsterdam
Informatics Institute
2Thank you.Questions?
3Outline
- Background and problem statement
- General overview of modeling methodology
- Sesame environment
- Application modeling layer
- Architecture modeling layer
- Mapping layer
- Gradual refinement of architecture models
- Event refinement using dataflow graphs
- Both computational and communication refinement
- Current status and future work
4Sketching the context
- Lets play a little quiz
- What is the most popular microprocessor around?
- You may have answered something like Intel
Pentium - If so, thanks for playing!
- Intel Pentium has almost 0 market share.Zip.
Zilch. - Pentium is a statistically insignificant chip
with tiny sales! - The answer should (of course?) be embedded
processors (no particular brand)
5Sketching the context (contd)
Relating microprocessors to life on earth are
Pentiums the viruses of the microprocessor
market? -)
6Sketching the context (contd)
7Sketching the context (contd)
- Estimation 5 times as much embedded software
than normal software - Embedded systems are everywhere
- On the average, a human touches about 50 to 100
embedded processors per day - Average car has 15 processors, luxurious one
60! - The domain of embedded multimedia and signal
processing applications plays an important role - Camcorders, PDAs, set-top boxes, (Digital) TVs,
cell phones, etc.
8Embedded media systems
- Modern embedded systems for media and signal
processing must - support multiple applications and various
standards - often provide real-time performance
- These systems increasingly have heterogeneous
system architectures, integrating - Dedicated hardware
- High performance and low power/cost
- Embedded processor cores
- High flexibility
- Reconfigurable components (e.g. FPGAs)
- Good performance/power/flexibility
9Trends in system design (contd)
- Silicon budgets are increasing (Moores Law)
- Integration of functions Systems-on-Chip
- (Massively) Parallel Systems on a single chip!
- Life cycle of systems decreasing (e.g., look at
cellphones) - Short time to market
10Design crisis
Log Scale
0.35µ
0.25µ
0.18µ
0.15µ
0.12µ
0.1µ
Technology (micron)
11The system design problem
- Design better products faster
- Design productivity
- Design technology architectures, methods, tools,
libraries - Design quality
- Low cost, low power, flexible, no bugs
- Multi-dimensional design space with many
tradeoffs - Cost (silicon area, design time)
- Performance
- Power consumption
- Flexibility
- Time-to-market
- etc.
12Design tradeoffs computational efficiency
13From Applications to Silicon Software
Silicon
Application(s)
Software
HW / SW Architecture
TM
CP1
MIPS
Architecture components
14Rethinking system design
- Design complexity forces us to reconsider current
design practice - Classical design methods
- often depart from a single application
specification which is gradually synthesized into
HW/SW implementation - lack generalizability to cope with highly
programmable architectures targeting multiple
applications - also hamper extensibility to efficiently support
future applications
15Rethinking system design (contd)
- Traditionally, designers only rely on detailed
simulators for design space exploration - HW/SW co-simulation
- This approach becomes infeasible for the early
design stages - Effort to build these simulators is too high as
systems become too complex - The low speeds of these simulators seriously
hamper the architectural exploration - HW/SW co-simulation requires a HW/SW partitioning
- A new system model is needed for assessment of
each HW/SW partitioning
16Jumping down the design pyramid
High
Low
Effort
Abstraction
Low
High
Alternative realizations
17Design by stepwise refinement
High
Low
Effort
Abstraction
Low
High
Alternative realizations
18SesameSimulation of Embedded Systems
Architectures for Multi-level Exploration
- Part of Artemisia project
- Design methods for NoC-based embedded systems
- Co-operation of
- Leiden Embedded Research Center, Leiden
University (prof. E.F. Deprettere) - Computer Engineering group, Delft University of
Technology (prof. S. Vassiliadis) - Computer Systems Architecture group, University
of Amsterdam (prof. C. Jesshope) - Philips Research Labs in Eindhoven
19SesameSimulation of Embedded Systems
Architectures for Multi-level Exploration
- Provides methods and tools to efficiently
evaluate the performance of heterogeneous
embedded systems and explore their design space - Different architectures, applications, and
mappings - Different HW/SW partitionings
- Smooth transition between abstraction levels
- Mixed-level simulations
- Promotes reuse of models (re-use of IP)
- Targets the multimedia application domain
- Techniques and tools also applicable to other
application domains
20Y-chart Design Methodology Kienhuis
Architecture
21Modeling and simulation using the Y-Chart
methodology
- Application model
- Description of functional behavior of an
application - Independent from architecture, HW/SW partitioning
and timing characteristics - Generates application events representing the
workload imposed on the architecture
- Architecture model
- Parameterized timing behavior of architecture
components - Models timing consequences of application events
- Explicit mapping of application and architecture
models - Trace-driven co-simulation Lieverse
- Easy reuse of both application and architecture
models!
22Application modeling
- Using Kahn Process Networks (KPNs)
- Parallel (C/C) processes communicating with
each other via unbounded FIFO channels - expresses parallelism in an application and makes
communication explicit - blocking reads, non-blocking writes
- Generation of application events
- Code is instrumented with annotations describing
computational actions - Reading from/writing to Kahn channels represent
communication behavior - Application events can be very coarse grain like
compute a DCT or read/write a pixel block
23Application modeling (contd)
- Why Kahn process networks (KPNs)?
- Fit very well to multimedia application domain
- KPNs are deterministic
- automatically guarantees validity of event traces
when application and architecture simulators are
executed independently - Application model can also be analyzed in
isolation from any architecture model - Investigation of upper performance bounds and
early recognition of bottlenecks within
application
24Architecture modeling
- Architecture models react to application trace
events to simulate the timing behavior - Accounting for functional behavior is not
necessary! - Architecture modeling at varying abstraction
levels - Starting at black box level
- Processing cores can model timing behavior of SW,
HW or reconfigurable execution - parameterizable latencies for the application
events - SW execution high latency, HW execution low
latency - Allows for rapid evaluation of different HW/SW
partitionings!
25Architecture modeling (contd)
- Models implemented in Pearl
- Object-based discrete event simulation language
- Keeps track of virtual time
- Provides simulation primitives
- Inter-object communication via message-passing
- Keeps track of simulation statistics
- RISC-like language keep it simple and make the
common case fast - Lacks features not needed for architectural
modeling (e.g., no dynamic datastructures,
dynamic object creation, etc.) - Result high-performance modeling simulation
- High simulation speed and low modeling effort
26Pearl an example
Processor object
message
27Architecture modeling (contd)
- Models implemented in SystemC
- We added a layer on top of SystemC 2.0, called
SCPEx (SystemC Pearl Extension) - Provides SystemC with Pearls message-passing
semantics - Raises abstraction level of SystemC (e.g., no
ports, transparent incorporation of
synchronization) - Improves transaction-level modeling
- SCPEx enables reuse of Pearl models in SystemC
context - Makes Pearl ? SystemC translation trivial
- Provides link towards possible implementation
- Facilitates importing SystemC IP models in Sesame
28Sesame in layers
Application model
Event trace
Mapping layer
Architecture model
29Sesames mapping layer
- Maps application tasks (event traces)to
architecture model components - Guarantees deadlock-free schedulingof
application events
30Scheduling of communication events
Because Read events are blocking (Kahn), some
schedules may yield deadlock
A
C
Application model
B
Write(A)
Read(C)
Read(B)
Write(C)
Proc. core
Proc. core
Architecture model
Bus
31Sesames mapping layer
- Accounts for synchronization behavior
- Mapping layer executes in same time domain as
architecture model - Transforms application-level events into
primitives (events) for architecture model - More on this later on...
- Tool for auto-generation of mapping layer
- Maps application tasks (event traces)to
architecture model components - Guarantees deadlock-free schedulingof
application events
32Sesame from a software perspective
(SCPEx)
33Y-chart Modeling Language (YML)
- Flexible and persistent description (XML) of
- The structure of application and architecture
models (connecting library components) - SCPEx also supports YML!
- The mapping of appl. models onto arch.
models(i.e., the mapping layer) - YML combines scripting language within XML
- Simplifies descriptions of complicated structures
- Increases expressive power of components
- E.g., a parameterized complex interconnect
component modeling a network of arbitrary size - Increases reusability
- Re-use of components and structures
34An illustrative case study M-JPEG
- Lossy, Motion-JPEG encoder
- Accepts both RGB and YUV formats
- Includes dynamic quality control by on-the-fly
adaptation of quantization and Huffman tables
35The platform architecture
- Bus-based shared memory multiprocessor
architecture
36M-JPEG case study (contd)
Exploration
mapping
37M-JPEG case study (contd)
- Kahn Process Network
- Functional behavior
- Library approach
- Timing behavior
38Screenshot model editor
39M-JPEG design space exploration
- Experimented with different
- HW/SW partitionings
- Application-architecture mappings
- Processor speeds
- Interconnect structures (bus, crossbar and O
networks)
- This took about 1 person-month (all modeling
included) - Simulation performance for 128x128 frames, a 270
MHz Sun Ultra 5 Sparcstation simulated 2,3
frames/second ( 0.43 secs/frame)
40M-JPEG design space exploration
41M-JPEG design space exploration
42Mapping problem implementation gap
Application behavioral model (what?)
Primitive operations
Implementation
Primitive operations
Architecture model (how?)
43Mapping problem
- Application events Read, Write and Execute
- Typical mismatch between application events and
architecture primitives, examples - Architecture primitives operating on different
data granularities - Architecture primitives more refined than
application events - Trace events from the application layer need to
be refined - How?
- Refine the application model
- A transformation mechanism between the
application and architecture models
44Communication refinement
- Lets take the mismatch of communication
primitives as an example - Assume following architecture communication
primitives - Check-Data (CD)
- Load-Data (Ld)
- Signal-Room (SR)
- Check-Room (CR)
- Store-Data (St)
- Signal-Data (SD)
45Communication refinement (contd)
- Transformation rules for refining
application-level communication events Lieverse - R ? CD ? Ld ? SR (1)
- W ? CR ? St ? SD (2)
- E ? E (3)
- How to transform traces of application events
using (1), (2) and (3)?
Generates R?E?W event sequences
46Communication refinement (contd)
Processor 1
Processor 2
Processor 3
bus
Mem
- Assumption 1 processor 2 has local (block)
memory - Transforming R?E?W event sequences from process
B - R ?E?W ? CD?Ld?SR?E?CR?St?SD
- Assumption 2 processor 2 has NO local (block)
memory - Transforming R?E?W event sequences from process
B - R ?E?W ? CD?CR?Ld?E?St?SR?SD
47IDF-based trace transformation
- Virtual processors in mapping layer are refined
to accomplish trace refinement - Integer-controlled DataFlow (IDF) model describes
internal behavior of a virtual processor - Application events specify
- what a virtual processor executes
- with whom it communicates
- Internal IDF model specifies
- how the computations and communications take
place at the architecture layer
48A short Dataflow intermezzo
- Synchronous DataFlow (SDF) Lee,Messerschmitt
- Static model of computation allowing compile-time
scheduling - Basic idea each actor consumes and produces a
fixed number of tokens each time it fires - Integer-controlled DataFlow (IDF) Buck
- Extends SDF with dynamic integer-controlled
switch and select actors to allow data dependent
execution - Generalization makes it more powerful(Turing
complete) but generally needs dynamic scheduling - Hard to analyze statically
49Process B
Application modelProcess network
Process A
Process C
Virtual proc. Y
Virtual proc. Z
MappinglayerDataflow
Virtual proc. X
ArchitecturemodelDiscrete event
bus
50IDF-based trace transformation (contd)
- IDF models transform application events into
architecture events at run-time - IDF models execute in the same simulation
time-domain as the architecture model - timed IDF models
- We distinguish three IDF token-channel types
- Intra-event dependency channels specify
dependencies within the refinement of an
application event - Inter-event dependency channels specify
dependencies between refinements of different
application events - Token-exchange channels connected to architecture
model (accomplish timed execution)
51Communication refinement revisited
Process B
Process A
Process C
Processor 1
Processor 2
Processor 3
bus
Mem
- Assumption processor 2 has NO local (block)
memory - Transforming R?E?W event sequences from process
B - R ?E?W ? CD?CR?Ld?E?St?SR?SD
52Communication refinement revisited (2)
Event trace process B
Virtual processor Y
switch
Virtual processor X
Virtual processor Z
R
E
W
CD
E
CR
CR
CD
b
b
Ld
St
St
Ld
SR
SD
SD
SR
processor 2
Architecture model
Bus
53Communication refinement revisited (3)
Process B
Process A
Process C
Virtual proc. X
Virtual proc. Z
Virtual proc. Y
Processor 1
Processor 2
Processor 3
bus
Mem
R?E?W?R?E?W ? CD?CR?
Ld(line)?E(line)?St(line)?
Ld(line)?E(line)?St(li
ne)?
Ld(line)?E(line)?St(line)?
SR?SD
- Now assume that
- processor 2 operates on lines (3 lines 2
blocks) - processor 2 has a single-entry local line buffer
- processors 1 and 3 still operate at block
granularity
54Communication refinement revisited (4)
Event trace from process B
switch
R
E
W
...,1,0,1,0
0,1,0,1,...
switch
switch
Virtual processor Z
1
0
1
0
2?3
from virtual proc. X
CD
CR
b
CD
1?2
...,1,0,1,0
0,1,0,1,...
1
0
1
0
E(line)
select
select
2?3
2?3
Ld
1?3
Ld(line)
St(line)
SR
2?1
1
3?1
3?1
processor 2
to virtual proc. X
SR
SD
55A case of computational refinement
- The application models a synthetic 2D-IDCT by
computing two consecutive IDCT operations at
block level - High level, so execute(block) 1D-IDCT on a data
block
while(1) read(block) execute(block)
write(block)
while(1) read(block) execute(block)
write(block)
while(1) write(block) execute(block)
while(1) read(block) execute(block)
write(block)
while(1) read(block) execute(block)
56Computational refinement (contd)
- Two target architectures are explored
Proc D
Proc B
Proc C
Proc A
Proc A
Proc C
Proc D
Proc B
Mem
- Scenario 2 The PE models onto which the IDCT
tasks are mapped, operate at line leveland are
pipelined
- Scenario 1 All processing elements (PE's) are
modeled at block level
57Computational refinement (contd)
- Trace transformation rules
- R(block) ? R(line) ? . . . ? R(line) (1)
- W(block) ? W(line) ? . . . ? W(line) (2)
- E(block) ? E(line) ? . . . ? E(line) (3)
- E(line) ? e1 ? . . . ? en (4)
58Computational refinement
Process B
Process A
Process C
Virtual proc. X
Virtual proc. Z
bus
59Computational refinement (contd)
60(No Transcript)
61(No Transcript)
62Putting Sesame to use An example design flow
Compaan/Laura (Leiden University) Molen (Delft
University)
Motion-JPEG encoder
Architecture simulation environment
Reconfigurable architecture framework
DCT
Experimentation
System-level architecture exploration
Applications
Code suitable for FPGA execution
63A real implementation using Compaan/Laura/Molen
Mapping M-JPEG on the Molen platform architecture
The DCT kernel
for k 114, for j 1164, Pixel
(k,j) In(inBlock) end end for k 114,
if k lt 2, for j 1164, Pixel
(k,j) PreShift(Pixel
(k,j)) end end Block 2D_dct( Pixel
) end for k 114, for j 1164,
outBlockOut(Pixel(k,j)) end end
C Compiler
Laura
64System-level simulation experiment
- Modeling Molen with DCT mapped onto CCU
- Validation against real implementation
- Information from Compaan/Laura/Molen used for
calibration of architecture model - Apply architecture model refinement
- Keep M-JPEG application model untouched
- DCT component in architecture model is refined
- Operates at pixel level
- Abstract pipeline model, deeply pipelined
- Other architecture components operate at
(pixel-)block level
65Sesames IDF-based model refinement
Process B
Process A
Process C
Application model
M-JPEG
Virtual proc. X
Virtual proc. Z
Mapping layer
Map DCT on CCU and refine
Architecture model
Molen
bus
66DCT virtual processor
Event trace
scheduler
Control trace
63
P2
P1
Block out
Type in
2d-dct
Block in
To/from architecture model
67Simulation results
- Full software implementation
- Simulation 85024000 cycles
- Real Molen 84581250 cycles
- Error 0.5
- DCT mapped onto CCU
- Simulation 40107869
- Real Molen 39369970
- Error 1.9
- No tuning was done!
68Where are we going?
- Some ongoing and future work
69NoC modeling
- So far, we mainly modeled bus-based systems
- Networks-on-Chip (NoC) will be our (near) future
- Standardized interfaces
- Scalable (point-to-point) networks
- Much more complex protocols (protocol stack?)
- QoS aspects
- Modeling NoCs
- Topologies, switching routing methods,
flow-control, protocols, QoS, etc. - Communication mapping
- Modeling at multiple abstraction levels
- Gradual refinement
- Role of IDF models
70Communication mapping
With more complex Networks-on-Chip routing
information is needed
71Architecture model calibration
Initial derivation of latency parameters
- documentation
- educated guess
- performance budgeting (what is the required
parameter range?)
Next step calibration with lower-level, external
simulation models or prototypes, e.g.
- Instruction set simulators (ISSs)
- Compaan/Laura framework
72Calibration using an ISS
1
C
ISS (e.g. Simplescalar)
2
API
read(1,) API_write(C,..)
ISS measures cycle times of annotated code
fragments
API_read(C,)
computation e
computation
API_read(C,..) write(2,)
API_write(C,)
73Mixed-level system simulation
- Zoom in on interesting system components in
architecture model - Simulate these components at a lower level
- Retain high abstraction level for other
components - Saves modeling effort
- May save simulation overhead
- Integration of external simulation models
- ISSs, SystemC models, etc.
- Also allows calibration of higher-level models
- BUT
- Mixed-level simulation can be complex!
- multiple time domains and time grain sizes
(synchronization) - differences in protocol and data granularity of
components
74Mixed-level system simulation (contd)
Embedding external models
IDF-based refinement
75Does mixed-level need to be hard? NO!
C
ISS (e.g. Simplescalar)
API
Virtual processor
Virtual processor
Virtual processor
Read E(N cycles) Write
buffer
buffer
Trace calibration!
76Towards real design space exploration
- Sesame supplies basic methods tools for
evaluating application, architecture, and mapping
combinations - Simulating entire design space is not an option
- More is needed to explore large design spaces
- What will be the initial design(s) to evaluate?
- How to react when the evaluated architecture does
not suffice? - We need steering before and during simulation
- Design decisions using analytical modeling
- Finding Pareto-optimal candidates using
multi-objective optimization - Design evaluation using simulation
77Real design space exploration (contd)
Heuristic methods like evolutionary algorithms
78Credits
This work would not have been possible without
the (ground-laying work of the) following people
- Cagkan Erbas
- Simon Polstra
- Berry van Halderen
- Joseph Coffland
- Frank Terpstra
- Mark Thompson
- Paul Lieverse
- Bart Kienhuis
- Ed Deprettere
- Pieter van der Wolf
- Kees Vissers
- Vladimir Zivkovic
- Todor Stefanov
79For more information
- URL www.science.uva.nl/andy/publications.html
- or
- email andy_at_science.uva.nl
Sesame software can be found at sesamesim.sourcef
orge.net