Timed CompiledCode Simulation of Embedded Software for Performance Analysis of SOC Design - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Timed CompiledCode Simulation of Embedded Software for Performance Analysis of SOC Design

Description:

Timed Compiled-Code Simulation of Embedded Software for ... TEE and AVE using optimized executables. ADPCM. ARRAY. FIR. G721. IS95. MPEG. RS. LMS (AVE. ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 27

Provided by: JYL89

Category:

more less

Transcript and Presenter's Notes

Title: Timed CompiledCode Simulation of Embedded Software for Performance Analysis of SOC Design

1
Timed Compiled-Code Simulation of Embedded
Software for Performance Analysis of SOC Design

Jong-Yeol Lee and In-Cheol Park
ICS Laboratory, EECS, KAIST

2
Presentation Outline

Introduction
Timed Compiled-Code Simulation
Time-Delta Generation from Intermediate
Representation (IR) Code
Operation Table
Difference of the Number of Registers
Cache Simulation
Interrupt Processing
Experimental Results
Conclusion

3
Introduction

A system-on-chip (SOC) is a complex integrated
circuit (IC).
An SOC includes both H/W and S/W components.
A processor core and S/W executed on the core
Special H/W blocks and peripherals interconnected
with a system bus

Baseband unit
Microprocessor (ARM7)
RF Module
USB
System bus
Host Computer
Memory Management Unit (MMU)
Flash Memory
UART
Audio
Audio CODEC
4KB SRAM
Example of an SOC Bluetooth baseband module
4
Introduction
In SOC Design

Fast and accurate performance analysis of a
target system at high-level is required.
To reduce the design space exploring time
To obtain the guidelines on proceeding to the
lower level design
Bus contention is one of the major factors that
determine the target system performance.
In most SOCs, components are connected with one
or more buses.
The analysis of the timing at which the system
components access buses is required.
Including both S/W and H/W components
A simulation method that enables timing analysis
of S/W and H/W is needed.

5
Introduction
Timed compiled-code functional simulation
for performance analysis of SOCs

Fast simulation using compiled-code execution
Target processor simulators are usually
time-consuming.
An application is compiled and executed on a host
machine.
Timing estimation of embedded S/W described at
functional (behavioral) level
Timing generation for embedded software by using
a portable compiler
Machine independent properties of portable
compilers for retargetability
No need for source code modification

6
Introduction
Previous Works

Use of instruction-set simulators
Long execution time
Annotating C code with timing estimates guessing
compiler behavior
Effective only for the codes having the same
structure
Timing annotated C code generation from a
functional C or object code generated from a
target compiler
A separate simulator is needed.

7
Introduction
Previous Works

Seamless
Compiled-code execution
No way to make up for the discrepancy between a
host and a target
Virtual-CPU
Compiled-code execution
No timing information is available from the
compiled-code execution

8
Timed Compiled-Code Simulation
BFM generates the bus cycles of I/O operations in
a target processor for simulation with hardware
models.
Execution of application on a host machine
Hardware models
Host executable of embedded application
Cache
MEM
Bus functional model(BFM)
P0
with added codes for timing estimation
Wrapper invokes BFM when host-machine I/O
instructions are executed to access I/O variables.
Wrapper
P1
Target processor model
9
Timed Compiled-Code Simulation

Timing estimation
I/O operation timing
I/O operations (OP1, OP2, and OP3) access I/O
variables.
Timing of I/O operations is determined by a
simulation which uses the BFM and hardware
models.
Software timing (time-delta, ?)
Time consumed on core between I/O operations
Estimated by using time-delta generator and
functional C code

Time consumed on core (S/W timing)
? 3
?1
? 2
OP1
OP2
OP3
Time proceeds
I/O operation (I/O operation timing)
Timing in S/W execution
10
Timed Compiled-Code Simulation

Timing in a simulation
OP1 accesses a peripheral P0.
OP2 accesses a variable that is not in cache.
OP3 accesses a variable in cache.

Time consumed on core (S/W timing)
? 3
?1
? 2
OP1
OP2
OP3
I/O operation (I/O operation timing)
Time proceeds
T4
T2
T3
T1
Timing in a simulation
OP1
OP3
? 3
? 2
OP2
?1
OP2
Time proceeds
11
Time-Delta Generation from IR Code

Source code without modification
Functional description is used without any
modification.

Machine description
Variable list
Source code
Source code
IR modification for time delta generation
Front end
IR code
Machine-independent optimization
Modified IR code
Machine-dependent optimization
execut- able
Time delta
execution
Modified compiler for time-delta generation
Time-delta generation using the IR of a compiler
12
Time-Delta Generation from IR Code

The proposed method shows good result when a host
compiler and a target compiler have similar IR.
Different compiler may produce quite different
results.
The effect of machine-independent optimization
can be taken into account because IR code is
modified after machine-independent optimizations.
Machine-dependent optimizations based on
machine-specific features may degrade the
accuracy of time-delta.
Granularity difference between target
instructions and IR operations
The difference of the number of registers and
number and kinds of functional blocks
To compensate this degradation, machine
description is used.

13
Time-Delta Generation from IR Code

With the knowledge of target instructions
generated from each IR operation, the timing
estimation is possible.
Cycle counts of target instructions generated
from each IR operation are accumulated.

in and out are I/O variables.
1 MOV R1, in read input 2 CALL Foo
argument in R1 3 II1 IR
instruction 4 II2 IR instruction 5
MOV out, R2 out result
/ get input / j in / processing input / k
Foo(j) / other codes / ....... / store
result / out k
IR code
MOV 2 cycles CALL 3 cycles II1 1 cycles II2
2 cycles
Source code
Cycle counts of IR operations
14
Time-Delta Generation from IR Code

IR is modified to contain information needed to
generate time-deltas.
Indicators and counters are inserted as optional
fields.
Optional fields do not disturb the optimization
results.

OP ID
Pointer to previous OP
Pattern (operation body)
TSGEN_COUNTER N
TSGEN_INDICATOR FOO
Pointer to next OP
Example of an IR operation with counter and
indicator fields
15
Time-Delta Generation from IR Code
Example

Two I/O variables, X and Y, are accessed.

IR code
Time-delta is the difference between the counter
values of two IR operations whose indicator
fields are not null.
16
Operation Table

Contains the cycle counts of IR operations
The cycle counts are set in the counter fields of
IR operations.
Granularity difference between an IR operation
and target machine instructions can be
compensated.
In case of ARM7, LOAD IR takes 6 cycles (3 cycles
for data load and 3 cycles of address load).

17
Difference of the Number of Registers

Related to the number of register-saving/restoring
instructions
Restricting the number of register used in
simulation
Effective when the number of register of a target
machine is smaller than that of a host machine
MCORE (16 registers) is an example when the host
is SPARC (32 registers).
Register allocation routine is modified to accept
a parameter specifying the number of usable
registers.

18
Cache Simulation

An embedded cache simulator is used to simulate
the specified cache configuration.

memory operation
Sequence of IR operations
19
Interrupt Processing

Host-machine interrupts are used.
Interrupts cannot be checked after each target
instruction in the compiled-code execution.
Host trap is used to service target-machine
interrupts as soon as possible.

Host-machine ISR Find appropriate target ISR
BFM
Host executable
Execution proceeds
time-delta is also generated for the target ISR
Target ISR
20
Experimental Results

To show the accuracy of proposed time-delta
generator
Experimented for ARM7, MCORE and PowerPC using
time-delta generator based on GCC
Results of time-delta generator(tsgcc) are
compared to those of instruction-set
simulators(ISSs)
TEE and AVE are calculated and compared.

N number of time-deltas in a program Ctarget
and Ctsgcc total number of execution cycles
measured on ISSs
and estimated by tsgcc Target?i and Tsgcc?i
time-delta
measured on ISSs and estimated by tsgcc
21
Experimental Results

TEE and AVE using unoptimized executables

4
2
TEE()
1.22
0.64
0
-0.51
-2
4
2
AVE()
-1.18
-0.45
0
-0.28
-2
-4
ADPCM
ARRAY
FIR
G721
IS95
MPEG
RS
LMS
(AVE.)
Benchmark
22
Experimental Results

TEE and AVE using optimized executables

6
4
1.32
2
TEE()
0.87
0
-2
-0.95
-4
4
AVE()
2
-0.58
0
-0.65
2
-2.17
-4
-6
ADPCM
ARRAY
FIR
G721
IS95
MPEG
RS
LMS
(AVE.)
Benchmark
23
Experimental Results

Experiment with restricting the number of
registers
MCORE have only 16 registers. (SPARC has 32
registers.)

24
Experimental Results

Speed-up of compiled-code execution over ISSs
Compiled-code execution is about 300 times faster
than ISS.

25
Experimental Results

Simulation with BFM and memory model in Verilog
HDL

Inter-process communication (IPC)
Bus interaction
Host-code executable
MEMORY (Verilog)
BFM (Verilog)
Wrapper
Chip boundary
Simulation configuration
Waveform in simulation
26
Conclusion

A new software timing generation method for
performance analysis and architecture exploration
of embedded systems is proposed.
Easily retargetable to other processors
The proposed method uses the intermediate
representation (IR) of a portable compiler.
Machine description containing information about
a target processor such as the number of the
cycles of each IR operation and the number of
registers is used.
Experimental results show that the average error
is about 2 and the maximum speed-up over
instruction-set simulators is about 300 times.
The method is also verified in a timed functional
simulation environment.