Title: Timed CompiledCode Simulation of Embedded Software for Performance Analysis of SOC Design
1Timed Compiled-Code Simulation of Embedded
Software for Performance Analysis of SOC Design
- Jong-Yeol Lee and In-Cheol Park
- ICS Laboratory, EECS, KAIST
2Presentation Outline
- Introduction
- Timed Compiled-Code Simulation
- Time-Delta Generation from Intermediate
Representation (IR) Code - Operation Table
- Difference of the Number of Registers
- Cache Simulation
- Interrupt Processing
- Experimental Results
- Conclusion
3Introduction
- A system-on-chip (SOC) is a complex integrated
circuit (IC). - An SOC includes both H/W and S/W components.
- A processor core and S/W executed on the core
- Special H/W blocks and peripherals interconnected
with a system bus
Baseband unit
Microprocessor (ARM7)
RF Module
USB
System bus
Host Computer
Memory Management Unit (MMU)
Flash Memory
UART
Audio
Audio CODEC
4KB SRAM
Example of an SOC Bluetooth baseband module
4Introduction
In SOC Design
- Fast and accurate performance analysis of a
target system at high-level is required. - To reduce the design space exploring time
- To obtain the guidelines on proceeding to the
lower level design - Bus contention is one of the major factors that
determine the target system performance. - In most SOCs, components are connected with one
or more buses. - The analysis of the timing at which the system
components access buses is required. - Including both S/W and H/W components
- A simulation method that enables timing analysis
of S/W and H/W is needed.
5Introduction
Timed compiled-code functional simulation
for performance analysis of SOCs
- Fast simulation using compiled-code execution
- Target processor simulators are usually
time-consuming. - An application is compiled and executed on a host
machine. - Timing estimation of embedded S/W described at
functional (behavioral) level - Timing generation for embedded software by using
a portable compiler - Machine independent properties of portable
compilers for retargetability - No need for source code modification
6Introduction
Previous Works
- Use of instruction-set simulators
- Long execution time
- Annotating C code with timing estimates guessing
compiler behavior - Effective only for the codes having the same
structure - Timing annotated C code generation from a
functional C or object code generated from a
target compiler - A separate simulator is needed.
7Introduction
Previous Works
- Seamless
- Compiled-code execution
- No way to make up for the discrepancy between a
host and a target - Virtual-CPU
- Compiled-code execution
- No timing information is available from the
compiled-code execution
8Timed Compiled-Code Simulation
BFM generates the bus cycles of I/O operations in
a target processor for simulation with hardware
models.
Execution of application on a host machine
Hardware models
Host executable of embedded application
Cache
MEM
Bus functional model(BFM)
P0
with added codes for timing estimation
Wrapper invokes BFM when host-machine I/O
instructions are executed to access I/O variables.
Wrapper
P1
Target processor model
9Timed Compiled-Code Simulation
- Timing estimation
- I/O operation timing
- I/O operations (OP1, OP2, and OP3) access I/O
variables. - Timing of I/O operations is determined by a
simulation which uses the BFM and hardware
models. - Software timing (time-delta, ?)
- Time consumed on core between I/O operations
- Estimated by using time-delta generator and
functional C code
Time consumed on core (S/W timing)
? 3
?1
? 2
OP1
OP2
OP3
Time proceeds
I/O operation (I/O operation timing)
Timing in S/W execution
10Timed Compiled-Code Simulation
- Timing in a simulation
- OP1 accesses a peripheral P0.
- OP2 accesses a variable that is not in cache.
- OP3 accesses a variable in cache.
Time consumed on core (S/W timing)
? 3
?1
? 2
OP1
OP2
OP3
I/O operation (I/O operation timing)
Time proceeds
T4
T2
T3
T1
Timing in a simulation
OP1
OP3
? 3
? 2
OP2
?1
OP2
Time proceeds
11Time-Delta Generation from IR Code
- Source code without modification
- Functional description is used without any
modification.
Machine description
Variable list
Source code
Source code
IR modification for time delta generation
Front end
IR code
Machine-independent optimization
Modified IR code
Machine-dependent optimization
execut- able
Time delta
execution
Modified compiler for time-delta generation
Time-delta generation using the IR of a compiler
12Time-Delta Generation from IR Code
- The proposed method shows good result when a host
compiler and a target compiler have similar IR. - Different compiler may produce quite different
results. - The effect of machine-independent optimization
can be taken into account because IR code is
modified after machine-independent optimizations. - Machine-dependent optimizations based on
machine-specific features may degrade the
accuracy of time-delta. - Granularity difference between target
instructions and IR operations - The difference of the number of registers and
number and kinds of functional blocks - To compensate this degradation, machine
description is used.
13Time-Delta Generation from IR Code
- With the knowledge of target instructions
generated from each IR operation, the timing
estimation is possible. - Cycle counts of target instructions generated
from each IR operation are accumulated.
in and out are I/O variables.
1 MOV R1, in read input 2 CALL Foo
argument in R1 3 II1 IR
instruction 4 II2 IR instruction 5
MOV out, R2 out result
/ get input / j in / processing input / k
Foo(j) / other codes / ....... / store
result / out k
IR code
MOV 2 cycles CALL 3 cycles II1 1 cycles II2
2 cycles
Source code
Cycle counts of IR operations
14Time-Delta Generation from IR Code
- IR is modified to contain information needed to
generate time-deltas. - Indicators and counters are inserted as optional
fields. - Optional fields do not disturb the optimization
results.
OP ID
Pointer to previous OP
Pattern (operation body)
TSGEN_COUNTER N
TSGEN_INDICATOR FOO
Pointer to next OP
Example of an IR operation with counter and
indicator fields
15Time-Delta Generation from IR Code
Example
- Two I/O variables, X and Y, are accessed.
IR code
Time-delta is the difference between the counter
values of two IR operations whose indicator
fields are not null.
16Operation Table
- Contains the cycle counts of IR operations
- The cycle counts are set in the counter fields of
IR operations. - Granularity difference between an IR operation
and target machine instructions can be
compensated. - In case of ARM7, LOAD IR takes 6 cycles (3 cycles
for data load and 3 cycles of address load).
17Difference of the Number of Registers
- Related to the number of register-saving/restoring
instructions - Restricting the number of register used in
simulation - Effective when the number of register of a target
machine is smaller than that of a host machine - MCORE (16 registers) is an example when the host
is SPARC (32 registers). - Register allocation routine is modified to accept
a parameter specifying the number of usable
registers.
18Cache Simulation
- An embedded cache simulator is used to simulate
the specified cache configuration.
memory operation
Sequence of IR operations
19Interrupt Processing
- Host-machine interrupts are used.
- Interrupts cannot be checked after each target
instruction in the compiled-code execution. - Host trap is used to service target-machine
interrupts as soon as possible.
Host-machine ISR Find appropriate target ISR
BFM
Host executable
Execution proceeds
time-delta is also generated for the target ISR
Target ISR
20Experimental Results
- To show the accuracy of proposed time-delta
generator - Experimented for ARM7, MCORE and PowerPC using
time-delta generator based on GCC - Results of time-delta generator(tsgcc) are
compared to those of instruction-set
simulators(ISSs) - TEE and AVE are calculated and compared.
N number of time-deltas in a program Ctarget
and Ctsgcc total number of execution cycles
measured on ISSs
and estimated by tsgcc Target?i and Tsgcc?i
time-delta
measured on ISSs and estimated by tsgcc
21Experimental Results
- TEE and AVE using unoptimized executables
4
2
TEE()
1.22
0.64
0
-0.51
-2
4
2
AVE()
-1.18
-0.45
0
-0.28
-2
-4
ADPCM
ARRAY
FIR
G721
IS95
MPEG
RS
LMS
(AVE.)
Benchmark
22Experimental Results
- TEE and AVE using optimized executables
6
4
1.32
2
TEE()
0.87
0
-2
-0.95
-4
4
AVE()
2
-0.58
0
-0.65
2
-2.17
-4
-6
ADPCM
ARRAY
FIR
G721
IS95
MPEG
RS
LMS
(AVE.)
Benchmark
23Experimental Results
- Experiment with restricting the number of
registers - MCORE have only 16 registers. (SPARC has 32
registers.)
24Experimental Results
- Speed-up of compiled-code execution over ISSs
- Compiled-code execution is about 300 times faster
than ISS.
25Experimental Results
- Simulation with BFM and memory model in Verilog
HDL
Inter-process communication (IPC)
Bus interaction
Host-code executable
MEMORY (Verilog)
BFM (Verilog)
Wrapper
Chip boundary
Simulation configuration
Waveform in simulation
26Conclusion
- A new software timing generation method for
performance analysis and architecture exploration
of embedded systems is proposed. - Easily retargetable to other processors
- The proposed method uses the intermediate
representation (IR) of a portable compiler. - Machine description containing information about
a target processor such as the number of the
cycles of each IR operation and the number of
registers is used. - Experimental results show that the average error
is about 2 and the maximum speed-up over
instruction-set simulators is about 300 times. - The method is also verified in a timed functional
simulation environment.