Timed CompiledCode Simulation of Embedded Software for Performance Analysis of SOC Design - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Timed CompiledCode Simulation of Embedded Software for Performance Analysis of SOC Design

Description:

Timed Compiled-Code Simulation of Embedded Software for ... TEE and AVE using optimized executables. ADPCM. ARRAY. FIR. G721. IS95. MPEG. RS. LMS (AVE. ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 27
Provided by: JYL89
Category:

less

Transcript and Presenter's Notes

Title: Timed CompiledCode Simulation of Embedded Software for Performance Analysis of SOC Design


1
Timed Compiled-Code Simulation of Embedded
Software for Performance Analysis of SOC Design
  • Jong-Yeol Lee and In-Cheol Park
  • ICS Laboratory, EECS, KAIST

2
Presentation Outline
  • Introduction
  • Timed Compiled-Code Simulation
  • Time-Delta Generation from Intermediate
    Representation (IR) Code
  • Operation Table
  • Difference of the Number of Registers
  • Cache Simulation
  • Interrupt Processing
  • Experimental Results
  • Conclusion

3
Introduction
  • A system-on-chip (SOC) is a complex integrated
    circuit (IC).
  • An SOC includes both H/W and S/W components.
  • A processor core and S/W executed on the core
  • Special H/W blocks and peripherals interconnected
    with a system bus

Baseband unit
Microprocessor (ARM7)
RF Module
USB
System bus
Host Computer
Memory Management Unit (MMU)
Flash Memory
UART
Audio
Audio CODEC
4KB SRAM
Example of an SOC Bluetooth baseband module
4
Introduction
In SOC Design
  • Fast and accurate performance analysis of a
    target system at high-level is required.
  • To reduce the design space exploring time
  • To obtain the guidelines on proceeding to the
    lower level design
  • Bus contention is one of the major factors that
    determine the target system performance.
  • In most SOCs, components are connected with one
    or more buses.
  • The analysis of the timing at which the system
    components access buses is required.
  • Including both S/W and H/W components
  • A simulation method that enables timing analysis
    of S/W and H/W is needed.

5
Introduction
Timed compiled-code functional simulation
for performance analysis of SOCs
  • Fast simulation using compiled-code execution
  • Target processor simulators are usually
    time-consuming.
  • An application is compiled and executed on a host
    machine.
  • Timing estimation of embedded S/W described at
    functional (behavioral) level
  • Timing generation for embedded software by using
    a portable compiler
  • Machine independent properties of portable
    compilers for retargetability
  • No need for source code modification

6
Introduction
Previous Works
  • Use of instruction-set simulators
  • Long execution time
  • Annotating C code with timing estimates guessing
    compiler behavior
  • Effective only for the codes having the same
    structure
  • Timing annotated C code generation from a
    functional C or object code generated from a
    target compiler
  • A separate simulator is needed.

7
Introduction
Previous Works
  • Seamless
  • Compiled-code execution
  • No way to make up for the discrepancy between a
    host and a target
  • Virtual-CPU
  • Compiled-code execution
  • No timing information is available from the
    compiled-code execution

8
Timed Compiled-Code Simulation
BFM generates the bus cycles of I/O operations in
a target processor for simulation with hardware
models.
Execution of application on a host machine
Hardware models
Host executable of embedded application
Cache
MEM
Bus functional model(BFM)
P0
with added codes for timing estimation
Wrapper invokes BFM when host-machine I/O
instructions are executed to access I/O variables.
Wrapper
P1
Target processor model
9
Timed Compiled-Code Simulation
  • Timing estimation
  • I/O operation timing
  • I/O operations (OP1, OP2, and OP3) access I/O
    variables.
  • Timing of I/O operations is determined by a
    simulation which uses the BFM and hardware
    models.
  • Software timing (time-delta, ?)
  • Time consumed on core between I/O operations
  • Estimated by using time-delta generator and
    functional C code

Time consumed on core (S/W timing)
? 3
?1
? 2
OP1
OP2
OP3
Time proceeds
I/O operation (I/O operation timing)
Timing in S/W execution
10
Timed Compiled-Code Simulation
  • Timing in a simulation
  • OP1 accesses a peripheral P0.
  • OP2 accesses a variable that is not in cache.
  • OP3 accesses a variable in cache.

Time consumed on core (S/W timing)
? 3
?1
? 2
OP1
OP2
OP3
I/O operation (I/O operation timing)
Time proceeds
T4
T2
T3
T1
Timing in a simulation
OP1
OP3
? 3
? 2
OP2
?1
OP2
Time proceeds
11
Time-Delta Generation from IR Code
  • Source code without modification
  • Functional description is used without any
    modification.

Machine description
Variable list
Source code
Source code
IR modification for time delta generation
Front end
IR code
Machine-independent optimization
Modified IR code
Machine-dependent optimization
execut- able
Time delta
execution
Modified compiler for time-delta generation
Time-delta generation using the IR of a compiler
12
Time-Delta Generation from IR Code
  • The proposed method shows good result when a host
    compiler and a target compiler have similar IR.
  • Different compiler may produce quite different
    results.
  • The effect of machine-independent optimization
    can be taken into account because IR code is
    modified after machine-independent optimizations.
  • Machine-dependent optimizations based on
    machine-specific features may degrade the
    accuracy of time-delta.
  • Granularity difference between target
    instructions and IR operations
  • The difference of the number of registers and
    number and kinds of functional blocks
  • To compensate this degradation, machine
    description is used.

13
Time-Delta Generation from IR Code
  • With the knowledge of target instructions
    generated from each IR operation, the timing
    estimation is possible.
  • Cycle counts of target instructions generated
    from each IR operation are accumulated.

in and out are I/O variables.
1 MOV R1, in read input 2 CALL Foo
argument in R1 3 II1 IR
instruction 4 II2 IR instruction 5
MOV out, R2 out result
/ get input / j in / processing input / k
Foo(j) / other codes / ....... / store
result / out k
IR code
MOV 2 cycles CALL 3 cycles II1 1 cycles II2
2 cycles
Source code
Cycle counts of IR operations
14
Time-Delta Generation from IR Code
  • IR is modified to contain information needed to
    generate time-deltas.
  • Indicators and counters are inserted as optional
    fields.
  • Optional fields do not disturb the optimization
    results.

OP ID
Pointer to previous OP
Pattern (operation body)
TSGEN_COUNTER N
TSGEN_INDICATOR FOO
Pointer to next OP
Example of an IR operation with counter and
indicator fields
15
Time-Delta Generation from IR Code
Example
  • Two I/O variables, X and Y, are accessed.

IR code
Time-delta is the difference between the counter
values of two IR operations whose indicator
fields are not null.
16
Operation Table
  • Contains the cycle counts of IR operations
  • The cycle counts are set in the counter fields of
    IR operations.
  • Granularity difference between an IR operation
    and target machine instructions can be
    compensated.
  • In case of ARM7, LOAD IR takes 6 cycles (3 cycles
    for data load and 3 cycles of address load).

17
Difference of the Number of Registers
  • Related to the number of register-saving/restoring
    instructions
  • Restricting the number of register used in
    simulation
  • Effective when the number of register of a target
    machine is smaller than that of a host machine
  • MCORE (16 registers) is an example when the host
    is SPARC (32 registers).
  • Register allocation routine is modified to accept
    a parameter specifying the number of usable
    registers.

18
Cache Simulation
  • An embedded cache simulator is used to simulate
    the specified cache configuration.

memory operation
Sequence of IR operations
19
Interrupt Processing
  • Host-machine interrupts are used.
  • Interrupts cannot be checked after each target
    instruction in the compiled-code execution.
  • Host trap is used to service target-machine
    interrupts as soon as possible.

Host-machine ISR Find appropriate target ISR
BFM
Host executable
Execution proceeds
time-delta is also generated for the target ISR
Target ISR
20
Experimental Results
  • To show the accuracy of proposed time-delta
    generator
  • Experimented for ARM7, MCORE and PowerPC using
    time-delta generator based on GCC
  • Results of time-delta generator(tsgcc) are
    compared to those of instruction-set
    simulators(ISSs)
  • TEE and AVE are calculated and compared.

N number of time-deltas in a program Ctarget
and Ctsgcc total number of execution cycles
measured on ISSs
and estimated by tsgcc Target?i and Tsgcc?i
time-delta
measured on ISSs and estimated by tsgcc
21
Experimental Results
  • TEE and AVE using unoptimized executables

4
2
TEE()
1.22
0.64
0
-0.51
-2
4
2
AVE()
-1.18
-0.45
0
-0.28
-2
-4
ADPCM
ARRAY
FIR
G721
IS95
MPEG
RS
LMS
(AVE.)
Benchmark
22
Experimental Results
  • TEE and AVE using optimized executables

6
4
1.32
2
TEE()
0.87
0
-2
-0.95
-4
4
AVE()
2
-0.58
0
-0.65
2
-2.17
-4
-6
ADPCM
ARRAY
FIR
G721
IS95
MPEG
RS
LMS
(AVE.)
Benchmark
23
Experimental Results
  • Experiment with restricting the number of
    registers
  • MCORE have only 16 registers. (SPARC has 32
    registers.)

24
Experimental Results
  • Speed-up of compiled-code execution over ISSs
  • Compiled-code execution is about 300 times faster
    than ISS.

25
Experimental Results
  • Simulation with BFM and memory model in Verilog
    HDL

Inter-process communication (IPC)
Bus interaction
Host-code executable
MEMORY (Verilog)
BFM (Verilog)
Wrapper
Chip boundary
Simulation configuration
Waveform in simulation
26
Conclusion
  • A new software timing generation method for
    performance analysis and architecture exploration
    of embedded systems is proposed.
  • Easily retargetable to other processors
  • The proposed method uses the intermediate
    representation (IR) of a portable compiler.
  • Machine description containing information about
    a target processor such as the number of the
    cycles of each IR operation and the number of
    registers is used.
  • Experimental results show that the average error
    is about 2 and the maximum speed-up over
    instruction-set simulators is about 300 times.
  • The method is also verified in a timed functional
    simulation environment.
Write a Comment
User Comments (0)
About PowerShow.com