Title: Exploring VLIW ASIP Design Space using Trimaran Based Framework
1Exploring VLIW ASIP Design SpaceusingTrimaran
Based Framework
- Under the guidance of
- Prof. Anshul Kumar
- Department of Computer Science Engineering
- IIT Delhi.
Diviya Jain 2001MCS022
2Motivation
- Application Specific Instruction Set Processors
(ASIPs) - Combines advantages of ASICs GPPs.
- Desired performance , cost power consumption
through - Processor Specialisation.
- Instruction Set Customization.
- for a targeted set of applications.
3Processor Specialisation Techniques
- Instruction Set Specialisation.
- Function Unit Data Path Adaptation.
- Memory Specialisation.
- Cache Configuration.
- Interconnect Specialisation.
- Current Approach
- Instruction Set Specialisation.
- Function Unit Data Path Specialisation.
4Objectives
- Design of a largely automated framework for ASIP
design Evaluation. - Enable automatic Identification, Evaluation and
Selection of critical parts of application to be
mapped on special FUs. - Validate performance of ASIP designed through the
suitably modified Trimaran Framework.
5Architectural Choices
- Instruction Level Parallelism (ILP) for high
performance. - ILP choices Superscalar and VLIW.
- Our Target Architecture
- VLIW architecture
- Core with fine-grain FUs.
- Application Specific coarse-grain FUs.
- Core may be fixed or parameterized.
6Coarse-Grain FUs ?
- Chaining a sequence of operations reduces
computation time. - Limited Resolution Operation leads to faster
hardware. - Concurrent Operations within a group more easily
parallelized. - Intermediate results stored locally - Reduction
in Register Pressure. - Collapsing a series of operations into one
instruction shortens code and reduces pressure on
I-cache.
7AFU Design Space
- Type of code mapped
- Clusters of elementary arithmetic or logic
operations. - Complex, control-oriented, sections e.g. complete
loops. - The Number of Inputs and Outputs to the AFU.
- The Number of AFUs or Number of operations per
single AFU. - Accessibility to Memory.
8Spectrum of Custom FUs
REGISTER FILE
9Automatic Processor Specialization Process
Application Code
Automatic Identification Of Coarse Grained FU
Identification Algorithm
Evaluation Of FUs identified
Selection Criteria
Selection Of FUs
Modification of Application Code
Customization Of Machine Architecture
Instruction Set Specialisation
Code Generation
Simulation
Performance Statistics
Synthesis
10ASIP Architecture Exploration Requirements
- Execution Evaluation of Application Code
- on Simulated Architectures.
- Tools needed
- Language for Architecture Description.
- Re-targetable Compiler or Compiler Generator.
- Reconfigurable Simulator.
11Trimaran Framework
- The Infrastructure is comprised of the following
components - A Machine description language, HMDES, for
describing ILP architectures. - A parameterized ILP Architecture called HPL-PD
- A compiler front-end (IMPACT) for C, performing
parsing, type checking, and a large suite of
high-level (i.e. machine independent)
optimizations.
12Trimaran Framework
- A Compiler back-end (ELCOR), parameterized by a
machine description, performing instruction
scheduling, register allocation, and machine
dependent optimizations. - An Extensible IR called Rebel.
- A Cycle-level Simulator of the HPL-PD
architecture configurable by a machine
description and provides run-time information on
execution time, branch frequencies, and resource
utilization. - An Integrated graphical user interface (GUI) for
configuring and running the Trimaran system.
13Trimaran Compiler Infrastructure
14Limitations of Trimaran
- Restricted to the HPL- Play Doh architecture.
- Introduction of a new instruction requires
modification of the source code. - Since it is a VLIW based compiler we cannot have
instructions with variable latency (excludes the
possibility of exploration of conditionals and
loops).
15Earlier Framework
Original C Program
PERL SCRIPT (for automation)
Patch Files
Instrumented C Program
IMPACT
ELCOR
SIMULATOR GENERATOR
Generated Simulator
Results and Statistics
MDES(Description of FUs)
16Limitations Of Earlier Framework
- Manual selection of Application Code to be mapped
on special FUs designed - Failure to identify potentially good candidates.
- Allowed validation of framework only on small
benchmarks. - Instrumented Code Generated was not completely
functional - Erroneous Code Profiling.
- Frequently the code failed to pass through
Trimaran Front End Compiler. - Sub-optimally scheduled code.
17Limitations Of Earlier Framework
- Poor Design Of Evaluation Framework.
- Instruction Set Specialisation delayed to the
Elcor Stage of Trimaran Framework. - Introduction of a large number of data movement
instructions by the Front End Compiler. - Execution Statistics obtained did not conform to
the performance gain estimated.
18Identification of Special Fus
Application Code
Identification Algorithm
Automatic Identification Of Coarse Grained FU
Evaluation Of FUs identified
Selection Criteria
Selection Of FUs
19MachSUIF
- Features Of MachSUIF
- IR has a DFG/CFG representation, but is
architecture independent. - Operations resemble generic assembly operations.
- Provides control and data flow libraries.
- Suited to the process of identification which
depends upon the topological characteristics of
DAG constructed.
20Inefficient Machsuif IR
- Original Source Code
-
- for (i0 iltNUM i)
- ai i
- Trimaran failed to recognize the for loop.
- No loop optimizations eg. loop unrolling,
software pipelining are not done.
loop condition check
loop body
exit code
21Generation of Efficient IR
- IR generated is modified, to add loop condition
check along with the loop body. - Trimaran recognized the for loop and loop
optimizations were performed.
loop condition check
loop body condition check
exit code
22MISO Identification Algorithm
- for all Nodes e Nodes_to_be_analysed
-
- Generate_MaxMISO(Node)
- Nodes_to_be_analysed - Nodes_in_MaxMISO
-
- Generate_MaxMISO(Node)
-
- for all Parents_of_Node(Node)
-
- if(fanout_of_Parent_of_Node(Node)1)
-
- include(Parent_of_Node)
- Generate_MaxMISO(Parent_of_Node)
-
- else
- fanout_of_Parent_of_Node --
-
-
23Implementation Of Identification Algorithm
- Identified all the MISOs in the Application Code.
- Capable of including or excluding memory
operations as a part of a MISO. - Annotated each instruction with the
identification information. - Annotated the inputs and output of the MISO.
- Generated graphical representation of each MISO
identified.
24Evaluation Selection Technique
- Let ?sw represent execution time of instructions
on a processor. - Let ?hw represent relative delay of operations
when executed on dedicated hardware unit. - ?hw is represented as a fraction or a multiple of
32 bit multiply accumulate delay. - Let CP represent the critical path delay of the
MISO identified. - Let ? represent the Number of times a basic block
is executed.
25Evaluation Selection Technique
- Execution time of a MISO on a processor is
calculated as - Tsw S ?sw
- for all instr
- Execution time of a MISO on a special FU is
calculated as - Thw ceil(CPhw)
- Thus Speed Up Potential Of a MISO is calculated
as - SpeedUp ( Tsw Thw ) ?
- Finally Best N candidates expected to provide
highest Speedup are selected.
26Modeling Of MISOs
- Instrumented Source Code
- (Current Framework)
- int FU_miso (int a, int b, int c)
-
- return a bc bc
-
- main()
-
- int a, b, c
- while(a lt 1000)
- / identified MISO /
- a FU_miso (a, b, c)
-
- Instrumented Source Code
- (Earlier Framework)
- int FU_miso (int a, int b, int c)
-
- return 1
-
- main()
-
- int a, b, c
- while(a lt 1000)
- / identified MISO /
- a FU_miso (a, b, c)
-
- Original Source Code
-
- main()
-
- int a, b, c
- while(a lt 1000)
- / identified MISO /
- a a bc bc
-
27Advantages of New Approach
- Completely functional instrumented code.
- Eliminates erroneous profiling.
- Generation of optimally scheduled code.
- Elimination of Semantic Analysis.
- Elimination of illegal memory access and hence
segmentation faults.
28Modeling Of MIMOs
- Instrumented Source Code
- (Earlier Framework)
- void FU_mimo (int a, int b)
-
- main()
-
- int a, b, c, d
- int j1, j2, r1, r2
- scanf(d, j1)
- scanf(d, j2)
- r1 j1
- r2 j2
- / identified MIMO /
- FU_mimo (a,b)
- c r1
- d r2
-
- Instrumented Source Code
- (Current Framework)
- int FU_mimo_one (int a, int b)
- return a b
- int FU_mimo_two (int a, int b)
- return a - b
- void FU_mimo (int a, int b)
-
- main()
-
- int a, b, c, d
- / identified MIMO /
- c FU_mimo_one (a, b)
- d FU_mimo_two (a, b)
- FU_mimo(a,b)
-
- Original Source Code
-
- main()
-
- int a, b, c, d
- / identified MIMO /
- c a b
- d a b
-
29Advantages Of the Approach
- Completely functional instrumented code.
- No need to explicitly reserve registers through
introduction of scanf instructions. - No Erroneous profiling.
- Generation of optimally scheduled code.
30MISO/MIMO with load/store units
- Modeled in exactly similar manner as the
MISO/MIMO. - During Resource definition, load unit/store is
reserved for a few cycles before and after
computation for memory access.
- Original Code
- main()
-
- int a10
- for(int i ilt10 i)
- ai ai i2
-
- Modified Code
- int FU_miso_ld(int a, int i)
- return a i2
- main()
- int a10
- for(int i ilt10 i)
- FU_miso_ld(ai, i)
31Instruction Set Specialisation in Earlier
Framework
- The function call representing the new
instruction to be introduced passed through
Impact without any modifications. - Impact requires the function call arguments to be
present either in Macro Registers or on the
stack. - The introduction of a new machine instruction was
delayed to the Elcor Stage.
32Overheads Introduced
- Source Code
- d FU_main_mimofun(a, b, c, d, e)
- Impact Generated IR
- (op 113 st_f2 (mac OP i)(i -24)(r 36 f2) lt(tm
(i 300))gt - (op 115 st_i (mac OP i)(i -28)(r 111 i) lt(tm (i
301))gt - (op 116 st_f2 (mac OP i)(i -36)(r 41 f2) lt(tm (i
302))gt - (op 117 mov_f2 mac P5 f2) r 26 f2)
- (op 118 mov_f2 mac P7 f2) r 31 f2)
- (op 119 jsr (l_g_abs fn_FU_main_mimofun) lt(tr
(mac P5 f2)(mac P7 - f2)) (tm (i 300)(i 301)(i 302)) (tmo (i -24)(i
-28)(i -36)) (ret (mac P4 f2)) (param size (I
36))gt(call info (s_l_abs doubledoubledoubledou
bleintdouble ))
33Impact Compilation Phases
C Language Source
Pcode
Hcode
Machine Independent Lcode
Mcode
Target Architecture Code
34Selection Of Hcode Phase
- Pcode generated needs to be reverse translated
into C for execution and subsequent collection of
profiling information. - At Lcode level, data movement instructions are
already introduced. Elimination requires complex
handling of data dependencies. - Hcode forms a Natural Choice
- No extra overhead is yet introduced.
- Instruction Set Extension is easy to accomplish.
35Customization of Machine Architecture
- A new Functional Unit is introduced using HMDES,
machine description language. - Operation format , latency, resource usage etc
are all specified. - Semantics of the special machine instruction are
provided to the new Simulator. - Trimaran Back End Compiler is modified so that
it recognizes the new machine instruction and
optimally schedules the execution of the new
instruction on the special FU.
36Enhanced Performance Evaluation Framework
Original C Program
IMPACT
Identification Selection Of FUs
Pcode
Hcode
Lcode
Instrumented C Program
ELCOR
SIMULATOR GENERATOR
Generated Simulator
Results and Statistics
MDES(Description of FUs)
37Case Studies (Kalman Filter)
- Modeled 5 MISOs with Load Store Units.
- Latency of the FUs conformed to amount of
computation involved.
38Discussion Of Results (Kalman Filter)
- Kalman_Update Better Performance Evaluation can
be attributed to - Removal of extra data movement instructions as
was required in earlier framework. - Reuse of register values containing memory
addresses for successive MISOs. - Predict_State Performance efficiency evaluated
remains the same - Completely different memory addresses required
for successive MISOs. - Addresses generated stored in GPRs instead of
Macro registers. Thus, elimination of data
movement instructions is of no use.
39Case Studies (Fast Fourier Transform)
- Modeled a MIMO performing butterfly operation.
- Latency of the MIMO is assumed to be 8.
- MIMO has 6 sources and 4 destinations.
a
a bw
w
b
a - bw
a ar i(ac)b br i(bc) w wr i(wc) i
v-1
40Results (Fast Fourier Transform)
- Overheads due to additional scanfs are removed.
- Quality of the code generated by the new
framework is much better. - Loop optimizations like software pipelining could
be applied unlike the previous framework. - Though optimizations performed, performance
efficiency was lowered. -
41Explanation of the Anomaly
- The scheduled code generated shows no evidence of
software pipelining. - Extra code added to support these optimization
features. - Generation of Statistics is not done accurately.
0
st
1
ld
ld
ld
2
ld
4
bfly
12
st
13
st
br
14
st
42Case Studies (FFT)
- For fair comparison, modulo scheduling algorithm
is explicitly switched off.
43Case Studies (AdpcmDecode)
- 2 MISOs were introduced
- dest1 (src2 (src1 gtgt 1)
- dest1 (src2 (src1 gtgt 2)
- Latency of the FUs was taken to be 1 assuming
chaining of operations.
44Case Studies (AdpcmDecode)
- No performance gain is attributed to
- Presence of only small MISOs.
- Reduction of execution time on hardware is
matched by the execution of instructions of the
MISO in parallel with other instructions,
achieved by VLIW compiler. - Poor estimation technique which assumes temporal
execution of instructions on the processor.
45Case Studies Predicated Adpcmdecode
- Predication Done to identify larger MISOs
- 3 MISOs identified
- Critical Path lengths of the MISOs are 14 , 3, 4
respectively. - Their latencies are assumed to be 7, 2, 2
respectively.
46Results Discussion
- Though latency of the computations are
dramatically reduced, gain is not as expected - VLIW Compiler is able to schedule the component
instructions with remaining application code in
parallel. - Gain achieved by shortening the critical path of
the application. - Tradeoff between the ILP and the granularity of
the MISOs considered. - Raises question Given a core with enough
resources to handle Maximum parallelism
available, will special FUs enhance performance
further?
47Need for VLIW Special FU ?
- Modeled 1 MISO having critical path length of 14.
- Latency assumed to be 7.
- Varied the number of Integer ALUs.
- Once VLIW compiler extracts maximum parallelism,
MISOs which shorten critical path length of
application will enhance efficiency. - Graph shows that modified code has a lower level
of ILP.
48Conclusions Contributions
- A largely automated framework for the design and
evaluation of ASIPs is achieved. - Pluggable Modules for Identification, Evaluation
Selection of critical parts of application are
implemented. - Trimaran Evaluation Framework is enhanced, to
achieve better insight into the gain achieved. - Performance gain evaluated is significantly
improved, and depend upon the base architecture
and nature of application. - VLIW architecture augmented with special FUs will
perform better provided the FU is capable of
reducing the critical path of the application as
dictated by the base architecture.
49Future Work
- Explore the complexity/performance tradeoff of
ASIPs with control flows mapped on to the FUs. - Better Evaluation and Selection Criterion which
depend on architectural constraints. - Multi-Objective Selection including area, power,
I/O constraints etc. - Design of a better Memory Model to evaluate the
gain.
50References
- Bhuvan Middha, Varun Raj, Anup Gangwar, Anshul
Kumar, M. Balakrishnan, and Paolo Ienne. A
Trimaran based framework for exploring the design
space of VLIW ASIPs with coarse grain functional
units. In Proceedings of the 15th International
Symposium on System Synthesis, Kyoto, October
2002. - Paolo Ienne, Laura Pozzi, and Miljan Vuletic. On
the limits of processor specialisation by mapping
dataflow sections on ad-hoc functional units.
Technical Report 01/376, Swiss Federal Institute
of Technology Lausanne (EPFL), Computer Science
Department (DI), Lausanne, December 2001. - Laura Pozzi, Miljan Vuletic, and Paolo Ienne.
Automatic topology-based identification of
instruction-set extensions for embedded
processors. Technical Report 01/377, Swiss
Federal Institute of Technology Lausanne (EPFL),
Computer Science Department (DI), Lausanne,
December 2001.
51References
- P.P.Tirumalai B. Ramakrishna Rau, Michael S.
Schlansker. Code generation schema for modulo
scheduled loops. Technical Report HPL - 92 -47,
Hewlett Packard Laboratories, April 1992. - Machine suif, http//www.eecs.harvard.edu/hube/sof
tware - The trimaran compiler infrastructure,
http//www.trimaran.org - B. Ramakrishna Rau Vinod Kathail, Michael S.
Schlansker. Hpl-pd architecture specification
Version 1.1. Technical Report HPL-93-80(R.1),
Compiler and Architecture Research HP
Laboratories Palo Alto, February, 2000 (Revised).
52Acknowledgements
- Prof. M.Balakrishnan
- Dr. P.R. Panda
- Anup Gangwar
- Basant K. Dwivedi
53Thank You