Title: Cycle Accurate Parameterized Simulator for Clustered VLIW ASIP in SystemC
1Cycle Accurate Parameterized Simulator for
Clustered VLIW ASIP in SystemC
March 31, 2003
2Motivation
- VLIW ASIPs include increasing number of
functional units to meet the high throughput
requirements of applications exhibiting high ILP. - Degree of parallelism limited by the register
files ability to supply operands to functional
units. - Clustered VLIW overcomes this problem by
clustering the functional units such that each
cluster can read/write to only a subset of
registers.
3Need for a retargetable simulator
- Role of Architecture Customization
- Higher performance
- Lesser area
- Low power
- Need of a tool to verify the performance matrices
of given application on a given architecture.
4Objective
- To define a retargetable clustered VLIW
architecture and implement a cycle accurate
simulator in SystemC for it. - Validate the simulator by modeling
- Texas Instruments TI-C6400 DSP
- Trimedia TM-1000 DSP
5Clustered VLIW Architecture
6Customization Opportunities
- System Level
- Processor Level
7Customization Opportunities -gt System Level
- Compute units
- No. and Types ASICs, VLIW or RISC ASIPs etc.
- Interconnection Network
- Custom interconnection based on communication
pattern of the application - Memory architecture
- Types synchronous/asynchronous,
pipelined/non-pipelined etc. - Various transfer modes
- Custom memories FIFOs, LIFOs, frame buffers etc.
8Customization Opportunities -gt System Level
9Customization Opportunities -gt Processor Level
- Functional Units
- MISO, MIMO, MIMO with LD/ST
- Rigid or flexible I/O timeshapes
- Register File Clustering
- If many FUs connected to same register file,
delay and cost of register file becomes the
bottleneck. - Each FU can read from and write to only a subset
of registers - Interconnects
- Between different clusters and between clusters
and memory - Instruction encoding and decoding
- Reduce or remove explicit NOPs in code
- Affects Code size, Object code compatibility,
Branch miss prediction penalty, Hardware cost,
Address specification in code size
10Customization Opportunities -gt System Level
Register Files
RF1
RF2
No. and Type of Regfile Customization
Interconnect Customization
Functional Units
FU1
FU2
FU3
FU4
AFU2
AFU1
No. and Type of FU Customization
11Simulator Description
- Simulator is being written in SystemC.
- Simulator runs in two phases.
- In Setup phase, all the functional units,
register files, memories and interconnection
network are initialized by reading the HMDES
description of architecture. - In Execution phase, the simulator executes the
given program and we can extract performance
statistics like number of cycles taken, memory
bandwidth utilized, number of cache misses etc.
12Simulator Description (cont)
- Predicated instruction execution
- Compiler visible interconnection network
- Multiple FUs may be connected by same set of
ports with Decode unit and register files - Pipeline stall occurs on Cache miss or cross path
register access.
13Simulator Description (cont)
- Simulator
- Uses HMDES to describe the architecture
- Uses Dinero IV cache simulator for simulating
cache hierarchy - Actually runs the input program
- Generates statistics and execution trace
- Dinero IV
- Cache hierarchy is modelled using Dinero IV cache
simulator - Developed by Mark D. Hill, Univ. of Wisconsin
Computer Sciences - Provides subroutine interface for a flexible
simulator of multilevel cache memories. - Not a timing or functional Simulator. So these
details will be taken care of by memory module.
14Retargetability
- Parametrized parts
- Memory Hierarchy
- Number and types of register files
- Interconnects between Reg. Files and Fus.
- Organization of Fus in various clusters
- Change in Code required
- Adding custom Fus
- Changing number of pipeline stages
15The Typical VLIW Pipeline
Instruction Decode
Align
Decode
Decompress
Instruction Fetch
DF/AG
Execute
Store Results
16Pipeline stages in simulator
Instruction Decode
Align
Decode
Decompress
Instruction Fetch
DF/AG
Execute
Store Results
17Class Hierarchy
18Overview of Simulator
19Register Files
- Retargetable parameters
- Type of register in register file
- Latency (in cycles)
- Number of read ports for
- Normal registers
- Predicate registers
- Number of write ports
- Can have either
- Separate predicate registers
- Use least significant bit of normal registers as
predicate value
20L1 Cache (Data and Instruction)
- Use of Dinero IV cache simulator
- Retargetable parameters
- Size
- Total Size
- Line Size
- Associativity
- Read/write policies
- Latency
- Bus width with memory/ functional units
21Main Memory
- Retargetable parameters
- Latency
- Delay
- Size of burst mode for memory access
- Start address of data (so that multiple memory
modules may be added).
22Bus between Cache/Memory
- Retargetable parameters
- Bus Width
- Plug able arbitrator
- Using static priorities for now.
23Functional Units
- Retargetable parameters
- Input/output data type
- Instructions executed
- Latency and initialization interval
24Fetch Unit
- Retargetable parameters
- Number of instructions fetched per cycle
- Latency
- Supports instruction prefetch. Prefetch requests
are not queued.
25Decode Unit
- Retargetable parameters
- Cross path delay
- Latency
- Uses a plug able function for instruction decode.
26Interconnects
- Two types
- One to one Connect directly using a signal.
- Many to one Instantiate a multiplexor
27Example of using HMDES architecture description
- // beh is 0 read, 1 write, 2 read/write
- CREATE SECTION Port
- REQUIRED beh(INT)
- REQUIRED type(LINK(DataType) )
-
-
- CREATE SECTION DirectConnect
- REQUIRED end1(LINK(Port))
- REQUIRED end2(LINK(Port))
- REQUIRED type( INT )
-
28Example of using HMDES architecture
description(cont)
SECTION Port PortA( beh(write_port)
type(Type1)) PortB( beh(read_port)
type(Type1)) SECTION DirectConnect IC_1(inpu
t(PortA) output(PortB) type(Type1) )
Unit 1
Port A
Port B
Unit 2
29A simple instance of simulated architecture
30(No Transcript)
31(No Transcript)
32REFERENCES
- Introduction to VLIW by Philips
- Architectural Design and Analysis of a VLIW
Processor (Arthur Abnous and Nader Bagherzadeh) - TriMedia Technologies
- Anup's Research Plan
- Dinero IV Cache Simulator manual
- TI - C62x data book
- TI - C64x data book
- TI - C6x instruction set
- HMDES 2.0 Specification
33Thanks