Cycle Accurate Parameterized Simulator for Clustered VLIW ASIP in SystemC - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Cycle Accurate Parameterized Simulator for Clustered VLIW ASIP in SystemC

Description:

Cycle Accurate Parameterized Simulator for Clustered VLIW ASIP in SystemC ... Decompress. Instruction Decode. DF/AG. Execute. Store Results. Slide 16. Slide 16 ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 34
Provided by: vand165
Category:

less

Transcript and Presenter's Notes

Title: Cycle Accurate Parameterized Simulator for Clustered VLIW ASIP in SystemC


1
Cycle Accurate Parameterized Simulator for
Clustered VLIW ASIP in SystemC
  • Vipul Jain

March 31, 2003
2
Motivation
  • VLIW ASIPs include increasing number of
    functional units to meet the high throughput
    requirements of applications exhibiting high ILP.
  • Degree of parallelism limited by the register
    files ability to supply operands to functional
    units.
  • Clustered VLIW overcomes this problem by
    clustering the functional units such that each
    cluster can read/write to only a subset of
    registers.

3
Need for a retargetable simulator
  • Role of Architecture Customization
  • Higher performance
  • Lesser area
  • Low power
  • Need of a tool to verify the performance matrices
    of given application on a given architecture.

4
Objective
  • To define a retargetable clustered VLIW
    architecture and implement a cycle accurate
    simulator in SystemC for it.
  • Validate the simulator by modeling
  • Texas Instruments TI-C6400 DSP
  • Trimedia TM-1000 DSP

5
Clustered VLIW Architecture
6
Customization Opportunities
  • System Level
  • Processor Level

7
Customization Opportunities -gt System Level
  • Compute units
  • No. and Types ASICs, VLIW or RISC ASIPs etc.
  • Interconnection Network
  • Custom interconnection based on communication
    pattern of the application
  • Memory architecture
  • Types synchronous/asynchronous,
    pipelined/non-pipelined etc.
  • Various transfer modes
  • Custom memories FIFOs, LIFOs, frame buffers etc.

8
Customization Opportunities -gt System Level
9
Customization Opportunities -gt Processor Level
  • Functional Units
  • MISO, MIMO, MIMO with LD/ST
  • Rigid or flexible I/O timeshapes
  • Register File Clustering
  • If many FUs connected to same register file,
    delay and cost of register file becomes the
    bottleneck.
  • Each FU can read from and write to only a subset
    of registers
  • Interconnects
  • Between different clusters and between clusters
    and memory
  • Instruction encoding and decoding
  • Reduce or remove explicit NOPs in code
  • Affects Code size, Object code compatibility,
    Branch miss prediction penalty, Hardware cost,
    Address specification in code size

10
Customization Opportunities -gt System Level
Register Files
RF1
RF2
No. and Type of Regfile Customization
Interconnect Customization
Functional Units
FU1
FU2
FU3
FU4
AFU2
AFU1
No. and Type of FU Customization
11
Simulator Description
  • Simulator is being written in SystemC.
  • Simulator runs in two phases.
  • In Setup phase, all the functional units,
    register files, memories and interconnection
    network are initialized by reading the HMDES
    description of architecture.
  • In Execution phase, the simulator executes the
    given program and we can extract performance
    statistics like number of cycles taken, memory
    bandwidth utilized, number of cache misses etc.

12
Simulator Description (cont)
  • Predicated instruction execution
  • Compiler visible interconnection network
  • Multiple FUs may be connected by same set of
    ports with Decode unit and register files
  • Pipeline stall occurs on Cache miss or cross path
    register access.

13
Simulator Description (cont)
  • Simulator
  • Uses HMDES to describe the architecture
  • Uses Dinero IV cache simulator for simulating
    cache hierarchy
  • Actually runs the input program
  • Generates statistics and execution trace
  • Dinero IV
  • Cache hierarchy is modelled using Dinero IV cache
    simulator
  • Developed by Mark D. Hill, Univ. of Wisconsin
    Computer Sciences
  • Provides subroutine interface for a flexible
    simulator of multilevel cache memories.
  • Not a timing or functional Simulator. So these
    details will be taken care of by memory module.

14
Retargetability
  • Parametrized parts
  • Memory Hierarchy
  • Number and types of register files
  • Interconnects between Reg. Files and Fus.
  • Organization of Fus in various clusters
  • Change in Code required
  • Adding custom Fus
  • Changing number of pipeline stages

15
The Typical VLIW Pipeline
Instruction Decode
Align
Decode
Decompress
Instruction Fetch
DF/AG
Execute
Store Results
16
Pipeline stages in simulator
Instruction Decode
Align
Decode
Decompress
Instruction Fetch
DF/AG
Execute
Store Results
17
Class Hierarchy
18
Overview of Simulator
19
Register Files
  • Retargetable parameters
  • Type of register in register file
  • Latency (in cycles)
  • Number of read ports for
  • Normal registers
  • Predicate registers
  • Number of write ports
  • Can have either
  • Separate predicate registers
  • Use least significant bit of normal registers as
    predicate value

20
L1 Cache (Data and Instruction)
  • Use of Dinero IV cache simulator
  • Retargetable parameters
  • Size
  • Total Size
  • Line Size
  • Associativity
  • Read/write policies
  • Latency
  • Bus width with memory/ functional units

21
Main Memory
  • Retargetable parameters
  • Latency
  • Delay
  • Size of burst mode for memory access
  • Start address of data (so that multiple memory
    modules may be added).

22
Bus between Cache/Memory
  • Retargetable parameters
  • Bus Width
  • Plug able arbitrator
  • Using static priorities for now.

23
Functional Units
  • Retargetable parameters
  • Input/output data type
  • Instructions executed
  • Latency and initialization interval

24
Fetch Unit
  • Retargetable parameters
  • Number of instructions fetched per cycle
  • Latency
  • Supports instruction prefetch. Prefetch requests
    are not queued.

25
Decode Unit
  • Retargetable parameters
  • Cross path delay
  • Latency
  • Uses a plug able function for instruction decode.

26
Interconnects
  • Two types
  • One to one Connect directly using a signal.
  • Many to one Instantiate a multiplexor

27
Example of using HMDES architecture description
  • // beh is 0 read, 1 write, 2 read/write
  • CREATE SECTION Port
  • REQUIRED beh(INT)
  • REQUIRED type(LINK(DataType) )
  • CREATE SECTION DirectConnect
  • REQUIRED end1(LINK(Port))
  • REQUIRED end2(LINK(Port))
  • REQUIRED type( INT )

28
Example of using HMDES architecture
description(cont)
SECTION Port PortA( beh(write_port)
type(Type1)) PortB( beh(read_port)
type(Type1)) SECTION DirectConnect IC_1(inpu
t(PortA) output(PortB) type(Type1) )
Unit 1
Port A
Port B
Unit 2
29
A simple instance of simulated architecture
30
(No Transcript)
31
(No Transcript)
32
REFERENCES
  •  Introduction to VLIW by Philips
  • Architectural Design and Analysis of a VLIW
    Processor (Arthur Abnous and Nader Bagherzadeh)
  •  TriMedia Technologies
  • Anup's Research Plan
  • Dinero IV Cache Simulator manual
  • TI - C62x data book
  • TI - C64x data book
  • TI - C6x instruction set
  • HMDES 2.0 Specification

33
Thanks
Write a Comment
User Comments (0)
About PowerShow.com