A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD

Description:

Motivation SVD Applications Smart antennas Image processing Medical imaging VLIW Trend in high performance embedded computing Vector Out of favor Flynn bottleneck ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 15
Provided by: cheu150
Category:

less

Transcript and Presenter's Notes

Title: A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD


1
A Comparison of the VIRAM-1 and Embedded VLIW
architectures for use on SVD
  • CS 252
  • Spring 2000
  • Jeff Herman
  • John Loo
  • Xiaoyi Tang

2
Motivation
  • SVD Applications
  • Smart antennas
  • Image processing
  • Medical imaging
  • VLIW
  • Trend in high performance embedded computing
  • Vector
  • Out of favor
  • Flynn bottleneck is a limiting factor in
    parallelism
  • Known for linear algebra performance

3
C67 Architecture (mapped)
Instruction Ram (cache optional)
Decode Logic (8-way)
A Register File
B Register File
L1
S1
M1
D1
D2
M2
S2
L2
Data Ram (gt4 banks)
4
C67 Architecture
  • Split Register Files
  • 16 registers per register file
  • One cross path per register file
  • Instruction Latencies
  • Branches - 6 cycles
  • Load - 5 cycles
  • FP add/multiply - 4 cycles

5
TM 1100 VLIW Processor Core Architecture
  • 5-issue VLIW
  • 2 FP adders/multipliers
  • 2 Load/Store Units
  • 128 general purpose 32 bit registers
  • 16KB data cache, 32KB instruction cache
  • Instruction Latencies
  • 3 cycles for Branches, Load, FP add/multiply

6
VIRAM-1 Microarchitecture
  • 2-way-issue superscalar MIPS IV core
  • Asynchronous vector unit
  • Communication to scalar core through queue
  • 32 general purpose vector and flag registers
  • 32 scalar and control register
  • 2 VAFU, 2 FFU, 1 VMFU
  • 4-lane standard configuration

7
VIRAM-1 Microarchitecture
8
Testing Conditions
  • SVD routine from CLAPACK
  • Random test matrices with a rank of 10
  • Matrix dimension ratio of 10
  • Sizes range from 100x10 to 300x30
  • Suboptimal parameters used
  • Trends should still hold
  • Assumed 200 Mhz clock rate

9
(No Transcript)
10
Ideal C67 and TM 1100 Performance Gap
  • Same memory bottlenecks in both processors
  • Programming model
  • C67
  • Assembly coded kernels
  • 1700 lines
  • TM 1100
  • Only C level optimizations

11

12
(No Transcript)
13
VIRAM Performance Summary
  • Gains from vector unit limited by Amdahls law.
  • Vector instructions comprise only 15 of total
    code.
  • Not much else of SVD can be vectorized.
  • Gains limited by what cannot be vectorized.
  • Perhaps streamline LAPACK or handcode assembly?
  • Sub-linear scalability.
  • Scaling IRAM is cheap but gains diminish.
  • Efficiency and scalability increase with size of
    data set.

14
Concluding Remarks
  • Limitations of both architecture are different
  • VIRAM Scalar core
  • VLIW Memory bandwidth
  • VLIW cannot match performance of VIRAM when
    computing SVD.
  • VLIW with vector coprocessor?
Write a Comment
User Comments (0)
About PowerShow.com