IMAGINE: Signal and Image Processing Using Streams - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

IMAGINE: Signal and Image Processing Using Streams

Description:

Stream register organization. Data bandwidth hierarchy ... Stream Register File: Details. SRF: Single ... Local Register File. Arithmetic Cluster: Details ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 20
Provided by: cvaSta
Category:

less

Transcript and Presenter's Notes

Title: IMAGINE: Signal and Image Processing Using Streams


1
IMAGINE Signal and Image Processing Using
Streams
Brucek Khailany
  • William J. Dally, Scott Rixner, Ujval J. Kapasi,
    Peter Mattson,
  • Jinyung Namkoong, John D. Owens, Brian Towles
  • Concurrent VLSI Architecture Group
  • Computer Systems Laboratory
  • Stanford University

http//cva.stanford.edu/imagine
2
Imagine A Programmable Signal and Image
Processor
  • Motivation
  • Applications poorly matched to conventional
    architectures
  • Key stream architecture features
  • High computational bandwidth (Imagine 48 on-chip
    ALUs)
  • Stream register organization
  • Data bandwidth hierarchy
  • Performance density of a special purpose
    processor
  • 0.59 cm2 CMOS chip, 0.13 mm standard cell, 500
    MHz
  • 20 GFLOPS peak performance (40 GOPS fixed point)
  • 10 GFLOPS sustained on several apps
  • gt 2 GFLOPS/W, gt 5 GOPS/W

3
Representative Applications
  • Stereo Depth Extraction
  • Polygon Rendering
  • MPEG Encoding/Decoding

101100 010110 001001
Encoded 2D Data
2D Video Stream
4
Stream Processing
  • Little data reuse (pixels never revisited)
  • Highly data parallel (output pixels not dependent
    on other output pixels)
  • Compute intensive (60 arithmetic operations per
    memory reference)

5
Characteristics of Media Applications
  • Poorly matched to conventional architectures
  • Caches
  • Instruction-Level Parallelism
  • Few arithmetic units
  • Well-matched to modern VLSI technology
  • Lots (100s - 1000s) of ALUs fit on a single
    chip
  • Communication bandwidth is the scarce resource

6
Communication Bandwidth Care and Feeding of ALUs
Special-Purpose Processors ALUs fed by dedicated
wires/memories
General-Purpose Processors Feeding Structure
Dwarfs ALUs
Instr. Cache
IP
IR
Regs
7
Stream Architecture Provides Data Bandwidth
Hierarchy
SIMD/VLIW Control
SDRAM
ALU Cluster
ALU Cluster
ALU Cluster
SDRAM
ALU Cluster
Stream Register File
ALU Cluster
SDRAM
ALU Cluster
ALU Cluster
SDRAM
ALU Cluster
Peak BW
2GB/s
32GB/s
544GB/s
8
Application Data Bandwidth Usage
SDRAM
ALU Cluster
ALU Cluster
SDRAM
Stream Register File
SDRAM
SDRAM
ALU Cluster
2GB/s
32GB/s
544GB/s
9
Stream Register File Details
Arbiter
SRF Single-ported 128KB SRAM (1024 x 32W)
To/From Arithmetic Clusters
Stream buffers
32W/cycle
10
Arithmetic Cluster Details
Intercluster Network
Local Register File

/
CU
To SRF
Cross Point
From SRF
  • Units support floating-point / 32-bit / dual
    16-bit / quad 8-bit instructions
  • 4-cycle pipelined FMUL,FADD,FSUB,FTOI,ITOF,FFRAC
  • 17-cycle FDIV (pipelined for 1 FDIV every 7
    cycles)

11
Imagine Programming Environment
  • StereoDepthExtraction()
  • // Load Input Images
  • ...
  • // Run Kernels
  • convolve7x7 (RawImage,ConvImage)
  • convolve3x3 (ConvImage,Conv2Image)
  • ...
  • // Store Output
  • Convolve7x7()
  • ...
  • while(!In.empty())
  • ...
  • p0 k0 in10
  • p12 k21 in32
  • p34 k43 in54
  • p56 k65 in76
  • sum (p0 p12)
  • (p34 p56)
  • ...

12
Performance
floating-point application
16-bit applications
16-bit kernels
floating-point kernel
13
Sustained Application Performance
  • Stereo Depth Extraction
  • 320x240 8-bit grayscale
  • 200 frames/second
  • Polygon Rendering
  • 4.5 Million Vertices/sec
  • 5.1 Million Pixels/sec
  • MPEG Encoding
  • 720x486 24-bit color
  • 120 frames/second

SPECviewperf ADVS benchmark (unlit)
101100 010110 001001
Encoded 2D Data
2D Video Stream
14
Power Estimates
GOPS/W 4.6 10.7 4.1 10.2
9.6 2.4 6.9
15
The Imagine Stream Processor
SDRAM
SDRAM
SDRAM
SDRAM
Streaming Memory System
Network
Host
Stream Register File 32kW SRAM
Interface
Processor
Microcontroller 2K VLIW Instrs
ALU Cluster 0
ALU Cluster 1
ALU Cluster 2
ALU Cluster 3
ALU Cluster 4
ALU Cluster 5
ALU Cluster 6
ALU Cluster 7
Imagine Stream Processor
16
Imagine Floorplan
  • 22 million transistors
  • 500 MHz
  • TI GS30KA
  • 0.15 mm Ldrawn
  • 0.13 mm Leff
  • CMOS process

17
VLSI Implementation 22M Transistors with 7 grad
students
  • Stream architecture reduces VLSI design
    complexity
  • Modularity / Replication
  • Long wire delays converted to explicit
    communications
  • Exposed to microarchitecture, software
  • Design methodology
  • Standard ASIC flow with forced placement of
    datapaths
  • Bitslice Verilog
  • Improved area, delay
  • Pre-placement wire length estimates
  • Reduce design iterations

18
Status
  • Imagine team accomplishments
  • Cycle-accurate simulator
  • Software tools
  • Completed synthesizable Verilog
  • Arithmetic units implemented in standard cells
  • Industrial partners
  • Texas Instruments Fab
  • Intel
  • Future work
  • Circuits/Logic expected completion 9/15/00
  • Tapeout expected Q4/2000

19
Summary
  • Key stream architecture features
  • Stream register organization
  • Data bandwidth hierarchy
  • Performance density of a special purpose
    processor
  • 10 GFLOPS sustained on several apps
  • gt2 GFLOPS/W, gt5 GOPS/W
  • VLSI Implementation
  • Validate architectural concepts
  • Develop experimental prototype
Write a Comment
User Comments (0)
About PowerShow.com