The Future of Vector Processors - PowerPoint PPT Presentation

About This Presentation
Title:

The Future of Vector Processors

Description:

Cross-Pollination of Vector/Superscalar/VLIW. MMX, Embedded... Idiom recognition. IF conversion. Vector parallelization. Kyoto, May 28th. 1999. 22 ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 75
Provided by: MateoV6
Category:

less

Transcript and Presenter's Notes

Title: The Future of Vector Processors


1
The Future of Vector Processors
  • M. Valero, R. Espasa and J. Corbal
  • UPC, Barcelona

Kyoto, May 28th, 1999
2
TOP-500 and Vector Processors
310
November 98 Fujitsu27 NEC18 SGI..15 Hitachi
.5
96
65
43
15
3
The Future of Vector ISAs
  • Cross-Pollination of Vector/Superscalar/VLIW
  • MMX, Embedded...
  • Very-high Performance Architectures
  • ILP techniques, IRAM, SDRAM
  • Vector Microprocessors
  • Numerical Accelerators
  • Multimedia Applications

4
Talk Outline
  • The Past
  • Initial Motivation for Vector ISA
  • Evolution of Vector Processors
  • The Present
  • Recent Announcements
  • The Decline of Vector Processors
  • Cross-Pollination of Vector/Superscalars/VLIW
  • The Future
  • Very-high Performance Architectures
  • Vector Microprocessors
  • Numerical Accelerators
  • Multimedia Applications
  • Conclusions

5
Characteristics of Numerical Applications
  • Examples Weather prediction, mechanical
    engineering
  • Data structures Huge matrices (dense, sparse)
  • Data types 64 bits, floating point
  • Highly repetitive loops
  • Compute-intensive
  • Data-Level Parallel

6
Initial Motivations for Vector Processors
Dependence Graph
real8 x(9992), y(9992), u(9984) subroutine
loop integer I real8 q do
I1,9984 q u(I) y(I)
y(I) x(I) q x(I)
q - u(I) x(I) enddo end
y(I)
u(I)
x(I)
For I1 to 9984
7
Execution of scalar code
Loop ld R1,0(R10) ld
R2,0(R11) ld R3,0(R12) mulf
R4,R1,R2) mulf R5,R2,R3
M
ALU
add R11,R11,8 addf
R6,R4,R3 subf R7,R4,R5 st
0(R12),R7 add R10, R10,8
st 0(R12),R7 sub
R13,R13,1 bne Loop add
R12,R12,8
14 cycles / Iteration
Perfect Memory !!!
8
Generation of Vector Code
A vector iteration is equivalent to 128 scalar
iterations
ld.w 9984,s2 ld.w
0,a2 ld.w 8,vs
.
.
.
Loop mov s2, vl
vl lt- min(s2,128) ld.l
-y(a2),v0 v0 lt- y(II127) ld.l
-u(a2),v1 v1 lt- u(II127)
mul.d v1,v0,v2 q(II127) lt-
u(II127)y() ld.l -x(a2),v3
v3 lt-x(II127) add.d
v3,v2,v0 v0 lt- x(II127) q(II127)
st.l v0,-y(a2)
y(II127) lt- x(II127) q( ) mul.d
v1,v3,v1 v1 lt- u(II127)
x(II127) sub.d v2,v1,v0
v0 lt- q( ) - u( ) x( ) st.l
v0,-x(a2) x(II127) lt- q( ) - u( )
x( ) add.w 1024,a2
increment index (128 8) add.w
-128,s2 128 iterations less to process
lt.w 0,s2 jbrs.t
loop
0 1 2 127
.
.
.
.
.
.
DLP !!!
9
Execution of vector code
One L/S Port One Adder, One Multiplier
Loop mov s2, vl ld.l
-y(a2),v0 ld.l
-u(a2),v1 mul.d v1,v0,v2
ld.l -x(a2),v3 add.d
v3,v2,v0 st.l v0,-y(a2)
mul.d v1,v3,v1 sub.d
v2,v1,v0 st.l v0,-x(a2)
add.w 1024,a2 add.w -
128,s2 lt.w 0,s2 jbrs.t
loop
A vector iteration is equivalent to 128 scalar
iterations
5.1 cycles / Iteration Memory Latency 24 cycles
!!! 14 vector instructions 1792 scalar
instructions
10
Vector Processor
11
Why Vector ISA ?
  • Natural way to express Data-Level Parallelism
  • Fewer instructions
    ( 3 )
  • Easy way to convey this information to the
    hardware
  • Good hardware implementation
  • Affordable/ incremental parallelism ( 2 )
  • Simple control/ faster clock
    ( 1 )
  • Mechanism to deal with memory latency
  • Problem Memory Bandwidth...

12
Vector versus Scalar Architectures
Number of instructions (in millions)
Vector instruction semantics encode many
different scalar instructions
- Loop counters - Branch computations - Addresses
generation
Rate from 140 to 2
F. Quintana, R. Espasa and M. Valero A case for
merging the ILP.. PDP-98
13
Easy to convey information to the hardware
  • Data path
  • No pressure at fetch, decode and issue
  • Decentralized control
  • Faster cycle times
  • Vector memory instructions
  • Spatial locality can be made clearly visible to
    the hardware through strides
  • No overhead and good prefetching
  • Reduction of memory latency overhead
  • Memory uses facts, not guesses

14
Key parameters for vector processors
  • Cycle time
  • Scalar processor
  • of registers and FUs
  • Cache
  • Vector processor
  • of vector registers
  • of FUs and of pipes/ FU
  • Connection to memory
  • of busses and width
  • Number of processors

15
Cray Y-MP Architecture
0
4
28
P0
44
88
224
228
232
44
P1
256 modules. ta 30 ns.
tc 6 ns. 333 Mflops / processor
31
3
7
88
228
231
255
P7
44
Synchronization
16
Vector Processors (1 of 2)
17
Vector Processors (2 of 2 )
18
Evolution of Cray Machines
Tc x6 ILP x2
of proc. x32 Total x400
Courtesy from SGI/CRAY
19
Vector Innovations (1 of 2 )
  • Star-100/Cyber-200 had many of them
  • Gather/scatter
  • Masked operations for conditionals
  • Cray-1 introduced vector registers
  • BSP had instructions for recurrences and
    multioperand
  • Instructions to optimize masked vector operations
  • Instructions to handle Index and Bit sequence on
    mask register
  • Flexible addressing of subvector registers(C4)

20
Vector Innovations ( 2 of 2 )
  • Multi-pipes (Star/Cyber)
  • Vector with Virtual Memory
  • Flexible chaining (multi-ported register-file)
  • Multilevel register-file (NEC)
  • Scalar units sharing vector FUs (Fujitsu)
  • Combined vector and scalar instructions (Titan)
  • Short vectors (CS-2 and CM-5)
  • Scalar processor LIW( Fujitsu), SS(NEC)

21
Automatic vectorization
  • Compiler technology for vectorization over 25
    years of development
  • Dependence analysis
  • Elimination of false dependences
  • Strip mining
  • Loop interchange
  • Partial vectorization
  • Idiom recognition
  • IF conversion
  • Vector parallelization

22
Vector Architectures Present
  • New announcements (NEC, Cray, Fujitsu)
  • The decline of vector processors
  • Cross-pollination of Vector/ Superscalar/ VLIW
    processors

23
NEC SX-5
  • Announced on June 5th. of 1998
  • 8 Gflops, CMOS, tc 4 ns
  • Superscalar processor at 500 Mflops
  • 32 results/cycle (2 FPU, 16-pipe)
  • 32 data memory accesses/cycle (2 ports,16
    data/port). Memory bandwidth of 64 GB/s
  • System composed by 32 nodes of 128 Gflops
    providing 4 Tflop/s

24
Cray SV1
  • Announced on June 16th. of 1998
  • CMOS, 250 Mhz and 4 Gigaflop/proc.
  • Vector cache memory
  • 2 FUs of 8 operations/cycle
  • Multi-Streaming Processor
  • Scalable vector architecture (32 nodes of 32
    processors4 Teraflops)
  • Future processor enhancements !!!

25
Fujitsu VP5000
  • Announced on April 20 th. of 1999
  • 9.2 Gflop/s, CMOS, 0.22 micr, 33 Mtrs/chip
  • Linpack 10001000 gives 8758 Mflop/s
  • Crossbar provides 21.6 GB/s per processor
  • System composed by 512 PEs or 4.9 Teraflops
  • Maximum of 16 GB/PE or 8 TB/512 PEs

26
The decline of vector processors
  • Why have vector machines declined so fast in
    popularity?
  • Cost (Scalar parallel machines use commodity
    parts)
  • Too restricted in applications (lack of
    vectorization in many programs)
  • Massive use of computers to run so called
    Non-numerical Applications

27
Characteristics of non-numerical Applications
  • Examples OLTP,DSS, simulators, games
  • General data structures Lists, trees, tables
  • Data types Scalar integers of 8 to 64 bits
  • Frequent control flow changeSpeculation
  • Short distance data dependencies... Forwarding
  • Instruction/data localityCaches
  • Fine-grain ILP..Out-of-order

28
Micro Killers ???
Peak performance Tc ILP
29
Bandwidth and Performance
30
Peak performance and Bandwidth
100
90
80
Z(I)C0A(I)(C1B(I)
70
(C2C(I)(C3D(I)
60
(C4E(I)(C5F(I)
Efficiency ()
50
(C6G(I)(C7H(I)
40
(C8K(I)(C9L(I))))))))))
30
20
VPP500
IBM RS6000
10
0
0
1000
2000
3000
4000
Vector length
Measurement condition RS6000-590(66.6MHz)
FORTRAN77 - 03 - qarchpwr2 - qtunepwr2
Courtesy from Fujitsu
31
Vector ideas used in SSs/VLIW processors
  • Address prediction and Prefetching
  • Exploitation of data locality(the stride value is
    used for locality detection and exploitation)
  • Predicate execution(VLIW)
  • Multiply and add, chaining
  • Multi-size operands
  • Data reuse and vectorization
  • Addressing modes (auto-increment)
  • Multithreading ( 2 scalar processors in Fujitsu
    machines)
  • Dynamic load/store elimination

32
Predictions for ALL instructions
Y.Sazeides and J.E. Smith The predictability of
data valuesMICRO-30.1997
33
Characterization of Vector Programs
R. Espasa Advanced Vector Architectures . PhD
Thesis, Feb.97
34
SSs ideas usable in vector processors
  • Decoupled Vector Architectures
  • Multithreaded Vector Architectures
  • Out-of-order Vector Architectures
  • Simultaneous Multithreaded Vector Architecture
  • Victim Register File

R. Espasa, M. Valero and J.E. Smith HPCA96,
HPCA97, MICRO97, ICS97...
35
ILPDLP Out-of-order Vector
Fetch
Decode Rename
S registers
A registers
LD/ST
V registers
Memory
Reorder Buffer
R. Espasa, M. Valero, J.E. Smith Out-of-order
Vector Architecture MICRO30, 1997.
36
OOO Vector Performance
R. Espasa, M. Valero, J.E. Smith Out-of-order
Vector Architecture MICRO30, 1997.
37
Vector Processors The Future
  • Very high-performance architectures
  • Vector Microprocessors
  • Numerical Accelerators
  • Multimedia Applications

38
Architectures for a Billion Transistors
  • Advanced/Superspeculative Architectures
  • Trace Processors
  • Simultaneous Multithreading
  • Multiprocessor on a chip
  • RAW processors
  • IRAM

Billion -Transistor Architectures. IEEE Computer
Sept. 1997
39
SMV
  • Simultaneous Multithreaded Vector Arch.
  • Mixes three paradigms
  • DLP vector unit
  • ILP O-o-O execution
  • TLP multithreaded fetch unit
  • Requires a memory system with
  • high performance at low cost
  • low pin-count

R. Espasa and M. Valero Exploiting Instruction
and Data-Level ParallelismIEEE MICRO Sep. 1997
40
Billion Trans. Vector Architecture
M e m o r y
Memory
B
R. Espasa and M. Valero Exploiting Instruction
and Data-Level ParallelismIEEE MICRO Sep. 1997
41
SMV Performance
R. Espasa and M. Valero Exploiting Instruction
and Data-Level ParallelismIEEE MICRO Sep. 1997
42
V-IRAM1
0.18 µm, 200 MHz, 1.6GFLOPS(64b)/6.4GOPS(16b)/32M
B
Serial I/O
D.A. Patterson New directions in Computer
Architecture Berkeley, June 1998
43
Conflict-free access to vectors
Idea Out-of-order access
Memory Modules
P1
P1
P2
P2
Interconnection Network
Interconnection Network
P3
P3
Pn
Pn
Sections
M. Valero et al. ISCA 92, ISCA 95, IEEE-TC 95,
ICS 92, ICS 94,...
44
Command Memory System
Command lt_at_,Length,Stride,sizegt Break commands
into bursts at the section controller
J. Corbal, R. Espasa and M. Valero
Command-Vector Memory System PACT98
45
System configuration in 2009
T. Watanabe SC98, Orlando.
46
Vector Microprocessors
  • Ways of reducing the design impact
  • Short Vectors (64 x 16 words 8 Kbytes)
  • Vector Functionall units shared with INT/FP
    units
  • Vector Register renaming to allow precise
    exceptions
  • Cache hierarchy tuned to vector execution
  • Vector data locality allows large data
    transactions
  • Very large bandwidth between cache and vector
    registers
  • High performance for numerical and multimedia
    applications

47
General Architecture
I-Cache
Fetch
Decode

VRF
1024
Vector Cache
Rambus Controller
8
48
Vector PC Vs SuperScalar
49
Cache Hierarchy
  • Where should be allocated the Vector Cache?

DIRECT RAMBUS
DIRECT RAMBUS
L2
VC
VC
L1
CPU
CPU
50
Performance of the cache hierarchies
BDNA
FLO52
HYDRO2D
EIPC
FLOPS/CYCLE
FLOPS/CYCLE
FLOPS/CYCLE
VECTOR CACHE on L1
VECTOR CACHE on L2
PERFECT CACHE
51
Importance of media Applications
On the next five years, (1998-2002), we believe
that media processing will become the dominant
force in computer architecture (K. Diefendorf
and P. K. Dubey in IEEE Computer Journal, Sep.97,
pp. 43-45) 90 of Desktop Cycles will Be Spent
on Media Applications by 2000 ( Scott
Kirkpatrick of IBM )
52
Characteristics of media Applications
  • Examples Image/ speech processing,
    communications, virtual reality, graphics
  • Data structures matrices and vectors
  • Data types Integer(8 -32 bits), FP (32- 64)
  • Demand for high memory bandwidth
  • Low data locality and latency problem
  • No critical data-dependences
  • Real time necessity
  • Fine/coarse grain parallelism

53
Multimedia Applications and Architectures
Scientific Applications Multimedia
Superscalar MMX
Vector Architectures
VLIW
Re-discover the parallelism at run-time using a
lot of hardware
54
MMX-like processors
  • Multimedia extensions are designed to exploit
    the parallelism inherent in multimedia
    aplications
  • Targeted to leverage full compatibility with
    existing operating systems and applications, plus
    minimum chip area investment.
  • The highlights of multimedia extensions are
  • Single Instruction, Multiple Data (SIMD)
    techniques
  • New data types (Multimedia Vectors, 32/64 bits)
  • Multimedia registers
  • SIMD-like instructions, over small integer data
    types

55
MMX instruction example
  • PADDW Parallel ADD of 4x16-bit data type with
    Wrap Around (No Saturation)

15
0
31
47
63




56
Superscalar Multimedia Processors
Microprocessor Report Vol 12, N 6, May 11, 1998
57
Multimedia Applications and Architectures
Scientific Applications Multimedia
Superscalar MMX
Vector Architectures
VLIW
Re-discover the parallelism at run-time using a
lot of hardware
58
Multimedia Embedded Systems
  • NEC V830R/AV includes MIX2, a multimedia
    instruction extension (SIMD, MMX-like approach)
  • Hitachi SH4 includes FP 4-length vector
    instructions, targeted at geometry transformation
    in 3D rendering applications
  • ARM10 Thumb Family processors will include a
    Vector FP unit capable of delivering 600 MFLOPS

59
Widen is better(?)
  • Most multimedia algorithms exhibit vectors no
    longer than 8/16 elements gt widening the
    multimedia registers could provide diminishing
    returns.

SS
Altivec
MMX
60
VLIW Widening vs Replication
Bus configurations
D. López et al. Increasing Memory Bandwidth
with Wide BussesICS-97
61
Widening and Replication Performance
D. López et al. Widening versus
replicating... ICS98, MICRO98
62
Multimedia Applications and Architectures
Scientific Applications Multimedia
Superscalar MMX
Vector Architectures
VLIW
Re-discover the parallelism at run-time using a
lot of hardware
63
Torrent T0 Microprocessor
  • The first single-chip vector microprocessor.
  • Can sustain over 24 operations per cycle while
    having a issue rate of only one 32-bit
    instruction per cycle
  • Features
  • 16 vector registers (32 32-bit elements each)
  • 2 Vector arithmetic units (8 pipes each)
  • Reconfigurable composite operation pipelines
  • 128-bit wide, external memory interface
  • MIPS-II, 32-bit instruction set, scalar unit.

K. Asanovic et al. The T0 vector microprocessor
. Hot Chips VII, 1995
64
Torrent T0 Microprocessor
K. Asanovic et al. The T0 vector microprocessor
. Hot Chips VII, 1995
65
Vector versus Superscalar Processors
  • Comparison of Die Area
  • Processor Die Area (in mm2 scaled to 0.25m)

250.0
69.81
66.92
67.77
37.77
21.86
14.73
C. G. Lee and D. J. DeVries Initial Results on
. MICRO-30, 1997.
66
Vector versus Superscalar Processors
  • Component Percentages

C. G. Lee and D. J. DeVries Initial Results on
. MICRO-30, 1997.
67
Imagine project
  • Focused on developing a programmable architecture
    that achieves performance similar to special
    purpose hardware on graphics and image
    processing.
  • Matches media applications demands to the current
    VLSI capabilities by using a stream-based
    programming model.
  • Most multimedia kernels exhibit a streaming
    nature.
  • Individual stream elements can be operated on in
    parallel, thus exploiting data parallelism.

Bill Dally Tomorrow Computing EnginesKeynote
HPCA98
68
Imagine architecture
  • Organized around a large stream register file
    (64Kb)
  • Memory operations move entire streams of data
  • Data streams pass through a set of arithmetic
    clusters (8)
  • Each cluster unit operates a single element under
    VLIW control

Bill Dally Tomorrow Computing EnginesKeynote
HPCA98
69
Matrix extensions for Multimedia
  • By combining conventional vector approach
    together with SIMD MMX-like instructions, we can
    exploit additional levels of DLP with matrix
    oriented multimedia extensions.

MOM
0
15
31
47

63
MMX
A1
A2
A4
A3
SS
A5
A6
A8
A7

A9
A10
A12
A11
15
31
0
47

63
A1
A2
A4
A3
A13
A14
A16
A15





B1
B2
B4
B3
B1
0
C1
C2
C4
C3
C5
C6
C8
C7
C1
C1
C2
C4
C3
C9
C10
C12
C11
C13
C14
C16
C15
70
Relative Performance
INVERSE DCT TRANSFORM
MPEG-2 MOTION ESTIMATION
RGB-YCC Color CONVERSION
71
Applications and Architectures
Numerical Applications

Integer
Very Slow
Subroutines

FPU
Very Big Improvement !!!


Additional Speed
FPU
72
Future Applications
  • Integer SPEC-like
  • Commercial (OLTP,DSS)
  • Numerical
  • Multimedia

Integer
Integer
Commercial
Numerical
Multimedia
73
Acknowledgments
  • Roger Espasa
  • James E. Smith
  • Luis A. Villa
  • Francisca Quintana
  • Jesús Corbal
  • David López
  • Josep Llosa
  • Eduard Ayguade
  • Krste Asanovic
  • William Dally
  • Christoforos E. Kozyrakis
  • Corinna G. Lee
  • David A. Patterson
  • Steve Wallace

74
The End
Write a Comment
User Comments (0)
About PowerShow.com