CS718%20:%20Data%20Parallel%20Processors - PowerPoint PPT Presentation

About This Presentation
Title:

CS718%20:%20Data%20Parallel%20Processors

Description:

Multiple processing elements driven by a single ... Burroughs Scientific Processor (BSP) Model. P. M. P1. M1. P2. M2. Pn. Mk. Interconnection network ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 26
Provided by: profa150
Category:

less

Transcript and Presenter's Notes

Title: CS718%20:%20Data%20Parallel%20Processors


1
CS718 Data Parallel Processors
  • 27th April, 2006

2
Data Parallel Architectures
  • SIMD Processors
  • Multiple processing elements driven by a single
    instruction stream
  • Associative Processors
  • SIMD like processors with associative memory
  • Vector Processors
  • Uni-processors with vector instructions
  • Systolic Arrays
  • Application specific VLSI structures

3
SIMD
M
P
DS
IS
C
P
DS
One of the earliest model of parallel computer
4
ILLIAC IV SIMD Model
I/O
CU
bus
PE1
PE2
PEn
Interconnection network
Planned for 64 x 4 PEs, built only 64
5
Burroughs Scientific Processor (BSP) Model
I/O
CU
bus
P1
P2
Pn
Interconnection network
M1
M2
Mk
6
SIMD algorithms sum of vector elements
a0
a1
a2
a3
a4
a5
a6
a7
step 1
a0a1
a2a3
a4a5
a6a7
a0a1 a2a3
a4a5 a6a7
step 2
a0a1a2a3 a4a5a6a7
step 3
OR
  • Si ai ai1 i 0,2,4,6
  • Si Si Si2 i 0,4
  • Si Si Si4 i 0

Si ai ai4 i 0,1,2,3 Si Si
Si2 i 0,1 Si Si Si1 i 0
7
No. of processors vs time
  • Adding vector elements
  • n processors log n steps
  • n/log n processors log n steps
  • Matrix multiplication
  • n processor n2 steps
  • n2 processors n steps
  • n3 processors log n steps
  • n3/log n processors log n steps
  • Important factors data distribution, network

8
Rise and fall of SIMDs
  • Introduced in 60s (e.g. Illiac, BSP)
  • Problems
  • not cost effective
  • serial fraction and Amdahls law
  • I/O bottle neck
  • Overshadowed by Vector Processors
  • Resurrected in 80s (MPP from Goodyear,
    Connection machine from Thinking Machines Inc.,
    MP-1 from MasPar)
  • Did not survive because of high cost

9
Related ideas
  • Coarse grain SIMD with off the shelf processors
    (synchronized MIMD), e.g. CM5 of Thinking
    Machines
  • This gave rise to SPMD (single program multiple
    data)
  • MMX and SIMD instructions in Pentium

10
Vector Processors
I-cache
I-unit and control
D-cache
Memory
V-reg
GPRs
address unit
Mem control
Buses
VFU
VFU
FU
11
Four Generations of CRAY systems (vector
processors)
  • System CPUs Clock Flops/ Words Mflops Gates/
  • MHz clock/ moved/ chip
  • CPU clk/CPU
  • CRAY-1 1 80 2 1 80 2
  • X-MP 4 105 2 3 840 16
  • Y-MP 8 166 2 3 2667 2500
  • C90 16 240 4 6 15360 10000

12
Cray History
  • http//www.cray.com/company/history.html

13
CRAY C90
  • 8GB central memory shared by 16 CPUs
  • 128 CPU - mem paths
  • word
  • 64 bits 16 ECC
  • Dual vector pipes
  • 128 element segments
  • Memory
  • 8 sections
  • 8x8 sub sections
  • 8x8x2 bank groups
  • 8x8x2x8 banks

14
Convex C4/XA system
  • CPU 7.5 ns clock, 1620 MFLOPs
  • Mem 32 MB x 32 banks, 64 bit word, 50ns access
    time
  • 3 FP pipes, 2 results each
  • Vector regs - FPU cross bar
  • 1.1 GB/s per I/O port

5 x 5 crossbar
CPUs
memories
I/O
utilities
15
Other examples
Fujitsu VP2000 1 - 2 CPUs
  • Fujitsu VP5000
  • 7 - 222 CPUs
  • 2 LS pipes
  • 3 Func pipes
  • 2 mask pipes
  • NEC SX - X
  • 4 CPUs
  • 4 x 2 pipes each

16
Systolic Arrays (H.T. Kung 1978)
Simplicity, Regularity, Concurrency, Communication
Example Band matrix multiplication
17
T0
B31
A23
A22
B21
A12
A31
A11
A21
B11
B12
18
T1
B31
A23
A22
A32
A12
B22
B21
A31
A11
A21
B11
B12
19
T2
A33
B32
A23
B31
B22
A32
A12
A22
B21
A31
A21
A11
B11
B12
20
T3
A34
B42
B31
B32
A23
A33
A32
B22
B21
A22
A12
A42
B23
A31
A21
A11 B11
B12
21
T4
A34
B42
A23
A33
B31
B32
B33
A43
A11 B11 A12 B21
A32
A22
B22
A42
B23
A31
A21 B11
A11 B12
22
T5
A34
B42
A23
B31
B32
B33
A33
A43
C11
A21 B11 A22 B21
A11 B12 A12 B22
A32
A42
B23
A21 B12
A31 B11
23
T6
B43
B42
A44
A34
C11
A21 B11 A22 B21 A23 B31
B32
B33
A33
A43
C12
A53
A31 B11 A32 B21
A21 B12 A22 B22
A42
A12 B23
A31 B12
24
WARP Programmable Systolic Processor
  • Kung, CMU 1987
  • Complete contrast to the original idea
  • not application specific
  • not a single VLSI
  • complex cell (pipelined FP adder, mult, FIFOs,
    RAM, cross bar)
  • linear
  • asynchronous

25
References
  • D. Sima, T. Fountain, P. Kacsuk, "Advanced
    Computer Architectures A Design Space
    Approach", Addison Wesley, 1997.
  • K. Hwang, "Advanced Computer Architecture
    Parallelism, Scalability, Programmability",
    McGraw Hill, 1993.
Write a Comment
User Comments (0)
About PowerShow.com