The LINPACK Benchmark on a Multi-Core Multi-FPGA System

About This Presentation

Title:

The LINPACK Benchmark on a Multi-Core Multi-FPGA System

Description:

DGEFA: LU factorization with partial pivoting: A=LUP. Ax=LUx=b ... (MPI) Broadcast scaled column and pivot. Perform loop that contains DAXPY ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 44

Provided by: eecgTo

Category:

more less

Transcript and Presenter's Notes

Title: The LINPACK Benchmark on a Multi-Core Multi-FPGA System

1
The LINPACK Benchmark on a Multi-Core Multi-FPGA
System
University of TorontoElectrical and Computer
Engineering Department
byEmanuel Ramalho
Supervisor Prof. Paul Chow
October 1st, 2008
2
Outline

Motivation
LINPACK Algorithm
Parallelizing LINPACK
Results
Conclusions
Future Work

3
Motivation

The LINPACK Benchmark is used to rank the Top500
computers in the world
Can FPGAs compete?

4
Objective

To see how well a multi-core multi-FPGA system
performs when compared to processor

FPGA

Advantage
Total implementation may be done in hardware

Disadvantage
Much lower clock rate

5
LINPACK Algorithm

Solves a system of linear equations by calling
two routines DGEFA and DGESL
Axb
DGEFA LU factorization with partial pivoting
ALUP
AxLUxb
DGESL Solves the system using LU factorization
Lyb
Uxy

6
LINPACK1 vs. HPL

LINPACK1
Single processor
Uses Level 1 BLAS
Slower
Low Complexity

HPL
Multiple processors
Uses Level 3 BLAS
Faster
High Complexity

FPGA Implementation
BLAS3 performs faster in processors (due to
locality of reference)
FPGAs do not take advantage of BLAS3, LINPACK1 is
chosen

7
LINPACK Pseudo-Code

Random generation of matrix A and vector b
Execute DGEFA routine (ALU)
IDAMAX, DSCAL and DAXPY are executed here
Execute DGESL routine (LUxb)
Verify the result using residual calculation

Performance is measured from 2. to 3. (inclusive)
How is this going to be parallelized?

8
Parallelizing LINPACK

Find focus of parallelization DGEFA

9
DGEFA Analysis

Inside DGEFA IDAMAX, DSCAL and DAXPY
DAXPY is the main computation

10
TMD-MPI

TMD-MPI is a lightweight implementation of the
MPI protocol (message passing interface)
TMD-MPE is a hardware implementation of TMD-MPI's
main functionality (SEND and RECV)

MPI Network
11
DGEFA Parallelization

Generate matrix A and vector b (main rank)
(MPI) Matrix distribution
Perform DGEFA (main loop)
Perform IDAMAX and DSCAL
(MPI) Broadcast scaled column and pivot
Perform loop that contains DAXPY
(MPI) Matrix gather (main rank)
Perform DGESL
Calculate residual

12
LINPACK Engine
LIN
P
ACK Engine
RAM
Data
BLAS1
Main
Engine
FSM
Command
T
o Network
FSLs
On-Chip
TMD
MPE Header
MPE
FSM
13
BLAS1 Engine

Performs IDAMAX, DSCAL and DAXPY

14
IDAMAX

Finds Max(v1) and returns its index

15
DSCAL

Performs v2a.v1

16
DAXPY

Calculates
v3a. v1v2

17
Hardware - BEE2 Board
18
Device Utilization (XC2VP70)
Cores 4-Input LUTs Number of Occurrences Total 4-Input LUTs Total ()
LINPACK Engine 4360 6 26160 40
TMD-MPE 896 6 5376 8
NetIf 579 11 6369 10
PLB-MPE 2685 1 2685 4
FSLs 44 154 6776 10
FSL2IC 349 4 1396 2
NETWORK CORES

About 34 is dedicated to the network

19
Methods of Analysis

Method 1 Simulation
Modelsim waveform
Method 2 PPC Timer
By counting the time through the C code in PPC
Method 3 TMD-Profiler
Using an external profiler to analyze the engines

20
Processor vs FPGA

Most important portion is DGEFA
DGEFA Benchmark with n 100
Processor's performance 315MFLOPS

Performance FPGA (6 Engines)
379MFLOPS

Performance 1 Engine
123MFLOPS

21
Engines Speedup
FPGA 1
FPGA 2
22
Problem

Engines computation time is being surpassed by
either communication or idle time
TMD-Profiler can be used to track the problem

For 8 Engines
23
TMD-Profiler
SEND
RECV
COMP
24
Scaled Problem Size
FPGA 1
FPGA 2
25
Why super speedup?

As matrix increases the size of column also
increases
Since each engine has exactly the same amount of
data, number of columns decrease

4 x Latency 20
2 x Latency 20
26
New Speedup

With matrix size of 195 x 195
Performance of 6 engines (one FPGA) 628MFLOPS
Performance of one processor 324MFLOPS
Speedup of FPGA over processor is 1.94x

27
Newer Technology

Max theoretical peak performance of engine in
V2Pro is 200MFLOPS
Newer FPGAs are larger and faster
Estimated peak performance for an engine network
(20) for Virtex 5 LX330 4000MFLOPS
Theoretical speedup, compared to a processor, is
11.4x
Compared to HPL, estimated speedup is 4.4x

28
Scaling to Larger Systems

LINPACK is meant to run in large multi-processor
systems
Computer networks suffer from high latency
The tighter coupling and lighter protocol used in
this FPGA system have potential to scale

29
Conclusions

TMD-MPE was used to parallelize LINPACK Hardware
Engine
Disadvantage expensive in terms of device
utilization
Advantage higher flexibility
Max speedup of engines over a processor, is 1.9x
Newer FPGAs have better chances of outperforming
processors (est. 4000MFLOPS for Virtex 5 LX330)
Multi-FPGA systems have good scalability
potential due to low latencies

30
Future Work