The LINPACK Benchmark on a Multi-Core Multi-FPGA System - PowerPoint PPT Presentation

About This Presentation
Title:

The LINPACK Benchmark on a Multi-Core Multi-FPGA System

Description:

DGEFA: LU factorization with partial pivoting: A=LUP. Ax=LUx=b ... (MPI) Broadcast scaled column and pivot. Perform loop that contains DAXPY ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 44
Provided by: eecgTo
Category:

less

Transcript and Presenter's Notes

Title: The LINPACK Benchmark on a Multi-Core Multi-FPGA System


1
The LINPACK Benchmark on a Multi-Core Multi-FPGA
System
University of TorontoElectrical and Computer
Engineering Department
byEmanuel Ramalho
Supervisor Prof. Paul Chow
October 1st, 2008
2
Outline
  • Motivation
  • LINPACK Algorithm
  • Parallelizing LINPACK
  • Results
  • Conclusions
  • Future Work

3
Motivation
  • The LINPACK Benchmark is used to rank the Top500
    computers in the world
  • Can FPGAs compete?

4
Objective
  • To see how well a multi-core multi-FPGA system
    performs when compared to processor

FPGA
  • Advantage
  • Total implementation may be done in hardware
  • Disadvantage
  • Much lower clock rate

5
LINPACK Algorithm
  • Solves a system of linear equations by calling
    two routines DGEFA and DGESL
  • Axb
  • DGEFA LU factorization with partial pivoting
  • ALUP
  • AxLUxb
  • DGESL Solves the system using LU factorization
  • Lyb
  • Uxy

6
LINPACK1 vs. HPL
  • LINPACK1
  • Single processor
  • Uses Level 1 BLAS
  • Slower
  • Low Complexity
  • HPL
  • Multiple processors
  • Uses Level 3 BLAS
  • Faster
  • High Complexity
  • FPGA Implementation
  • BLAS3 performs faster in processors (due to
    locality of reference)
  • FPGAs do not take advantage of BLAS3, LINPACK1 is
    chosen

7
LINPACK Pseudo-Code
  • Random generation of matrix A and vector b
  • Execute DGEFA routine (ALU)
  • IDAMAX, DSCAL and DAXPY are executed here
  • Execute DGESL routine (LUxb)
  • Verify the result using residual calculation
  • Performance is measured from 2. to 3. (inclusive)
  • How is this going to be parallelized?

8
Parallelizing LINPACK
  • Find focus of parallelization DGEFA

9
DGEFA Analysis
  • Inside DGEFA IDAMAX, DSCAL and DAXPY
  • DAXPY is the main computation

10
TMD-MPI
  • TMD-MPI is a lightweight implementation of the
    MPI protocol (message passing interface)
  • TMD-MPE is a hardware implementation of TMD-MPI's
    main functionality (SEND and RECV)

MPI Network
11
DGEFA Parallelization
  • Generate matrix A and vector b (main rank)
  • (MPI) Matrix distribution
  • Perform DGEFA (main loop)
  • Perform IDAMAX and DSCAL
  • (MPI) Broadcast scaled column and pivot
  • Perform loop that contains DAXPY
  • (MPI) Matrix gather (main rank)
  • Perform DGESL
  • Calculate residual

12
LINPACK Engine
LIN
P
ACK Engine
RAM
Data
BLAS1
Main
Engine
FSM
Command
T
o Network
FSLs
On-Chip
TMD
MPE Header
MPE
FSM
13
BLAS1 Engine
  • Performs IDAMAX, DSCAL and DAXPY

14
IDAMAX
  • Finds Max(v1) and returns its index

15
DSCAL
  • Performs v2a.v1

16
DAXPY
  • Calculates
  • v3a. v1v2

17
Hardware - BEE2 Board
18
Device Utilization (XC2VP70)
Cores 4-Input LUTs Number of Occurrences Total 4-Input LUTs Total ()
LINPACK Engine 4360 6 26160 40
TMD-MPE 896 6 5376 8
NetIf 579 11 6369 10
PLB-MPE 2685 1 2685 4
FSLs 44 154 6776 10
FSL2IC 349 4 1396 2
NETWORK CORES
  • About 34 is dedicated to the network

19
Methods of Analysis
  • Method 1 Simulation
  • Modelsim waveform
  • Method 2 PPC Timer
  • By counting the time through the C code in PPC
  • Method 3 TMD-Profiler
  • Using an external profiler to analyze the engines

20
Processor vs FPGA
  • Most important portion is DGEFA
  • DGEFA Benchmark with n 100
  • Processor's performance 315MFLOPS
  • Performance FPGA (6 Engines)
  • 379MFLOPS
  • Performance 1 Engine
  • 123MFLOPS

21
Engines Speedup
FPGA 1
FPGA 2
22
Problem
  • Engines computation time is being surpassed by
    either communication or idle time
  • TMD-Profiler can be used to track the problem

For 8 Engines
23
TMD-Profiler
SEND
RECV
COMP
24
Scaled Problem Size
FPGA 1
FPGA 2
25
Why super speedup?
  • As matrix increases the size of column also
    increases
  • Since each engine has exactly the same amount of
    data, number of columns decrease

4 x Latency 20
2 x Latency 20
26
New Speedup
  • With matrix size of 195 x 195
  • Performance of 6 engines (one FPGA) 628MFLOPS
  • Performance of one processor 324MFLOPS
  • Speedup of FPGA over processor is 1.94x

27
Newer Technology
  • Max theoretical peak performance of engine in
    V2Pro is 200MFLOPS
  • Newer FPGAs are larger and faster
  • Estimated peak performance for an engine network
    (20) for Virtex 5 LX330 4000MFLOPS
  • Theoretical speedup, compared to a processor, is
    11.4x
  • Compared to HPL, estimated speedup is 4.4x

28
Scaling to Larger Systems
  • LINPACK is meant to run in large multi-processor
    systems
  • Computer networks suffer from high latency
  • The tighter coupling and lighter protocol used in
    this FPGA system have potential to scale

29
Conclusions
  • TMD-MPE was used to parallelize LINPACK Hardware
    Engine
  • Disadvantage expensive in terms of device
    utilization
  • Advantage higher flexibility
  • Max speedup of engines over a processor, is 1.9x
  • Newer FPGAs have better chances of outperforming
    processors (est. 4000MFLOPS for Virtex 5 LX330)
  • Multi-FPGA systems have good scalability
    potential due to low latencies

30
Future Work
  • Include DDR memory
  • Improve broadcast method (e.g. to tree approach)
  • Optimize DAXPY flow
  • Replicate DAXPY flow inside each engine
  • Explore newer technologies and scalability

31
Thank You(Questions?)
32
Additional Slides
33
DGEFA Code
34
MPE Protocol
35
LINPACK Report
36
Opcode TAG
37
Matrix Distribution
  • Considering an n x n matrix and 3 ranks

38
Processor vs. LINPACK Engine
  • Whole LINPACK Benchmark with n 100
  • Performance (MFLOPS)
  • Processor 319MFLOPS
  • LINPACK Engine 164MFLOPS

39
IDAMAX
40
DSCAL
41
DAXPY
42
FLOPS
43
16 Engines
Write a Comment
User Comments (0)
About PowerShow.com