SciDac2 Kickoff Meeting - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

SciDac2 Kickoff Meeting

Description:

http://crash.ncac.gwu.edu/pradeep/Models.html ... Time accurate solution of the Navier-Stokes equations, overset (Chimera) grids ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 28
Provided by: kare137
Category:

less

Transcript and Presenter's Notes

Title: SciDac2 Kickoff Meeting


1
 SciDac2 Kickoff Meeting
  • Karen A. Tomko
  • Electrical and Computer Engineering Department
  • email Karen.Tomko_at_uc.edu

2
Outline of talk
  • Other Application areas
  • Crash Worthiness
  • Computational Electromagnetics
  • Computational Fluid Dynamics
  • Research Interests
  • Application Performance Challenges
  • Reconfigurable High Performance Computing

3
Crash Worthiness
  • with S. Abraham, E.S. Davidson, Q. Stout on Ford
    Motor Co. sponsored Project
  • Finite Element Method
  • Newtonian physics, deformation models for
    crumpling of car body
  • 100,000 lines of Fortran 77

1996 Ford Taurus Model http//crash.ncac.gwu.edu/p
radeep/Models.html
  • Parallelization and performance enhancement for
    shared memory and distributed memory systems
  • Weighted and multi-constraint domain
    decomposition using graph partitioning algorithms

4
Computational Electromagnetics
  • with L. Katehi, C. Sarris et. al.
  • Time-accurate Wireless communication simulation
    (transients are what is of interest)
  • Solution of Maxwells equations, based on Yees
    FDTD approach
  • Adaptive multi-resolution using the Haar-wavelet
    transform
  • C with MPI, dynamic domain decomposition using
    Zoltan K. Devine, et. al.
  • one level of a multi-resolution modeling problem

5
COSITE INTERFERENCE IN A VEHICULAR TRANSCEIVER
NETWORK WITHIN A FOREST ENVIRONMENT A HYBRID
FDTD/MOM APPROACH
  • Problem Statement
  • The in-forest communication between multi-antenna
    mobile transmit-receive units is considered.
    Issues to address
  • Forest propagation and multi-path (FDTD modeling
    requires enormous resources) .
  • Effect of arbitrary platform (MoM requires
    extremely complex Greens function).
  • Operation of transceiver electronics under
    cosite interference conditions (MoM incompatible
    with SPICE type solvers as TRANSIM).

humvee.net
  • Modeling Approach
  • Use the Method of Moments Sarabandi and Koh,
    IEEE AP-49, Feb. 2001 to model wave propagation
    through the forest.
  • Enclose the vehicular transceivers in an FDTD
    mesh to model rigorously the effect of the
    platform and the transceiver architecture as in
    Sarris et al., Proc. 2001 IEEE AP-S.

Joint CEN-5/FCS work. Contributors
CEN-5 C. D. Sarris, W. Thiel , L. P.
Katehi FCS I.-S. Koh, K.
Sarabandi
6
Computational Fluid Dynamics
  • with D. Rizzetta, P. Morgan, M. Visbal and also
    with A. Hamed, D. Basu, Q. Liu
  • Time accurate solution of the Navier-Stokes
    equations, overset (Chimera) grids
  • Unsteady and turbulent fluid flow and acoustics
    modeling
  • Fortran 77, coarse level parallelization with MPI
  • Memory and cache analysis
  • Variety of numerical models implemented and
    compared
  • New effort turbomachinery modeling with M.
    Turner
  • multi-scale, unsteady, discontinuities,
    multi-domain modeling, periodicity and symmetry

7
Hybrid Turbulence Models
Cavity mid-span axial vorticity contours
Baseline Grid
Fine Grid
8
FPGA-based Reconfigurable High Performance
Computing
  • Field-programmable Gate Arrays (FPGA)
  • Programmable digital logic
  • Manufacturers Xilinx, Altera, others
  • Trends that make FPGA especially appealing
  • Computational capacity of FPGA has been scaling
    faster than CPU
  • Current generation chips are able to support
    large numbers of floating point units

FPGA
Software
Hardware
9
Programming the FPGA
  • FPGA are programmed or configured with a a
    sequence of bits containing the contents of the
    LUTs and the control bits determining the
    connections between LUTs, Flip Flops, Block Ram,
    etc..
  • This sequence of bits is referred to as the
    configuration or bit file.
  • Programmed/Reprogrammed on-the-fly in
    microseconds.

10
XilinxVirtex-II Architecture
Figure from T. El-Ghazawi, K. Gaj, and D.
Pointer,Reconfigurable Supercomputing Systems
tutorial RSSI 05
11
Configurable Logic Block (CLB) ofXilinx
VirtexTM 2.5 V FPGA
  • 4 Logic cells
  • 4 input Look up table
  • Carry logic
  • D flip-flop

Figure from VirtexTM 2.5v FPGA Datasheet by
Xilinx
12
Trends in FPGA Floating Point Capabilities
from V. Natoli,A Computational Physicists View
of Reconfigurable High Performance Computing
Stone Ridge Technology RSSI July 05
13
Xilinx XC4VLX200
  • 32 bit Integer and Fixed Point
  • Thousands of Arithmetic Units
  • Floating Point
  • 600 SP Floating Point Multipliers
  • 100 SP Floating Point Dividers
  • 100 DP Floating Point Multipliers
  • 20 DP Floating Point Dividers
  • SP ! 2 X DP
  • Theoretical Peaks
  • SP Floating Point 20-120 GFLOPs
  • DP Floating Point 4-20 GFLOPs
  • Integer .5-1 TOP

90 nm 200,448 Logic Cells 750 kB BRAM 96
18x18 bit Multipliers Clock upto 500MHz
from V. Natoli,A Computational Physicists View
of Reconfigurable High Performance Computing
Stone Ridge Technology RSSI July 05
14
An FPGA-based FDTD Solver for Reconfigurable High
Performance Computing
15
FDTD
  • Maxwells equations were solved using integral
    equations until Yee introduced Finite-Difference
    Time-Domain (FDTD).
  • The FDTD calculation is very parallel, and is
    currently employed in parallel simulations on
    High Performance Computing Clusters (HPC).
  • Fairly linear improvement in computations.
  • How to get even further speed-up on HPC systems?

16
FDTD
  • Target System

Beowulf Cluster
Network
FPGA
CPU
FPGA
CPU
FPGA
CPU
  • FPGA performs the computation
  • Host Software moves the data.
  • FPGA communication
  • HPC communication

17
FDTD
  • Relation of the Equations
  • HxtijHxt-1ij - dtumdy(Ezt-0.5i1j1-
    Ezt-0.5i1j)

Hx/Hy Calculations Transfers
Ez Calculations Transfers
18
FDTD
  • The FDTD calculations have both temporal and
    spatial locality.

Add
Ezij
Delay
Multiply
Constant
Hx/Hyij
Add
Hx/Hyij
Delay
  • HxtijHxt-1ij - dtumdy(Ezt-0.5i1j1-
    Ezt-0.5i1j)
  • HytijHyt-1ij dtumdx(Ezt-0.5i1j1-
    Ezt-0.5ij1)

19
FDTD
  • Ez calculation has more operations.

Constant
Multiply
Add
Delay
Hxij
Add
Hyij
Add
Add
Ezij
Hyi-1j
Multiply
Constant
Delay
Ezij
EztijEzt-1ij dtepsdx(Hyt-0.5ij-1-H
yt-0.5i-1j-1) - dtepsdy(Hxt-0.5i-1j-Hx
t-0.5i-1j-1)
20
Cray XD1 System Architecture
Cray XD1 Chasis
21
Cray XD1-Expansion Module
  • AAP FPGA Xilinx Virtex II Pro (xc2vp50-7)
  • RAP RapidArray Processor

Cray XD1 Expansion Module
22
Baseline Implementation
  • Update engines created by Gandhi 2
  • Floating point units provided by Belanovic 3 at
    NEU
  • Two clocks system, and update engines
  • Magnetic Updates in parallel (Hx and Hy)
  • Electric update (Ez) every 2 clock cycles
  • Multiple update cycles w/o host intervention
  • Local SRAMs for input and output data
  • SRAMs as ping-pong buffers
  • Slower than Opeterons alone

23
FPGA Implementation in Cray XD1
prog_clock_gen
Transmit Data Bus
app_fdtd
rt_core
qdr2_core
QDR 1 Interface
mux
Fabric Request Interface
rt_client
QDR II SRAM1 Interface
Receive Data Bus
QDR 2 Interface
QDR II SRAM2 Interface
Host Processor Interface
QDR 3 Interface
QDR II SRAM3 Interface
qdr_fdtd
User Request Interface
QDR 4 Interface
QDR II SRAM4 Interface
Clock Signals
  • Cray IP Cores rt_core qdr2_core
  • rt_client agent for host
  • qdr_fdtd instantiates controls update
    engine operations
  • mux multiplexes requests from rt_client
    qdr_fdtd
  • prog_clock_gen clock for different blocks, uses
    DCM

24
Performance Analysis Existing Design
  • Time for one electromagnetic field value update
  • Tcray total time taken by Cray XD1 to upate
    one electromagnetic field value
  • Dcray latency of QDR II SRAMs
  • M latency of the magnetic (Hx and Hy) update
    engine
  • E latency of the electric (Ez) update engine
  • N size of the electromagnetic matrix
    processed by the FPGA
  • (N Grid_Row x Grid_Column, Grid_Row mod 2
    0 and Grid_Column mod 2 0)
  • k minimum 3
  • Tu time period of clock for update engines
  • 2C number of cycles of FDTD algorithm
    calculation
  • Design Constants M 22, E 30, k 3
  • Significant variables N, C, Tu (most
    significant)
  • Reducing Tcray decrease Tu (most significant),
    M E (not significant)
  • increase N and C (typically high) not
    very significant

25
Areas to improve performance
  • Reducing time for update of one value
  • Improve clock speed of update engines
  • Higher clock speed of floating point units
  • Single Clock signal
  • Correct reset behavior
  • SRAM R/W address generation scheme
  • Increasing Throughput
  • One Ez result per clock cycle (vs 1 per 2 cycles)
  • FPGA-initiated boundary output data transfer
  • Multiple copies of update engines
  • FPGA-to-FPGA transfer of boundary data
  • Using Pre-synthesized floating point units
    (Sandia Labs, USA)

26
Performance Comparison
27
Performance Comparison
Original Design Units -
Sandia, Optimized Units -
Write a Comment
User Comments (0)
About PowerShow.com