Design and Analysis of Parallel NQueens on Reconfigurable Hardware with HandelC and MPI - PowerPoint PPT Presentation

Loading...

PPT – Design and Analysis of Parallel NQueens on Reconfigurable Hardware with HandelC and MPI PowerPoint presentation | free to download - id: d3ad6-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Design and Analysis of Parallel NQueens on Reconfigurable Hardware with HandelC and MPI

Description:

... of Parallel N-Queens on Reconfigurable Hardware with Handel-C and MPI ... Configurations designed in Handel-C using Celoxica's application mapping tool DK ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 28
Provided by: Gues180
Learn more at: http://www.hcs.ufl.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Design and Analysis of Parallel NQueens on Reconfigurable Hardware with HandelC and MPI


1
Design and Analysis of Parallel N-Queens on
Reconfigurable Hardware with Handel-C and MPI
  • Vikas Aggarwal, Ian Troxel, and Alan D. George
  • High-performance Computing and Simulation (HCS)
    Research Lab
  • Department of Electrical and Computer Engineering
  • University of Florida
  • Gainesville, FL

2
Outline
  • Introduction
  • N-Queens Solutions
  • Backtracking Approach
  • N-Queens Parallelization
  • Experimental Setup
  • Handel-C and Lessons Learned
  • Results and Analysis
  • Conclusions
  • Future Work and Acknowledgements
  • References

3
Introduction
  • N-Queens dates back to the 19th century
  • (studied by Gauss)
  • Classical combinatorial problem, widely used
  • as a benchmark because of its simple and regular
    structure
  • Problem involves placing N queens on an N ? N
    chessboard such that no queen can attack any
    other
  • Benchmark code versions include finding the first
    solution and finding all solutions

4
Introduction
  • Mathematically stated
  • Find a permutation of the BOARD() vector
    containing numbers 1N, such that
  • for any i ! j
  • Board( i ) - i ! Board( j ) - j
    Board( i ) i ! Board( j ) j

5
N-Queens Solutions
  • Various approaches to the problem
  • Brute force2
  • Local search algorithms4
  • Backtracking2, 7 , 11, 12, 13
  • Divide and conquer approach1
  • Permutation generation2
  • Mathematical solutions6
  • Graph theory concepts2
  • Heuristics and AI4, 14

6
Backtracking Approach
  • One of the only approaches that guarantees a
    solution, though it can be slow
  • Can be seen as a form of intelligent depth-first
    search
  • Complexity of backtracking typically rises
    exponentially with problem size
  • Good test case for performance analysis of RC
    systems, as the problem is complex even for small
    data size
  • Traditional processors provide a suboptimal
    platform for this iterative application due to
    serial nature of their processing pipelines
  • Tremendous speedups achieved by adding
    parallelism at the logic level via RC

For an 8x8 board, 981 moves (876 tests 105
backtracks) are required for first solution alone
7
Backtracking Approach
of operations for 1st solution 7
Number of solutions 8
  • Tables provide an estimate of the backtracking
    approachs complexity
  • Problem can be made to find first solution or the
    total number of solutions
  • Total number of solutions is obviously a more
    challenging problem
  • Interesting observation 1st solutions
    complexity (i.e. number of operations) does not
    increase monotonically with board size

8
N-Queens Parallelization
  • Different levels of parallelism added to improve
    performance
  • Functional Unit replication
  • Parallel column check
  • Parallel row check

Q
Parallelization Comparison
Sequential 11 cycles Parallel column check 3
cycles Multiple row check appended1 cycles 11x
speedup over sequential operation
Q
Note Assume first four queens have been placed
and the fifth queen starts from the 1st row
9
Experimental setup
  • Experiments conducted using RC1000 boards from
    Celoxica, Inc., and Tarari RC boards from Tarari,
    Inc.
  • Each RC1000 board features a Xilinx Virtex-2000
    FPGA, 8 MB of on-card SRAM, and PCI Mezzanine
    Card (PMC) sockets for connecting two daughter
    cards
  • Each Tarari board features two user-programmable
    Xilinx Virtex-II FPGAs in addition to a
    controller FPGA, 256 MB of DDR SDRAM
  • Configurations designed in Handel-C using
    Celoxicas application mapping tool DK-2, along
    with Xilinx ISE for place and route
  • Performance compared against 2.4 GHz Xeon server
    and 1.33 GHz Athlon server

10
Celoxica RC1000
  • PCI-based card having one Xilinx FPGA and four
    memory banks
  • FPGA configured from the host processor over the
    PCI bus
  • Four memory banks, each of 2MB, accessible to
    both the FPGA and any other device on the PCI bus
  • Data transfers The RC1000
  • provides 3 methods of transferring
  • data over PCI bus between
  • host processor and FPGA
  • Bulk data transfers performed via
  • memory banks
  • Two unidirectional 8 bit ports, called
  • control and status ports, for direct comm.
  • between FPGA and PCI bus
  • (note this method used in our experiments)
  • User I/O pins USER1 and USERO for
  • single bit communication with FPGA
  • API-layer calls from host to configure
  • and communicate with RC board

Figure courtesy of Celoxica RC1000 manual
11
Tarari Content Processing Platform
  • PCI-based board having 3 FPGAs and a 256 MB
    memory bank
  • Two Xilinx Virtex-II FPGAs available for user to
    load configuration files from host over the PCI
    bus
  • Each Content Processing Engine or CPE (User FPGA)
  • configured with one or two agents
  • Third FPGA acts as controller providing
    high-bandwidth
  • access to memory and configuration of CPP
  • with agents
  • 256 MB of DDR SDRAM for data sharing between
  • CPEs and the host application
  • Configuration files first uploaded into the
  • memory slots and used to configure each FPGA
  • Both single-word transfers and DMA transfers
    supported between the host and the CPP

Figure courtesy of Tarari CP-DK manual
12
Handel-C Programming Paradigm
  • Handel-C acts as a bridge between VHDL and C
  • Comparison with conventional C
  • More explicit provisioning of parallelism within
    the code
  • Variables declared to have the exact bit-lengths
    to save space
  • Provides more bit-level manipulations beyond
    shifts and logic operations
  • Limited support for many ANSI C standards and
    extensions
  • Comparison with VHDL
  • Application porting is much faster for
    experienced coders
  • Similar to VHDL behavioral models
  • Lacks VHDL concurrent signal assignments which
    can be suspended until changes on input triggers
    (Handel-C requires polling)
  • Provides more higher-level routines

13
Handel-C Design Specifics
  • Design makes use of the following two approaches
  • Approach 1
  • Use of an array of binary numbers to hold a 1
    at a particular bit position to indicate the
    location of queen in the column
  • A 32 x 32 board will require an array of 32
    elements of 32 bits each
  • Correspondingly use bit-shift operations and
    logical-and operations to check diagonal and row
    conditions
  • More closely corresponds to the way the
    operations will take place on the RC fabric
  • Approach 2
  • Use of an array of integers instead of binary
    numbers
  • Correspondingly use the mathematical model of the
    problem to check the validation conditions
  • Smaller variables yield better device
    utilization slices occupied reduce from about
    75 to about 15 for similar performance and
    parallelism
  • Approach 2 found to be more amenable for Handel-C
    designs

14
Lessons Learned with Handel-C
  • Some interesting observations
  • Code for which place and route did not work,
    finally worked when the function parameters were
    replaced by global variables
  • Less control at lower level with place and route
    being a consistent problem even with designs
    using up only 40 of total slices
  • Self-referenced operations (e.g. aax) affect
    the design adversely, so use intermediate
    variables
  • Order of operations and conditional statements
    can affect design
  • Useful to reduce wider-bit operations into a
    sequence of narrower-bit operations
  • Balancing if with else branches leads to
    better designs
  • Comments in the main program sometimes affected
    the synthesis, leading to place and route errors
    in fully commented code
  • We are still learning more everyday!

15
Sequential First-Solution Results
  • Sequential version does not perform well versus
    the Xeon and Athlon CPUs
  • Algorithm needs an efficient design to minimize
    resource utilization
  • The results do not include the one-time
    configuration overhead of 150 ms

RC1000 clock speed _at_ 40 MHz
RC1000 clock speed _at_ 40 MHz
16
Parallel First-Solution Results

RC1000 clock speed _at_ 25 MHz
  • The most parallel algorithm runs about 20x
    faster than
  • sequential algorithm on RC fabric
  • Parallel algorithm with two row checks almost
  • duplicates behavior of 2.4 GHz Xeon server,
  • while 6-row check outperforms it by 74
  • Further increasing the number of rows checked is
    likely to further improve performance
  • for larger problem sizes

17
Total Number of Solutions Method
  • Employ divide-and-conquer approach
  • Seen as a parallel depth-first search
  • Solutions obtained with queen positioned in any
    row in the first column are independent from
    solutions with queens in other positions
  • Technique allows for high
  • degree of parallelism (DoP)

18
One-Board Total-Solutions Results
RC1000 and Tarari clock speed _at_ 33 MHz
  • Designs on hardware perform around 1.7x faster
    than Xeon server
  • Performance on both RC platforms similar for same
    clock rates
  • RC1000 performs a notch better for smaller chess
    board sizes while Tarari CPPs performance
    improves with chess board sizes
  • Almost entire VirtexII chip on the Tarari is
    occupied for one FU

19
Multiple Functional Units (FUs)
fu10
fu5
  • Used additional FUs per chip to increase
    parallelism per chip
  • Each FU searches for the number of solutions
    corresponding to a subset of rows in the first
    column
  • The controller
  • Handles communication with the host
  • Invokes all FUs in parallel
  • Combines all results

fu9
fu4
fu3
fu8
fu2
fu7
fu1
fu6
On board FPGA
Controller
Host processor
20
Total-Solutions Results with Multiple FUs
RC1000 clock speed _at_ 30 MHz
  • RC1000 with three FUs performs almost 5x faster
    than Xeon server
  • Speedup increases near linearly with number of
    FUs
  • Area occupied scales linearly with number of FUs

RC speedup vs. Xeon server for board size of 17
21
MPI for Inter-Board Communication
  • To further increase system speedup (having more
    functional units), multiple boards employed
  • Each FU programmed to search a subset of the
    solution space
  • Servers communicate using the Message Passing
    Interface (MPI) to start search in parallel and
    obtain the final result

On-board FPGA (with one or multiple FUs)
Host server
MPI
Host server
On-board FPGA (with one or multiple FUs)
22
Total-Solutions Results with MPI
Tarari CPP clock speed _at_ 33 MHz
RC speedup vs. Xeon server for board size of 12
  • Results show total execution time including MPI
    overhead
  • Minimal MPI overhead incurred (high
    computation-to-communication ratio)
  • Communication overhead bounded to 3 ms regardless
    of problem size and initialization overhead is
    around 750 ms
  • Overhead becomes negligible for large problem
    sizes
  • Speedup scales near linearly with number of
    boards
  • 4-board Tarari design performs about 6.5x faster
    than Xeon server

23
Total-Solutions Results with MPI
RC speedup vs. Xeon server for board size of 12
RC1000 clock speed _at_ 30 MHz
  • Results show total execution time including MPI
    overhead
  • Minimal MPI overhead incurred (high
    computation-to-communication ratio)
  • Communication overhead bounded to 3 ms regardless
    of problem size and initialization overhead is
    around 750 ms
  • Overhead becomes negligible for large problem
    sizes
  • Speedup scales near linearly with number of
    boards
  • 4-board RC1000 design performs about 12x faster
    than Xeon server

24
Total-Solutions Results with MPI
RC1000 clock speed _at_ 30 MHz and Tarari clock
speed _at_ 33MHz
  • Communication overheads still remain low, while
    MPI initialization overheads increase with number
    of boards (now 1316 ms for 8 boards)
  • Heterogeneous mixture of boards employed to solve
    the problem coordinating via MPI
  • Total of 8 boards (4 RC1000 and 4 Tarari boards)
    allows up to 16 (4?3 4?1) FUs
  • 8 boards perform about 21x faster than Xeon
    server for chess board size of 16
  • What appears to be an unfair comparison really
    shows how the approach scales to many more FUs
    per FPGA (on higher density chips)

25
Conclusions
  • Parallel backtracking for solving N-Queens
    problem in RC shows promise for performance
  • N-Queens is an important benchmark in the HPC
    community
  • RC devices outperform CPUs for N-Queens due to
    RCs efficient processing of fine-grained,
    parallel, bit-manipulation operations
  • Previously inefficient methods for CPUs like
    backtracking can be improved by reexamining their
    design
  • This approach can be applied to many other
    applications
  • Numerous parallel approaches developed at several
    levels
  • Handel-C lessons learned
  • A C-based programming model for application
    mapping provides a degree of higher-level
    abstraction, yet still requires programmer to
    code from a hardware perspective
  • Solutions produced to date show promise for
    application mapping

26
Future Work and Acknowledgements
  • Compare application mappers with HDL design in
    terms of mapping efficiency
  • Develop and use direct communication between
    FPGAs to avoid MPI overhead
  • Export approach featured in this talk to variety
    of algorithms and HPC benchmarks for performance
    analysis and optimization
  • Develop library of application and middleware
    kernels for RC-based HPC
  • We wish to thank the following for their support
    of this research
  • Department of Defense
  • Xilinx
  • Celoxica
  • Tarari
  • Key vendors of our HPC cluster resources (Intel,
    AMD, Cisco, Nortel)

27
References
  • 1 Divide and Conquer under Global
    Constraints A Solution to the N-Queens Problem,
    Bruce Abramson and Mordechai M. Yung
  • 2 Different Perspectives Of The N-queens
    Problem, Cengiz Erbas, Seyed Sarkeshikt, Murat
    M. Tanik, Department of Computer Science and
    Engineering,Southern Methodist University, Dallas
  • 3 Algorithms and Complexity, Herbert S. Wilf,
    University of Pennsylvania, Philadelphia
  • 4 Fast search algorithms for N-Queens
    problem, Rok Sausic, Jum Gu, appeared in IEEE
    transactions on Systems, Man, and Cybernetics,
    Vol 21, 6, pp 1572-76, Nov/Dec 1991
  • 5 http//www.cit.gu.edu.au/sosic/nqueens.html
  • 6 http//bridges.canterbury.ac.nz/features/eight
    .html
  • 7 www.math.utah.edu/alfeld/queens/queens.html
  • 8 www.jsomers.com/nqueen_demo/nqueens.html
  • 9 A polynomial time algorithm for N-queens
    problem
  • 10 remus.rutgers.edu/rhoads/Code/code.html
  • 11 http//www.mactech.com/articles/mactech/Vol.1
    3/13.12/TheEightQueensProblem/index.html
  • 12 http//www2.ilog.com/preview/Discovery/sample
    s/nqueens/
  • 13 http//www.infosun.fmi.uni-passau.de/br/lehrs
    tuhl/Kurse/Proseminar_ss01/backtracking_nm.pdf
  • 14 From Alife Agents To A Kingdom Of N
    Queens, Han Jing, Jimimg Liu, Cai Qingsheng
  • 15 http//www.wi.leidenuniv.nl/kosters/nqueens.
    html
  • 16 http//www.dsitri.de/projects/NQP/
About PowerShow.com