Design and Analysis of Parallel NQueens on Reconfigurable Hardware with HandelC and MPI - PowerPoint PPT Presentation


PPT – Design and Analysis of Parallel NQueens on Reconfigurable Hardware with HandelC and MPI PowerPoint presentation | free to download - id: d3ad6-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Design and Analysis of Parallel NQueens on Reconfigurable Hardware with HandelC and MPI


... of Parallel N-Queens on Reconfigurable Hardware with Handel-C and MPI ... Configurations designed in Handel-C using Celoxica's application mapping tool DK ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 28
Provided by: Gues180
Learn more at:


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Design and Analysis of Parallel NQueens on Reconfigurable Hardware with HandelC and MPI

Design and Analysis of Parallel N-Queens on
Reconfigurable Hardware with Handel-C and MPI
  • Vikas Aggarwal, Ian Troxel, and Alan D. George
  • High-performance Computing and Simulation (HCS)
    Research Lab
  • Department of Electrical and Computer Engineering
  • University of Florida
  • Gainesville, FL

  • Introduction
  • N-Queens Solutions
  • Backtracking Approach
  • N-Queens Parallelization
  • Experimental Setup
  • Handel-C and Lessons Learned
  • Results and Analysis
  • Conclusions
  • Future Work and Acknowledgements
  • References

  • N-Queens dates back to the 19th century
  • (studied by Gauss)
  • Classical combinatorial problem, widely used
  • as a benchmark because of its simple and regular
  • Problem involves placing N queens on an N ? N
    chessboard such that no queen can attack any
  • Benchmark code versions include finding the first
    solution and finding all solutions

  • Mathematically stated
  • Find a permutation of the BOARD() vector
    containing numbers 1N, such that
  • for any i ! j
  • Board( i ) - i ! Board( j ) - j
    Board( i ) i ! Board( j ) j

N-Queens Solutions
  • Various approaches to the problem
  • Brute force2
  • Local search algorithms4
  • Backtracking2, 7 , 11, 12, 13
  • Divide and conquer approach1
  • Permutation generation2
  • Mathematical solutions6
  • Graph theory concepts2
  • Heuristics and AI4, 14

Backtracking Approach
  • One of the only approaches that guarantees a
    solution, though it can be slow
  • Can be seen as a form of intelligent depth-first
  • Complexity of backtracking typically rises
    exponentially with problem size
  • Good test case for performance analysis of RC
    systems, as the problem is complex even for small
    data size
  • Traditional processors provide a suboptimal
    platform for this iterative application due to
    serial nature of their processing pipelines
  • Tremendous speedups achieved by adding
    parallelism at the logic level via RC

For an 8x8 board, 981 moves (876 tests 105
backtracks) are required for first solution alone
Backtracking Approach
of operations for 1st solution 7
Number of solutions 8
  • Tables provide an estimate of the backtracking
    approachs complexity
  • Problem can be made to find first solution or the
    total number of solutions
  • Total number of solutions is obviously a more
    challenging problem
  • Interesting observation 1st solutions
    complexity (i.e. number of operations) does not
    increase monotonically with board size

N-Queens Parallelization
  • Different levels of parallelism added to improve
  • Functional Unit replication
  • Parallel column check
  • Parallel row check

Parallelization Comparison
Sequential 11 cycles Parallel column check 3
cycles Multiple row check appended1 cycles 11x
speedup over sequential operation
Note Assume first four queens have been placed
and the fifth queen starts from the 1st row
Experimental setup
  • Experiments conducted using RC1000 boards from
    Celoxica, Inc., and Tarari RC boards from Tarari,
  • Each RC1000 board features a Xilinx Virtex-2000
    FPGA, 8 MB of on-card SRAM, and PCI Mezzanine
    Card (PMC) sockets for connecting two daughter
  • Each Tarari board features two user-programmable
    Xilinx Virtex-II FPGAs in addition to a
    controller FPGA, 256 MB of DDR SDRAM
  • Configurations designed in Handel-C using
    Celoxicas application mapping tool DK-2, along
    with Xilinx ISE for place and route
  • Performance compared against 2.4 GHz Xeon server
    and 1.33 GHz Athlon server

Celoxica RC1000
  • PCI-based card having one Xilinx FPGA and four
    memory banks
  • FPGA configured from the host processor over the
    PCI bus
  • Four memory banks, each of 2MB, accessible to
    both the FPGA and any other device on the PCI bus
  • Data transfers The RC1000
  • provides 3 methods of transferring
  • data over PCI bus between
  • host processor and FPGA
  • Bulk data transfers performed via
  • memory banks
  • Two unidirectional 8 bit ports, called
  • control and status ports, for direct comm.
  • between FPGA and PCI bus
  • (note this method used in our experiments)
  • User I/O pins USER1 and USERO for
  • single bit communication with FPGA
  • API-layer calls from host to configure
  • and communicate with RC board

Figure courtesy of Celoxica RC1000 manual
Tarari Content Processing Platform
  • PCI-based board having 3 FPGAs and a 256 MB
    memory bank
  • Two Xilinx Virtex-II FPGAs available for user to
    load configuration files from host over the PCI
  • Each Content Processing Engine or CPE (User FPGA)
  • configured with one or two agents
  • Third FPGA acts as controller providing
  • access to memory and configuration of CPP
  • with agents
  • 256 MB of DDR SDRAM for data sharing between
  • CPEs and the host application
  • Configuration files first uploaded into the
  • memory slots and used to configure each FPGA
  • Both single-word transfers and DMA transfers
    supported between the host and the CPP

Figure courtesy of Tarari CP-DK manual
Handel-C Programming Paradigm
  • Handel-C acts as a bridge between VHDL and C
  • Comparison with conventional C
  • More explicit provisioning of parallelism within
    the code
  • Variables declared to have the exact bit-lengths
    to save space
  • Provides more bit-level manipulations beyond
    shifts and logic operations
  • Limited support for many ANSI C standards and
  • Comparison with VHDL
  • Application porting is much faster for
    experienced coders
  • Similar to VHDL behavioral models
  • Lacks VHDL concurrent signal assignments which
    can be suspended until changes on input triggers
    (Handel-C requires polling)
  • Provides more higher-level routines

Handel-C Design Specifics
  • Design makes use of the following two approaches
  • Approach 1
  • Use of an array of binary numbers to hold a 1
    at a particular bit position to indicate the
    location of queen in the column
  • A 32 x 32 board will require an array of 32
    elements of 32 bits each
  • Correspondingly use bit-shift operations and
    logical-and operations to check diagonal and row
  • More closely corresponds to the way the
    operations will take place on the RC fabric
  • Approach 2
  • Use of an array of integers instead of binary
  • Correspondingly use the mathematical model of the
    problem to check the validation conditions
  • Smaller variables yield better device
    utilization slices occupied reduce from about
    75 to about 15 for similar performance and
  • Approach 2 found to be more amenable for Handel-C

Lessons Learned with Handel-C
  • Some interesting observations
  • Code for which place and route did not work,
    finally worked when the function parameters were
    replaced by global variables
  • Less control at lower level with place and route
    being a consistent problem even with designs
    using up only 40 of total slices
  • Self-referenced operations (e.g. aax) affect
    the design adversely, so use intermediate
  • Order of operations and conditional statements
    can affect design
  • Useful to reduce wider-bit operations into a
    sequence of narrower-bit operations
  • Balancing if with else branches leads to
    better designs
  • Comments in the main program sometimes affected
    the synthesis, leading to place and route errors
    in fully commented code
  • We are still learning more everyday!

Sequential First-Solution Results
  • Sequential version does not perform well versus
    the Xeon and Athlon CPUs
  • Algorithm needs an efficient design to minimize
    resource utilization
  • The results do not include the one-time
    configuration overhead of 150 ms

RC1000 clock speed _at_ 40 MHz
RC1000 clock speed _at_ 40 MHz
Parallel First-Solution Results

RC1000 clock speed _at_ 25 MHz
  • The most parallel algorithm runs about 20x
    faster than
  • sequential algorithm on RC fabric
  • Parallel algorithm with two row checks almost
  • duplicates behavior of 2.4 GHz Xeon server,
  • while 6-row check outperforms it by 74
  • Further increasing the number of rows checked is
    likely to further improve performance
  • for larger problem sizes

Total Number of Solutions Method
  • Employ divide-and-conquer approach
  • Seen as a parallel depth-first search
  • Solutions obtained with queen positioned in any
    row in the first column are independent from
    solutions with queens in other positions
  • Technique allows for high
  • degree of parallelism (DoP)

One-Board Total-Solutions Results
RC1000 and Tarari clock speed _at_ 33 MHz
  • Designs on hardware perform around 1.7x faster
    than Xeon server
  • Performance on both RC platforms similar for same
    clock rates
  • RC1000 performs a notch better for smaller chess
    board sizes while Tarari CPPs performance
    improves with chess board sizes
  • Almost entire VirtexII chip on the Tarari is
    occupied for one FU

Multiple Functional Units (FUs)
  • Used additional FUs per chip to increase
    parallelism per chip
  • Each FU searches for the number of solutions
    corresponding to a subset of rows in the first
  • The controller
  • Handles communication with the host
  • Invokes all FUs in parallel
  • Combines all results

On board FPGA
Host processor
Total-Solutions Results with Multiple FUs
RC1000 clock speed _at_ 30 MHz
  • RC1000 with three FUs performs almost 5x faster
    than Xeon server
  • Speedup increases near linearly with number of
  • Area occupied scales linearly with number of FUs

RC speedup vs. Xeon server for board size of 17
MPI for Inter-Board Communication
  • To further increase system speedup (having more
    functional units), multiple boards employed
  • Each FU programmed to search a subset of the
    solution space
  • Servers communicate using the Message Passing
    Interface (MPI) to start search in parallel and
    obtain the final result

On-board FPGA (with one or multiple FUs)
Host server
Host server
On-board FPGA (with one or multiple FUs)
Total-Solutions Results with MPI
Tarari CPP clock speed _at_ 33 MHz
RC speedup vs. Xeon server for board size of 12
  • Results show total execution time including MPI
  • Minimal MPI overhead incurred (high
    computation-to-communication ratio)
  • Communication overhead bounded to 3 ms regardless
    of problem size and initialization overhead is
    around 750 ms
  • Overhead becomes negligible for large problem
  • Speedup scales near linearly with number of
  • 4-board Tarari design performs about 6.5x faster
    than Xeon server

Total-Solutions Results with MPI
RC speedup vs. Xeon server for board size of 12
RC1000 clock speed _at_ 30 MHz
  • Results show total execution time including MPI
  • Minimal MPI overhead incurred (high
    computation-to-communication ratio)
  • Communication overhead bounded to 3 ms regardless
    of problem size and initialization overhead is
    around 750 ms
  • Overhead becomes negligible for large problem
  • Speedup scales near linearly with number of
  • 4-board RC1000 design performs about 12x faster
    than Xeon server

Total-Solutions Results with MPI
RC1000 clock speed _at_ 30 MHz and Tarari clock
speed _at_ 33MHz
  • Communication overheads still remain low, while
    MPI initialization overheads increase with number
    of boards (now 1316 ms for 8 boards)
  • Heterogeneous mixture of boards employed to solve
    the problem coordinating via MPI
  • Total of 8 boards (4 RC1000 and 4 Tarari boards)
    allows up to 16 (4?3 4?1) FUs
  • 8 boards perform about 21x faster than Xeon
    server for chess board size of 16
  • What appears to be an unfair comparison really
    shows how the approach scales to many more FUs
    per FPGA (on higher density chips)

  • Parallel backtracking for solving N-Queens
    problem in RC shows promise for performance
  • N-Queens is an important benchmark in the HPC
  • RC devices outperform CPUs for N-Queens due to
    RCs efficient processing of fine-grained,
    parallel, bit-manipulation operations
  • Previously inefficient methods for CPUs like
    backtracking can be improved by reexamining their
  • This approach can be applied to many other
  • Numerous parallel approaches developed at several
  • Handel-C lessons learned
  • A C-based programming model for application
    mapping provides a degree of higher-level
    abstraction, yet still requires programmer to
    code from a hardware perspective
  • Solutions produced to date show promise for
    application mapping

Future Work and Acknowledgements
  • Compare application mappers with HDL design in
    terms of mapping efficiency
  • Develop and use direct communication between
    FPGAs to avoid MPI overhead
  • Export approach featured in this talk to variety
    of algorithms and HPC benchmarks for performance
    analysis and optimization
  • Develop library of application and middleware
    kernels for RC-based HPC
  • We wish to thank the following for their support
    of this research
  • Department of Defense
  • Xilinx
  • Celoxica
  • Tarari
  • Key vendors of our HPC cluster resources (Intel,
    AMD, Cisco, Nortel)

  • 1 Divide and Conquer under Global
    Constraints A Solution to the N-Queens Problem,
    Bruce Abramson and Mordechai M. Yung
  • 2 Different Perspectives Of The N-queens
    Problem, Cengiz Erbas, Seyed Sarkeshikt, Murat
    M. Tanik, Department of Computer Science and
    Engineering,Southern Methodist University, Dallas
  • 3 Algorithms and Complexity, Herbert S. Wilf,
    University of Pennsylvania, Philadelphia
  • 4 Fast search algorithms for N-Queens
    problem, Rok Sausic, Jum Gu, appeared in IEEE
    transactions on Systems, Man, and Cybernetics,
    Vol 21, 6, pp 1572-76, Nov/Dec 1991
  • 5 http//
  • 6 http//
  • 7
  • 8
  • 9 A polynomial time algorithm for N-queens
  • 10
  • 11 http//
  • 12 http//
  • 13 http//
  • 14 From Alife Agents To A Kingdom Of N
    Queens, Han Jing, Jimimg Liu, Cai Qingsheng
  • 15 http//
  • 16 http//