NARC: Network-Attached Reconfigurable Computing for High-performance, Network-based Applications - PowerPoint PPT Presentation

Loading...

PPT – NARC: Network-Attached Reconfigurable Computing for High-performance, Network-based Applications PowerPoint presentation | free to download - id: 5f92dc-YWE2N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

NARC: Network-Attached Reconfigurable Computing for High-performance, Network-based Applications

Description:

Title: NQUEENS Author: Guest Last modified by: Chris Created Date: 2/2/2004 7:08:34 PM Document presentation format: On-screen Show Company: UF Other titles – PowerPoint PPT presentation

Number of Views:2
Avg rating:3.0/5.0
Date added: 19 October 2020
Slides: 19
Provided by: Gue2223
Learn more at: http://klabs.org
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: NARC: Network-Attached Reconfigurable Computing for High-performance, Network-based Applications


1
NARC Network-Attached Reconfigurable Computing
for High-performance, Network-based Applications
  • Chris Conger, Ian Troxel, Daniel Espinosa,
  • Vikas Aggarwal, and Alan D. George
  • High-performance Computing and Simulation (HCS)
    Research Lab
  • Department of Electrical and Computer Engineering
  • University of Florida

2
Outline
  • Introduction
  • NARC Board Architecture, Protocols
  • Case Study Applications
  • Experimental Setup
  • Results and Analysis
  • Pitfalls and Lessons Learned
  • Conclusions
  • Future Work

3
Introduction
  • Network-Attached Reconfigurable Computer (NARC)
    Project
  • Inspiration network-attached storage
  • (NAS) devices
  • Core concept investigate challenges and
  • alternatives for enabling direct network access
  • and control over reconfigurable (RC) devices
  • Method prototype hardware interface and
  • software infrastructure, demonstrate proof of
  • concept for benefits of network-attached RC
  • resources
  • Motivations for NARC project include (but
  • not limited to) applications such as
  • Network-accessible processing resources
  • Generic network RC resource, viable alternative
    to server and supercomputer solutions
  • Power and cost savings over server-based FPGA
    cards are key benefits
  • No server needed to host RC device
  • Infrastructure provided for robust operation and
    interfacing with users
  • Performance increase over existing RC solutions
    is not a primary goal of this approach
  • Network monitoring and packet analysis

4
Envisioned Applications
  • Aerospace military applications
  • Modular, low-power design lends itself well to
    military craft and munitions deployment
  • FPGAs providing high-performance radar, sonar,
    and other computational capabilities
  • Scientific field operations
  • Quickly provide first-level estimations for
    scientific field operations for geologists,
    biologists, etc.
  • Field-deployable covert operations
  • Completely wireless device enabled through
    battery, WLAN
  • Passive network monitoring applications
  • Active network traffic injection
  • Distributed computing
  • Cost-effective, RC-enabled clusters or cluster
    resources
  • Cluster NARC devices at a fraction of cost,
    power, cooling
  • Cost-effective intelligent sensor networks
  • Use FPGAs in close conjunction with sensors to
    provide pre-processing functions before network
    transmission
  • High-performance network technologies
  • Fast Ethernet may be replaced by any network
    technology
  • Gig-E, Infiniband, RapidIO, proprietary
    communication protocols

5
NARC Board Architecture Hardware
  • ARM9 network control with FPGA processing power
    (see Figure 1)
  • Prototype design consists of two boards,
    connected via cable
  • Network interface board (ARM9 processor
    peripherals)
  • Xilinx development board(s) (FPGA)
  • Network interface peripherals include
  • Layer-2 network connection (hardware PHYMAC)
  • External memory, SDRAM and Flash
  • Serial port (debug communication link)
  • FPGA control and data lines
  • NARC hardware specifications
  • ARM-core microcontroller, 1.8V core, 3.3V
    peripheral
  • 32-bit RISC, 5-stage pipeline, in-order execution
  • 16KB data cache, 16KB instruction cache
  • Core clock speed 180MHz, peripheral clock 60MHz
  • On-chip Ethernet MAC layer with DMA
  • External memory, 3.3V
  • 32MB SDRAM, 32-bit data bus
  • 2MB Flash, 16-bit data bus
  • Port available for additional 16-bit SRAM devices

Figure 1 Block diagram of NARC device
6
NARC Board Architecture Software
  • ARM processor runs Linux kernel 2.4.19
  • Provides TCP(UDP)/IP stack, resource management,
    threaded execution, Berkeley Sockets interface
    for applications
  • Configured and compiled with drivers specifically
    for our board
  • Applications written in C, compiled using GCC
    compiler for ARM (see Figure 2)
  • NARC API Low-level driver function library for
    basic services
  • Initialize and configure on-chip peripherals of
    ARM-core processor
  • Configure FPGA (SelectMAP protocol)
  • Transfer data to/from FPGA, manipulate control
    lines
  • Monitor and initiate network traffic
  • NARC protocol for job exchange (from remote
    workstation)
  • NARC board application and client application
    must follow standard rules and procedures for
    responding to requests from a user
  • User appends a small header onto data (if any)
    containing info. about request before sending
    over network (see Figure 3)
  • Bootstrap software in on-board Flash,
    automatically loads and executes on power-up
  • Configures clocks, memory controllers, I/O pins,
    etc
  • Contacts tftp server running on network,
    downloads Linux and ramdisk
  • Boot Linux, automatically execute NARC board
    software contained in ramdisk
  • Optional serial interface through HyperTerminal
    for debugging/development

Figure 2 Software development process
Figure 3 Request header field definitions
7
NARC Board Architecture FPGA Interface
  • Data communicated to/from FPGA by means of
    unidirectional data paths
  • 8-bit input port, 8-bit output port, 8 control
    lines (Figure 4)
  • Control lines manage data transfer, also drive
    configuration signals
  • Data transferred one byte at a time, full duplex
    communication possible
  • Control lines include following signals
  • Clock software-generated signal to clock data
    on data ports
  • Reset reset signal for interface logic in FPGA
  • Ready signal indicating device is ready to
    accept another byte of data
  • Valid signal indicating device has placed valid
    data on port
  • SelectMAP all signals necessary to drive
    SelectMAP configuration

Figure 4 FPGA interface signal diagram
  • FPGA configuration through SelectMAP protocol
  • Fastest configuration option for Xilinx FPGAs,
    protocol emulated using GPIO pins of ARM
  • NARC board enables remote configuration and
    management of FPGA
  • User submits configuration request (RTYPE 01),
    along with bitfile and function descriptor
  • Function descriptor is ASCII string, formatted
    list of functions with associated RTYPE
    definition
  • ARM halts and configures FPGA, stores descriptor
    in dedicated RAM buffer for user queries
  • All FPGA designs must restrict use of all
    SelectMAP pins after configuration
  • Some signals are shared between SelectMAP port
    and FPGA-ARM link
  • Once configured, SelectMAP pins must remain
    tri-stated and unused

8
Results and Analysis Raw Performance
  • FPGA interface I/O throughput (Table 1)
  • 1KB data transferred over link, timed
  • Measured using hardware methods
  • Logic analyzer to capture raw link data rate,
    divide data sent by time from first clock to last
    clock (see Figure 9)
  • Performance lower than desired for prototype
  • Handshake protocol may add unnecessary overhead
  • Widening data paths, optimizing software routine
    will significantly improve FPGA I/O performance
  • Network throughput (Table 2)
  • Measured using Linux network benchmark IPerf
  • NARC board located on arbitrary switch within
    network, application partner is user workstation
  • Transfers as much data as possible in 10 seconds,
    calculates throughput based on data sent divided
    by 10 seconds
  • Performed two experiments with NARC board serving
    as client in one run, server in other
  • Both local and remote (remote location 400 miles
    away, at Florida State University) IPerf partner
  • Network interface achieves reasonably good
    bandwidth efficiency
  • External memory throughput (Table 3)
  • 4KB transferred to external SDRAM, both read and
    write
  • Measurements again taken using logic analyzer
  • Memory throughput sufficient to provide
    wire-speed buffering of network traffic
  • On-chip Ethernet MAC has DMA to this SDRAM

Figure 9 Logic analyzer timing
Mb/s Input Output
Logic Analyzer 6.08 6.12
Table 1 FPGA interface I/O performance
Mb/s Local Network Remote Network (WAN)
NARC-Server 75.4 4.9
Server-Server 78.9 5.3
Table 2 Network throughput
Mb/s Read Write
Logic Analyzer 183.2 160
Table 3 External SDRAM throughput
9
Results and Analysis Raw Performance
  • Reconfiguration speed
  • Includes time to transfer bitfile over network,
    plus time to configure device (transfer bitfile
    from ARM to FPGA), plus time to receive
    acknowledgement
  • Our design currently completes a user-initiated
    reconfiguration request with a 1.2MB bitfile in
    2.35 sec
  • Area/resource usage of minimal wrapper for
    Virtex-II Pro FPGA
  • Stats on resource requirements for a minimal
    design to provide required link control and data
    transfer in an application wrapper are presented
    below
  • Design implemented on older
  • Virtex-II Pro FPGA
  • Numbers to right indicate
  • requirements for wrapper only,
  • un-used resources available
  • for use in user applications
  • Extremely small footprint!
  • Footprint will be even smaller
  • on larger FPGA

Device utilization summary ----------------------
---------------------------------- Selected
Device 2vp20ff1152-5 Number of Slices
143 out of 9280 1 Number of
Slice Flip Flops 120 out of 18560 0
Number of 4 input LUTs 238 out of 18560
1 Number of bonded IOBs 24 out
of 564 4 Number of BRAMs
8 out of 88 9 Number of GCLKs
1 out of 16 6

10
Case Study Applications
  • Clustered RC Devices N-Queens
  • HPC application demonstrating NARC boards role
    as generic compute resource
  • Application characterized by minimal
    communication, heavy computation within FPGA
  • NARC version of N-Queens adapted from previously
    implemented application for PCI-based Celoxica
    RC1000 board housed in a conventional server
  • N-Queens algorithm is a part of the DoD
    high-performance computing benchmark suite and
    representative of select military and
    intelligence processing algorithms
  • Exercises functionality of various developed
    mechanisms and protocols for job submission, data
    transfer, etc. on NARC
  • User specifies a single parameter N, upon
  • completion the algorithm returns total number
  • of possible solutions
  • Purpose of algorithm is to determine how many
  • possible arrangements of N queens there are on
  • an N N chess board, such that no queen may
  • attack another (see Figure 5)
  • Results are presented from both NARC-based
    execution and RC1000-based execution for
    comparison

Figure c/o Jeff Somers
Figure 5 Possible 8x8 solution
11
Case Study Applications
  • Network processing Bloom Filter
  • This application performs passive packet analysis
    through use of a classification algorithm known
    as a Bloom Filter
  • Application characterized by constant, bursty
    communication patterns
  • Most communication is Rx over network,
    transmission to FPGA
  • Filter may be programmed or queried
  • NARC device copies all received network frames to
    memory, ARM parses TCP/IP header and sends it to
    Bloom Filter for classification
  • User can send programming requests, which include
    a header and string to be programmed into Filter
  • User can also send result collection requests,
    which causes a formatted results packet to be
    sent back to the user
  • Otherwise, application constantly runs, querying
    each header against the current Bloom Filter and
    recording match/header pair information
  • Bloom Filter works by using multiple hash
    functions on a given bit string, each hash
    function rendering indices of a separate bit
    vector (see Figure 6)
  • To program, hash inputted string and set
    resulting bit positions as 1
  • To query, hash inputted string, if all resulting
    bit positions are 1 the string matches
  • Implemented on Virtex-II Pro FPGA
  • Uses slightly larger, but ultimately more
    effective application wrapper (see Figure 7)
  • Larger FPGA selected to demonstrate
    interoperability with any FPGA

Figure 6 Bloom Filter algorithmic architecture
Figure 7 Bloom Filter implementation
architecture
12
Experimental Setup
  • N-Queens Clustered RC devices
  • NARC device located on arbitrary switch in
    network
  • User interfaces through client application on
    workstation, requests N-Queens procedure
  • Figure 8 illustrates experimental environment
  • Client application records time required to
    satisfy request
  • Power supply measures current draw of active NARC
    device
  • N-Queens also implemented on RC-enabled server
    equipped with Celoxica RC1000 board
  • Client-side function call to NARC board replaced
    with function call to RC1000 board in local
    workstation, same timing measurement
  • Comparison offered in terms of performance,
    power, cost
  • Bloom Filter Network processing
  • Same experimental setup as N-Queens case study
  • Software on ARM co-processor captures all
    Ethernet frames
  • Only packet headers (TCP/IP) are passed to FPGA
  • Data continuously sent to FPGA as packets arrive
    over network
  • By attaching NARC device to switch, limited
    packets can be captured
  • Only broadcast packets and packets destined for
    the NARC device can be seen
  • Dual-port device could be inserted in-line with
    network link, monitor all flow-through traffic

Figure 8 Experimental environment
13
Results and Analysis N-Queens Case Study
  • First, consider an execution time comparison
    between our NARC board and a PCI-based RC card
    (see Figure 10a and 10b)
  • Both FPGA designs clocked at 50MHz
  • Performance difference is minimal between devices
  • Being able to match performance of PCI-based card
    is a resounding success!
  • Power consumption and cost of NARC devices
    drastically lower than that of server with RC
    card combos
  • Multiple users may share NARC device, PCI-based
    cards somewhat fixed in an individual server
  • Power consumption calculated using following
    method
  • Three regulated power supplies exist in complete
    NARC device (network interface FPGA board) 5V,
    3.3V, 2.5V
  • Current draw from each supply was measured
  • Power consumption is calculated as sum of VI
    products of all three supplies

Figure 10 Performance comparison between NARC
board and PCI-based RC card on server
14
Results and Analysis N-Queens Case Study
  • Figure 11 summarizes the performance ratio of
    N-Queens between both NARC and RC-1000 platforms
  • Consider Table 4 for a summary of cost and power
    statistics
  • Unit price shown excluding cost of FPGA
  • FPGA costs offset when compared to another device
  • Price shown includes PCB fabrication, component
    costs
  • Approximate power consumption drastically less
    than server RC-card combo
  • Power consumption of server varies depending on
    particular hardware
  • Typical servers operate off of 200-400W power
    supplies
  • See Figure 12 for example of approximate power
    consumption calculation

Figure 11 Power consumption calculation
NARC Board
Cost per unit (prototype) 175.00
Approx. Power Consumption 3.28 W
Table 4 Price and power figures for NARC device
P (5V)(I5) (3.3V)(I33) (2.5V)(I25) I5
0.2A I33 0.49A I25 0.27A P (5)(.2)
(3.3)(.49) (2.5)(.27) 3.28W
Figure 12 Power consumption calculation
15
Results and Analysis Bloom Filter
  • Passive, continuous network traffic analysis
  • Wrapper design was slightly larger than previous
    minimal wrapper used with N-Queens
  • Still small footprint on chip, majority of FPGA
    remains for application
  • Maximum wrapper clock frequency 183 MHz, should
    not limit application clock if in same clock
    domain
  • Packets received over network link are parsed by
    ARM, with TCP/IP header saved in buffer
  • Headers sent one-at-a-time as query requests to
    Bloom Filter (FPGA), when query finishes another
    header will be de-queued if available
  • User may query NARC device at any time for
    results update, program new pattern
  • Figure 13 shows resource usage for Virtex-II Pro
    FPGA
  • Maximum clock frequency of 113MHz
  • Not affected by wrapper constraint
  • Significantly faster computation speed than
    FPGA-ARM link communication speed
  • FPGA-side buffer will not fill up, headers are
    processed before next header transmitted to FPGA
  • ARM-side buffer may fill up under heavy traffic
    loads
  • 32MB ARM-side RAM gives large buffer

Device utilization summary ----------------------
--------------------------------- Selected Device
2vp20ff1152-5 Number of Slices
1174 out of 9280 13 Number of Slice
Flip Flops 1706 out of 18560 9 Number
of 4 input LUTs 2032 out of 18560 11
Number of bonded IOBs 24 out of 564
4 Number of BRAMs 9 out of
88 10 Number of GCLKs 1
out of 16 6

Figure 13 Device utilization statistics for
Bloom Filter design
16
Pitfalls and Lessons Learned
  • FPGA I/O throughput capacity remains persistent
    problem
  • One motivation for designing custom hardware is
    to remove typical PCI bottleneck and provide
    wire-speed network connectivity for FPGA
  • Under-provisioned data path between FPGA and
    network interface restricts performance benefits
    for our prototype design
  • Luckily, this problem may be solved through a
    variety of approaches
  • Wider data paths (16-bit, 32-bit) double or
    quadruple throughput, at expense of higher pin
    count
  • Use of higher-performance co-processor capable of
    faster I/O switching frequencies
  • Optimized data transfer protocol
  • Having co-processor in addition to FPGA to handle
    network interface is vital to success of our
    approach
  • Required in order to permit initial remote
    configuration of FPGA, as well as additional
    reconfigurations upon user request
  • Offloading network stack, basic request handling,
    and other maintenance-type tasks from FPGA saves
    significant amount of valuable slices for user
    designs
  • Drastically eases interfacing with user
    application on networked workstation
  • Active co-processor for FPGA applications, e.g.
    parsing network packets as in Bloom Filter
    application

17
Conclusions
  • A novel approach to providing FPGAs with
    standalone network connectivity has been
    prototyped and successfully demonstrated
  • Investigated issues critical to providing remote
    management of standalone NARC resources
  • Proposed and demonstrated solutions to discovered
    challenges
  • Performed pair of case studies with two distinct,
    representative applications for a NARC device
  • Network-attached RC devices offer potential
    benefits for a variety of applications
  • Impressive cost and power savings over
    server-based RC processing
  • Independent NARC devices may be shared by
    multiple users without moving
  • Tightly coupled network interface enables FPGA to
    be used directly in path of network traffic for
    real-time analysis and monitoring
  • Two issues that are proving to be a challenge to
    our approach include
  • Data latency in FPGA communication
  • Software infrastructure required to achieve a
    robust standalone RC unit
  • While prototype design achieves relatively good
    performance in some areas, and limited
    performance in others, this is acceptable for
    concept demonstration
  • Fairly complex board design architecture and
    software enhancements in development
  • As proof of NARC concept, important goal of
    project was achieved in demonstration of an
    effective and efficient infrastructure for
    managing NARC devices

18
Future Work
  • Expansion of network processing capabilities
  • Further development of packet filtering
    application
  • More specific and practical activity or behavior
    sought from network traffic
  • Analyze streaming packets at or near wire-speed
    rates
  • Expansion of Ethernet link to 2-port hub
  • Permit transparent insertion of device into
    network path
  • Provide easier access to all packets in switched
    IP network
  • Merging FPGA with ARM co-processor and network
    interface into one device
  • Ultimate vision for NARC device
  • Will restrict number of different FPGAs which may
    be supported, according to chosen FPGA
    socket/footprint for board
  • Increased difficulty in PCB design
  • Expansion to Gig-E, other network technologies
  • Fast Ethernet targeted for prototyping effort,
    concept demonstration
  • True high-performance device should support
    Gigabit Ethernet
  • Other potential technologies include (but not
    limited to) InfiniBand, RapidIO
  • Further development of management infrastructure
  • Need for more robust control/decision-making
    middleware
  • Automatic device discovery, concurrent job
    execution, fault-tolerant operation
About PowerShow.com