Reconfigurable Computing RC Group - PowerPoint PPT Presentation

Loading...

PPT – Reconfigurable Computing RC Group PowerPoint presentation | free to view - id: 6410b-NzBmM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Reconfigurable Computing RC Group

Description:

RC5, DES, Serpent, Blowfish key crack. RSA, Diffie-Helmann. Elliptic key cryptography ... Seven Blowfish B FUs addressable from the local processor ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 25
Provided by: dral60
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Reconfigurable Computing RC Group


1
Reconfigurable Computing (RC) Group
  • Reconfigurable Architectures, Networks, and
    Services for COTS-based Cluster Computing Systems
  • Appendix for Q3 Status Report
  • DOD Project MDA904-03-R-0507
  • February 9, 2004

2
Outline
  • RC Group Overview
  • Motivation
  • CARMA Framework Overview
  • CARMA Framework Updates
  • Applications
  • Benchmarking
  • Application Mappers
  • Job Scheduler
  • Configuration Manager
  • Resource Monitoring Service
  • Conclusions
  • Future Work

3
RC Group Overview
  • Group Members (Fall 2003)
  • Vikas Aggarwal, M.S. student in ECE
  • Ramya Chandrasekaran, M.S. student in ECE
  • Gall Gotfried, B.S. student in CISE
  • Aju Jacob, M.S. student in ECE
  • Matt Radlinski, Ph.D. student in ECE
  • Ian Troxel, Ph.D. student in ECE, group leader
  • Girish Venkatasubramanian, M.S. student in ECE
  • Industry Collaborators
  • Honeywell
  • Xilinx (hardware and tools)
  • Celoxica / Alpha Data / Tarari (boards and tools)
  • Starbridge Systems (pending)
  • Silicon Graphics (pending)
  • Numerous other sponsors for cluster resources
    (Intel, AMD, etc.)

4
Motivation
  • Key missing pieces in RC clusters for HPC
  • Dynamic RC fabric discovery and management
  • Coherent multitasking, multi-user environment
  • Robust job scheduling and management
  • Fault tolerance and scalability
  • Performance monitoring down into the RC fabric
  • Automated application mapping into management
    tool
  • The HCS labs proposed Cluster-based Approach to
    Reconfigurable Management Architecture (CARMA)
    attempts to unify existing technologies as well
    as fill in missing pieces

5
CARMA Framework Overview
  • CARMA seeks to integrate
  • Graphical user interface
  • Applications and Benchmarking (New)
  • COTS application mapper
  • Handel-C, Viva, CoreFire, etc.
  • Graph-based job description
  • Condensed Graphs, DAGMan, etc.
  • Robust management tool
  • Distributed, scalable job scheduling
  • Checkpointing, rollback and recovery
  • Distributed configuration management
  • Multilevel monitoring service (GEMS)
  • Clusters, networks, hosts, RC fabric
  • Monitoring down into RC Fabric
  • Bypass Middleware API (for future work)
  • Multiple types of RC boards
  • Multiple high-speed networks
  • SCI, Myrinet, GigE, InfiniBand, etc.

Note Substituting RC/HPC benchmarking for the
middleware API task
6
Applications
  • Test applications developed
  • Block ciphers
  • DES, Blowfish
  • Floating-point FFT
  • Sonar Beamforming
  • Hyperspectral Imaging (c/o LANL)
  • Future development
  • Additional cryptanalysis applications
  • RC4, RSA, Diffie-Helmann, Serpent, Elliptic Key
    systems
  • RC/HPC benchmarks (c/o Honeywell TC and UCSD)
  • Cryptanalysis benchmarks (c/o DoD)
  • Other benchmarking algorithm possibilities
  • N Queens, Monte Carlo Pi generator, many others
    considered (see slide 8)

7
RC Benchmarking
  • Parallel RC attributes
  • RC speedup vs. software
  • Parallel efficiency
  • Parallel speedup
  • Job throughput
  • Isoefficiency
  • Communication bandwidth
  • Machine and RC overhead
  • Reconfigurability
  • Parallelism
  • Versatility
  • Capacity
  • Time sensitivity
  • Scalability
  • RC metrics
  • Area used (slices, LUTs)
  • Execution time
  • Speedup per FPGA
  • Total configuration time
  • Communication latency
  • Selected References
  • Parallel Computing Architecture- A Hardware
    Software Approach, Culler Singh
  • Parallel Computing Performance Models and
    Metrics, Sartaj Sahni Thanvantri
  • Analyzing Scalability of Parallel Algorithms and
    Systems, Vipin Kumar and Atul Gupta
  • A Benchmark Suite for evaluating configurable
    systems-status, reflections, and future
    directions, Honeywell Tech. Center (FPGA 2000)
  • The RAW Benchmark Suite Computation Structures
    for General Purpose Computing, MIT lab for
    Computer Science (FCCM 97)

8
RC Benchmarking
  • Algorithms under consideration
  • Binary heap
  • DES
  • FFT, DCT
  • Game of Life
  • Boolean satisfiability
  • Matrix multiply
  • N queens
  • CORDIC algorithms
  • Huffman encoding
  • Jacobi relaxation
  • Hanoi algorithms
  • Permutation generator
  • Monte Carlo Pi generator
  • Bubble, quick and merge sort
  • Wireless comm. algorithms
  • Sieve of Erasthones prime number generator
  • Search over discrete matrices
  • Wavelet-based image compression
  • Differential PCM
  • Adaptive Viterbi Algorithm
  • RC5, DES, Serpent, Blowfish key crack
  • RSA, Diffie-Helmann
  • Elliptic key cryptography
  • Graph problems (SSP,SPM,TC)
  • Micro benchmarks to be created as needed
  • Benchmark suites
  • NAS parallel
  • Pallas
  • SPEC HPC
  • SparseBench
  • PARKBENCH
  • DoD crypto. emulation

9
Applications (Blowfish)
  • Parallelization
  • B optimized for parallel network traffic encrypt.
    where key remains fixed
  • C optimized for parallel cryptanalysis where key
    changes rapidly

Blowfish
Network Packet Processing Optimized
B
Blowfish
Blowfish
A
Crypto Unit
Crypto Unit
…
S Boxes
Crypto Unit
Control
Control
P Arrays
Control
S Boxes
Init S Boxes
Init S Boxes
Init P Arrays
P Arrays
Init P Arrays
F Function
2 Iterations
Single Instance
Cryptanalysis Optimized
Blowfish
Blowfish
C
…
S Boxes
Crypto Unit
S Boxes
Crypto Unit
P Arrays
Control
P Arrays
Control
Init S Boxes
Init P Arrays
Based on Virtex 1000E at 25MHz
Denotes shared resources
10
Applications
Local CPU(s)
PCI Bridge
Memory
NIC
Remote CPU(s)
  • Remote access to Functional Units (FU)
  • Remote processes and FUs access local FUs
  • Potential for FUs to perform autonomous
    communication
  • Users ID sets an access level for enhanced
    security
  • Authentication and encryption could be included
  • Q3 Accomplishments
  • Seven Blowfish B FUs addressable from the local
    processor
  • Able to decrypt input data and encrypt output
    data within the FPGA for secure comm. between
    FPGAs and hosts (ex. FFT)
  • Working to provide remote access and autonomous
    comm.

Remote RC Boards
11
Application Mapper
  • Evaluating three application mappers on the basis
    of
  • Ease of use, performance, hardware device
    independence,
  • programming model, parallelization support,
    resource targeting,
  • network support, stand-alone mapping
  • Celoxica - SDK (Handel-C)
  • Provides access to in-house boards
  • ADM-XRC (x1) and Tarari (x4)
  • StarBridge Systems - Viva
  • Provides best option for hardware independence
  • Annapolis Micro Systems - CoreFire
  • Provides access to the AFRL-IFTC 48-node cluster
  • Xilinx - ISE compulsory, evaluating Jbits for
    partial RTR

Tarari
ADM-XRC
12
Application Mapper
  • QFD Comparison (V1)
  • Compared mappers in various categories
  • No clear winner among application mappers
  • Mapping efficiency will be examined next
  • Jbits
  • Allows for flexibility
  • Potential for splicing partial configurations
  • Users can potentially create hazardous designs!
  • Xilinx is likely to not support Jbits in future

13
Job Scheduler (JS)
  • Prototyping effort underway (forecasting)
  • Completed first version of JS (coded Q2 but still
    under test)
  • Single node
  • Task-based execution using Dynamic Acyclic Graphs
    (DAGs)
  • Separate processes and message queues for
    fault-tolerance
  • Second version of JS (Q4 completion)
  • Multi-node
  • Distributed job migration
  • Checkpoint and rollback
  • Links to Configuration Manager and GEMS
  • External extensions to traditional tools
    (interoperability)
  • Expand upon GWU/GMU work (Dr. El-Ghazawis group)
  • Code and algorithms reviewed but LSF required
    (now trying to acquire)
  • Other COTS job schedulers under consideration
  • Striving for plug and play approach to JS
    within CARMA

c/o GWU/GMU
14
Job Scheduler
DAG-based execution
  • Q3 Accomplishments
  • Rewritten JS in C
  • Stable, secure, easily extendable
  • Minimal overhead penalty
  • Network enabled
  • Uses RPC for local and network comm.
  • Minimal overhead
  • Standard interface
  • Under development
  • Enable dynamic job scheduling
  • Fault tolerance (checkpointing, rollback)
  • Evaluate interface with commercial job schedulers

15
Configuration Manager (CM)
Config File Reg.
Stub
CMUI
Proxy
CM
CM
Networks
Com
Com
Decision Maker
Remote Node
Boards
File Location File Transport File
Managing File Loading
Network Node Reg.
Network Compiler Message Queue
RC Fabric Processor
  • Configuration Manager (CM)
  • Application interface to RC Board
  • Handles configuration caching, distribution and
    loading
  • CM User Interface (CMUI)
  • Allows user input to configure CM
  • Communication Module (Com)
  • Used to transfer configuration files between CMs
    via TCP/Ethernet or SCI

FPGA Hardware
16
Management Schemes
MW and CS already built,
CB and PP to be built Q4
Jobs submitted centrally
Global view of the system at all times
APP
Global view of the system at all times
APP MAP
GJS
GRMAN
Network
Results, Statistics
Tasks, States
…
LRMON
LRMON
Local Sys
Local Sys
Server houses configurations
Master-Worker (MW)
Client-Server (CS)
Global view of the system at all times
Server brokers configurations
Client-Broker (CB)
Peer-to-Peer (PP)
Multilevel approach anticipated for large number
of nodes having different schemes at different
levels
17
Configuration Manager
Five nodes in all (four workers, one master or
server)
  • Prelim. stress testing measurements of CM
    software infrastructure excluding exec. time on
    FPGAs
  • Major component of Completion Time is CM Queue
    Latency (over 70 on average)
  • CM Queue Latency directly dependent on contention
    for configuration files
  • MW and CS performance degrades above 5
    configuration requests per second
  • MW passes out tasks (? requests) in round-robin
    fashion so nodes have similar completion times
  • CS produces a first-come-first-served order as
    each node fights for the server, and closer nodes
    receive preference (due to SCI rings) so there
    exists variance in completion times of nodes
  • CS provides better average completion latency
    while MW provides less variance between nodes

Note config. file transfers via SCI
18
Monitoring Service Options
  • Custom agent per Functional Unit (FU)
  • Provides customized information per FU
  • Heavily burdens user (unless automated)
  • Requires additional FPGA area
  • Centralized agent per FPGA
  • Possibly reduces area overhead
  • Reduces data storage and communication
  • Limits scalability
  • Information storage and response
  • Store on chip or on board
  • Periodically send or respond when queried
  • Key parameters to monitor - further study
  • Custom parameters per algorithm
  • Requires all-encompassing interface
  • Automation needed for usability
  • Monitoring is overhead, so use sparingly!

19
Monitoring Service Parameters
  • Many parameters to measure will start with
    subset
  • GEMS to be extended for HPC/RC monitoring
  • Various security levels for each parameter type

GEMS is the gossip-enabled monitoring service
developed by the HCS Lab for robust, scalable,
multilevel monitoring of resource health and
performance for more info. see
http//www.hcs.ufl.edu/gems
20
Results Summary
  • Q1
  • Several algorithms developed as test cases
  • DES, Blowfish, RSA, Sonar Beamforming,
    Hyperspectral Imaging
  • Prototyping of initial mechanisms for CARMA
    framework
  • ExMan, ConfigMan, TaskMan, and simple JS over
    ADM-XRC API
  • Evaluation of cluster management tools
  • Many commercial tools identified and evaluated
    (ref GWU/GMU)
  • Q2
  • Parallelization options identified and under
    development
  • Two Blowfish flavors with multi-board per node,
    multi-instance per FPGA
  • Application mapper evaluation underway
  • Handel-C, Viva, CoreFire, Jbits
  • Further prototyping and test of CARMA framework
  • ConfigMan over TCP and SCI with MW scheme

21
Results Summary
  • Q3
  • FU remote access
  • Addressing scheme designed, developed and tested
  • Multiple Blowfish and Floating-Point FFT modules
    operable
  • Additional algorithms under development for RC
    Benchmarking
  • N Queens, Serpent, Elliptic Curve Crypto., Monte
    Carlo Pi generator, etc.
  • First phase of application mapper evaluation
    concluded
  • No single winner of Handel-C, Viva, CoreFire,
    Jbits
  • Began to study mapper optimization / inefficiency
    issues
  • Further prototyping of mechanisms for CARMA
    framework
  • ConfigMan (MW and CS) and JS over ADM-XRC and
    Tarari API
  • Hardware monitoring scheme designed
  • Monitoring options, parameters and interfaces
    identified

22
Conclusions
  • Broad coverage of RC design space
  • With focus on COTS-based RC clusters for HPC
  • Builds on lab strength in cluster computing and
    communications
  • Key missing pieces in RC cluster design
    identified
  • Initial framework for CARMA developed
  • Design options being refined
  • Prototyping of preliminary mechanisms underway
  • Several test applications for RC developed
  • Parallelization options for RC under development
  • Collaboration with other RC groups
  • Developing collaboration with key groups in
    academia
  • Pursuing and hopeful of significant industry
    collaboration

23
CARMA Future Work (for Q4)
  • Continue development and evaluation of CARMA
  • Applications and Benchmarking
  • Refine attribute, benchmark and metric
    definitions
  • Map algorithms as appropriate
  • Develop initial benchmark suite for HPC/RC
  • Algorithm Mappers
  • Determine efficiency (or inefficiency) of the
    three mappers under test vs. hand VHDL
  • Job Scheduler (JS)
  • Enable dynamic job scheduling
  • Build in fault tolerance (checkpointing,
    rollback)
  • Provide for distributed job submission and
    scheduling
  • Integrate with the CM
  • Configuration Manager (CM)
  • Scale MW and CS up to 32 nodes
  • Provide a multiple server CS to reduce completion
    latency variability
  • Finish coding CB and PP
  • Add support for additional boards as they become
    available
  • Hardware Monitoring
  • Further develop remote access to FPGA functional
    units / processing elements

24
Future Work (beyond Q4)
  • Continue development and evaluation of CARMA
  • Expanded features
  • Support for additional boards, networks, etc.
  • Functionality and performance optimization
  • Extend early work in RC cluster simulation
  • Extend previous analytical modeling work
    (architecture and software)
  • Leverage modeling and simulation tools under
    development by MS group _at_ HCS
  • Forecast architecture limitations
  • Forecast software/management limitations
  • Determine key design tradeoffs
  • Investigate network-attached RC resources
  • Currently procuring evaluation boards for our
    donated FPGAs
  • Could provide in-network content processing or
    pre/post processing
  • Develop interfaces for network attached RC
    devices
  • Develop cores for high-performance networks (e.g.
    SCI, Myrinet, InfiniBand)
  • The Virtex II Pro X shows a trend toward merging
    high-performance networking and RC
  • RC extensions to UPC, SHMEM, etc.
  • Investigate/develop HPC/RC system programming
    model
  • Consider additional RC hardware security
    challenges (e.g. FPGA viruses)
About PowerShow.com