Reconfigurable Computing RC Group presentation

About This Presentation

Transcript and Presenter's Notes

Title: Reconfigurable Computing RC Group

1
Reconfigurable Computing (RC) Group

Reconfigurable Architectures, Networks, and
Services for COTS-based Cluster Computing Systems
Appendix for Q3 Status Report
DOD Project MDA904-03-R-0507
February 9, 2004

2
Outline

RC Group Overview
Motivation
CARMA Framework Overview
CARMA Framework Updates
Applications
Benchmarking
Application Mappers
Job Scheduler
Configuration Manager
Resource Monitoring Service
Conclusions
Future Work

3
RC Group Overview

Group Members (Fall 2003)
Vikas Aggarwal, M.S. student in ECE
Ramya Chandrasekaran, M.S. student in ECE
Gall Gotfried, B.S. student in CISE
Aju Jacob, M.S. student in ECE
Matt Radlinski, Ph.D. student in ECE
Ian Troxel, Ph.D. student in ECE, group leader
Girish Venkatasubramanian, M.S. student in ECE
Industry Collaborators
Honeywell
Xilinx (hardware and tools)
Celoxica / Alpha Data / Tarari (boards and tools)
Starbridge Systems (pending)
Silicon Graphics (pending)
Numerous other sponsors for cluster resources
(Intel, AMD, etc.)

4
Motivation

Key missing pieces in RC clusters for HPC
Dynamic RC fabric discovery and management
Coherent multitasking, multi-user environment
Robust job scheduling and management
Fault tolerance and scalability
Performance monitoring down into the RC fabric
Automated application mapping into management
tool
The HCS labs proposed Cluster-based Approach to
Reconfigurable Management Architecture (CARMA)
attempts to unify existing technologies as well
as fill in missing pieces

5
CARMA Framework Overview

CARMA seeks to integrate
Graphical user interface
Applications and Benchmarking (New)
COTS application mapper
Handel-C, Viva, CoreFire, etc.
Graph-based job description
Condensed Graphs, DAGMan, etc.
Robust management tool
Distributed, scalable job scheduling
Checkpointing, rollback and recovery
Distributed configuration management
Multilevel monitoring service (GEMS)
Clusters, networks, hosts, RC fabric
Monitoring down into RC Fabric
Bypass Middleware API (for future work)
Multiple types of RC boards
Multiple high-speed networks
SCI, Myrinet, GigE, InfiniBand, etc.

Note Substituting RC/HPC benchmarking for the
middleware API task
6
Applications

Test applications developed
Block ciphers
DES, Blowfish
Floating-point FFT
Sonar Beamforming
Hyperspectral Imaging (c/o LANL)
Future development
Additional cryptanalysis applications
RC4, RSA, Diffie-Helmann, Serpent, Elliptic Key
systems
RC/HPC benchmarks (c/o Honeywell TC and UCSD)
Cryptanalysis benchmarks (c/o DoD)
Other benchmarking algorithm possibilities
N Queens, Monte Carlo Pi generator, many others
considered (see slide 8)

7
RC Benchmarking

Parallel RC attributes
RC speedup vs. software
Parallel efficiency
Parallel speedup
Job throughput
Isoefficiency
Communication bandwidth
Machine and RC overhead
Reconfigurability
Parallelism
Versatility
Capacity
Time sensitivity
Scalability

RC metrics
Area used (slices, LUTs)
Execution time
Speedup per FPGA
Total configuration time
Communication latency

Selected References
Parallel Computing Architecture- A Hardware
Software Approach, Culler Singh
Parallel Computing Performance Models and
Metrics, Sartaj Sahni Thanvantri
Analyzing Scalability of Parallel Algorithms and
Systems, Vipin Kumar and Atul Gupta
A Benchmark Suite for evaluating configurable
systems-status, reflections, and future
directions, Honeywell Tech. Center (FPGA 2000)
The RAW Benchmark Suite Computation Structures
for General Purpose Computing, MIT lab for
Computer Science (FCCM 97)

8
RC Benchmarking

Algorithms under consideration
Binary heap
DES
FFT, DCT
Game of Life
Boolean satisfiability
Matrix multiply
N queens
CORDIC algorithms
Huffman encoding
Jacobi relaxation
Hanoi algorithms
Permutation generator
Monte Carlo Pi generator
Bubble, quick and merge sort
Wireless comm. algorithms
Sieve of Erasthones prime number generator

Search over discrete matrices
Wavelet-based image compression
Differential PCM
Adaptive Viterbi Algorithm
RC5, DES, Serpent, Blowfish key crack
RSA, Diffie-Helmann
Elliptic key cryptography
Graph problems (SSP,SPM,TC)
Micro benchmarks to be created as needed
Benchmark suites
NAS parallel
Pallas
SPEC HPC
SparseBench
PARKBENCH
DoD crypto. emulation

9
Applications (Blowfish)

Parallelization
B optimized for parallel network traffic encrypt.
where key remains fixed
C optimized for parallel cryptanalysis where key
changes rapidly

Blowfish
Network Packet Processing Optimized
B
Blowfish
Blowfish
A
Crypto Unit
Crypto Unit

S Boxes
Crypto Unit
Control
Control
P Arrays
Control
S Boxes
Init S Boxes
Init S Boxes
Init P Arrays
P Arrays
Init P Arrays
F Function
2 Iterations
Single Instance
Cryptanalysis Optimized
Blowfish
Blowfish
C

S Boxes
Crypto Unit
S Boxes
Crypto Unit
P Arrays
Control
P Arrays
Control
Init S Boxes
Init P Arrays
Based on Virtex 1000E at 25MHz
Denotes shared resources
10
Applications
Local CPU(s)
PCI Bridge
Memory
NIC
Remote CPU(s)

Remote access to Functional Units (FU)
Remote processes and FUs access local FUs
Potential for FUs to perform autonomous
communication
Users ID sets an access level for enhanced
security
Authentication and encryption could be included
Q3 Accomplishments
Seven Blowfish B FUs addressable from the local
processor
Able to decrypt input data and encrypt output
data within the FPGA for secure comm. between
FPGAs and hosts (ex. FFT)
Working to provide remote access and autonomous
comm.

Remote RC Boards
11
Application Mapper

Evaluating three application mappers on the basis
of
Ease of use, performance, hardware device
independence,
programming model, parallelization support,
resource targeting,
network support, stand-alone mapping
Celoxica - SDK (Handel-C)
Provides access to in-house boards
ADM-XRC (x1) and Tarari (x4)
StarBridge Systems - Viva
Provides best option for hardware independence
Annapolis Micro Systems - CoreFire
Provides access to the AFRL-IFTC 48-node cluster
Xilinx - ISE compulsory, evaluating Jbits for
partial RTR

Tarari
ADM-XRC
12
Application Mapper

QFD Comparison (V1)
Compared mappers in various categories
No clear winner among application mappers
Mapping efficiency will be examined next
Jbits
Allows for flexibility
Potential for splicing partial configurations
Users can potentially create hazardous designs!
Xilinx is likely to not support Jbits in future

13
Job Scheduler (JS)

Prototyping effort underway (forecasting)
Completed first version of JS (coded Q2 but still
under test)
Single node
Task-based execution using Dynamic Acyclic Graphs
(DAGs)
Separate processes and message queues for
fault-tolerance
Second version of JS (Q4 completion)
Multi-node
Distributed job migration
Checkpoint and rollback
Links to Configuration Manager and GEMS
External extensions to traditional tools
(interoperability)
Expand upon GWU/GMU work (Dr. El-Ghazawis group)
Code and algorithms reviewed but LSF required
(now trying to acquire)
Other COTS job schedulers under consideration
Striving for plug and play approach to JS
within CARMA

c/o GWU/GMU
14
Job Scheduler
DAG-based execution

Q3 Accomplishments
Rewritten JS in C
Stable, secure, easily extendable
Minimal overhead penalty
Network enabled
Uses RPC for local and network comm.
Minimal overhead
Standard interface
Under development
Enable dynamic job scheduling
Fault tolerance (checkpointing, rollback)
Evaluate interface with commercial job schedulers

15
Configuration Manager (CM)
Config File Reg.
Stub
CMUI
Proxy
CM
CM
Networks
Com
Com
Decision Maker
Remote Node
Boards
File Location File Transport File
Managing File Loading
Network Node Reg.
NetworkCompilerMessage Queue
RC Fabric Processor

Configuration Manager (CM)
Application interface to RC Board
Handles configuration caching, distribution and
loading
CM User Interface (CMUI)
Allows user input to configure CM
Communication Module (Com)
Used to transfer configuration files between CMs
via TCP/Ethernet or SCI

FPGA Hardware
16
Management Schemes
MW and CS already built,
CB and PP to be built Q4
Jobs submitted centrally
Global view of the system at all times
APP
Global view of the system at all times
APP MAP
GJS
GRMAN
Network
Results, Statistics
Tasks, States

LRMON
LRMON
Local Sys
Local Sys
Server houses configurations
Master-Worker (MW)
Client-Server (CS)
Global view of the system at all times
Server brokers configurations
Client-Broker (CB)
Peer-to-Peer (PP)
Multilevel approach anticipated for large number
of nodes having different schemes at different
levels
17
Configuration Manager
Five nodes in all (four workers, one master or
server)

Prelim. stress testing measurements of CM
software infrastructure excluding exec. time on
FPGAs
Major component of Completion Time is CM Queue
Latency (over 70 on average)
CM Queue Latency directly dependent on contention
for configuration files
MW and CS performance degrades above 5
configuration requests per second
MW passes out tasks (? requests) in round-robin
fashion so nodes have similar completion times
CS produces a first-come-first-served order as
each node fights for the server, and closer nodes
receive preference (due to SCI rings) so there
exists variance in completion times of nodes
CS provides better average completion latency
while MW provides less variance between nodes

Note config. file transfers via SCI
18
Monitoring Service Options

Custom agent per Functional Unit (FU)
Provides customized information per FU
Heavily burdens user (unless automated)
Requires additional FPGA area
Centralized agent per FPGA
Possibly reduces area overhead
Reduces data storage and communication
Limits scalability
Information storage and response
Store on chip or on board
Periodically send or respond when queried
Key parameters to monitor - further study
Custom parameters per algorithm
Requires all-encompassing interface
Automation needed for usability
Monitoring is overhead, so use sparingly!

19
Monitoring Service Parameters

Many parameters to measure will start with
subset
GEMS to be extended for HPC/RC monitoring
Various security levels for each parameter type

GEMS is the gossip-enabled monitoring service
developed by the HCS Lab for robust, scalable,
multilevel monitoring of resource health and
performance for more info. see
http//www.hcs.ufl.edu/gems
20
Results Summary

Q1
Several algorithms developed as test cases
DES, Blowfish, RSA, Sonar Beamforming,
Hyperspectral Imaging
Prototyping of initial mechanisms for CARMA
framework
ExMan, ConfigMan, TaskMan, and simple JS over
ADM-XRC API
Evaluation of cluster management tools
Many commercial tools identified and evaluated
(ref GWU/GMU)
Q2
Parallelization options identified and under
development
Two Blowfish flavors with multi-board per node,
multi-instance per FPGA
Application mapper evaluation underway
Handel-C, Viva, CoreFire, Jbits
Further prototyping and test of CARMA framework
ConfigMan over TCP and SCI with MW scheme

21
Results Summary

Q3
FU remote access
Addressing scheme designed, developed and tested
Multiple Blowfish and Floating-Point FFT modules
operable
Additional algorithms under development for RC
Benchmarking
N Queens, Serpent, Elliptic Curve Crypto., Monte
Carlo Pi generator, etc.
First phase of application mapper evaluation
concluded
No single winner of Handel-C, Viva, CoreFire,
Jbits
Began to study mapper optimization / inefficiency
issues
Further prototyping of mechanisms for CARMA
framework
ConfigMan (MW and CS) and JS over ADM-XRC and
Tarari API
Hardware monitoring scheme designed
Monitoring options, parameters and interfaces
identified

22
Conclusions

Broad coverage of RC design space
With focus on COTS-based RC clusters for HPC
Builds on lab strength in cluster computing and
communications
Key missing pieces in RC cluster design
identified
Initial framework for CARMA developed
Design options being refined
Prototyping of preliminary mechanisms underway
Several test applications for RC developed
Parallelization options for RC under development
Collaboration with other RC groups
Developing collaboration with key groups in
academia
Pursuing and hopeful of significant industry
collaboration

23
CARMA Future Work (for Q4)

Continue development and evaluation of CARMA
Applications and Benchmarking
Refine attribute, benchmark and metric
definitions
Map algorithms as appropriate
Develop initial benchmark suite for HPC/RC
Algorithm Mappers
Determine efficiency (or inefficiency) of the
three mappers under test vs. hand VHDL
Job Scheduler (JS)
Enable dynamic job scheduling
Build in fault tolerance (checkpointing,
rollback)
Provide for distributed job submission and
scheduling
Integrate with the CM
Configuration Manager (CM)
Scale MW and CS up to 32 nodes
Provide a multiple server CS to reduce completion
latency variability
Finish coding CB and PP
Add support for additional boards as they become
available
Hardware Monitoring
Further develop remote access to FPGA functional
units / processing elements

24
Future Work (beyond Q4)

Continue development and evaluation of CARMA
Expanded features
Support for additional boards, networks, etc.
Functionality and performance optimization
Extend early work in RC cluster simulation
Extend previous analytical modeling work
(architecture and software)
Leverage modeling and simulation tools under
development by MS group _at_ HCS
Forecast architecture limitations
Forecast software/management limitations
Determine key design tradeoffs
Investigate network-attached RC resources
Currently procuring evaluation boards for our
donated FPGAs
Could provide in-network content processing or
pre/post processing
Develop interfaces for network attached RC
devices
Develop cores for high-performance networks (e.g.
SCI, Myrinet, InfiniBand)
The Virtex II Pro X shows a trend toward merging
high-performance networking and RC
RC extensions to UPC, SHMEM, etc.
Investigate/develop HPC/RC system programming
model
Consider additional RC hardware security
challenges (e.g. FPGA viruses)

Write a Comment

User Comments (0)

About PowerShow.com

Reconfigurable Computing RC Group PowerPoint PPT Presentation