Title: Configurable Computing for Mainstream Software Applications
1Configurable Computing for Mainstream Software
Applications
- William D. Bishop
- wdbishop_at_computer.org
2Presentation Outline
- Introduction to configurable computing
- Motivation
- Definitions and concepts
- Niche applications
- Research into configurable computing for
mainstream software - Model of configurable computing
- Configurable computing experiments
- Test results
- Observations
- Conclusions
3Motivation
"For a given class of problems, one set of basic
instructions may be more efficient than another
such set" John von Neumann, 1958
- The above statement can be applied to computer
architecture in the following way - Application-specific computer hardware may be
more efficient than general-purpose computer
hardware for solving a given class of problems
4Introduction to Configurable Computing
- Definition of a configurable computer
- Configurable computers offer the following
advantages - Increased control logic (a.k.a. processing units)
flexibility - Increased datapath (a.k.a. wiring) flexibility
- Ability to specialize the computer hardware at
runtime
A configurable computer is a computing device
that provides hardware that may be modified at
runtime to efficiently compute of a set of tasks.
5Classification of Computing Machines
6Building a Configurable Computer
- The basic building block of a configurable
computer is the High-Density Programmable Logic
Device (HDPLD). - Suitable HDPLDs have the following features
- Large capacity for digital hardware designs
- Electrically programmable in-system
- Support for high-speed reconfiguration
- SRAM-based device
A Modern HDPLD The Altera 10K100 CPLD
Photo Courtesy of Altera
7Types of Configurable Computers
- Loosely-Coupled
- Configurable coprocessor connected to a host
computer via a peripheral bus - Tightly-Coupled
- Configurable coprocessor connected directly to
the system bus of a host computer - Configurable Instruction Set Computer
- Processor utilizes configurable hardware to
implement instructions
8Niche Applications of Configurable Computers
- What are niche applications of configurable
computers? - Applications that use bit-wise computations or
integer arithmetic - Applications with course-grain computations
- Examples of niche applications include the
following - Image processing Athanas, 1995 (138? 236?)
- Cryptography Vuillemin, 1996 (10? 1000?)
- Hardware emulation Dubois, 1995 (123? 207?)
- Performance improvements of 10? to 1000? are
typical.
9Research Goals
- Develop a model of a configurable computer
- Conduct experiments to quantify key factors that
influence the performance of configurable
computers - Use the model to predict the performance of a
configurable computer for mainstream software
applications - Propose a configurable computer architecture for
mainstream software applications
10Model of Configurable Computing
- Applications can be modelled as a sequence of
transaction pairs - CPU initiates a transaction by transmitting
parameters - When not communicating, the CPU and the
coprocessor can compute tasks - CPU concludes a transaction by receiving results
- Equations are complex and best left for the
thesis
11A Configurable Coprocessor Transaction Pair
12Key Factors Influencing Performance
- Dynamic configuration of the coprocessor (Pcfg,
Tcfg) - Context-switching of the operating system (Pcs,
Tcs) - Memory utilization and bandwidth (Pmem, Tmem,
Smem) - Bus utilization and bandwidth (Pbus, Tcomm1 and
Tcomm2) - Processing power (Sproc)
- Computation granularity (Tcpu)
- Exploitation of parallelism (Thwopp and Tswopp)
13Configurable Computing Test Platforms
- Two platforms were chosen for experiments in
configurable computing - Platform I PC ARC-PCI Board
- Loosely-coupled configurable computer
- Platform II Excalibur Development Board
- Tightly-coupled configurable computer
14Platform I PC ARC-PCI Board
- Processor Pentium III
- 450 MHz Pentium III
- 512 MB of SDRAM (10 ns)
- L1 and L2 Cache
- Coprocessor ARC-PCI
- Three FLEX 10K50 Devices
- 8640 LEs (Logic Elements)
- 60KB SRAM ( 20 ns)
- Operating System Windows NT 4.0
- Custom ARC-PCI Device Driver
Photo Courtesy of Altera
15Platform II Excalibur Development Board
- Processor 32-Bit Nios
- 33 MHz Nios 2.0
- Optimized for speed
- Hardware multiplication
- 256 KB SRAM ( 30 ns)
- Coprocessor APEX 20K200E
- One APEX 20K200E Device
- 8,320 LEs
- 104 KB SRAM ( 10 ns)
- Operating System NONE
Figure Courtesy of Altera
16Configurable Computing Experiments
- The following experiments were conducted
- Platform I Tests
- CSIM Coprocessor Tests
- ARC-PCI Interface Transaction Tests
- Platform I and II Tests
- Pseudo-Random Number Generation (RAND) Tests
- Min Heap Insertion and Deletion (MIN) Tests
17CSIM Coprocessor Tests
- CSIM is a process-oriented, discrete-event
simulation library... - Well-optimized, mainstream software application
- Popular applications of CSIM include simulating
queuing systems, assembly lines, and computer
architectures - Profiling of CSIM using VTune revealed the
following statistics
18CSIM Choosing Suitable Functions
- Functions ideally suited for coprocessing have
the following characteristics - Computationally intensive
- Very little use of input, output or internal
registers - Suitable for implementation in configurable
hardware - Functions chosen for acceleration
- Pseudo-random number generation (streams and
distributions) - Event queue insertion and deletion (event
management)
19CSIM Coprocessor System Components
20Pseudo-Random Number Generation
- Implemented the CSIM pseudo-random number
generation algorithm as a configurable
coprocessor... - Specifications
- 374 lines of VHDL code
- Utilizes 30 of an Altera 10K50 CPLD
- Achieves desired performance (33 MHz )
- Configurable coprocessor system provides
identical results to the original software
implementation - Its completely transparent!
NOTE The pseudo-random number generation
algorithm requires only 9 lines of C code!
21Pseudo-Random Number Generation Observations
- The enhanced version is slower. Why?
- ANSWER
- The time required to compute a random number on a
typical PC ranges from 80 ns to 120 ns for the
CSIM pseudo-random number generation algorithm - The time required to read a 64-bit quantity using
32-bit PCI bus transfers is at least 330 ns - A more complex computation is necessary to
justify the communication latency
22Event Queue Insertion and Deletion
- Implemented algorithms for event queue insertion
and deletion in a configurable coprocessor... - Specifications
- Min heap with 4096 entries
- Each entry has both a 32-bit key and a 32-bit
data element - 2029 lines of VHDL code
- Utilizes 41 of an Altera 10K50 CPLD
- Achieves desired performance (33 MHz )
- Difficult to interface with CSIM
23ARC-PCI Interface Transaction Tests
- Implemented a hardware timer in the ARC-PCI Board
to investigate the actual time required to
transfer data on Platform I - Specifications
- Hardware timer with a 30 ns resolution
- Simple control / status register interface
- 149 lines of VHDL code
- Utilized 3 of an Altera 10K50 CPLD
24ARC-PCI Interface Transaction Test Results
NOTE These test results were obtained using
Windows NT 4.0
25Pseudo-Random Number Generation (RAND) Tests
- Implemented a simple pseudo-random number
generator - Linear Congruential Generator (LCG)
- Generates a 32-bit unsigned value
- Suitable for implementation on both platforms
- Developed an application to test the generator
- Computes between 500,000 and 500,000,000
pseudo-random numbers - Performance varies directly with the number of
computations performed
26RAND Test Results
27Min Heap Insertion and Deletion (MIN) Tests
- Implemented the min heap initially designed for
CSIM - Suitable for implementation on both platforms
- Developed an application to test the min heap
hardware - Inserts random entries into the heap until it is
full and then deletes all of the entries - Repeats sequence 5000 times
- Total of 5,000,000 insertions and deletions per
test - Performance depends upon the depth of the heap
28MIN Test Results
NOTE These test results were obtained using a
heap with 1000 entries
29Impact of Heap Depth on Speedup
30Observations Key Factors
- Dynamic configuration of the coprocessor
- Approximately 2 ms to 6 ms for FLEX 10K50
- May not be required frequently
- Context-switching of the operating system
- Approximately 2 us for Windows NT 4.0
- No operating system on Platform II
- Memory utilization and bandwidth
- Cache plays a very significant role on Platform I
- PC can read its memory at least twice as fast as
the FLEX 10K50 - Nios and APEX 20K200E read memory at the same
speed
31Observations Key Factors
- Bus utilization and bandwidth
- Depends upon host computer usage
- Under light loads, bus reads take approximately
twice as long on average as they should
theoretically (544 ns vs. 300 ns) - Bus contention doesnt occur in Platform II (only
1 bus master) - Processing power
- Pentium III is approximately 75x to 100x faster
than Nios processor - Clock speed only accounts for a factor of 15x
- Super-scalar architecture and cache subsystem
result account for additional processing power of
Pentium III
32Observations Key Factors
- Computation granularity
- Pseudo-random number generation has fine
granularity - Min heap insertion and deletion granularity is
more reasonable - Course granularity is desirable
- Exploitation of parallelism
- Depends upon the application
- Can recover time lost to configuration,
context-switching, memory utilization, and bus
utilization
33Comments on Performance
- System performance is not equal to application
performance - In a multitasking operating system, there is more
than one process to consider - Application is just one process in the entire
system - Application may run slower while other processes
run faster - Addition of a configurable coprocessor
effectively transforms the computer system into a
heterogeneous multiprocessor system - Important to exploit all available resources
- Coprocessor boards provide additional logic and
memory
34Conclusions
- It is possible to completely hide the use of a
configurable coprocessor from the end-user. - Loosely-coupled configurable computers are not
suitable for mainstream software applications due
to communication latency and lack of memory
bandwidth. - Tightly-coupled configurable computers are
suitable for mainstream software applications. - Configurable computing may be useful for embedded
systems.
35Selected Configurable Computing References
- Katherine Compton and Scott Hauck, Reconfigurable
Computing A Survey of Systems and Software. ACM
Computing Surveys, Vol. 34, No. 2. pp. 171-210.
June 2002. - Peter M. Athanas and A. Lynn Abbott. Addressing
the Computational Requirements of Image
Processing with a Custom Computing Machine An
Overview. In Proceedings of the Ninth
International Parallel Processing Symposium
Special Workshop on Reconfigurable Architectures
and Algorithms, Santa Barbara, California, April
1995. - Jean E. Vuillemin, Patric Bertin, Didier Roncin,
Mark Shand, Hervé H. Touati, and Philippe
Boucard. Programmable Active Memories
Reconfigurable Systems Come of Age. IEEE
Transactions on Very Large Scale Integration
(VLSI) Systems, 4(1)56-69, March 1996. - Michel Dubois, Alain Gefflaut, Jaeheon Jeong,
Adrian Moga, and Koray Oner. Multiprocessor
Emulation with RPM Early Experience. Technical
Report CENG95-23, University of Southern
California, Los Angeles, California, December
1995. - William Bishop, ARC-PCI Website,
http//www.pads.uwaterloo.ca/wdbishop/arc-pci.htm
l.