Configurable Computing for Mainstream Software Applications - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Configurable Computing for Mainstream Software Applications

Description:

September 27, 2002. WDB. 1. UNIVERSITY OF WATERLOO ... September 27, 2002. WDB. 2. Parallel and Distributed Systems Research Group. Presentation Outline ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 36
Provided by: william155
Category:

less

Transcript and Presenter's Notes

Title: Configurable Computing for Mainstream Software Applications


1
Configurable Computing for Mainstream Software
Applications
  • William D. Bishop
  • wdbishop_at_computer.org

2
Presentation Outline
  • Introduction to configurable computing
  • Motivation
  • Definitions and concepts
  • Niche applications
  • Research into configurable computing for
    mainstream software
  • Model of configurable computing
  • Configurable computing experiments
  • Test results
  • Observations
  • Conclusions

3
Motivation
"For a given class of problems, one set of basic
instructions may be more efficient than another
such set" John von Neumann, 1958
  • The above statement can be applied to computer
    architecture in the following way
  • Application-specific computer hardware may be
    more efficient than general-purpose computer
    hardware for solving a given class of problems

4
Introduction to Configurable Computing
  • Definition of a configurable computer
  • Configurable computers offer the following
    advantages
  • Increased control logic (a.k.a. processing units)
    flexibility
  • Increased datapath (a.k.a. wiring) flexibility
  • Ability to specialize the computer hardware at
    runtime

A configurable computer is a computing device
that provides hardware that may be modified at
runtime to efficiently compute of a set of tasks.
5
Classification of Computing Machines
6
Building a Configurable Computer
  • The basic building block of a configurable
    computer is the High-Density Programmable Logic
    Device (HDPLD).
  • Suitable HDPLDs have the following features
  • Large capacity for digital hardware designs
  • Electrically programmable in-system
  • Support for high-speed reconfiguration
  • SRAM-based device

A Modern HDPLD The Altera 10K100 CPLD
Photo Courtesy of Altera
7
Types of Configurable Computers
  • Loosely-Coupled
  • Configurable coprocessor connected to a host
    computer via a peripheral bus
  • Tightly-Coupled
  • Configurable coprocessor connected directly to
    the system bus of a host computer
  • Configurable Instruction Set Computer
  • Processor utilizes configurable hardware to
    implement instructions

8
Niche Applications of Configurable Computers
  • What are niche applications of configurable
    computers?
  • Applications that use bit-wise computations or
    integer arithmetic
  • Applications with course-grain computations
  • Examples of niche applications include the
    following
  • Image processing Athanas, 1995 (138? 236?)
  • Cryptography Vuillemin, 1996 (10? 1000?)
  • Hardware emulation Dubois, 1995 (123? 207?)
  • Performance improvements of 10? to 1000? are
    typical.

9
Research Goals
  • Develop a model of a configurable computer
  • Conduct experiments to quantify key factors that
    influence the performance of configurable
    computers
  • Use the model to predict the performance of a
    configurable computer for mainstream software
    applications
  • Propose a configurable computer architecture for
    mainstream software applications

10
Model of Configurable Computing
  • Applications can be modelled as a sequence of
    transaction pairs
  • CPU initiates a transaction by transmitting
    parameters
  • When not communicating, the CPU and the
    coprocessor can compute tasks
  • CPU concludes a transaction by receiving results
  • Equations are complex and best left for the
    thesis

11
A Configurable Coprocessor Transaction Pair
12
Key Factors Influencing Performance
  • Dynamic configuration of the coprocessor (Pcfg,
    Tcfg)
  • Context-switching of the operating system (Pcs,
    Tcs)
  • Memory utilization and bandwidth (Pmem, Tmem,
    Smem)
  • Bus utilization and bandwidth (Pbus, Tcomm1 and
    Tcomm2)
  • Processing power (Sproc)
  • Computation granularity (Tcpu)
  • Exploitation of parallelism (Thwopp and Tswopp)

13
Configurable Computing Test Platforms
  • Two platforms were chosen for experiments in
    configurable computing
  • Platform I PC ARC-PCI Board
  • Loosely-coupled configurable computer
  • Platform II Excalibur Development Board
  • Tightly-coupled configurable computer

14
Platform I PC ARC-PCI Board
  • Processor Pentium III
  • 450 MHz Pentium III
  • 512 MB of SDRAM (10 ns)
  • L1 and L2 Cache
  • Coprocessor ARC-PCI
  • Three FLEX 10K50 Devices
  • 8640 LEs (Logic Elements)
  • 60KB SRAM ( 20 ns)
  • Operating System Windows NT 4.0
  • Custom ARC-PCI Device Driver

Photo Courtesy of Altera
15
Platform II Excalibur Development Board
  • Processor 32-Bit Nios
  • 33 MHz Nios 2.0
  • Optimized for speed
  • Hardware multiplication
  • 256 KB SRAM ( 30 ns)
  • Coprocessor APEX 20K200E
  • One APEX 20K200E Device
  • 8,320 LEs
  • 104 KB SRAM ( 10 ns)
  • Operating System NONE

Figure Courtesy of Altera
16
Configurable Computing Experiments
  • The following experiments were conducted
  • Platform I Tests
  • CSIM Coprocessor Tests
  • ARC-PCI Interface Transaction Tests
  • Platform I and II Tests
  • Pseudo-Random Number Generation (RAND) Tests
  • Min Heap Insertion and Deletion (MIN) Tests

17
CSIM Coprocessor Tests
  • CSIM is a process-oriented, discrete-event
    simulation library...
  • Well-optimized, mainstream software application
  • Popular applications of CSIM include simulating
    queuing systems, assembly lines, and computer
    architectures
  • Profiling of CSIM using VTune revealed the
    following statistics

18
CSIM Choosing Suitable Functions
  • Functions ideally suited for coprocessing have
    the following characteristics
  • Computationally intensive
  • Very little use of input, output or internal
    registers
  • Suitable for implementation in configurable
    hardware
  • Functions chosen for acceleration
  • Pseudo-random number generation (streams and
    distributions)
  • Event queue insertion and deletion (event
    management)

19
CSIM Coprocessor System Components
20
Pseudo-Random Number Generation
  • Implemented the CSIM pseudo-random number
    generation algorithm as a configurable
    coprocessor...
  • Specifications
  • 374 lines of VHDL code
  • Utilizes 30 of an Altera 10K50 CPLD
  • Achieves desired performance (33 MHz )
  • Configurable coprocessor system provides
    identical results to the original software
    implementation
  • Its completely transparent!

NOTE The pseudo-random number generation
algorithm requires only 9 lines of C code!
21
Pseudo-Random Number Generation Observations
  • The enhanced version is slower. Why?
  • ANSWER
  • The time required to compute a random number on a
    typical PC ranges from 80 ns to 120 ns for the
    CSIM pseudo-random number generation algorithm
  • The time required to read a 64-bit quantity using
    32-bit PCI bus transfers is at least 330 ns
  • A more complex computation is necessary to
    justify the communication latency

22
Event Queue Insertion and Deletion
  • Implemented algorithms for event queue insertion
    and deletion in a configurable coprocessor...
  • Specifications
  • Min heap with 4096 entries
  • Each entry has both a 32-bit key and a 32-bit
    data element
  • 2029 lines of VHDL code
  • Utilizes 41 of an Altera 10K50 CPLD
  • Achieves desired performance (33 MHz )
  • Difficult to interface with CSIM

23
ARC-PCI Interface Transaction Tests
  • Implemented a hardware timer in the ARC-PCI Board
    to investigate the actual time required to
    transfer data on Platform I
  • Specifications
  • Hardware timer with a 30 ns resolution
  • Simple control / status register interface
  • 149 lines of VHDL code
  • Utilized 3 of an Altera 10K50 CPLD

24
ARC-PCI Interface Transaction Test Results
NOTE These test results were obtained using
Windows NT 4.0
25
Pseudo-Random Number Generation (RAND) Tests
  • Implemented a simple pseudo-random number
    generator
  • Linear Congruential Generator (LCG)
  • Generates a 32-bit unsigned value
  • Suitable for implementation on both platforms
  • Developed an application to test the generator
  • Computes between 500,000 and 500,000,000
    pseudo-random numbers
  • Performance varies directly with the number of
    computations performed

26
RAND Test Results
27
Min Heap Insertion and Deletion (MIN) Tests
  • Implemented the min heap initially designed for
    CSIM
  • Suitable for implementation on both platforms
  • Developed an application to test the min heap
    hardware
  • Inserts random entries into the heap until it is
    full and then deletes all of the entries
  • Repeats sequence 5000 times
  • Total of 5,000,000 insertions and deletions per
    test
  • Performance depends upon the depth of the heap

28
MIN Test Results
NOTE These test results were obtained using a
heap with 1000 entries
29
Impact of Heap Depth on Speedup
30
Observations Key Factors
  • Dynamic configuration of the coprocessor
  • Approximately 2 ms to 6 ms for FLEX 10K50
  • May not be required frequently
  • Context-switching of the operating system
  • Approximately 2 us for Windows NT 4.0
  • No operating system on Platform II
  • Memory utilization and bandwidth
  • Cache plays a very significant role on Platform I
  • PC can read its memory at least twice as fast as
    the FLEX 10K50
  • Nios and APEX 20K200E read memory at the same
    speed

31
Observations Key Factors
  • Bus utilization and bandwidth
  • Depends upon host computer usage
  • Under light loads, bus reads take approximately
    twice as long on average as they should
    theoretically (544 ns vs. 300 ns)
  • Bus contention doesnt occur in Platform II (only
    1 bus master)
  • Processing power
  • Pentium III is approximately 75x to 100x faster
    than Nios processor
  • Clock speed only accounts for a factor of 15x
  • Super-scalar architecture and cache subsystem
    result account for additional processing power of
    Pentium III

32
Observations Key Factors
  • Computation granularity
  • Pseudo-random number generation has fine
    granularity
  • Min heap insertion and deletion granularity is
    more reasonable
  • Course granularity is desirable
  • Exploitation of parallelism
  • Depends upon the application
  • Can recover time lost to configuration,
    context-switching, memory utilization, and bus
    utilization

33
Comments on Performance
  • System performance is not equal to application
    performance
  • In a multitasking operating system, there is more
    than one process to consider
  • Application is just one process in the entire
    system
  • Application may run slower while other processes
    run faster
  • Addition of a configurable coprocessor
    effectively transforms the computer system into a
    heterogeneous multiprocessor system
  • Important to exploit all available resources
  • Coprocessor boards provide additional logic and
    memory

34
Conclusions
  • It is possible to completely hide the use of a
    configurable coprocessor from the end-user.
  • Loosely-coupled configurable computers are not
    suitable for mainstream software applications due
    to communication latency and lack of memory
    bandwidth.
  • Tightly-coupled configurable computers are
    suitable for mainstream software applications.
  • Configurable computing may be useful for embedded
    systems.

35
Selected Configurable Computing References
  • Katherine Compton and Scott Hauck, Reconfigurable
    Computing A Survey of Systems and Software. ACM
    Computing Surveys, Vol. 34, No. 2. pp. 171-210.
    June 2002.
  • Peter M. Athanas and A. Lynn Abbott. Addressing
    the Computational Requirements of Image
    Processing with a Custom Computing Machine An
    Overview. In Proceedings of the Ninth
    International Parallel Processing Symposium
    Special Workshop on Reconfigurable Architectures
    and Algorithms, Santa Barbara, California, April
    1995.
  • Jean E. Vuillemin, Patric Bertin, Didier Roncin,
    Mark Shand, Hervé H. Touati, and Philippe
    Boucard. Programmable Active Memories
    Reconfigurable Systems Come of Age. IEEE
    Transactions on Very Large Scale Integration
    (VLSI) Systems, 4(1)56-69, March 1996.
  • Michel Dubois, Alain Gefflaut, Jaeheon Jeong,
    Adrian Moga, and Koray Oner. Multiprocessor
    Emulation with RPM Early Experience. Technical
    Report CENG95-23, University of Southern
    California, Los Angeles, California, December
    1995.
  • William Bishop, ARC-PCI Website,
    http//www.pads.uwaterloo.ca/wdbishop/arc-pci.htm
    l.
Write a Comment
User Comments (0)
About PowerShow.com