Title: Experimental%20Performance%20Evaluation%20For%20Reconfigurable%20Computer%20Systems:%20The%20GRAM%20Benchmarks
1Experimental Performance Evaluation For
Reconfigurable Computer Systems The GRAM
Benchmarks
Chitalwala. E., El-Ghazawi. T., Gaj. K., The
George Washington University, George Mason
University. MAPLD 2004, Washington DC
2Abbreviations
- BRAM Block RAM
- GRAM - Generalized Reconfigurable Architecture
Model - LM - Local Memory
- Max Maximum
- MAP Multi Adaptive Processor
- MPM - Microprocessor Memory
- OCM - On-Chip Memory
- PE - Processing Element
- Trans Perms - Transfer of Permissons
3Outline
- Problem Statement
- GRAM Description
- Assumptions and Methodology
- Testbed Description SRC-6E
- Results
- Conclusion and Future Direction
4Problem Statement
- Develop a standardized model of Reconfigurable
Architectures. - Define a set of synthetic benchmarks based on
this model to analyze performance and discover
bottlenecks. - Evaluate the system against the peak performance
specifications given by the manufacturer. - Prove the concept by using these benchmarks to
assess and dynamically characterize the
performance of a reconfigurable system, using the
SRC-6E as a test case.
5Generalized Reconfigurable Architecture Model
(GRAM)
6GRAM Benchmarks Objective
- To measure maximum sustainable data transfer
rates and latency between the various elements of
the GRAM. - Dynamically characterize the performance of the
system against system peak performance.
7Generalized Reconfigurable Architecture Model
(GRAM)
8GRAM Elements
- PE Processing Element
- OCM On-Chip Memory
- LM Local Memory
- Interconnect Network / Shared Memory
- Bus Interface
- Microprocessor Memory
9GRAM Benchmarks
- OCM OCM Measure max. sustainable bandwidth and
latency between two OCMs residing on different
PEs. - OCM LM Measure max. sustainable bandwidth and
latency between OCM and LM in either direction. - OCM - Shared Memory Measure max. sustainable
bandwidth and latency between OCM and Shared
Memory in either direction. - Shared Memory MPM Measure max. sustainable
bandwidth and latency between Shared Memory and
MPM in either direction.
10GRAM Benchmarks
- OCM MPM Measure max. sustainable bandwidth and
latency between OCM and MPM in either direction. - LM MPM Measure max. sustainable bandwidth and
latency between LM and MPM in either direction. - LM LM Measure max. sustainable bandwidth and
latency between LM and LM in either direction. - LM Shared Memory Measure max. sustainable
bandwidth and latency between LM and Shared
Memory in either direction.
11GRAM Assumptions
12Assumptions
- All devices on board are fed through a single
clock - No direct path between the Local Memories of
individual elements - Connections for add-on cards may exist but not
shown - The generalized architecture has been created
based on precedents set by past and current
manufacturers of Reconfigurable Systems.
13Methodology
- Data paths can be parallelized to the maximum
extent possible. - Inputs and Outputs have been kept symmetrical.
- Hardware timers have been used to measure times
taken to transfer data. - Measurements have been taken for transfers of
increasingly large amounts of data. - Data must be verified for correctness after
transfers. - Multiple paths may exist between the elements
specified. Our aim will be to measure the fastest
path available. - All experiments will be conducted using the
programming model and library functions of the
system.
14Testbed Description SRC-6E
15Hardware Architecture of the SRC-6E
800/1600 Mbytes/sec
800/1600 Mbytes/sec
64 x 6
64 x 6
64
800 Mbytes/sec
800 Mbytes/sec
64 x 6
64 x 6
16Programming Model of the SRC-6E
17GRAM Benchmarks for the SRC-6E
18GRAM Benchmarks for the SRC-6E
Benchmark SRC-6E
OCM OCM BRAM BRAM
OCM LM NA
OCM Shared Memory BRAM On-Board Memory
Shared Memory MPM On-Board Memory Common Memory
OCM MPM BRAM Common Memory
LM MPM NA
LM LM NA
LM Shared Memory NA
19Results
20Block Diagram for a Single Bank transfer between
OCM to Shared Memory
Start_timer
Read_timer(ht0)
µProcessor Memory to Shared Memory (DMA_in)
Read_timer(ht1)
Shared Memory to OCM
Read_timer(ht2)
OCM to Shared Memory
Read_timer(ht3)
Shared Memory to µProcessor Memory (DMA_out)
Read_timer(ht4)
21Latency
Latency Minimum Data Transferred Latency In Clock Cycles Latency In Clock Cycles Latency in µs Latency in µs
Latency Minimum Data Transferred Pentium III Pentium IV Pentium III Pentium IV
Shared Memory to OCM 1 word 20 20 0.20 0.20
OCM to Shared Memory 1 word 15 15 0.15 0.15
OCM to OCM (Bridge Port) 1 word 11 11 0.11 0.11
Shared Memory to MPM 4 words 4200 2100 42 21
MPM to Shared Memory 4 words 1000 1000 10 10
1 word 64 bits
22Latency
- The difference between read and write times for
the OCM and Shared Memory is due to the read
latency of OBM (6 clocks) vs. BRAM (1 clock). - When transferring data from the MPM to Shared
Memory, writes are issued at each clock cycle and
there is no startup latency involved. - When reading data from the Shared Memory to the
MPM, there is an additional five clock cycles
required to transfer data after the read has been
issued.
23Data Path from OCM to OCM Using Transfer Of
Permissions
24Data Path from OCM to OCM Using The Bridge Port
and the Streaming Protocol
25P III IV Bandwidth OCM and OCM (BM1)
26P III Bandwidth OCM and OCM (BM1)
27P IV Bandwidth OCM and OCM (BM1)
28P IV Bandwidth OCM and OCM (BM1) (Streaming
Protocol in Bridge Port)
29Data Path from OCM to MPM and Shared Memory to MPM
30P III Bandwidth OCM and Shared Memory for a
single bank
31P III Bandwidth OCM and Shared Memory
32P IV Bandwidth OCM and Shared Memory
33P III Bandwidth OCM and µP Memory
34P IV Bandwidth OCM and µP Memory
35P III Bandwidth Shared Memory and µP Memory
(BM5)
36P IV Bandwidth Shared Memory and µP Memory
37P III Bandwidth Shared Memory and µP Memory
38P IV Bandwidth Shared Memory and µP Memory
39Data Path from FPGA Register to Shared Memory
40P III Bandwidth Shared Memory and Register
41Conclusion Future Direction
42GRAM Summation for Pentium III
Benchmarks Peak Performance (Mbytes/s) Maximum Sustainable Bandwidth Measured (Mbytes/s) Efficiency () Normalized Transfer Rate (compared with PCI-X _at_ 133 MHz, 32 bits unidirectional)
OCM OCM a (Bridge Port) 800 149 18.6 0.28
OCM OCM b (Trans Perms) 800 793 99.13 1.5
OCM OCM c (Streaming) 800 NA NA NA
OCM LM NA NA NA NA
OCM ? Shared Memory/ Shared Memory ? OCM 2400 2373/2373 98.8/98.8 4.46
OCM ? MPM/MPM ? OCM 800/800 182.8/227.3 22.85/28.41 0.34 / 0.43
Shared Memory ? MPM/ MPM ? Shared Memory 800/800 203/314 25.3/39.3 0.38 / 0.59
Shared Memory ? Reg/ Reg ? Shared Memory 800/800 798/798 99.75/99.75 1.5/1.5
LM MPM NA NA NA NA
LM LM NA NA NA NA
LM Shared Memory NA NA NA NA
For three banks For three banks For three banks For three banks For three banks
43GRAM Summation for Pentium IV
Benchmarks Peak Performance (Mbytes/s) Maximum Sustainable Bandwidth Measured (Mbytes/s) Efficiency () Normalized transfer Rate (compared with PCI-X _at_ 133 MHz, 32 bits unidirectional)
OCM OCM a (Bridge Port) 800 149 18.6 0.28
OCM OCM b (Trans Perms) 800 797.39 99.67 1.5
OCM OCM c (Streaming) 800 799.49 100 1.5
OCM LM NA NA NA NA
OCM ? Shared Memory/ Shared Memory ? OCM 2400 2392 / 2390 99.6 / 99.6 4.5 / 4.5
OCM ? MPM/MPM ? OCM 800/800 578 / 562 72.25 / 70.25 1.08 / 1.05
Shared Memory ? MPM/ MPM ? Shared Memory 800/800 796 / 799 99.5 / 99.8 1.5 / 1.5
Shared Memory ? Reg/ Reg ? Shared Memory 800/800 798/798 99.75/99.75 1.5/1.5
LM MPM NA NA NA NA
LM LM NA NA NA NA
LM Shared Memory NA NA NA NA
For three banks For three banks For three banks For three banks For three banks
44Conclusions
- Type of components used has a major role to play
in determining the performance of the system as
seen in the performance of the Pentium III and
the Pentium IV versions of the SRC-6E. - Software environment and state of development
plays a role in determining how effectively the
program is able to utilize the hardware. This is
clear when observing the difference in bandwidth
achieved across the Bridge ports using the Carte
1.6.2 release and the Carte 1.7 release.
45Conclusions
- The GRAM Summation Tables help to serve machine
architects in the following ways - The efficiency column indicates how well a
particular communication channel is being
utilized within the hardware context. If the
efficiency is low, architects may be able to
improve performance using a firmware improvement.
If efficiency is high and the normalized
bandwidth is low then they should consider a
hardware upgrade. - By looking at the normalized bandwidths obtained
from the GRAM benchmarks, designers can also
determine whether the data transfer rates are
balanced across the architectural modules. This
helps identifying bottlenecks. - Designers can find out which channels have the
maximum efficiency and can hence fine tune their
application to exploit these channels to achieve
the maximum data transfer rate.
46Conclusions
- In addition, the GRAM Summation tables also
provide the following information to application
developers - The tables can tell a designer what bottlenecks
to expect and where these bottlenecks lie. - By comparing the figures for Efficiency and the
Normalized transfer rates, designers can
understand if the bottlenecks being created are
by the hardware or the software. - By observing the GRAM summarization tables,
designers can actually predict the performance of
a pre-designed application on a particular
reconfigurable system.
47Future Direction
- Benchmarks can be expanded to include end-to-end
performance from asymmetrical and synthetic
workloads. - The Benchmarks can also include tables to
characterize the performance of reconfigurable
computers as it compares to modern parallel
architectures. A performance to cost analysis
can also be considered.