Experimental%20Performance%20Evaluation%20For%20Reconfigurable%20Computer%20Systems:%20The%20GRAM%20Benchmarks - PowerPoint PPT Presentation

About This Presentation

Title:

Experimental%20Performance%20Evaluation%20For%20Reconfigurable%20Computer%20Systems:%20The%20GRAM%20Benchmarks

Description:

Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington University, – PowerPoint PPT presentation

Number of Views:153

Avg rating:3.0/5.0

Slides: 48

Provided by: SciT154

Learn more at: http://klabs.org

Category:

more less

Transcript and Presenter's Notes

Title: Experimental%20Performance%20Evaluation%20For%20Reconfigurable%20Computer%20Systems:%20The%20GRAM%20Benchmarks

1
Experimental Performance Evaluation For
Reconfigurable Computer Systems The GRAM
Benchmarks
Chitalwala. E., El-Ghazawi. T., Gaj. K., The
George Washington University, George Mason
University. MAPLD 2004, Washington DC
2
Abbreviations

BRAM Block RAM
GRAM - Generalized Reconfigurable Architecture
Model
LM - Local Memory
Max Maximum
MAP Multi Adaptive Processor
MPM - Microprocessor Memory
OCM - On-Chip Memory
PE - Processing Element
Trans Perms - Transfer of Permissons

3
Outline

Problem Statement
GRAM Description
Assumptions and Methodology
Testbed Description SRC-6E
Results
Conclusion and Future Direction

4
Problem Statement

Develop a standardized model of Reconfigurable
Architectures.
Define a set of synthetic benchmarks based on
this model to analyze performance and discover
bottlenecks.
Evaluate the system against the peak performance
specifications given by the manufacturer.
Prove the concept by using these benchmarks to
assess and dynamically characterize the
performance of a reconfigurable system, using the
SRC-6E as a test case.

5
Generalized Reconfigurable Architecture Model
(GRAM)
6
GRAM Benchmarks Objective

To measure maximum sustainable data transfer
rates and latency between the various elements of
the GRAM.
Dynamically characterize the performance of the
system against system peak performance.

7
Generalized Reconfigurable Architecture Model
(GRAM)
8
GRAM Elements

PE Processing Element
OCM On-Chip Memory
LM Local Memory
Interconnect Network / Shared Memory
Bus Interface
Microprocessor Memory

9
GRAM Benchmarks

OCM OCM Measure max. sustainable bandwidth and
latency between two OCMs residing on different
PEs.
OCM LM Measure max. sustainable bandwidth and
latency between OCM and LM in either direction.
OCM - Shared Memory Measure max. sustainable
bandwidth and latency between OCM and Shared
Memory in either direction.
Shared Memory MPM Measure max. sustainable
bandwidth and latency between Shared Memory and
MPM in either direction.

10
GRAM Benchmarks

OCM MPM Measure max. sustainable bandwidth and
latency between OCM and MPM in either direction.
LM MPM Measure max. sustainable bandwidth and
latency between LM and MPM in either direction.
LM LM Measure max. sustainable bandwidth and
latency between LM and LM in either direction.
LM Shared Memory Measure max. sustainable
bandwidth and latency between LM and Shared
Memory in either direction.

11
GRAM Assumptions
12
Assumptions

All devices on board are fed through a single
clock
No direct path between the Local Memories of
individual elements
Connections for add-on cards may exist but not
shown
The generalized architecture has been created
based on precedents set by past and current
manufacturers of Reconfigurable Systems.

13
Methodology

Data paths can be parallelized to the maximum
extent possible.
Inputs and Outputs have been kept symmetrical.
Hardware timers have been used to measure times
taken to transfer data.
Measurements have been taken for transfers of
increasingly large amounts of data.
Data must be verified for correctness after
transfers.
Multiple paths may exist between the elements
specified. Our aim will be to measure the fastest
path available.
All experiments will be conducted using the
programming model and library functions of the
system.

14
Testbed Description SRC-6E
15
Hardware Architecture of the SRC-6E
800/1600 Mbytes/sec
800/1600 Mbytes/sec
64 x 6
64 x 6
64
800 Mbytes/sec
800 Mbytes/sec
64 x 6
64 x 6
16
Programming Model of the SRC-6E
17
GRAM Benchmarks for the SRC-6E
18
GRAM Benchmarks for the SRC-6E
Benchmark SRC-6E
OCM OCM BRAM BRAM
OCM LM NA
OCM Shared Memory BRAM On-Board Memory
Shared Memory MPM On-Board Memory Common Memory
OCM MPM BRAM Common Memory
LM MPM NA
LM LM NA
LM Shared Memory NA
19
Results
20
Block Diagram for a Single Bank transfer between
OCM to Shared Memory
Start_timer
Read_timer(ht0)
µProcessor Memory to Shared Memory (DMA_in)
Read_timer(ht1)
Shared Memory to OCM
Read_timer(ht2)
OCM to Shared Memory
Read_timer(ht3)
Shared Memory to µProcessor Memory (DMA_out)
Read_timer(ht4)
21
Latency
Latency Minimum Data Transferred Latency In Clock Cycles Latency In Clock Cycles Latency in µs Latency in µs
Latency Minimum Data Transferred Pentium III Pentium IV Pentium III Pentium IV
Shared Memory to OCM 1 word 20 20 0.20 0.20
OCM to Shared Memory 1 word 15 15 0.15 0.15
OCM to OCM (Bridge Port) 1 word 11 11 0.11 0.11
Shared Memory to MPM 4 words 4200 2100 42 21
MPM to Shared Memory 4 words 1000 1000 10 10
1 word 64 bits
22
Latency

The difference between read and write times for
the OCM and Shared Memory is due to the read
latency of OBM (6 clocks) vs. BRAM (1 clock).
When transferring data from the MPM to Shared
Memory, writes are issued at each clock cycle and
there is no startup latency involved.
When reading data from the Shared Memory to the
MPM, there is an additional five clock cycles
required to transfer data after the read has been
issued.

23
Data Path from OCM to OCM Using Transfer Of
Permissions
24
Data Path from OCM to OCM Using The Bridge Port
and the Streaming Protocol
25
P III IV Bandwidth OCM and OCM (BM1)
26
P III Bandwidth OCM and OCM (BM1)
27
P IV Bandwidth OCM and OCM (BM1)
28
P IV Bandwidth OCM and OCM (BM1) (Streaming
Protocol in Bridge Port)
29
Data Path from OCM to MPM and Shared Memory to MPM
30
P III Bandwidth OCM and Shared Memory for a
single bank
31
P III Bandwidth OCM and Shared Memory
32
P IV Bandwidth OCM and Shared Memory
33
P III Bandwidth OCM and µP Memory
34
P IV Bandwidth OCM and µP Memory
35
P III Bandwidth Shared Memory and µP Memory
(BM5)
36
P IV Bandwidth Shared Memory and µP Memory
37
P III Bandwidth Shared Memory and µP Memory
38
P IV Bandwidth Shared Memory and µP Memory
39
Data Path from FPGA Register to Shared Memory
40
P III Bandwidth Shared Memory and Register
41
Conclusion Future Direction
42
GRAM Summation for Pentium III
Benchmarks Peak Performance (Mbytes/s) Maximum Sustainable Bandwidth Measured (Mbytes/s) Efficiency () Normalized Transfer Rate (compared with PCI-X _at_ 133 MHz, 32 bits unidirectional)
OCM OCM a (Bridge Port) 800 149 18.6 0.28
OCM OCM b (Trans Perms) 800 793 99.13 1.5
OCM OCM c (Streaming) 800 NA NA NA
OCM LM NA NA NA NA
OCM ? Shared Memory/ Shared Memory ? OCM 2400 2373/2373 98.8/98.8 4.46
OCM ? MPM/MPM ? OCM 800/800 182.8/227.3 22.85/28.41 0.34 / 0.43
Shared Memory ? MPM/ MPM ? Shared Memory 800/800 203/314 25.3/39.3 0.38 / 0.59
Shared Memory ? Reg/ Reg ? Shared Memory 800/800 798/798 99.75/99.75 1.5/1.5
LM MPM NA NA NA NA
LM LM NA NA NA NA
LM Shared Memory NA NA NA NA
For three banks For three banks For three banks For three banks For three banks
43
GRAM Summation for Pentium IV
Benchmarks Peak Performance (Mbytes/s) Maximum Sustainable Bandwidth Measured (Mbytes/s) Efficiency () Normalized transfer Rate (compared with PCI-X _at_ 133 MHz, 32 bits unidirectional)
OCM OCM a (Bridge Port) 800 149 18.6 0.28
OCM OCM b (Trans Perms) 800 797.39 99.67 1.5
OCM OCM c (Streaming) 800 799.49 100 1.5
OCM LM NA NA NA NA
OCM ? Shared Memory/ Shared Memory ? OCM 2400 2392 / 2390 99.6 / 99.6 4.5 / 4.5
OCM ? MPM/MPM ? OCM 800/800 578 / 562 72.25 / 70.25 1.08 / 1.05
Shared Memory ? MPM/ MPM ? Shared Memory 800/800 796 / 799 99.5 / 99.8 1.5 / 1.5
Shared Memory ? Reg/ Reg ? Shared Memory 800/800 798/798 99.75/99.75 1.5/1.5
LM MPM NA NA NA NA
LM LM NA NA NA NA
LM Shared Memory NA NA NA NA
For three banks For three banks For three banks For three banks For three banks
44
Conclusions

Type of components used has a major role to play
in determining the performance of the system as
seen in the performance of the Pentium III and
the Pentium IV versions of the SRC-6E.
Software environment and state of development
plays a role in determining how effectively the
program is able to utilize the hardware. This is
clear when observing the difference in bandwidth
achieved across the Bridge ports using the Carte
1.6.2 release and the Carte 1.7 release.

45
Conclusions

The GRAM Summation Tables help to serve machine
architects in the following ways
The efficiency column indicates how well a
particular communication channel is being
utilized within the hardware context. If the
efficiency is low, architects may be able to
improve performance using a firmware improvement.
If efficiency is high and the normalized
bandwidth is low then they should consider a
hardware upgrade.
By looking at the normalized bandwidths obtained
from the GRAM benchmarks, designers can also
determine whether the data transfer rates are
balanced across the architectural modules. This
helps identifying bottlenecks.
Designers can find out which channels have the
maximum efficiency and can hence fine tune their
application to exploit these channels to achieve
the maximum data transfer rate.

46
Conclusions

In addition, the GRAM Summation tables also
provide the following information to application
developers
The tables can tell a designer what bottlenecks
to expect and where these bottlenecks lie.
By comparing the figures for Efficiency and the
Normalized transfer rates, designers can
understand if the bottlenecks being created are
by the hardware or the software.
By observing the GRAM summarization tables,
designers can actually predict the performance of
a pre-designed application on a particular
reconfigurable system.

47
Future Direction

Benchmarks can be expanded to include end-to-end
performance from asymmetrical and synthetic
workloads.
The Benchmarks can also include tables to
characterize the performance of reconfigurable
computers as it compares to modern parallel
architectures. A performance to cost analysis
can also be considered.

Write a Comment

User Comments (0)