Experimental%20Performance%20Evaluation%20For%20Reconfigurable%20Computer%20Systems:%20The%20GRAM%20Benchmarks - PowerPoint PPT Presentation

About This Presentation
Title:

Experimental%20Performance%20Evaluation%20For%20Reconfigurable%20Computer%20Systems:%20The%20GRAM%20Benchmarks

Description:

Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington University, – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 48
Provided by: SciT154
Learn more at: http://klabs.org
Category:

less

Transcript and Presenter's Notes

Title: Experimental%20Performance%20Evaluation%20For%20Reconfigurable%20Computer%20Systems:%20The%20GRAM%20Benchmarks


1
Experimental Performance Evaluation For
Reconfigurable Computer Systems The GRAM
Benchmarks
Chitalwala. E., El-Ghazawi. T., Gaj. K., The
George Washington University, George Mason
University. MAPLD 2004, Washington DC
2
Abbreviations
  • BRAM Block RAM
  • GRAM - Generalized Reconfigurable Architecture
    Model
  • LM - Local Memory
  • Max Maximum
  • MAP Multi Adaptive Processor
  • MPM - Microprocessor Memory
  • OCM - On-Chip Memory
  • PE - Processing Element
  • Trans Perms - Transfer of Permissons

3
Outline
  • Problem Statement
  • GRAM Description
  • Assumptions and Methodology
  • Testbed Description SRC-6E
  • Results
  • Conclusion and Future Direction

4
Problem Statement
  • Develop a standardized model of Reconfigurable
    Architectures.
  • Define a set of synthetic benchmarks based on
    this model to analyze performance and discover
    bottlenecks.
  • Evaluate the system against the peak performance
    specifications given by the manufacturer.
  • Prove the concept by using these benchmarks to
    assess and dynamically characterize the
    performance of a reconfigurable system, using the
    SRC-6E as a test case.

5
Generalized Reconfigurable Architecture Model
(GRAM)
6
GRAM Benchmarks Objective
  • To measure maximum sustainable data transfer
    rates and latency between the various elements of
    the GRAM.
  • Dynamically characterize the performance of the
    system against system peak performance.

7
Generalized Reconfigurable Architecture Model
(GRAM)
8
GRAM Elements
  • PE Processing Element
  • OCM On-Chip Memory
  • LM Local Memory
  • Interconnect Network / Shared Memory
  • Bus Interface
  • Microprocessor Memory

9
GRAM Benchmarks
  • OCM OCM Measure max. sustainable bandwidth and
    latency between two OCMs residing on different
    PEs.
  • OCM LM Measure max. sustainable bandwidth and
    latency between OCM and LM in either direction.
  • OCM - Shared Memory Measure max. sustainable
    bandwidth and latency between OCM and Shared
    Memory in either direction.
  • Shared Memory MPM Measure max. sustainable
    bandwidth and latency between Shared Memory and
    MPM in either direction.

10
GRAM Benchmarks
  • OCM MPM Measure max. sustainable bandwidth and
    latency between OCM and MPM in either direction.
  • LM MPM Measure max. sustainable bandwidth and
    latency between LM and MPM in either direction.
  • LM LM Measure max. sustainable bandwidth and
    latency between LM and LM in either direction.
  • LM Shared Memory Measure max. sustainable
    bandwidth and latency between LM and Shared
    Memory in either direction.

11
GRAM Assumptions
12
Assumptions
  • All devices on board are fed through a single
    clock
  • No direct path between the Local Memories of
    individual elements
  • Connections for add-on cards may exist but not
    shown
  • The generalized architecture has been created
    based on precedents set by past and current
    manufacturers of Reconfigurable Systems.

13
Methodology
  • Data paths can be parallelized to the maximum
    extent possible.
  • Inputs and Outputs have been kept symmetrical.
  • Hardware timers have been used to measure times
    taken to transfer data.
  • Measurements have been taken for transfers of
    increasingly large amounts of data.
  • Data must be verified for correctness after
    transfers.
  • Multiple paths may exist between the elements
    specified. Our aim will be to measure the fastest
    path available.
  • All experiments will be conducted using the
    programming model and library functions of the
    system.

14
Testbed Description SRC-6E
15
Hardware Architecture of the SRC-6E
800/1600 Mbytes/sec
800/1600 Mbytes/sec
64 x 6
64 x 6
64
800 Mbytes/sec
800 Mbytes/sec
64 x 6
64 x 6
16
Programming Model of the SRC-6E
17
GRAM Benchmarks for the SRC-6E
18
GRAM Benchmarks for the SRC-6E
Benchmark SRC-6E
OCM OCM BRAM BRAM
OCM LM NA
OCM Shared Memory BRAM On-Board Memory
Shared Memory MPM On-Board Memory Common Memory
OCM MPM BRAM Common Memory
LM MPM NA
LM LM NA
LM Shared Memory NA
19
Results
20
Block Diagram for a Single Bank transfer between
OCM to Shared Memory
Start_timer
Read_timer(ht0)
µProcessor Memory to Shared Memory (DMA_in)
Read_timer(ht1)
Shared Memory to OCM
Read_timer(ht2)
OCM to Shared Memory
Read_timer(ht3)
Shared Memory to µProcessor Memory (DMA_out)
Read_timer(ht4)
21
Latency
Latency Minimum Data Transferred Latency In Clock Cycles Latency In Clock Cycles Latency in µs Latency in µs
Latency Minimum Data Transferred Pentium III Pentium IV Pentium III Pentium IV
Shared Memory to OCM 1 word 20 20 0.20 0.20
OCM to Shared Memory 1 word 15 15 0.15 0.15
OCM to OCM (Bridge Port) 1 word 11 11 0.11 0.11
Shared Memory to MPM 4 words 4200 2100 42 21
MPM to Shared Memory 4 words 1000 1000 10 10
1 word 64 bits
22
Latency
  • The difference between read and write times for
    the OCM and Shared Memory is due to the read
    latency of OBM (6 clocks) vs. BRAM (1 clock).
  • When transferring data from the MPM to Shared
    Memory, writes are issued at each clock cycle and
    there is no startup latency involved.
  • When reading data from the Shared Memory to the
    MPM, there is an additional five clock cycles
    required to transfer data after the read has been
    issued.

23
Data Path from OCM to OCM Using Transfer Of
Permissions
24
Data Path from OCM to OCM Using The Bridge Port
and the Streaming Protocol
25
P III IV Bandwidth OCM and OCM (BM1)
26
P III Bandwidth OCM and OCM (BM1)
27
P IV Bandwidth OCM and OCM (BM1)
28
P IV Bandwidth OCM and OCM (BM1) (Streaming
Protocol in Bridge Port)
29
Data Path from OCM to MPM and Shared Memory to MPM
30
P III Bandwidth OCM and Shared Memory for a
single bank
31
P III Bandwidth OCM and Shared Memory
32
P IV Bandwidth OCM and Shared Memory
33
P III Bandwidth OCM and µP Memory
34
P IV Bandwidth OCM and µP Memory
35
P III Bandwidth Shared Memory and µP Memory
(BM5)
36
P IV Bandwidth Shared Memory and µP Memory
37
P III Bandwidth Shared Memory and µP Memory
38
P IV Bandwidth Shared Memory and µP Memory
39
Data Path from FPGA Register to Shared Memory
40
P III Bandwidth Shared Memory and Register
41
Conclusion Future Direction
42
GRAM Summation for Pentium III
Benchmarks Peak Performance (Mbytes/s) Maximum Sustainable Bandwidth Measured (Mbytes/s) Efficiency () Normalized Transfer Rate (compared with PCI-X _at_ 133 MHz, 32 bits unidirectional)
OCM OCM a (Bridge Port) 800 149 18.6 0.28
OCM OCM b (Trans Perms) 800 793 99.13 1.5
OCM OCM c (Streaming) 800 NA NA NA
OCM LM NA NA NA NA
OCM ? Shared Memory/ Shared Memory ? OCM 2400 2373/2373 98.8/98.8 4.46
OCM ? MPM/MPM ? OCM 800/800 182.8/227.3 22.85/28.41 0.34 / 0.43
Shared Memory ? MPM/ MPM ? Shared Memory 800/800 203/314 25.3/39.3 0.38 / 0.59
Shared Memory ? Reg/ Reg ? Shared Memory 800/800 798/798 99.75/99.75 1.5/1.5
LM MPM NA NA NA NA
LM LM NA NA NA NA
LM Shared Memory NA NA NA NA
For three banks For three banks For three banks For three banks For three banks
43
GRAM Summation for Pentium IV
Benchmarks Peak Performance (Mbytes/s) Maximum Sustainable Bandwidth Measured (Mbytes/s) Efficiency () Normalized transfer Rate (compared with PCI-X _at_ 133 MHz, 32 bits unidirectional)
OCM OCM a (Bridge Port) 800 149 18.6 0.28
OCM OCM b (Trans Perms) 800 797.39 99.67 1.5
OCM OCM c (Streaming) 800 799.49 100 1.5
OCM LM NA NA NA NA
OCM ? Shared Memory/ Shared Memory ? OCM 2400 2392 / 2390 99.6 / 99.6 4.5 / 4.5
OCM ? MPM/MPM ? OCM 800/800 578 / 562 72.25 / 70.25 1.08 / 1.05
Shared Memory ? MPM/ MPM ? Shared Memory 800/800 796 / 799 99.5 / 99.8 1.5 / 1.5
Shared Memory ? Reg/ Reg ? Shared Memory 800/800 798/798 99.75/99.75 1.5/1.5
LM MPM NA NA NA NA
LM LM NA NA NA NA
LM Shared Memory NA NA NA NA
For three banks For three banks For three banks For three banks For three banks
44
Conclusions
  • Type of components used has a major role to play
    in determining the performance of the system as
    seen in the performance of the Pentium III and
    the Pentium IV versions of the SRC-6E.
  • Software environment and state of development
    plays a role in determining how effectively the
    program is able to utilize the hardware. This is
    clear when observing the difference in bandwidth
    achieved across the Bridge ports using the Carte
    1.6.2 release and the Carte 1.7 release.

45
Conclusions
  • The GRAM Summation Tables help to serve machine
    architects in the following ways
  • The efficiency column indicates how well a
    particular communication channel is being
    utilized within the hardware context. If the
    efficiency is low, architects may be able to
    improve performance using a firmware improvement.
    If efficiency is high and the normalized
    bandwidth is low then they should consider a
    hardware upgrade.
  • By looking at the normalized bandwidths obtained
    from the GRAM benchmarks, designers can also
    determine whether the data transfer rates are
    balanced across the architectural modules. This
    helps identifying bottlenecks.
  • Designers can find out which channels have the
    maximum efficiency and can hence fine tune their
    application to exploit these channels to achieve
    the maximum data transfer rate.

46
Conclusions
  • In addition, the GRAM Summation tables also
    provide the following information to application
    developers
  • The tables can tell a designer what bottlenecks
    to expect and where these bottlenecks lie.
  • By comparing the figures for Efficiency and the
    Normalized transfer rates, designers can
    understand if the bottlenecks being created are
    by the hardware or the software.
  • By observing the GRAM summarization tables,
    designers can actually predict the performance of
    a pre-designed application on a particular
    reconfigurable system.

47
Future Direction
  • Benchmarks can be expanded to include end-to-end
    performance from asymmetrical and synthetic
    workloads.
  • The Benchmarks can also include tables to
    characterize the performance of reconfigurable
    computers as it compares to modern parallel
    architectures. A performance to cost analysis
    can also be considered.
Write a Comment
User Comments (0)
About PowerShow.com