HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing - PowerPoint PPT Presentation


PPT – HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PowerPoint presentation | free to download - id: 7fb698-YzJlN


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing


HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot DARPA DIS PI Meeting Santa Fe, NM – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 32
Provided by: W224


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

HiDISC A Decoupled Architecture for Applications
in Data Intensive Computing
PIs Alvin M. Despain and Jean-Luc Gaudiot DARPA
DIS PI Meeting Santa Fe, NM
March 26-27, 2002
HiDISC Hierarchical Decoupled Instruction Set
  • New Ideas
  • A dedicated processor for each level of the
    memory    hierarchy
  • Explicit management of each level of the memory
    hierarchy using   instructions generated by the
  • Hide memory latency by converting data access
       predictability to data access locality
  • Exploit instruction-level parallelism without
    extensive   scheduling hardware
  • Zero overhead prefetches for maximal computation
  • Impact
  • 2x speedup for scientific benchmarks with large
    data sets   over an in-order superscalar
  • 7.4x speedup for matrix multiply over an
    in-order issue   superscalar processor
  • 2.6x speedup for matrix decomposition/substitutio
    n over an   in-order issue superscalar processor
  • Reduced memory latency for systems that have
    high   memory bandwidths (e.g. PIMs, RAMBUS)
  • Allows the compiler to solve indexing functions
    for irregular   applications
  • Reduced system cost for high-throughput
    scientific codes

  • Defined benchmarks
  • Completed simulator
  • Performed instruction-level simulations on
  • benchmarks
  • Continue simulations of more benchmarks
  • Define HiDISC architecture
  • Benchmark result
  • Update Simulator
  • Developed and tested a complete decoupling
  • Generated performance statistics and
    evaluated design

April 2001
March 2002
  • Design of the HiDISC model
  • Compiler development (operational)
  • Simulator design (operational)
  • Performance Evaluation
  • Three DIS benchmarks (Multidimensional Fourier
    Transform, Method of Moments, Data Management)
  • Five stressmarks (Pointer, Transitive Closure,
    Neighborhood, Matrix, Field)
  • Results
  • Mostly higher performance (some lower)
  • Range of applications
  • HiDISC of the future

  • Original Technical Objective
  • HiDISC Architecture Review
  • Benchmark Results
  • Conclusion and Future Works 

HiDISC Hierarchical Decoupled Instruction Set
  • Technological Trend Memory latency is getting
    longer relative to microprocessor speed (40 per
  • Problem Some SPEC benchmarks spend more than
    half of their time stalling Lebeck and Wood
  • Domain benchmarks with large data sets
    symbolic, signal processing and scientific
  • Present Solutions Larger Caches, Prefetching
    (software hardware), Simultaneous Multithreading

Present Solutions
Solution Larger Caches Hardware
Prefetching Software Prefetching
  • Limitations
  • Slow
  • Works well only if working set fits cache and
    there is temporal locality.
  • Cannot be tailored for each application
  • Behavior based on past and present execution-time
  • Ensure overheads of prefetching do not outweigh
    the benefits gt conservative prefetching
  • Adaptive software prefetching is required to
    change prefetch distance during run-time
  • Hard to insert prefetches for irregular access
  • Solves the throughput problem, not the memory
    latency problem

The HiDISC Approach
  • Observation
  • Software prefetching impacts compute performance
  • PIMs and RAMBUS offer a high-bandwidth memory
    system - useful for speculative prefetching
  • Approach
  • Add a processor to manage prefetching ?
    hide overhead
  • Compiler explicitly manages the memory hierarchy
  • Prefetch distance adapts to the program runtime

What is HiDISC?
  • A dedicated processor for each level of the
    memory hierarchy
  • Explicitly manage each level of the memory
    hierarchy using instructions generated by the
  • Hide memory latency by converting data access
    predictability to data access locality (Just in
    Time Fetch)
  • Exploit instruction-level parallelism without
    extensive scheduling hardware
  • Zero overhead prefetches for maximal computation

Computation Processor (CP)
Store Address Queue
Load Data Queue
Access Processor (AP)
Store Data Queue
Slip Control Queue
L1 Cache
Cache Mgmt. Processor (CMP)
Slip Control Queue
  • The Slip Control Queue (SCQ) adapts dynamically
  • Late prefetches prefetched data arrived after
    load had been issued
  • Useful prefetches prefetched data arrived
    before load had been issued

if (prefetch_buffer_full ()) Dont change size
of SCQ else if ((2late_prefetches) gt
useful_prefetches) Increase size of
SCQ else Decrease size of SCQ
Decoupling Programs for HiDISC(Discrete
Convolution - Inner Loop)
while (not EOD) y y (x h) send y to SDQ
Computation Processor Code
for (j 0 j lt i j) load (xj) load
(hi-j-1) GET_SCQ send (EOD token) send
address of yi to SAQ
for (j 0 j lt i j) yiyi(xjhi-j-1)

Inner Loop Convolution
Access Processor Code
SAQ Store Address Queue SDQ Store Data
Queue SCQ Slip Control Queue EOD End of Data
for (j 0 j lt i j) prefetch
(xj) prefetch (hi-j-1 PUT_SCQ
Cache Management Code
General View of the Compiler
Source Program
Binary Code
Stream Separator
Computation Code
Cache Mgmt. Code
Access Code
HiDISC Compilation Overview
HiDISC Stream Separator
Stream Separation Backward Load/Store Chasing
  • Load/store instructions and backward slice are
    included as Access stream

  • Other remaining instructions are sent to the
    Computation stream
  • Communications between the two streams are also
    defined at this point

Stream Separation Creating an Access Stream
  • Communication needs to take place via the various
    hardware queues.
  • Insert Communication instructions
  • CMP instructions are copied from access stream
    instructions, except that load instructions are
    replaced by prefetch instructions

DIS Benchmarks
  • Application oriented benchmarks
  • Main characteristic Large data sets non
    contiguous memory access and no temporal locality

DIS Benchmarks Description
Method of Moments (MoM) Computing the electromagnetic scattering from complex objects Contains computational complexity and requests high memory speed Done
Multidimensional Fourier Transform (FFT) Wide range of application usage Image processing, synthesis, convolution deconvolution and digital signal filtering Done
Data Management (DM) Traditional DBMS processing. Focusing on index algorithms and ad hoc query processing. Done
Image Understanding Sampling contains algorithms that perform spatial filtering and data reduction. Morphological filter component, a region of interest (ROI) selection component, and a feature extraction component Compiled (Input file problem)
SAR Ray Tracing Cost-effective alternative to real data collectionsSAR Ray-Tracing Benchmark utilizes an image-domain approach to SAR image simulation Done
Atlantic Aerospace Stressmark Suite
  • Smaller and more specific procedures
  • Seven individual data intensive benchmarks
  • Directly illustrate particular elements of the
    DIS problem
  • Fewer computation operations than DIS benchmarks
  • Focusing on irregular memory accesses

Stressmark Suite
Name of Stressmark Problem Memory Access
Pointer Pointer following Small blocks at unpredictable locations. Can be parallelized Done
Update Pointer following with memory update Small blocks at unpredictable location Done
Matrix Conjugate gradient simultaneous equation solver Dependent on matrix representation Likely to be irregular or mixed, with mixed levels of reuse Done
Neighborhood Calculate image texture measures by finding sum and difference histograms Regular access to pairs of words at arbitrary distances Done
Field Collect statistics on large field of words Regular, with little re-use Done
Corner-Turn Matrix transposition Block movement between processing nodes with practically nil computation N/A
Transitive Closure Find all-pairs-shortest-path solution for a directed graph Dependent on matrix representation, but requires reads and writes to different matrices concurrently Done
DIS Stressmark Suite Version 1.0, Atlantic
Aerospace Division
Pointer Stressmarks
  • Basic idea repeatedly follow pointers to
    randomized locations in memory
  • Memory access pattern is unpredictable (input
    data dependent)
  • Randomized memory access pattern
  • Not sufficient temporal and spatial locality for
    conventional cache architectures
  • HiDISC architecture should provide lower average
    memory access latency
  • The pointer chasing in AP can run ahead without
    blocking by the CP

Simulation Parameters (simplescalar 3.0, 2000)
Branch predict mode Bimodal
Branch table size 2048
Issue width 4
Window size for superscalar RUU 16 LSQ 8
Slip distance for AP/EP 50
Data L1 cache configuration 128 sets, 32 block, 4 -way set associative , LRU
Data L1 cache latency 1
Unified L2 cache configuration 1024 sets, 64 block, 4 - way set associative, LRU
Unified L2 cache latency 6
Integer functional units ALU( x 4), MUL/DIV
Floating point functional units ALU( x 4), MUL/DIV
Number of memory ports 2
Superscalar and HiDISC
  • Ability to vary the memory latency
  • Superscalar supports OOO issue with 16 RUU
    (Register Update Unit) and 8 LSQ (Load Store
  • Each of AP and CP issues instructions in order

DIS Benchmark/Stressmark Results
DIS Benchmarks DIS Benchmarks DIS Benchmarks DIS Benchmarks DIS Benchmarks
? ? ? N/A ?
? better with HiDISC ? better with Superscalar
Stressmarks Stressmarks Stressmarks Stressmarks Stressmarks Stressmarks Stressmarks
Pointer Tran. Cl. Field Matrix Neighbor Update Corner
? ? ? ? ? ? N/A
DIS Benchmark Results
DIS Benchmark Performance
  • DIS benchmarks perform extremely well in general
    with our decoupled processing for the following
  • Many long latency floating-point operations
  • Robust with longer memory latency (eg, Method of
    Moment is a more stream-like process)
  • In FFT, the HiDISC architecture also suffers from
    longer memory latencies (Due to the data
    dependencies between two streams)
  • DM is not affected by the longer memory latency
    in either case.

Stressmark Results
Stressmark Results Good Performance
  • Pointer chasing can be executed far ahead by
    using decoupled access stream
  • It does not requires the computation results from
    the CP
  • Transitive closure also produces good results
  • Not much in the AP depend on the results of CP
  • AP can run ahead

Some Weak Performance
Performance Bottleneck
  • Too much synchronization causes loss of
  • Unbalanced code size between two streams
  • Stressmark suite is highly access-heavy
  • Application domain for decoupled architectures
  • Balanced ratio between computation and memory

  • Synchronization degrades performance
  • When the dependency of AP on CP increases, the
    slip distance of AP cannot be increased
  • More stream like applications will benefit from
    using HiDISC
  • Multithreading support is needed
  • Applications should contain enough computation (1
    to 1 ratio) to hide the memory access latency
  • CMP should be simple
  • It executes redundant operations if the data is
    already in cache
  • Cache pollution can occur

  • Fundamental characteristics
  • inherently highly dynamic at execution time.
  • Dynamic reconfigurable central computational
    kernel (CK)
  • Multiple levels of caching and processing around
  • adjustable prefetching
  • Multiple processors on a chip which will provide
    for a flexible adaptation from multiple to single
    processors and horizontal sharing of the existing

  • Partitioning of Computation Kernel
  • It can be allocated to the different portions of
    the application or different applications
  • CK requires separation of the next ring to feed
    it with data
  • The variety of target applications makes the
    memory accesses unpredictable
  • Identical processing units for outer rings
  • Highly efficient dynamic partitioning of the
    resources and their run-time allocation can be
About PowerShow.com