HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing - PowerPoint PPT Presentation

About This Presentation
Title:

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

Description:

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot DARPA DIS PI Meeting Santa Fe, NM – PowerPoint PPT presentation

Number of Views:161
Avg rating:3.0/5.0
Slides: 32
Provided by: W224
Category:

less

Transcript and Presenter's Notes

Title: HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing


1
HiDISC A Decoupled Architecture for Applications
in Data Intensive Computing
PIs Alvin M. Despain and Jean-Luc Gaudiot DARPA
DIS PI Meeting Santa Fe, NM
March 26-27, 2002
2
HiDISC Hierarchical Decoupled Instruction Set
Computer
  • New Ideas
  • A dedicated processor for each level of the
    memory    hierarchy
  • Explicit management of each level of the memory
    hierarchy using   instructions generated by the
    compiler
  • Hide memory latency by converting data access
       predictability to data access locality
  • Exploit instruction-level parallelism without
    extensive   scheduling hardware
  • Zero overhead prefetches for maximal computation
      throughput
  • Impact
  • 2x speedup for scientific benchmarks with large
    data sets   over an in-order superscalar
    processor
  • 7.4x speedup for matrix multiply over an
    in-order issue   superscalar processor
  • 2.6x speedup for matrix decomposition/substitutio
    n over an   in-order issue superscalar processor
  • Reduced memory latency for systems that have
    high   memory bandwidths (e.g. PIMs, RAMBUS)
  • Allows the compiler to solve indexing functions
    for irregular   applications
  • Reduced system cost for high-throughput
    scientific codes

Schedule
  • Defined benchmarks
  • Completed simulator
  • Performed instruction-level simulations on
    hand-compiled
  • benchmarks
  • Continue simulations of more benchmarks
    (SAR)
  • Define HiDISC architecture
  • Benchmark result
  • Update Simulator
  • Developed and tested a complete decoupling
    compiler
  • Generated performance statistics and
    evaluated design

April 2001
March 2002
Start
3
Accomplishments
  • Design of the HiDISC model
  • Compiler development (operational)
  • Simulator design (operational)
  • Performance Evaluation
  • Three DIS benchmarks (Multidimensional Fourier
    Transform, Method of Moments, Data Management)
  • Five stressmarks (Pointer, Transitive Closure,
    Neighborhood, Matrix, Field)
  • Results
  • Mostly higher performance (some lower)
  • Range of applications
  • HiDISC of the future

4
Outline
  • Original Technical Objective
  • HiDISC Architecture Review
  • Benchmark Results
  • Conclusion and Future Works 

5
HiDISC Hierarchical Decoupled Instruction Set
Computer
  • Technological Trend Memory latency is getting
    longer relative to microprocessor speed (40 per
    year)
  • Problem Some SPEC benchmarks spend more than
    half of their time stalling Lebeck and Wood
    1994
  • Domain benchmarks with large data sets
    symbolic, signal processing and scientific
    programs
  • Present Solutions Larger Caches, Prefetching
    (software hardware), Simultaneous Multithreading

6
Present Solutions
Solution Larger Caches Hardware
Prefetching Software Prefetching
Multithreading
  • Limitations
  • Slow
  • Works well only if working set fits cache and
    there is temporal locality.
  • Cannot be tailored for each application
  • Behavior based on past and present execution-time
    behavior
  • Ensure overheads of prefetching do not outweigh
    the benefits gt conservative prefetching
  • Adaptive software prefetching is required to
    change prefetch distance during run-time
  • Hard to insert prefetches for irregular access
    patterns
  • Solves the throughput problem, not the memory
    latency problem

7
The HiDISC Approach
  • Observation
  • Software prefetching impacts compute performance
  • PIMs and RAMBUS offer a high-bandwidth memory
    system - useful for speculative prefetching
  • Approach
  • Add a processor to manage prefetching ?
    hide overhead
  • Compiler explicitly manages the memory hierarchy
  • Prefetch distance adapts to the program runtime
    behavior

8
What is HiDISC?
  • A dedicated processor for each level of the
    memory hierarchy
  • Explicitly manage each level of the memory
    hierarchy using instructions generated by the
    compiler
  • Hide memory latency by converting data access
    predictability to data access locality (Just in
    Time Fetch)
  • Exploit instruction-level parallelism without
    extensive scheduling hardware
  • Zero overhead prefetches for maximal computation
    throughput

Computation Processor (CP)
2-issue
Registers
Store Address Queue
Load Data Queue
Access Processor (AP)
Store Data Queue
Slip Control Queue
3-issue
L1 Cache
Cache Mgmt. Processor (CMP)
3-issue
HiDISC
9
Slip Control Queue
  • The Slip Control Queue (SCQ) adapts dynamically
  • Late prefetches prefetched data arrived after
    load had been issued
  • Useful prefetches prefetched data arrived
    before load had been issued

if (prefetch_buffer_full ()) Dont change size
of SCQ else if ((2late_prefetches) gt
useful_prefetches) Increase size of
SCQ else Decrease size of SCQ
10
Decoupling Programs for HiDISC(Discrete
Convolution - Inner Loop)
while (not EOD) y y (x h) send y to SDQ
Computation Processor Code
for (j 0 j lt i j) load (xj) load
(hi-j-1) GET_SCQ send (EOD token) send
address of yi to SAQ
for (j 0 j lt i j) yiyi(xjhi-j-1)

Inner Loop Convolution
Access Processor Code
SAQ Store Address Queue SDQ Store Data
Queue SCQ Slip Control Queue EOD End of Data
for (j 0 j lt i j) prefetch
(xj) prefetch (hi-j-1 PUT_SCQ
Cache Management Code
11
General View of the Compiler
Source Program
Gcc
Binary Code
Disassembler
Stream Separator
Computation Code
Cache Mgmt. Code
Access Code
HiDISC Compilation Overview
12
HiDISC Stream Separator
13
Stream Separation Backward Load/Store Chasing
  • Load/store instructions and backward slice are
    included as Access stream

LLL1
  • Other remaining instructions are sent to the
    Computation stream
  • Communications between the two streams are also
    defined at this point

14
Stream Separation Creating an Access Stream
  • Communication needs to take place via the various
    hardware queues.
  • Insert Communication instructions
  • CMP instructions are copied from access stream
    instructions, except that load instructions are
    replaced by prefetch instructions

15
DIS Benchmarks
  • Application oriented benchmarks
  • Main characteristic Large data sets non
    contiguous memory access and no temporal locality

16
DIS Benchmarks Description
Method of Moments (MoM) Computing the electromagnetic scattering from complex objects Contains computational complexity and requests high memory speed Done
Multidimensional Fourier Transform (FFT) Wide range of application usage Image processing, synthesis, convolution deconvolution and digital signal filtering Done
Data Management (DM) Traditional DBMS processing. Focusing on index algorithms and ad hoc query processing. Done
Image Understanding Sampling contains algorithms that perform spatial filtering and data reduction. Morphological filter component, a region of interest (ROI) selection component, and a feature extraction component Compiled (Input file problem)
SAR Ray Tracing Cost-effective alternative to real data collectionsSAR Ray-Tracing Benchmark utilizes an image-domain approach to SAR image simulation Done
17
Atlantic Aerospace Stressmark Suite
  • Smaller and more specific procedures
  • Seven individual data intensive benchmarks
  • Directly illustrate particular elements of the
    DIS problem
  • Fewer computation operations than DIS benchmarks
  • Focusing on irregular memory accesses

18
Stressmark Suite
Name of Stressmark Problem Memory Access
Pointer Pointer following Small blocks at unpredictable locations. Can be parallelized Done
Update Pointer following with memory update Small blocks at unpredictable location Done
Matrix Conjugate gradient simultaneous equation solver Dependent on matrix representation Likely to be irregular or mixed, with mixed levels of reuse Done
Neighborhood Calculate image texture measures by finding sum and difference histograms Regular access to pairs of words at arbitrary distances Done
Field Collect statistics on large field of words Regular, with little re-use Done
Corner-Turn Matrix transposition Block movement between processing nodes with practically nil computation N/A
Transitive Closure Find all-pairs-shortest-path solution for a directed graph Dependent on matrix representation, but requires reads and writes to different matrices concurrently Done
DIS Stressmark Suite Version 1.0, Atlantic
Aerospace Division
19
Pointer Stressmarks
  • Basic idea repeatedly follow pointers to
    randomized locations in memory
  • Memory access pattern is unpredictable (input
    data dependent)
  • Randomized memory access pattern
  • Not sufficient temporal and spatial locality for
    conventional cache architectures
  • HiDISC architecture should provide lower average
    memory access latency
  • The pointer chasing in AP can run ahead without
    blocking by the CP

20
Simulation Parameters (simplescalar 3.0, 2000)
Branch predict mode Bimodal
Branch table size 2048
Issue width 4
Window size for superscalar RUU 16 LSQ 8
Slip distance for AP/EP 50
Data L1 cache configuration 128 sets, 32 block, 4 -way set associative , LRU
Data L1 cache latency 1
Unified L2 cache configuration 1024 sets, 64 block, 4 - way set associative, LRU
Unified L2 cache latency 6
Integer functional units ALU( x 4), MUL/DIV
Floating point functional units ALU( x 4), MUL/DIV
Number of memory ports 2
21
Superscalar and HiDISC
  • Ability to vary the memory latency
  • Superscalar supports OOO issue with 16 RUU
    (Register Update Unit) and 8 LSQ (Load Store
    Queue)
  • Each of AP and CP issues instructions in order

22
DIS Benchmark/Stressmark Results
DIS Benchmarks DIS Benchmarks DIS Benchmarks DIS Benchmarks DIS Benchmarks
FFT MoM DM IU RAY
? ? ? N/A ?
? better with HiDISC ? better with Superscalar
Stressmarks Stressmarks Stressmarks Stressmarks Stressmarks Stressmarks Stressmarks
Pointer Tran. Cl. Field Matrix Neighbor Update Corner
? ? ? ? ? ? N/A
23
DIS Benchmark Results
24
DIS Benchmark Performance
  • DIS benchmarks perform extremely well in general
    with our decoupled processing for the following
    reasons
  • Many long latency floating-point operations
  • Robust with longer memory latency (eg, Method of
    Moment is a more stream-like process)
  • In FFT, the HiDISC architecture also suffers from
    longer memory latencies (Due to the data
    dependencies between two streams)
  • DM is not affected by the longer memory latency
    in either case.

25
Stressmark Results
26
Stressmark Results Good Performance
  • Pointer chasing can be executed far ahead by
    using decoupled access stream
  • It does not requires the computation results from
    the CP
  • Transitive closure also produces good results
  • Not much in the AP depend on the results of CP
  • AP can run ahead

27
Some Weak Performance
28
Performance Bottleneck
  • Too much synchronization causes loss of
    decoupling
  • Unbalanced code size between two streams
  • Stressmark suite is highly access-heavy
  • Application domain for decoupled architectures
  • Balanced ratio between computation and memory
    access

29
Synthesis
  • Synchronization degrades performance
  • When the dependency of AP on CP increases, the
    slip distance of AP cannot be increased
  • More stream like applications will benefit from
    using HiDISC
  • Multithreading support is needed
  • Applications should contain enough computation (1
    to 1 ratio) to hide the memory access latency
  • CMP should be simple
  • It executes redundant operations if the data is
    already in cache
  • Cache pollution can occur

30
Flexi-DISC
  •  
  • Fundamental characteristics
  • inherently highly dynamic at execution time.
  • Dynamic reconfigurable central computational
    kernel (CK)
  • Multiple levels of caching and processing around
    CK
  • adjustable prefetching
  • Multiple processors on a chip which will provide
    for a flexible adaptation from multiple to single
    processors and horizontal sharing of the existing
    resources.

31
Flexi-DISC
  • Partitioning of Computation Kernel
  • It can be allocated to the different portions of
    the application or different applications
  • CK requires separation of the next ring to feed
    it with data
  • The variety of target applications makes the
    memory accesses unpredictable
  • Identical processing units for outer rings
  • Highly efficient dynamic partitioning of the
    resources and their run-time allocation can be
    achieved
Write a Comment
User Comments (0)
About PowerShow.com