HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing - PowerPoint PPT Presentation

About This Presentation

Title:

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

Description:

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing PIs: Alvin M. Despain and Jean-Luc Gaudiot DARPA DIS PI Meeting Santa Fe, NM – PowerPoint PPT presentation

Number of Views:161

Avg rating:3.0/5.0

Slides: 32

Provided by: W224

Learn more at: http://pascal.eng.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

1
HiDISC A Decoupled Architecture for Applications
in Data Intensive Computing
PIs Alvin M. Despain and Jean-Luc Gaudiot DARPA
DIS PI Meeting Santa Fe, NM
March 26-27, 2002
2
HiDISC Hierarchical Decoupled Instruction Set
Computer

New Ideas
A dedicated processor for each level of the
memory hierarchy
Explicit management of each level of the memory
hierarchy using instructions generated by the
compiler
Hide memory latency by converting data access
predictability to data access locality
Exploit instruction-level parallelism without
extensive scheduling hardware
Zero overhead prefetches for maximal computation
throughput

Impact
2x speedup for scientific benchmarks with large
data sets over an in-order superscalar
processor
7.4x speedup for matrix multiply over an
in-order issue superscalar processor
2.6x speedup for matrix decomposition/substitutio
n over an in-order issue superscalar processor
Reduced memory latency for systems that have
high memory bandwidths (e.g. PIMs, RAMBUS)
Allows the compiler to solve indexing functions
for irregular applications
Reduced system cost for high-throughput
scientific codes

Schedule

Defined benchmarks
Completed simulator
Performed instruction-level simulations on
hand-compiled
benchmarks

Continue simulations of more benchmarks
(SAR)
Define HiDISC architecture
Benchmark result
Update Simulator

Developed and tested a complete decoupling
compiler
Generated performance statistics and
evaluated design

April 2001
March 2002
Start
3
Accomplishments

Design of the HiDISC model
Compiler development (operational)
Simulator design (operational)
Performance Evaluation
Three DIS benchmarks (Multidimensional Fourier
Transform, Method of Moments, Data Management)
Five stressmarks (Pointer, Transitive Closure,
Neighborhood, Matrix, Field)
Results
Mostly higher performance (some lower)
Range of applications
HiDISC of the future

4
Outline

Original Technical Objective
HiDISC Architecture Review
Benchmark Results
Conclusion and Future Works

5
HiDISC Hierarchical Decoupled Instruction Set
Computer

Technological Trend Memory latency is getting
longer relative to microprocessor speed (40 per
year)
Problem Some SPEC benchmarks spend more than
half of their time stalling Lebeck and Wood
1994
Domain benchmarks with large data sets
symbolic, signal processing and scientific
programs
Present Solutions Larger Caches, Prefetching
(software hardware), Simultaneous Multithreading

6
Present Solutions
Solution Larger Caches Hardware
Prefetching Software Prefetching
Multithreading

Limitations
Slow
Works well only if working set fits cache and
there is temporal locality.
Cannot be tailored for each application
Behavior based on past and present execution-time
behavior
Ensure overheads of prefetching do not outweigh
the benefits gt conservative prefetching
Adaptive software prefetching is required to
change prefetch distance during run-time
Hard to insert prefetches for irregular access
patterns
Solves the throughput problem, not the memory
latency problem

7
The HiDISC Approach

Observation
Software prefetching impacts compute performance
PIMs and RAMBUS offer a high-bandwidth memory
system - useful for speculative prefetching

Approach
Add a processor to manage prefetching ?
hide overhead
Compiler explicitly manages the memory hierarchy
Prefetch distance adapts to the program runtime
behavior

8
What is HiDISC?

A dedicated processor for each level of the
memory hierarchy
Explicitly manage each level of the memory
hierarchy using instructions generated by the
compiler
Hide memory latency by converting data access
predictability to data access locality (Just in
Time Fetch)
Exploit instruction-level parallelism without
extensive scheduling hardware
Zero overhead prefetches for maximal computation
throughput

Computation Processor (CP)
2-issue
Registers
Store Address Queue
Load Data Queue
Access Processor (AP)
Store Data Queue
Slip Control Queue
3-issue
L1 Cache
Cache Mgmt. Processor (CMP)
3-issue
HiDISC
9
Slip Control Queue

The Slip Control Queue (SCQ) adapts dynamically
Late prefetches prefetched data arrived after
load had been issued
Useful prefetches prefetched data arrived
before load had been issued

if (prefetch_buffer_full ()) Dont change size
of SCQ else if ((2late_prefetches) gt
useful_prefetches) Increase size of
SCQ else Decrease size of SCQ
10
Decoupling Programs for HiDISC(Discrete
Convolution - Inner Loop)
while (not EOD) y y (x h) send y to SDQ
Computation Processor Code
for (j 0 j lt i j) load (xj) load
(hi-j-1) GET_SCQ send (EOD token) send
address of yi to SAQ
for (j 0 j lt i j) yiyi(xjhi-j-1)

Inner Loop Convolution
Access Processor Code
SAQ Store Address Queue SDQ Store Data
Queue SCQ Slip Control Queue EOD End of Data
for (j 0 j lt i j) prefetch
(xj) prefetch (hi-j-1 PUT_SCQ
Cache Management Code
11
General View of the Compiler
Source Program
Gcc
Binary Code
Disassembler
Stream Separator
Computation Code
Cache Mgmt. Code
Access Code
HiDISC Compilation Overview
12
HiDISC Stream Separator
13
Stream Separation Backward Load/Store Chasing

Load/store instructions and backward slice are
included as Access stream

LLL1

Other remaining instructions are sent to the
Computation stream

Communications between the two streams are also
defined at this point

14
Stream Separation Creating an Access Stream

Communication needs to take place via the various
hardware queues.
Insert Communication instructions
CMP instructions are copied from access stream
instructions, except that load instructions are
replaced by prefetch instructions

15
DIS Benchmarks

Application oriented benchmarks
Main characteristic Large data sets non
contiguous memory access and no temporal locality

16
DIS Benchmarks Description
Method of Moments (MoM) Computing the electromagnetic scattering from complex objects Contains computational complexity and requests high memory speed Done
Multidimensional Fourier Transform (FFT) Wide range of application usage Image processing, synthesis, convolution deconvolution and digital signal filtering Done
Data Management (DM) Traditional DBMS processing. Focusing on index algorithms and ad hoc query processing. Done
Image Understanding Sampling contains algorithms that perform spatial filtering and data reduction. Morphological filter component, a region of interest (ROI) selection component, and a feature extraction component Compiled (Input file problem)
SAR Ray Tracing Cost-effective alternative to real data collectionsSAR Ray-Tracing Benchmark utilizes an image-domain approach to SAR image simulation Done
17
Atlantic Aerospace Stressmark Suite

Smaller and more specific procedures
Seven individual data intensive benchmarks
Directly illustrate particular elements of the
DIS problem
Fewer computation operations than DIS benchmarks
Focusing on irregular memory accesses

18
Stressmark Suite
Name of Stressmark Problem Memory Access
Pointer Pointer following Small blocks at unpredictable locations. Can be parallelized Done
Update Pointer following with memory update Small blocks at unpredictable location Done
Matrix Conjugate gradient simultaneous equation solver Dependent on matrix representation Likely to be irregular or mixed, with mixed levels of reuse Done
Neighborhood Calculate image texture measures by finding sum and difference histograms Regular access to pairs of words at arbitrary distances Done
Field Collect statistics on large field of words Regular, with little re-use Done
Corner-Turn Matrix transposition Block movement between processing nodes with practically nil computation N/A
Transitive Closure Find all-pairs-shortest-path solution for a directed graph Dependent on matrix representation, but requires reads and writes to different matrices concurrently Done
DIS Stressmark Suite Version 1.0, Atlantic
Aerospace Division
19
Pointer Stressmarks

Basic idea repeatedly follow pointers to
randomized locations in memory
Memory access pattern is unpredictable (input
data dependent)
Randomized memory access pattern
Not sufficient temporal and spatial locality for
conventional cache architectures
HiDISC architecture should provide lower average
memory access latency
The pointer chasing in AP can run ahead without
blocking by the CP

20
Simulation Parameters (simplescalar 3.0, 2000)
Branch predict mode Bimodal
Branch table size 2048
Issue width 4
Window size for superscalar RUU 16 LSQ 8
Slip distance for AP/EP 50
Data L1 cache configuration 128 sets, 32 block, 4 -way set associative , LRU
Data L1 cache latency 1
Unified L2 cache configuration 1024 sets, 64 block, 4 - way set associative, LRU
Unified L2 cache latency 6
Integer functional units ALU( x 4), MUL/DIV
Floating point functional units ALU( x 4), MUL/DIV
Number of memory ports 2
21
Superscalar and HiDISC

Ability to vary the memory latency
Superscalar supports OOO issue with 16 RUU
(Register Update Unit) and 8 LSQ (Load Store
Queue)
Each of AP and CP issues instructions in order

22
DIS Benchmark/Stressmark Results
DIS Benchmarks DIS Benchmarks DIS Benchmarks DIS Benchmarks DIS Benchmarks
FFT MoM DM IU RAY
? ? ? N/A ?
? better with HiDISC ? better with Superscalar
Stressmarks Stressmarks Stressmarks Stressmarks Stressmarks Stressmarks Stressmarks
Pointer Tran. Cl. Field Matrix Neighbor Update Corner
? ? ? ? ? ? N/A
23
DIS Benchmark Results
24
DIS Benchmark Performance

DIS benchmarks perform extremely well in general
with our decoupled processing for the following
reasons
Many long latency floating-point operations
Robust with longer memory latency (eg, Method of
Moment is a more stream-like process)
In FFT, the HiDISC architecture also suffers from
longer memory latencies (Due to the data
dependencies between two streams)
DM is not affected by the longer memory latency
in either case.

25
Stressmark Results
26
Stressmark Results Good Performance

Pointer chasing can be executed far ahead by
using decoupled access stream
It does not requires the computation results from
the CP
Transitive closure also produces good results
Not much in the AP depend on the results of CP
AP can run ahead

27
Some Weak Performance
28
Performance Bottleneck

Too much synchronization causes loss of
decoupling
Unbalanced code size between two streams
Stressmark suite is highly access-heavy
Application domain for decoupled architectures
Balanced ratio between computation and memory
access

29
Synthesis

Synchronization degrades performance
When the dependency of AP on CP increases, the
slip distance of AP cannot be increased
More stream like applications will benefit from
using HiDISC
Multithreading support is needed
Applications should contain enough computation (1
to 1 ratio) to hide the memory access latency
CMP should be simple
It executes redundant operations if the data is
already in cache
Cache pollution can occur

30
Flexi-DISC

Fundamental characteristics
inherently highly dynamic at execution time.
Dynamic reconfigurable central computational
kernel (CK)
Multiple levels of caching and processing around
CK
adjustable prefetching
Multiple processors on a chip which will provide
for a flexible adaptation from multiple to single
processors and horizontal sharing of the existing
resources.

31
Flexi-DISC

Partitioning of Computation Kernel
It can be allocated to the different portions of
the application or different applications
CK requires separation of the next ring to feed
it with data
The variety of target applications makes the
memory accesses unpredictable
Identical processing units for outer rings
Highly efficient dynamic partitioning of the
resources and their run-time allocation can be
achieved