HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing - PowerPoint PPT Presentation

About This Presentation

Title:

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

Description:

HiDISC: A Decoupled Architecture for Applications in Data Intensive ... (FLIR SAR VIDEO ATR /SLD Scientific ) Processor. Decoupling Compiler. HiDISC Processor ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 31

Provided by: wonw1

Learn more at: http://pascal.eng.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

1
HiDISC A Decoupled Architecture for Applications
in Data Intensive Computing

PIs Alvin M. Despain and Jean-Luc Gaudiot

University of Southern California
http//www-pdpc.usc.edu
October 12th, 2000

2
HiDISC Hierarchical Decoupled Instruction Set
Computer

New Ideas
A dedicated processor for each level of the
memory hierarchy
Explicitly manage each level of the memory
hierarchy using instructions generated by the
compiler
Hide memory latency by converting data access
predictability to data access locality
Exploit instruction-level parallelism without
extensive scheduling hardware
Zero overhead prefetches for maximal computation
throughput

Impact
2x speedup for scientific benchmarks with large
data sets over an in-order superscalar
processor
7.4x speedup for matrix multiply over an
in-order issue superscalar processor
2.6x speedup for matrix decomposition/substitutio
n over an in-order issue superscalar processor
Reduced memory latency for systems that have
high memory bandwidths (e.g. PIMs, RAMBUS)
Allows the compiler to solve indexing functions
for irregular applications
Reduced system cost for high-throughput
scientific codes

Schedule

Defined benchmarks
Completed simulator
Performed instruction-level simulations on
hand-compiled benchmarks

Continue simulations of more benchmarks
(SAR)
Define HiDISC architecture
Benchmark result

Develop and test a full decoupling compiler
Update Simulator
Generate performance statistics and evaluate
design

April 98 Start
April 99
April 00
3
HiDISC Hierarchical Decoupled Instruction Set
Computer
Technological Trend Memory latency is getting
longer relative to microprocessor speed (40 per
year) Problem Some SPEC benchmarks spend more
than half of their time stalling Lebeck and Wood
1994 Domain benchmarks with large data sets
symbolic, signal processing and scientific
programs Present Solutions Multithreading
(Homogenous), Larger Caches, Prefetching,
Software Multithreading
4
Present Solutions
Solution Larger Caches Hardware
Prefetching Software Prefetching
Multithreading

Limitations
Slow
Works well only if working set fits cache and
there is temporal locality.
Cannot be tailored for each application
Behavior based on past and present execution-time
behavior
Ensure overheads of prefetching do not outweigh
the benefits gt conservative prefetching
Adaptive software prefetching is required to
change prefetch distance during run-time
Hard to insert prefetches for irregular access
patterns
Solves the throughput problem, not the memory
latency problem

5
The HiDISC Approach

Observation
Software prefetching impacts compute performance
PIMs and RAMBUS offer a high-bandwidth memory
system - useful for speculative prefetching

Approach
Add a processor to manage prefetching -gt
hide overhead
Compiler explicitly manages the memory hierarchy
Prefetch distance adapts to the program runtime
behavior

6
What is HiDISC?
Computation Instructions

A dedicated processor for each level of the
memory hierarchy
Explicitly manage each level of the memory
hierarchy using instructions generated by the
compiler
Hide memory latency by converting data access
predictability to data access locality (Just in
Time Fetch)
Exploit instruction-level parallelism without
extensive scheduling hardware
Zero overhead prefetches for maximal computation
throughput

Computation Processor (CP)
Registers
Access Instructions
Access Processor (AP)
Compiler
Program
Cache Mgmt. Instructions
Cache Mgmt. Processor (CMP)
7
Decoupled Architectures
5-issue
2-issue
3-issue
8-issue
Computation Processor (CP)
Computation Processor (CP)
Computation Processor (CP)
Computation Processor (CP)
Registers
Registers
Registers
Registers
Access Processor (AP) - (5-issue)
Access Processor (AP) - (3-issue)
Cache
Cache Mgmt. Processor (CMP)
3-issue
3-issue
Cache Mgmt. Processor (CMP)
2nd-Level Cache and Main Memory
MIPS
CAPP
HiDISC
DEAP
(Conventional)
(Decoupled)
(New Decoupled)
DEAP Kurian, Hulina, Caraor 94 PIPE
Goodman 85 Other Decoupled Processors ACRI,
ZS-1, WA
8
Slip Control Queue

The Slip Control Queue (SCQ) adapts dynamically
Late prefetches prefetched data arrived after
load had been issued
Useful prefetches prefetched data arrived
before load had been issued

if (prefetch_buffer_full ()) Dont change size
of SCQ else if ((2late_prefetches) gt
useful_prefetches) Increase size of
SCQ else Decrease size of SCQ
9
Decoupling Programs for HiDISC(Discrete
Convolution - Inner Loop)
while (not EOD) y y (x h) send y to SDQ
Computation Processor Code
for (j 0 j lt i j) load (xj) load
(hi-j-1) GET_SCQ send (EOD token) send
address of yi to SAQ
for (j 0 j lt i j) yiyi(xjhi-j-1)

Inner Loop Convolution
Access Processor Code
SAQ Store Address Queue SDQ Store Data
Queue SCQ Slip Control Queue EOD End of Data
for (j 0 j lt i j) prefetch
(xj) prefetch (hi-j-1 PUT_SCQ
Cache Management Code
10
Benchmarks
Benchmarks

Source of

Lines of
Description

Data
Benchmark

Source
Set
Code

Size

LLL1

Livermore
20

1024
-
element
24 KB

Loops 45

arrays, 100
iterations

LLL2

Livermore
24

1024
-
element
16 KB

Loops

arrays, 100
iterations

LLL3

Livermore
18

1024
-
element
16 KB

Loops

a
rrays, 100
iterations

LLL4

Livermore
25

1024
-
element
16 KB

Loops

arrays, 100
iterations

LLL5

Livermore
17

1024
-
element
24 KB

Loops

arrays, 100
iterations

Tomcatv

SPECfp95 68

190

33x33
-
element
lt64 KB

matrices, 5
iterations

MXM

NAS kernels 5

11
3

Unrolled matrix
448 KB

multiply, 2
iterations

CHOLSKY

NAS kernels

156

Cholsky matrix
724 KB

decomposition

VPENTA

NAS kernels

199

Invert three
128 KB

pentadiagonals
simultaneously

Qsort

Quicksort
58

Quicksort

128 KB

sorting
algorithm 14

11
Simulation
Parameter Value Parameter Value
L1 cache size 4 KB L2 cache size 16 KB
L1 cache associativity 2 L2 cache associativity 2
L1 cache block size 32 B L2 cache block size 32 B
Memory Latency Variable, (0-200 cycles) Memory contention time Variable
Victim cache size 32 entries Prefetch buffer size 8 entries
Load queue size 128 Store address queue size 128
Store data queue size 128 Total issue width 8
12
Simulation Results
13
Accomplishments

2x speedup for scientific benchmarks with large
data sets over an in-order superscalar processor
7.4x speedup for matrix multiply (MXM) over an
in-order issue superscalar processor - (similar
operations are used in ATR/SLD)
2.6x speedup for matrix decomposition/substitution
(Cholsky) over an in-order issue superscalar
processor
Reduced memory latency for systems that have high
memory bandwidths (e.g. PIMs, RAMBUS)
Allows the compiler to solve indexing functions
for irregular applications
Reduced system cost for high-throughput
scientific codes

14
Work in Progress

Compiler design
Data Intensive Systems (DIS) benchmarks analysis
Simulator update
Parameterization of silicon space for VLSI
implementation

15
Compiler Requirements

Source language flexibility
Sequential assembly code for streaming
Ease of implementation
Optimality of sequential code
Source level language flexibility
Portability
Ease of implementation
Portability and upgradability

16
Gcc-2.95 Features

Localized register spilling, global common sub
expression elimination using lazy code motion
algorithms
There is also an enhancement made in the control
flow graph analysis.The new framework simplifies
control dependence analysis, which is used by
aggressive dead code elimination algorithms
Provision to add modules for instruction
scheduling and delayed branch execution
Front-ends for C, C and Fortran available
Support for different environments and
platforms Cross compilation

17
Compiler Organization
Source Program
GCC
Assembly Code
Stream Separator
Computational Assembly Code
Access Assembly Code
Cache Management Assembly Code
Access Assembly Code
Cache Management Object Code
Computation Assembly Code
HiDISC Compilation Overview
18
HiDISC Stream Separator
19
Compiler Front End Optimizations

Jump Optimization simplify jumps to the
following instruction, jumps across jumps and
jumps to jumps
Jump Threading detect a conditional jump that
branches to an identical or inverse test
Delayed Branch Execution find instructions that
can go into the delay slots of other instructions
Constant Propagation Propagate constants into a
conditional loop

20
Compiler Front End Optimizations (contd.)

Instruction Combination combine groups of two or
three instructions that are related by data flow
into a single instruction
Instruction Scheduling looks for instructions
whose output will not be available by the time
that it is used in subsequent instructions
Loop Optimizations move constant expressions out
of loops, and do strength-reduction

21
Example of Stressmarks

Pointer Stressmark
Basic idea repeatedly follow pointers to
randomized locations in memory
Memory access pattern is unpredictable
Randomized memory access pattern
Not sufficient temporal and spatial locality for
conventional cache architectures
HiDISC architecture provides lower memory access
latency

22
Decoupling of Pointer Stressmarks
while (not EOD)if (field gt partition)
balance if (balancehigh w/2) breakelse
if (balancehigh gt w/2) min
partitionelse max partition
high
for (ij1iltwi) if (fieldindexi gt
partition) balance if (balancehigh w/2)
breakelse if (balancehigh gt w/2) min
partitionelse max partition
high
Computation Processor Code
for (ij1 iltw i) load (fieldindexi) G
ET_SCQ send (EOD token)
Access Processor Code
for (ij1 iltw i) prefetch
(fieldindexi) PUT_SCQ
Inner loop for the next indexing
Cache Management Code
23
Stressmarks

Hand-compile the 7 individual benchmarks
Use gcc as front-end
Manually partition each of the three instruction
streams and insert synchronizing instructions
Evaluate architectural trade-offs
Updated simulator characteristics such as
out-of-order issue
Large L2 cache and enhanced main memory system
such as Rambus and DDR

24
Simulator Update

Survey the current processor architecture
Focus on commercial leading edge technology for
implementation
Analyze the current simulator and previous
benchmark results
Enhance memory hierarchy configurations
Add Out-of-Order issue

25
Memory Hierarchy

Current modern processors have increasingly large
L2 on-chip caches
E.g., 256 KB L-2 cache on Pentium and Athlon
processor
reduces L1 cache miss penalty
Also, development of new mechanisms in the
architecture of the main memory (e.g., RAMBUS)
reduces the L2 cache miss penalty

26
Out-of-Order multiple issue

Most of the current advanced processors are based
on the Superscalar and Multiple Issue paradigm.
MIPS-R10000, Power-PC, Ultra-Sparc, Alpha and
Pentium family
Compare HiDISC architecture and modern
superscalar processors
Out-of-Order instruction issue
For precision exception handling, include
in-order completion
New access decoupling paradigm for out-of-order
issue

27
HiDISC with Modern DRAM Architecture

RAMBUS and DDR DRAM improve the memory bandwidth
Latency does not improve significantly
Decoupled access processor can fully utilize the
enhanced memory bandwidth
More requests caused by access processor
Pre-fetching mechanism hide memory access latency

28
HiDISC / SMT

Reduced memory latency of HiDISC can
decrease the number of threads for SMT
architecture
relieve memory burden of SMT architecture
lessen complex issue logic of multithreading
The functional unit utilization can increase with
multithreading features on HiDISC
More instruction level parallelism is possible

29
The McDISC System Memory-Centered Distributed
Instruction Set Computer
30
Summary