HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

About This Presentation

Title:

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

Description:

Other Decoupled Processors: ACRI, ZS-1, WM. Cache. 2nd-Level Cache. and Main Memory ... Cache has become a major portion of the chip area ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 34

Provided by: james510

Learn more at: http://pascal.eng.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

1
HiDISC A Decoupled Architecture for Applications
in Data Intensive Computing

PIs Alvin M. Despain and Jean-Luc Gaudiot

University of Southern California
http//www-pdpc.usc.edu
May 2001

2
Outline

HiDISC Project Description
Experiments and Accomplishments
Work in Progress
Summary

3
HiDISC Hierarchical Decoupled Instruction Set
Computer
Technological Trend Memory latency is getting
longer relative to microprocessor speed (40 per
year) Problem Some SPEC benchmarks spend more
than half of their time stalling Domain
benchmarks with large data sets symbolic,
signal processing and scientific programs Present
Solutions Multithreading (Homogenous), Larger
Caches, Prefetching, Software Multithreading
4
Present Solutions
Solution Larger Caches Hardware
Prefetching Software Prefetching
Multithreading

Limitations
Slow
Works well only if working set fits cache and
there is temporal locality.
Cannot be tailored for each application
Behavior based on past and present execution-time
behavior
Ensure overheads of prefetching do not outweigh
the benefits gt conservative prefetching
Adaptive software prefetching is required to
change prefetch distance during run-time
Hard to insert prefetches for irregular access
patterns
Solves the throughput problem, not the memory
latency problem

5
The HiDISC Approach

Observation
Software prefetching impacts compute performance
PIMs and RAMBUS offer a high-bandwidth memory
system - useful for speculative prefetching

Approach
Add a processor to manage prefetching -gt
hide overhead
Compiler explicitly manages the memory hierarchy
Prefetch distance adapts to the program runtime
behavior

6
What is HiDISC?
Computation Instructions

A dedicated processor for each level of the
memory hierarchy
Explicitly manage each level of the memory
hierarchy using instructions generated by the
compiler
Hide memory latency by converting data access
predictability to data access locality (Just in
Time Fetch)
Exploit instruction-level parallelism without
extensive scheduling hardware
Zero overhead prefetches for maximal computation
throughput

Computation Processor (CP)
Registers
Access Instructions
Access Processor (AP)
Compiler
Program
Cache Mgmt. Instructions
Cache Mgmt. Processor (CMP)
7
Decoupled Architectures
5-issue
2-issue
3-issue
8-issue
Computation Processor (CP)
Computation Processor (CP)
Computation Processor (CP)
Computation Processor (CP)
Registers
Registers
Registers
Registers
SDQ
SAQ
LQ
Access Processor (AP) - (5-issue)
Access Processor (AP) - (3-issue)
SCQ
Cache
Cache Mgmt. Processor (CMP)
3-issue
3-issue
Cache Mgmt. Processor (CMP)
2nd-Level Cache and Main Memory
MIPS
CAPP
HiDISC
DEAP
(Conventional)
(Decoupled)
(New Decoupled)
DEAP Kurian, Hulina, Caraor 94 PIPE
Goodman 85 Other Decoupled Processors ACRI,
ZS-1, WM
8
Slip Control Queue

The Slip Control Queue (SCQ) adapts dynamically
Late prefetches prefetched data arrived after
load had been issued
Useful prefetches prefetched data arrived
before load had been issued

if (prefetch_buffer_full ()) Dont change size
of SCQ else if ((2late_prefetches) gt
useful_prefetches) Increase size of
SCQ else Decrease size of SCQ
9
Decoupling Programs for HiDISC(Discrete
Convolution - Inner Loop)
while (not EOD) y y (x h) send y to SDQ
Computation Processor Code
for (j 0 j lt i j) load (xj) load
(hi-j-1) GET_SCQ send (EOD token) send
address of yi to SAQ
for (j 0 j lt i j) yiyi(xjhi-j-1)

Inner Loop Convolution
Access Processor Code
SAQ Store Address Queue SDQ Store Data
Queue SCQ Slip Control Queue EOD End of Data
for (j 0 j lt i j) prefetch
(xj) prefetch (hi-j-1 PUT_SCQ
Cache Management Code
10
Where We Were

HiDISC compiler
Frontend selection (Gcc)
Single thread running without conditionals
Hand compiling of benchmarks
Livermore loops, Tomcatv, MXM, Cholsky, Vpenta
and Qsort

11
Benchmarks
Benchmarks

Source of

Lines of
Description

Data
Benchmark

Source
Set
Code

Size

LLL1

Livermore
20

1024
-
element
24 KB

Loops 45

arrays, 100
iterations

LLL2

Livermore
24

1024
-
element
16 KB

Loops

arrays, 100
iterations

LLL3

Livermore
18

1024
-
element
16 KB

Loops

a
rrays, 100
iterations

LLL4

Livermore
25

1024
-
element
16 KB

Loops

arrays, 100
iterations

LLL5

Livermore
17

1024
-
element
24 KB

Loops

arrays, 100
iterations

Tomcatv

SPECfp95 68

190

33x33
-
element
lt64 KB

matrices, 5
iterations

MXM

NAS kernels 5

11
3

Unrolled matrix
448 KB

multiply, 2
iterations

CHOLSKY

NAS kernels

156

Cholsky matrix
724 KB

decomposition

VPENTA

NAS kernels

199

Invert three
128 KB

pentadiagonals
simultaneously

Qsort

Quicksort
58

Quicksort

128 KB

sorting
algorithm 14

12
Simulation Parameters
13
Simulation Results
14
Accomplishments

2x speedup for scientific benchmarks with large
data sets over an in-order superscalar processor
7.4x speedup for matrix multiply (MXM) over an
in-order issue superscalar processor - (similar
operations are used in ATR/SLD)
2.6x speedup for matrix decomposition/substitution
(Cholsky) over an in-order issue superscalar
processor
Reduced memory latency for systems that have high
memory bandwidths (e.g. PIMs, RAMBUS)
Allows the compiler to solve indexing functions
for irregular applications
Reduced system cost for high-throughput
scientific codes

15
Work in Progress

Silicon Space for VLSI Layout
Compiler Progress
Simulator Integration
Hand-compiling of Benchmarks
Architectural Enhancement for Data Intensive
Applications

16
VLSI Layout Overhead

Goal Evaluate layout effectiveness of HiDISC
architecture
Cache has become a major portion of the chip area
Methodology Extrapolate HiDISC VLSI Layout based
on MIPS10000 processor (0.35 µm, 1996)
The space overhead is 11.3 over a comparable
MIPS processor

17
VLSI Layout Overhead
18
Compiler Progress

Preprocessor- Support for library calls
Gnu pre-processor
- Support for special compiler directives
(include, define)
Nested Loops
Nested loops without data dependencies (for and
while)
Support for conditional statements where the
index variable of an inner loop is not a function
of some outer loop computation
Conditional statements
CP to perform all computations
Need to move the condition to AP

19
Nested Loops (Assembly)

6 for(i0ilt10i)
sw 0, 12(sp)
32
7 for(k0klt10k)
sw 0, 4(sp)
33
8 j
lw 15, 8(sp)
addu 24, 15, 1
sw 24, 8(sp)
lw 25, 4(sp)
addu 8, 25, 1
sw 8, 4(sp)
blt 8, 10, 33
lw 9, 12(sp)
addu 10, 9, 1
sw 10, 12(sp)
blt 10, 10, 32

Assembly Code
High Level Code - C
20
Nested Loops
6 for(i0ilt10i) 32 b_eod
loop_i 7 for(k0klt10k) 33
b_eod loop_k 8 j addu
15, LQ, 1 sw 15, SDQ b
33 Loop_k b 32 loop_i

6 for(i0ilt10i)
sw 0, 12(sp)
32
7 for(k0klt10k)
sw 0, 4(sp)
33
8 j
lw LQ, 8(sp)
get SCQ
sw 8(sp), SAQ
lw 25, 4(sp)
addu 8, 25, 1
sw 8, 4(sp)
blt 8, 10, 33
s_eod
lw 9, 12(sp)
addu 10, 9, 1
sw 10, 12(sp)
blt 10, 10, 32

6 for(i0ilt10i)
sw 0, 12(sp)
32
7 for(k0klt10k)
sw 0, 4(sp)
33
8 j
pref 8(sp)
put SCQ
lw 25, 4(sp)
addu 8, 25, 1
sw 8, 4(sp)
blt 8, 10, 33
lw 9, 12(sp)
addu 10, 9, 1
sw 10, 12(sp)
blt 10, 10, 32

CP Stream
AP Stream
CMP Stream
21
HiDISC Compiler

Gcc Backend
Use input from the parsing phase for performing
loop optimizations
Extend the compiler to provide MIPS4
Handle loops with dependencies
Nested loops where the computation depends on the
indices of more than 1 loop
e.g. X(i,j) iY(j,i)
where i and j are index variables and j is a
function of i

22
HiDISC Stream Separator
Sequential Source
Program Flow Graph
Classify Address Registers
Allocate Instruction to streams
Previous Work
Current Work
Access Stream
Computation Stream
Fix Conditional Statements
Move Queue Access into Instructions
Move Loop Invariants out of the loop
Add Slip Control Queue Instructions
Substitute Prefetches for Loads, Remove global
Stores, and Reverse SCQ Direction
Add global data Communication and Synchronization
Produce Assembly code
Cache Management Assembly Code
Computation Assembly Code
Access Assembly Code
23
Simulator Integration

Based on MIPS RISC pipeline Architecture (dsim)
Supports MIPS1 and MIPS2 ISA
Supports dynamic linking for shared library
Loading shared library in the simulator
Hand compiling
Using SGI-cc or gcc as front-end
Making 3 streams of codes
Using SGI-cc as compiler back-end

.c
cc mips2 -s

Modification on three .s codes to HiDISC
assembly
Convert .hs to .s for three codes (hs2s)

.s
.cp.s
.ap.s
.cmp.s
cc mips2 -o
sharedlibrary
.cp
.ap
.cmp
dsim
24
DIS Benchmarks

Atlantic Aerospace DIS benchmark suite
Application oriented benchmarks
Many defense applications employ large data sets
non contiguous memory access and no temporal
locality
Too large for hand-compiling, wait until the
compiler ready
Requires linker that can handle multiple object
files
Atlantic Aerospace Stressmarks suite
Smaller and more specific procedures
Seven individual data intensive benchmarks
Directly illustrate particular elements of the
DIS problem

25
Stressmark Suite
DIS Stressmark Suite Version 1.0, Atlantic
Aerospace Division
26
Example of Stressmarks

Pointer Stressmark
Basic idea repeatedly follow pointers to
randomized locations in memory
Memory access pattern is unpredictable
Randomized memory access pattern
Not sufficient temporal and spatial locality for
conventional cache architectures
HiDISC architecture provides lower memory access
latency

27
Decoupling of Pointer Stressmarks
while (not EOD)if (field gt partition)
balance if (balancehigh w/2) breakelse
if (balancehigh gt w/2) min
partitionelse max partition
high
for (ij1iltwi) if (fieldindexi gt
partition) balance if (balancehigh w/2)
breakelse if (balancehigh gt w/2) min
partitionelse max partition
high
Computation Processor Code
for (ij1 iltw i) load (fieldindexi) G
ET_SCQ send (EOD token)
Access Processor Code
for (ij1 iltw i) prefetch
(fieldindexi) PUT_SCQ
Inner loop for the next indexing
Cache Management Code
28
Stressmarks

Hand-compile the 7 individual benchmarks
Use gcc as front-end
Manually partition each of the three instruction
streams and insert synchronizing instructions
Evaluate architectural trade-offs
Updated simulator characteristics such as
out-of-order issue
Large L2 cache and enhanced main memory system
such as Rambus and DDR

29
Architectural Enhancement for Data Intensive
Applications

Enhanced memory system such as RAMBUS DRAM and
DDR DRAM
Provide high memory bandwidth
Latency does not improve significantly
Decoupled access processor can fully utilize the
enhanced memory bandwidth
More requests caused by access processor
Pre-fetching mechanism hide memory access latency

30
Flexi-DISC

Fundamental characteristics
inherently highly dynamic at execution time.
Dynamic reconfigurable central computational
kernel (CK)
Multiple levels of caching and processing around
CK
adjustable prefetching
Multiple processors on a chip which will provide
for a flexible adaptation from multiple to single
processors and horizontal sharing of the existing
resources.

31
Flexi-DISC

Partitioning of Computation Kernel
It can be allocated to the different portions of
the application or different applications
CK requires separation of the next ring to feed
it with data
The variety of target applications makes the
memory accesses unpredictable
Identical processing units for outer rings
Highly efficient dynamic partitioning of the
resources and their run-time allocation can be
achieved

32
Summary

Gcc Backend
Use the parsing tree to extract the load
instructions
Handle loops with dependencies where the index
variable of an inner loop is not a function of
some outer loop computation
Robust compiler is being designed to experiment
and analyze with additional benchmarks
Eventually extend it to experiment DIS benchmarks
Additional architectural enhancements have been
introduced to make HiDISC amenable to DIS
benchmarks

33
HiDISC Hierarchical Decoupled Instruction Set
Computer

New Ideas
A dedicated processor for each level of the
memory hierarchy
Explicitly manage each level of the memory
hierarchy using instructions generated by the
compiler
Hide memory latency by converting data access
predictability to data access locality
Exploit instruction-level parallelism without
extensive scheduling hardware
Zero overhead prefetches for maximal computation
throughput

Impact
2x speedup for scientific benchmarks with large
data sets over an in-order superscalar
processor
7.4x speedup for matrix multiply over an
in-order issue superscalar processor
2.6x speedup for matrix decomposition/substitutio
n over an in-order issue superscalar processor
Reduced memory latency for systems that have
high memory bandwidths (e.g. PIMs, RAMBUS)
Allows the compiler to solve indexing functions
for irregular applications
Reduced system cost for high-throughput
scientific codes

Schedule

Defined benchmarks
Completed simulator
Performed instruction-level simulations on
hand-compiled
benchmarks

Continue simulations of more benchmarks
(SAR)
Define HiDISC architecture
Benchmark result
Update Simulator

Develop and test a full decoupling compiler
decoupling compiler
Generate performance statistics and evaluate
design

Start
April 2001

Write a Comment

User Comments (0)

About PowerShow.com

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing - PowerPoint PPT Presentation

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

Other Decoupled Processors: ACRI, ZS-1, WM. Cache. 2nd-Level Cache. and Main Memory ... Cache has become a major portion of the chip area ... – PowerPoint PPT presentation