HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing - PowerPoint PPT Presentation

About This Presentation
Title:

HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing

Description:

HiDISC: A Decoupled Architecture for Applications in Data Intensive ... (FLIR SAR VIDEO ATR /SLD Scientific ) Processor. Decoupling Compiler. HiDISC Processor ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 31
Provided by: wonw1
Category:

less

Transcript and Presenter's Notes

Title: HiDISC: A Decoupled Architecture for Applications in Data Intensive Computing


1
HiDISC A Decoupled Architecture for Applications
in Data Intensive Computing
  • PIs Alvin M. Despain and Jean-Luc Gaudiot
  • University of Southern California
  • http//www-pdpc.usc.edu
  • October 12th, 2000

2
HiDISC Hierarchical Decoupled Instruction Set
Computer
  • New Ideas
  • A dedicated processor for each level of the
    memory    hierarchy
  • Explicitly manage each level of the memory
    hierarchy using   instructions generated by the
    compiler
  • Hide memory latency by converting data access
       predictability to data access locality
  • Exploit instruction-level parallelism without
    extensive   scheduling hardware
  • Zero overhead prefetches for maximal computation
      throughput
  • Impact
  • 2x speedup for scientific benchmarks with large
    data sets   over an in-order superscalar
    processor
  • 7.4x speedup for matrix multiply over an
    in-order issue   superscalar processor
  • 2.6x speedup for matrix decomposition/substitutio
    n over an   in-order issue superscalar processor
  • Reduced memory latency for systems that have
    high   memory bandwidths (e.g. PIMs, RAMBUS)
  • Allows the compiler to solve indexing functions
    for irregular   applications
  • Reduced system cost for high-throughput
    scientific codes

Schedule
  • Defined benchmarks
  • Completed simulator
  • Performed instruction-level simulations on
    hand-compiled benchmarks
  • Continue simulations of more benchmarks
    (SAR)
  • Define HiDISC architecture
  • Benchmark result
  • Develop and test a full decoupling compiler
  • Update Simulator
  • Generate performance statistics and evaluate
    design

April 98 Start
April 99
April 00
3
HiDISC Hierarchical Decoupled Instruction Set
Computer
Technological Trend Memory latency is getting
longer relative to microprocessor speed (40 per
year) Problem Some SPEC benchmarks spend more
than half of their time stalling Lebeck and Wood
1994 Domain benchmarks with large data sets
symbolic, signal processing and scientific
programs Present Solutions Multithreading
(Homogenous), Larger Caches, Prefetching,
Software Multithreading
4
Present Solutions
Solution Larger Caches Hardware
Prefetching Software Prefetching
Multithreading
  • Limitations
  • Slow
  • Works well only if working set fits cache and
    there is temporal locality.
  • Cannot be tailored for each application
  • Behavior based on past and present execution-time
    behavior
  • Ensure overheads of prefetching do not outweigh
    the benefits gt conservative prefetching
  • Adaptive software prefetching is required to
    change prefetch distance during run-time
  • Hard to insert prefetches for irregular access
    patterns
  • Solves the throughput problem, not the memory
    latency problem

5
The HiDISC Approach
  • Observation
  • Software prefetching impacts compute performance
  • PIMs and RAMBUS offer a high-bandwidth memory
    system - useful for speculative prefetching
  • Approach
  • Add a processor to manage prefetching -gt
    hide overhead
  • Compiler explicitly manages the memory hierarchy
  • Prefetch distance adapts to the program runtime
    behavior

6
What is HiDISC?
Computation Instructions
  • A dedicated processor for each level of the
    memory hierarchy
  • Explicitly manage each level of the memory
    hierarchy using instructions generated by the
    compiler
  • Hide memory latency by converting data access
    predictability to data access locality (Just in
    Time Fetch)
  • Exploit instruction-level parallelism without
    extensive scheduling hardware
  • Zero overhead prefetches for maximal computation
    throughput

Computation Processor (CP)
Registers
Access Instructions
Access Processor (AP)
Compiler
Program
Cache Mgmt. Instructions
Cache Mgmt. Processor (CMP)
7
Decoupled Architectures
5-issue
2-issue
3-issue
8-issue
Computation Processor (CP)
Computation Processor (CP)
Computation Processor (CP)
Computation Processor (CP)
Registers
Registers
Registers
Registers
Access Processor (AP) - (5-issue)
Access Processor (AP) - (3-issue)
Cache
Cache Mgmt. Processor (CMP)
3-issue
3-issue
Cache Mgmt. Processor (CMP)
2nd-Level Cache and Main Memory
MIPS
CAPP
HiDISC
DEAP
(Conventional)
(Decoupled)
(New Decoupled)
DEAP Kurian, Hulina, Caraor 94 PIPE
Goodman 85 Other Decoupled Processors ACRI,
ZS-1, WA
8
Slip Control Queue
  • The Slip Control Queue (SCQ) adapts dynamically
  • Late prefetches prefetched data arrived after
    load had been issued
  • Useful prefetches prefetched data arrived
    before load had been issued

if (prefetch_buffer_full ()) Dont change size
of SCQ else if ((2late_prefetches) gt
useful_prefetches) Increase size of
SCQ else Decrease size of SCQ
9
Decoupling Programs for HiDISC(Discrete
Convolution - Inner Loop)
while (not EOD) y y (x h) send y to SDQ
Computation Processor Code
for (j 0 j lt i j) load (xj) load
(hi-j-1) GET_SCQ send (EOD token) send
address of yi to SAQ
for (j 0 j lt i j) yiyi(xjhi-j-1)

Inner Loop Convolution
Access Processor Code
SAQ Store Address Queue SDQ Store Data
Queue SCQ Slip Control Queue EOD End of Data
for (j 0 j lt i j) prefetch
(xj) prefetch (hi-j-1 PUT_SCQ
Cache Management Code
10
Benchmarks
Benchmarks

Source of

Lines of
Description

Data
Benchmark

Source
Set
Code

Size

LLL1

Livermore
20

1024
-
element
24 KB

Loops 45

arrays, 100
iterations

LLL2

Livermore
24

1024
-
element
16 KB

Loops

arrays, 100
iterations

LLL3

Livermore
18

1024
-
element
16 KB

Loops

a
rrays, 100
iterations

LLL4

Livermore
25

1024
-
element
16 KB

Loops

arrays, 100
iterations

LLL5

Livermore
17

1024
-
element
24 KB

Loops

arrays, 100
iterations

Tomcatv

SPECfp95 68

190

33x33
-
element
lt64 KB

matrices, 5
iterations

MXM

NAS kernels 5

11
3

Unrolled matrix
448 KB

multiply, 2
iterations

CHOLSKY

NAS kernels

156

Cholsky matrix
724 KB

decomposition

VPENTA

NAS kernels

199

Invert three
128 KB

pentadiagonals
simultaneously

Qsort

Quicksort
58

Quicksort

128 KB

sorting
algorithm 14


11
Simulation
Parameter Value Parameter Value
L1 cache size 4 KB L2 cache size 16 KB
L1 cache associativity 2 L2 cache associativity 2
L1 cache block size 32 B L2 cache block size 32 B
Memory Latency Variable, (0-200 cycles) Memory contention time Variable
Victim cache size 32 entries Prefetch buffer size 8 entries
Load queue size 128 Store address queue size 128
Store data queue size 128 Total issue width 8
12
Simulation Results
13
Accomplishments
  • 2x speedup for scientific benchmarks with large
    data sets over an in-order superscalar processor
  • 7.4x speedup for matrix multiply (MXM) over an
    in-order issue superscalar processor - (similar
    operations are used in ATR/SLD)
  • 2.6x speedup for matrix decomposition/substitution
    (Cholsky) over an in-order issue superscalar
    processor
  • Reduced memory latency for systems that have high
    memory bandwidths (e.g. PIMs, RAMBUS)
  • Allows the compiler to solve indexing functions
    for irregular applications
  • Reduced system cost for high-throughput
    scientific codes

14
Work in Progress
  • Compiler design
  • Data Intensive Systems (DIS) benchmarks analysis
  • Simulator update
  • Parameterization of silicon space for VLSI
    implementation

15
Compiler Requirements
  • Source language flexibility
  • Sequential assembly code for streaming
  • Ease of implementation
  • Optimality of sequential code
  • Source level language flexibility
  • Portability
  • Ease of implementation
  • Portability and upgradability

16
Gcc-2.95 Features
  • Localized register spilling, global common sub
    expression elimination using lazy code motion
    algorithms
  • There is also an enhancement made in the control
    flow graph analysis.The new framework simplifies
    control dependence analysis, which is used by
    aggressive dead code elimination algorithms
  • Provision to add modules for instruction
    scheduling and delayed branch execution
  • Front-ends for C, C and Fortran available
  • Support for different environments and
    platforms Cross compilation

17
Compiler Organization
Source Program
GCC
Assembly Code
Stream Separator
Computational Assembly Code
Access Assembly Code
Cache Management Assembly Code
Access Assembly Code
Cache Management Object Code
Computation Assembly Code
HiDISC Compilation Overview
18
HiDISC Stream Separator
19
Compiler Front End Optimizations
  • Jump Optimization simplify jumps to the
    following instruction, jumps across jumps and
    jumps to jumps
  • Jump Threading detect a conditional jump that
    branches to an identical or inverse test
  • Delayed Branch Execution find instructions that
    can go into the delay slots of other instructions
  • Constant Propagation Propagate constants into a
    conditional loop

20
Compiler Front End Optimizations (contd.)
  • Instruction Combination combine groups of two or
    three instructions that are related by data flow
    into a single instruction
  • Instruction Scheduling looks for instructions
    whose output will not be available by the time
    that it is used in subsequent instructions
  • Loop Optimizations move constant expressions out
    of loops, and do strength-reduction

21
Example of Stressmarks
  • Pointer Stressmark
  • Basic idea repeatedly follow pointers to
    randomized locations in memory
  • Memory access pattern is unpredictable
  • Randomized memory access pattern
  • Not sufficient temporal and spatial locality for
    conventional cache architectures
  • HiDISC architecture provides lower memory access
    latency

22
Decoupling of Pointer Stressmarks
while (not EOD)if (field gt partition)
balance if (balancehigh w/2) breakelse
if (balancehigh gt w/2) min
partitionelse max partition
high
for (ij1iltwi) if (fieldindexi gt
partition) balance if (balancehigh w/2)
breakelse if (balancehigh gt w/2) min
partitionelse max partition
high
Computation Processor Code
for (ij1 iltw i) load (fieldindexi) G
ET_SCQ send (EOD token)
Access Processor Code
for (ij1 iltw i) prefetch
(fieldindexi) PUT_SCQ
Inner loop for the next indexing
Cache Management Code
23
Stressmarks
  • Hand-compile the 7 individual benchmarks
  • Use gcc as front-end
  • Manually partition each of the three instruction
    streams and insert synchronizing instructions
  • Evaluate architectural trade-offs
  • Updated simulator characteristics such as
    out-of-order issue
  • Large L2 cache and enhanced main memory system
    such as Rambus and DDR

24
Simulator Update
  • Survey the current processor architecture
  • Focus on commercial leading edge technology for
    implementation
  • Analyze the current simulator and previous
    benchmark results
  • Enhance memory hierarchy configurations
  • Add Out-of-Order issue

25
Memory Hierarchy
  • Current modern processors have increasingly large
    L2 on-chip caches
  • E.g., 256 KB L-2 cache on Pentium and Athlon
    processor
  • reduces L1 cache miss penalty
  • Also, development of new mechanisms in the
    architecture of the main memory (e.g., RAMBUS)
    reduces the L2 cache miss penalty

26
Out-of-Order multiple issue
  • Most of the current advanced processors are based
    on the Superscalar and Multiple Issue paradigm.
  • MIPS-R10000, Power-PC, Ultra-Sparc, Alpha and
    Pentium family
  • Compare HiDISC architecture and modern
    superscalar processors
  • Out-of-Order instruction issue
  • For precision exception handling, include
    in-order completion
  • New access decoupling paradigm for out-of-order
    issue

27
HiDISC with Modern DRAM Architecture
  • RAMBUS and DDR DRAM improve the memory bandwidth
  • Latency does not improve significantly
  • Decoupled access processor can fully utilize the
    enhanced memory bandwidth
  • More requests caused by access processor
  • Pre-fetching mechanism hide memory access latency

28
HiDISC / SMT
  • Reduced memory latency of HiDISC can
  • decrease the number of threads for SMT
    architecture
  • relieve memory burden of SMT architecture
  • lessen complex issue logic of multithreading
  • The functional unit utilization can increase with
    multithreading features on HiDISC
  • More instruction level parallelism is possible

29
The McDISC System Memory-Centered Distributed
Instruction Set Computer
30
Summary
  • Designing a compiler
  • Porting gcc to HiDISC
  • Benchmark simulation with new parameters and
    updated simulator
  • Analysis of architectural trade-offs for equal
    silicon area
  • Hand-compilation of Stressmarks suites and
    simulation
  • DIS benchmarks simulation
Write a Comment
User Comments (0)
About PowerShow.com