Modeling and Mapping Techniques for Dynamically Reconfigurable Hybrid Architectures - PowerPoint PPT Presentation

Loading...

PPT – Modeling and Mapping Techniques for Dynamically Reconfigurable Hybrid Architectures PowerPoint presentation | free to view - id: d4c33-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Modeling and Mapping Techniques for Dynamically Reconfigurable Hybrid Architectures

Description:

Dynamically Reconfigurable Hybrid Architectures. Kiran Bondalapati. Computer Engineering ... Early Hybrid Chip: Xilinx XC 6200 FPGA. Background. SRAM based FPGA ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 74
Provided by: kiranbon
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Modeling and Mapping Techniques for Dynamically Reconfigurable Hybrid Architectures


1
Modeling and Mapping Techniques for Dynamically
Reconfigurable Hybrid Architectures
  • Kiran Bondalapati
  • Computer Engineering

2
Outline
  • Introduction
  • Background
  • Thesis Contributions
  • HySAM Model
  • Mapping Techniques
  • DRIVE Simulation
  • Conclusions

3
Computing Landscape
Cost/Performance Gap
Microprocessor
ASIC
Introduction
Special purpose Excellent application
specific performance
General purpose Good average performance
Increasing Performance
IBM JPEG
Increasing Flexibility
4
Configurable Computing Concept
  • Computation and Communication adapted on-the-fly

Introduction
5
Configurable Computing Characteristics
  • Variable Hardware and Software
  • Spatial Computation
  • Distributed Resources
  • Distributed Control

Introduction
6
Configurable Computing Variable Hardware
Introduction
Evolution
7
Configurable Computing Spatial Computing
ALU
computations
Chip
Introduction
Time
Temporal
Chip
Spatial
8
Configurable Computing Distributed Resources
Active Logic Silicon resources which actually
perform the computations
Introduction
Configurable Processor
Microprocessor
9
Configurable Computing Distributed Control
Instruction Broadcast
Localized Control
Introduction
Decode
Configurable Processor
Microprocessor
10
Early Hybrid Chip Xilinx XC 6200 FPGA
  • SRAM based FPGA architecture
  • 64 x 64 array of cells
  • 2-input, 2-output logic function in cell
  • Reconfiguration
  • Less than 1 ms. for entire device
  • Partial reconfiguration
  • 40 ns each cell
  • FastMap Processor Interface
  • FPGA connected to system bus
  • Memory mapped device
  • Normal load/store instructions for
    reconfiguration
  • Dynamic reconfiguration

Background
cell
11
Billion Transistor Chips
Conf. Logic
Background
Conf. Logic
CPU
Memory
Memory
Current Systems
Emerging Systems
12
BRASS Garp
  • Reconfigurable array unit with a RISC processor
  • Gate array of 32x24 logic blocks
  • Partial configuration of gate array in row
    increments
  • Configuration cache for fast reconfiguration
  • 4 cycles on-chip and 12 cycle off-chip
    reconfiguration time

Memory
Background
Instruction cache
Data cache
Configurable Array
MIPS
13
Chameleon RCP
ARC 32-bit RISC
DMA Controller
128-bit RoadRunner Bus
Background
Reconfigurable Processing Fabric
  • 84 DPUs, 24 Multipliers
  • 24 Billion OPS
  • 3 Billion MACS
  • Single cycle Reconfiguration
  • 2 Gbyte/sec I/O

14
Xilinx Platform FPGA
Distributed Multipliers
PowerPC 32-bit RISC
Background
  • 10 Million System Gates
  • 200 MHz System Clock
  • 300 MHz PowerPC Core
  • 600 Billion MACS
  • 6 Gbyte/sec RISC-Logic b/w

Virtex-II CLB Array
15
Configurable Computing Challenges
  • High-level System Models
  • Current abstraction is register-level
  • Formal Methodologies
  • Performance analysis and optimization
  • Integrated Mapping Techniques
  • Exploit all resources in a unified approach
  • Design and Simulation Tools
  • System-level simulation and analysis

Background
16
Our Model Based Approach
Application Developer
Background
Optimized mapping algorithms for generic
problems and applications
Models
Computational Model Compilation Model
Devices
Systems
17
Related Work
  • Mapping Loops
  • Weinhardt, Pande, Luk, etc.
  • No high-level model
  • Limited applicability
  • Compiler Projects
  • National NAPA, Synopsys Nimble, Berkeley Garp,
    USC/ISI DEFACTO, Chameleon c2b
  • Mainly complementary software efforts
  • Reconfiguration costs not addressed
  • Interactions with DEFACTO and Chameleon

Background
18
Thesis Contributions
Mapping Techniques
HySAM Model
DRIVE Simulation
19
Parameterized HySAM Model
Application Developer
Algorithmic techniques for mapping generic
loops onto hybrid architectures
HySAM
Hybrid Reconfigurable Architectures
20
Summary of Mapping Techniques
  • Mapping Linear Loops
  • Polynomial complexity algorithms
  • Mapping onto multi-context architectures
  • Dynamic precision management
  • Reconfigurable Pipelines
  • Mapping onto configurable pipelines
  • Heuristic pipeline segmentation techniques
  • Integrated Mapping Techniques
  • Parallelizing feedback loops
  • Data Context Switching

21
DRIVE Interpretive Simulation Framework
System Abstraction
Performance Characterization
Analysis and Transformation
System Models
Task Models
Mapping Algorithms
Interpretive Simulation
Performance Analysis Design Exploration
22
Thesis Contributions
Mapping Techniques
HySAM Model
HySAM
DRIVE Simulation
23
Hybrid System Architecture Model
Main Processor
Memory
Interconnection Network
HySAM
Conf. Cache
Conf. Logic Unit (CLU)
Parameterized Model
24
Functions and Configurations
  • F - Functions
  • Computational units (e.g. Add, Multiply, Select)
  • Library Modules
  • C - Configurations
  • Area, Configuration time, Execution time,
    Precision, Power dissipation, I/O requirement
  • A Function can be executed by various
    Configurations
  • Aij - Attributes for function Fi in
    configuration Cj
  • Rij - Reconfiguration cost from Ci to Cj
  • Depends on both Ci and Cj
  • Can include partial reconfiguration
  • Reconfiguration cost matrix

HySAM
25
Attributes
Execution of Fi in Cj
  • tij Latency (execution time)
  • ?ij - Throughput
  • pij - Precision
  • Others
  • Data access
  • Power dissipation

HySAM
26
Scheduled Reconfiguration
Computation (e.g. Program)
Hybrid System Architecture Model
HySAM
Problem Instance (e.g. Input Data)
Configurations and Schedule
Intermediate Results
Computation and Reconfiguration
Result
27
Tasks and Configurations
Input Application Tasks
Tp
T3
T2
T1
p
Mapping
m
HySAM
Configurations
C3
C5
R53
F6
F4
C5
C3
Reconfiguration
28
Example Garp Architectural Parameters
Exec. Time
Conf. Time
tij
R0j
Function
Operation
Configuration
14.4 us
37.5 ns
C1
F1
Multiplication (Fast)
HySAM
C2
6.4 us
52.5 ns
Multiplication (Slow)
F2
C3
7.5 ns
1.6 us
Addition
F3
C4
7.5 ns
Subtraction
1.6 us
C5
3.2 us
7.5 ns
F4
Shift
29
HySAM Abstraction
  • Resource Model - hardware
  • Task Model - applications
  • Execution Model - run-time
  • Attribute Model - library
  • Generative Model - design

HySAM
30
Thesis Contributions
Mapping Techniques
HySAM Model
Mapping
DRIVE Simulation
31
Why Loops?
  • Dense and Regular computations
  • Occur in most applications
  • More than 90 of execution is spent in loops
  • Extensive research in loop analysis
  • - Task identification
  • - Configuration generation

Mapping
32
Loops Definitions
  • Loop Iteration
  • Loop Index
  • Dependency
  • Data
  • Control
  • Loop carried dependency

FOR I1 TO 100 AI 2 BI CI AI -
3 DI DI CI
Mapping
33
Sub-space of Problems
Loop Characteristics
Feedback
Linear
Linear Loop Mapping
Parallel Pipelines
Unlimited
Resources
Mapping
Heuristic Pipeline Segmentation
Data Context Switching
Limited
34
Example Mapping Contributions
  • Problems and Highlighted Characteristics
  • LMP Theoretical complexity
  • DPMA Novel reconfiguration ideas
  • DCS Application area focus

Mapping
35
General Mapping Problem
  • Execute a given sequence of tasks for N
    iterations
  • Minimize total execution time
  • E Computation Reconfiguration
  • Find a sequence of configurations which minimizes
    E

LOOP 1..N
T1
T2
Mapping - LMP
Tp
END
NP-Complete!
36
LMP Linear Mapping Problem
  • Given
  • A set of tasks ltT1 , T2, Tp gt to be executed
    sequentially N times
  • Set of configurations C
  • Reconfiguration cost matrix R
  • Find
  • A sequence ltC1 , C2, , Cqgt of configurations
    which minimizes the total execution time E

LOOP 1..N
T1
T2
Mapping - LMP
Tp
END
37
Optimal Solution
  • Lemma 1
  • An optimal sequence of configurations for
    executing one iteration of the loop can be
    computed in O(pm2) time
  • Lemma 2
  • An optimal sequence of configurations can be
    computed by unrolling the loop only m times
  • Theorem
  • An optimal sequence of configurations for N
    iterations of a loop statement with p tasks,
    where each task can be executed in one of m
    possible configurations, can be computed in
    O(pm3) time.

Mapping - LMP
38
Optimal Solution - One Iteration
  • Explore all possible execution sequences ? -
    Exponential !

T1
Tp
Ti
Ti1
C2
C2
C7
C3
Mapping - LMP
C4
C9
C6
Reconfiguration Costs
  • Exploit subsequence optimality
  • Utilize dynamic programming to reduce search
    space

39
Optimal Solution Multiple Iterations
  • Maximum number of distinct ltT,Cgt pairs is pm
  • Compute dynamic programming solution for
    T1TpT1Tp T1Tp (unrolled m times)
  • Solution repeated N/m times is required sequence
  • Complexity O(pm3)
  • p number of tasks
  • m number of configurations

Mapping - LMP
40
Summary of LMP Solution
C1
Configuration Library (size m)
C2
Our Mapping
...
Mapping - LMP
Algorithm
Cq
Optimal Configuration Sequence
Input
  • Minimizes total execution time including
    reconfiguration time
  • Algorithm complexity independent of number of
    loop iterations
  • O(pm3) compile time algorithm

41
Example Garp Architectural Parameters
Exec. Time
Conf. Time
tij
R0j
Function
Operation
Configuration
14.4 us
37.5 ns
C1
F1
Multiplication (Fast)
C2
6.4 us
52.5 ns
Multiplication (Slow)
Mapping - LMP
F2
C3
7.5 ns
1.6 us
Addition
F3
C4
7.5 ns
Subtraction
1.6 us
C5
3.2 us
7.5 ns
F4
Shift
42
Example Mapping FFT onto Garp
FFT Butterfly Operation - One Complex Multiply,
One Complex Add, One Complex Subtract - 4
Real Multiplies, 3 Real Adds, 3 Real Subtracts
FFT Linearized Task Sequence
TM - Multiplication TA - Addition TS - Subtraction
TM TM TM TM TA TS TA TA TS TS
Mapping - LMP
Optimal Solution N 13.055 ?s ( N
number of iterations)
Important Characteristic of Solution - Uses
slower execution time Multiplier configuration
- Faster reconfiguration helps in amortizing
the execution cost over all the
iterations
43
Variable Precision Computation
  • Precision requirement is lower than implemented
  • Match implementation to algorithm requirements
  • Less resources
  • Execution time
  • Logic area
  • Power dissipation
  • Run-time precision management
  • Dynamic modification

Mapping - DPMA
44
Precision Variation in Loops
DO 10 I1,N DO 20 J1,N RSQ(J)
RSQ(J) XDIFF(I,J)YDIFF(I,J) 20 IF
(MAXQ.LT.RSQ(J)) THEN MAXQ RSQ(J) 10
VIRTXY VIRTXY MAXQ SCALE(I)
Ex
Mapping - DPMA
  • 8-bit inputs XDIFF(I,J) and YDIFF(I,J)
  • MAXQ operand and operation
  • Accumulation
  • Precision changes with iterations of I
  • Does not change every iteration
  • Lower than maximum possible precision (for most
    iterations)

45
Precision Variation Curve
Mapping - DPMA
46
Precision Management Problem
  • Given
  • PVC for a given operation in the loop
  • Find
  • A valid optimal schedule which minimizes total
    execution time
  • Valid schedule
  • Satisfies the precision requirements of the
    computation
  • Total execution time
  • Execution time Reconfiguration time

Mapping - DPMA
47
Dynamic Precision Management Algorithm
  • DPMA algorithm
  • Dynamic programming based
  • Explores sub-optimal configurations
  • For a few iterations
  • Reduces reconfiguration overhead
  • O(um2) complexity
  • u of PVC points, m of configurations

Mapping - DPMA
48
Experimental Results
Mapping - DPMA
49
Mapping onto XC 6200
Mapping the multiplier operation in MAXQ
SCALE(I)
Execution Time (ns)
Reconfig. Time (ns)
Total Time (ns)
Algorithm
20480
675840
Raw
655360
17920
Static
550400
532480
Greedy
56320
524330
468010
Mapping - DPMA
504440
DPMA
33280
471160
  • Raw - 8x32 precision for all iterations
  • Static - 8x28 precision for all iterations
  • Greedy - schedule using greedy algorithm
  • DPMA - schedule using theoretical PVC

50
Assumptions
  • Higher precision requires more resources
  • Execution time
  • Logic area
  • Monotonic variation in precision
  • Several image processing and signal processing
    applications
  • Split non-monotonic PVC into monotonic
    subsequences
  • Optimal solution for the given PVC
  • Near optimal if actual precision variation is
    different

Mapping - DPMA
51
DCS Data Context Switching
  • Introduction to Voice Coding
  • Synthesis Filter
  • Single Channel Design
  • Data Context Switching
  • Multi-Channel Design

Mapping - DCS
52
Multimedia Communication
Video
Audio
Control and Management
Data
H.261 H.263
G.711, G.723.1, G.728, G.729
H.225.0 H.225.0 H.245 RAS
Signaling Control
T.124
RTCP
RTP
X.224 Class 0
Mapping - DCS
UDP
TCP
T.123
Network (IP)
Datalink (IEEE 802.3)
53
Hybrid Vocoder
  • Waveform coding encodes voice signal for Tx
  • Vocoding models speech using parameters
  • Hybrid Coding
  • Extract parameters of speech
  • Regenerate the signal using parameters
  • Compare to original voice signal
  • Refine

Mapping - DCS
54
Voice Compression
  • G.729 is a hybrid vocoder algorithm
  • Conjugate-Structure Algebraic Code Excited
    Linear Prediction

LP Analysis Quantization Interpolation
Input Speech
Mapping - DCS
Parameter Encoding Refinement
Synthesis Filter
Transmitted Bitstream

55
Voice Decompression
Fixed Codebook
Received Bitstream
Gc
Synthesis Filter
Post- Processing

Mapping - DCS
Adaptive Codebook
Gp
56
Synthesis Filter
  • 10th order Infinite Impulse Response (IIR) Filter

Mapping - DCS
57
Generalization Feedback Loops
Loop Carried Dependence
FOR I1 TO N DO FOR J1 TO N DO
. VARJ f(VARJ-1)
Mapping - DCS
Many Signal and Image Processing Kernels and
Cryptographic Engines
58
Mapping Pipelining Timing Constraints
y(n)
x(n)
y(n)
Mapping - DCS



-
59
Limitations of the Design
  • Pipeline delay of 5-12 cycles/stage
  • Feedback limits throughput
  • Cannot feed a new input every cycle
  • Only one output every 5-12 clock cycles
  • 40 sample frame takes 250-600 cycles!

Mapping - DCS
60
Mapping Technique Goals
  • Maximize channels/sec
  • Improve throughput
  • Maximize Multiplier and DPU utilization
  • Integrated Mapping Technique
  • Multi-dimensional optimization

Mapping - DCS
Parallelism Pipelining
Embedded Memory
Configurability
61
Data Context Switching
  • Computed result has to pipe through buffers
  • No useful computation performed in delay cycles
  • Multiplier and DPU are idle
  • Data Context Switching
  • Perform multi-channel computations
  • Switch Data Context
  • Overlapped multiple data set processing
  • Utilize multipliers every cycle

Mapping - DCS
62
Overlapped Multi-Channel Processing
Data Parallel Programming
Mapping - DCS
63
DCS Loop Interchange Transformation
FOR I1 TO N DO FOR J1 TO N DO
. VARJ f(VARJ-1)
Mapping - DCS
FOR J1 TO N DO FOR I1 TO N DO
. VARJ f(VARJ-1)
64
Data Flow in Multi-Channel Processing
channel 1
sample i
sample i1
channel 2
sample i
sample i1
channel N
sample i
feedback
channel 1
channel 2
channel 3
coefficient
channel 1
channel 2
channel 3
channel 1
channel 2
channel N
Mapping - DCS
65
Multi-Channel Design Datapath
Distributed memories store the coefficients Distri
buted memories as buffers schedule the dataflow
Mapping - DCS
66
Chameleon RCP Mapping
Pipelined Design - Local Resources - Routable
Design Optimal Throughput
Mapping - DCS
67
Analytical Performance Comparison
  • Single channel design - N250 cycles
  • Multi-channel design - N50 cycles
  • 5x speedup
  • One output per cycle Optimal
  • DSPs - N400 cycles
  • 8x Chameleon RCP speedup

Mapping - DCS
68
Performance Speedup
Approach
Speedup
Time (us)
1.0
UltraSPARC
2000
660
DSP
3.0
1.4
Virtex Standard
1426
Virtex DCS
12.7
158
Mapping - DCS
Chameleon Standard
432
4.6
27.8
36
Chameleon DCS
100 channels, 80 samples, 10-stage filter 400 MHz
UltraSPARC 300 MHz DSP TI C62x 200 MHz Virtex
(max frequency) 125 MHz Chameleon
69
Thesis Contributions
Mapping Techniques
HySAM Model
DRIVE
DRIVE Simulation
70
Simulation Tools
  • Performance Analysis
  • Execution time, memory access, power,
  • Algorithmic Analysis
  • Various mapping and scheduling algorithms
  • Architectural Exploration
  • Device and architectural alternatives

DRIVE
71
EDA Simulation Tools
  • Simulation of VHDL designs
  • High level behavioral simulation
  • Verifies correctness
  • Does not provide performance characteristics
  • Simulation of netlist/placed and routed design
  • Low level timing simulation
  • Fixed to specific implementation on specific
    device
  • Needs final design for each alternative
    device/algorithm

DRIVE
72
DRIVE Goals
  • High level performance analysis
  • Module level performance characterization
  • Architecture abstraction
  • Insulate developer from hardware intricacies
  • Algorithm analysis
  • Extensible to study various algorithmic
    techniques
  • Architecture exploration
  • Parameterized architectural model for exploration

DRIVE
73
Interpretive Simulation Framework
System Abstraction
Performance Characterization
Analysis and Transformation
System Models
Task Models
Mapping Algorithms
DRIVE
Interpretive Simulation
Performance Analysis Design Exploration
74
Interpretive Simulation
  • Simulate the application model on the system
    model
  • Performance is based on module characterization
  • Advantages
  • Exploits the design methodology
  • Elimination of actual execution
  • Interactive and real-time simulation
  • Disadvantages
  • Analysis only as accurate as module analysis
  • Approximates module interactions

DRIVE
75
DRIVE Components
USER
Visualizer
System State
Simulator Core
Data
DRIVE
HySAM Model
Scheduler
Applications
Architectures
76
Sample Visualizer View
DRIVE
77
Sample Publications
  • Integrated Mapping Techniques for Reconfigurable
    SoC Architectures
  • FPGAs for Custom Computing Machines (FCCM), 2001
    (Submitted).
  • Parallelizing DSP Nested Loops on Reconfigurable
    Architectures using
  • Data Context Switching
  • Design Automation Conference 2001 (Submitted).
  • DRIVE An Interpretive Simulation and
    Visualization Environment for Dynamically
  • Reconfigurable Systems
  • Field Programmable Logic and Applications,
    Aug-Sept 1999.
  • Hardware Object Selection for Mapping Loops onto
    Reconfigurable Architectures
  • Parallel and Distributed Processing Techniques
    and Applications, June 1999.
  • DEFACTO A Design Environment for Adaptive
    Computing Technology (with ISI DEFACTO)
  • Reconfigurable Architectures Workshop 1999,
    April 1999.
  • Dynamic Precision Management for Loop
    Computations on Reconfigurable Architectures
  • FPGAs for Custom Computing Machines (FCCM),
    April 1999.
  • Mapping Loops onto Reconfigurable Architectures
  • Field Programmable Logic and Applications,
    Aug-Sept 1998.
  • Reconfigurable Meshes Theory and Practice
  • Reconfigurable Architectures Workshop, Int.
    Parallel Processing Symposium, April 1997.

78
Conclusions
  • Reconfigurable Computing Is Here
  • Model-based Approach
  • HySAM Hybrid System Architecture Model of
    reconfigurable architectures
  • Algorithmic Mapping Techniques
  • Mapping of application loops onto reconfigurable
    architectures
  • Dynamic precision management to exploit run-time
    reconfiguration
  • Configurable pipeline generation and segmentation
  • Integrated mapping techniques for hybrid
    architectures
  • Simulation Methodology
  • DRIVE module based interpretive simulation
    framework
About PowerShow.com