Design and Implementation of a NoC-Based Cellular Computational System

About This Presentation

Title:

Design and Implementation of a NoC-Based Cellular Computational System

Description:

Introduction and Motivations (3) Emerging trends aim to ... Code Style Source operands are replaced by line-number of the most recent instructions that has ... – PowerPoint PPT presentation

Number of Views:269

Avg rating:3.0/5.0

Slides: 51

Provided by: Present189

Category:

more less

Transcript and Presenter's Notes

Title: Design and Implementation of a NoC-Based Cellular Computational System

1
Design and Implementation of a NoC-Based Cellular
Computational System

By Shervin Vakili
Supervisors Dr. Sied Mehdi Fakhraie
Dr. Siamak
Mohammadi

February 09, 2009
2
Outline

Introduction and Motivations
Basics of Evolvable Multiprocessor System (EvoMP)
EvoMP Operational View
EvoMP Architectural View
Simulation and Synthesis Results
Summary

Introduction and Motivations
Basics of Evolvable Multiprocessor System (EvoMP)
EvoMP Operational View
EvoMP Architectural View
Simulation and Synthesis Results
Summary

4
Introduction and Motivations (1)

Computing systems have played an important role
in advances of human life in last four decades.
Number and complexity of applications are
countinously increasing.
More computational power is required.
Three main hardware design approaches
ASIC (hardware realization)
Reconfigurable Computing
Processor-Based Designs (software realization)

Flexibility
Performance
5
Introduction and Motivations (2)

Microprocessors are the most pupular approach.
Flexibility and reprogramability
Low performance
Architectural techniques to improve processor
performance
Pipeline, out of order execution, Super Scalar,
VLIW, etc.
Seems to be saturated in recent years.

6
Introduction and Motivations (3)

Emerging trends aim to achieve
More performance
Preserving the classical software development
process.

1
7
Why Multi-Proseccor?

One of the main trends is to increase number of
processors.
Uses Thread-level Parallelism (TLP)
Similarity to single-processor
Short time-to market
Post-fabricate reusability
Flexibility and programmability
Moving toward large number of simple processors
on a chip.

8
Number of Processing Cores in Different Products
3
3
9
MPSoC Development Challenges (1)

MP systems faces some major challenges.
Programming models
MP systems require concurrent software.
Concurrent software development requires two
operations
Decomposition of the program into some tasks
Scheduling the tasks among cooperating processors
Both are NP-complete problems
Strongly affects the performance

10
MPSoC Development Challenges (2)

Two main solutions
1. Software development using parallel
programming libraries.
e.g. MPI and OpenMP
Manually by the programmer.
Requires huge investment to re-develop existing
software.
2. Automatic parallelization at compile-time
Does not require reprogramming but requires
re-compilation.
Compiler performs both Task decomposition and
scheduling.

11
MPSoC Development Challenges (3)

Control and Synchronization
To Address inter-processor data dependencies
Debugging
Tracking concurrent execution is difficult.
Particularly in heterogeneous architecture with
different ISA processors.

12
MPSoC Development Challenges (4)

All MPSoCs can be divided into two categories
Static scheduling
Task scheduling is performed before execution.
Predetermined number of contributing processors.
Has access to entire program.
Dynamic scheduling
A run-time scheduler (in hardware or OS) performs
task scheduling.
Does not depend on number of processors.
Only has access to pending tasks and available
resources.

Introduction and Motivations
Basics of Evolvable Multiprocessor System
EvoMP Operational View
EvoMP Architectural View
Simulation and Synthesis Results
Summary

14
Proposal of Evolvable Multi-processor System (1)

This thesis introduces a novel MPSoC
Uses evolutionary strategies for run-time task
decomposition and scheduling.
Is called EvoMP (Evolvable Multi-Processor
system).
Features
Can directly execute classical sequential codes
on MP platform.
Uses a hardware evolutionary algorithm core to
perform run time task decomposition and
scheduling.
Distributed control and computing
Flexibility
NoC-Based, 2D mesh, and homogeneous

15
Proposal of Evolvable Multi-processor System (2)

All computational units have one copy of the
entire program
EvoMP architecture exploits a hardware
evolutionary core
to generates a bit-string (chromosome).
This bit-string determines the processor which is
in charge of executing each instruction.
Primary version of EvoMP uses a genetic algorithm
core.

16
Target Applications

Target Applications
Applications, which perform a unique computation
on a stream of data, e.g.
Digital signal processing
Packet processing in network applications
Huge sensory data processing

17
Streaming Applications Code Style
Initial 1- MOV R1, 0 2- MOV R2,
0 L1 Loop 3- MOV R1, Input 4- MUL R3, R1,
Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7-
MOV Output, R1 8- MOV R1, R2 9-
Genetic 10-JUMP L1

Streaming programs have two main parts
Initialization
Infinite (or semi-infinite) Loop

Two-Tap FIR Filter
18

Introduction and Motivations
Basics of Evolvable Multiprocessor System (EvoMP)
EvoMP Operational View
EvoMP Architectural View
Simulation and Synthesis Results
Summary

19
EvoMP Top View

Genetic core produces a bit-string (chromosome)
Determines the location of execution of each
instruction

1- MOV R1, 0 2- MOV R2, 0 L1 Loop 3- MOV R1,
Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6-
ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9-
JUMP L1
1- MOV R1, 0 2- MOV R2, 0 L1 Loop 3- MOV R1,
Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6-
ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1,
R2 9-JUMP L1
SW00
SW01
P-01
P-00
Chromosome 011011011
Genetic Core
1- MOV R1, 0 2- MOV R2, 0 L1 Loop 3- MOV R1,
Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6-
ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9-
JUMP L1
1- MOV R1, 0 2- MOV R2, 0 L1 Loop 3- MOV R1,
Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6-
ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9-
JUMP L1
SW10
SW11
P-11
P-10
20
How EvoMP Works? (1)

Following process is repeated in each iteration
At the beginning of each iteration
genetic core generates and sends the bit-string
(chromosome) to all processors.
Processors execute this iteration with the
determined decomposition and scheduling scheme.
A counter in genetic core counts number of spent
clock cycles.
When all processors reached end of the loop
The genetic core uses the output of this counter
as the fitness value.

21
How EvoMP Works? (2)
Terminate
Initialize
Evolution
Final
Fault detected

Three main working states
Initialize
Just in first population
Genetic core generates random particles.
Evolution
Uses recombination to produce new populations .
When the termination condition is met, system
goes to final state.
Final
The best chromosome is used as constant output of
the genetic core.
When one of the processors becomes faulty, the
system returns to evolution stage

22
How Chromosome Codes the Scheduling Data? (1)

Each chromosome consists of some small words
(gene).
Each word contains two fields
A processor number
Number of instructions

23
How Chromosome Codes the Scheduling Data (2)

Assume that we have a 2X2 mesh

1- MOV R1, 0 2- MOV R2, 0 L1 Loop 3- MOV
R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2,
Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8-
MOV R1, R2 9- GENETIC 10-JUMP L1
Chromosome
10
001
Word1
00
01
01
010
Word2
11
000
Word3
10
101
Word4
10
11
10
of Instructions
24
Data Dependency Problem

Data dependencies are the main challenge.
Must be detected dynamically at run-time.
Is addressed using
Particular machine code style
Architectural techniques

25
EvoMP Machine Code Style

Source operands are replaced by line-number of
the most recent instructions that has changed it
(ID).
Will enormously simplify dependency detection.

10. ADD R1,R2,R3 R3R1R2 11. AND
R2,R6,R7 R7R2R6 12. SUB R7,R3,R4
R4R7-R3
12. SUB (11), (10), R4
26

Introduction and Motivations
Basics of Evolvable Multiprocessor System (EvoMP)
EvoMP Operational View
EvoMP Architectural View
Simulation and Synthesis Results
Summary

27
Architecture of each Processor

Number of FUs is configurable.
Homogeneous or heterogeneous policies can be used
for FUs.
Supports out of order execution.
First free FU grabs the instruction from Instr
bus (Daisy Chain).

28
Fetch_Issue Unit

PC1-Instr bus is used for executive instructions.
PC2-Invalidate_Instr bus is used for data
dependency detection.

29
Functional Unit

Can be configured to execute different
operations
Arithmetic Operations
Add
Sub
Shift/Rotate Right/Left
Multiply Add and shift
Logical Operations

30
Genetic Core
SW00
SW01
Cell-01
Cell-00
Genetic Core
SW10
SW11
Cell-11
Cell-10

Population size and mutation rate are
configurable.
Elite count is constant and equal to two in order
to reduce the hardware complexity

31
EvoMP Challenges

Current versions uses centralized memory unit.
In 00 address.
This address does not contain computational
circuits.
Major issue for scalability
Search space of genetic algorithm is very large.
Exponentially grows up with linear increase of
number of processors.

32
PSO Core 8
33

Introduction and Motivations
Basics of Evolvable Multiprocessor System (EvoMP)
EvoMP Operational View
EvoMP Architectural View
Simulation and Synthesis Results
Summary

34
Configurable Parameters

There are some configurable parameters in EvoMP
Word-length of the system
Size of the mesh (number of processors)
Flit length bit-length of NoC switch links
Population size
Crossover rate

35
Simulation Results

Two sets of applications are used for performance
evaluation.
Some DSP programs
Some sample neural Network
Two other decomposition and scheduling methods
are implemented enabling the comparison
Static Decomposition Genetic Scheduler (SDGS)
Decomposition is performed statically i.e. tasks
are predetermined manually
Genetic core only specifies scheduling scheme
Static Decomposition First Free Scheduler (FF)
Assigns the first task in job-queue to the first
free processor in the system

36
16-Tap FIR Filter

Parameters
16 bit mode
Population size16
Crossover Rate8
NoC connection width16

Best fitness shows number of clock cycles
required to execute one iteration using the best
particle which has been found yet.

74 Instructions
16 multiplication

37
8-Point DCT

Parameters
16 bit mode
Population size16
Crossover Rate8
NoC connection width16

88 Instructions
32 multiplication

38
16-point DCT

Parameters
16 bit mode
Population size16
Crossover Rate6
NoC connection width16

320 Instructions
128 multiplication

39
5x5 Matrix Multiplication

Parameters
16 bit mode
Population size16
Crossover Rate6
NoC connection width16

406 Instructions
125 multiplication

40
FIR-16 DCT-8 DCT-16 MATRIX-5x5
Number of Instructions Number of Instructions Number of Instructions 74 88 324 406
Number of Multiply Instructions Number of Multiply Instructions Number of Multiply Instructions 16 32 128 125
1x2 mesh (One Proc.) In all three schemes Fitness (clock cycles) 350 671 2722 3181
1x2 mesh (One Proc.) In all three schemes Speed-up 1 1 1 1
1x3 mesh Main Design Fitness (clock cycles) 214 403 1841 2344
1x3 mesh Main Design Speed-up 1.63 1.66 1.47 1.37
1x3 mesh Main Design Evolution Time (us) 27342 42807 74582 198384
1x3 mesh SDGS Fitness (clock cycles) 202 401 1812 2218
1x3 mesh SDGS Speed-up 1.73 1.67 1.50 1.43
1x3 mesh SDGS Evolution Time (us) 1967 29315 84365 65119
1x3 mesh First Free Fitness (clock cycles) 293 733 2529 2487
1x3 mesh First Free Speed-up 1.19 0.91 1.08 1.27
2x2 mesh Main Design Fitness (clock cycles) 171 319 1460 1868
2x2 mesh Main Design Speed-up 2.04 2.10 1.86 1.70
2x2 mesh Main Design Evolution Time (us) 30174 54790 23319 294828
2x2 mesh SDGS Fitness (clock cycles) 161 306 1189 1817
2x2 mesh SDGS Speed-up 2.17 2.19 2.28 1.75
2x2 mesh SDGS Evolution Time (us) 10739 52477 536565 10092
2x2 mesh First Free Fitness (clock cycles) 239 681 1933 2098
2x2 mesh First Free Speed-up 1.46 0.98 1.40 1.51
41
FIR-16 DCT-8 DCT-16 MATRIX-5x5
Number of Instructions Number of Instructions Number of Instructions 74 88 324 406
Number of Multiply Instructions Number of Multiply Instructions Number of Multiply Instructions 16 32 128 125
1x2 mesh (One Proc.) In all three schemes Fitness (clock cycles) 350 671 2722 3181
1x2 mesh (One Proc.) In all three schemes Speed-up 1 1 1 1
2x3 mesh Main Design Fitness (clock cycles) Unevaluated 285 1213 1596
2x3 mesh Main Design Speed-up Unevaluated 2.33 2.25 1.99
2x3 mesh Main Design Evolution Time (us) Unevaluated 93034 630482 546095
2x3 mesh SDGS Fitness (clock cycles) Unevaluated 256 1106 1575
2x3 mesh SDGS Speed-up Unevaluated 2.62 2.46 2.01
2x3 mesh SDGS Evolution Time (us) Unevaluated 41023 111118 178219
2x3 mesh First Free Fitness (clock cycles) Unevaluated 496 1587 1815
2x3 mesh First Free Speed-up Unevaluated 1.35 1.71 1.75
42
Neural Network Case Study
of Instr. of Multiplies 1x2 mesh 1x2 mesh 1x3 mesh 1x3 mesh 1x3 mesh 2x2 mesh 2x2 mesh 2x2 mesh 2x3 mesh 2x3 mesh 2x3 mesh
of Instr. of Multiplies Fitness Speed-up Fitness Speed-up Time Fitness Speed-up Time Fitness Speed-up Time
4-4-1 58 20 450 1 281 1.60 125 245 1.83 52 207 2.17 262
3-9-2 95 45 905 1 570 1.59 52 503 1.80 163 463 1.95 342
12-20-10 924 440 8304 1 5153 1.61 892 4365 1.90 1832 3813 2.18 3436
43
Fault Tolerance Results

When a fault is detected in a processor, the
evolutionary core eliminates it of contribution
in next iterations.
It also returns to evolution stage to find the
suitable solution for the new situation.
Best obtained fitness in a 2x3 EvoMP for 16-point
DCT program is evaluated.
Faults are injected into 010, 001 and 101
processors in 1000000us, 2000000us and 3000000us
respectively

44
Genetic vs. PSO

Population size in both experiments is 16

of Instr. of Multi-plies Particle length (bits) 1x2 mesh 1x3 mesh 1x3 mesh 1x3 mesh 1x3 mesh 2x2 mesh 2x2 mesh 2x2 mesh 2x2 mesh 2x3 mesh 2x3 mesh 2x3 mesh 2x3 mesh
of Instr. of Multi-plies Particle length (bits) Both Genetic Genetic PSO PSO Genetic Genetic PSO PSO Genetic Genetic PSO PSO
of Instr. of Multi-plies Particle length (bits) Both Fit Time Fit Time Fit Time Fit Time Fit Time Fit Time
FIR-16 74 16 240 350 214 23.7 211 12.3 171 30.1 174 14.3 unevaluated unevaluated unevaluated unevaluated
DCT-8 88 32 280 671 403 93.0 393 6.2 319 99.8 308 21.8 285 138.1 203 15.6
DCT-16 324 128 720 2722 1841 74.5 1831 41.7 1460 23.3 1439 45.3 1213 633.7 1191 98.3
MAT-5x5 406 125 800 3181 2344 198.3 2312 86.3 1868 294.8 1821 148.3 1596 546.7 1518 240.9

45
Synthesis Results

Synthesis results on VIRTEX II (XC2V3000) FPGA
using Sinplify Pro.

NoC switch Genetic Core PSO Core MMU Processor Total System
Area (Total LUTs) 729 (2) 1864 (6) 1642 (5) 3553 (12) 4433 (15) 20112 (70)
Max Freq. (MHz) - 68.4 94.6 - - 61.4
46

Introduction and Motivations
Basics of Evolvable Multiprocessor System (EvoMP)
EvoMP Operational View
EvoMP Architectural View
Simulation and Synthesis Results
Summary

47
Summary

The EvoMP which is a novel MPSoC system was
studied.
EvoMP exploits evolvable strategies to perform
run-time task decomposition and scheduling.
EvoMP does not require concurrent codes because
it can parallelize th sequential codes at
run-time.
Exploits particular and novel processor
architecture in order to address data dependency
problem.
Experimental results confirm the applicability of
EvoMP novel ideas.

48
Main References

1 N. S. Voros and K. Masselos, System Level
Design of Reconfigurable Systems-on-Chip.
Netherlands Springer, 2005.
2 G. Martin, Overview of the MPSoC design
challenge, Proc. Design and Automation Conf.,
July 2005, pp. 274-279.
3 S. Amarasinghe, Multicore programming primer
and programming competition, class notes for
6.189, Computer Architecture Group, Massachusetts
Institute of Technology, Available
www.cag.csail.mit.edu/ps3/lectures/6.189-lecture1-
intro.pdf.
4 M. Hubner, K. Paulsson, and J. Becker,
Parallel and flexible multiprocessor
system-on-chip for adaptive automotive
applications based on Xilinx MicroBlaze
soft-cores, Proc. Intl. Symp. Parallel and
Distributed Processing, 2005.
5 D. Gohringer, M. Hubner, V. Schatz, and J.
Becker, Runtime adaptive multi-processor
system-on-chip RAMPSoC, Proc. Intl. Symp.
Parallel and Distributed Processing, April 2008,
pp. 1-7.
6 A. Klimm, L. Braun, and J. Becker, An
adaptive and scalable multiprocessor system for
Xilinx FPGAs using minimal sized processor
cores, Proc. Symp. Parallel and Distributed
Processing, April 2008, pp. 1-7.
7 Z.Y. Wen and Y.J. Gang, A genetic algorithm
for tasks scheduling in parallel multiprocessor
systems, Proc. Intl. Conf. Machine Learning and
Cybernetics, Nov. 2003, pp.1785-1790.
8 A. Farmahini-Farahani, S. Vakili, S. M.
Fakhraie, S. Safari, and C. Lucas, Parallel
scalable hardware implementation of asynchronous
discrete particle swarm optimization, Elsevier
J. of Engineering Applications of Artificial
Intelligence, submitted for publication.

49
Main References (2)

9 A. A. Jerraya and W. Wolf, Multiprocessor
Systems-on-Chips. San Francisco Morgan Kaufmann
Publishers, 2005.
10 A.J. Page and T.J. Naughton, Dynamic task
scheduling using genetic algorithms for
heterogeneous distributed computing, Proc. Intl.
Symp. Parallel and Distributed Processing, April
2005, pp. 189.1.
11 E. Carvalho, N. Calazans, and F. Moraes,
Heuristics for dynamic task mapping in NoC based
heterogeneous MPSoCs, Proc. Int. Rapid System
Prototyping Workshop, pp. 34-40, 2007.
12 R. Canham, and A. Tyrrell, An embryonic
array with improved efficiency and fault
tolerance, Proc. NASA/DoD Conf. on Evolvable
Hardware, July 2003, pp. 265-272.
13 W. Barker, D. M. Halliday, Y. Thoma, E.
Sanchez, G. Tempesti, and A. Tyrrell, Fault
tolerance using dynamic reconfiguration on the
POEtic Tissue, IEEE Trans. Evolutionary
Computing, vol. 11, num. 5, Oct. 2007, pp.
666-684.

50
Related Publications

Journal
S. Vakili, S. M. Fakhraie, and S. Mohammadi,
EvoMP a novel MPSoC architecture with evolvable
task decomposition and scheduling, Submitted to
IET Comp. Digital Tech., (Under Revision).
S. Vakili, S. M. Fakhraie, and S. Mohammadi,
Low-cost fault tolerance in evolvable
multiprocessor system a graceful degradation
approach, Submitted to Journal of Zhejiang
University SCIENCE A (JZUS-A).
Conference
S. Vakili, S. M. Fakhraie, and S. Mohammadi,
Designing an MPSoC architecture with run-time
and evolvable task decomposition and scheduling,
Proc. 5th IEEE Intl. Conf. Innovations in
Information Technology, Dec. 2008.
S. Vakili, S. M. Fakhraie, S. Mohammadi, and Ali
Ahmadi, Particle swarm optimization for run-time
task decomposition and scheduling in evolvable
MPSoC, Proc. IEEE. Intl. conf. Computer
Engineering and Technology, Jan. 2009.