Design and Implementation of a NoC-Based Cellular Computational System - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Design and Implementation of a NoC-Based Cellular Computational System

Description:

Introduction and Motivations (3) Emerging trends aim to ... Code Style Source operands are replaced by line-number of the most recent instructions that has ... – PowerPoint PPT presentation

Number of Views:269
Avg rating:3.0/5.0
Slides: 51
Provided by: Present189
Category:

less

Transcript and Presenter's Notes

Title: Design and Implementation of a NoC-Based Cellular Computational System


1
Design and Implementation of a NoC-Based Cellular
Computational System
  • By Shervin Vakili
  • Supervisors Dr. Sied Mehdi Fakhraie
  • Dr. Siamak
    Mohammadi

February 09, 2009
2
Outline
  • Introduction and Motivations
  • Basics of Evolvable Multiprocessor System (EvoMP)
  • EvoMP Operational View
  • EvoMP Architectural View
  • Simulation and Synthesis Results
  • Summary

3
  • Introduction and Motivations
  • Basics of Evolvable Multiprocessor System (EvoMP)
  • EvoMP Operational View
  • EvoMP Architectural View
  • Simulation and Synthesis Results
  • Summary

4
Introduction and Motivations (1)
  • Computing systems have played an important role
    in advances of human life in last four decades.
  • Number and complexity of applications are
    countinously increasing.
  • More computational power is required.
  • Three main hardware design approaches
  • ASIC (hardware realization)
  • Reconfigurable Computing
  • Processor-Based Designs (software realization)

Flexibility
Performance
5
Introduction and Motivations (2)
  • Microprocessors are the most pupular approach.
  • Flexibility and reprogramability
  • Low performance
  • Architectural techniques to improve processor
    performance
  • Pipeline, out of order execution, Super Scalar,
    VLIW, etc.
  • Seems to be saturated in recent years.

6
Introduction and Motivations (3)
  • Emerging trends aim to achieve
  • More performance
  • Preserving the classical software development
    process.

1
7
Why Multi-Proseccor?
  • One of the main trends is to increase number of
    processors.
  • Uses Thread-level Parallelism (TLP)
  • Similarity to single-processor
  • Short time-to market
  • Post-fabricate reusability
  • Flexibility and programmability
  • Moving toward large number of simple processors
    on a chip.

8
Number of Processing Cores in Different Products
3
3
9
MPSoC Development Challenges (1)
  • MP systems faces some major challenges.
  • Programming models
  • MP systems require concurrent software.
  • Concurrent software development requires two
    operations
  • Decomposition of the program into some tasks
  • Scheduling the tasks among cooperating processors
  • Both are NP-complete problems
  • Strongly affects the performance

10
MPSoC Development Challenges (2)
  • Two main solutions
  • 1. Software development using parallel
    programming libraries.
  • e.g. MPI and OpenMP
  • Manually by the programmer.
  • Requires huge investment to re-develop existing
    software.
  • 2. Automatic parallelization at compile-time
  • Does not require reprogramming but requires
    re-compilation.
  • Compiler performs both Task decomposition and
    scheduling.

11
MPSoC Development Challenges (3)
  • Control and Synchronization
  • To Address inter-processor data dependencies
  • Debugging
  • Tracking concurrent execution is difficult.
  • Particularly in heterogeneous architecture with
    different ISA processors.

12
MPSoC Development Challenges (4)
  • All MPSoCs can be divided into two categories
  • Static scheduling
  • Task scheduling is performed before execution.
  • Predetermined number of contributing processors.
  • Has access to entire program.
  • Dynamic scheduling
  • A run-time scheduler (in hardware or OS) performs
    task scheduling.
  • Does not depend on number of processors.
  • Only has access to pending tasks and available
    resources.

13
  • Introduction and Motivations
  • Basics of Evolvable Multiprocessor System
  • EvoMP Operational View
  • EvoMP Architectural View
  • Simulation and Synthesis Results
  • Summary

14
Proposal of Evolvable Multi-processor System (1)
  • This thesis introduces a novel MPSoC
  • Uses evolutionary strategies for run-time task
    decomposition and scheduling.
  • Is called EvoMP (Evolvable Multi-Processor
    system).
  • Features
  • Can directly execute classical sequential codes
    on MP platform.
  • Uses a hardware evolutionary algorithm core to
    perform run time task decomposition and
    scheduling.
  • Distributed control and computing
  • Flexibility
  • NoC-Based, 2D mesh, and homogeneous

15
Proposal of Evolvable Multi-processor System (2)
  • All computational units have one copy of the
    entire program
  • EvoMP architecture exploits a hardware
    evolutionary core
  • to generates a bit-string (chromosome).
  • This bit-string determines the processor which is
    in charge of executing each instruction.
  • Primary version of EvoMP uses a genetic algorithm
    core.

16
Target Applications
  • Target Applications
  • Applications, which perform a unique computation
    on a stream of data, e.g.
  • Digital signal processing
  • Packet processing in network applications
  • Huge sensory data processing

17
Streaming Applications Code Style
Initial 1- MOV R1, 0 2- MOV R2,
0 L1 Loop 3- MOV R1, Input 4- MUL R3, R1,
Coe1 5- MUL R4, R2, Coe2 6- ADD R1, R3, R4 7-
MOV Output, R1 8- MOV R1, R2 9-
Genetic 10-JUMP L1
  • Streaming programs have two main parts
  • Initialization
  • Infinite (or semi-infinite) Loop

Two-Tap FIR Filter
18
  • Introduction and Motivations
  • Basics of Evolvable Multiprocessor System (EvoMP)
  • EvoMP Operational View
  • EvoMP Architectural View
  • Simulation and Synthesis Results
  • Summary

19
EvoMP Top View
  • Genetic core produces a bit-string (chromosome)
  • Determines the location of execution of each
    instruction

1- MOV R1, 0 2- MOV R2, 0 L1 Loop 3- MOV R1,
Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6-
ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9-
JUMP L1
1- MOV R1, 0 2- MOV R2, 0 L1 Loop 3- MOV R1,
Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6-
ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1,
R2 9-JUMP L1
SW00
SW01
P-01
P-00
Chromosome 011011011
Genetic Core
1- MOV R1, 0 2- MOV R2, 0 L1 Loop 3- MOV R1,
Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6-
ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9-
JUMP L1
1- MOV R1, 0 2- MOV R2, 0 L1 Loop 3- MOV R1,
Input 4- MUL R3, R1, Coe1 5- MUL R4, R2, Coe2 6-
ADD R1, R3, R4 7- MOV Output, R1 8- MOV R1, R2 9-
JUMP L1
SW10
SW11
P-11
P-10
20
How EvoMP Works? (1)
  • Following process is repeated in each iteration
  • At the beginning of each iteration
  • genetic core generates and sends the bit-string
    (chromosome) to all processors.
  • Processors execute this iteration with the
    determined decomposition and scheduling scheme.
  • A counter in genetic core counts number of spent
    clock cycles.
  • When all processors reached end of the loop
  • The genetic core uses the output of this counter
    as the fitness value.

21
How EvoMP Works? (2)
Terminate
Initialize
Evolution
Final
Fault detected
  • Three main working states
  • Initialize
  • Just in first population
  • Genetic core generates random particles.
  • Evolution
  • Uses recombination to produce new populations .
  • When the termination condition is met, system
    goes to final state.
  • Final
  • The best chromosome is used as constant output of
    the genetic core.
  • When one of the processors becomes faulty, the
    system returns to evolution stage

22
How Chromosome Codes the Scheduling Data? (1)
  • Each chromosome consists of some small words
    (gene).
  • Each word contains two fields
  • A processor number
  • Number of instructions

23
How Chromosome Codes the Scheduling Data (2)
  • Assume that we have a 2X2 mesh

1- MOV R1, 0 2- MOV R2, 0 L1 Loop 3- MOV
R1, Input 4- MUL R3, R1, Coe1 5- MUL R4, R2,
Coe2 6- ADD R1, R3, R4 7- MOV Output, R1 8-
MOV R1, R2 9- GENETIC 10-JUMP L1
Chromosome
10
001
Word1
00
01
01
010
Word2
11
000
Word3
10
101
Word4
10
11
10
of Instructions
24
Data Dependency Problem
  • Data dependencies are the main challenge.
  • Must be detected dynamically at run-time.
  • Is addressed using
  • Particular machine code style
  • Architectural techniques

25
EvoMP Machine Code Style
  • Source operands are replaced by line-number of
    the most recent instructions that has changed it
    (ID).
  • Will enormously simplify dependency detection.

10. ADD R1,R2,R3 R3R1R2 11. AND
R2,R6,R7 R7R2R6 12. SUB R7,R3,R4
R4R7-R3
12. SUB (11), (10), R4
26
  • Introduction and Motivations
  • Basics of Evolvable Multiprocessor System (EvoMP)
  • EvoMP Operational View
  • EvoMP Architectural View
  • Simulation and Synthesis Results
  • Summary

27
Architecture of each Processor
  • Number of FUs is configurable.
  • Homogeneous or heterogeneous policies can be used
    for FUs.
  • Supports out of order execution.
  • First free FU grabs the instruction from Instr
    bus (Daisy Chain).

28
Fetch_Issue Unit
  • PC1-Instr bus is used for executive instructions.
  • PC2-Invalidate_Instr bus is used for data
    dependency detection.

29
Functional Unit
  • Can be configured to execute different
    operations
  • Arithmetic Operations
  • Add
  • Sub
  • Shift/Rotate Right/Left
  • Multiply Add and shift
  • Logical Operations

30
Genetic Core
SW00
SW01
Cell-01
Cell-00
Genetic Core
SW10
SW11
Cell-11
Cell-10
  • Population size and mutation rate are
    configurable.
  • Elite count is constant and equal to two in order
    to reduce the hardware complexity

31
EvoMP Challenges
  • Current versions uses centralized memory unit.
  • In 00 address.
  • This address does not contain computational
    circuits.
  • Major issue for scalability
  • Search space of genetic algorithm is very large.
  • Exponentially grows up with linear increase of
    number of processors.

32
PSO Core 8
33
  • Introduction and Motivations
  • Basics of Evolvable Multiprocessor System (EvoMP)
  • EvoMP Operational View
  • EvoMP Architectural View
  • Simulation and Synthesis Results
  • Summary

34
Configurable Parameters
  • There are some configurable parameters in EvoMP
  • Word-length of the system
  • Size of the mesh (number of processors)
  • Flit length bit-length of NoC switch links
  • Population size
  • Crossover rate

35
Simulation Results
  • Two sets of applications are used for performance
    evaluation.
  • Some DSP programs
  • Some sample neural Network
  • Two other decomposition and scheduling methods
    are implemented enabling the comparison
  • Static Decomposition Genetic Scheduler (SDGS)
  • Decomposition is performed statically i.e. tasks
    are predetermined manually
  • Genetic core only specifies scheduling scheme
  • Static Decomposition First Free Scheduler (FF)
  • Assigns the first task in job-queue to the first
    free processor in the system

36
16-Tap FIR Filter
  • Parameters
  • 16 bit mode
  • Population size16
  • Crossover Rate8
  • NoC connection width16

Best fitness shows number of clock cycles
required to execute one iteration using the best
particle which has been found yet.
  • 74 Instructions
  • 16 multiplication

37
8-Point DCT
  • Parameters
  • 16 bit mode
  • Population size16
  • Crossover Rate8
  • NoC connection width16
  • 88 Instructions
  • 32 multiplication

38
16-point DCT
  • Parameters
  • 16 bit mode
  • Population size16
  • Crossover Rate6
  • NoC connection width16
  • 320 Instructions
  • 128 multiplication

39
5x5 Matrix Multiplication
  • Parameters
  • 16 bit mode
  • Population size16
  • Crossover Rate6
  • NoC connection width16
  • 406 Instructions
  • 125 multiplication

40
FIR-16 DCT-8 DCT-16 MATRIX-5x5
Number of Instructions Number of Instructions Number of Instructions 74 88 324 406
Number of Multiply Instructions Number of Multiply Instructions Number of Multiply Instructions 16 32 128 125
1x2 mesh (One Proc.) In all three schemes Fitness (clock cycles) 350 671 2722 3181
1x2 mesh (One Proc.) In all three schemes Speed-up 1 1 1 1
1x3 mesh Main Design Fitness (clock cycles) 214 403 1841 2344
1x3 mesh Main Design Speed-up 1.63 1.66 1.47 1.37
1x3 mesh Main Design Evolution Time (us) 27342 42807 74582 198384
1x3 mesh SDGS Fitness (clock cycles) 202 401 1812 2218
1x3 mesh SDGS Speed-up 1.73 1.67 1.50 1.43
1x3 mesh SDGS Evolution Time (us) 1967 29315 84365 65119
1x3 mesh First Free Fitness (clock cycles) 293 733 2529 2487
1x3 mesh First Free Speed-up 1.19 0.91 1.08 1.27
2x2 mesh Main Design Fitness (clock cycles) 171 319 1460 1868
2x2 mesh Main Design Speed-up 2.04 2.10 1.86 1.70
2x2 mesh Main Design Evolution Time (us) 30174 54790 23319 294828
2x2 mesh SDGS Fitness (clock cycles) 161 306 1189 1817
2x2 mesh SDGS Speed-up 2.17 2.19 2.28 1.75
2x2 mesh SDGS Evolution Time (us) 10739 52477 536565 10092
2x2 mesh First Free Fitness (clock cycles) 239 681 1933 2098
2x2 mesh First Free Speed-up 1.46 0.98 1.40 1.51
41
FIR-16 DCT-8 DCT-16 MATRIX-5x5
Number of Instructions Number of Instructions Number of Instructions 74 88 324 406
Number of Multiply Instructions Number of Multiply Instructions Number of Multiply Instructions 16 32 128 125
1x2 mesh (One Proc.) In all three schemes Fitness (clock cycles) 350 671 2722 3181
1x2 mesh (One Proc.) In all three schemes Speed-up 1 1 1 1
2x3 mesh Main Design Fitness (clock cycles) Unevaluated 285 1213 1596
2x3 mesh Main Design Speed-up Unevaluated 2.33 2.25 1.99
2x3 mesh Main Design Evolution Time (us) Unevaluated 93034 630482 546095
2x3 mesh SDGS Fitness (clock cycles) Unevaluated 256 1106 1575
2x3 mesh SDGS Speed-up Unevaluated 2.62 2.46 2.01
2x3 mesh SDGS Evolution Time (us) Unevaluated 41023 111118 178219
2x3 mesh First Free Fitness (clock cycles) Unevaluated 496 1587 1815
2x3 mesh First Free Speed-up Unevaluated 1.35 1.71 1.75
42
Neural Network Case Study
of Instr. of Multiplies 1x2 mesh 1x2 mesh 1x3 mesh 1x3 mesh 1x3 mesh 2x2 mesh 2x2 mesh 2x2 mesh 2x3 mesh 2x3 mesh 2x3 mesh
of Instr. of Multiplies Fitness Speed-up Fitness Speed-up Time Fitness Speed-up Time Fitness Speed-up Time  
    4-4-1 58 20 450 1 281 1.60 125 245 1.83 52 207 2.17 262
    3-9-2 95 45 905 1 570 1.59 52 503 1.80 163 463 1.95 342
    12-20-10 924 440 8304 1 5153 1.61 892 4365 1.90 1832 3813 2.18 3436
43
Fault Tolerance Results
  • When a fault is detected in a processor, the
    evolutionary core eliminates it of contribution
    in next iterations.
  • It also returns to evolution stage to find the
    suitable solution for the new situation.
  • Best obtained fitness in a 2x3 EvoMP for 16-point
    DCT program is evaluated.
  • Faults are injected into 010, 001 and 101
    processors in 1000000us, 2000000us and 3000000us
    respectively

44
Genetic vs. PSO
  • Population size in both experiments is 16

of Instr. of Multi-plies Particle length (bits) 1x2 mesh 1x3 mesh 1x3 mesh 1x3 mesh 1x3 mesh 2x2 mesh 2x2 mesh 2x2 mesh 2x2 mesh 2x3 mesh   2x3 mesh   2x3 mesh   2x3 mesh  
of Instr. of Multi-plies Particle length (bits) Both Genetic Genetic PSO PSO Genetic Genetic PSO PSO Genetic Genetic PSO PSO
of Instr. of Multi-plies Particle length (bits) Both Fit Time Fit Time Fit Time Fit Time Fit Time Fit Time
 FIR-16 74 16 240 350 214 23.7 211 12.3 171 30.1 174 14.3 unevaluated unevaluated unevaluated unevaluated
 DCT-8 88 32 280 671 403 93.0 393 6.2 319 99.8 308 21.8 285 138.1 203 15.6
 DCT-16 324 128 720 2722 1841 74.5 1831 41.7 1460 23.3 1439 45.3 1213 633.7 1191 98.3
 MAT-5x5 406 125 800 3181 2344 198.3 2312 86.3 1868 294.8 1821 148.3 1596 546.7 1518 240.9

45
Synthesis Results
  • Synthesis results on VIRTEX II (XC2V3000) FPGA
    using Sinplify Pro.

NoC switch Genetic Core PSO Core MMU Processor Total System
Area (Total LUTs) 729 (2) 1864 (6) 1642 (5) 3553 (12) 4433 (15) 20112 (70)
Max Freq. (MHz) - 68.4 94.6 - - 61.4
46
  • Introduction and Motivations
  • Basics of Evolvable Multiprocessor System (EvoMP)
  • EvoMP Operational View
  • EvoMP Architectural View
  • Simulation and Synthesis Results
  • Summary

47
Summary
  • The EvoMP which is a novel MPSoC system was
    studied.
  • EvoMP exploits evolvable strategies to perform
    run-time task decomposition and scheduling.
  • EvoMP does not require concurrent codes because
    it can parallelize th sequential codes at
    run-time.
  • Exploits particular and novel processor
    architecture in order to address data dependency
    problem.
  • Experimental results confirm the applicability of
    EvoMP novel ideas.

48
Main References
  • 1 N. S. Voros and K. Masselos, System Level
    Design of Reconfigurable Systems-on-Chip.
    Netherlands Springer, 2005.
  • 2 G. Martin, Overview of the MPSoC design
    challenge, Proc. Design and Automation Conf.,
    July 2005, pp. 274-279.
  • 3 S. Amarasinghe, Multicore programming primer
    and programming competition, class notes for
    6.189, Computer Architecture Group, Massachusetts
    Institute of Technology, Available
    www.cag.csail.mit.edu/ps3/lectures/6.189-lecture1-
    intro.pdf.
  • 4 M. Hubner, K. Paulsson, and J. Becker,
    Parallel and flexible multiprocessor
    system-on-chip for adaptive automotive
    applications based on Xilinx MicroBlaze
    soft-cores, Proc. Intl. Symp. Parallel and
    Distributed Processing, 2005.
  • 5 D. Gohringer, M. Hubner, V. Schatz, and J.
    Becker, Runtime adaptive multi-processor
    system-on-chip RAMPSoC, Proc. Intl. Symp.
    Parallel and Distributed Processing, April 2008,
    pp. 1-7.
  • 6 A. Klimm, L. Braun, and J. Becker, An
    adaptive and scalable multiprocessor system for
    Xilinx FPGAs using minimal sized processor
    cores, Proc. Symp. Parallel and Distributed
    Processing, April 2008, pp. 1-7.
  • 7 Z.Y. Wen and Y.J. Gang, A genetic algorithm
    for tasks scheduling in parallel multiprocessor
    systems, Proc. Intl. Conf. Machine Learning and
    Cybernetics, Nov. 2003, pp.1785-1790.
  • 8 A. Farmahini-Farahani, S. Vakili, S. M.
    Fakhraie, S. Safari, and C. Lucas, Parallel
    scalable hardware implementation of asynchronous
    discrete particle swarm optimization, Elsevier
    J. of Engineering Applications of Artificial
    Intelligence, submitted for publication.

49
Main References (2)
  • 9 A. A. Jerraya and W. Wolf, Multiprocessor
    Systems-on-Chips. San Francisco Morgan Kaufmann
    Publishers, 2005.
  • 10 A.J. Page and T.J. Naughton, Dynamic task
    scheduling using genetic algorithms for
    heterogeneous distributed computing, Proc. Intl.
    Symp. Parallel and Distributed Processing, April
    2005, pp. 189.1.
  • 11 E. Carvalho, N. Calazans, and F. Moraes,
    Heuristics for dynamic task mapping in NoC based
    heterogeneous MPSoCs, Proc. Int. Rapid System
    Prototyping Workshop, pp. 34-40, 2007.
  • 12 R. Canham, and A. Tyrrell, An embryonic
    array with improved efficiency and fault
    tolerance, Proc. NASA/DoD Conf. on Evolvable
    Hardware, July 2003, pp. 265-272.
  • 13 W. Barker, D. M. Halliday, Y. Thoma, E.
    Sanchez, G. Tempesti, and A. Tyrrell, Fault
    tolerance using dynamic reconfiguration on the
    POEtic Tissue, IEEE Trans. Evolutionary
    Computing, vol. 11, num. 5, Oct. 2007, pp.
    666-684.

50
Related Publications
  • Journal
  • S. Vakili, S. M. Fakhraie, and S. Mohammadi,
    EvoMP a novel MPSoC architecture with evolvable
    task decomposition and scheduling, Submitted to
    IET Comp. Digital Tech., (Under Revision).
  • S. Vakili, S. M. Fakhraie, and S. Mohammadi,
    Low-cost fault tolerance in evolvable
    multiprocessor system a graceful degradation
    approach, Submitted to Journal of Zhejiang
    University SCIENCE A (JZUS-A).
  • Conference
  • S. Vakili, S. M. Fakhraie, and S. Mohammadi,
    Designing an MPSoC architecture with run-time
    and evolvable task decomposition and scheduling,
    Proc. 5th IEEE Intl. Conf. Innovations in
    Information Technology, Dec. 2008.
  • S. Vakili, S. M. Fakhraie, S. Mohammadi, and Ali
    Ahmadi, Particle swarm optimization for run-time
    task decomposition and scheduling in evolvable
    MPSoC, Proc. IEEE. Intl. conf. Computer
    Engineering and Technology, Jan. 2009.
Write a Comment
User Comments (0)
About PowerShow.com