Lab 2 Parallel processing using NIOS II processors presentation

About This Presentation

Transcript and Presenter's Notes

Title: Lab 2 Parallel processing using NIOS II processors

1
Lab 2 Parallel processing using NIOS II
processors

CEG 4131 Computer Architecture III
Miodrag Bolic

2
Overview

You will learn how to
Design multiprocessing systems that use shared
memories
Partition sequential program so that it can be
implemented on multi-processors
Synchronize multiprocessing system
Time 3 weeks
Point 115 (There is an optional task)

3
Overview

Part 1
Design a multiprocessing system by following the
steps from the tutorial. Run and debug the
program that comes with the tutorial.
Part 2
Use the same hardware designed in part 1
Develop a program for parallel matrix
multiplication and run it on the multiprocessing
system
Compute the speedup of the program when it runs
on a single processor and on a multiprocessing
system

4
Part 1

Copy the project C\altera\kits\nios2\examples\vhd
l\niosII_stratix_1s10\standard
to your home directory
Go through the steps of the tutorial Creating
Multiprocessor Nios II System tutorial. You can
download the tutorial from tt_nios2_multiprocessor
_tutorial.pdf and a program from
http//www.altera.com/literature/tt/hello_world_mu
lti.c
Modification On page 30 of the tutorial, choose
NIOS II/s core for CPU3 instead of NIOS II/e. All
three cores have to be NIOS II/s. Change the
instruction cache size for all 3 of them to
4kBytes.
Before generating and compiling on page 36 of the
tutorial, do the following
Add performance counter in the same way as in Lab
1. Connect performance_counter only to the data
master of the CPU1.
Add on-chip Memory block and configure it as
shown in the next page. Connect s1 port to
cpu1/data_master and cpu2/data_master. Connect s2
port to cpu3/data_master.
Continue with the tutorial.

5
On-chip memory configuration
6
Task 1 Demonstration and Questions

Show to the TA that the program is working (20
points)
Questions
Describe the program in details.
Why do we need mutex?
If processor 1 gets a mutex for the memory
messsage_buffer_ram, can processor 2 write to
this memory before processor 1 releases the
mutex?
Can processor 1 store two messages in the buffer?

7
Part 2

In this part, the same hardware configuration
will be used.
You will design a program for parallel matrix
multiplication.
Problem
There is an input/output module which receives
and stores data in matrices in matrices M1 and
M2. We will simulate this module using
shared_memory module that we added in the first
part of the Lab. Our program multiplies these two
matrices and stores the result C in the same
module (memory).

8
Sequential solution

Program the Altera chip using the same
configuration from part 1.
Modify the matrix_performance.c file so that
matrices M1, M2 and C are transferred to the
shared_memory. Do this step before activating the
performance counter. Change the number of
iterations in matrix multiplication from 100 to
1000.
Change the C/C options in your project and
syslib project from Debug to Release.
Run the code and present the performance count
results and matrix C that is obtained in the
iteration 1000.
Demonstration show the result to the TA.

9
Parallel solution

CPU 1 will be used for synchronization and for
I/O operations, while CPU 2 and 3 are used for
multiplication. CPU 2 and 3 function in single
program multiple data SPMD mode. This means that
they start the iterations at the same time and
they execute the same code but on different data.
After they finish the multiplication, they signal
to CPU1. The program will repeat the
multiplication of matrices 1000 times.

10
Parallel matrix multiplication

CPU1 transfers M1 and M2 to the shared_memory.
Algorithm
The sequential program is show bellow. In
parallel implementation, CPU 2 will execute i
loop from 0 to 4, and CPU 3 will execute i loop
from 5 to 9. CPU 2 and 3 will perform their
operations at the same time
for (i0ilt9i)
for (j0jlt9j)
Cij 0
for (k0klt9k)
CijM1ikM2kj

11
Synchronization

Variables status_start and status_done will be
shared variables used for synchronization. All
three processor will access these variable using
the mutex. They will be stored in
message_buffer_ram memory.
It is extremely important that both CPU2 and CPU3
start matrix multiplication at the same time.
This will not happen automatically since they are
booted from the same memory. So, CPU1 has to
assure that both CPU2 and CPU3 start at the same
time. Shared variable status_start will be used
for that. CPU1 has to set this variable to 1 and
CPU2 and CPU3 have to increment this variable
before they start matrix multiplication. When
status_start is 3 then CPU2 and CPU3 will start
matrix multiplication and CPU1 will initiate
measurement of time using the performance_counter.
At the beginning, CPU 1 will set status_done to
1. After CPU 2 and CPU 3 finish 1000 iterations
of 10x10 matrix multiplication, they each
increment the status_done. CPU 1 is periodically
reading the variable status_done, and when it is
3, the program is over. CPU1 stops the
performance_count and print performance_count
result and matrix C from 1000th iteration on the
terminal.

12
Task 2 - Questions

What is speedup if we compare sequential and
parallel implementation? Comment the speed-up
result.
Why can we design a program for matrix
multiplication without using mutexes (except for
synchronization)?

13
Demonstration (40)

Send matrix C of 1000th iteration of the matrix
multiplication algorithm to the terminal through
JTAG UART. Send also the number of clock cycles
from the performance counter.
Show this result to the TA. Explain to the TA how
your parallel matrix multiplication program works
and how you achieved synchronization. You will
get 10 points less if speedup is less than 1.

14
Optional part- Synchronization

If our program emulates real system, then CPU1
should synchronize both CPU1 and CPU2 after 1
iteration of 10x10 matrix multiplication and not
after 1000 of them. So, in a real program after
each 10x10 matrix multiplication, the CPU1 will
perform some operations on the computed matrix C
and initialize new iteration of 10x10 matrix
multiplication if matrices M1 and M2 are ready.
In this part of the lab, you will use
iteration_done variable to notify CPU1 that one
iteration of 10x10 matrix multiplication is done.
Additional shared variable is needed for the
start of next iteration. Lets call it
start_next_iteration.
The program works as follows. At the beginning
CPU1 sets start_next_iteration . After 10x10
multiplication iteration starts, CPU2 and CPU3
resets this variable. After CPU2 and CPU3 are
done with the execution of their part of 10x10
matrix multiplication, they increment
iteration_done and wait for start_next_iteration
to be set. CPU1 checks if iteration_done is equal
3 and if it is, CPU1 sets start_next_iteration.
The new iteration of 10x10 matrix multiplication
can start then.

15
Optional part Demonstration and Questions

Question
What is the speedup of this program?
Demonstration (10 optional points)
Send the sum of the elements of matrix C of each
iteration of 10x10 matrix multiplication
algorithm to the terminal through JTAG UART. Send
also the number of clock cycles from the
performance counter.
Show this result to the TA. Explain to them how
you achieved synchronization.

16
What to submit

Report contains the following (30 points)
Title page
Description of your system with the picture of
SOPC Builder System Components
Detailed description of your solution of the
algorithm for parallel matrix multiplication and
synchronization.
Answers to the questions from task 1-2.
Conclusions
Page 17 of this document signed by the TA.
Soft copies of the report and source code of the
programs for sequential and parallel
multiplication with basic comments (.c files)
and quartus II files .sof and .ptf (10 points).
Optional Description of the synchronization
method and speedup for the optional part as apart
of the report. Softcopy of the algorithm for
matrix multiplication. (5 points)

17
Lab 2 Signature page

Student name
Student name

Demonstrated (TAs signature) Performance_counter result - Time Points
Part 1 / ____/20
Part 2 sequential
Part 2 parallel ____/40
Part 2 optional ____/10
Total / / ____

Write a Comment

User Comments (0)

About PowerShow.com

Lab 2 Parallel processing using NIOS II processors PowerPoint PPT Presentation