Efficient Exploitation of Fine-Grained Parallelism using a microHeterogeneous Computing Environment

About This Presentation

Title:

Efficient Exploitation of Fine-Grained Parallelism using a microHeterogeneous Computing Environment

Description:

The main objective of this thesis is to propose a new computing ... a node is added, the tree is pruned by deleting the node with the largest cost function. ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 54

Provided by: williams80

Learn more at: http://meseec.ce.rit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Efficient Exploitation of Fine-Grained Parallelism using a microHeterogeneous Computing Environment

1
Efficient Exploitation of Fine-Grained
Parallelism using a microHeterogeneous Computing
Environment
Presented By William Scheidel September 19th, 2002

Computer Engineering Department

2
Agenda

Objective
Heterogeneous Computing Background
microHeterogeneous Computing Architecture
microHeterogeneous Framework Implementation
Simulation Results
Conclusion
Future Work
Acknowledgements

3
Objectives

The main objective of this thesis is to propose a
new computing paradigm, called microHeterogeneous
computing or mHC, which incorporates processing
elements (vector processors, digital signal
processors, etc) into a general purpose machine
using the high-performance I/O buses that are
available
This architecture will then be used in order to
efficiently utilize fine-grained parallelism

4
Heterogeneous Computing

An architecture which provides an assortment of
high performance machines for use by an
application
Arose from the realization that no single machine
is capable of performing all tasks in an optimal
manner
Machines differ in both speed as well as in
capabilities and are connected using high speed,
high bandwidth intelligent interconnects that
handle the intercommunication between each of the
machines.

5
Motivation For Heterogeneous Computing

Hypothetical example of the advantage of using a
heterogeneous suite of machines, where the
heterogeneous suite time includes inter-machine
communication overhead. Not drawn to scale.

6
Heterogeneous Computing Driving ApplicationImage
Understanding

Well suited to a heterogeneous environment due to
its complexity and involvement of different types
of parallelism
Consists of three main levels of processing, each
level containing a different type of parallelism
Three main levels can also be executed in
parallel
Lowest Level
Consists of pixel-based operators and pixel
subset operators such as edge detection
Highest amount of parallelism
Best suited to mesh connected SIMD machines
Intermediate Level
Grouping and organization of features previously
extracted
Communication is irregular, parallelism decreases
as features are grouped
Best suited to medium-grained MIMD machines
Highest Level
Knowledge processing
Uses the data from the previous levels in order
to infer semantic attributes about an image
Requires coarse-grained loosely coupled MIMD
machines

7
Heterogeneous Computing Driving ApplicationImage
Understanding

The three levels of processing required for image
understanding.
Each of the levels contains large amounts of
varying kinds of parallelism that can be
exploited by heterogeneous computing

8
Heterogeneous Computing Broad Issues

Analytical benchmarking
Code-Type or task profiling
Matching and Scheduling
Programming environments
Interconnection requirements environment
requirements

9
Execution Steps of Heterogeneous Applications

Analytical Benchmarking - determines the optimal
speedup that a particular machine can achieve on
different types of tasks
Code Profiling - determines the modes of
computation that exist in each program segment as
well as the execution times
Task Scheduling tasks are mapped to a
particular machine using some form of scheduling
heuristic
Task Execution the tasks are executed on the
assigned machine

10
HC Issues Analytical Benchmarking

Measure of how well a given machine is able to
perform on a certain type of code
Required in HSC to determine which types of code
should be mapped to which machines
Benchmarking is an offline process
Example results
SIMD machines are well suited for matrix
computations / low level image processing
MIMD machines are best suited for tasks that have
limited intercommunication

11
HC Issues Code-type Profiling

Used to determine the types of parallelism in the
code as well as execution time
Tasks are separated into segments which contain a
homogeneous type of parallelism
These segments can then be matched to a
particular machine that is best suited to execute
them
Code-type profiling is an offline process

12
Code Profiling Example

Example results from the code-profiling of a
task.
The task is broken into S segments, each of which
contains embedded homogeneous parallelism.

13
HC Issues Matching and Scheduling

Goal is to map code-types to the best suited
machine
Costs most be carefully weighed
Computation Costs execution time of a code
segment is dependent on the machine it is run on
as well as the current workload of the machine
Communication Costs dependent on type of
interconnection used and bandwidth
Interference Costs resource contention occurs
when multiple tasks are assigned to a particular
machine
Problem determined to be NP-hard even for
homogeneous environments
Addition of heterogeneous processing elements
adds to the complexity
A large number of heuristic algorithms have been
designed to schedule tasks to machines on
heterogeneous computing systems.
Most such heuristic algorithms developed are
static and assume the ETC (expected time to
compute) for every task on every machine to be
known from code-type profiling and analytical
benchmarking.

14
Example Static Scheduling Heuristics for HC

Opportunistic Load Balancing (OLB) assigns each
task, in arbitrary order, to the next available
machine.
User-Directed Assignment (UDA) assigns each
task, in arbitrary order, to the machine with the
best expected execution time for the task.
Fast Greedy assigns each task, in arbitrary
order, to the machine with the minimum completion
time for that task.
Min-min the minimum completion time for each
task is computed respect to all machines. The
task with the overall minimum completion time is
selected and assigned to the corresponding
machine. The newly mapped task is removed, and
the process repeats until all tasks are mapped.
Max-min The Max-min heuristic is very similar
to the Min-min algorithm. The set of minimum
completion times is calculated for every task.
The task with overall maximum completion time
from the set is selected and assigned to the
corresponding machine.
Greedy or Duplex The Greedy heuristic is
literally a combination of the Min-min and
Max-min heuristics by using the better solution

15
Example Static Scheduling Heuristics for HC

GA The Genetic algorithm (GA) is used for
searching large solution space. It operates on a
population of chromosomes for a given problem.
The initial population is generated randomly. A
chromosome could be generated by any other
heuristic algorithm.
Simulated Annealing (SA) an iterative technique
that considers only one possible solution for
each meta-task at a time. SA uses a procedure
that probabilistically allows solution to be
accepted to attempt to obtain a better search of
the solution space based on a system temperature.
GSA The Genetic Simulated Annealing (GSA)
heuristic is a combination of the GA and SA
techniques.
Tabu Tabu search is a solution space search
that keeps track of the regions of the solution
space which have already been searched so as not
to repeat a search near these areas .
A A is a tree search beginning at a root node
that is usually a null solution. As the tree
grows, intermediate nodes represent partial
solutions and leaf nodes represent final
solutions. Each node has a cost function, and the
node with the minimum cost function is replaced
by its children. Any time a node is added, the
tree is pruned by deleting the node with the
largest cost function. This process continues
until a complete mapping (a leaf node) is reached.

16
Example Static Scheduling Heuristics for HCThe
Segmented Min-Min Algorithm

Every task has a ETC (expected time to compute)
on a specific machine. If there are t tasks and m
machines, we can obtain a t x m ETC
matrix. ETC(i j) is the estimated execution time
for task i on machine j.
The Segmented min-min algorithm sorts the tasks
according to ETCs.
The tasks can be sorted into an ordered list by
the average ETC, the minimum ETC, or the maximum
ETC.
Then, the task list is partitioned into segments
with the equal size.

17
Segmented Min-Min Scheduling Heuristic
18
Example Dynamic Scheduling Heuristics for
HCHEFT Scheduling Heuristic
19
Heterogeneous Parallel Programming

Parallel Virtual Machine (PVM)
Enables a collection of heterogeneous computers
to be used as a coherent and flexible concurrent
computational resource
Unit of parallelism are tasks which are generally
processes
User also has the choice to view these resources
as an attributeless collection of virtual
processing elements or choose to exploit the
capabilities of specific machines in the host
pool
Allows multifaceted virtual machines to be
configured within the same framework and permits
messages containing more than one data type to be
exchanged between machines having different data
representations
Message Passing Interface (MPI)
Each processor is assigned a rank which then
determines which parts of the application that
processor will execute
Division of tasks among processors is left solely
up to the developer writing the application
Includes routines to do point-to-point
communication between two processing elements,
collective operations to simultaneously
communicate information between all processing
elements, and implicit as well as explicit
synchronization

20
HC Issues Interconnection Requirements

Interconnection medium must support high
bandwidths and low latency communications (LANs
wont cut it)
Complexity in a heterogeneous system increases
since different machines use different protocols
for communication
Must support both shared memory and message-based
communication

21
Heterogeneous Computing Limitations

Task Granularity
Heterogeneous computing only utilizes
coarse-grained parallelism to increase
performance
Coarse-grained parallelism results in large task
sizes and reduced coupling which allows the
processing elements to work more efficiently
Eequirement is also translated to most
heterogeneous schedulers since they are based on
the scheduling of meta-tasks, i.e. tasks that
have no dependencies
Communication Overhead
Tasks and their working sets must be transmitted
over some form of network in order to execute
them, latency and bandwidth become crucial
factors
Overhead is also incurred when encoding and
decoding the data for different architectures
Cost
Machines used in heterogeneous computing
environments can be prohibitively expensive
Expensive high speed, low latency networks are
required to achieve the best performance
Not cost effective for applications where only a
small portion of the code would benefit from such
an environment

22
microHeterogeneous Computing

A new computing paradigm that attempts to most
efficiently exploit the fine-grained parallelism
found in most scientific computing applications
Environment is contained within a workstation and
consists of a host processor and a number of
additional PCI (or other high performance I/O
bus) based processing elements
Elements might be DSP based, vector based, FGPA
based, or even reconfigurable computing elements
In combination with a host processor, these
elements create a small scale heterogeneous
computing environment
An mHC specific API was developed that greatly
simplifies using these types of devices for
parallel applications

23
microHeterogeneous Computing Environment
24
Comparison Between Heterogeneous Computing and
microHeterogeneous Computing

Task Granularity
Heterogeneous environments only support
coarse-grained parallelism, while the mHC
environment instead focuses on fine-grained
parallelism by using a tightly coupled shared
memory environment
Task size is reduced to a single function call in
a mHC environment
Drawbacks
Processing elements used in mHC are not nearly as
powerful as the machines used in a standard
heterogeneous environment
There is a small and finite number of processing
elements that can be added to a single machine
Communication Overhead
High performance I/O buses are twice as fast as
the fastest network
Less overhead is incurred when encoding and
decoding data since all processing elements use
the same base architecture
Cost Effectiveness
Machines used in a heterogeneous environment can
cost tens of thousands of dollars each, and
require the extra expense of the high-speed, low
latency interconnects to achieve acceptable
performance
mHC processing elements cost only hundreds of
dollars

25
Comparison Between Heterogeneous Computing and
microHeterogeneous Computing

Analytical Benchmarking and Profiling
Analytical benchmarking is used for the same
purpose in both computing environments. The
capabilities of each processing element or
machine must be known before program execution
begins so the scheduling algorithm is able to
determine an efficient mapping of tasks.
Profiling, while necessary in heterogeneous
environments is not required in
microHeterogeneous environments
Scheduling Heuristics
Scheduling algorithms in heterogeneous
environments generally take place during the
compilation stage instead of during execution
The scheduler for an mHC environment must be
dynamic and map tasks in real-time in order to
provide the best performance.

26
microHeterogeneous Computing API

An API was created for microHeterogeneous
computing to provide a flexible and portable
interface
User applications only need to make simple API
calls, mapping of tasks onto available devices is
performed automatically
There is no need for an application to be
recompiled if the underlying implementation of
microHeterogeneous Computing changes as long as
the API is adhered to
The API supports a subset of the Gnu Scientific
Library (GSL)
GSL is a freely distributable scientific API
written in C
Includes an extensive library of scientific
functions and data types
Directly supports the Basic Linear Algebra
Subprograms (BLAS)
Data structures are compatible with those used by
the Vector, Image, and Signal, Processing Library
that is becoming a standard on embedded devices

27
microHeterogeneous Computing API (cont)
The microHeterogeneous Computing API provides
support for the following areas of scientific
computing

Linear Algebra
EigenVectors and EigenValues
Fast Fourier Transforms
Numerical Integration
Statistics

Vector Operations
Matrix Operations
Polynomial Solvers
Permutations
Combinations
Sorting

28
Suitable mHC Devices

XP-15
Developed by Texas Memory Systems
DSP based accelerator card
Performs 80 32-bit floating point operations per
second
Contains 256 MB of on board DDR Ram
Supports over 500 different scientific functions
Increases FFT performance by 20x - 40x over a 1.4
Gigahertz P4
Pegasus-2
Developed by Catalina Research
Vector Processor based
Supports FFT, matrix and vector operations,
convolutions, filters and more
Supported functions operate between 5x and 15x
faster then a 1.4 Gigahertz P4

29
microHeterogeneous Framework Implementation

Implemented as a dynamically linked library
written purely in C that user applications
interact with by way of the mHC API
The framework creates tasks from the API function
calls and schedules them to the available
processing elements

30
Phases of the microHeterogneous Framework

Initialization
The framework must be initialized before it may
be used
Scheduler and scheduler parameters chosen
Bus and Device Configuration Files read
Log file specified
Data structures created
Helper threads are created that move tasks from a
devices task queue to the device. These threads
are real-time threads that use a round-robin
scheduling policy.

31
microHeterogeneous Computing Framework Overview
32
Device Configuration File

Determines what devices are available in the
microHeterogeneous environment
File is XML based which makes it easy for other
programs to generate and parse device
configuration files
The following is configurable for each device
Unique ID
Name
Description
Bus that the device uses
A list of API calls that the device supports,
each API call in the list contains
The ID and Name of the API call
The speedup achieved as compared to the host
processor
The expected time to completion (ETC) of the API
call given in microseconds per byte of input

33
Example Device Configuration File

ltmHCDeviceConfiggt
ltDevicegt
ltIDgt0lt/IDgt
ltNamegtHostlt/Namegt
ltDescriptiongtA bad host.lt/Descriptiongt
ltBusNamegtLocallt/BusNamegt
ltBusIDgt0lt/BusIDgt
ltAPISupportgt
ltFunctiongt
ltIDgt26lt/IDgt
ltNamegtmhc_combination_nextlt/Namegt
ltSpeedupgt1lt/Speedupgt
ltCompletionTimegt.015lt/CompletionTimegt
lt/Functiongt
ltFunctiongt
ltIDgt9lt/IDgt
ltNamegtmhc_vector_sublt/Namegt
ltSpeedupgt1lt/Speedupgt
ltCompletionTimegt.001lt/CompletionTimegt

ltDevicegt
ltIDgt1lt/IDgt
ltNamegtVector1lt/Namegt
ltDescriptiongtA simple vector
processor.lt/Descriptiongt
ltBusNamegtPCIlt/BusNamegt
ltBusIDgt1lt/BusIDgt
ltAPISupportgt
ltFunctiongt
ltIDgt9lt/IDgt
ltNamegtmhc_vector_sublt/Namegt
ltSpeedupgt10lt/Speedupgt
ltCompletionTimegt.0001lt/CompletionTimegt
lt/Functiongt
lt/APISupportgt
lt/Devicegt
lt/mHCDeviceConfiggt

34
Bus Configuration File

Determines the bus characteristics being used by
the devices
File is XML based which makes it easy for other
programs to generate and parse bus configuration
files
The following is configurable for each bus
Unique ID
Name
Description
Initialization time
Specified in microseconds
Taken into account once during the framework
initialization
Overhead Time
Specified in microseconds
Taken into account once for ever bus transaction
Transfer Time
Specified in microseconds per byte
Taken into account once for every byte that is
transmitted over the bus

35
Example Bus Configuration File

ltmHCBusConfiggt
ltBusgt
ltIDgt0lt/IDgt
ltNamegtLocallt/Namegt
ltDescriptiongtUsed by the hostlt/Descriptiongt
ltInitTimegt0lt/InitTimegt
ltOverheadgt0lt/Overheadgt
ltTransferTimegt0lt/TransferTimegt
lt/Busgt
ltBusgt
ltIDgt1lt/IDgt
ltNamegtPCIlt/Namegt
ltDescriptiongtPCI buslt/Descriptiongt
ltInitTimegt50lt/InitTimegt
ltOverheadgt0.01lt/Overheadgt
ltTransferTimegt0.002lt/TransferTimegt
lt/Busgt
lt/mHCBusConfiggt

36
Phases of the microHeterogneous Framework

Task Creation
A new task is created for every API call that is
made, except for initialization, finalization,
and join calls
Tasks encapsulate all of the information of a
function call
ID of function to execute
List of pointers to all of the arguments
List of pointers to all of the data blocks used
as inputs and their sizes
List of pointers to all of the data blocks used
as outputs and their sizes

37
Phases of the microHeterogneous Framework (cont)

Task Scheduling
After a task is created, it is passed to the
scheduling algorithm that was selected during
initialization
The scheduler determines which device to assign
the task and places the task in that devices
task queue
Done dynamically in real-time
Profiling of applications is not required
As soon as the scheduler has mapped the task to a
device the API call returns and the main user
program is allowed to continue execution

38
Fast Greedy Scheduling Heuristic
39
Real-Time Min-Min Scheduling Heuristic
40
Weighted Real-Time Min-Min Scheduling Heuristic
41
Phases of the microHeterogneous Framework (cont)

Task Execution
If a task is available,
The helper thread checks to see if there are any
unresolved dependencies
If there are no dependencies, the task is removed
from the task queue and passed to the device
driver for execution, otherwise it sleeps
All tasks are executed on the host processor by
simulated drivers

42
mHC Applications

Four mHC Application were written in order to
test the performance of both the architecture and
the different scheduling algorithms that were
developed
Matrix Performs basic matrix operations on a set
of fifty 100 x 100 matrices
First twenty-five matrices are summed together
Last twenty-five matrices are subtracted from one
another
Every fifth matrix is scaled by a constant
Finally, the inverse of all fifty matrices is
determined
Stats Performs basic statistics on a block of
five million values
Divides a block of five million values into 50
blocks of 100,000 values
Calculates the standard deviation of each block
of values
Determines the blocks of data with the minimum
and maximum deviations
Linalg solves fifty sets of linear equations
each containing one hundred and seventy-five
variables
Random Used to stress test the different
scheduling algorithms
Creates random task graphs consisting of 300
tasks
Tasks created are matrix element multiplications
between a group of 25 matrices

43
Simulation Methodology

Each simulation run used a three step process
The sequential version of the application was run
Done by using the -s -1 parameter when
initializing the framework
Used to determine the estimated time to
completion (ETC) for each of the API calls on the
host processor
Output recorded for comparison purposes
The parallel version of the application was run
Done by using the appropriate scheduler
parameter, bus configuration, and device
configuration files
Used to compare the parallel output to the
sequential output to make sure that the
parallelized results were correct
The parallel version was run using the timing
mode
Done by specifying the -t parameter along with
the parameters used in Step 2.
Used to get the final timing results for the
simulation
Steps 1 and 2 were run five times, the median run
was used for calculations

44
Matrix Simulation Results
2 3 4 5 6 7
Fast Greedy 2.33 2.33 2.31 2.34 2.31 2.31
RTmm 1.43 2.00 2.39 3.68 3.75 5.00
WRTmm 1.97 2.78 3.71 3.97 6.58 5.07
45
Stats Simulation Results
2 3 4 5 6 7
Fast Greedy 0.90 0.91 0.90 0.90 0.90 0.90
RTmm 0.97 1.00 1.07 1.08 1.15 1.10
WRTmm 1.03 1.11 1.14 1.14 1.16 1.17
46
Linalg Simulation Results
2 3 4 5 6 7
Fast Greedy 3.20 3.08 3.12 3.18 3.02 2.98
RTmm 1.42 3.40 2.29 4.34 3.72 6.40
WRTmm 1.64 3.40 4.49 6.00 6.44 8.28
47
Random Simulation Similar Processing Elements

Simulation used between 1 and 6 additional
processing elements, each having a speedup of 20x

48
Random Simulation Different Processing Elements

Simulation used between 1 and 6 additional
processing elements
Speedups of 20x, 10x, 5x, 2x, 1x, and 0.5x were
used for devices 1 through 6 respectively.

49
Random Simulation Different Bus Transfer Times

Simulation used three additional processing
elements with various bus transfer times
Transfer times range from a 64-bit 33 MHz PCI bus
(smallest), to a 100 Mb/s Ethernet connection
(largest)

50
Random Simulation Various Speedups

Simulation used four additional processing
elements with various speedups

51
Conclusion

Accomplishments
A new computer architecture, microHeterogeneous
Computing, was presented that successfully
exploits fine-grained parallelism in scientific
based applications using additional processing
elements
An API was created that allows developers to
incorporate mHC into their applications without
being required to address task scheduling, load
balancing, or threading issues
A highly configurable mHC framework was
implemented as a standard library which allows
actual mHC compliant applications to be compiled
and executed using standard techniques
Future Work
Creation of mHC compliant device drivers so that
an actual mHC environment can be created
While the microHeterogeneous API currently
contains the most common scientific functions, it
needs to be expanded in order to become
complimentary to the GNU Scientific Library.
the concept of mHC clusters needs to be fully
explored in order to determine the applicability
of mHC to this area of computing.

52
mHC Cluster Based Computing
53
Acknowledgements

I would like to thank the following people for
making this thesis possible
My primary advisor, Dr. Shaaban for allowing me
to work on such an interesting and worthwhile
project
My committee members, Dr. Savakis and Dr.
Heliotis for working with very tight schedules
My family for supporting me through this whole
process
My sister for putting a roof over my head for the
last month and a half