Efficient Exploitation of Fine-Grained Parallelism using a microHeterogeneous Computing Environment - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Efficient Exploitation of Fine-Grained Parallelism using a microHeterogeneous Computing Environment

Description:

The main objective of this thesis is to propose a new computing ... a node is added, the tree is pruned by deleting the node with the largest cost function. ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 54
Provided by: williams80
Learn more at: http://meseec.ce.rit.edu
Category:

less

Transcript and Presenter's Notes

Title: Efficient Exploitation of Fine-Grained Parallelism using a microHeterogeneous Computing Environment


1
Efficient Exploitation of Fine-Grained
Parallelism using a microHeterogeneous Computing
Environment
Presented By William Scheidel September 19th, 2002
  • Computer Engineering Department

2
Agenda
  • Objective
  • Heterogeneous Computing Background
  • microHeterogeneous Computing Architecture
  • microHeterogeneous Framework Implementation
  • Simulation Results
  • Conclusion
  • Future Work
  • Acknowledgements

3
Objectives
  • The main objective of this thesis is to propose a
    new computing paradigm, called microHeterogeneous
    computing or mHC, which incorporates processing
    elements (vector processors, digital signal
    processors, etc) into a general purpose machine
    using the high-performance I/O buses that are
    available
  • This architecture will then be used in order to
    efficiently utilize fine-grained parallelism

4
Heterogeneous Computing
  • An architecture which provides an assortment of
    high performance machines for use by an
    application
  • Arose from the realization that no single machine
    is capable of performing all tasks in an optimal
    manner
  • Machines differ in both speed as well as in
    capabilities and are connected using high speed,
    high bandwidth intelligent interconnects that
    handle the intercommunication between each of the
    machines.

5
Motivation For Heterogeneous Computing
  • Hypothetical example of the advantage of using a
    heterogeneous suite of machines, where the
    heterogeneous suite time includes inter-machine
    communication overhead. Not drawn to scale.

6
Heterogeneous Computing Driving ApplicationImage
Understanding
  • Well suited to a heterogeneous environment due to
    its complexity and involvement of different types
    of parallelism
  • Consists of three main levels of processing, each
    level containing a different type of parallelism
  • Three main levels can also be executed in
    parallel
  • Lowest Level
  • Consists of pixel-based operators and pixel
    subset operators such as edge detection
  • Highest amount of parallelism
  • Best suited to mesh connected SIMD machines
  • Intermediate Level
  • Grouping and organization of features previously
    extracted
  • Communication is irregular, parallelism decreases
    as features are grouped
  • Best suited to medium-grained MIMD machines
  • Highest Level
  • Knowledge processing
  • Uses the data from the previous levels in order
    to infer semantic attributes about an image
  • Requires coarse-grained loosely coupled MIMD
    machines

7
Heterogeneous Computing Driving ApplicationImage
Understanding
  • The three levels of processing required for image
    understanding.
  • Each of the levels contains large amounts of
    varying kinds of parallelism that can be
    exploited by heterogeneous computing

8
Heterogeneous Computing Broad Issues
  • Analytical benchmarking
  • Code-Type or task profiling
  • Matching and Scheduling
  • Programming environments
  • Interconnection requirements environment
    requirements

9
Execution Steps of Heterogeneous Applications
  • Analytical Benchmarking - determines the optimal
    speedup that a particular machine can achieve on
    different types of tasks
  • Code Profiling - determines the modes of
    computation that exist in each program segment as
    well as the execution times
  • Task Scheduling tasks are mapped to a
    particular machine using some form of scheduling
    heuristic
  • Task Execution the tasks are executed on the
    assigned machine

10
HC Issues Analytical Benchmarking
  • Measure of how well a given machine is able to
    perform on a certain type of code
  • Required in HSC to determine which types of code
    should be mapped to which machines
  • Benchmarking is an offline process
  • Example results
  • SIMD machines are well suited for matrix
    computations / low level image processing
  • MIMD machines are best suited for tasks that have
    limited intercommunication

11
HC Issues Code-type Profiling
  • Used to determine the types of parallelism in the
    code as well as execution time
  • Tasks are separated into segments which contain a
    homogeneous type of parallelism
  • These segments can then be matched to a
    particular machine that is best suited to execute
    them
  • Code-type profiling is an offline process

12
Code Profiling Example
  • Example results from the code-profiling of a
    task.
  • The task is broken into S segments, each of which
    contains embedded homogeneous parallelism.

13
HC Issues Matching and Scheduling
  • Goal is to map code-types to the best suited
    machine
  • Costs most be carefully weighed
  • Computation Costs execution time of a code
    segment is dependent on the machine it is run on
    as well as the current workload of the machine
  • Communication Costs dependent on type of
    interconnection used and bandwidth
  • Interference Costs resource contention occurs
    when multiple tasks are assigned to a particular
    machine
  • Problem determined to be NP-hard even for
    homogeneous environments
  • Addition of heterogeneous processing elements
    adds to the complexity
  • A large number of heuristic algorithms have been
    designed to schedule tasks to machines on
    heterogeneous computing systems.
  • Most such heuristic algorithms developed are
    static and assume the ETC (expected time to
    compute) for every task on every machine to be
    known from code-type profiling and analytical
    benchmarking.

14
Example Static Scheduling Heuristics for HC
  • Opportunistic Load Balancing (OLB) assigns each
    task, in arbitrary order, to the next available
    machine.
  • User-Directed Assignment (UDA) assigns each
    task, in arbitrary order, to the machine with the
    best expected execution time for the task.
  • Fast Greedy assigns each task, in arbitrary
    order, to the machine with the minimum completion
    time for that task.
  • Min-min the minimum completion time for each
    task is computed respect to all machines. The
    task with the overall minimum completion time is
    selected and assigned to the corresponding
    machine. The newly mapped task is removed, and
    the process repeats until all tasks are mapped.
  • Max-min The Max-min heuristic is very similar
    to the Min-min algorithm. The set of minimum
    completion times is calculated for every task.
    The task with overall maximum completion time
    from the set is selected and assigned to the
    corresponding machine.
  • Greedy or Duplex The Greedy heuristic is
    literally a combination of the Min-min and
    Max-min heuristics by using the better solution

15
Example Static Scheduling Heuristics for HC
  • GA The Genetic algorithm (GA) is used for
    searching large solution space. It operates on a
    population of chromosomes for a given problem.
    The initial population is generated randomly. A
    chromosome could be generated by any other
    heuristic algorithm.
  • Simulated Annealing (SA) an iterative technique
    that considers only one possible solution for
    each meta-task at a time. SA uses a procedure
    that probabilistically allows solution to be
    accepted to attempt to obtain a better search of
    the solution space based on a system temperature.
  • GSA The Genetic Simulated Annealing (GSA)
    heuristic is a combination of the GA and SA
    techniques.
  • Tabu Tabu search is a solution space search
    that keeps track of the regions of the solution
    space which have already been searched so as not
    to repeat a search near these areas .
  • A A is a tree search beginning at a root node
    that is usually a null solution. As the tree
    grows, intermediate nodes represent partial
    solutions and leaf nodes represent final
    solutions. Each node has a cost function, and the
    node with the minimum cost function is replaced
    by its children. Any time a node is added, the
    tree is pruned by deleting the node with the
    largest cost function. This process continues
    until a complete mapping (a leaf node) is reached.

16
Example Static Scheduling Heuristics for HCThe
Segmented Min-Min Algorithm
  • Every task has a ETC (expected time to compute)
    on a specific machine. If there are t tasks and m
    machines, we can obtain a t x m ETC
    matrix. ETC(i j) is the estimated execution time
    for task i on machine j.
  • The Segmented min-min algorithm sorts the tasks
    according to ETCs.
  • The tasks can be sorted into an ordered list by
    the average ETC, the minimum ETC, or the maximum
    ETC.
  • Then, the task list is partitioned into segments
    with the equal size.

17
Segmented Min-Min Scheduling Heuristic
18
Example Dynamic Scheduling Heuristics for
HCHEFT Scheduling Heuristic
19
Heterogeneous Parallel Programming
  • Parallel Virtual Machine (PVM)
  • Enables a collection of heterogeneous computers
    to be used as a coherent and flexible concurrent
    computational resource
  • Unit of parallelism are tasks which are generally
    processes
  • User also has the choice to view these resources
    as an attributeless collection of virtual
    processing elements or choose to exploit the
    capabilities of specific machines in the host
    pool
  • Allows multifaceted virtual machines to be
    configured within the same framework and permits
    messages containing more than one data type to be
    exchanged between machines having different data
    representations
  • Message Passing Interface (MPI)
  • Each processor is assigned a rank which then
    determines which parts of the application that
    processor will execute
  • Division of tasks among processors is left solely
    up to the developer writing the application
  • Includes routines to do point-to-point
    communication between two processing elements,
    collective operations to simultaneously
    communicate information between all processing
    elements, and implicit as well as explicit
    synchronization

20
HC Issues Interconnection Requirements
  • Interconnection medium must support high
    bandwidths and low latency communications (LANs
    wont cut it)
  • Complexity in a heterogeneous system increases
    since different machines use different protocols
    for communication
  • Must support both shared memory and message-based
    communication

21
Heterogeneous Computing Limitations
  • Task Granularity
  • Heterogeneous computing only utilizes
    coarse-grained parallelism to increase
    performance
  • Coarse-grained parallelism results in large task
    sizes and reduced coupling which allows the
    processing elements to work more efficiently
  • Eequirement is also translated to most
    heterogeneous schedulers since they are based on
    the scheduling of meta-tasks, i.e. tasks that
    have no dependencies
  • Communication Overhead
  • Tasks and their working sets must be transmitted
    over some form of network in order to execute
    them, latency and bandwidth become crucial
    factors
  • Overhead is also incurred when encoding and
    decoding the data for different architectures
  • Cost
  • Machines used in heterogeneous computing
    environments can be prohibitively expensive
  • Expensive high speed, low latency networks are
    required to achieve the best performance
  • Not cost effective for applications where only a
    small portion of the code would benefit from such
    an environment

22
microHeterogeneous Computing
  • A new computing paradigm that attempts to most
    efficiently exploit the fine-grained parallelism
    found in most scientific computing applications
  • Environment is contained within a workstation and
    consists of a host processor and a number of
    additional PCI (or other high performance I/O
    bus) based processing elements
  • Elements might be DSP based, vector based, FGPA
    based, or even reconfigurable computing elements
  • In combination with a host processor, these
    elements create a small scale heterogeneous
    computing environment
  • An mHC specific API was developed that greatly
    simplifies using these types of devices for
    parallel applications

23
microHeterogeneous Computing Environment
24
Comparison Between Heterogeneous Computing and
microHeterogeneous Computing
  • Task Granularity
  • Heterogeneous environments only support
    coarse-grained parallelism, while the mHC
    environment instead focuses on fine-grained
    parallelism by using a tightly coupled shared
    memory environment
  • Task size is reduced to a single function call in
    a mHC environment
  • Drawbacks
  • Processing elements used in mHC are not nearly as
    powerful as the machines used in a standard
    heterogeneous environment
  • There is a small and finite number of processing
    elements that can be added to a single machine
  • Communication Overhead
  • High performance I/O buses are twice as fast as
    the fastest network
  • Less overhead is incurred when encoding and
    decoding data since all processing elements use
    the same base architecture
  • Cost Effectiveness
  • Machines used in a heterogeneous environment can
    cost tens of thousands of dollars each, and
    require the extra expense of the high-speed, low
    latency interconnects to achieve acceptable
    performance
  • mHC processing elements cost only hundreds of
    dollars

25
Comparison Between Heterogeneous Computing and
microHeterogeneous Computing
  • Analytical Benchmarking and Profiling
  • Analytical benchmarking is used for the same
    purpose in both computing environments. The
    capabilities of each processing element or
    machine must be known before program execution
    begins so the scheduling algorithm is able to
    determine an efficient mapping of tasks.
  • Profiling, while necessary in heterogeneous
    environments is not required in
    microHeterogeneous environments
  • Scheduling Heuristics
  • Scheduling algorithms in heterogeneous
    environments generally take place during the
    compilation stage instead of during execution
  • The scheduler for an mHC environment must be
    dynamic and map tasks in real-time in order to
    provide the best performance.

26
microHeterogeneous Computing API
  • An API was created for microHeterogeneous
    computing to provide a flexible and portable
    interface
  • User applications only need to make simple API
    calls, mapping of tasks onto available devices is
    performed automatically
  • There is no need for an application to be
    recompiled if the underlying implementation of
    microHeterogeneous Computing changes as long as
    the API is adhered to
  • The API supports a subset of the Gnu Scientific
    Library (GSL)
  • GSL is a freely distributable scientific API
    written in C
  • Includes an extensive library of scientific
    functions and data types
  • Directly supports the Basic Linear Algebra
    Subprograms (BLAS)
  • Data structures are compatible with those used by
    the Vector, Image, and Signal, Processing Library
    that is becoming a standard on embedded devices

27
microHeterogeneous Computing API (cont)
The microHeterogeneous Computing API provides
support for the following areas of scientific
computing
  • Linear Algebra
  • EigenVectors and EigenValues
  • Fast Fourier Transforms
  • Numerical Integration
  • Statistics
  • Vector Operations
  • Matrix Operations
  • Polynomial Solvers
  • Permutations
  • Combinations
  • Sorting

28
Suitable mHC Devices
  • XP-15
  • Developed by Texas Memory Systems
  • DSP based accelerator card
  • Performs 80 32-bit floating point operations per
    second
  • Contains 256 MB of on board DDR Ram
  • Supports over 500 different scientific functions
  • Increases FFT performance by 20x - 40x over a 1.4
    Gigahertz P4
  • Pegasus-2
  • Developed by Catalina Research
  • Vector Processor based
  • Supports FFT, matrix and vector operations,
    convolutions, filters and more
  • Supported functions operate between 5x and 15x
    faster then a 1.4 Gigahertz P4

29
microHeterogeneous Framework Implementation
  • Implemented as a dynamically linked library
    written purely in C that user applications
    interact with by way of the mHC API
  • The framework creates tasks from the API function
    calls and schedules them to the available
    processing elements

30
Phases of the microHeterogneous Framework
  • Initialization
  • The framework must be initialized before it may
    be used
  • Scheduler and scheduler parameters chosen
  • Bus and Device Configuration Files read
  • Log file specified
  • Data structures created
  • Helper threads are created that move tasks from a
    devices task queue to the device. These threads
    are real-time threads that use a round-robin
    scheduling policy.

31
microHeterogeneous Computing Framework Overview
32
Device Configuration File
  • Determines what devices are available in the
    microHeterogeneous environment
  • File is XML based which makes it easy for other
    programs to generate and parse device
    configuration files
  • The following is configurable for each device
  • Unique ID
  • Name
  • Description
  • Bus that the device uses
  • A list of API calls that the device supports,
    each API call in the list contains
  • The ID and Name of the API call
  • The speedup achieved as compared to the host
    processor
  • The expected time to completion (ETC) of the API
    call given in microseconds per byte of input

33
Example Device Configuration File
  • ltmHCDeviceConfiggt
  • ltDevicegt
  • ltIDgt0lt/IDgt
  • ltNamegtHostlt/Namegt
  • ltDescriptiongtA bad host.lt/Descriptiongt
  • ltBusNamegtLocallt/BusNamegt
  • ltBusIDgt0lt/BusIDgt
  • ltAPISupportgt
  • ltFunctiongt
  • ltIDgt26lt/IDgt
  • ltNamegtmhc_combination_nextlt/Namegt
  • ltSpeedupgt1lt/Speedupgt
  • ltCompletionTimegt.015lt/CompletionTimegt
  • lt/Functiongt
  • ltFunctiongt
  • ltIDgt9lt/IDgt
  • ltNamegtmhc_vector_sublt/Namegt
  • ltSpeedupgt1lt/Speedupgt
  • ltCompletionTimegt.001lt/CompletionTimegt
  • ltDevicegt
  • ltIDgt1lt/IDgt
  • ltNamegtVector1lt/Namegt
  • ltDescriptiongtA simple vector
    processor.lt/Descriptiongt
  • ltBusNamegtPCIlt/BusNamegt
  • ltBusIDgt1lt/BusIDgt
  • ltAPISupportgt
  • ltFunctiongt
  • ltIDgt9lt/IDgt
  • ltNamegtmhc_vector_sublt/Namegt
  • ltSpeedupgt10lt/Speedupgt
  • ltCompletionTimegt.0001lt/CompletionTimegt
  • lt/Functiongt
  • lt/APISupportgt
  • lt/Devicegt
  • lt/mHCDeviceConfiggt

34
Bus Configuration File
  • Determines the bus characteristics being used by
    the devices
  • File is XML based which makes it easy for other
    programs to generate and parse bus configuration
    files
  • The following is configurable for each bus
  • Unique ID
  • Name
  • Description
  • Initialization time
  • Specified in microseconds
  • Taken into account once during the framework
    initialization
  • Overhead Time
  • Specified in microseconds
  • Taken into account once for ever bus transaction
  • Transfer Time
  • Specified in microseconds per byte
  • Taken into account once for every byte that is
    transmitted over the bus

35
Example Bus Configuration File
  • ltmHCBusConfiggt
  • ltBusgt
  • ltIDgt0lt/IDgt
  • ltNamegtLocallt/Namegt
  • ltDescriptiongtUsed by the hostlt/Descriptiongt
  • ltInitTimegt0lt/InitTimegt
  • ltOverheadgt0lt/Overheadgt
  • ltTransferTimegt0lt/TransferTimegt
  • lt/Busgt
  • ltBusgt
  • ltIDgt1lt/IDgt
  • ltNamegtPCIlt/Namegt
  • ltDescriptiongtPCI buslt/Descriptiongt
  • ltInitTimegt50lt/InitTimegt
  • ltOverheadgt0.01lt/Overheadgt
  • ltTransferTimegt0.002lt/TransferTimegt
  • lt/Busgt
  • lt/mHCBusConfiggt

36
Phases of the microHeterogneous Framework
  • Task Creation
  • A new task is created for every API call that is
    made, except for initialization, finalization,
    and join calls
  • Tasks encapsulate all of the information of a
    function call
  • ID of function to execute
  • List of pointers to all of the arguments
  • List of pointers to all of the data blocks used
    as inputs and their sizes
  • List of pointers to all of the data blocks used
    as outputs and their sizes

37
Phases of the microHeterogneous Framework (cont)
  • Task Scheduling
  • After a task is created, it is passed to the
    scheduling algorithm that was selected during
    initialization
  • The scheduler determines which device to assign
    the task and places the task in that devices
    task queue
  • Done dynamically in real-time
  • Profiling of applications is not required
  • As soon as the scheduler has mapped the task to a
    device the API call returns and the main user
    program is allowed to continue execution

38
Fast Greedy Scheduling Heuristic
39
Real-Time Min-Min Scheduling Heuristic
40
Weighted Real-Time Min-Min Scheduling Heuristic
41
Phases of the microHeterogneous Framework (cont)
  • Task Execution
  • If a task is available,
  • The helper thread checks to see if there are any
    unresolved dependencies
  • If there are no dependencies, the task is removed
    from the task queue and passed to the device
    driver for execution, otherwise it sleeps
  • All tasks are executed on the host processor by
    simulated drivers

42
mHC Applications
  • Four mHC Application were written in order to
    test the performance of both the architecture and
    the different scheduling algorithms that were
    developed
  • Matrix Performs basic matrix operations on a set
    of fifty 100 x 100 matrices
  • First twenty-five matrices are summed together
  • Last twenty-five matrices are subtracted from one
    another
  • Every fifth matrix is scaled by a constant
  • Finally, the inverse of all fifty matrices is
    determined
  • Stats Performs basic statistics on a block of
    five million values
  • Divides a block of five million values into 50
    blocks of 100,000 values
  • Calculates the standard deviation of each block
    of values
  • Determines the blocks of data with the minimum
    and maximum deviations
  • Linalg solves fifty sets of linear equations
    each containing one hundred and seventy-five
    variables
  • Random Used to stress test the different
    scheduling algorithms
  • Creates random task graphs consisting of 300
    tasks
  • Tasks created are matrix element multiplications
    between a group of 25 matrices

43
Simulation Methodology
  • Each simulation run used a three step process
  • The sequential version of the application was run
  • Done by using the -s -1 parameter when
    initializing the framework
  • Used to determine the estimated time to
    completion (ETC) for each of the API calls on the
    host processor
  • Output recorded for comparison purposes
  • The parallel version of the application was run
  • Done by using the appropriate scheduler
    parameter, bus configuration, and device
    configuration files
  • Used to compare the parallel output to the
    sequential output to make sure that the
    parallelized results were correct
  • The parallel version was run using the timing
    mode
  • Done by specifying the -t parameter along with
    the parameters used in Step 2.
  • Used to get the final timing results for the
    simulation
  • Steps 1 and 2 were run five times, the median run
    was used for calculations

44
Matrix Simulation Results
2 3 4 5 6 7
Fast Greedy 2.33 2.33 2.31 2.34 2.31 2.31
RTmm 1.43 2.00 2.39 3.68 3.75 5.00
WRTmm 1.97 2.78 3.71 3.97 6.58 5.07
45
Stats Simulation Results
2 3 4 5 6 7
Fast Greedy 0.90 0.91 0.90 0.90 0.90 0.90
RTmm 0.97 1.00 1.07 1.08 1.15 1.10
WRTmm 1.03 1.11 1.14 1.14 1.16 1.17
46
Linalg Simulation Results
2 3 4 5 6 7
Fast Greedy 3.20 3.08 3.12 3.18 3.02 2.98
RTmm 1.42 3.40 2.29 4.34 3.72 6.40
WRTmm 1.64 3.40 4.49 6.00 6.44 8.28
47
Random Simulation Similar Processing Elements
  • Simulation used between 1 and 6 additional
    processing elements, each having a speedup of 20x

48
Random Simulation Different Processing Elements
  • Simulation used between 1 and 6 additional
    processing elements
  • Speedups of 20x, 10x, 5x, 2x, 1x, and 0.5x were
    used for devices 1 through 6 respectively.

49
Random Simulation Different Bus Transfer Times
  • Simulation used three additional processing
    elements with various bus transfer times
  • Transfer times range from a 64-bit 33 MHz PCI bus
    (smallest), to a 100 Mb/s Ethernet connection
    (largest)

50
Random Simulation Various Speedups
  • Simulation used four additional processing
    elements with various speedups

51
Conclusion
  • Accomplishments
  • A new computer architecture, microHeterogeneous
    Computing, was presented that successfully
    exploits fine-grained parallelism in scientific
    based applications using additional processing
    elements
  • An API was created that allows developers to
    incorporate mHC into their applications without
    being required to address task scheduling, load
    balancing, or threading issues
  • A highly configurable mHC framework was
    implemented as a standard library which allows
    actual mHC compliant applications to be compiled
    and executed using standard techniques
  • Future Work
  • Creation of mHC compliant device drivers so that
    an actual mHC environment can be created
  • While the microHeterogeneous API currently
    contains the most common scientific functions, it
    needs to be expanded in order to become
    complimentary to the GNU Scientific Library.
  • the concept of mHC clusters needs to be fully
    explored in order to determine the applicability
    of mHC to this area of computing.

52
mHC Cluster Based Computing
53
Acknowledgements
  • I would like to thank the following people for
    making this thesis possible
  • My primary advisor, Dr. Shaaban for allowing me
    to work on such an interesting and worthwhile
    project
  • My committee members, Dr. Savakis and Dr.
    Heliotis for working with very tight schedules
  • My family for supporting me through this whole
    process
  • My sister for putting a roof over my head for the
    last month and a half
Write a Comment
User Comments (0)
About PowerShow.com