Slides Prepared from the CI-Tutor Courses at NCSA - PowerPoint PPT Presentation

About This Presentation
Title:

Slides Prepared from the CI-Tutor Courses at NCSA

Description:

Parallel Computing Explained Slides Prepared from the CI-Tutor Courses at NCSA http://ci-tutor.ncsa.uiuc.edu/ By S. Masoud Sadjadi School of Computing and Information ... – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 238
Provided by: usersCsF
Learn more at: http://users.cs.fiu.edu
Category:

less

Transcript and Presenter's Notes

Title: Slides Prepared from the CI-Tutor Courses at NCSA


1
Parallel Computing Explained
  • Slides Prepared from the CI-Tutor Courses at NCSA
  • http//ci-tutor.ncsa.uiuc.edu/
  • By
  • S. Masoud Sadjadi
  • School of Computing and Information Sciences
  • Florida International University
  • March 2009

2
Agenda
  • 1 Parallel Computing Overview
  • 2 How to Parallelize a Code
  • 3 Porting Issues
  • 4 Scalar Tuning
  • 5 Parallel Code Tuning
  • 6 Timing and Profiling
  • 7 Cache Tuning
  • 8 Parallel Performance Analysis
  • 9 About the IBM Regatta P690

3
Agenda
  • 1 Parallel Computing Overview
  • 1.1 Introduction to Parallel Computing
  • 1.1.1 Parallelism in our Daily Lives
  • 1.1.2 Parallelism in Computer Programs
  • 1.1.3 Parallelism in Computers
  • 1.1.3.4 Disk Parallelism
  • 1.1.4 Performance Measures
  • 1.1.5 More Parallelism Issues
  • 1.2 Comparison of Parallel Computers
  • 1.3 Summary

4
Parallel Computing Overview
  • Who should read this chapter?
  • New Users to learn concepts and terminology.
  • Intermediate Users for review or reference.
  • Management Staff to understand the basic
    concepts even if you dont plan to do any
    programming.
  • Note Advanced users may opt to skip this
    chapter.

5
Introduction to Parallel Computing
  • High performance parallel computers
  • can solve large problems much faster than a
    desktop computer
  • fast CPUs, large memory, high speed
    interconnects, and high speed input/output
  • able to speed up computations
  • by making the sequential components run faster
  • by doing more operations in parallel
  • High performance parallel computers are in demand
  • need for tremendous computational capabilities in
    science, engineering, and business.
  • require gigabytes/terabytes f memory and
    gigaflops/teraflops of performance
  • scientists are striving for petascale performance

6
Introduction to Parallel Computing
  • HPPC are used in a wide variety of disciplines.
  • Meteorologists prediction of tornadoes and
    thunderstorms
  • Computational biologists analyze DNA sequences
  • Pharmaceutical companies design of new drugs
  • Oil companies seismic exploration
  • Wall Street analysis of financial markets
  • NASA aerospace vehicle design
  • Entertainment industry special effects in movies
    and commercials
  • These complex scientific and business
    applications all need to perform computations on
    large datasets or large equations.

7
Parallelism in our Daily Lives
  • There are two types of processes that occur in
    computers and in our daily lives
  • Sequential processes
  • occur in a strict order
  • it is not possible to do the next step until the
    current one is completed.
  • Examples
  • The passage of time the sun rises and the sun
    sets.
  • Writing a term paper pick the topic, research,
    and write the paper.
  • Parallel processes
  • many events happen simultaneously
  • Examples
  • Plant growth in the springtime
  • An orchestra

8
Agenda
  • 1 Parallel Computing Overview
  • 1.1 Introduction to Parallel Computing
  • 1.1.1 Parallelism in our Daily Lives
  • 1.1.2 Parallelism in Computer Programs
  • 1.1.2.1 Data Parallelism
  • 1.1.2.2 Task Parallelism
  • 1.1.3 Parallelism in Computers
  • 1.1.3.4 Disk Parallelism
  • 1.1.4 Performance Measures
  • 1.1.5 More Parallelism Issues
  • 1.2 Comparison of Parallel Computers
  • 1.3 Summary

9
Parallelism in Computer Programs
  • Conventional wisdom
  • Computer programs are sequential in nature
  • Only a small subset of them lend themselves to
    parallelism.
  • Algorithm the "sequence of steps" necessary to
    do a computation.
  • The first 30 years of computer use, programs were
    run sequentially.
  • The 1980's saw great successes with parallel
    computers.
  • Dr. Geoffrey Fox published a book entitled
    Parallel Computing Works!
  • many scientific accomplishments resulting from
    parallel computing
  • Computer programs are parallel in nature
  • Only a small subset of them need to be run
    sequentially

10
Parallel Computing
  • What a computer does when it carries out more
    than one computation at a time using more than
    one processor.
  • By using many processors at once, we can speedup
    the execution
  • If one processor can perform the arithmetic in
    time t.
  • Then ideally p processors can perform the
    arithmetic in time t/p.
  • What if I use 100 processors? What if I use 1000
    processors?
  • Almost every program has some form of
    parallelism.
  • You need to determine whether your data or your
    program can be partitioned into independent
    pieces that can be run simultaneously.
  • Decomposition is the name given to this
    partitioning process.
  • Types of parallelism
  • data parallelism
  • task parallelism.

11
Data Parallelism
  • The same code segment runs concurrently on each
    processor, but each processor is assigned its own
    part of the data to work on.
  • Do loops (in Fortran) define the parallelism.
  • The iterations must be independent of each other.
  • Data parallelism is called "fine grain
    parallelism" because the computational work is
    spread into many small subtasks.
  • Example
  • Dense linear algebra, such as matrix
    multiplication, is a perfect candidate for data
    parallelism.

12
An example of data parallelism
  • Original Sequential Code
  • Parallel Code
  • DO K1,N
  • DO J1,N
  • DO I1,N
  • C(I,J) C(I,J) A(I,K)B(K,J)
  • END DO
  • END DO
  • END DO
  • !OMP PARALLEL DO
  • DO K1,N
  • DO J1,N
  • DO I1,N
  • C(I,J) C(I,J) A(I,K)B(K,J)
  • END DO
  • END DO
  • END DO
  • !END PARALLEL DO

13
Quick Intro to OpenMP
  • OpenMP is a portable standard for parallel
    directives covering both data and task
    parallelism.
  • More information about OpenMP is available on the
    OpenMP website.
  • We will have a lecture on Introduction to OpenMP
    later.
  • With OpenMP, the loop that is performed in
    parallel is the loop that immediately follows the
    Parallel Do directive.
  • In our sample code, it's the K loop
  • DO K1,N

14
OpenMP Loop Parallelism
  • Iteration-Processor Assignments
  • The code segment running on each processor
  • DO J1,N
  • DO I1,N
  • C(I,J) C(I,J) A(I,K)B(K,J)
  • END DO
  • END DO

Processor Iterations of K Data Elements
proc0 K15 A(I, 15) B(15 ,J)
proc1 K610 A(I, 610) B(610 ,J)
proc2 K1115 A(I, 1115) B(1115 ,J)
proc3 K1620 A(I, 1620) B(1620 ,J)
15
OpenMP Style of Parallelism
  • can be done incrementally as follows
  • Parallelize the most computationally intensive
    loop.
  • Compute performance of the code.
  • If performance is not satisfactory, parallelize
    another loop.
  • Repeat steps 2 and 3 as many times as needed.
  • The ability to perform incremental parallelism is
    considered a positive feature of data
    parallelism.
  • It is contrasted with the MPI (Message Passing
    Interface) style of parallelism, which is an "all
    or nothing" approach.

16
Task Parallelism
  • Task parallelism may be thought of as the
    opposite of data parallelism.
  • Instead of the same operations being performed on
    different parts of the data, each process
    performs different operations.
  • You can use task parallelism when your program
    can be split into independent pieces, often
    subroutines, that can be assigned to different
    processors and run concurrently.
  • Task parallelism is called "coarse grain"
    parallelism because the computational work is
    spread into just a few subtasks.
  • More code is run in parallel because the
    parallelism is implemented at a higher level than
    in data parallelism.
  • Task parallelism is often easier to implement and
    has less overhead than data parallelism.

17
Task Parallelism
  • The abstract code shown in the diagram is
    decomposed into 4 independent code segments that
    are labeled A, B, C, and D. The right hand side
    of the diagram illustrates the 4 code segments
    running concurrently.

18
Task Parallelism
  • Original Code
  • Parallel Code

program main code segment labeled A code
segment labeled B code segment labeled C
code segment labeled D end
program main code segment labeled A code
segment labeled B code segment labeled C
code segment labeled D end
program main !OMP PARALLEL !OMP SECTIONS
code segment labeled A !OMP SECTION code
segment labeled B !OMP SECTION code segment
labeled C !OMP SECTION code segment labeled D
!OMP END SECTIONS !OMP END PARALLEL end
19
OpenMP Task Parallelism
  • With OpenMP, the code that follows each
    SECTION(S) directive is allocated to a different
    processor. In our sample parallel code, the
    allocation of code segments to processors is as
    follows.

Processor Code
proc0 code segment labeled A
proc1 code segment labeled B
proc2 code segment labeled C
proc3 code segment labeled D
20
Parallelism in Computers
  • How parallelism is exploited and enhanced within
    the operating system and hardware components of a
    parallel computer
  • operating system
  • arithmetic
  • memory
  • disk

21
Operating System Parallelism
  • All of the commonly used parallel computers run a
    version of the Unix operating system. In the
    table below each OS listed is in fact Unix, but
    the name of the Unix OS varies with each vendor.
  • For more information about Unix, a collection of
    Unix documents is available.

Parallel Computer OS
SGI Origin2000 IRIX
HP V-Class HP-UX
Cray T3E Unicos
IBM SP AIX
WorkstationClusters Linux
22
Two Unix Parallelism Features
  • background processing facility
  • With the Unix background processing facility you
    can run the executable a.out in the background
    and simultaneously view the man page for the
    etime function in the foreground. There are two
    Unix commands that accomplish this
  • a.out gt results
  • man etime
  • cron feature
  • With the Unix cron feature you can submit a job
    that will run at a later time.

23
Arithmetic Parallelism
  • Multiple execution units
  • facilitate arithmetic parallelism.
  • The arithmetic operations of add, subtract,
    multiply, and divide ( - /) are each done in a
    separate execution unit. This allows several
    execution units to be used simultaneously,
    because the execution units operate
    independently.
  • Fused multiply and add
  • is another parallel arithmetic feature.
  • Parallel computers are able to overlap multiply
    and add. This arithmetic is named MultiplyADD
    (MADD) on SGI computers, and Fused Multiply Add
    (FMA) on HP computers. In either case, the two
    arithmetic operations are overlapped and can
    complete in hardware in one computer cycle.
  • Superscalar arithmetic
  • is the ability to issue several arithmetic
    operations per computer cycle.
  • It makes use of the multiple, independent
    execution units. On superscalar computers there
    are multiple slots per cycle that can be filled
    with work. This gives rise to the name n-way
    superscalar, where n is the number of slots per
    cycle. The SGI Origin2000 is called a 4-way
    superscalar computer.

24
Memory Parallelism
  • memory interleaving
  • memory is divided into multiple banks, and
    consecutive data elements are interleaved among
    them. For example if your computer has 2 memory
    banks, then data elements with even memory
    addresses would fall into one bank, and data
    elements with odd memory addresses into the
    other.
  • multiple memory ports
  • Port means a bi-directional memory pathway. When
    the data elements that are interleaved across the
    memory banks are needed, the multiple memory
    ports allow them to be accessed and fetched in
    parallel, which increases the memory bandwidth
    (MB/s or GB/s).
  • multiple levels of the memory hierarchy
  • There is global memory that any processor can
    access. There is memory that is local to a
    partition of the processors. Finally there is
    memory that is local to a single processor, that
    is, the cache memory and the memory elements held
    in registers.
  • Cache memory
  • Cache is a small memory that has fast access
    compared with the larger main memory and serves
    to keep the faster processor filled with data.

25
Memory Parallelism
  • Memory Hierarchy
  • Cache Memory

26
Disk Parallelism
  • RAID (Redundant Array of Inexpensive Disk)
  • RAID disks are on most parallel computers.
  • The advantage of a RAID disk system is that it
    provides a measure of fault tolerance.
  • If one of the disks goes down, it can be swapped
    out, and the RAID disk system remains
    operational.
  • Disk Striping
  • When a data set is written to disk, it is striped
    across the RAID disk system. That is, it is
    broken into pieces that are written
    simultaneously to the different disks in the RAID
    disk system. When the same data set is read back
    in, the pieces are read in parallel, and the full
    data set is reassembled in memory.

27
Agenda
  • 1 Parallel Computing Overview
  • 1.1 Introduction to Parallel Computing
  • 1.1.1 Parallelism in our Daily Lives
  • 1.1.2 Parallelism in Computer Programs
  • 1.1.3 Parallelism in Computers
  • 1.1.3.4 Disk Parallelism
  • 1.1.4 Performance Measures
  • 1.1.5 More Parallelism Issues
  • 1.2 Comparison of Parallel Computers
  • 1.3 Summary

28
Performance Measures
  • Peak Performance
  • is the top speed at which the computer can
    operate.
  • It is a theoretical upper limit on the computer's
    performance.
  • Sustained Performance
  • is the highest consistently achieved speed.
  • It is a more realistic measure of computer
    performance.
  • Cost Performance
  • is used to determine if the computer is cost
    effective.
  • MHz
  • is a measure of the processor speed.
  • The processor speed is commonly measured in
    millions of cycles per second, where a computer
    cycle is defined as the shortest time in which
    some work can be done.
  • MIPS
  • is a measure of how quickly the computer can
    issue instructions.
  • Millions of instructions per second is
    abbreviated as MIPS, where the instructions are
    computer instructions such as memory reads and
    writes, logical operations , floating point
    operations, integer operations, and branch
    instructions.

29
Performance Measures
  • Mflops (Millions of floating point operations per
    second)
  • measures how quickly a computer can perform
    floating-point operations such as add, subtract,
    multiply, and divide.
  • Speedup
  • measures the benefit of parallelism.
  • It shows how your program scales as you compute
    with more processors, compared to the performance
    on one processor.
  • Ideal speedup happens when the performance gain
    is linearly proportional to the number of
    processors used.
  • Benchmarks
  • are used to rate the performance of parallel
    computers and parallel programs.
  • A well known benchmark that is used to compare
    parallel computers is the Linpack benchmark.
  • Based on the Linpack results, a list is produced
    of the Top 500 Supercomputer Sites. This list is
    maintained by the University of Tennessee and the
    University of Mannheim.

30
More Parallelism Issues
  • Load balancing
  • is the technique of evenly dividing the workload
    among the processors.
  • For data parallelism it involves how iterations
    of loops are allocated to processors.
  • Load balancing is important because the total
    time for the program to complete is the time
    spent by the longest executing thread.
  • The problem size
  • must be large and must be able to grow as you
    compute with more processors.
  • In order to get the performance you expect from a
    parallel computer you need to run a large
    application with large data sizes, otherwise the
    overhead of passing information between
    processors will dominate the calculation time.
  • Good software tools
  • are essential for users of high performance
    parallel computers.
  • These tools include
  • parallel compilers
  • parallel debuggers
  • performance analysis tools
  • parallel math software
  • The availability of a broad set of application
    software is also important.

31
More Parallelism Issues
  • The high performance computing market is risky
    and chaotic. Many supercomputer vendors are no
    longer in business, making the portability of
    your application very important.
  • A workstation farm
  • is defined as a fast network connecting
    heterogeneous workstations.
  • The individual workstations serve as desktop
    systems for their owners.
  • When they are idle, large problems can take
    advantage of the unused cycles in the whole
    system.
  • An application of this concept is the SETI
    project. You can participate in searching for
    extraterrestrial intelligence with your home PC.
    More information about this project is available
    at the SETI Institute.
  • Condor
  • is software that provides resource management
    services for applications that run on
    heterogeneous collections of workstations.
  • Miron Livny at the University of Wisconsin at
    Madison is the director of the Condor project,
    and has coined the phrase high throughput
    computing to describe this process of harnessing
    idle workstation cycles. More information is
    available at the Condor Home Page.

32
Agenda
  • 1 Parallel Computing Overview
  • 1.1 Introduction to Parallel Computing
  • 1.2 Comparison of Parallel Computers
  • 1.2.1 Processors
  • 1.2.2 Memory Organization
  • 1.2.3 Flow of Control
  • 1.2.4 Interconnection Networks
  • 1.2.4.1 Bus Network
  • 1.2.4.2 Cross-Bar Switch Network
  • 1.2.4.3 Hypercube Network
  • 1.2.4.4 Tree Network
  • 1.2.4.5 Interconnection Networks Self-test
  • 1.2.5 Summary of Parallel Computer
    Characteristics
  • 1.3 Summary

33
Comparison of Parallel Computers
  • Now you can explore the hardware components of
    parallel computers
  • kinds of processors
  • types of memory organization
  • flow of control
  • interconnection networks
  • You will see what is common to these parallel
    computers, and what makes each one of them
    unique.

34
Kinds of Processors
  • There are three types of parallel computers
  • computers with a small number of powerful
    processors
  • Typically have tens of processors.
  • The cooling of these computers often requires
    very sophisticated and expensive equipment,
    making these computers very expensive for
    computing centers.
  • They are general-purpose computers that perform
    especially well on applications that have large
    vector lengths.
  • The examples of this type of computer are the
    Cray SV1 and the Fujitsu VPP5000.

35
Kinds of Processors
  • There are three types of parallel computers
  • computers with a large number of less powerful
    processors
  • Named a Massively Parallel Processor (MPP),
    typically have thousands of processors.
  • The processors are usually proprietary and
    air-cooled.
  • Because of the large number of processors, the
    distance between the furthest processors can be
    quite large requiring a sophisticated internal
    network that allows distant processors to
    communicate with each other quickly.
  • These computers are suitable for applications
    with a high degree of concurrency.
  • The MPP type of computer was popular in the
    1980s.
  • Examples of this type of computer were the
    Thinking Machines CM-2 computer, and the
    computers made by the MassPar company.

36
Kinds of Processors
  • There are three types of parallel computers
  • computers that are medium scale in between the
    two extremes
  • Typically have hundreds of processors.
  • The processor chips are usually not proprietary
    rather they are commodity processors like the
    Pentium III.
  • These are general-purpose computers that perform
    well on a wide range of applications.
  • The most common example of this class is the
    Linux Cluster.

37
Trends and Examples
  • Processor trends
  • The processors on todays commonly used parallel
    computers

Decade Processor Type Computer Example
1970s Pipelined, Proprietary Cray-1
1980s Massively Parallel, Proprietary Thinking Machines CM2
1990s Superscalar, RISC, Commodity SGI Origin2000
2000s CISC, Commodity Workstation Clusters
Computer Processor
SGI Origin2000 MIPS RISC R12000
HP V-Class HP PA 8200
Cray T3E Compaq Alpha
IBM SP IBM Power3
Workstation Clusters Intel Pentium III, Intel Itanium
38
Memory Organization
  • The following paragraphs describe the three types
    of memory organization found on parallel
    computers
  • distributed memory
  • shared memory
  • distributed shared memory

39
Distributed Memory
  • In distributed memory computers, the total memory
    is partitioned into memory that is private to
    each processor.
  • There is a Non-Uniform Memory Access time (NUMA),
    which is proportional to the distance between the
    two communicating processors.
  • On NUMA computers, data is accessed the quickest
    from a private memory, while data from the most
    distant processor takes the longest to access.
  • Some examples are the Cray T3E, the IBM SP, and
    workstation clusters.

40
Distributed Memory
  • When programming distributed memory computers,
    the code and the data should be structured such
    that the bulk of a processors data accesses are
    to its own private (local) memory.
  • This is called having good data locality.
  • Today's distributed memory computers use message
    passing such as MPI to communicate between
    processors as shown in the following example

41
Distributed Memory
  • One advantage of distributed memory computers is
    that they are easy to scale. As the demand for
    resources grows, computer centers can easily add
    more memory and processors.
  • This is often called the LEGO block approach.
  • The drawback is that programming of distributed
    memory computers can be quite complicated.

42
Shared Memory
  • In shared memory computers, all processors have
    access to a single pool of centralized memory
    with a uniform address space.
  • Any processor can address any memory location at
    the same speed so there is Uniform Memory Access
    time (UMA).
  • Processors communicate with each other through
    the shared memory.
  • The advantages and disadvantages of shared memory
    machines are roughly the opposite of distributed
    memory computers.
  • They are easier to program because they resemble
    the programming of single processor machines
  • But they don't scale like their distributed
    memory counterparts

43
Distributed Shared Memory
  • In Distributed Shared Memory (DSM) computers, a
    cluster or partition of processors has access to
    a common shared memory.
  • It accesses the memory of a different processor
    cluster in a NUMA fashion.
  • Memory is physically distributed but logically
    shared.
  • Attention to data locality again is important.
  • Distributed shared memory computers combine the
    best features of both distributed memory
    computers and shared memory computers.
  • That is, DSM computers have both the scalability
    of distributed memory computers and the ease of
    programming of shared memory computers.
  • Some examples of DSM computers are the SGI
    Origin2000 and the HP V-Class computers.

44
Trends and Examples
  • Memory organization trends
  • The memory organization of todays commonly used
    parallel computers

Decade Memory Organization Example
1970s Shared Memory Cray-1
1980s Distributed Memory Thinking Machines CM-2
1990s Distributed Shared Memory SGI Origin2000
2000s Distributed Memory Workstation Clusters
Computer Memory Organization
SGI Origin2000 DSM
HP V-Class DSM
Cray T3E Distributed
IBM SP Distributed
Workstation Clusters Distributed
45
Flow of Control
  • When you look at the control of flow you will see
    three types of parallel computers
  • Single Instruction Multiple Data (SIMD)
  • Multiple Instruction Multiple Data (MIMD)
  • Single Program Multiple Data (SPMD)

46
Flynns Taxonomy
  • Flynns Taxonomy, devised in 1972 by Michael
    Flynn of Stanford University, describes computers
    by how streams of instructions interact with
    streams of data.
  • There can be single or multiple instruction
    streams, and there can be single or multiple data
    streams. This gives rise to 4 types of computers
    as shown in the diagram below
  • Flynn's taxonomy names the 4 computer types SISD,
    MISD, SIMD and MIMD.
  • Of these 4, only SIMD and MIMD are applicable to
    parallel computers.
  • Another computer type, SPMD, is a special case of
    MIMD.

47
SIMD Computers
  • SIMD stands for Single Instruction Multiple Data.
  • Each processor follows the same set of
    instructions.
  • With different data elements being allocated to
    each processor.
  • SIMD computers have distributed memory with
    typically thousands of simple processors, and the
    processors run in lock step.
  • SIMD computers, popular in the 1980s, are useful
    for fine grain data parallel applications, such
    as neural networks.
  • Some examples of SIMD computers were the Thinking
    Machines CM-2 computer and the computers from the
    MassPar company.
  • The processors are commanded by the global
    controller that sends instructions to the
    processors.
  • It says add, and they all add.
  • It says shift to the right, and they all shift to
    the right.
  • The processors are like obedient soldiers,
    marching in unison.

48
MIMD Computers
  • MIMD stands for Multiple Instruction Multiple
    Data.
  • There are multiple instruction streams with
    separate code segments distributed among the
    processors.
  • MIMD is actually a superset of SIMD, so that the
    processors can run the same instruction stream or
    different instruction streams.
  • In addition, there are multiple data streams
    different data elements are allocated to each
    processor.
  • MIMD computers can have either distributed memory
    or shared memory.
  • While the processors on SIMD computers run in
    lock step, the processors on MIMD computers run
    independently of each other.
  • MIMD computers can be used for either data
    parallel or task parallel applications.
  • Some examples of MIMD computers are the SGI
    Origin2000 computer and the HP V-Class computer.

49
SPMD Computers
  • SPMD stands for Single Program Multiple Data.
  • SPMD is a special case of MIMD.
  • SPMD execution happens when a MIMD computer is
    programmed to have the same set of instructions
    per processor.
  • With SPMD computers, while the processors are
    running the same code segment, each processor can
    run that code segment asynchronously.
  • Unlike SIMD, the synchronous execution of
    instructions is relaxed.
  • An example is the execution of an if statement on
    a SPMD computer.
  • Because each processor computes with its own
    partition of the data elements, it may evaluate
    the right hand side of the if statement
    differently from another processor.
  • One processor may take a certain branch of the if
    statement, and another processor may take a
    different branch of the same if statement.
  • Hence, even though each processor has the same
    set of instructions, those instructions may be
    evaluated in a different order from one processor
    to the next.
  • The analogies we used for describing SIMD
    computers can be modified for MIMD computers.
  • Instead of the SIMD obedient soldiers, all
    marching in unison, in the MIMD world the
    processors march to the beat of their own
    drummer.

50
Summary of SIMD versus MIMD
SIMD MIMD
Memory distributed memory distriuted memoryorshared memory
Code Segment same perprocessor sameordifferent
ProcessorsRun In lock step asynchronously
DataElements different perprocessor different perprocessor
Applications data parallel data parallelortask parallel
51
Trends and Examples
  • Flow of control trends
  • The flow of control on today

Decade Flow of Control Computer Example
1980's SIMD Thinking Machines CM-2
1990's MIMD SGI Origin2000
2000's MIMD Workstation Clusters
Computer Flow of Control
SGI Origin2000 MIMD
HP V-Class MIMD
Cray T3E MIMD
IBM SP MIMD
Workstation Clusters MIMD
52
Agenda
  • 1 Parallel Computing Overview
  • 1.1 Introduction to Parallel Computing
  • 1.2 Comparison of Parallel Computers
  • 1.2.1 Processors
  • 1.2.2 Memory Organization
  • 1.2.3 Flow of Control
  • 1.2.4 Interconnection Networks
  • 1.2.4.1 Bus Network
  • 1.2.4.2 Cross-Bar Switch Network
  • 1.2.4.3 Hypercube Network
  • 1.2.4.4 Tree Network
  • 1.2.4.5 Interconnection Networks Self-test
  • 1.2.5 Summary of Parallel Computer
    Characteristics
  • 1.3 Summary

53
Interconnection Networks
  • What exactly is the interconnection network?
  • The interconnection network is made up of the
    wires and cables that define how the multiple
    processors of a parallel computer are connected
    to each other and to the memory units.
  • The time required to transfer data is dependent
    upon the specific type of the interconnection
    network.
  • This transfer time is called the communication
    time.
  • What network characteristics are important?
  • Diameter the maximum distance that data must
    travel for 2 processors to communicate.
  • Bandwidth the amount of data that can be sent
    through a network connection.
  • Latency the delay on a network while a data
    packet is being stored and forwarded.
  • Types of Interconnection Networks
  • The network topologies (geometric arrangements of
    the computer network connections) are
  • Bus
  • Cross-bar Switch
  • Hybercube
  • Tree

54
Interconnection Networks
  • The aspects of network issues are
  • Cost
  • Scalability
  • Reliability
  • Suitable Applications
  • Data Rate
  • Diameter
  • Degree
  • General Network Characteristics
  • Some networks can be compared in terms of their
    degree and diameter.
  • Degree how many communicating wires are coming
    out of each processor.
  • A large degree is a benefit because it has
    multiple paths.
  • Diameter This is the distance between the two
    processors that are farthest apart.
  • A small diameter corresponds to low latency.

55
Bus Network
  • Bus topology is the original coaxial cable-based
    Local Area Network (LAN) topology in which the
    medium forms a single bus to which all stations
    are attached.
  • The positive aspects
  • It is also a mature technology that is well known
    and reliable.
  • The cost is also very low.
  • simple to construct.
  • The negative aspects
  • limited data transmission rate.
  • not scalable in terms of performance.
  • Example SGI Power Challenge.
  • Only scaled to 18 processors.

56
Cross-Bar Switch Network
  • A cross-bar switch is a network that works
    through a switching mechanism to access shared
    memory.
  • it scales better than the bus network but it
    costs significantly more.
  • The telephone system uses this type of network.
    An example of a computer with this type of
    network is the HP V-Class.
  • Here is a diagram of a cross-bar switch network
    which shows the processors talking through the
    switchboxes to store or retrieve data in memory.
  • There are multiple paths for a processor to
    communicate with a certain memory.
  • The switches determine the optimal route to take.

57
Cross-Bar Switch Network
  • In a hypercube network, the processors are
    connected as if they were corners of a
    multidimensional cube. Each node in an N
    dimensional cube is directly connected to N other
    nodes.
  • The fact that the number of directly connected,
    "nearest neighbor", nodes increases with the
    total size of the network is also highly
    desirable for a parallel computer.
  • The degree of a hypercube network is log n and
    the diameter is log n, where n is the number of
    processors.
  • Examples of computers with this type of network
    are the CM-2, NCUBE-2, and the Intel iPSC860.

58
Tree Network
  • The processors are the bottom nodes of the tree.
    For a processor to retrieve data, it must go up
    in the network and then go back down.
  • This is useful for decision making applications
    that can be mapped as trees.
  • The degree of a tree network is 1. The diameter
    of the network is 2 log (n1)-2 where n is the
    number of processors.
  • The Thinking Machines CM-5 is an example of a
    parallel computer with this type of network.
  • Tree networks are very suitable for database
    applications because it allows multiple searches
    through the database at a time.

59
Interconnected Networks
  • Torus Network A mesh with wrap-around
    connections in both the x and y directions.
  • Multistage Network A network with more than one
    networking unit.
  • Fully Connected Network A network where every
    processor is connected to every other processor.
  • Hypercube Network Processors are connected as if
    they were corners of a multidimensional cube.
  • Mesh Network A network where each interior
    processor is connected to its four nearest
    neighbors.

60
Interconnected Networks
  • Bus Based Network Coaxial cable based LAN
    topology in which the medium forms a single bus
    to which all stations are attached.
  • Cross-bar Switch Network A network that works
    through a switching mechanism to access shared
    memory.
  • Tree Network The processors are the bottom nodes
    of the tree.
  • Ring Network Each processor is connected to two
    others and the line of connections forms a circle.

61
Summary of Parallel Computer Characteristics
  • How many processors does the computer have?
  • 10s?
  • 100s?
  • 1000s?
  • How powerful are the processors?
  • what's the MHz rate
  • what's the MIPS rate
  • What's the instruction set architecture?
  • RISC
  • CISC

62
Summary of Parallel Computer Characteristics
  • How much memory is available?
  • total memory
  • memory per processor
  • What kind of memory?
  • distributed memory
  • shared memory
  • distributed shared memory
  • What type of flow of control?
  • SIMD
  • MIMD
  • SPMD

63
Summary of Parallel Computer Characteristics
  • What is the interconnection network?
  • Bus
  • Crossbar
  • Hypercube
  • Tree
  • Torus
  • Multistage
  • Fully Connected
  • Mesh
  • Ring
  • Hybrid

64
Design decisions made by some of the major
parallel computer vendors
Computer ProgrammingStyle OS Processors Memory Flow ofControl Network
SGI Origin2000 OpenMPMPI IRIX MIPS RISCR10000 DSM MIMD CrossbarHypercube
HP V-Class OpenMPMPI HP-UX HP PA 8200 DSM MIMD CrossbarRing
Cray T3E SHMEM Unicos Compaq Alpha Distributed MIMD Torus
IBM SP MPI AIX IBM Power3 Distributed MIMD IBM Switch
WorkstationClusters MPI Linux Intel Pentium III Distributed MIMD Myrinet Tree
65
Summary
  • This completes our introduction to parallel
    computing.
  • You have learned about parallelism in computer
    programs, and also about parallelism in the
    hardware components of parallel computers.
  • In addition, you have learned about the commonly
    used parallel computers, and how these computers
    compare to each other.
  • There are many good texts which provide an
    introductory treatment of parallel computing.
    Here are two useful references
  • Highly Parallel Computing, Second EditionGeorge
    S. Almasi and Allan GottliebBenjamin/Cummings
    Publishers, 1994Parallel Computing Theory and
    PracticeMichael J. QuinnMcGraw-Hill, Inc., 1994

66
Agenda
  • 1 Parallel Computing Overview
  • 2 How to Parallelize a Code
  • 2.1 Automatic Compiler Parallelism
  • 2.2 Data Parallelism by Hand
  • 2.3 Mixing Automatic and Hand Parallelism
  • 2.4 Task Parallelism
  • 2.5 Parallelism Issues
  • 3 Porting Issues
  • 4 Scalar Tuning
  • 5 Parallel Code Tuning
  • 6 Timing and Profiling
  • 7 Cache Tuning
  • 8 Parallel Performance Analysis
  • 9 About the IBM Regatta P690

67
How to Parallelize a Code
  • This chapter describes how to turn a single
    processor program into a parallel one, focusing
    on shared memory machines.
  • Both automatic compiler parallelization and
    parallelization by hand are covered.
  • The details for accomplishing both data
    parallelism and task parallelism are presented.

68
Automatic Compiler Parallelism
  • Automatic compiler parallelism enables you to use
    a single compiler option and let the compiler do
    the work.
  • The advantage of it is that its easy to use.
  • The disadvantages are
  • The compiler only does loop level parallelism,
    not task parallelism.
  • The compiler wants to parallelize every do loop
    in your code. If you have hundreds of do loops
    this creates way too much parallel overhead.

69
Automatic Compiler Parallelism
  • To use automatic compiler parallelism on a Linux
    system with the Intel compilers, specify the
    following.
  • ifort -parallel -O2 ... prog.f
  • The compiler creates conditional code that will
    run with any number of threads.
  • Specify the number of threads and make sure you
    still get the right answers with setenv
  • setenv OMP_NUM_THREADS 4 a.out gt results

70
Data Parallelism by Hand
  • First identify the loops that use most of the CPU
    time (the Profiling lecture describes how to do
    this).
  • By hand, insert into the code OpenMP directive(s)
    just before the loop(s) you want to make
    parallel.
  • Some code modifications may be needed to remove
    data dependencies and other inhibitors of
    parallelism.
  • Use your knowledge of the code and data to assist
    the compiler.
  • For the SGI Origin2000 computer, insert into the
    code an OpenMP directive just before the loop
    that you want to make parallel.
  • !OMP PARALLEL
  • DO do i1,n
  • lots of computation ...
  • end do
  • !OMP END PARALLEL DO

71
Data Parallelism by Hand
  • Compile with the mp compiler option.
  • f90 -mp ... prog.f
  • As before, the compiler generates conditional
    code that will run with any number of threads.
  • If you want to rerun your program with a
    different number of threads, you do not need to
    recompile, just re-specify the setenv command.
  • setenv OMP_NUM_THREADS 8
  • a.out gt results2
  • The setenv command can be placed anywhere before
    the a.out command.
  • The setenv command must be typed exactly as
    indicated. If you have a typo, you will not
    receive a warning or error message. To make sure
    that the setenv command is specified correctly,
    type
  • setenv
  • It produces a listing of your environment
    variable settings.

72
Mixing Automatic and Hand Parallelism
  • You can have one source file parallelized
    automatically by the compiler, and another source
    file parallelized by hand. Suppose you split your
    code into two files named prog1.f and prog2.f.
  • f90 -c -apo prog1.f (automatic // for
    prog1.f)
  • f90 -c -mp prog2.f (by hand // for prog2.f)
  • f90 prog1.o prog2.o (creates one executable)
  • a.out gt results (runs the executable)

73
Task Parallelism
  • You can accomplish task parallelism as follows
  • !OMP PARALLEL
  • !OMP SECTIONS
  • lots of computation in part A
  • !OMP SECTION
  • lots of computation in part B ...
  • !OMP SECTION
  • lots of computation in part C ...
  • !OMP END SECTIONS
  • !OMP END PARALLEL
  • Compile with the mp compiler option.
  • f90 -mp prog.f
  • Use the setenv command to specify the number of
    threads.
  • setenv OMP_NUM_THREADS 3
  • a.out gt results

74
Parallelism Issues
  • There are some issues to consider when
    parallelizing a program.
  • Should data parallelism or task parallelism be
    used?
  • Should automatic compiler parallelism or
    parallelism by hand be used?
  • Which loop in a nested loop situation should be
    the one that becomes parallel?
  • How many threads should be used?

75
Agenda
  • 1 Parallel Computing Overview
  • 2 How to Parallelize a Code
  • 3 Porting Issues
  • 3.1 Recompile
  • 3.2 Word Length
  • 3.3 Compiler Options for Debugging
  • 3.4 Standards Violations
  • 3.5 IEEE Arithmetic Differences
  • 3.6 Math Library Differences
  • 3.7 Compute Order Related Differences
  • 3.8 Optimization Level Too High
  • 3.9 Diagnostic Listings
  • 3.10 Further Information

76
Porting Issues
  • In order to run a computer program that presently
    runs on a workstation, a mainframe, a vector
    computer, or another parallel computer, on a new
    parallel computer you must first "port" the code.
  • After porting the code, it is important to have
    some benchmark results you can use for
    comparison.
  • To do this, run the original program on a
    well-defined dataset, and save the results from
    the old or baseline computer.
  • Then run the ported code on the new computer and
    compare the results.
  • If the results are different, don't automatically
    assume that the new results are wrong they may
    actually be better. There are several reasons why
    this might be true, including
  • Precision Differences - the new results may
    actually be more accurate than the baseline
    results.
  • Code Flaws - porting your code to a new computer
    may have uncovered a hidden flaw in the code that
    was already there.
  • Detection methods for finding code flaws,
    solutions, and workarounds are provided in this
    lecture.

77
Recompile
  • Some codes just need to be recompiled to get
    accurate results.
  • The compilers available on the NCSA computer
    platforms are shown in the following table

Language SGI Origin2000 SGI Origin2000 IA-32 Linux IA-32 Linux IA-32 Linux IA-64 Linux IA-64 Linux
MIPSpro Portland Group Intel GNU Portland Group Intel GNU
Fortran 77 f77 ifort g77 pgf77 ifort g77
Fortran 90 f90 ifort pgf90 ifort
Fortran 90 f95 ifort ifort
High Performance Fortran pghpf pghpf
C cc icc gcc pgcc icc gcc
C CC icpc g pgCC icpc g
78
Word Length
  • Code flaws can occur when you are porting your
    code to a different word length computer.
  • For C, the size of an integer variable differs
    depending on the machine and how the variable is
    generated. On the IA32 and IA64 Linux clusters,
    the size of an integer variable is 4 and 8 bytes,
    respectively. On the SGI Origin2000, the
    corresponding value is 4 bytes if the code is
    compiled with the n32 flag, and 8 bytes if
    compiled without any flags or explicitly with the
    64 flag.
  • For Fortran, the SGI MIPSpro and Intel compilers
    contain the following flags to set default
    variable size.
  • -in where n is a number set the default INTEGER
    to INTEGERn. The value of n can be 4 or 8 on
    SGI, and 2, 4, or 8 on the Linux clusters.
  • -rn where n is a number set the default REAL to
    REALn. The value of n can be 4 or 8 on SGI, and
    4, 8, or 16 on the Linux clusters.

79
Compiler Options for Debugging
  • On the SGI Origin2000, the MIPSpro compilers
    include debugging options via the DEBUGgroup.
    The syntax is as follows
  • -DEBUGoption1value1option2value2...
  • Two examples are
  • Array-bound checking check for subscripts out of
    range at runtime.-DEBUGsubscript_checkON
  • Force all un-initialized stack, automatic and
    dynamically allocated variables to be
    initialized. -DEBUGtrap_uninitializedON

80
Compiler Options for Debugging
  • On the IA32 Linux cluster, the Fortran compiler
    is equipped with the following C flags for
    runtime diagnostics
  • -CA pointers and allocatable references
  • -CB array and subscript bounds
  • -CS consistent shape of intrinsic procedure
  • -CU use of uninitialized variables
  • -CV correspondence between dummy and actual
    arguments

81
Standards Violations
  • Code flaws can occur when the program has
    non-ANSI standard Fortran coding.
  • ANSI standard Fortran is a set of rules for
    compiler writers that specify, for example, the
    value of the do loop index upon exit from the do
    loop.
  • Standards Violations Detection
  • To detect standards violations on the SGI
    Origin2000 computer use the -ansi flag.
  • This option generates a listing of warning
    messages for the use of non-ANSI standard coding.
  • On the Linux clusters, the -ansi- flag
    enables/disables assumption of ANSI conformance.

82
IEEE Arithmetic Differences
  • Code flaws occur when the baseline computer
    conforms to the IEEE arithmetic standard and the
    new computer does not.
  • The IEEE Arithmetic Standard is a set of rules
    governing arithmetic roundoff and overflow
    behavior.
  • For example, it prohibits the compiler writer
    from replacing x/y with x recip (y) since the
    two results may differ slightly for some
    operands. You can make your program strictly
    conform to the IEEE standard.
  • To make your program conform to the IEEE
    Arithmetic Standards on the SGI Origin2000
    computer use
  • f90 -OPTIEEEarithmeticn ... prog.f where n is
    1, 2, or 3.
  • This option specifies the level of conformance to
    the IEEE standard where 1 is the most stringent
    and 3 is the most liberal.
  • On the Linux clusters, the Intel compilers can
    achieve conformance to IEEE standard at a
    stringent level with the mp flag, or a slightly
    relaxed level with the mp1 flag.

83
Math Library Differences
  • Most high-performance parallel computers are
    equipped with vendor-supplied math libraries.
  • On the SGI Origin2000 platform, there are
    SGI/Cray Scientific Library (SCSL) and
    Complib.sgimath.
  • SCSL contains Level 1, 2, and 3 Basic Linear
    Algebra Subprograms (BLAS), LAPACK and Fast
    Fourier Transform (FFT) routines.
  • SCSL can be linked with lscs for the serial
    version, or mp lscs_mp for the parallel
    version.
  • The complib library can be linked with
    lcomplib.sgimath for the serial version, or mp
    lcomplib.sgimath_mp for the parallel version.
  • The Intel Math Kernel Library (MKL) contains the
    complete set of functions from BLAS, the extended
    BLAS (sparse), the complete set of LAPACK
    routines, and Fast Fourier Transform (FFT)
    routines.

84
Math Library Differences
  • On the IA32 Linux cluster, the libraries to link
    to are
  • For BLAS -L/usr/local/intel/mkl/lib/32 -lmkl
    -lguide lpthread
  • For LAPACK -L/usr/local/intel/mkl/lib/32
    lmkl_lapack -lmkl -lguide lpthread
  • When calling MKL routines from C/C programs,
    you also need to link with lF90.
  • On the IA64 Linux cluster, the corresponding
    libraries are
  • For BLAS -L/usr/local/intel/mkl/lib/64 lmkl_itp
    lpthread
  • For LAPACK -L/usr/local/intel/mkl/lib/64
    lmkl_lapack lmkl_itp lpthread
  • When calling MKL routines from C/C programs,
    you also need to link with -lPEPCF90 lCEPCF90
    lF90 -lintrins

85
Compute Order Related Differences
  • Code flaws can occur because of the
    non-deterministic computation of data elements on
    a parallel computer. The compute order in which
    the threads will run cannot be guaranteed.
  • For example, in a data parallel program, the 50th
    index of a do loop may be computed before the
    10th index of the loop. Furthermore, the threads
    may run in one order on the first run, and in
    another order on the next run of the program.
  • Note If your algorithm depends on data being
    compared in a specific order, your code is
    inappropriate for a parallel computer.
  • Use the following method to detect compute order
    related differences
  • If your loop looks like
  • DO I 1, N change it to
  • DO I N, 1, -1 The results should not change if
    the iterations are independent

86
Optimization Level Too High
  • Code flaws can occur when the optimization level
    has been set too high thus trading speed for
    accuracy.
  • The compiler reorders and optimizes your code
    based on assumptions it makes about your program.
    This can sometimes cause answers to change at
    higher optimization level.
  • Setting the Optimization Level
  • Both SGI Origin2000 computer and IBM Linux
    clusters provide Level 0 (no optimization) to
    Level 3 (most aggressive) optimization, using the
    O0,1,2, or 3 flag. One should bear in mind
    that Level 3 optimization may carry out loop
    transformations that affect the correctness of
    calculations. Checking correctness and precision
    of calculation is highly recommended when O3 is
    used.
  • For example on the Origin 2000
  • f90 -O0 prog.f turns off all optimizations.

87
Optimization Level Too High
  • Isolating Optimization Level Problems
  • You can sometimes isolate optimization level
    problems using the method of binary chop.
  • To do this, divide your program prog.f into
    halves. Name them prog1.f and prog2.f.
  • Compile the first half with -O0 and the second
    half with -O3
  • f90 -c -O0 prog1.f f90 -c -O3 prog2.f f90
    prog1.o prog2.o a.out gt results
  • If the results are correct, the optimization
    problem lies in prog1.f
  • Next divide prog1.f into halves. Name them
    prog1a.f and prog1b.f
  • Compile prog1a.f with -O0 and prog1b.f with -O3
  • f90 -c -O0 prog1a.f f90 -c -O3 prog1b.f f90
    prog1a.o prog1b.o prog2.o a.out gt results
  • Continue in this manner until you have isolated
    the section of code that is producing incorrect
    results.

88
Diagnostic Listings
  • The SGI Origin 2000 compiler will generate all
    kinds of diagnostic warnings and messages, but
    not always by default. Some useful listing
    options are
  • f90 -listing ...
  • f90 -fullwarn ...
  • f90 -showdefaults ...
  • f90 -version ...
  • f90 -help ...

89
Further Information
  • SGI
  • man f77/f90/cc
  • man debug_group
  • man math
  • man complib.sgimath
  • MIPSpro 64-Bit Porting and Transition Guide
  • Online Manuals
  • Linux clusters pages
  • ifort/icc/icpc help (IA32, IA64, Intel64)
  • Intel Fortran Compiler for Linux
  • Intel C/C Compiler for Linux

90
Agenda
  • 1 Parallel Computing Overview
  • 2 How to Parallelize a Code
  • 3 Porting Issues
  • 4 Scalar Tuning
  • 4.1 Aggressive Compiler Options
  • 4.2 Compiler Optimizations
  • 4.3 Vendor Tuned Code
  • 4.4 Further Information

91
Scalar Tuning
  • If you are not satisfied with the performance of
    your program on the new computer, you can tune
    the scalar code to decrease its runtime.
  • This chapter describes many of these techniques
  • The use of the most aggressive compiler options
  • The improvement of loop unrolling
  • The use of subroutine
Write a Comment
User Comments (0)
About PowerShow.com