CSE700 Parallel Programming Introduction - PowerPoint PPT Presentation

Loading...

PPT – CSE700 Parallel Programming Introduction PowerPoint presentation | free to download - id: 26dd5-MDAyZ



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

CSE700 Parallel Programming Introduction

Description:

Sony PlayStation 3 Cell, eight cores enabled, 2006. Intel, 80-cores, 2011 (prototype finished) ... thread #3. 9. Data Parallelism in Hardware. GeForce 8800 ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 48
Provided by: sungwo8
Learn more at: http://pl.postech.ac.kr
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: CSE700 Parallel Programming Introduction


1
CSE-700 Parallel ProgrammingIntroduction
???
  • POSTECH
  • Sep 6, 2007

2
Common Features?
3
... runs faster on
4
Multi-core CPUs
  • IBM Power4, dual-core, 2000
  • Intel reaches thermal wall, 2004 ) no more free
    lunch!
  • Intel Xeon, quad-core, 2006
  • Sony PlayStation 3 Cell, eight cores enabled,
    2006
  • Intel, 80-cores, 2011 (prototype finished)

source Herb Sutter - "Software and the
concurrency revolution"
5
Parallel Programming Models
  • Posix threads (API)
  • OpenMP (API)
  • HPF (High Performance Fortran)
  • Cray's Chapel
  • Nesl
  • Sun's Fortress
  • IBM's X10
  • ...
  • and a lot more.

6
Parallelism
  • Data parallelism
  • ability to apply a function in parallel to each
    element of a collection of data
  • Thread parallelism
  • ability to run multiple threads concurrently
  • Each thread uses its own local state.
  • Shared memory parallelism

7
Data ParallelismThread ParallelismShared Memory
Parallelism
8
Data Parallelism Data Separation
hardware thread 1
hardware thread 2
hardware thread 3
a1
a2
...
an
anm
anml
an1
an2
...
anm1
...
9
Data Parallelism in Hardware
  • GeForce 8800
  • 128 stream processors _at_ 1.3Ghz, 500GFlops

10
Data Parallelism in Programming Languages
  • Fortress
  • parallelism is the default.
  • for i à 1m, j à 1n do // 1n is a
    generator
  • ai, j bi cj
  • end
  • Nesl (1990's)
  • supports nested data parallelism
  • the function being applied itself can be
    parallel.
  • sum(a) a in 2, 3, 8, 3, 9, 7

11
Data Parallel Haskell (DAMP '07)
  • Haskell nested data parallelism
  • flattening (vectorization)
  • transforms a nested parallel program such that
    it manipulates only flat arrays.
  • fusion
  • eliminate many intermediate arrays
  • Ex 10,000x10,000 sparse matrix multiplication
    with 1 million elements

12
Data ParallelismThread ParallelismShared Memory
Parallelism
13
Thread Parallelism
synchronous communication
hardware thread 1
hardware thread 2
message
message
local state
local state
14
Pure Functional Threads
  • Purely functional threads can run concurrently.
  • Effect-free computations can be executed in
    parallel with any other effect-free computations.
  • Example collision-detection

A'
B'
A
B
15
Manticore (DAMP '07)
  • Three layers
  • sequential base language
  • functional language drawn from SML
  • no mutable references and arrays!
  • data-parallel programming
  • Implicit
  • the compiler and runtime system manage thread
    creation.
  • E.g.) parallel arrays of parallel arrays
  • 2 n n in nums where n 0
  • fun mapP f xs f x x in xs
  • concurrent programming

16
Concurrent Programming in Manticore (DAMP '07)
  • Based on Concurrent ML
  • threads and synchronous message passing
  • Threads do not share mutable states.
  • actually no mutable references and arrays
  • explicit
  • The programmer manages thread creation.

17
Data ParallelismThread ParallelismShared Memory
Parallelism(Shared State Concurrency)
18
Share Memory Parallelism
hardware thread 1
hardware thread 2
hardware thread 3
shared memory
19
World War II
20
Company of Heroes
  • Interaction of a LOT of objects
  • thousands of objects
  • Each object has its own mutable state.
  • Each object update affects several other objects.
  • All objects are updated 30 times per second.
  • Problem
  • How do we handle simultaneous updates to the same
    memory location?

21
Manual Lock-based Synchronization
  • pthread_mutex_lock(mutex)
  • mutate_variable()
  • pthread_mutex_unlock(mutex)
  • Locks and conditional variables
  • ) fundamentally flawed!

22
Bank Accounts Beautiful Concurrency, Peyton
Jones, 2007
thread 1
thread 2
thread n
...
transfer request
transfer request
transfer request
account A
account B
shared memory
  • Invariant atomicity
  • no thread observes a state in which the money has
    left one account, but has not arrived in the
    other.

23
Bank Accounts using Locks
  • In an object-oriented language
  • class Account
  • Int balance
  • synchronized void deposit (Int n)
  • balance balance n
  • Code for transfer
  • void transfer (Account from, Account to, Int
    amount)
  • from.withdraw (amount)
  • to.deposit (amount)

an intermediate state!
24
A Quick Fix Explicit Locking
  • void transfer (Account from, Account to, Int
    amount)
  • from.lock() to.lock()
  • from.withdraw (amount)
  • to.deposit (amount)
  • from.unlock() to.unlock()
  • Now, the program is prone to deadlock.

25
Locks are Bad
  • Taking two few locks
  • ) simultaneous update
  • Taking too many locks
  • ) no concurrency or deadlock
  • Taking the wrong locks
  • ) error-prone programming
  • Taking locks in the wrong order
  • ) error-prone programming
  • ...
  • Fundamental problem no modular programming
  • Correct implementations of withdraw and deposit
    do not give a correct implementation of transfer.

26
Transactional Memory
  • An alternative to lock-based synchronization
  • eliminates many problems associated with
    lock-based synchronization
  • no deadlock
  • read sharing
  • safe modular programming
  • Hot research area
  • hardware transactional memory
  • software transactional memory
  • C, Java, functional languages, ...

27
Transactions in Haskell
  • transfer Account - Account - Int - IO ()
  • -- transfer 'amount' from account 'from' to
    account 'to'
  • transfer from to amount
  • atomically (do deposit to amount
  • withdraw from amount )
  • atomically act
  • atomicity
  • the effects become visible to other threads all
    at once.
  • isolation
  • the action act does not see any effects from
    other threads.

28
ConclusionWe need parallelism!
29
Tim Sweeney's POPL '06 Invited Talk- Last Slide
30
CSE-700 Parallel Programming
  • Fall 2007

31
CSE-700 in a Nutshell
  • Scope
  • Parallel computing from the viewpoint of
    programmers and language designers
  • We will not talk about hardware for parallel
    computing
  • Audience
  • Anyone interested in learning parallel
    programming
  • Prerequisite
  • C programming
  • Desire to learn new programming languages

32
Material
  • Books
  • Introduction to Parallel Programming (2nd).
    Ananth Grama et al.
  • Parallel Programming with MPI. Peter Pacheco.
    Parallel Programming in OpenMP. Rohit Chandra et
    al.
  • Any textbook on MPI and OpenMP is fine.
  • Papers

33
Teaching Staff
  • Instructors
  • Gla
  • Myson
  • ...
  • and YOU!
  • We will lead this course TOGETHER.

34
Resources
  • Plquad
  • quad-core Linux
  • OpenMP and MPI already installed
  • Ask for an account if you need one.

35
Basic Plan - First Half
  • Goal
  • learn the basics of parallel programming through
    5 assignments on OpenMP and MPI
  • Each lecture consists of
  • discussion on the previous assignment
  • Each of you is expected to give a presentation.
  • presentation on OpenMP and MPI by the instructors
  • discussion on the next assignment

36
Basic Plan - Second Half
  • Recent parallel languages
  • learn a recent parallel language
  • write a cool program in your parallel language
  • give a presentation on your experience
  • Topics in parallel language research
  • choose a topic
  • give a presentation on it

37
What Matters Most?
  • Spirit of adventure
  • Proactivity
  • Desire to provoke Happy Chaos
  • I want you to develop this course into a total,
    complete, yet happy chaos.
  • A truly inspirational course borders almost on
    chaos.

38
Impact of Memory and Cache on Performance
39
Impact of Memory Bandwidth 1
  • Consider the following code fragment
  • for (i 0 i
  • column_sumi 0.0
  • for (j 0 j
  • column_sumi bji
  • The code fragment sums columns of the matrix b
    into a vector column_sum.

40
Impact of Memory Bandwidth 2
  • The vector column_sum is small and easily fits
    into the cache
  • The matrix b is accessed in a column order.
  • The strided access results in very poor
    performance.

Multiplying a matrix with a vector (a)
multiplying column-by-column, keeping a running
sum (b) computing each element of the result as
a dot product of a row of the matrix with the
vector.
41
Impact of Memory Bandwidth 3
  • We can fix the above code as follows
  • for (i 0 i
  • column_sumi 0.0
  • for (j 0 j
  • for (i 0 i
  • column_sumi bji
  • In this case, the matrix is traversed in a
    row-order and performance can be expected to be
    significantly better.

42
Lesson
  • Memory layouts and organizing computation
    appropriately can make a significant impact on
    the spatial and temporal locality.

43
Assignment 1Cache Matrix Multiplication
44
Typical Sequential Implementation
  • A n x n
  • B n x n
  • C A B n x n
  • for i 1 to n
  • for j 1 to n
  • Ci, j 0
  • for k 1 to n
  • Ci, j Ai, k B k, j

45
Using Submatrixes
  • Improves data locality significantly.

46
Experimental Results
47
Assignment 1
  • Machine
  • the older, the better.
  • Myson offers his ancient notebook for you.
  • Pentium II 600Mhz
  • no L1 cache
  • 64KB L2 cache
  • running Linux
  • Prepare a presentation on your experimental
    results.
About PowerShow.com