CS 213: Parallel Processing Architectures - PowerPoint PPT Presentation

Loading...

PPT – CS 213: Parallel Processing Architectures PowerPoint presentation | free to download - id: 1154f8-YzYxM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

CS 213: Parallel Processing Architectures

Description:

Parallelism moved to instruction level. Microprocessor performance ... Process Level or Thread level parallelism; mainstream for general purpose computing? ... – PowerPoint PPT presentation

Number of Views:211
Avg rating:3.0/5.0
Slides: 27
Provided by: laxmib
Learn more at: http://www.cs.ucr.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: CS 213: Parallel Processing Architectures


1
CS 213 Parallel Processing Architectures
  • Laxmi Narayan Bhuyan
  • http//www.cs.ucr.edu/bhuyan

2
  • PARALLEL PROCESSING ARCHITECTURES
  • CS213 SYLLABUS
  • Winter 2008
  • INSTRUCTOR L.N. Bhuyan (http//www.engr.ucr.edu/
    bhuyan/)
  • PHONE (951) 827-2347 E-mail bhuyan_at_cs.ucr.edu
  • LECTURE TIME TR 1240pm-2pm 
  • PLACE HMNSS 1502
  • OFFICE HOURS W 2.00-4.00 or By Appointment

3
  • References
  • John Hennessy and David Patterson, Computer
    Architecture A Quantitative Approach, Morgan
    Kauffman Publisher.
  • Research Papers to be available in the class
  • COURSE OUTLINE
  • Introduction to Parallel Processing Flynns
    classification, SIMD and MIMD operations, Shared
    Memory vs. message passing multiprocessors,
    Distributed shared memory
  • Shared Memory Multiprocessors SMP and CC-NUMA
    architectures, Cache coherence protocols,
    Consistency protocols, Data pre-fetching, CC-NUMA
    memory management, SGI 4700 multiprocessor, Chip
    Multiprocessors, Network Processors (IXP and
    Cavium)
  • Interconnection Networks Static and Dynamic
    networks, switching techniques, Internet
    techniques
  • Message Passing Architectures Message passing
    paradigms, Grid architecture, Workstation
    clusters, User level software
  • Multiprocessor Scheduling Scheduling and
    mapping, Internet web servers, P2P, Content aware
    load balancing
  • PREREQUISITE CS 203A
  • GRADING
  • Project I 20 points Project II 30
    points Test 1 20 points Test 2 - 30 points

4
Possible Projects
  • Experiments with SGI Altix 4700 Supercomputer
    Algorithm design and FPGA offloading
  • I/O Scheduling on SGI
  • Chip Multiprocessor (CMP) Design, analysis and
    simulation
  • P2P Using Planet Lab
  • Note 2 students/group Expect submission of a
    paper to a conference

5
Useful Web Addresses
  • http//www.sgi.com/products/servers/altix/4000/
    and http//www.sgi.com/products/rasc/
  • Wisconsin Computer Architecture Page Simulators
    http//www.cs.wisc.edu/arch/www/tools.html
  • SimpleScalar www.simplescalar.com Look for
    multiprocessor extensions
  • NepSim http www.cs.ucr.edu/yluo/nepsim/
  • Working in a cluster environment
  • Beowulf Cluster www.beowulf.org
  • MPI www-unix.mcs.anl.gov/mpi
  • Application Benchmarks
  • http//www-flash.stanford.edu/apps/SPLASH/

6
Parallel Computers
  • Definition A parallel computer is a collection
    of processing elements that cooperate and
    communicate to solve large problems fast.
  • Almasi and Gottlieb, Highly Parallel Computing
    ,1989
  • Questions about parallel computers
  • How large a collection?
  • How powerful are processing elements?
  • How do they cooperate and communicate?
  • How are data transmitted?
  • What type of interconnection?
  • What are HW and SW primitives for programmer?
  • Does it translate into performance?

7
Parallel Processors Myth
  • The dream of computer architects since 1950s
    replicate processors to add performance vs.
    design a faster processor
  • Led to innovative organization tied to particular
    programming models since uniprocessors cant
    keep going
  • e.g., uniprocessors must stop getting faster due
    to limit of speed of light Has it happened?
  • Killer Micros! Parallelism moved to instruction
    level. Microprocessor performance doubles every
    1.5 years!
  • In 1990s companies went out of business Thinking
    Machines, Kendall Square, ...

8
What level Parallelism?
  • Bit level parallelism 1970 to 1985
  • 4 bits, 8 bit, 16 bit, 32 bit microprocessors
  • Instruction level parallelism (ILP) 1985
    through today
  • Pipelining
  • Superscalar
  • VLIW
  • Out-of-Order execution
  • Limits to benefits of ILP?
  • Process Level or Thread level parallelism
    mainstream for general purpose computing?
  • Servers are parallel
  • High-end Desktop dual processor PC soon?? (or
    just the sell the socket?)

9
Why Multiprocessors?
  • Microprocessors as the fastest CPUs
  • Collecting several much easier than redesigning 1
  • Complexity of current microprocessors
  • Do we have enough ideas to sustain 2X/1.5yr?
  • Can we deliver such complexity on schedule?
  • Slow (but steady) improvement in parallel
    software (scientific apps, databases, OS)
  • Emergence of embedded and server markets driving
    microprocessors in addition to desktops
  • Embedded functional parallelism
  • Network processors exploiting packet-level
    parallelism
  • SMP Servers and cluster of workstations for
    multiple users Less demand for parallel
    computing

10
Amdahls Law and Parallel Computers
  • Amdahls Law (f original fraction
    sequential)Speedup 1 / (f (1-f)/n
    n/1(n-1)/f, where n No. of processors
  • A portion f is sequential gt limits parallel
    speedup
  • Speedup lt 1/ f
  • Ex. What fraction sequential to get 80X speedup
    from 100 processors? Assume either 1 processor or
    100 fully used
  • 80 1 / (f (1-f)/100 gt f 0.0025
  • Only 0.25 sequential! gt Must be a highly
    parallel program

11
(No Transcript)
12
Popular Flynn Categories
  • SISD (Single Instruction Single Data)
  • Uniprocessors
  • MISD (Multiple Instruction Single Data)
  • ??? multiple processors on a single data stream
  • SIMD (Single Instruction Multiple Data)
  • Examples Illiac-IV, CM-2
  • Simple programming model
  • Low overhead
  • Flexibility
  • All custom integrated circuits
  • (Phrase reused by Intel marketing for media
    instructions vector)
  • MIMD (Multiple Instruction Multiple Data)
  • Examples Sun Enterprise 5000, Cray T3D, SGI
    Origin
  • Flexible
  • Use off-the-shelf micros
  • MIMD current winner Concentrate on major design
    emphasis lt 128 processor MIMD machines

13
Classification of Parallel Processors
  • SIMD EX Illiac IV and Maspar
  • MIMD - True Multiprocessors
  • 1. Message Passing Multiprocessor -
    Interprocessor communication through explicit
    message passing through send and receive
    operations.
  • EX IBM SP2, Cray XD1, and Clusters
  • 2. Shared Memory Multiprocessor All
    processors share the same address space.
    Interprocessor communication through load/store
    operations to a shared memory.
  • EX SMP Servers, SGI Origin, HP
  • V-Class, Cray T3E
  • Their advantages and disadvantages?

14
More Message passing Computers
  • Cluster Computers connected over high-bandwidth
    local area network (Ethernet or Myrinet) used as
    a parallel computer
  • Network of Workstations (NOW) Homogeneous
    cluster same type computers
  • Grid Computers connected over wide area network

15
Another Classification for MIMD Computers
  • Centralized Memory Shared memory located at
    centralized location may consist of several
    interleaved modules same distance from any
    processor Symmetric Multiprocessor (SMP)
    Uniform Memory Access (UMA)
  • Distributed Memory Memory is distributed to each
    processor improves scalability
  • (a) Message passing architectures No
    processor can directly access another processors
    memory
  • (b) Hardware Distributed Shared Memory (DSM)
    Multiprocessor Memory is distributed, but the
    address space is shared Non-Uniform Memory
    Access (NUMA)
  • (c) Software DSM A level of o/s built on
    top of message passing multiprocessor to give a
    shared memory view to the programmer.

16
(No Transcript)
17
Data Parallel Model
  • Operations can be performed in parallel on each
    element of a large regular data structure, such
    as an array
  • 1 Control Processor (CP) broadcasts to many PEs.
    The CP reads an instruction from the control
    memory, decodes the instruction, and broadcasts
    control signals to all PEs.
  • Condition flag per PE so that can skip
  • Data distributed in each memory
  • Early 1980s VLSI gt SIMD rebirth 32 1-bit PEs
    memory on a chip was the PE
  • Data parallel programming languages lay out data
    to processor

18
Data Parallel Model
  • Vector processors have similar ISAs, but no data
    placement restriction
  • SIMD led to Data Parallel Programming languages
  • Advancing VLSI led to single chip FPUs and whole
    fast µProcs (SIMD less attractive)
  • SIMD programming model led to Single Program
    Multiple Data (SPMD) model
  • All processors execute identical program
  • Data parallel programming languages still useful,
    do communication all at once Bulk Synchronous
    phases in which all communicate after a global
    barrier

19
SIMD Programming High-Performance Fortran (HPF)
  • Single Program Multiple Data (SPMD)
  • FORALL Construct similar to Fork
  • FORALL (I1N), A(I) B(I) C(I), END
    FORALL
  • Data Mapping in HPF
  • 1. To reduce interprocessor communication
  • 2. Load balancing among processors
  • http//www.npac.syr.edu/hpfa/
  • http//www.crpc.rice.edu/HPFF/

20
Major MIMD Styles
  • Centralized shared memory ("Uniform Memory
    Access" time or "Shared Memory Processor")
  • Decentralized memory (memory module with CPU)
  • Advantages Scalability, get more memory
    bandwidth, lower local memory latency
  • Drawback Longer remote communication latency,
    Software model more complex
  • Two types Shared Memory and Message passing

21
Symmetric Multiprocessor (SMP)
  • Memory centralized with uniform access time
    (uma) and bus interconnect
  • Examples Sun Enterprise 5000 , SGI Challenge,
    Intel SystemPro

22
Decentralized Memory versions
  • Shared Memory with "Non Uniform Memory Access"
    time (NUMA)
  • Message passing "multicomputer" with separate
    address space per processor
  • Can invoke software with Remote Procedue Call
    (RPC)
  • Often via library, such as MPI Message Passing
    Interface
  • Also called "Syncrohnous communication" since
    communication causes synchronization between 2
    processes

23
Distributed Directory MPs
24
Communication Models
  • Shared Memory
  • Processors communicate with shared address space
  • Easy on small-scale machines
  • Advantages
  • Model of choice for uniprocessors, small-scale
    MPs
  • Ease of programming
  • Lower latency
  • Easier to use hardware controlled caching
  • Message passing
  • Processors have private memories, communicate
    via messages
  • Advantages
  • Less hardware, easier to design
  • Good scalability
  • Focuses attention on costly non-local operations
  • Virtual Shared Memory (VSM)

25
Shared Address/Memory Multiprocessor Model
  • Communicate via Load and Store
  • Oldest and most popular model
  • Based on timesharing processes on multiple
    processors vs. sharing single processor
  • process a virtual address space and 1 thread
    of control
  • Multiple processes can overlap (share), but ALL
    threads share a process address space
  • Writes to shared address space by one thread are
    visible to reads of other threads
  • Usual model share code, private stack, some
    shared heap, some private heap

26
Shared Memory Multiprocessor Model
  • Communicate via Load and Store
  • Oldest and most popular model
  • Based on timesharing processes on multiple
    processors vs. sharing single processor
  • process a virtual address space and 1 thread
    of control
  • Multiple processes can overlap (share), but ALL
    threads share a process address space
  • Writes to shared address space by one thread are
    visible to reads of other threads
  • Usual model share code, private stack, some
    shared heap, some private heap
About PowerShow.com