Review: Multiprocessor Basics - PowerPoint PPT Presentation

About This Presentation
Title:

Review: Multiprocessor Basics

Description:

... have to be essentially free and much less likely to ... Video Out. CMP&SMT.11. The PS3 'Cell' Processor Architecture. Composed of a Non-SMP Architecture ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 14
Provided by: janie3
Learn more at: https://www.cs.wm.edu
Category:

less

Transcript and Presenter's Notes

Title: Review: Multiprocessor Basics


1
Review Multiprocessor Basics
  • Q1 How do they share data?
  • Q2 How do they coordinate?
  • Q3 How scalable is the architecture? How many
    processors?

2
CMP Multiprocessors On One Chip
  • By placing multiple processors, their memories
    and the IN all on one chip, the latencies of
    chip-to-chip communication are drastically
    reduced
  • ARM multi-chip core

Configurable of hardware intr
Private IRQ
Interrupt Distributor
Per-CPU aliased peripherals
CPU Interface
CPU Interface
CPU Interface
CPU Interface
Configurable between 1 4 symmetric CPUs
CPU L1s
CPU L1s
CPU L1s
CPU L1s
Snoop Control Unit
CCB
Private peripheral bus
I D 64-b bus
Optional AXI R/W 64-b bus
Primary AXI R/W 64-b bus
3
Multithreading on A Chip
  • Find a way to hide true data dependency stalls,
    cache miss stalls, and branch stalls by finding
    instructions (from other process threads) that
    are independent of those stalling instructions
  • Multithreading increase the utilization of
    resources on a chip by allowing multiple
    processes (threads) to share the functional units
    of a single processor
  • Processor must duplicate the state hardware for
    each thread a separate register file, PC,
    instruction buffer, and store buffer for each
    thread
  • The caches, buffers can be shared (although the
    miss rates may increase if they are not sized
    accordingly)
  • The memory can be shared through virtual memory
    mechanisms
  • Hardware must support efficient thread context
    switching

4
Types of Multithreading
  • Fine-grain switch threads on every instruction
    issue
  • Round-robin thread interleaving (skipping stalled
    threads)
  • Processor must be able to switch threads on every
    clock cycle
  • Advantage can hide throughput losses that come
    from both short and long stalls
  • Disadvantage slows down the execution of an
    individual thread since a thread that is ready to
    execute without stalls is delayed by instructions
    from other threads
  • Coarse-grain switches threads only on costly
    stalls (e.g., L2 cache misses)
  • Advantages thread switching doesnt have to be
    essentially free and much less likely to slow
    down the execution of an individual thread
  • Disadvantage limited, due to pipeline start-up
    costs, in its ability to overcome throughput loss
  • Pipeline must be flushed and refilled on thread
    switches

5
Multithreaded Example Suns Niagara (UltraSparc
T1)
  • Eight fine grain multithreaded single-issue,
    in-order cores (no speculation, no dynamic branch
    prediction)

4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
4-way MT SPARC pipe
I/O shared functs
Crossbar
4-way banked L2
Memory controllers
6
Niagara Integer Pipeline
  • Cores are simple (single-issue, 6 stage, no
    branch prediction), small, and power-efficient

Fetch
Thrd Sel
Decode
Execute
Memory
WB
RegFilex4
ALU Mul Shft Div
D DTLB Stbufx4
Crossbar Interface
Inst bufx4
Thrd Sel Mux
I ITLB
Decode
Instr type
Thread Select Logic
Cache misses
Traps interrupts
Thrd Sel Mux
Resource conflicts
PC logicx4
7
Simultaneous Multithreading (SMT)
  • A variation on multithreading that uses the
    resources of a multiple-issue, dynamically
    scheduled processor (superscalar) to exploit both
    program ILP and thread-level parallelism (TLP)
  • Most have more machine level parallelism than
    most programs can effectively use (i.e., than
    have ILP)
  • With register renaming and dynamic scheduling,
    multiple instructions from independent threads
    can be issued without regard to dependencies
    among them
  • Need separate rename tables (ROBs) for each
    thread
  • Need the capability to commit from multiple
    threads (i.e., from multiple ROBs) in one cycle
  • Intels Pentium 4 SMT called hyperthreading
  • Supports just two threads (doubles the
    architecture state)

8
Threading on a 4-way SS Processor Example
SMT
Issue slots ?
Thread A
Thread B
Time ?
Thread C
Thread D
9
Multicore Xbox360 Xenon processor
  • To provide game developers with a balanced and
    powerful platform
  • Three SMT processors, 32KB L1 D I, 1MB UL2
    cache
  • 165M transistors total
  • 3.2 Ghz Near-POWER ISA
  • 2-issue, 21 stage pipeline, with 128 128-bit
    registers
  • Weak branch prediction supported by software
    hinting
  • In order instructions
  • Narrow cores 2 INT units, 2 128-bit VMX units,
    1 of anything else
  • An ATI-designed 500MZ GPU w/ 512MB of DDR3DRAM
  • 337M transistors, 10MB framebuffer
  • 48 pixel shader cores, each with 4 ALUs

10
Xenon Diagram
11
The PS3 Cell Processor Architecture
  • Composed of a Non-SMP Architecture
  • 234M transistors _at_ 4Ghz
  • 1 Power Processing Element, 8 Synergistic
    (SIMD) PEs
  • 512KB L2 - Massively high bandwidth (200GB/s)
    bus connects it to everything else
  • The PPE is strangely similar to one of the Xenon
    cores
  • Almost identical, really. Slight ISA differences,
    and fine-grained MT instead of real SMT
  • The real differences lie in the SPEs (21M
    transistors each)
  • An attempt to fix the memory latency problem by
    giving each processor complete control over its
    own 256KB scratchpad 14M transistors
  • Direct mapped for low latency
  • 4 vector units per SPE, 1 of everything else 7M
    trans.

12
How to make use of the SPEs
13
What about the Software?
  • Makes use of special IBM Hypervisor
  • Like an OS for OSs
  • Runs both a real time OS (for sound) and non-real
    time (for things like AI)
  • Software must be specially coded to run well
  • The single PPE will be quickly bogged down
  • Must make use of SPEs wherever possible
  • This isnt easy, by any standard
  • What about Microsoft?
  • Development suite identifies which 6 threads
    youre expected to run
  • Four of them are DirectX based, and handled by
    the OS
  • Only need to write two threads, functionally
Write a Comment
User Comments (0)
About PowerShow.com