Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Description:

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rub n Gonz lez , Jean-Francois Collard , Norman P. Jouppi ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 41
Provided by: JohnS738
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers


1
Exploiting Fine-Grained Data Parallelism with
Chip Multiprocessors and Fast Barriers
  • Jack Sampson, Rubén González, Jean-Francois
    Collard, Norman P. Jouppi, Mike Schlansker,
    Brad Calder

UCSD UPC Barcelona Hewlett-Packard
Laboratories UCSD/Microsoft
2
Motivations
  • CMPs are not just small multiprocessors
  • Different computation/communication ratio
  • Different shared resources
  • Inter-core fabric offers potential to support
    optimizations/acceleration
  • CMPs for vector, streaming workloads

3
Fine-grained Parallelism
  • CMPs in role of vector processors
  • Software synchronization still expensive
  • Can target inner-loop parallelism
  • Barriers a straightforward organizing tool
  • Opportunity for hardware acceleration
  • Faster barriers allow greater parallelism
  • 1.2x 6.4x on 256 element vectors
  • 3x 12.2x on 1024 element vectors

4
Accelerating Barriers
  • Barrier Filters a new method for barrier
    synchronization
  • No dedicated networks
  • No new instructions
  • Changes only in shared memory system
  • CMP-friendly design point
  • Competitive with dedicated barrier network
  • Achieves 77-95 of dedicated network performance

5
Outline
  • Introduction
  • Barrier Filter Overview
  • Barrier Filter Implementation
  • Results
  • Summary

6
Observation and Intuition
  • Observations
  • Barriers need to stall forward progress
  • There exist events that already stall processors
  • Co-opt and extend existing stall behavior
  • Cache misses
  • Either I-Cache or D-Cache suffices

7
High Level Barrier Behavior
  • A thread can be in one of three states
  • Executing
  • Perform work
  • Enforce memory ordering
  • Signal arrival at barrier
  • Blocking
  • Stall at barrier until all arrive
  • Resuming
  • Release from barrier

8
Barrier Filter Example
  • CMP augmented with filter
  • Private L1
  • Shared, banked L2

9
Example Memory Ordering
  • Before/after for memory
  • Each thread executes a memory fence

10
Example Signaling Arrival
  • Communication with filter
  • Each thread invalidates a designated cache line

11
Example Signaling Arrival
  • Invalidation propagates to shared L2 cache
  • Filter snoops the invalidation
  • Checks address for match
  • Records arrival

12
Example Signaling Arrival
  • Invalidation propagates to shared L2 cache
  • Filter snoops the invalidation
  • Checks address for match
  • Records arrival

13
Example Stalling
  • Thread A attempts to fetch the invalidated data
  • Fill request not satisfied
  • Thread stalling mechanism

14
Example Release
  • Last thread signals arrival
  • Barrier release
  • Counter resets
  • Filter state for all threads switches

15
Example Release
  • After release
  • New cache-fill requests served
  • Filter serves pending cache-fills

16
Example Release
  • After release
  • New cache-fill requests served
  • Filter serves pending cache-fills

17
Outline
  • Introduction
  • Barrier Filter Overview
  • Barrier Filter Implementation
  • Results
  • Summary

18
Software Interface
  • Communication requirements
  • Let hardware know of threads
  • Let threads know signal addresses
  • Barrier filters as virtualized resource
  • Library interface
  • Pure software fallback
  • User scenario
  • Application calls OS to create barrier with
    threads
  • OS allocates barrier filter, relays address and
    threads
  • OS returns address to application

19
Barrier Filter Hardware
  • Additional hardware address filter
  • In controller for shared memory level
  • State table, associated FSMs
  • Snoops invalidations, fill requests for
    designated addresses
  • Makes use of existing instructions and existing
    interconnect network

20
Barrier Filter Internals
  • Each barrier filter supports one barrier
  • Barrier state
  • Per-thread state, FSMs
  • Multiple barrier filters
  • In each controller
  • In banked caches, at a particular bank

21
Barrier Filter Internals
  • Each barrier filter supports one barrier
  • Barrier state
  • Per-thread state, FSMs
  • Multiple barrier filters
  • In each controller
  • In banked caches, at a particular bank

22
Barrier Filter Internals
  • Each barrier filter supports one barrier
  • Barrier state
  • Per-thread state, FSMs
  • Multiple barrier filters
  • In each controller
  • In banked caches, at a particular bank

23
Why have an exit address?
  • Needed for re-entry to barriers
  • When does Resuming again become Executing?
  • Additional fill requests may be issued
  • Delivery is not a guarantee of receipt
  • Context switches
  • Migration
  • Cache eviction

24
Ping-Pong Optimization
  • Draws from sense reversal barriers
  • Entry and exit operations as duals
  • Two alternating arrival addresses
  • Each conveys exit to the others barrier
  • Eliminates explicit invalidate of exit address

25
Outline
  • Introduction
  • Barrier Filter Overview
  • Barrier Filter Implementation
  • Results
  • Summary

26
Methodology
  • Used modified version of SMT-Sim
  • We performed experiments using 7 different
    barrier implementations
  • Software
  • Centralized, combining tree
  • Hardware
  • Filter barrier (4 variants), dedicated barrier
    network
  • We examined performance over a set of
    parallelizeable kernels
  • Livermore loops 2, 3, 6
  • EEMBC kernels autocorrelation, viterbi

27
Benchmark Selection
  • Barriers are seen as heavyweight operations
  • Infrequently executed in most workloads
  • Example Ocean from SPLASH-2
  • On simulated 16 core CMP 4 of time in barriers
  • Barriers will be used more frequently on CMPs

28
Latency Micro-benchmark
  • Average time of barrier execution (in isolation)
  • threads cores

29
Latency Micro-benchmark
  • Notable effects due to bus saturation
  • Barrier filter scales well up until this point

30
Latency Micro-benchmark
  • Filters closer to dedicated network than software
  • Significant speedup vs. software still exhibited

31
Autocorrelation Kernel
  • On 16 core CMP
  • 7.98x speedup for dedicated network
  • 7.31x speedup for best filter barrier
  • 3.86 speedup for best software barrier
  • Significant speedup opportunities with fast
    barriers

32
Viterbi Kernel
Viterbi on 4 core CMP
  • Not all applications can scale to arbitrary
    number of cores
  • Viterbi performance higher on 4 or 8 cores than
    on 16 cores

33
Livermore Loops
Livermore Loop 3 on 16-core CMP
  • Serial/parallel crossover
  • HW achieves on 4x smaller problem

34
Livermore Loops
Livermore Loop 3 on 16-core CMP
  • Reduction in parallelism to avoid false sharing

35
Result Summary
  • Fine-grained parallelism on CMPs
  • Significant speedups possible
  • 1.2x 6.4x on 256 element vectors
  • 3x 12.2x on 1024 element vectors
  • False sharing affects problem size/scaling
  • Faster barriers allow greater parallelism
  • HW approaches extend worthwhile problem sizes
  • Barrier filters give competitive performance
  • 77 - 95 of dedicated network performance

36
Conclusions
  • Fast barriers
  • Can organize fine-grained data parallelism on a
    CMP
  • CMPs can act in a vector processor role
  • Exploit inner-loop parallelism
  • Barrier filters
  • CMP-oriented fast barrier

37
(FIN)
  • Questions?

38
(No Transcript)
39
(No Transcript)
40
Extra Graphs
Write a Comment
User Comments (0)
About PowerShow.com