Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Description:

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rub n Gonz lez , Jean-Francois Collard , Norman P. Jouppi ... – PowerPoint PPT presentation

Number of Views:131

Avg rating:3.0/5.0

Slides: 41

Provided by: JohnS738

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

1
Exploiting Fine-Grained Data Parallelism with
Chip Multiprocessors and Fast Barriers

Jack Sampson, Rubén González, Jean-Francois
Collard, Norman P. Jouppi, Mike Schlansker,
Brad Calder

UCSD UPC Barcelona Hewlett-Packard
Laboratories UCSD/Microsoft
2
Motivations

CMPs are not just small multiprocessors
Different computation/communication ratio
Different shared resources
Inter-core fabric offers potential to support
optimizations/acceleration
CMPs for vector, streaming workloads

3
Fine-grained Parallelism

CMPs in role of vector processors
Software synchronization still expensive
Can target inner-loop parallelism
Barriers a straightforward organizing tool
Opportunity for hardware acceleration
Faster barriers allow greater parallelism
1.2x 6.4x on 256 element vectors
3x 12.2x on 1024 element vectors

4
Accelerating Barriers

Barrier Filters a new method for barrier
synchronization
No dedicated networks
No new instructions
Changes only in shared memory system
CMP-friendly design point
Competitive with dedicated barrier network
Achieves 77-95 of dedicated network performance

5
Outline

Introduction
Barrier Filter Overview
Barrier Filter Implementation
Results
Summary

6
Observation and Intuition

Observations
Barriers need to stall forward progress
There exist events that already stall processors
Co-opt and extend existing stall behavior
Cache misses
Either I-Cache or D-Cache suffices

7
High Level Barrier Behavior

A thread can be in one of three states
Executing
Perform work
Enforce memory ordering
Signal arrival at barrier
Blocking
Stall at barrier until all arrive
Resuming
Release from barrier

8
Barrier Filter Example

CMP augmented with filter
Private L1
Shared, banked L2

9
Example Memory Ordering

Before/after for memory
Each thread executes a memory fence

10
Example Signaling Arrival

Communication with filter
Each thread invalidates a designated cache line

11
Example Signaling Arrival

Invalidation propagates to shared L2 cache
Filter snoops the invalidation
Checks address for match
Records arrival

12
Example Signaling Arrival

Invalidation propagates to shared L2 cache
Filter snoops the invalidation
Checks address for match
Records arrival

13
Example Stalling

Thread A attempts to fetch the invalidated data
Fill request not satisfied
Thread stalling mechanism

14
Example Release

Last thread signals arrival
Barrier release
Counter resets
Filter state for all threads switches

15
Example Release

After release
New cache-fill requests served
Filter serves pending cache-fills

16
Example Release

After release
New cache-fill requests served
Filter serves pending cache-fills

17
Outline

Introduction
Barrier Filter Overview
Barrier Filter Implementation
Results
Summary

18
Software Interface

Communication requirements
Let hardware know of threads
Let threads know signal addresses
Barrier filters as virtualized resource
Library interface
Pure software fallback
User scenario
Application calls OS to create barrier with
threads
OS allocates barrier filter, relays address and
threads
OS returns address to application

19
Barrier Filter Hardware

Additional hardware address filter
In controller for shared memory level
State table, associated FSMs
Snoops invalidations, fill requests for
designated addresses
Makes use of existing instructions and existing
interconnect network

20
Barrier Filter Internals

Each barrier filter supports one barrier
Barrier state
Per-thread state, FSMs
Multiple barrier filters
In each controller
In banked caches, at a particular bank

21
Barrier Filter Internals

Each barrier filter supports one barrier
Barrier state
Per-thread state, FSMs
Multiple barrier filters
In each controller
In banked caches, at a particular bank

22
Barrier Filter Internals

Each barrier filter supports one barrier
Barrier state
Per-thread state, FSMs
Multiple barrier filters
In each controller
In banked caches, at a particular bank

23
Why have an exit address?

Needed for re-entry to barriers
When does Resuming again become Executing?
Additional fill requests may be issued
Delivery is not a guarantee of receipt
Context switches
Migration
Cache eviction

24
Ping-Pong Optimization

Draws from sense reversal barriers
Entry and exit operations as duals
Two alternating arrival addresses
Each conveys exit to the others barrier
Eliminates explicit invalidate of exit address

25
Outline

Introduction
Barrier Filter Overview
Barrier Filter Implementation
Results
Summary

26
Methodology

Used modified version of SMT-Sim
We performed experiments using 7 different
barrier implementations
Software
Centralized, combining tree
Hardware
Filter barrier (4 variants), dedicated barrier
network
We examined performance over a set of
parallelizeable kernels
Livermore loops 2, 3, 6
EEMBC kernels autocorrelation, viterbi

27
Benchmark Selection

Barriers are seen as heavyweight operations
Infrequently executed in most workloads
Example Ocean from SPLASH-2
On simulated 16 core CMP 4 of time in barriers
Barriers will be used more frequently on CMPs

28
Latency Micro-benchmark

Average time of barrier execution (in isolation)
threads cores

29
Latency Micro-benchmark

Notable effects due to bus saturation
Barrier filter scales well up until this point

30
Latency Micro-benchmark

Filters closer to dedicated network than software
Significant speedup vs. software still exhibited

31
Autocorrelation Kernel

On 16 core CMP
7.98x speedup for dedicated network
7.31x speedup for best filter barrier
3.86 speedup for best software barrier
Significant speedup opportunities with fast
barriers

32
Viterbi Kernel
Viterbi on 4 core CMP

Not all applications can scale to arbitrary
number of cores
Viterbi performance higher on 4 or 8 cores than
on 16 cores

33
Livermore Loops
Livermore Loop 3 on 16-core CMP

Serial/parallel crossover
HW achieves on 4x smaller problem

34
Livermore Loops
Livermore Loop 3 on 16-core CMP

Reduction in parallelism to avoid false sharing

35
Result Summary

Fine-grained parallelism on CMPs
Significant speedups possible
1.2x 6.4x on 256 element vectors
3x 12.2x on 1024 element vectors
False sharing affects problem size/scaling
Faster barriers allow greater parallelism
HW approaches extend worthwhile problem sizes
Barrier filters give competitive performance
77 - 95 of dedicated network performance

36
Conclusions

Fast barriers
Can organize fine-grained data parallelism on a
CMP
CMPs can act in a vector processor role
Exploit inner-loop parallelism
Barrier filters
CMP-oriented fast barrier

37
(FIN)

Questions?

38
(No Transcript)
39
(No Transcript)
40
Extra Graphs

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

World's Best PowerPoint Templates PowerPoint PPT Presentation

World's Best PowerPoint Templates - CrystalGraphics offers more PowerPoint templates than anyone else in the world, with over 4 million to choose from. Winner of the Standing Ovation Award for “Best PowerPoint Templates” from Presentations Magazine. They'll give your presentations a professional, memorable appearance - the kind of sophisticated look that today's audiences expect. Boasting an impressive range of designs, they will support your presentations with inspiring background photos or videos that support your themes, set the right mood, enhance your credibility and inspire your audiences.

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

The Best Data Access Solutions of Blutalon PowerPoint PPT Presentation

The Best Data Access Solutions of Blutalon - The BlueTalon Policy Engine enables you to define fine-grained data access policies that are applied to all users, all applications and across multiple data stores – Hadoop, SQL and Big Data environments. | PowerPoint PPT presentation | free to view

G-fast Chips: Market Shares, Strategies, and Forecasts, Worldwide, 2014 to 2020 PowerPoint PPT Presentation

G-fast Chips: Market Shares, Strategies, and Forecasts, Worldwide, 2014 to 2020 - Worldwide G.fast Chip markets are increasingly diversified, poised to achieve significant growth as broadband is used in every industry segment. | PowerPoint PPT presentation | free to view

Shared Memory Multiprocessors PowerPoint PPT Presentation

Shared Memory Multiprocessors - coarse grain tasks, decompose to obtain locality. Extra Work ... Fine-grain resource sharing. Uniform access via loads/stores ... fine-grain sharing ... | PowerPoint PPT presentation | free to view

Semantic Computing and Standard Data Category Registry PowerPoint PPT Presentation

Semantic Computing and Standard Data Category Registry - Two-Day Work ... Criteria on DC Registry. Purpose. annotation ... Fast knowledge circulation. In a week? Evaluation better than IF and CI. Network analysis ... | PowerPoint PPT presentation | free to view

How to Make Chocolate Chip Cookies PowerPoint PPT Presentation

How to Make Chocolate Chip Cookies - How to Make Chocolate Chip Cookies Fast, Easy, and Yummy Original Author Unknown Modified By: CTAE Resource Network Chocolate Chip Cookies This is a fast, easy recipe ... | PowerPoint PPT presentation | free to view

CSE 664 Parallel Comptuer Architecture Definitions.. PowerPoint PPT Presentation

CSE 664 Parallel Comptuer Architecture Definitions.. - CSE 664 Parallel Comptuer Architecture Definitions.. GRAIN SIZE Fine, Medium, Coarse. The basic program segment chosen for parallel processing. Latency: Communication ... | PowerPoint PPT presentation | free to view

ASIC Chip Market Size- KBV Research PowerPoint PPT Presentation

ASIC Chip Market Size- KBV Research - The Global ASIC Chip Market size is expected to reach $24.7 billion by 2025, rising at a market growth of 8.2% CAGR during the forecast period. Application Specific Integrated Circuit (ASIC) is the kind of integrated circuit (IC) for a particular purpose or application. An ASIC will boost performance because the desired feature is specifically designed to perform. These chip forms are highly customized to deliver superior performance in specific applications. However, once it is put into production, ASIC cannot be changed. Full Report: https://www.kbvresearch.com/asic-chip-market/ | PowerPoint PPT presentation | free to view

Global Organ-On-Chip Market, Opportunities And Strategies To 2021 PowerPoint PPT Presentation

Global Organ-On-Chip Market, Opportunities And Strategies To 2021 - The organ-on-chips market is expected to reach $284 million by 2021. Growth in the organ-on-chips market is due to improved drug screening with organ-on-chips, investments by pharmaceutical. Get More Information at http://bit.ly/2Cw3fIG | PowerPoint PPT presentation | free to view

Parallel - Parallel & Distributed databases Agenda The problem domain of design parallel & distributed databases (chp 18-20) The data allocation problem The data processing ... | PowerPoint PPT presentation | free to view

Fast Data at Massive Scale PowerPoint PPT Presentation

Fast Data at Massive Scale - Basic tools are parallelism and clustering. Clustering is a latency/throughput tradeoff ... Your best tool is parallelism. Look at your data. Build tools to ... | PowerPoint PPT presentation | free to view

Fine-Grained Failover Using Connection Migration PowerPoint PPT Presentation

Fine-Grained Failover Using Connection Migration - Fine-Grained Failover. Using Connection Migration. Alex C. Snoeren, ... Infer application state from transport layer information. Connection Migration ... | PowerPoint PPT presentation | free to view

CS 213: Parallel Processing Architectures PowerPoint PPT Presentation

CS 213: Parallel Processing Architectures - Parallelism moved to instruction level. Microprocessor performance ... Process Level or Thread level parallelism; mainstream for general purpose computing? ... | PowerPoint PPT presentation | free to view

CS252 Graduate Computer Architecture Lecture 11 Vectors, Branch Prediction, Dependence Speculation, and Data Prediction PowerPoint PPT Presentation

CS252 Graduate Computer Architecture Lecture 11 Vectors, Branch Prediction, Dependence Speculation, and Data Prediction - Graduate Computer Architecture Lecture 11 Vectors, Branch Prediction, Dependence Speculation, and Data Prediction October 1, 1999 Prof. John Kubiatowicz | PowerPoint PPT presentation | free to view

data science online training in hyderabad PowerPoint PPT Presentation

data science online training in hyderabad - A comprehensive up-to-date Data Science course that includes all the essential topics of the Data Science domain, presented in a well-thought-out structure. Taught and developed by experienced and certified data professionals, the course goes right from collecting raw digital data to presenting it visually. Suitable for those with computer backgrounds, analytic mindset, and coding knowledge. Grasp the key fundamentals of data science, coding, and machine learning. Develop mastery over essential analytic tools like R, Python, SQL, and more. | PowerPoint PPT presentation | free to view

An Introduction to Parallel Computing PowerPoint PPT Presentation

An Introduction to Parallel Computing - Task parallelism. The problem consists of a number of independent tasks ... Data parallelism. The problem consists of dependent tasks ... | PowerPoint PPT presentation | free to view

Designing Parallel Programs PowerPoint PPT Presentation

Designing Parallel Programs - Designing Parallel Programs Automatic vs. Manual Parallelization Understand the Problem and the Program Communications Synchronization Data Dependencies | PowerPoint PPT presentation | free to view

CS267/E233 Applications of Parallel Computers Lecture 1: Introduction PowerPoint PPT Presentation

CS267/E233 Applications of Parallel Computers Lecture 1: Introduction - Savings: approx. $1 billion per company per year. Securities industry: ... Parallel processors, collectively, have large, fast ... | PowerPoint PPT presentation | free to view

Code and Data Partitioning for Fine Grain Parallelism PowerPoint PPT Presentation

Code and Data Partitioning for Fine Grain Parallelism - Fine Grain Parallelism. Michael Chu and Scott Mahlke. University of Michigan. 2 ... Goal: detect and exploit available fine-grain parallelism ... | PowerPoint PPT presentation | free to view

Developments of ECMWFs Data Assimilation System With Respect to Higherdensity Observations and Highe PowerPoint PPT Presentation

Developments of ECMWFs Data Assimilation System With Respect to Higherdensity Observations and Highe - Convergence is roughly twice as fast with Hessian preconditioning, ... Scores over a 34-day period were very slightly positive, nut not enough to ... | PowerPoint PPT presentation | free to view

Evaluation of Signaling Cascades Based on the Weights from Microarray and ChIPseq Data PowerPoint PPT Presentation

Evaluation of Signaling Cascades Based on the Weights from Microarray and ChIPseq Data - CisGenome: system to analyse ChIP data. visualization. data normalization. peak detection ... Different ranking methods on the transcriptome data will be analyzed. ... | PowerPoint PPT presentation | free to view

Data Theft rules and regulations: Things you should know (Pt.1) PowerPoint PPT Presentation

Data Theft rules and regulations: Things you should know (Pt.1) - The IT Act appears to be adequate in regards to data theft, it is insufficient in addressing the minute technical intricacies involved in such a crime, leaving gaps in the law and allowing the perpetrators to get away with it. Since this problem affects more than one country and has international implications, we have briefed the countries that have such law and how it works; Which will be covered in two parts. | PowerPoint PPT presentation | free to view

SigRace: Signature-Based Data Race Detection PowerPoint PPT Presentation

SigRace: Signature-Based Data Race Detection - SigRace: Signature-Based Data Race Detection Abdullah Muzahid, Dario Suarez*, Shanxiang Qi & Josep Torrellas Computer Science Department University of Illinois at ... | PowerPoint PPT presentation | free to view

Workflow automation for processing plasma fusion simulation data PowerPoint PPT Presentation

Workflow automation for processing plasma fusion simulation data - Workflow automation for processing plasma fusion simulation data. Norbert Podhorszki ... PN for parallelism and pipeline processing ... | PowerPoint PPT presentation | free to view

G-fast Chips: Market Shares, Strategies, and Forecasts, Worldwide, 2014 to 2020 - Big Market Research, Share, analyzing, Forecast, Deal, Worldwide, 2014-2020 To Get Complete Details here : http://www.bigmarketresearch.com/g-fast-chips-shares-strategies-and-forecasts-worldwide-2014-to-2020-market End to end broadband networks leverage a combination of optical infrastructure in the long haul and copper infrastructure in the last few meters from the distribution box to the home. Fiber has had rapid advance but does not work in the end, it is too expensive to the home. FTTH is too expensive and DSL continues to be a viable alternative, with DSL set to be replaced at the high end initially by G.fast. Copper based broadband technologies promise to last for a long long time. | PowerPoint PPT presentation | free to view

DSL and G-fast Chips: Market Shares, Strategies, and Forecasts, Worldwide, 2014 to 2020 PowerPoint PPT Presentation

DSL and G-fast Chips: Market Shares, Strategies, and Forecasts, Worldwide, 2014 to 2020 - Worldwide DSL and G.fast Chips markets are increasingly diversified, poised to achieve significant growth as broadband is used in every industry segment. | PowerPoint PPT presentation | free to view