John Cavazos - PowerPoint PPT Presentation

About This Presentation

Title:

John Cavazos

Description:

Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0 ... Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0. Defaults to gcc ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 33

Provided by: cisU7

Learn more at: https://www.eecis.udel.edu

Category:

more less

Transcript and Presenter's Notes

Title: John Cavazos

1
Lecture 3 Laws, Equality, and Inside a Cell

John Cavazos
Dept of Computer Information Sciences
University of Delaware
www.cis.udel.edu/cavazos/cisc879

2
Lecture 2 Overview

Know the Laws
All are NOT Created Equal
Inside a Cell

3
Two Important Laws

Amdahls Law
Gene Amdahl observation in 1967
Speedup is limited by serial portions
Assumes fixed workloads and fixed problem size
Gustafsons Law
John Gustafson observation in 1988
Rescues parallel processing from Amdahls Law
Proposes fixed time and increasing work
Sequential portions have diminishing effect

4
Amdahls Law
Parallelize parts 2 and 4 with 2 processors
Sequential
100
50
100
100
Sequential
100
50
100
Sequential
Speedup 25
5
Amdahls Law (contd)
Parallelize parts 2 and 4 with 4 processors
Sequential
100
25
50
100
100
Sequential
25
100
50
100
Sequential
Speedup 40
6
Amdahls Law (contd)
Parallelize parts 2 and 4 with infinite
processors
Sequential
100
0
25
50
100
100
Sequential
0
25
100
50
100
Sequential
Speedup only 70
Multicore doesnt look very appealing!
7
Gustafsons Law (contd)
Boxes contain units of work now!
500 units of time, but 700 units of work!
Sequential
100
200
100
100
Sequential
100
200
100
Sequential
Speedup 40
8
Gustafsons Law (contd)
Boxes contain units of work now!
500 units of time, but 1100 units of work!
Sequential
100
400
200
100
100
Sequential
400
100
200
100
Sequential
Speedup 220
9
Gustafson Law (contd)

Gustafson found important observation
As processors grow, people scale problem size
Serial bottlenecks do not grow with problem size
Increasing processors gives linear speedup
20 processors roughly twice as fast as 10
This is why supercomputers are successful
More processors allows increased dataset size

Reference http//www.scl.ameslab.gov/Publications
/Gus/AmdahlsLaw/Amdahls.html
10
Lecture 2 Overview

Know the Laws
All are NOT Created Equal
Inside a Cell

11
All Multicores Not Equal

Multicore CPUs and GPUs are very different!
CPUs run general purpose programs well
GPUs run graphics (or similar prgs) well
General Purpose Programs have
Less parallelism
More complex control requirements
GPU programs
Highly parallel
Arithmetic intense
Simple control requirements

12
Floating-Point Operations
32-bit FP operations per second
GPUs more computational units and take better
advantage of them.
Slide Source NVIDIA CUDA Programming Guide 1.1
13
CPUs versus GPUs
CPUs devote lots of area to control and
storage. GPUs devote most area to computational
units.
Slide Source NVIDIA CUDA Programming Guide 1.1
14
CPU Programming Model

Scalar programming model
No native data parallelism
Few arithmetic units
Very small area
Optimized for complex control
Optimized for low latency not high bandwidth

Slide Source John Owens, EEC 227 Graphics Arch
course
15
AMD K7 Deerhound
Slide Source John Owens, EEC 227 Graphics Arch
course
16
GPU Programming Model

Streams
Collections of data records
Data parallelism amenable
Kernels
Inputs/outputs are streams
Performs computation on each element of stream
No dependencies between stream elements
Stream storage
Not cache (input read once/output written once)
Producer-consumer locality

Slide Source John Owens (EEC 227 Graphics Arch)
and Pat Hanrahan (Stream Prog. Env., GP2
Workshop)
17
Lecture 2 Overview

Know the Laws
All are NOT Created Equal
Inside a Cell

18
Cell B.E. Design Goals

An accelerator extension to Power
Exploits parallelism and achieves high frequency
Sustain high memory bandwidth through DMA
Designed for flexibility
Heterogenous architecture
PPU for control, general-purpose
SPU for computation-intensive, little control
Applicable to a wide variety of applications

The Cell Architecture has characteristics of both
a CPU and GPU.
19
Cell Chip Highlights

241M Transistors
9 cores, 10 threads
gt200 GFlops (SP)
gt20 GFlops (DP)
gt 300 GB/s EIB
3.2 GHz shipping
Top freq. 4.0 GHz (in lab)

Slide Source Michael Perrone, MIT 6.189 Fall
2007 course
20
Cell Details

Heterogenous multicore architecture
Power Processor Element (PPE) for control tasks
Synergistic Processor Element (SPE) for
data-intensive processing
SPE Features
No cache
Large unified register file
Synergistic Memory Flow Control (MFC)
Interface to high-perf. EIB

Slide Source Michael Perrone, MIT 6.189 Fall
2007 course
21
Cell PPE Details

Power Processor Element (PPE)
General Purpose 64-bit PowerPC RISC processor
2-way hardware threaded
L1 32KB I 32KB D
L2 512 KB
For operating systems and program control

Slide Source Michael Perrone, MIT 6.189 Fall
2007 course
22
Cell SPE Details

Synergistic Processor Element (SPE)
128-bit SIMD architecture
Dual Issue
Register File 128x128-bit
Load Store (256KB)
Simplified Branch Arch.
No hardware BR predictor
Compiler-managed hint
Memory Flow Controller
Dedicated DMA engine - Up to 16 outstanding
requests

Slide Source Michael Perrone, MIT 6.189 Fall
2007 course
23
Compiler Tools

Gnu based C/C compiler (Sony)
ppu-gcc/ppu-g - generates ppu code
spu-gcc/spu-g - generates spu code
Gdb debugger
Supports both PPU and SPU debugging
Different modes of execution

Slide Source Michael Perrone, MIT 6.189 Fall
2007 course
24
Compiler Tools

The XLC/C compiler
ppuxlc/ppuxlc - generates ppu code
spuxlc/spuxlc - generates spu code
Includes the following optimization levels
-O0 almost no optimization
-O2 strong, low-level optimization
-O3 intense, low-level opts with basic loop opts
-O4 all of -O3 and detaild loop analysis and
good whole program analysis
-O5 all of -O4 and detailed whole-program
analysis

Slide Source Michael Perrone, MIT 6.189 Fall
2007 course
25
Performance Tools

Gnu-based tools
Oprofile - System level profiler (only PPU)
Gprof - generates call graphs
IBM Tools
Static analysis tool (spu_timing)
annotates assembly file with scheduling and
instruction issue estimates
Dynamic analysis tool (CellBE system simulator)
Can run your code on an X86 machine
Can collect a variety of statistics

Slide Source Michael Perrone, MIT 6.189 Fall
2007 course
26
Compiling with the SDK

README_build_env.txt (You should IMPORTANT!)
Provides details on the build environment
features, including files, structure and
variables.
make.footer
Specifies all of the build rules needed to
properly build binaries
Must be included in all SDK Makefiles (referenced
relatively if CELL_TOP is not defined)
Includes make.header
make.header
Specifies definitions needed to process the
Makefiles
Includes make.env
make.env
Specifies the default compilers and tools to be
used by make
make.footer and make.header should not be
modified

Slide Source Cell Programming Workshop at GTech,
Cell SDK 2.0
27
Compiling with the SDK

Defaults to gcc
Set in make.env with three variables set to gcc
or xlc
PPU32_COMPILER
PPU64_COMPILER
PPU_COMPILER overrides PPU32_COMPILER and
PPU64_COMPILER
SPU_COMPILER
Can change from the command line
PPU_COMPILERxlc SPU_COMPILERxlc make
make -e PPU64_COMPILERgcc -e PPU32_COMPILERgcc
-e SPU_COMPILERgcc
export PPU_COMPILERxlc SPU_COMPILERxlc make

Slide Source Cell Programming Workshop at GTech,
Cell SDK 2.0
28
Compiling with the SDK

Use CELL_TOP or maintain relative directory
structure
ifdef CELL_TOP
include (CELL_TOP)/make.footer
else
include ../../../make.footer
endif

Slide Source Cell Programming Workshop at GTech,
Cell SDK 2.0
29
Makefile variables

DIRS
list of subdirectories to build first
PROGRAM_ppu PROGRAMS_ppu
32-bit PPU program (or list of programs) to
build.
PROGRAM_ppu64 PROGRAMS_ppu64
64-bit PPU program (or list of programs) to
build.
PROGRAM_spu PROGRAMS_spu
SPU program (or list of programs) to build.
If written as a standalone binary, can run
without being embedded in a PPU program.

Slide Source Cell Programming Workshop at GTech,
Cell SDK 2.0
30
Makefile variables (contd)

LIBRARY_embed LIBRARY_embed64
Creates a linked library from an SPU program to
be embedded into a 32-bit or 64-bit PPU program.
CC_OPT_LEVEL
Optimization level for compiler to use
CFLAGS, CFLAGS_gcc, CFLAGS_xlc
Additional flags for compiler to use (general or
specific to gcc/xlc)
TARGET_INSTALL_DIR
Specifies where built targets are installed

Slide Source Cell Programming Workshop at GTech,
Cell SDK 2.0
31
Sample Project
Slide Source Cell Programming Workshop at GTech,
Cell SDK 2.0
32
Next Time