John Cavazos - PowerPoint PPT Presentation

About This Presentation
Title:

John Cavazos

Description:

Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0 ... Slide Source: Cell Programming Workshop at GTech, Cell SDK 2.0. Defaults to gcc ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 33
Provided by: cisU7
Category:
Tags: cavazos | gtech | john

less

Transcript and Presenter's Notes

Title: John Cavazos


1
Lecture 3 Laws, Equality, and Inside a Cell
  • John Cavazos
  • Dept of Computer Information Sciences
  • University of Delaware
  • www.cis.udel.edu/cavazos/cisc879

2
Lecture 2 Overview
  • Know the Laws
  • All are NOT Created Equal
  • Inside a Cell

3
Two Important Laws
  • Amdahls Law
  • Gene Amdahl observation in 1967
  • Speedup is limited by serial portions
  • Assumes fixed workloads and fixed problem size
  • Gustafsons Law
  • John Gustafson observation in 1988
  • Rescues parallel processing from Amdahls Law
  • Proposes fixed time and increasing work
  • Sequential portions have diminishing effect

4
Amdahls Law
Parallelize parts 2 and 4 with 2 processors
Sequential
100
50
100
100
Sequential
100
50
100
Sequential
Speedup 25
5
Amdahls Law (contd)
Parallelize parts 2 and 4 with 4 processors
Sequential
100
25
50
100
100
Sequential
25
100
50
100
Sequential
Speedup 40
6
Amdahls Law (contd)
Parallelize parts 2 and 4 with infinite
processors
Sequential
100
0
25
50
100
100
Sequential
0
25
100
50
100
Sequential
Speedup only 70
Multicore doesnt look very appealing!
7
Gustafsons Law (contd)
Boxes contain units of work now!
500 units of time, but 700 units of work!
Sequential
100
200
100
100
Sequential
100
200
100
Sequential
Speedup 40
8
Gustafsons Law (contd)
Boxes contain units of work now!
500 units of time, but 1100 units of work!
Sequential
100
400
200
100
100
Sequential
400
100
200
100
Sequential
Speedup 220
9
Gustafson Law (contd)
  • Gustafson found important observation
  • As processors grow, people scale problem size
  • Serial bottlenecks do not grow with problem size
  • Increasing processors gives linear speedup
  • 20 processors roughly twice as fast as 10
  • This is why supercomputers are successful
  • More processors allows increased dataset size

Reference http//www.scl.ameslab.gov/Publications
/Gus/AmdahlsLaw/Amdahls.html
10
Lecture 2 Overview
  • Know the Laws
  • All are NOT Created Equal
  • Inside a Cell

11
All Multicores Not Equal
  • Multicore CPUs and GPUs are very different!
  • CPUs run general purpose programs well
  • GPUs run graphics (or similar prgs) well
  • General Purpose Programs have
  • Less parallelism
  • More complex control requirements
  • GPU programs
  • Highly parallel
  • Arithmetic intense
  • Simple control requirements

12
Floating-Point Operations
32-bit FP operations per second
GPUs more computational units and take better
advantage of them.
Slide Source NVIDIA CUDA Programming Guide 1.1
13
CPUs versus GPUs
CPUs devote lots of area to control and
storage. GPUs devote most area to computational
units.
Slide Source NVIDIA CUDA Programming Guide 1.1
14
CPU Programming Model
  • Scalar programming model
  • No native data parallelism
  • Few arithmetic units
  • Very small area
  • Optimized for complex control
  • Optimized for low latency not high bandwidth

Slide Source John Owens, EEC 227 Graphics Arch
course
15
AMD K7 Deerhound
Slide Source John Owens, EEC 227 Graphics Arch
course
16
GPU Programming Model
  • Streams
  • Collections of data records
  • Data parallelism amenable
  • Kernels
  • Inputs/outputs are streams
  • Performs computation on each element of stream
  • No dependencies between stream elements
  • Stream storage
  • Not cache (input read once/output written once)
  • Producer-consumer locality

Slide Source John Owens (EEC 227 Graphics Arch)
and Pat Hanrahan (Stream Prog. Env., GP2
Workshop)
17
Lecture 2 Overview
  • Know the Laws
  • All are NOT Created Equal
  • Inside a Cell

18
Cell B.E. Design Goals
  • An accelerator extension to Power
  • Exploits parallelism and achieves high frequency
  • Sustain high memory bandwidth through DMA
  • Designed for flexibility
  • Heterogenous architecture
  • PPU for control, general-purpose
  • SPU for computation-intensive, little control
  • Applicable to a wide variety of applications

The Cell Architecture has characteristics of both
a CPU and GPU.
19
Cell Chip Highlights
  • 241M Transistors
  • 9 cores, 10 threads
  • gt200 GFlops (SP)
  • gt20 GFlops (DP)
  • gt 300 GB/s EIB
  • 3.2 GHz shipping
  • Top freq. 4.0 GHz (in lab)

Slide Source Michael Perrone, MIT 6.189 Fall
2007 course
20
Cell Details
  • Heterogenous multicore architecture
  • Power Processor Element (PPE) for control tasks
  • Synergistic Processor Element (SPE) for
    data-intensive processing
  • SPE Features
  • No cache
  • Large unified register file
  • Synergistic Memory Flow Control (MFC)
  • Interface to high-perf. EIB

Slide Source Michael Perrone, MIT 6.189 Fall
2007 course
21
Cell PPE Details
  • Power Processor Element (PPE)
  • General Purpose 64-bit PowerPC RISC processor
  • 2-way hardware threaded
  • L1 32KB I 32KB D
  • L2 512 KB
  • For operating systems and program control

Slide Source Michael Perrone, MIT 6.189 Fall
2007 course
22
Cell SPE Details
  • Synergistic Processor Element (SPE)
  • 128-bit SIMD architecture
  • Dual Issue
  • Register File 128x128-bit
  • Load Store (256KB)
  • Simplified Branch Arch.
  • No hardware BR predictor
  • Compiler-managed hint
  • Memory Flow Controller
  • Dedicated DMA engine - Up to 16 outstanding
    requests

Slide Source Michael Perrone, MIT 6.189 Fall
2007 course
23
Compiler Tools
  • Gnu based C/C compiler (Sony)
  • ppu-gcc/ppu-g - generates ppu code
  • spu-gcc/spu-g - generates spu code
  • Gdb debugger
  • Supports both PPU and SPU debugging
  • Different modes of execution

Slide Source Michael Perrone, MIT 6.189 Fall
2007 course
24
Compiler Tools
  • The XLC/C compiler
  • ppuxlc/ppuxlc - generates ppu code
  • spuxlc/spuxlc - generates spu code
  • Includes the following optimization levels
  • -O0 almost no optimization
  • -O2 strong, low-level optimization
  • -O3 intense, low-level opts with basic loop opts
  • -O4 all of -O3 and detaild loop analysis and
    good whole program analysis
  • -O5 all of -O4 and detailed whole-program
    analysis

Slide Source Michael Perrone, MIT 6.189 Fall
2007 course
25
Performance Tools
  • Gnu-based tools
  • Oprofile - System level profiler (only PPU)
  • Gprof - generates call graphs
  • IBM Tools
  • Static analysis tool (spu_timing)
  • annotates assembly file with scheduling and
    instruction issue estimates
  • Dynamic analysis tool (CellBE system simulator)
  • Can run your code on an X86 machine
  • Can collect a variety of statistics

Slide Source Michael Perrone, MIT 6.189 Fall
2007 course
26
Compiling with the SDK
  • README_build_env.txt (You should IMPORTANT!)
  • Provides details on the build environment
    features, including files, structure and
    variables.
  • make.footer
  • Specifies all of the build rules needed to
    properly build binaries
  • Must be included in all SDK Makefiles (referenced
    relatively if CELL_TOP is not defined)
  • Includes make.header
  • make.header
  • Specifies definitions needed to process the
    Makefiles
  • Includes make.env
  • make.env
  • Specifies the default compilers and tools to be
    used by make
  • make.footer and make.header should not be
    modified

Slide Source Cell Programming Workshop at GTech,
Cell SDK 2.0
27
Compiling with the SDK
  • Defaults to gcc
  • Set in make.env with three variables set to gcc
    or xlc
  • PPU32_COMPILER
  • PPU64_COMPILER
  • PPU_COMPILER overrides PPU32_COMPILER and
    PPU64_COMPILER
  • SPU_COMPILER
  • Can change from the command line
  • PPU_COMPILERxlc SPU_COMPILERxlc make
  • make -e PPU64_COMPILERgcc -e PPU32_COMPILERgcc
    -e SPU_COMPILERgcc
  • export PPU_COMPILERxlc SPU_COMPILERxlc make

Slide Source Cell Programming Workshop at GTech,
Cell SDK 2.0
28
Compiling with the SDK
  • Use CELL_TOP or maintain relative directory
    structure
  • ifdef CELL_TOP
  • include (CELL_TOP)/make.footer
  • else
  • include ../../../make.footer
  • endif

Slide Source Cell Programming Workshop at GTech,
Cell SDK 2.0
29
Makefile variables
  • DIRS
  • list of subdirectories to build first
  • PROGRAM_ppu PROGRAMS_ppu
  • 32-bit PPU program (or list of programs) to
    build.
  • PROGRAM_ppu64 PROGRAMS_ppu64
  • 64-bit PPU program (or list of programs) to
    build.
  • PROGRAM_spu PROGRAMS_spu
  • SPU program (or list of programs) to build.
  • If written as a standalone binary, can run
    without being embedded in a PPU program.

Slide Source Cell Programming Workshop at GTech,
Cell SDK 2.0
30
Makefile variables (contd)
  • LIBRARY_embed LIBRARY_embed64
  • Creates a linked library from an SPU program to
    be embedded into a 32-bit or 64-bit PPU program.
  • CC_OPT_LEVEL
  • Optimization level for compiler to use
  • CFLAGS, CFLAGS_gcc, CFLAGS_xlc
  • Additional flags for compiler to use (general or
    specific to gcc/xlc)
  • TARGET_INSTALL_DIR
  • Specifies where built targets are installed

Slide Source Cell Programming Workshop at GTech,
Cell SDK 2.0
31
Sample Project
Slide Source Cell Programming Workshop at GTech,
Cell SDK 2.0
32
Next Time
  • Chapters 1-3
  • NVIDIA CUDA Programming Guide version 1.1
  • And all of
  • Chapter 29 from GPU Gems 2
  • Links on website
Write a Comment
User Comments (0)
About PowerShow.com