Enhancing GPU for Scientific Computing - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Enhancing GPU for Scientific Computing

Description:

... rasterizer creates the s pixels for fragment processing. For each pixel, our fragment processor will ... These pixels are written onto the Pbuffer memory. ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 27
Provided by: duderoc
Category:

less

Transcript and Presenter's Notes

Title: Enhancing GPU for Scientific Computing


1
Enhancing GPU for Scientific Computing
  • Some thoughts

2
Outline
  • Motivation
  • Related work
  • BLAS Library
  • Execution Model
  • Benchmarks
  • Recommendations

3
Motivation
  • GPU Computing
  • Vector and Fragment Processor
    streaming (super)-computers
  • enormous performance!
  • ATI 9700, NV30
  • They have become programmable
  • Emerging application areas
  • Numerical Sim.Schroder03, Sorting, Genomics,
    etc.
  • Goal Scientific Computing

4
Motivation
  • Most software built from small-efficient parts
  • Scientific apps built on top of s/w library
    routines
  • Harnessing GPU resources
  • Arithmetic Intensive
  • Data parallel
  • BLAS Library

5
Related work
  • Using non-programmable GPUs
  • Erik01 prog. vertex engine for
    lighting/morphing
  • Oskin02 vector processing using VP
  • Ian03 stream processing using FP
  • Problems
  • Monolithic Big Programs
  • One of VP or FP
  • CPU Passive Mode
  • No Cascading Loop-backs (Parallelism, Setup
    Times)

6
BLAS Library
  • BLAS (Basic Linear Algebra Subprograms)
  • Building blocks for vector and matrix operations
  • development of highly efficient linear algebra
    software
  • LINPACK and LAPACK
  • Operations
  • Scalar Vector
  • Vector Vector
  • Vector Matrix
  • Matrix Matrix

7
Mapping
  • Operation processor
  • CPU/FP - All ops
  • VP - no memory access
  • Restricted data-flows
  • CPU FP
  • VP CPU

8
Execution graph Vector Scalar Add Operation
vAdd CPU
  • In this example, a Vector of length n is
    segmented into m other vectors of length 4 in the
    CPU function vsAdd.
  • The vertex program vsAdd.cg is loaded onto the
    vertex processor and the scalar value is passed
    as a parameter.
  • Subsequently, CPU function vsAdd will stream the
    set of m vectors onto the CPU as openGL primitive
    points. Our vertex program, vsAdd.cg will add the
    scalar value to all fields in the m vertices.
  • Consequently, these vertices will proceed to the
    fragment processor and written onto the
    framebuffer memory.
  • The CPU function vsADD continues to read the
    color values off each pixel representation of the
    vertices. These color values contain result of a
    Vector Scalar add.
  • Lastly the CPU function concatenates the sequence
    of color values into a vector of length n as
    result.

vAdd.cg
Vertexm (GL_POINTS)
vAdd.cg Vertex Processor
Vertexm
G P U
None Fragment Processor
Texture Mem
PBuffer
TextureDatam
Texture Color valuesm
vAdd CPU
(Vectors)
9
Execution graph Vector Vector Add Operation
vAdd CPU
GL_QUAD Vector4m
  • In this example, 2 vectors of length s are
    transformed into texture data in the CPU function
    vAdd.
  • The vertex program vAdd.cg, and texture data are
    loaded onto the fragment processor GPU memory
    respectively.
  • Subsequently, CPU function vAdd will draw a
    quadrilateral primitive having s pixels.
  • The vertex processor does nothing and passes on
    the vertices to the rasterizer to process into
    pixel representation.
  • The rasterizer creates the s pixels for fragment
    processing.
  • For each pixel, our fragment processor will
    lookup the values from both textures and
    determine the color value of each pixel. These
    pixels are written onto the Pbuffer memory.
  • The CPU function vADD continues to read the color
    values off each pixel representation of the
    vertices. These color values contain result of a
    Vector Vector add.
  • The output in Pbuffer is then converted into a
    texture entry.
  • Lastly the CPU function reads the texture entry
    and concatenates the sequence of color values
    into a vector of length s as result.

Vertex4m GL_QUAD
vAdd.cg
None Vertex Processor
Vertex4m
G P U
vAdd.cg Fragment Processor
TextureData1m TextureData2m
Texture Mem
PBuffer
TextureData3m
Texture Color valuesm
vAdd CPU
(Vectors)
10
Execution graph 2 Vector Vector Add Operations
vAdd CPU
GL_QUAD Vectex4m
  • In this example, we perform 2 separate vector
    vector add operations.
  • The 1st operation proceeds as described earlier
    in our vector vector add operation.
  • The output of the 1st operation is used as input
    for the 2nd operation.
  • Since its the same operation, we do not load a
    new Vertex or Fragment program. However we
    proceed to load a new texture data.
  • The 2nd operation proceeds as normal.
  • Lastly the CPU function concatenates the sequence
    of color values into a vector of length s as
    result.

Vertex4m
None Vertex Processor
TextureData4m
Vertex4m
G P U
vAdd.cg Fragment Processor
Texture Mem
PBuffer
TextureData3m
Texture Color valuesm
vAdd CPU
(Vectors)
11
Performance Issues
  • Representation inefficiency
  • Memory
  • Data stored both in CPU and GPU
  • Communication costs
  • Loading data onto GPU
  • Reading data from GPU
  • Execution inefficiencies
  • Computation setup overhead
  • Remodeling CPU data for GPU
  • Problem execution time
  • Rendering
  • Texture lookups

12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
Observations
  • Fixed-point operations are much faster than
    FP16/FP32 operations
  • FP16/FP32 operations have similar performance
  • VP is slower than FP
  • Operation mappings involving both VP and FP
    result in inefficient pipeline

20
Observations
  • Simple operations perform better on CPU
  • Best to design whole algorithm as single VP/FP
    program
  • Memory cost for storing intermediate results
  • Execution cost ?
  • More textures result in decreased performance

21
Bug Reports Filed!
  • Incorrect dump of floating point values after
    render to texture NVIDIA confirmed
  • cgSetcolor parameter does not update alpha values
    Awaiting reply

22
Recommendations (3D Graphics Hackers)
  • Load important data into Video memory
  • Maximum use of Fixed-point Pipeline
  • Code optimization important (Instr., Memory)
  • Upgrade your video card drivers (must!)
  • Hacking graphics hardware is a real pain!

23
Recommendations (Cg)
  • Pointer meaningful for numerical computing
  • Texture fetch instructions (add. Offsets)
  • Accumulation registers (sum)
  • Preserving State across multiple calls
  • Introduce stack mechanisms
  • Introduce bit wise operators

24
Recommendations (Hardware)
  • Allow GPU to read/write from CPU memory
  • VP and FP as 1st class processors on GPU
  • Similar cores and instruction sets
  • Allow full parallelism
  • Allow CPU to read/write all registers in GPU
    processors
  • Introduce a stack
  • Introduce bit wise operators

25
Deliverables!
  • A draft subset of the BLAS library
  • Architecture Insights (issues/constraints)
  • NV30 Improvements (Bug reports)
  • Technical Write-up

26
The End
Write a Comment
User Comments (0)
About PowerShow.com