Vector Unit Assembly - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Vector Unit Assembly

Description:

... graph shows some vector-math intensive function calls. 200K calls were ... Micro mode is a mode that allows your vector processor to act as an independent CPU ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 18
Provided by: benjamin8
Category:

less

Transcript and Presenter's Notes

Title: Vector Unit Assembly


1
Vector Unit Assembly
  • bquintero_at_fullsail.com

2
Overview
  • Architecture Review
  • VU0 Macro Mode Instruction Set
  • Building a Vector Library

3
Review
  • Playstation2 has two vector units that are
    similar but not the same
  • VU0 is the CPUs alternate processing unit
  • VU1 is the GSs alternate processing unit
  • Each Unit has a direct pipeline to its
    respective processor
  • Vector Units are designed for 4Dx32bit vectors

4
Review
  • VU0/1 each have access to 32 float registers and
    16 integer register
  • Float registers are not like PC registers they
    are 128bits in size (PC is 32bit)
  • 128bits can fit 4 float values at once (4D
    vector)
  • Integer registers are typically used as loop
    counters and address calculators

5
Review
  • VU0 has two bus lines
  • One bus is dedicated to the CPU
  • The other bus is used to communicate with all
    other devices
  • VU0 has 4KB of

6
Vector Unit Processing Speed
  • The graph shows some vector-math intensive
    function calls
  • 200K calls were made to each function

7
Macro and Micro Modes
  • Vector Unit Zero (VU0) has two modes
  • Micro mode is a mode that allows your vector
    processor to act as an independent CPU
  • A mini program is uploaded and executed in
    parallel to the main CPU
  • Macro mode allows your CPU to directly offload
    heavy vector computation with low overhead
  • Most popular method, hands down.

8
Micro Mode
  • When uploaded, the micro program is executed
    independent to the CPU
  • This means that we must time our execution so
    that the result is fetched by the CPU after the
    program is completed by the Vector Unit
  • Micro mode causes serious stalls and timing
    issues since execution speed is near impossible
    to determine

9
Macro Mode
  • Macro mode is a much easier method of executing
    fast math functionality
  • Assembly can be used as inline instructions,
    telling the compiler to offload the math to VU0
  • Notes
  • Just because its in assembly does not mean it
    will be faster
  • Switching CPU focus has its overheads

10
Assembly Structure
  • There is typically a specific method to writing
    assembly routines
  • Load the variable data/addresses to registers
  • Apply vector computations to those registers
  • Store the result back into a variable address
  • Overhead of using assembly is in the load and
    store
  • Make sure that the computation stage will improve
    performance enough to offset the load/store
    overhead

11
Vector Unit MIPS Instructions
  • Coprocessor Transfer Instructions
  • Store / Load
  • Coprocessor Branch Instructions
  • Macro (primitive) calculation instructions
  • Add / Subtract / Multiply / Divide / ect
  • Micro subroutine execution instructions
  • (VU Macro Instructions)

12
EEVectorAdd
  • Adding two vectors using the EE Core (CPU)
  • // (Vec4T v0, Vec4T v1, Vec4T v2)
  • v2-gtx v0-gtx v1-gtx
  • v2-gty v0-gty v1-gty
  • v2-gtz v0-gtz v1-gtz
  • v2-gtw v0-gtw v1-gtw

13
VectorAdd
  • Adding two vectors using the VU0
  • // (Vec4T v0, Vec4T v1, Vec4T v2)
  •    asm __volatile__ ("
  •     lqc2    vf05, 0x0(0)
  •     lqc2    vf06, 0x0(1)
  •     vadd.xyzw vf07, vf05, vf06
  •     sqc2    vf07, 0x0(2)
  • "r" (v0) , "r" (v1), "r" (v2)
  • )

14
EECrossProduct
  • Notice how we must use a temp because of the
    cross
  • // (Vec4T v1, Vec4T v2, Vec4T cross)
  • Vec4T temp
  • temp.x v1-gty v2-gtz - v1-gtz v2-gty
  • temp.y v1-gtz v2-gtx - v1-gtx v2-gtz
  • temp.z v1-gtx v2-gty - v1-gty v2-gtx
  • VectorCopy(temp, cross)

15
CrossProduct
  • // (Vec4T v1, Vec4T v2, Vec4T cross)
  • asm __volatile__("
  • lqc2 vf05, 0x0(0)
  • lqc2 vf06, 0x0(1)
  • vopmula.xyz ACC, vf05, vf06 first
  • vopmsub.xyz vf06, vf06, vf05 - second
  • vsub.w vf06, vf00, vf00 w 0
  • sqc2 vf06, 0x0(2)
  • // No Output
  • "r"(v1), "r"(v2), "r"(cross)
  • )

16
Vector Outer Product
  • The vopmula instruction performs an outer product
  • The result is stored into the special purpose ACC
    register
  • VF05 X Y Z
  • VF06 X Y Z
  • ACC X Y Z

17
For Next Time
  • Read Chapters 7.3.2 7.4.2
  • Read Chapters 9.3
Write a Comment
User Comments (0)
About PowerShow.com