Intel - PowerPoint PPT Presentation

About This Presentation
Title:

Intel

Description:

Intel s MMX Dr. Richard Enbody CSE 820 – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 44
Provided by: Richard1675
Learn more at: http://www.cse.msu.edu
Category:
Tags: intel

less

Transcript and Presenter's Notes

Title: Intel


1
Intels MMX
  • Dr. Richard Enbody
  • CSE 820

2
Why MMX?
  • Make the Common Case Fast
  • Multimedia and Communication consume significant
    computing resources.
  • Providing specific hardware support makes sense.

3
Goals
  • accelerate multimedia and communications
    applications.
  • maintain full compatibility with existing
    operating systems and applications.
  • exploit inherent parallelism in multimedia and
    communication algorithms
  • includes new instructions and data types to
    improve performance.

4
First Step examine code
  • Examined a wide range of applications graphics,
    MPEG video, music synthesis, speech compression,
    speech recognition, image processing, games,
    video conferencing.
  • Identified and analyzed the most
    compute-intensive routines

5
Common Characteristics
  • Small integer data types e.g. 8-bit pixels,
    16-bit audio samples
  • Small, highly repetitive loops
  • Frequent multiply-and-accumulate
  • Compute-intensive algorithms
  • Highly parallel operations

6
MMX Technology
  • A set of basic, general purpose integer
    instructions
  • Single Instruction, Multiple Data (SIMD)
  • 57 new instructions
  • Eight 64-bit wide MMX registers
  • Four new data types

7
Data Types
8
Data Types
9
Example
  • Pixels are generally 8-bit integers. Pack eight
    pixels into a 64-bit MMX register.
  • An MMX instruction takes all eight of the pixels
    at once from the MMX register, performs the
    arithmetic or logical operation on all eight
    elements in parallel, and writes the result into
    an MMX register.

10
Compatibility
  • No new exceptions or states are added.
  • Aliases to existing FP registersThe exponent
    field of the corresponding floating-point
    register (bits 64-78) and the sign bit (bit 79)
    are set to ones (1's), making the value in the
    register a NaN (Not a Number) or infinity when
    viewed as a floating-point value.

11
(No Transcript)
12
57 Instructions
  • Basic arithmetic add, subtract, multiply,
    arithmetic shift and multiply-add
  • Comparison
  • Conversion pack unpack
  • Logical
  • Shift
  • Move register-to-register
  • Load/Store 64-bit and 32-bit

13
Packed Add Word with wrap around
  • Each Addition is independent
  • Rightmost overflows and wraps around

14
Saturation
  • Saturation if addition results in overflow or
    underflow, the result is clamped to the largest
    or smallest value representable.
  • This is important for pixel calculations where
    this would prevent a wrap-around add from causing
    a black pixel to suddenly turn white

15
No Mode
  • There is no "saturation mode bita new mode bit
    would require a change to the operating system.
    Separate instructions are used to generate
    wrap-around and saturating results.

16
Packed Add Word with unsigned saturation
  • Each Addition is independent
  • Rightmost saturates

17
Multiply-Accumulate
  • multiply-accumulate operations are fundamental to
    many signal processing algorithms like
    vector-dot-products, matrix multiplies, FIR and
    IIR Filters, FFTs, DCTs etc

18
Packed Multiply-Add
Multiply bytes generating four 32-bit
results.Add the 2 products on the left for one
result and the 2 products on the right for the
other result.
19
Packed Parallel Compare
  • No new condition code flags
  • No existing IA condition code flags are affected
    by this instruction.
  • Result can be used as a mask to select elements
    from different inputs using a logical operation,
    eliminating branchs.

20
Packed Parallel Compare
21
Pack/Unpack
  • Important when an algorithm needs higher
    precision in its intermediate calculations, as in
    image filtering.
  • For example, image filtering involves a set of
    intermediate multiply operations between filter
    coefficients and a set of adjacent image pixels,
    accumulating all the values together.

22
Pack
23
Conditional Select
  • The Chroma Keying example demonstrates how
    conditional selection using the MMX instruction
    set removes branch mis-predictions, in addition
    to performing multiple selection operations in
    parallel. Text overlay on a pix/video background,
    and sprite overlays in games are some of the
    other operations that would benefit from this
    technique.

24
Chroma Keying
25
Chroma Keying (cont)
  • Take pixels from the picture with the woman on a
    green background.
  • A compare instruction builds a mask for that
    data. That mask is a sequence of bytes that are
    all ones or all zeros.
  • We now know what is the unwanted background and
    what we want to keep.

26
Create Mask
Assume pixels alternate green/not_green
27
Combine !AND, AND, OR
28
Branch Removal
  • Without MMX technology, each pixel is processed
    separately and requires a conditional branch.
    Using MMX instructions, eight 8-bit pixels can be
    processed in parallel and no conditional branches
    are involved.

29
Vector Dot Product
  • The vector dot product is one of the most basic
    algorithms used in signal-processing of natural
    data such as images, audio, video and sound.
  • PMADD does 4 multiplies and 2 adds at a time.
    Coupled with PADD, eight multiply-accumulate
    operations can be performed 2 PMADD and 2 PADD

30
Vector Dot Product
31
Vector Dot Product
32
Vector Dot Product
  • Assuming precision is sufficient, a dot-product
    on an 8-element vector can be completed using 8
    MMX instructions 2 PMADDs, 2 PADDs, two shifts
    (if needed to fix the precision after the
    multiply), and 2 loads for one of the vectors
    (the other vector is loaded by the PMADD
    instruction which can have one of its operands
    come from memory).

33
Compare
34
Compare
  • With MMX technology, one third of the number of
    instructions is needed.
  • Most MMX instructions can be executed in one
    clock cycle, so the performance improvement will
    be more dramatic than the simple ratio of
    instruction counts.

35
Matrix Multiply
  • 3D games computations that manipulate 3D objects
    use 4-by-4 matrices that are multiplied with
    4-element vectors many times. Each vector has the
    X,Y, Z and perspective corrective information for
    each pixel. The 4-by-4 matrix is used to rotate,
    scale, translate and update the perspective
    corrective information for each pixel.

36
(No Transcript)
37
Compare
38
Matrix Multiply
  • MMX required half the instructions.

39
Image Dissolve Using Alpha Blending
  • Dissolve a Swan into a FlowerResult_pixel
    Flower_pixel (alpha/255)
    Swan_pixel 1 - (alpha/255)
  • Assume 640x480 resolution

40
Dissolve Millions of Inst.
41
Dissolve
  • 1 billion fewer instructions for the 640x480
    dissolve

42
(No Transcript)
43
Conclusion
  • MMX appeared in 1997 in Pentium processors (with
    bigger cache).
  • According to Intel, an MMX microprocessor runs a
    multimedia application up to 60 faster.In
    addition, it runs other applications about 10
    faster
Write a Comment
User Comments (0)
About PowerShow.com