Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs using CUDA - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs using CUDA

Description:

let be the data of size N supposedly drawn from a distribution with density ... unobserved data whose value inform which component density generates each ... – PowerPoint PPT presentation

Number of Views:249
Avg rating:3.0/5.0
Slides: 34
Provided by: Ala7108
Category:

less

Transcript and Presenter's Notes

Title: Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs using CUDA


1
Fast Parallel Expectation Maximization for
Gaussian Mixture Models on GPUs using CUDA
2
Outline
  • Expectation maximization for GMM
  • CUDA on NVIDIA GPUs
  • CUDA implementation of EM for GMM
  • Performance

3
Expectation maximization for GMM
4
  • let be the data of size N
    supposedly drawn from a distribution with density
    function governed by some parameters
  • Assume data vectors
  • are i.i.d.(independent identically
    distributed)
  • resultant density

5
likelihood function
  • is called the likelihood function of
    parameters given the data.
  • The ML estimates are those which maximize this
    likelihood function
  • This is same as maximize ,
  • the log-likelihood function

6
  • In the case of mixture models,
  • the density
  • where
  • are the parameters to be estimated

7
  • If the component densities are d-dimensional
    Gaussian densities with
  • mean and covariance matrix ,
  • i.e., given by,
  • the model is called Gaussian Mixture Model

8
Assumption
  • Assumption
  • a complete dataset
  • where X is incomplete
  • the joint density function
  • the complete log-likelihood function

9
E-step
  • EM algo. first finds the expectation of the
    complete log-likelihood function
  • w.r.t. unknown data Y given the observed data X
    and current parameter estimates
  • defined by,
  • where are the parameters to be estimated.

10
M-step
  • The second step to maximize the expectation
    computed in E-step, i.e., to find
  • These two steps are repeated as necessary.
  • The algo. is guaranteed to converge to a local
    maxima of the likelihood function.

11
  • For GMM, if X incomplete unobserved data
    whose value inform which
    component density generates each data item, then
    the complete-data log-likelihood function
  • if , it implies
    that the sample is generated by the
    mixture component.

12

  • where ,
  • are independently drawn unobserved data

13
E-step for the case of GMM
  • The complete-data log-likelihood function
  • where is computed as in the Eq (9)
    putting

14
M-step for the case of GMM
15
CUDA on NVIDIA GPUs
16
  • Each multiprocessor has a Single Instruction,
    Multiple Thread architecture (SIMT).
  • Each active block is split into SIMT groups of
    threads called warps.
  • A thread scheduler periodically switches from one
    warp to another to maximize the use of the
    multiprocessors computational resources.

17
(No Transcript)
18
(No Transcript)
19
CUDA implementation of EM for GMM
20
  • Given
  • data size N
  • mixture length M
  • dimension of the mixture component D
  • INPUT observed data X in matrix form (N x D)
  • Parameters to be estimated
  • Mixture Coefficients A (1 x M)
  • MEAN in matrix form (M x D)
  • COVARIANCE in matrix form (M x D)
  • Sequential launch of 6 kernels

21
  • COMPUTE-BASIS
  • COMPUTE-APRIORI

22
  • ESTIMATE-ALPHA
  • This kernel is implemented on a grid of M/B
    thread blocks, each consisting of B threads.
  • ESTIMATE-MEAN
  • expresses the estimates of MEAN matrix as a
    function of APRIORI (P) and INPUT (X)
  • Then dividing ith row of MEAN by Nai gives the
    updated matrix

23
  • COMPUTE-VARIANCE
  • Define a new matrix called VARIANCE (V) matrix
    of order (N x MD) whose (i, (m-1)D n)th element,
    where (1 i N), (1 m M) and (1 n D) is
    computed as
  • COMPUTE-COVARIANCE
  • Define another matrix of order (M x MD)
    computed as
  • COVARIANCE matrix of order (M x D)
  • Then dividing ith row of MEAN by Nai gives the
    updated matrix

24
Effect of coalesced memory access on performance
25
  • the data accessed by each thread is read first
    from the global memory into the on-chip (shared)
    memory before doing any arithmetic.
  • The data is reordered in global memory in such a
    way that the words read by consecutive threads
    fall into consecutive address locations.

26
(No Transcript)
27
Performamce
28
  • Input data size N
  • mixture length M
  • dimension D
  • block size B 64 B 256

29
  • As the size of dataset grows, the performance of
    GPU is increasing over CPU.
  • The performance of 240 cores is improving over
    that of 128 cores.

30
(No Transcript)
31
  • As the number of floating point operations
    increase, the performance on 240 cores is
    improving over that of 128 cores.

32
  • When the dataset grows, the GFLOPS achieved by
    the same number of cores keep increasing.
  • The bent curves are due to insufficiency of
    threads launched on the multiprocessors to combat
    the memory access latencies while proceeding to
    more number of cores and bigger data sets.

33
  • The time taken by GPU drastically comes down with
    coalesced memory accesses.
Write a Comment
User Comments (0)
About PowerShow.com