Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs using CUDA - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs using CUDA

Description:

let be the data of size N supposedly drawn from a distribution with density ... unobserved data whose value inform which component density generates each ... – PowerPoint PPT presentation

Number of Views:249

Avg rating:3.0/5.0

Slides: 34

Provided by: Ala7108

Category:

more less

Transcript and Presenter's Notes

Title: Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs using CUDA

1
Fast Parallel Expectation Maximization for
Gaussian Mixture Models on GPUs using CUDA
2
Outline

Expectation maximization for GMM
CUDA on NVIDIA GPUs
CUDA implementation of EM for GMM
Performance

3
Expectation maximization for GMM
4

let be the data of size N
supposedly drawn from a distribution with density
function governed by some parameters
Assume data vectors
are i.i.d.(independent identically
distributed)
resultant density

5
likelihood function

is called the likelihood function of
parameters given the data.
The ML estimates are those which maximize this
likelihood function
This is same as maximize ,
the log-likelihood function

In the case of mixture models,
the density
where
are the parameters to be estimated

If the component densities are d-dimensional
Gaussian densities with
mean and covariance matrix ,
i.e., given by,
the model is called Gaussian Mixture Model

8
Assumption

Assumption
a complete dataset
where X is incomplete
the joint density function
the complete log-likelihood function

9
E-step

EM algo. first finds the expectation of the
complete log-likelihood function
w.r.t. unknown data Y given the observed data X
and current parameter estimates
defined by,
where are the parameters to be estimated.

10
M-step

The second step to maximize the expectation
computed in E-step, i.e., to find
These two steps are repeated as necessary.
The algo. is guaranteed to converge to a local
maxima of the likelihood function.

For GMM, if X incomplete unobserved data
whose value inform which
component density generates each data item, then
the complete-data log-likelihood function
if , it implies
that the sample is generated by the
mixture component.

where ,
are independently drawn unobserved data

13
E-step for the case of GMM

The complete-data log-likelihood function
where is computed as in the Eq (9)
putting

14
M-step for the case of GMM
15
CUDA on NVIDIA GPUs
16

Each multiprocessor has a Single Instruction,
Multiple Thread architecture (SIMT).
Each active block is split into SIMT groups of
threads called warps.
A thread scheduler periodically switches from one
warp to another to maximize the use of the
multiprocessors computational resources.

17
(No Transcript)
18
(No Transcript)
19
CUDA implementation of EM for GMM
20

Given
data size N
mixture length M
dimension of the mixture component D
INPUT observed data X in matrix form (N x D)
Parameters to be estimated
Mixture Coefficients A (1 x M)
MEAN in matrix form (M x D)
COVARIANCE in matrix form (M x D)
Sequential launch of 6 kernels

COMPUTE-BASIS
COMPUTE-APRIORI

ESTIMATE-ALPHA
This kernel is implemented on a grid of M/B
thread blocks, each consisting of B threads.
ESTIMATE-MEAN
expresses the estimates of MEAN matrix as a
function of APRIORI (P) and INPUT (X)
Then dividing ith row of MEAN by Nai gives the
updated matrix

COMPUTE-VARIANCE
Define a new matrix called VARIANCE (V) matrix
of order (N x MD) whose (i, (m-1)D n)th element,
where (1 i N), (1 m M) and (1 n D) is
computed as
COMPUTE-COVARIANCE
Define another matrix of order (M x MD)
computed as
COVARIANCE matrix of order (M x D)
Then dividing ith row of MEAN by Nai gives the
updated matrix

24
Effect of coalesced memory access on performance
25

the data accessed by each thread is read first
from the global memory into the on-chip (shared)
memory before doing any arithmetic.
The data is reordered in global memory in such a
way that the words read by consecutive threads
fall into consecutive address locations.

26
(No Transcript)
27
Performamce
28

Input data size N
mixture length M
dimension D
block size B 64 B 256

As the size of dataset grows, the performance of
GPU is increasing over CPU.
The performance of 240 cores is improving over
that of 128 cores.

30
(No Transcript)
31

As the number of floating point operations
increase, the performance on 240 cores is
improving over that of 128 cores.

When the dataset grows, the GFLOPS achieved by
the same number of cores keep increasing.
The bent curves are due to insufficiency of
threads launched on the multiprocessors to combat
the memory access latencies while proceeding to
more number of cores and bigger data sets.