Title: Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering
1Methods in Medical Image Analysis Statistics of
Pattern Recognition Classification and Clustering
- Some content provided by Milos Hauskrecht,
University of Pittsburgh Computer Science
2ITK Questions?
3Classification
4Classification
5Classification
6Features
- Loosely stated, a feature is a value describing
something about your data points (e.g. for
pixels intensity, local gradient, distance from
landmark, etc) - Multiple (n) features are put together to form a
feature vector, which defines a data points
location in n-dimensional feature space
7Feature Space
- Feature Space -
- The theoretical n-dimensional space occupied by n
input raster objects (features). - Each feature represents one dimension, and its
values represent positions along one of the
orthogonal coordinate axes in feature space. - The set of feature values belonging to a data
point define a vector in feature space. -
8Statistical Notation
- Class probability distribution
-
- p(x,y) p(x y) p(y)
-
- x feature vector x1,x2,x3,xn
- y class
- p(x y) probabilty of x given y
- p(x,y) probability of both x and y
9Example Binary Classification
10Example Binary Classification
- Two class-conditional distributions
-
- p(x y 0) p(x y 1)
- Priors
- p(y 0) p(y 1) 1
11Modeling Class Densities
- In the text, they choose to concentrate on
methods that use Gaussians to model class
densities
12Modeling Class Densities
13Generative Approach to Classification
- Represent and learn the distribution
p(x,y) - Use it to define probabilistic discriminant
functions - e.g.
- go(x) p(y 0 x)
- g1(x) p(y 1 x)
14Generative Approach to Classification
- Typical model
- p(x,y) p(x y) p(y)
- p(x y) Class-conditional distributions
(densities) - p(y) Priors of classes (probability of class
y) - We Want
- p(y x) Posteriors of classes
15Class Modeling
- We model the class distributions as multivariate
Gaussians - x N(µ0, S0) for y 0
- x N(µ1, S1) for y 1
- Priors are based on training data, or a
distribution can be chosen that is expected to
fit the data well (e.g. Bernoulli distribution
for a coin flip)
16Making a class decision
- We need to define discriminant functions ( gn(x)
) - We have two basic choices
- Likelihood of data choose the class (Gaussian)
that best explains the input data (x) - Posterior of class choose the class with a
better posterior probability
17Calculating Posteriors
- Use Bayes Rule
- In this case,
-
18Linear Decision Boundary
- When covariances are the same
19Linear Decision Boundary
20Linear Decision Boundary
21Quadratic Decision Boundary
- When covariances are different
22Quadratic Decision Boundary
23Quadratic Decision Boundary
24Clustering
- Basic Clustering Problem
- Distribute data into k different groups such that
data points similar to each other are in the same
group - Similarity between points is defined in terms of
some distance metric - Clustering is useful for
- Similarity/Dissimilarity analysis
- Analyze what data point in the sample are close
to each other - Dimensionality Reduction
- High dimensional data replaced with a group
(cluster) label
25Clustering
26Clustering
27Distance Metrics
- Euclidean Distance, in some space (for our
purposes, probably a feature space) - Must fulfill three properties
28Distance Metrics
- Common simple metrics
- Euclidean
- Manhattan
- Both work for an arbitrary k-dimensional space
29Clustering Algorithms
- k-Nearest Neighbor
- k-Means
- Parzen Windows
30k-Nearest Neighbor
- In essence, a classifier
- Requires input parameter k
- In this algorithm, k indicates the number of
neighboring points to take into account when
classifying a data point - Requires training data
31k-Nearest Neighbor Algorithm
- For each data point xn, choose its class by
finding the most prominent class among the k
nearest data points in the training set - Use any distance measure (usually a Euclidean
distance measure)
32k-Nearest Neighbor Algorithm
-
-
-
-
q1
e1
-
-
1-nearest neighbor the concept represented by e1
5-nearest neighbors q1 is classified as negative
33k-Nearest Neighbor
- Advantages
- Simple
- General (can work for any distance measure you
want) - Disadvantages
- Requires well classified training data
- Can be sensitive to k value chosen
- All attributes are used in classification, even
ones that may be irrelevant - Inductive bias we assume that a data point
should be classified the same as points near it
34k-Means
- Suitable only when data points have continuous
values - Groups are defined in terms of cluster centers
(means) - Requires input parameter k
- In this algorithm, k indicates the number of
clusters to be created - Guaranteed to converge to at least a local optima
35k-Means Algorithm
- Algorithm
- Randomly initialize k mean values
- Repeat next two steps until no change in means
- Partition the data using a similarity measure
according to the current means - Move the means to the center of the data in the
current partition - Stop when no change in the means
36k-Means
37k-Means
- Advantages
- Simple
- General (can work for any distance measure you
want) - Requires no training phase
- Disadvantages
- Result is very sensitive to initial mean
placement - Can perform poorly on overlapping regions
- Doesnt work on features with non-continuous
values (cant compute cluster means) - Inductive bias we assume that a data point
should be classified the same as points near it
38Parzen Windows
- Similar to k-Nearest Neighbor, but instead of
using the k closest training data points, its
uses all points within a kernel (window),
weighting their contribution to the
classification based on the kernel - As with our classification algorithms, we will
consider a gaussian kernel as the window
39Parzen Windows
- Assume a region defined by a d-dimensional
Gaussian of scale s - We can define a window density function
- Note that we consider all points in the training
set, but if a point is outside of the kernel, its
weight will be 0, negating its influence
40Parzen Windows
41Parzen Windows
- Advantages
- More robust than k-nearest neighbor
- Excellent accuracy and consistency
- Disadvantages
- How to choose the size of the window?
- Alone, kernel density estimation techniques
provide little insight into data or problems