titel - PowerPoint PPT Presentation

1 / 134

About This Presentation

Title:

titel

Description:

All information on math-part of course on: ... Sampling artefacts (aliasing, Nyquist frequency) Sampling artefacts (aliasing, Nyquist frequency) ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 135

Provided by: ronald60

Category:

more less

Transcript and Presenter's Notes

Title: titel

1
DATA MINING from data to information Ronald
Westra Dep. Mathematics Knowledge
Engineering Maastricht University
2
PART 1 Introduction
3
All information on math-part of course
on http//www.math.unimaas.nl/personal/ronaldw/D
AM/DataMiningPage.htm
4
(No Transcript)
5
(No Transcript)
6
Data mining - a definition
"Data mining is the process of exploration
and analysis, by automatic or semi-automatic
means, of large quantities of data in order to
discover meaningful patterns and results."
(Berry Linoff, 1997, 2000)
7
DATA MINING
8
DATA MINING
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
DATA AS SETS OF MEASUREMENTS AND OBSERVATIONS

Data Mining Lecture II
Chapter 2 from Principles of Data Mining by
Hand,, Manilla, Smyth

37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
VISUALISING AND EXPLORING DATA-SPACE

Data Mining Lecture III
Chapter 2 from Principles of Data Mining by
Hand,, Manilla, Smyth

66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
(No Transcript)
77
(No Transcript)
78
(No Transcript)
79
(No Transcript)
80
(No Transcript)
81
MEAN
82
Principal axis 2
Principal axis 1
MEAN
83
(No Transcript)
84
(No Transcript)
85
(No Transcript)
86
(No Transcript)
87
(No Transcript)
88
Principal component analysis (PCA) is a technique
that is useful for the compression and
classification of data. The purpose is to reduce
the dimensionality of a data set (sample) by
finding a new set of variables, smaller than the
original set of variables, that nonetheless
retains most of the sample's information. By
information we mean the variation present in the
sample, given by the correlations between the
original variables. The new variables, called
principal components (PCs), are uncorrelated, and
are ordered by the fraction of the total
information each retains.
89
overview

geometric picture of PCs

algebraic definition and derivation of PCs

usage of PCA

astronomical application

90
Geometric picture of principal components (PCs)
A sample of n observations in the 2-D space
Goal to account for the variation in a sample
in as few variables as possible, to some
accuracy
91
Geometric picture of principal components (PCs)

the 1st PC is a minimum distance fit to
a line in space

the 2nd PC is a minimum distance fit to a
line
in the plane perpendicular to the 1st PC

PCs are a series of linear least squares fits to
a sample, each orthogonal to all the previous.
92
Algebraic definition of PCs
Given a sample of n observations on a vector of p
variables
define the first principal component of the
sample by the linear transformation
?
where the vector
is chosen such that
is maximum
93
Algebraic definition of PCs
Likewise, define the kth PC of the sample by the
linear transformation
?
where the vector
is chosen such that
is maximum
subject to
and to
94
Algebraic derivation of coefficient vectors
To find first note that
where
is the covariance matrix for the variables
95
Algebraic derivation of coefficient vectors
To find maximize
subject to
Let ? be a Lagrange multiplier
then maximize
by differentiating
therefore
is an eigenvector of
corresponding to eigenvalue
96
Algebraic derivation of
We have maximized
So is the largest eigenvalue of
The first PC retains the greatest amount
of variation in the sample.
97
Algebraic derivation of coefficient vectors
To find the next coefficient vector
maximize
subject to
and to
First note that
then let ? and f be Lagrange multipliers, and
maximize
98
Algebraic derivation of coefficient vectors
We find that is also an eigenvector of
whose eigenvalue is the second
largest.
In general

The kth largest eigenvalue of is the
variance of the kth PC.

The kth PC retains the kth greatest
fraction
of the variation in the sample.

99
Algebraic formulation of PCA
Given a sample of n observations
on a vector of p variables
define a vector of p PCs
according to
where is an orthogonal p x p matrix
whose kth column is the kth eigenvector
of
Then
is the covariance matrix of the PCs,
being diagonal with elements
100
usage of PCA Probability distribution for
sample PCs
If (i) the n observations of in the
sample are independent

is drawn from an underlying population
that
follows a p-variate normal (Gaussian)
distribution
with known covariance matrix

then
where is the Wishart distribution
else utilize a bootstrap approximation
101
usage of PCA Probability distribution for
sample PCs
If (i) follows a Wishart
distribution

the population eigenvalues are all
distinct

then the following results hold as

all the are independent of all the

are jointly normally distributed

(a tilde denotes a population quantity)
102
usage of PCA Probability distribution for
sample PCs
and

(a tilde denotes a population quantity)
103
usage of PCA Inference about population PCs
If follows a p-variate normal
distribution
then analytic expressions exist for
MLEs of , , and
confidence intervals for and
hypothesis testing for and
else bootstrap and jackknife
approximations exist
see references, esp. Jolliffe
104
usage of PCA Practical computation of PCs
In general it is useful to define standardized
variables by
If the are each measured
about their sample mean
then the covariance matrix of
will be equal to the correlation matrix of
and the PCs
will be dimensionless

105
usage of PCA Practical computation of PCs
Given a sample of n observations on a vector
of p variables
(each measured about its sample mean)
compute the covariance matrix
where is the n x p matrix
whose ith row is the ith obsv.
Then compute the n x p matrix
whose ith row is the PC score

for the ith observation.
106
usage of PCA Practical computation of PCs
Write to decompose
each observation into PCs
107
usage of PCA Data compression
Because the kth PC retains the kth greatest
fraction of the variation
we can approximate each observation
by truncating the sum at the first m lt p PCs
108
usage of PCA Data compression
Reduce the dimensionality of the data
from p to m lt p by approximating
where is the n x m portion of
and is the p x m portion of
109
astronomical application PCs for elliptical
galaxies
Rotating to PC in BT S space improves
Faber-Jackson relation
as a distance indicator
Dressler, et al. 1987
110
astronomical application Eigenspectra (KL
transform)
Connolly, et al. 1995
111
references
Connolly, and Szalay, et al., Spectral
Classification of Galaxies An Orthogonal
Approach, AJ, 110, 1071-1082, 1995. Dressler, et
al., Spectroscopy and Photometry of Elliptical
Galaxies. I. A New Distance Estimator, ApJ,
313, 42-58, 1987. Efstathiou, G., and Fall, S.M.,
Multivariate analysis of elliptical galaxies,
MNRAS, 206, 453-464, 1984. Johnston, D.E., et
al., SDSS J09035028 A New Gravitational
Lens, AJ, 126, 2281-2290, 2003. Jolliffe, Ian
T., 2002, Principal Component Analysis
(Springer-Verlag New York, Secaucus, NJ). Lupton,
R., 1993, Statistics In Theory and Practice
(Princeton University Press, Princeton,
NJ). Murtagh, F., and Heck, A., Multivariate Data
Analysis (D. Reidel Publishing Company,
Dordrecht, Holland). Yip, C.W., and Szalay, A.S.,
et al., Distributions of Galaxy Spectral Types
in the SDSS, AJ, 128, 585-609, 2004.
112
(No Transcript)
113
(No Transcript)
114
(No Transcript)
115
(No Transcript)
116
(No Transcript)
117
1 pc
2 pc
4 pc
3 pc
118
(No Transcript)
119
(No Transcript)
120
(No Transcript)
121
(No Transcript)
122
(No Transcript)
123
(No Transcript)
124
(No Transcript)
125
(No Transcript)
126
(No Transcript)
127
(No Transcript)
128
(No Transcript)
129
(No Transcript)
130
(No Transcript)
131
(No Transcript)
132
(No Transcript)
133
(No Transcript)
134
(No Transcript)

Write a Comment

User Comments (0)