Description of Multivariate Data - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Description of Multivariate Data

Description:

Description of Multivariate Data – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 44
Provided by: laverty
Category:

less

Transcript and Presenter's Notes

Title: Description of Multivariate Data


1
Description of Multivariate Data
2
Multivariate Analysis
The analysis of many variables
3
Multivariate Analysis The analysis of many
variables
  • More precisely and also more traditionally this
    term stands fo the study of a random sample pf n
    objects (units or cases) such that on each
    object we measure p variables or characteristics.
  • So that for each object there is a vector

Each with p components
4
  • The variables will be correlated as they are
    measured on the same object.

A common practice is to treat each variable
separately by applying methods of univariate
analysis.
This may lead to incorrect and inadequate
analysis
5
  • The challenge of multivariate analysis is to
    untangle the overlapping information provided by
    a set of correlated variables and to reveal the
    underlying structure.

6
  • This is done by a variety of methods,
  • some of which are
  • generalizations of univariate methods
  • and some which are
  • multivariate with without univariate counterparts

7
  • The purpose of this course is
  • to describe and perhaps justify these methods,
    and also
  • provide some guidance about how to select an
    appropriate method for a given multivariate data
    set.

8
Example
Randomly select n 5 students as objects and for
each student measure
  • x1 age (in years) at entry to university,

x2 mark out of 100 in an exam at the end of the
first year,
x3 sex (0 female, 1 male)
9
The result may look something like this
Objects x1 x2 x3
1 18.5 91 0
2 18.0 73 1
3 18.9 64 1
4 18.5 71 0
5 18.4 85 1
  • It is of interest to note that the variables in
    the example are not of the same type
  • x1 is a continuous variable,
  • x2 is a discrete variable and
  • x3 is a binary variable

10
The Data Matrix
11
We can write
where
the ith row of X.
12
We can also write
where
13
In this notation
is the p-vector denoting the p observations on
the first object, while
is the n-vector denoting the observations on the
first variable
The rows
form a random sample while the columns
do not (this is emphasized in the notation by the
use of parentheses)
14
The objective of multivariate analysis will be a
attempt to find some feature of the variables
(i.e. the columns of the data matrix)
At other times, the objective of multivariate
analysis will be a attempt to find some feature
of the individuals (i.e. the rows of the data
matrix)
The feature that we often look for is grouping of
the individuals or of the variables. We will
give a classification of multivariate methods
later
15
Summarization of the data
16
  • Even when n and p are moderately large, the
    amount of information (np elements of the data
    matrix) can be overwhelming and it is necessary
    to find ways of summarizing data.
  • Later on we will discuss way of graphical
    representation of the data

17
Definitions
  1. The sample mean for the ith variable
  1. The sample variance for the ith variable
  1. The sample covariance between the ith variable
    and the jth variable

18
Putting the definitions together we are led to
the following definitions
Defn The sample mean vector
19
Defn The sample covariance matrix
20
Expressing the sample mean vector and the sample
covariance matrix in terms of the data matrix
21
The sample mean vector
22
Note
where
is the n-vector whose components are all equal to
1.
23
The sample covariance matrix
24
We can write
25
because
then
26
It is easy to check that
The final step is to realize that that
27
So that
28
In the text book
And then
29
Another Expression for S
30
Note
and
31
Thus
Hence
32
Data are frequently scaled as well as centered.
The scaling is done by introducing
Defn the sample correlation coefficient for
(between) the ith and the jth variables
the sample correlation matrix
33
Obviously
and using the Schwartzs inequality
If R I then we say the variables are
uncorrelated
34
Note if we denote
Then it can be checked that
35
Measures of Multivariate Scatter
36
  • The sample variance-covariance matrix S is an
    obvious generalization of the univariate concept
    of variance, which measures scatter about the
    mean.
  • Sometimes it is convenient to have a single
    number to measure the overall multivariate
    scatter.

37
  • There are two common measures of this type

Defn The generalized sample variance
Defn The total sample variance
38
  • In both cases, large values indicate a high
    degree of scatter about the centroid

low values indicate concentration about the
centroid
Using the eigenvalues l1, l2, ,lp of the matrix
S,it can be shown that
39
If lp 0 then
This says that there is a linear dependence
amongst the variables.
Normally, S is positive definite and all the
eigenvalues are positive.
40
Linear combinations
41
  • Taking linear combinations of variables is one of
    the most important tools of multivariate
    analysis.
  • This is for basically two reasons
  1. A few appropriately chosen combinations may
    provide more of the information than a lot of the
    original variables. (this is called dimension
    reduction.)
  2. Linear combinations can simplify the structure of
    the variance-covariance matrix, which can help in
    the interpretation of the data.

42
For a given vector of constraints
We consider a linear combination
For i 1, 2, , n. Then
43
And the variance of the Ys is
Write a Comment
User Comments (0)
About PowerShow.com