Multivariate Statistics - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Multivariate Statistics

Description:

Hopefully, a small subset of the newly constructed variables carry as much of ... Has fewer components than variables (Kaiser criterion, scree plot, etc. ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 36

Provided by: wvande

Category:

more less

Transcript and Presenter's Notes

Title: Multivariate Statistics

1
Multivariate Statistics

Confirmatory Factor Analysis I
W. M. van der Veld
University of Amsterdam

2
Overview

PCA a review
Introduction to Confirmatory Factor Analysis
Digression History and example
Intuitive specification
Digression Decomposition rule
Identification
Exercise 1
An example

3
PCA a review
4
PCA a review

Principal Components analysis is a kind of data
reduction
Start with a set of observed variables.
Transform the data so that a new set of variables
is constructed from the observed variables (full
component solution).
Hopefully, a small subset of the newly
constructed variables carry as much of their
information as possible, so that we can reduce
the number of variables in our study.
Full component solution
Has as many components as variables
Accounts for 100 of the variables variance
Each variable has a final communality of 1.00
Truncated components solution
Has fewer components than variables (Kaiser
criterion, scree plot, etc.)
Components can hopefully be interpreted
Accounts for lt100 of the variables variance
Each variable has a communality lt 1.00

5
PCA a review

Each principal component
Accounts for maximum available variance,
Is orthogonal (uncorrelated, independent) with
all prior components (often called factors)
Full solution (as many factors as variables)
accounts for all the variance.
Full solution Truncated solution

6
PCA a review

PCA is part of the toolset for exploring data.
When the researcher does not have, a priori,
sufficient evidence to form a hypothesis about
the structure in the data.
But anyone who can read should be able to get an
idea about the structure, i.e. what goes with
what.
Highly empirical or data driven technique.
Because one will always find components.
And when you find them you have to interpret
them.
Interpretation of components measured by many
variables can be quite complicated. Especially if
you have no a priori ideas.
PCA at best suggests hypotheses for further
research.
Confirmatory Factor Analysis has everything that
PCA lacks,
Nevertheless, PCA is useful when the data is
collected with hypothesis in mind, that you want
to explore quickly.
And when you want to construct variables by
definition.

7
Introduction to CFA
8
Introduction to CFA

Some variables of interest cannot be directly
observed.
These unobserved variables are referred to as
either
Latent variables, or
Factors
How do we obtain information about these latent
variables, if they cannot be observed?
We do so by assuming that the latent variables
have real observable consequences. And these
consequences are related.
By studying the relations between the
consequences, we can infer that there is or is
not a latent variable.
History and example.

9
Digression History and example

Sir Francis Galton (1822-1911)
Psychologist, etc.
Galton was one of the first experimental
psychologists, and the founder of the field of
enquiry now called Differential Psychology, which
concerns itself with psychological differences
between people, rather than on common traits. He
started virtually from scratch, and had to invent
the major tools he required, right down to the
statistical methods - correlation and regression
- which he later developed. These are now the
nuts-and-bolts of the empirical human sciences,
but were unknown in his time. One of the
principal obstacles he had to overcome was the
treatment of differences on measures as
measurement error, rather than as natural
variability.
His influential study Hereditary Genius (1869)
was the first systematic attempt to investigate
the effect of heredity on intellectual abilities,
and was notable for its use of the bell-shaped
Normal Distribution, then called the "Law of
Errors", to describe differences in intellectual
ability, and its use of pedigree analysis to
determine hereditary effects.

10
Digression History and example

Charles Edward Spearman (1863-1945)
Psychologist, with a strong statistical
background.
Spearman set out to estimate the intelligence of
twenty-four children in the village school. In
the course of his study, he realized that any
empirically observed correlation between two
variables will underestimate the "true" degree of
relationship, to the extent that there is
inaccuracy or unreliability in the measurement of
those two variables.
Further, if the amount of unreliability is
precisely known, it is possible to "correct" the
attenuated observed correlation according to the
formula (where r stands for the correlation
coefficient) r (true) r (observed)/reliability
of variable 1 X reliability of variable 2.
Using his correction formula, Spearman found
"perfect" relationships and inferred that
"General Intelligence" or "g" was in fact
something real, and not merely an arbitrary
mathematical abstraction.

11
Digression History and example

M are measures of math skills, and L are measures
of language skills.
He then discovered yet another marvelous
coincidence, the correlations were positive and
hierarchal. These discoveries lead Spearman to
the eventual development of a two-factor theory
of intelligence.

12
Digression History and example
g
Verbal Ability
Math Ability
L1
L2
L3
M1
M2
M3
13
Introduction to CFA

Factor analysis is a very powerful and elegant
way to analyze data.
Among others, because it is close to the core of
scientific purpose.
CFA is a theory-testing model
As opposed to a theory-generating method like
EFA, incl. PCA.
Researcher begins with a hypothesis prior to the
analysis.
The model or hypothesis specifies
which variables will be correlated with which
factors
which factors are correlated.
The hypothesis should be based on a strong
theoretical and/or empirical foundation (Stevens,
1996).

14
Intuitive specification

xi are the observed variables,
? is the common latent variable that is the cause
of the relations between xi.
?i is the effect from ? on xi.
di are the unique components of the observed
variables. They are also latent variables,
commonly interpreted as random measurement error.

15
Intuitive specification

This causal diagram can be represented with the
following set of equations x1 ?1? d1 x2
?2? d2 x3 ?3? d3
With E(xi)E(?i)0, var(xi)var(?i)1,
E(didj)E(dx)E(d?)0
Only the x variables are known.
We dont know anything about the other variables
? and d.
How can we compute ?i?
We need a theory about why there are correlations
between the x variables.

16
Digression Decomposition rule

In Structural Equation Modeling (includes CFA)
The correlation between two variables is equal to
the sum of
- the direct effect,
- indirect effects,
- spurious relations and
- joint effects between these variables.

17
Digression Basic concepts
18
Digression Decomposition rule

In Structural Equation Modeling (includes CFA)
The correlation between two variables is equal to
the sum of- the direct effect,- indirect
effects, - spurious relations and- joint
effects between these variables.

The indirect effects, spurious relations and
joint effects are equal to the products - of the
coefficients along the path - going from one
variable to the other - while one can not pass
the same variable twice - and can not go against
the direction of the arrows.

19
Intuitive specification

So the theory, according to the model, is that
the correlations between the x variables are
spurious due to the latent variable ?.
?(x1x2)?1?2,?(x1x3)?1?3,?(x2x3)?2?3.
Proof in the formal specification.
A simplified notation yields?12?1?2,?13?1?3,
?23?2?3
Where ?1, ?2, and ?3, are unknown, and?12, ?13,
and ?23, are known.
This is solvable, eg.

20
Intuitive specification

The correlations between the x variables are

We know that
?12 .42 ?1?2?13 .56 ?1?3?23 .48
?2?3
?2 .42/?1 gt?3 .56/?1 gt
?1 v(.42.56)/.48) .7
?2 .6
?3 .8

21
Intuitive specification

Notice that this is much more advanced than PCA.
It is based upon theory.
It is straightforward
It allows the estimation of an effect of
something that we have not measured!

22
Identification

What would happen if the latent variable had four
indicators?

We already had ?12?1?2 .42 ?13?1?3
.56 ?23?2?3 .48

Now we also have ?14?1?4 .35 ?24?2?4
.30 ?34?3?4 .50

23
Identification

Now we also have ?14?1?4 .35 ?24?2?4
.30 ?34?3?4 .50
Of these three equations we need only one to
determine the value of ?4 when we have solved ?1,
?2, ?3.
Therefore, this model is called over-identified
with 2 degrees of freedom (df2)
df ( of correlations) ( of parameters to be
estimated)
If the number of degrees of freedom is larger
than 1, a test is possible of the model
parameters.

24
Identification

A test of the model
?14 ?1?4 .7?4 .35 ?24 ?2?4 .6?4
.30 ?34 ?3?4 .8?4 .50
Lets use the first equation to estimate ?4, then
?4 .35/.7.5
Now we know all coefficients and two equations
are not used yet.
These equations can be used to test the model.
Using the idea that the observed correlation
should equal the estimated correlation, or ?(obs)
?(est)0
?24 ?2?4 .30 .6?4 .30 .6.50 0
(good) ?34 ?3?4 .50 .8?4 .50 .8.50
0.1 (less good)
These differences are called residuals. If these
residuals are big the model must be wrong.

25
Identification

With df gt 0, a test of the model is possible
With df 0, no test of the model is possible
This is the case with a one factor model and 3
observed variables.
With df lt 0, no test of the model is possible,
and even the parameters cannot be estimated.
However, if the parameters of the model can be
constraint, degrees of freedom can be gained, so
that the model could be estimated.
The issue of identification is connected to the
issue of number of unknowns versus number of
equations. Although not entirely, because there
is more.
Generally we leave this issue to the computer
program. If the model is not identified, you will
get a message.

26
Exercise 1

Assume that E(xi)E(?i)0, var(xi)var(?i)1,
E(didj)E(dx)E(d?)0
Formulate expressions for the correlations
between the x variables in terms of the
parameters of the model.
?12, ?13, ?14, ?23, ?24, ?34.
?12 ?11?21 ?13 ?11f21?32 ?14
?11f21?42 ?23 ?21f21?32 ?24 ?21f21?42
?34 ?32?42
We can get the same result via the formal
specification.

27
An example (PCA)

How does the factor model look like for these
items?

We suggested last time that there should be 1
factor behind these items.

28
An example (PCA)
Varimax rotated PCA solution. Claims a better
interpretation. There are two PCs, that can be
interpreted as positive (4,7,8) and negative
(others) loneliness. But this is strange, because
they should then be related!!!!
PCA solution
29
An example (CFA)

We did not look at the correlations, but, some
items were positive and some negative.
Therefore, if the same response scale is used, we
expect negative correlations. And they should be
found in the dark yellow cells.

30
An example (CFA)
31
An example (CFA)

This model does not fit (p0).
So we must search for another model.
One possibility that the correlations are
positive, where expected negative, is due to
acquiescence.
Acquiescence is the tendency to agree (confirm),
with a statement, regardless the content.
There are all kinds of psychological explanations
for this response behavior of respondents, being
nice, etc.
If people say agree to both negative and positive
items, we will find positive correlations between
those items.
Can we correct for acquiescence?
YES!

32
An example (CFA)

In order to identify an acquiescence factor, I
have to split the positive and negative items
into two factors.
Theoretically it makes sense if those two
loneliness factors are correlated -1, because
they are the same but opposites.

33
An example (CFA)
34
An example (CFA)

The result is a model that still does not fit,
but much better than before.
This model however shows that most variation and
covariation is due to the acquiescence factor.
The loneliness factors that remain after
correction for random measurement error, and
systematic error (acquiescence), are pure.
However, given the low loadings what do these
items have in common with the underlying
factor. That is the loneliness factor(s) does
explain 0.2 of the variance of the best
indicators (0.452).

35
An example (CFA)

Conclusion
The PCA did not provide a theoretical sound
solution.
Thinking about the items, and what some
respondents do when answering questions, lead to
a theory
This theory could be tested, and produced very
sensible results.
However the ?2 was not acceptable 74 (19),
whereas with 19 df the ?2 should be 30.