SC968 Panel data methods for sociologists Lecture 2, part 1 - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

SC968 Panel data methods for sociologists Lecture 2, part 1

Description:

Types of longitudinal data. Types of analysis possible with panel data. Data management merging, appending, long and wide ... capture drop dif* sort pid wave ... – PowerPoint PPT presentation

Number of Views:120

Avg rating:3.0/5.0

Slides: 42

Provided by: mariai

Category:

more less

Transcript and Presenter's Notes

Title: SC968 Panel data methods for sociologists Lecture 2, part 1

1
SC968Panel data methods for sociologistsLecture
2, part 1
Introducing panel data
2
Overview

Panel data
What it is
How to get to know the data
Change over time
Tabulating
Calculating transition probabilities

3
What is panel data?

A data set containing observations on multiple
phenomena observed at a single point in time is
called cross-sectional data
A data set containing observations on a single
phenomenon observed over multiple time periods is
called time series data
Observations on multiple phenomena over multiple
time periods are panel data
Cross sectional and time series data are one-
dimensional, panel data are two-dimensional

4
Using panel data in Stata

Data on n cases, over t time periods, giving a
total of n t observations
One record per observation
i.e. long format
Stata tools for analyzing panel data begin with
the prefix xt
First need to tell Stata that you have panel data
using xtset

5
Complete and incomplete person-wave data
6
Telling Stata you have time series data
Unique cross-wave identifier
Time variable
. xtset pid wave panel variable pid
(unbalanced) time variable wave, 1 to
15, but with gaps delta 1 unit
7
Cases not observed for every time period
. xtset pid wave panel variable pid
(unbalanced) time variable wave, 1 to
15, but with gaps delta 1 unit
Period between observations in units of the time
variable
8
Describing the patterns in panel data
9
Examining change over two waves
10
Calculating transition probabilities

The transition probability is the probability of
transitioning from one state to another

So to calculate by hand,
Cell count
Row total
11
Transition probability matrix
12
Transition probability matrices in Stata
Mean transition probabilities for all waves t to
t1 when you leave out the if statement
13
Change in a categorical variable over timeA
decision tree
empl
0.91
empl
0.03
unemp
0.06
0.90
olf
empl
0.26
unemp
0.03
0.49
empl
unemp
0.25
olf
0.04
empl
0.10
olf
0.03
unemp
0.87
olf
14
Change in a continuous variable over time

Size transition matrix
Quantile transition matrix
Mean transition matrix
Median transition matrix

15
Size transition matrix

Absolute mobility
e.g. movement in and out of poverty
Boundaries set exogenously i.e. predetermined
e.g poverty defined a priori as an income below
5,000
Do not depend on distribution under investigation
e.g comparing mobility in 1990s and 2000s
incorporates both movements of positions of
individuals and economic growth

16
Quantile transition matrix

Mobility as a relative concept
Same number of individuals in each class
Only records movements involving reranking
Cannot take into account of economic growth, for
example when comparing matrices
Cannot draw a complete picture if comparing
mobility in different cohorts/countries/welfare
regimes

17
Mean/median transition matrices

Both absolute and relative approaches
incorporated into matrices
Class boundaries defined as percentages of mean
or median income of the origin and destination
distributions
Example
25, 50, 75 of median income
Note that this is not the same as quartiles

18
Example income 1991-1992
19
Category boundaries for each method
Matrix Year Boundary 1 (n) Boundary 2 (n) Boundary 3 (n) Boundary 4 (n)
Size 1991 0 - 800 (580) 800 - 1500 (650) 1500 - 2200 (504) 2200 - 9231 (715)
1992 0 - 800 (580) 800 - 1500 (645) 1500 - 2200 (473) 2200 - 10491 (751)
Quartile 1991 0 827 (609) 827 -1511 (615) 1511 2365 (611) 2365 9231 (614)
1992 0 862 (610) 862 1508 (612) 1508 2450 (612) 2450 10491 (615)
Mean 1991 0 887 (654) 887 -1773 (814) 1773 2660 (506) 2660 9231 (475)
1992 0 898 (652) 898 -1795 (766) 1795 2693 (501) 2693 10491 (530)
Median 1991 0 750 (539) 750 -1500 (685) 1500 2250 (540) 2250 9231 (685)
1992 0 746 (536) 746 -1491 (686) 1491 -2237 (505) 2262 10491 (722)
20
Warning!

Measurement error
Causes an over-estimation of mobility
If mothers and babys weight are reported to
nearest half pound can affect which band the
observations falls in
A respondent may describe their marital status as
separated in year 1 and single in year 2

21
Finally..

Greater challenges to understanding and checking
panel data
Transition matrices a good way to summarise
mobility patterns
Different methods of constructing matrices lead
to distinct interpretations
May need to take account of measurement error
when modelling change

22
(No Transcript)
23
SC968Panel data methods for sociologistsLecture
2, part 2

Concepts for panel data analysis

24
Overview

Types of variables time-invariant, time-varying
and trend
Between- and within-individual variation
Concept of individual heterogeneity
Within and between estimators
Basic properties of fixed and random effects
models
The basics of these models implementation in
STATA

25
Types of variable

Those which vary between individuals but hardly
ever over time
Sex
Ethnicity
Parents social class when you were 14
The type of primary school you attended (once
youve become an adult)
Those which vary over time, but not between
individuals
The retail price index
National unemployment rates
Age, in a cohort study
Those which vary both over time and between
individuals
Income
Health
Psychological wellbeing
Number of children you have
Marital status
Trend variables
Vary between individuals and over time, but in
highly predictable ways
Age
Year

26
Between- and within-individual variation

If you have a sample with repeated observations
on the same individuals, there are two sources of
variance within the sample
The fact that individuals are systematically
different from one another (between-individual
variation)
The fact that individuals behaviour varies
between observations (within-individual
variation)

Not nearly as scary as it looks
Total variation is the sum over all individuals
and years, of the square of the difference
between each observation of x and the mean
Within variation is the sum of the squares of
each individuals observation from his or her
mean
Between variation is sum of squares of
differences between individual means and the
whole-sample mean

27
xtsum in STATA

Similar to ordinary sum command

Have chosen a balanced sample
All variation is between
Most variation is between, because its fairly
rare to switch between having and not having a
partner
All variation is within, because this is a
balanced sample
28
More on xtsum.
Observations with non-missing variable
Number of individuals
Average number of time-points
Min max refer to xi-bar
Min max refer to individual deviation from own
averages, with global averages added back in.
29
The xttab command
For simplicity, omitted jbstats of missing,
maternity leave, gov training and other.
Pooled sample, broken down by person/years
Of those who spent any time in this state, the
proportion of their time (on average) they spent
in it.
Number of people who spent any time in this state
30
Individual heterogeneity

A very simple concept people are different!
In social science, when we talk about
heterogeneity, we are really talking about
unobservable (or unobserved) heterogeneity.
Observed heterogeneity differences in education
levels, or parental background, or anything else
that we can measure and control for in
regressions
Unobserved heterogeneity anything which is
fundamentally unmeasurable, or which is rather
poorly measured, or which does not happen to be
measured in the particular data set we are using.
Example the relationship between employment and
children
We know that women who have more children are
less likely to go out to work, and if they do go
out to work they work fewer hours.
A priori, this isnt obvious women with
children do face a higher opportunity cost of
work, but one could also argue that they need
more money
What is the causal relationship does having lots
of children cause women to do less paid work?
Or are women who have lots of children a
fundamentally different type, with different sets
of preferences?

31
Unobserved heterogeneity

Extend the OLS equation we used in Week 1,
breaking the error term down into two components
one representing the unobservable characteristics
of the person, and the other representing genuine
error.
In cross-sectional analysis, there is no way of
distinguishing between the two.
But in panel data analysis, we have repeated
observations and this allows us to distinguish
between them.

32
Within and between estimators
Individual-specific, fixed over time
Varies over time, usual assumptions apply (mean
zero, homoscedastic, uncorrelated with x or u or
itself)
This is the between estimator
And this is the within estimator fixed
effects
? measures the weight given to between-group
variation, and is derived from the variances of
ui and ei
33
Fixed effects (within estimator)

Ignores between-group variation so its an
inefficient estimator
However, few assumptions are required for FE to
be consistent
Disadvantage cant estimate the effects of any
time-invariant variables

34
Between estimator

Not much used
Except to calculate the ? parameter for random
effects, but STATA does this, not you!
Its inefficient compared to random effects
It doesnt use as much information as is
available in the data (only uses means)
Assumption required that vi is uncorrelated with
xi
Easy to see why if they were correlated, how
could one decide how much of the variation in y
to attribute to the xs (via the betas) as
opposed to the correlation?
Cant estimate effects of variables where mean is
invariant over individuals
Age in a cohort study
Macro-level variables

35
Random effects estimator

Assumption required that ui is uncorrelated with
xi
Rather heroic assumption think of examples
Will see a test for this later
Uses both within- and between-group variation, so
makes best use of the data and is efficient

36
Estimating fixed effects in STATA
R-square-like statistic
Peaks at age 48
u and e are the two parts of the error term
Talk about xtmixed
37
Between regression

Not much used, but useful to compare coefficients
with fixed effects

Coefficient on partner was negative and
significant in FE model. In FE, the partner
coeff really measures the events of gaining or
losing a partner
38
Random effects regression
Option theta gives a summary of weights
39
And what about OLS?

OLS simply treats within- and between-group
variation as the same
Pools data across waves

40
Test whether pooling data is valid

If the ui do not vary between individuals, they
can be treated as part of a and OLS is fine.
Breusch-Pagan Lagrange multiplier test
H0 Variance of ui 0
H1 Variance of ui not equal to zero
If H0 is not rejected, you can pool the data and
use OLS
Post-estimation test after random effects

41
Comparing models