Title: SC968 Panel data methods for sociologists Lecture 2, part 1
1SC968Panel data methods for sociologistsLecture
2, part 1
Introducing panel data
2Overview
- Panel data
- What it is
- How to get to know the data
- Change over time
- Tabulating
- Calculating transition probabilities
3What is panel data?
- A data set containing observations on multiple
phenomena observed at a single point in time is
called cross-sectional data - A data set containing observations on a single
phenomenon observed over multiple time periods is
called time series data - Observations on multiple phenomena over multiple
time periods are panel data - Cross sectional and time series data are one-
dimensional, panel data are two-dimensional
4Using panel data in Stata
- Data on n cases, over t time periods, giving a
total of n t observations - One record per observation
- i.e. long format
- Stata tools for analyzing panel data begin with
the prefix xt - First need to tell Stata that you have panel data
using xtset
5Complete and incomplete person-wave data
6Telling Stata you have time series data
Unique cross-wave identifier
Time variable
. xtset pid wave panel variable pid
(unbalanced) time variable wave, 1 to
15, but with gaps delta 1 unit
7Cases not observed for every time period
. xtset pid wave panel variable pid
(unbalanced) time variable wave, 1 to
15, but with gaps delta 1 unit
Period between observations in units of the time
variable
8Describing the patterns in panel data
9Examining change over two waves
10Calculating transition probabilities
- The transition probability is the probability of
transitioning from one state to another
So to calculate by hand,
Cell count
Row total
11Transition probability matrix
12Transition probability matrices in Stata
Mean transition probabilities for all waves t to
t1 when you leave out the if statement
13Change in a categorical variable over timeA
decision tree
empl
0.91
empl
0.03
unemp
0.06
0.90
olf
empl
0.26
unemp
0.03
0.49
empl
unemp
0.25
olf
0.04
empl
0.10
olf
0.03
unemp
0.87
olf
14Change in a continuous variable over time
- Size transition matrix
- Quantile transition matrix
- Mean transition matrix
- Median transition matrix
15Size transition matrix
- Absolute mobility
- e.g. movement in and out of poverty
- Boundaries set exogenously i.e. predetermined
- e.g poverty defined a priori as an income below
5,000 - Do not depend on distribution under investigation
- e.g comparing mobility in 1990s and 2000s
- incorporates both movements of positions of
individuals and economic growth
16Quantile transition matrix
- Mobility as a relative concept
- Same number of individuals in each class
- Only records movements involving reranking
- Cannot take into account of economic growth, for
example when comparing matrices - Cannot draw a complete picture if comparing
mobility in different cohorts/countries/welfare
regimes
17Mean/median transition matrices
- Both absolute and relative approaches
incorporated into matrices - Class boundaries defined as percentages of mean
or median income of the origin and destination
distributions - Example
- 25, 50, 75 of median income
- Note that this is not the same as quartiles
18Example income 1991-1992
19Category boundaries for each method
Matrix Year Boundary 1 (n) Boundary 2 (n) Boundary 3 (n) Boundary 4 (n)
Size 1991 0 - 800 (580) 800 - 1500 (650) 1500 - 2200 (504) 2200 - 9231 (715)
1992 0 - 800 (580) 800 - 1500 (645) 1500 - 2200 (473) 2200 - 10491 (751)
Quartile 1991 0 827 (609) 827 -1511 (615) 1511 2365 (611) 2365 9231 (614)
1992 0 862 (610) 862 1508 (612) 1508 2450 (612) 2450 10491 (615)
Mean 1991 0 887 (654) 887 -1773 (814) 1773 2660 (506) 2660 9231 (475)
1992 0 898 (652) 898 -1795 (766) 1795 2693 (501) 2693 10491 (530)
Median 1991 0 750 (539) 750 -1500 (685) 1500 2250 (540) 2250 9231 (685)
1992 0 746 (536) 746 -1491 (686) 1491 -2237 (505) 2262 10491 (722)
20Warning!
- Measurement error
- Causes an over-estimation of mobility
- If mothers and babys weight are reported to
nearest half pound can affect which band the
observations falls in - A respondent may describe their marital status as
separated in year 1 and single in year 2
21Finally..
- Greater challenges to understanding and checking
panel data - Transition matrices a good way to summarise
mobility patterns - Different methods of constructing matrices lead
to distinct interpretations - May need to take account of measurement error
when modelling change
22(No Transcript)
23SC968Panel data methods for sociologistsLecture
2, part 2
- Concepts for panel data analysis
24Overview
- Types of variables time-invariant, time-varying
and trend - Between- and within-individual variation
- Concept of individual heterogeneity
- Within and between estimators
- Basic properties of fixed and random effects
models - The basics of these models implementation in
STATA
25Types of variable
- Those which vary between individuals but hardly
ever over time - Sex
- Ethnicity
- Parents social class when you were 14
- The type of primary school you attended (once
youve become an adult) - Those which vary over time, but not between
individuals - The retail price index
- National unemployment rates
- Age, in a cohort study
- Those which vary both over time and between
individuals - Income
- Health
- Psychological wellbeing
- Number of children you have
- Marital status
- Trend variables
- Vary between individuals and over time, but in
highly predictable ways - Age
- Year
26Between- and within-individual variation
- If you have a sample with repeated observations
on the same individuals, there are two sources of
variance within the sample - The fact that individuals are systematically
different from one another (between-individual
variation) - The fact that individuals behaviour varies
between observations (within-individual
variation)
- Not nearly as scary as it looks
- Total variation is the sum over all individuals
and years, of the square of the difference
between each observation of x and the mean - Within variation is the sum of the squares of
each individuals observation from his or her
mean - Between variation is sum of squares of
differences between individual means and the
whole-sample mean
27xtsum in STATA
- Similar to ordinary sum command
Have chosen a balanced sample
All variation is between
Most variation is between, because its fairly
rare to switch between having and not having a
partner
All variation is within, because this is a
balanced sample
28More on xtsum.
Observations with non-missing variable
Number of individuals
Average number of time-points
Min max refer to xi-bar
Min max refer to individual deviation from own
averages, with global averages added back in.
29The xttab command
For simplicity, omitted jbstats of missing,
maternity leave, gov training and other.
Pooled sample, broken down by person/years
Of those who spent any time in this state, the
proportion of their time (on average) they spent
in it.
Number of people who spent any time in this state
30Individual heterogeneity
- A very simple concept people are different!
- In social science, when we talk about
heterogeneity, we are really talking about
unobservable (or unobserved) heterogeneity. - Observed heterogeneity differences in education
levels, or parental background, or anything else
that we can measure and control for in
regressions - Unobserved heterogeneity anything which is
fundamentally unmeasurable, or which is rather
poorly measured, or which does not happen to be
measured in the particular data set we are using. - Example the relationship between employment and
children - We know that women who have more children are
less likely to go out to work, and if they do go
out to work they work fewer hours. - A priori, this isnt obvious women with
children do face a higher opportunity cost of
work, but one could also argue that they need
more money - What is the causal relationship does having lots
of children cause women to do less paid work? - Or are women who have lots of children a
fundamentally different type, with different sets
of preferences?
31Unobserved heterogeneity
- Extend the OLS equation we used in Week 1,
breaking the error term down into two components
one representing the unobservable characteristics
of the person, and the other representing genuine
error. - In cross-sectional analysis, there is no way of
distinguishing between the two. - But in panel data analysis, we have repeated
observations and this allows us to distinguish
between them.
32Within and between estimators
Individual-specific, fixed over time
Varies over time, usual assumptions apply (mean
zero, homoscedastic, uncorrelated with x or u or
itself)
This is the between estimator
And this is the within estimator fixed
effects
? measures the weight given to between-group
variation, and is derived from the variances of
ui and ei
33Fixed effects (within estimator)
- Ignores between-group variation so its an
inefficient estimator - However, few assumptions are required for FE to
be consistent - Disadvantage cant estimate the effects of any
time-invariant variables
34Between estimator
- Not much used
- Except to calculate the ? parameter for random
effects, but STATA does this, not you! - Its inefficient compared to random effects
- It doesnt use as much information as is
available in the data (only uses means) - Assumption required that vi is uncorrelated with
xi - Easy to see why if they were correlated, how
could one decide how much of the variation in y
to attribute to the xs (via the betas) as
opposed to the correlation? - Cant estimate effects of variables where mean is
invariant over individuals - Age in a cohort study
- Macro-level variables
35Random effects estimator
- Assumption required that ui is uncorrelated with
xi - Rather heroic assumption think of examples
- Will see a test for this later
- Uses both within- and between-group variation, so
makes best use of the data and is efficient
36Estimating fixed effects in STATA
R-square-like statistic
Peaks at age 48
u and e are the two parts of the error term
Talk about xtmixed
37Between regression
- Not much used, but useful to compare coefficients
with fixed effects
Coefficient on partner was negative and
significant in FE model. In FE, the partner
coeff really measures the events of gaining or
losing a partner
38Random effects regression
Option theta gives a summary of weights
39And what about OLS?
- OLS simply treats within- and between-group
variation as the same - Pools data across waves
40Test whether pooling data is valid
- If the ui do not vary between individuals, they
can be treated as part of a and OLS is fine. - Breusch-Pagan Lagrange multiplier test
- H0 Variance of ui 0
- H1 Variance of ui not equal to zero
- If H0 is not rejected, you can pool the data and
use OLS - Post-estimation test after random effects
41Comparing models
- Compare coefficients between models
- Reasonably similar differences in partner and
badhealth coeffs - R-squareds are similar
- Within and between estimators maximise within and
between r-2 respectively.