SC968 Panel data methods for sociologists Lecture 2, part 1 - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

SC968 Panel data methods for sociologists Lecture 2, part 1

Description:

Types of longitudinal data. Types of analysis possible with panel data. Data management merging, appending, long and wide ... capture drop dif* sort pid wave ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 42
Provided by: mariai
Category:

less

Transcript and Presenter's Notes

Title: SC968 Panel data methods for sociologists Lecture 2, part 1


1
SC968Panel data methods for sociologistsLecture
2, part 1
Introducing panel data
2
Overview
  • Panel data
  • What it is
  • How to get to know the data
  • Change over time
  • Tabulating
  • Calculating transition probabilities

3
What is panel data?
  • A data set containing observations on multiple
    phenomena observed at a single point in time is
    called cross-sectional data
  • A data set containing observations on a single
    phenomenon observed over multiple time periods is
    called time series data
  • Observations on multiple phenomena over multiple
    time periods are panel data
  • Cross sectional and time series data are one-
    dimensional, panel data are two-dimensional

4
Using panel data in Stata
  • Data on n cases, over t time periods, giving a
    total of n t observations
  • One record per observation
  • i.e. long format
  • Stata tools for analyzing panel data begin with
    the prefix xt
  • First need to tell Stata that you have panel data
    using xtset

5
Complete and incomplete person-wave data
6
Telling Stata you have time series data
Unique cross-wave identifier
Time variable
. xtset pid wave panel variable pid
(unbalanced) time variable wave, 1 to
15, but with gaps delta 1 unit
7
Cases not observed for every time period
. xtset pid wave panel variable pid
(unbalanced) time variable wave, 1 to
15, but with gaps delta 1 unit
Period between observations in units of the time
variable
8
Describing the patterns in panel data
9
Examining change over two waves
10
Calculating transition probabilities
  • The transition probability is the probability of
    transitioning from one state to another

So to calculate by hand,
Cell count
Row total
11
Transition probability matrix
12
Transition probability matrices in Stata
Mean transition probabilities for all waves t to
t1 when you leave out the if statement
13
Change in a categorical variable over timeA
decision tree
empl
0.91
empl
0.03
unemp
0.06
0.90
olf
empl
0.26
unemp
0.03
0.49
empl
unemp
0.25
olf
0.04
empl
0.10
olf
0.03
unemp
0.87
olf
14
Change in a continuous variable over time
  • Size transition matrix
  • Quantile transition matrix
  • Mean transition matrix
  • Median transition matrix

15
Size transition matrix
  • Absolute mobility
  • e.g. movement in and out of poverty
  • Boundaries set exogenously i.e. predetermined
  • e.g poverty defined a priori as an income below
    5,000
  • Do not depend on distribution under investigation
  • e.g comparing mobility in 1990s and 2000s
  • incorporates both movements of positions of
    individuals and economic growth

16
Quantile transition matrix
  • Mobility as a relative concept
  • Same number of individuals in each class
  • Only records movements involving reranking
  • Cannot take into account of economic growth, for
    example when comparing matrices
  • Cannot draw a complete picture if comparing
    mobility in different cohorts/countries/welfare
    regimes

17
Mean/median transition matrices
  • Both absolute and relative approaches
    incorporated into matrices
  • Class boundaries defined as percentages of mean
    or median income of the origin and destination
    distributions
  • Example
  • 25, 50, 75 of median income
  • Note that this is not the same as quartiles

18
Example income 1991-1992
19
Category boundaries for each method
Matrix Year Boundary 1 (n) Boundary 2 (n) Boundary 3 (n) Boundary 4 (n)
Size 1991 0 - 800 (580) 800 - 1500 (650) 1500 - 2200 (504) 2200 - 9231 (715)
1992 0 - 800 (580) 800 - 1500 (645) 1500 - 2200 (473) 2200 - 10491 (751)
Quartile 1991 0 827 (609) 827 -1511 (615) 1511 2365 (611) 2365 9231 (614)
1992 0 862 (610) 862 1508 (612) 1508 2450 (612) 2450 10491 (615)
Mean 1991 0 887 (654) 887 -1773 (814) 1773 2660 (506) 2660 9231 (475)
1992 0 898 (652) 898 -1795 (766) 1795 2693 (501) 2693 10491 (530)
Median 1991 0 750 (539) 750 -1500 (685) 1500 2250 (540) 2250 9231 (685)
1992 0 746 (536) 746 -1491 (686) 1491 -2237 (505) 2262 10491 (722)
20
Warning!
  • Measurement error
  • Causes an over-estimation of mobility
  • If mothers and babys weight are reported to
    nearest half pound can affect which band the
    observations falls in
  • A respondent may describe their marital status as
    separated in year 1 and single in year 2

21
Finally..
  • Greater challenges to understanding and checking
    panel data
  • Transition matrices a good way to summarise
    mobility patterns
  • Different methods of constructing matrices lead
    to distinct interpretations
  • May need to take account of measurement error
    when modelling change

22
(No Transcript)
23
SC968Panel data methods for sociologistsLecture
2, part 2
  • Concepts for panel data analysis

24
Overview
  • Types of variables time-invariant, time-varying
    and trend
  • Between- and within-individual variation
  • Concept of individual heterogeneity
  • Within and between estimators
  • Basic properties of fixed and random effects
    models
  • The basics of these models implementation in
    STATA

25
Types of variable
  • Those which vary between individuals but hardly
    ever over time
  • Sex
  • Ethnicity
  • Parents social class when you were 14
  • The type of primary school you attended (once
    youve become an adult)
  • Those which vary over time, but not between
    individuals
  • The retail price index
  • National unemployment rates
  • Age, in a cohort study
  • Those which vary both over time and between
    individuals
  • Income
  • Health
  • Psychological wellbeing
  • Number of children you have
  • Marital status
  • Trend variables
  • Vary between individuals and over time, but in
    highly predictable ways
  • Age
  • Year

26
Between- and within-individual variation
  • If you have a sample with repeated observations
    on the same individuals, there are two sources of
    variance within the sample
  • The fact that individuals are systematically
    different from one another (between-individual
    variation)
  • The fact that individuals behaviour varies
    between observations (within-individual
    variation)
  • Not nearly as scary as it looks
  • Total variation is the sum over all individuals
    and years, of the square of the difference
    between each observation of x and the mean
  • Within variation is the sum of the squares of
    each individuals observation from his or her
    mean
  • Between variation is sum of squares of
    differences between individual means and the
    whole-sample mean

27
xtsum in STATA
  • Similar to ordinary sum command

Have chosen a balanced sample
All variation is between
Most variation is between, because its fairly
rare to switch between having and not having a
partner
All variation is within, because this is a
balanced sample
28
More on xtsum.
Observations with non-missing variable
Number of individuals
Average number of time-points
Min max refer to xi-bar
Min max refer to individual deviation from own
averages, with global averages added back in.
29
The xttab command
For simplicity, omitted jbstats of missing,
maternity leave, gov training and other.
Pooled sample, broken down by person/years
Of those who spent any time in this state, the
proportion of their time (on average) they spent
in it.
Number of people who spent any time in this state
30
Individual heterogeneity
  • A very simple concept people are different!
  • In social science, when we talk about
    heterogeneity, we are really talking about
    unobservable (or unobserved) heterogeneity.
  • Observed heterogeneity differences in education
    levels, or parental background, or anything else
    that we can measure and control for in
    regressions
  • Unobserved heterogeneity anything which is
    fundamentally unmeasurable, or which is rather
    poorly measured, or which does not happen to be
    measured in the particular data set we are using.
  • Example the relationship between employment and
    children
  • We know that women who have more children are
    less likely to go out to work, and if they do go
    out to work they work fewer hours.
  • A priori, this isnt obvious women with
    children do face a higher opportunity cost of
    work, but one could also argue that they need
    more money
  • What is the causal relationship does having lots
    of children cause women to do less paid work?
  • Or are women who have lots of children a
    fundamentally different type, with different sets
    of preferences?

31
Unobserved heterogeneity
  • Extend the OLS equation we used in Week 1,
    breaking the error term down into two components
    one representing the unobservable characteristics
    of the person, and the other representing genuine
    error.
  • In cross-sectional analysis, there is no way of
    distinguishing between the two.
  • But in panel data analysis, we have repeated
    observations and this allows us to distinguish
    between them.

32
Within and between estimators
Individual-specific, fixed over time
Varies over time, usual assumptions apply (mean
zero, homoscedastic, uncorrelated with x or u or
itself)
This is the between estimator
And this is the within estimator fixed
effects
? measures the weight given to between-group
variation, and is derived from the variances of
ui and ei
33
Fixed effects (within estimator)
  • Ignores between-group variation so its an
    inefficient estimator
  • However, few assumptions are required for FE to
    be consistent
  • Disadvantage cant estimate the effects of any
    time-invariant variables

34
Between estimator
  • Not much used
  • Except to calculate the ? parameter for random
    effects, but STATA does this, not you!
  • Its inefficient compared to random effects
  • It doesnt use as much information as is
    available in the data (only uses means)
  • Assumption required that vi is uncorrelated with
    xi
  • Easy to see why if they were correlated, how
    could one decide how much of the variation in y
    to attribute to the xs (via the betas) as
    opposed to the correlation?
  • Cant estimate effects of variables where mean is
    invariant over individuals
  • Age in a cohort study
  • Macro-level variables

35
Random effects estimator
  • Assumption required that ui is uncorrelated with
    xi
  • Rather heroic assumption think of examples
  • Will see a test for this later
  • Uses both within- and between-group variation, so
    makes best use of the data and is efficient

36
Estimating fixed effects in STATA
R-square-like statistic
Peaks at age 48
u and e are the two parts of the error term
Talk about xtmixed
37
Between regression
  • Not much used, but useful to compare coefficients
    with fixed effects

Coefficient on partner was negative and
significant in FE model. In FE, the partner
coeff really measures the events of gaining or
losing a partner
38
Random effects regression
Option theta gives a summary of weights
39
And what about OLS?
  • OLS simply treats within- and between-group
    variation as the same
  • Pools data across waves

40
Test whether pooling data is valid
  • If the ui do not vary between individuals, they
    can be treated as part of a and OLS is fine.
  • Breusch-Pagan Lagrange multiplier test
  • H0 Variance of ui 0
  • H1 Variance of ui not equal to zero
  • If H0 is not rejected, you can pool the data and
    use OLS
  • Post-estimation test after random effects

41
Comparing models
  • Compare coefficients between models
  • Reasonably similar differences in partner and
    badhealth coeffs
  • R-squareds are similar
  • Within and between estimators maximise within and
    between r-2 respectively.
Write a Comment
User Comments (0)
About PowerShow.com