SC968 Panel data methods for sociologists Lecture 1, part 1

About This Presentation

Title:

SC968 Panel data methods for sociologists Lecture 1, part 1

Description:

Classic example: food consumption and income. Cure by using weighted least squares ... Expenditure on rarely-purchased items, eg cars ... – PowerPoint PPT presentation

Number of Views:136

Avg rating:3.0/5.0

Slides: 67

Provided by: mariai

Category:

more less

Transcript and Presenter's Notes

Title: SC968 Panel data methods for sociologists Lecture 1, part 1

1
SC968Panel data methods for sociologistsLecture
1, part 1

A review of concepts for regression modelling
Or
things you should know already

2
Overview

Models
OLS, logit and probit
Mathematically and practically
Interpretation of results, measures of fit and
regression diagnostics
Model specification
Post-estimation commands
STATA competence

3
Ordinary Least Squares (OLS)
Value of dependent variable for individual i
(LHS variable)
Residual (disturbance, error term)
Intercept (constant)
Total no. of explanatory variables (RHS variables
or regressors) is K
Coefficient on variable 1
Value of explanatory variable 1 for person i
Examples yi mental health x1 sex x2 age x3
marital status x4 employment status x5
physical health
yi hourly pay x1 sex x2 age x3
education x4 job tenure x5 industry x6
region
4
OLS
In vector form
In matrix form
Vector of explanatory variables
Vector of coefficients
Note you will often see xß written as xß
5
OLS

Also called linear regression
Assumes dependent variable is a linear
combination of dependent variables, plus
disturbance
Least squares ßs estimated so as to minimise
the sum of the es.

6
Assumptions

Residuals have zero mean.
Follows that es and Xs are uncorrelated.
violated if a regressor is endogenous
Eg, number of children in female labour supply
models
Cure by (eg) Instrumental Variables
Homoscedasticity all es have same variance
Classic example food consumption and income
Cure by using weighted least squares

7
When is OLS appropriate?

When you have a continuous dependent variable
Eg, you would use it to estimate regressions for
height, but not for whether a person has a
university degree.
When the assumptions are not obviously violated
As a first step in research to get ball-park
estimates
We will use them a lot for this purpose
Worked examples
Coefficients, P-values, t-statistics
Measures of fit (R-squared, adjusted R-squared)
Thinking about specification
Post-estimation commands
Regression diagnostics.
A note on the data
All examples (in lectures and practicals) drawn
from a 20 sample of the British Household Panel
Survey (BHPS) more about the data later!

8
Summarize monthly earned income
9
First worked example
For illustrative purposes only. Not an example of
good practice.
Monthly labour income, for people whose labour
income is gt 1
MS SS/df
Tests whether all coeffs except constant are
jointly zero
Analysis of variance (ANOVA) table
R-squared Model SS / Total SS
Root MSE sqrt(MS)
T-stat coefficient / standard error
Coefficients or 1.96 standard errors
10
What do the results tell us?

All coefficients except month of interview are
significant
29 of variation explained
Being female reduces income by nearly 600 per
month
Income goes up with age and then down
16458 observations..oops, this is from panel
data, so there are repeated observations on
individuals.

11
Add ,cluster(pid) as an option

Coefficients, R-squared etc are unchanged from
previous specification
But standard errors are adjusted standard errors
larger, t-statistics are lower

12
Lets get rid of the month variable
Think about the female coefficient a bit more.
Could it be to do with women working shorter
hours?
13
Control for weekly hours of work

Is the coefficient on hours of work reasonable?
5.65 for every additional hour worked
certainly in the right ball park.

14
Looking at 2 specifications together

R-squared jumps from 29 to 46
Coefficient on female goes from -595 to -315
Almost half the effect of gender is explained by
womens shorter hours of work
Age, partner and education coefficients are also
reduced in magnitude, for similar reasons
Number of observations reduces from 16460 to
13998 missing data on hours

15
Interesting post-estimation activities
What age does income peak? Income Y ß1age
ß2age2 d(Income)/d(age) ß1
2ß2age Derivative zero when age - ß1/2ß2
-79.552/(-0.8732) 45.5
Is the effect of university qualifications
statistically different from the effect of
secondary education?
16
A closer look at couple coefficient
17

Men benefit much more than women from being in a
couple.
Other coefficients also differ between men and
women, but with current specification, we cant
test whether differences are significant.

18
Logit and Probit

Developed for discrete (categorical) dependent
variables
Eg, psychological morbidity, whether one has a
job. Think of other examples.
Outcome variable is always 0 or 1. Estimate
OLS (linear probability model) would set F(X,ß)
Xß e
Inappropriate because
Heteroscedasticity the outcome variable is
always 0 or 1, so e only takes the value -xß or
1-xß
More seriously, one cannot constrain estimated
probabilities to lie between 0 and 1.

19
Logit and Probit

Looking for a function which lies between 0 and
1
Cumulative normal distribution Probit model
Logistic distribution Logit (logistic) model
They are very similar! Note how they lie between
0 and 1 (vertical axis)

20
Maximum likelihood estimation

Likelihood function product of
Pr(y1) F(xß) for all observations where
y1
Pr(y0) 1- F(xß) for all observations where
y0
(think of the probability of flipping exactly
four heads and two tails, with six dice)
Log likelihood written as
Estimated using an iterative procedure
STATA chooses starting values for ßs
Computes slopes of likelihood function at these
values
Adjusts ßs accordingly
Stops when slope of LF is 0
Can take time!

21
Lets look at whether a person works
gen byte work (jbstat 1 jbstat 2) if
jbstat gt 1 jbstat ! .
22
Logit regression whether have a job
All the iterations
2 (LL of this model LL of null model)
Interpret like R-squared, but is computed
differently

From these coefficients, can tell whether
estimated effects are positive or negative
Whether theyre significant
Something about effect sizes but difficult to
draw inferences from coefficients

23
Comparing logit and probit

Scaling factor proposed by Amemiya (1981)
Multiply Probit coefficients by 1.6 to get an
approximation to Logit
Other authors have suggested a factor of 1.8

24
Marginal effects

After logit or Probit estimation, type mfx into
the command line
Calculates marginal effects of each of the RHS
variables on the dependent variable
Slope of the function for continuous variables
Effect of change from 0 to 1 in a dummy variable
Can also calculate elasticities
By default, calculates mfx at means of dependent
variables
Can also calculate at medians, or at specified
points

25
Marginal effects

Logit and Probit mfx are very similar indeed
OLS is actually not too bad

26
Odds ratios

Only an option with logit
Type or in, after the comma as an option
Reports odds ratios that is, how many times more
(or less) likely the outcome becomes
if the variable is 1 rather than 0, in the case
of a dichotomous variable
for each unit increase of the variable, for a
continuous variable
Results gt1 show an increased probability, results
lt1 show decrease

27
Other post-estimation commands

Likelihood ratio test lrtest
Adding an extra variable to the RHS always
increases the likelihood
But, does it add enough to the likelihood?
LR test calculates L0/L1 (Lrestricted/Lunrestricte
d) and calculates chi-squared stat with d.f.
equal to the number of variables you are
dropping.
Null hypothesis restricted specification.
Only works on nested models, ie, where the RHS
variables in one model are a subset of the RHS
variables in the other.
How to do it
Run the full model
Type estimates store NAME
Run a smaller model
Type estimates store ANOTHERNAME
.. And so on for as many models as you like
Type lrtest NAME ANOTHERNAME
Be careful..
Sample sizes must be the same for both models
Wont happen if the dropped variable is missing
for some observations
Solve problem by running the biggest model first
and using e(sample)

28
LR test - example

Similar but not identical regression to previous
examples
Add regional variables, decide which ones to keep
Looks as though Scotland might stay, also
possibly SW, NW, N

29
LR test - example
REJECT nested specification
DONT REJECT nested spec

Reject dropping all regional variables against
keeping full set
Dont reject dropping all but 4, over keeping
full set
Dont reject dropping all but Scotland, over
keeping full set
Dont reject dropping all but Scotland, over
dropping all but 4
and just to check DO reject dropping all
regional variables against dropping all but
Scotland

30
Again, specification is illustrative only

This is not an example of a finished labour
supply model!
How could one improve the model?
Model specification
Theoretical considerations,
Empirical considerations
Parsimony
Stepwise regression techniques
Regression diagnostics
Interpreting results
Spotting unreasonable results

31
Other models

Other models to be aware of, but not covered on
this course
Multinomial logit and probit
Ordered models (ologit, oprobit) for ordered
outcomes
Levels of education,
Number of children
Excellent, good, fair or poor health
Multinomial models (mlogit, mprobit) for multiple
outcomes with no obvious ordering
Working in public, private or voluntary sector
Choice of nursery, childminder or playgroup for
pre-school care
Heckman selection model
For modelling two-stage procedures
Earnings, conditional on having a job at all
Having a job is modelled as a probit, earnings
are modelled as OLS
Used particularly for womens earnings
Tobit model for censored or truncated data
Typically, for data where there are lots of zeros
Expenditure on rarely-purchased items, eg cars
Childrens weights, in an experiment where the
scales broke and gave a minimum reading of 10kg

32
Competence in STATA

Best results in this course if you already know
how to use STATA competently.
Check you know how to
Get data into STATA (use and using commands)
Manipulate data, (merge, append, rename, drop,
save)
Describe your data (describe, tabulate, table)
Create new variables (gen, egen)
Work with subsets of data (if, in, by)
Do basic regressions (regress, logit, probit)
Run sessions interactively and in batch mode
Organise your datasets and do-files so you can
find them again.
If you cant do these, upgrade your knowledge
ASAP!
Could enroll in STATA net course 101
Costs 110
ESRC might pay
Courses run regularly
www.stata.com

33
SC968Panel data methods for sociologistsLecture
1, part 2

Introducing Longitudinal Data

34
Overview

Cross-sectional and longitudinal data
Types of longitudinal data
Types of analysis possible with panel data
Data management merging, appending, long and
wide forms
Simple models using longitudinal data

35
Cross-sectional and longitudinal data

First, draw the distinction between macro- and
micro-level data
Micro level firms, individuals
Macro level local authorities, travel-to-work
areas, countries, commodity prices
Both may exist in cross-sectional or longitudinal
forms
We are interested in micro-level data
But macro-level variables are often used in
conjunction with micro-data
Cross-sectional data
Contains information collected at a given point
in time
(More strictly, during a given time window)
Workplace Industrial Relations Survey (WIRS)
General Household Survey (GHS)
Many cross-sectional surveys are repeated
annually, but on different individuals
Longitudinal data
Contains repeated observations on the same
subjects

36
Types of longitudinal data

Time-series data
Eg, commodity prices, exchange rates
Repeated interviews at irregular intervals
UK cohort studies
NCDS (1958), BHPS (1970), MCS (2000)
Repeated interviews at regular intervals
Panel surveys
Usually annual intervals, sometimes two-yearly
BHPS, ECHP, PSID, SOEP
Some surveys have both cross-sectional and panel
elements
Panels more expensive to collect
LFS, EU-SILC both have a rolling panel element
Other sources of longitudinal data
Retrospective data (eg work or relationship
history)
Linkage with external data (eg, tax or benefit
records) particularly in Scandinavia
May be present in both cross-sectional or
longitudinal data sets

37
Analysis with longitudinal data

The snapshot versus the movie
Essentially, longitudinal data allow us to
observe how events evolve
Study flows as well as stocks.
Example unemployment
Cross-sectional analysis shows steady 5
unemployment rate
Does this mean that everyone is unemployed one
year out of five?
That 5 of people are unemployed all the time?
Or something in between
Very different implications for equality, social
policy, etc

38
The BHPS

Interviews about 10,000 adults in about 6,000
households
Interviews repeated annually
People followed when they move
People join the sample if they move in with a
sample member
Household-level information collected from head
of household
Individual-level information collected from
people aged 17
Young people aged 11-16 fill in a youth
questionnaire
BHPS is being upgraded to Understanding Society
Much larger and wider-ranging survey
BHPS sample being retained as part of US sample
Data set used for this course is a 20 sample of
BHPS, with selected variables

39
The BHPS

All files prefixed with a letter indicating the
year
All variables within each file also prefixed with
this letter
1991 a
1992 b. and so on, so far up to p
Several files each year, containing different
information
hhsamp information on sample households
hhresp household-level information on households
that actually responded
indall info on all individuals in responding
households
indresp info on respondents to main questionnaire
(adults)
egoalt file showing relationship of household
members to one another
income incomes
Extra files each year containing derived
variables
Work histories, net income files
And others with occasional modules, eg life
histories in wave 2
bjobhist blifemst bmarriag bcohabit bchildnt

40
Some BHPS files

768.1k aindall.dta
10.7M aindresp.dta
1626.3k ahhresp.dta
330.6k ahhsamp.dta
1066.4k aincome.dta
541.3k aegoalt.dta
303.8k ajobhist.dta
635.3k bindsamp.dta
978.2k bindall.dta
11.0M bindresp.dta
1499.7k bhhresp.dta
257.1k bhhsamp.dta
1073.0k bincome.dta
546.5k begoalt.dta
237.8k bjobhist.dta

624.3k cindsamp.dta 975.6k
cindall.dta 11.0M cindresp.dta
1539.0k chhresp.dta 287.4k
chhsamp.dta 1008.9k cincome.dta
542.2k cegoalt.dta 237.8k
cjobhist.dta 1675.0k clifejob.dta
616.7k dindsamp.dta 943.7k dindall.dta
11.2M dindresp.dta 1508.9k
dhhresp.dta 301.9k dhhsamp.dta
1019.7k dincome.dta 531.8k
degoalt.dta 245.0k djobhist.dta
129.0k dyouth.dta 4977.3k
xwaveid.dta 1027.7k xwlsten.dta

Following sample members
Youth module introduced 1994
Extra modules in Wave 2
Cross-wave identifiers
41
Person and household identifiers

BHPS (along with other panels such as ECHP, SOEP,
ECHP) is a household survey so everyone living
in sample households becomes a member
Need identifiers to
Associate the same individual with him- or
herself in different waves
Link members of same household with each other in
the same wave
- the HID identifier
Note no such thing as a longitudinal household!
Household composition changes, household location
changes..
HID is a cross-sectional concept only!

42
What it looks like 4 waves of data, sorted by
pid and wave.
Observations in rows, variables in columns. Blue
stripes show where one individual ends another
begins
Not present at 2nd wave
A child, so no data on job or marital status
43
(Can also use ,nol option)
44
Joining data sets together
Adding extra variables merge command
Adding extra observations append command
45
Whether appending or merging

Whether appending or merging
The data set you are using at the time is called
the master data
The data set you want to merge it with is called
the using data
Make sure you can identify observations properly
beforehand
Make sure you can identify observations uniquely
afterwards

46
Appending

Use this command to add more observations
Relatively easy
Check first that you are really adding
observations you dont already have (or that if
you are adding duplicates, you really want to do
this)
Syntax append using using_data
STATA simply sticks the using data on the end
of the master data
STATA re-orders the variables if necessary.
If the using data contain variables not present
in the master data, STATA sets the values of
these variables to missing in the using data
(and vice versa if the master data contains
variables not present in the using data)

47
Merging is more complicated

Use merge to add more variables to a data set

Master data age.dta pid wave age 28005 1 30
19057 1 59 28005 2 31 19057 3 61
19057 4 62 28005 4 33
Using data sex.dta pid wave sex
19057 1 female 19057 3 female 28005 1 male
28005 2 male 28005 4 male 42571 1 male
42571 3 male

First, make sure both data sets are sorted the
same way
use sex.dta
sort pid wave
save, replace
use age.dta
sort pid wave

48
Merging
Master data age.dta pid wave age 19057 1 59
19057 3 61 19057 4 62 28005 1 30
28005 2 31 28005 4 33
Using data sex.dta pid wave sex
19057 1 female 19057 3 female 28005 1 male
28005 2 male 28005 4 male 42571 1 male
42571 3 male

Notice that both data sets dont contain the same
observations
Merge 11 pid wave using sex

New in STATA this year shows you are expecting
one using observation for each master
observation
pid wave age sex _merge 19057 1 59 female 3 19057
3 61 female 3 19057 4 62 . 1 28005 1 30 male 3 2
8005 2 31 male 3 28005 4 33 male 3 42571 1 . male
2 42571 3 . male 2
49

STATA creates a variable called _merge after
merging
1 observation in master but not using data
2 observation in using but not master data
3 observation in both data sets
Options available for discarding some
observations see manual

50
More on merging

Previous example showed one-to-one merging
Not every observation was in both data sets, but
every observation in the master data was matched
with a maximum of only one observation in the
using data and vice versa.
Many-to-one merging
Household-level data sets contain only one
observation per household (usually lt1 per person)
Regional data (eg, regional unemployment data),
usually one observation per region
Sample syntax merge m1 hid wave using
hhinc_data

hid pid age 1604 19057 59 2341
28005 30 3569 42571 59 4301 51538
22 4301 51562 4 4956 59377 46
5421 64966 70 6363 76166 77 6827
81763 71 6827 81798 72
hid pid age h/h income 1604 19057
59 780 2341 28005 30 1501 3569 42571
59 268 4301 51538 22 394 4301
51562 4 394 4956 59377 46 1601 5421
64966 70 225 6363 76166 77
411 6827 81763 71 743 6827 81798 72
743
hid h/h income 1604 780 2341 1501
3569 268 4301 394 4956 1601 5421
225 6363 411 6827 743

One-to-many merging
Job and relationship files contain one
observation per episode (potentially gt1 per
person)
Income files contain one observation per source
of income (potentially gt1 per person)
Sample syntax merge 1m pid wave using
births_data

51
Long and wide forms

The data we have here is in long form
One row for each person/wave combination
From a few slides back

52
Wide form

However, its also possible to put longitudinal
data into wide form
One observation per person, with different
variables relating to different years of data

Sex doesnt change usually
Age at wave 1, and so on
53
The reshape command

Switching from long to wide
reshape wide stubnames, i(id) j(year)
In BHPS, this becomes
reshape wide stubnames, i(pid) j(wave)
What are stub names?
They are a list of variables which vary between
years
Variables like sex or ethnicity would not
normally be included in this list
Switching from wide to long
Exactly the opposite
reshape long stubnames, i(id) j(wave)
Lots more info and examples in STATA manual

54
Simple models using longitudinal data

Auto-regressive and time-lagged models
Models of change

55
But first the GHQ

Use this for lots of analysis in the lectures and
practical sessions
General Health Questionnaire
Different versions BHPS carries the GHQ-12, with
12 questions.
Have you recently
been able to concentrate on whatever you're
doing ?
lost much sleep over worry ?
felt that you were playing a useful part in
things ?
felt capable of making decisions about things ?
felt constantly under strain?
felt you couldn't overcome your difficulties ?
been able to enjoy your normal day to day
activities ?
been able to face up to problems ?
been feeling unhappy or depressed?
been losing confidence in yourself?
been thinking of yourself as a worthless person
?
been feeling reasonably happy, all things
considered ?
Answer each question on 4-point scale
not at all - no more than usual - rather more -
much more

56
GHQ

(ghq) 1 likert Freq. Percent
Cum.
-------------------------------------------------
----------
missing or wild 582 2.10
2.10
proxy respondent 1,202 4.33
6.43
0 77 0.28
6.70
1 109 0.39
7.10
2 149 0.54
7.63
3 288 1.04
8.67
4 504 1.82
10.49
5 867 3.12
13.61
6 2,229 8.03
21.64
7 2,265 8.16
29.80
8 2,355 8.48
38.28
9 2,426 8.74
47.02
10 2,259 8.14
55.16
11 2,228 8.03
63.19
12 2,478 8.93
72.11
13 1,316 4.74
76.85
14 1,115 4.02
80.87

HLGHQ1 in BHPS
Sum of scores
LIKERT scale
We recode lt0 values to missings, rename LIKERT
Consider as a continuous variable

57
GHQ

HLGHQ2
Caseness scale
Recodes answers 3-4 as 1, and adds up
Scores above 2 used to indicate psychological
morbidity

58
Time-lagged models
Start with simple OLS model The Likert score is a
measure of psychological wellbeing derived from a
battery of questions
59
Generate lagged variable
NB the 1/30 here is just so it will fit on the
page. You should check many more observations
than this!
60
OLS, with lagged dependent variable
R-squared rockets from 5 to 26
Big very significant coefficient on lagged
variable
Coeff on ue_sick falls from 3.6 to 2.1
Also possible to include lagged explanatory
variables
61
Models of change
Start with OLS model simplified, but imagine
more variables
Separate model for each year suffix denotes year
Subtract 1st from 2nd model
Or, express in terms of change
62
Generate difference variables

capture drop dif
sort pid wave
gen dif_LIKERT LIKERT - LIKERT_n-1 if
pid pid_n-1 wave wave_n-1 1
gen dif_age age - age_n-1 if
pid pid_n-1 wave wave_n-1 1
gen dif_age2 age2 - age_n-1 if
pid pid_n-1 wave wave_n-1 1
gen dif_female female - female_n-1 if
pid pid_n-1 wave wave_n-1 1
gen dif_ue_sick ue_sick - ue_sick_n-1 if
pid pid_n-1 wave wave_n-1 1
gen dif_partner partner - partner_n-1 if
pid pid_n-1 wave wave_n-1 1

Check you understand why dif_female will very
nearly always be zero
63
Check for sensible results!
64
More checking.
65
Obvious problems

Interview times may mean difference of 100 in
age difference variable
Most differences are zero
Moving into unemployment or partnership is given
equal and opposite weighting to moving out. No
real reason why this should be the case
There are MUCH better ways to use these data!
Nevertheless, lets proceed!

66
Results
Age increase equal and opposite to constant
Female drops out
Coeffs on sick and partner significant

Write a Comment

User Comments (0)