Title: SC968 Panel data methods for sociologists Lecture 1, part 1
1SC968Panel data methods for sociologistsLecture
1, part 1
- A review of concepts for regression modelling
- Or
- things you should know already
2Overview
- Models
- OLS, logit and probit
- Mathematically and practically
- Interpretation of results, measures of fit and
regression diagnostics - Model specification
- Post-estimation commands
- STATA competence
3Ordinary Least Squares (OLS)
Value of dependent variable for individual i
(LHS variable)
Residual (disturbance, error term)
Intercept (constant)
Total no. of explanatory variables (RHS variables
or regressors) is K
Coefficient on variable 1
Value of explanatory variable 1 for person i
Examples yi mental health x1 sex x2 age x3
marital status x4 employment status x5
physical health
yi hourly pay x1 sex x2 age x3
education x4 job tenure x5 industry x6
region
4OLS
In vector form
In matrix form
Vector of explanatory variables
Vector of coefficients
Note you will often see xß written as xß
5OLS
- Also called linear regression
- Assumes dependent variable is a linear
combination of dependent variables, plus
disturbance - Least squares ßs estimated so as to minimise
the sum of the es.
6Assumptions
- Residuals have zero mean.
- Follows that es and Xs are uncorrelated.
- violated if a regressor is endogenous
- Eg, number of children in female labour supply
models - Cure by (eg) Instrumental Variables
- Homoscedasticity all es have same variance
- Classic example food consumption and income
- Cure by using weighted least squares
7When is OLS appropriate?
- When you have a continuous dependent variable
- Eg, you would use it to estimate regressions for
height, but not for whether a person has a
university degree. - When the assumptions are not obviously violated
- As a first step in research to get ball-park
estimates - We will use them a lot for this purpose
- Worked examples
- Coefficients, P-values, t-statistics
- Measures of fit (R-squared, adjusted R-squared)
- Thinking about specification
- Post-estimation commands
- Regression diagnostics.
- A note on the data
- All examples (in lectures and practicals) drawn
from a 20 sample of the British Household Panel
Survey (BHPS) more about the data later!
8Summarize monthly earned income
9First worked example
For illustrative purposes only. Not an example of
good practice.
Monthly labour income, for people whose labour
income is gt 1
MS SS/df
Tests whether all coeffs except constant are
jointly zero
Analysis of variance (ANOVA) table
R-squared Model SS / Total SS
Root MSE sqrt(MS)
T-stat coefficient / standard error
Coefficients or 1.96 standard errors
10What do the results tell us?
- All coefficients except month of interview are
significant - 29 of variation explained
- Being female reduces income by nearly 600 per
month - Income goes up with age and then down
- 16458 observations..oops, this is from panel
data, so there are repeated observations on
individuals.
11Add ,cluster(pid) as an option
- Coefficients, R-squared etc are unchanged from
previous specification - But standard errors are adjusted standard errors
larger, t-statistics are lower
12Lets get rid of the month variable
Think about the female coefficient a bit more.
Could it be to do with women working shorter
hours?
13Control for weekly hours of work
- Is the coefficient on hours of work reasonable?
- 5.65 for every additional hour worked
certainly in the right ball park.
14Looking at 2 specifications together
- R-squared jumps from 29 to 46
- Coefficient on female goes from -595 to -315
- Almost half the effect of gender is explained by
womens shorter hours of work - Age, partner and education coefficients are also
reduced in magnitude, for similar reasons - Number of observations reduces from 16460 to
13998 missing data on hours
15Interesting post-estimation activities
What age does income peak? Income Y ß1age
ß2age2 d(Income)/d(age) ß1
2ß2age Derivative zero when age - ß1/2ß2
-79.552/(-0.8732) 45.5
Is the effect of university qualifications
statistically different from the effect of
secondary education?
16A closer look at couple coefficient
17- Men benefit much more than women from being in a
couple. - Other coefficients also differ between men and
women, but with current specification, we cant
test whether differences are significant.
18Logit and Probit
- Developed for discrete (categorical) dependent
variables - Eg, psychological morbidity, whether one has a
job. Think of other examples. - Outcome variable is always 0 or 1. Estimate
- OLS (linear probability model) would set F(X,ß)
Xß e - Inappropriate because
- Heteroscedasticity the outcome variable is
always 0 or 1, so e only takes the value -xß or
1-xß - More seriously, one cannot constrain estimated
probabilities to lie between 0 and 1.
19Logit and Probit
- Looking for a function which lies between 0 and
1 - Cumulative normal distribution Probit model
- Logistic distribution Logit (logistic) model
- They are very similar! Note how they lie between
0 and 1 (vertical axis)
20Maximum likelihood estimation
- Likelihood function product of
- Pr(y1) F(xß) for all observations where
y1 - Pr(y0) 1- F(xß) for all observations where
y0 - (think of the probability of flipping exactly
four heads and two tails, with six dice) - Log likelihood written as
- Estimated using an iterative procedure
- STATA chooses starting values for ßs
- Computes slopes of likelihood function at these
values - Adjusts ßs accordingly
- Stops when slope of LF is 0
- Can take time!
21Lets look at whether a person works
gen byte work (jbstat 1 jbstat 2) if
jbstat gt 1 jbstat ! .
22Logit regression whether have a job
All the iterations
2 (LL of this model LL of null model)
Interpret like R-squared, but is computed
differently
- From these coefficients, can tell whether
estimated effects are positive or negative - Whether theyre significant
- Something about effect sizes but difficult to
draw inferences from coefficients
23Comparing logit and probit
- Scaling factor proposed by Amemiya (1981)
- Multiply Probit coefficients by 1.6 to get an
approximation to Logit - Other authors have suggested a factor of 1.8
24Marginal effects
- After logit or Probit estimation, type mfx into
the command line - Calculates marginal effects of each of the RHS
variables on the dependent variable - Slope of the function for continuous variables
- Effect of change from 0 to 1 in a dummy variable
- Can also calculate elasticities
- By default, calculates mfx at means of dependent
variables - Can also calculate at medians, or at specified
points
25Marginal effects
- Logit and Probit mfx are very similar indeed
- OLS is actually not too bad
26Odds ratios
- Only an option with logit
- Type or in, after the comma as an option
- Reports odds ratios that is, how many times more
(or less) likely the outcome becomes - if the variable is 1 rather than 0, in the case
of a dichotomous variable - for each unit increase of the variable, for a
continuous variable - Results gt1 show an increased probability, results
lt1 show decrease
27Other post-estimation commands
- Likelihood ratio test lrtest
- Adding an extra variable to the RHS always
increases the likelihood - But, does it add enough to the likelihood?
- LR test calculates L0/L1 (Lrestricted/Lunrestricte
d) and calculates chi-squared stat with d.f.
equal to the number of variables you are
dropping. - Null hypothesis restricted specification.
- Only works on nested models, ie, where the RHS
variables in one model are a subset of the RHS
variables in the other. - How to do it
- Run the full model
- Type estimates store NAME
- Run a smaller model
- Type estimates store ANOTHERNAME
- .. And so on for as many models as you like
- Type lrtest NAME ANOTHERNAME
- Be careful..
- Sample sizes must be the same for both models
- Wont happen if the dropped variable is missing
for some observations - Solve problem by running the biggest model first
and using e(sample)
28LR test - example
- Similar but not identical regression to previous
examples - Add regional variables, decide which ones to keep
- Looks as though Scotland might stay, also
possibly SW, NW, N
29LR test - example
REJECT nested specification
DONT REJECT nested spec
- Reject dropping all regional variables against
keeping full set - Dont reject dropping all but 4, over keeping
full set - Dont reject dropping all but Scotland, over
keeping full set - Dont reject dropping all but Scotland, over
dropping all but 4 - and just to check DO reject dropping all
regional variables against dropping all but
Scotland
30Again, specification is illustrative only
- This is not an example of a finished labour
supply model! - How could one improve the model?
- Model specification
- Theoretical considerations,
- Empirical considerations
- Parsimony
- Stepwise regression techniques
- Regression diagnostics
- Interpreting results
- Spotting unreasonable results
31Other models
- Other models to be aware of, but not covered on
this course - Multinomial logit and probit
- Ordered models (ologit, oprobit) for ordered
outcomes - Levels of education,
- Number of children
- Excellent, good, fair or poor health
- Multinomial models (mlogit, mprobit) for multiple
outcomes with no obvious ordering - Working in public, private or voluntary sector
- Choice of nursery, childminder or playgroup for
pre-school care - Heckman selection model
- For modelling two-stage procedures
- Earnings, conditional on having a job at all
- Having a job is modelled as a probit, earnings
are modelled as OLS - Used particularly for womens earnings
- Tobit model for censored or truncated data
- Typically, for data where there are lots of zeros
- Expenditure on rarely-purchased items, eg cars
- Childrens weights, in an experiment where the
scales broke and gave a minimum reading of 10kg
32Competence in STATA
- Best results in this course if you already know
how to use STATA competently. - Check you know how to
- Get data into STATA (use and using commands)
- Manipulate data, (merge, append, rename, drop,
save) - Describe your data (describe, tabulate, table)
- Create new variables (gen, egen)
- Work with subsets of data (if, in, by)
- Do basic regressions (regress, logit, probit)
- Run sessions interactively and in batch mode
- Organise your datasets and do-files so you can
find them again. - If you cant do these, upgrade your knowledge
ASAP! - Could enroll in STATA net course 101
- Costs 110
- ESRC might pay
- Courses run regularly
- www.stata.com
33SC968Panel data methods for sociologistsLecture
1, part 2
- Introducing Longitudinal Data
34Overview
- Cross-sectional and longitudinal data
- Types of longitudinal data
- Types of analysis possible with panel data
- Data management merging, appending, long and
wide forms - Simple models using longitudinal data
35Cross-sectional and longitudinal data
- First, draw the distinction between macro- and
micro-level data - Micro level firms, individuals
- Macro level local authorities, travel-to-work
areas, countries, commodity prices - Both may exist in cross-sectional or longitudinal
forms - We are interested in micro-level data
- But macro-level variables are often used in
conjunction with micro-data - Cross-sectional data
- Contains information collected at a given point
in time - (More strictly, during a given time window)
- Workplace Industrial Relations Survey (WIRS)
- General Household Survey (GHS)
- Many cross-sectional surveys are repeated
annually, but on different individuals - Longitudinal data
- Contains repeated observations on the same
subjects
36Types of longitudinal data
- Time-series data
- Eg, commodity prices, exchange rates
- Repeated interviews at irregular intervals
- UK cohort studies
- NCDS (1958), BHPS (1970), MCS (2000)
- Repeated interviews at regular intervals
- Panel surveys
- Usually annual intervals, sometimes two-yearly
- BHPS, ECHP, PSID, SOEP
- Some surveys have both cross-sectional and panel
elements - Panels more expensive to collect
- LFS, EU-SILC both have a rolling panel element
- Other sources of longitudinal data
- Retrospective data (eg work or relationship
history) - Linkage with external data (eg, tax or benefit
records) particularly in Scandinavia - May be present in both cross-sectional or
longitudinal data sets
37Analysis with longitudinal data
- The snapshot versus the movie
- Essentially, longitudinal data allow us to
observe how events evolve - Study flows as well as stocks.
- Example unemployment
- Cross-sectional analysis shows steady 5
unemployment rate - Does this mean that everyone is unemployed one
year out of five? - That 5 of people are unemployed all the time?
- Or something in between
- Very different implications for equality, social
policy, etc
38The BHPS
- Interviews about 10,000 adults in about 6,000
households - Interviews repeated annually
- People followed when they move
- People join the sample if they move in with a
sample member - Household-level information collected from head
of household - Individual-level information collected from
people aged 17 - Young people aged 11-16 fill in a youth
questionnaire - BHPS is being upgraded to Understanding Society
- Much larger and wider-ranging survey
- BHPS sample being retained as part of US sample
- Data set used for this course is a 20 sample of
BHPS, with selected variables
39The BHPS
- All files prefixed with a letter indicating the
year - All variables within each file also prefixed with
this letter - 1991 a
- 1992 b. and so on, so far up to p
- Several files each year, containing different
information - hhsamp information on sample households
- hhresp household-level information on households
that actually responded - indall info on all individuals in responding
households - indresp info on respondents to main questionnaire
(adults) - egoalt file showing relationship of household
members to one another - income incomes
- Extra files each year containing derived
variables - Work histories, net income files
- And others with occasional modules, eg life
histories in wave 2 - bjobhist blifemst bmarriag bcohabit bchildnt
40Some BHPS files
- 768.1k aindall.dta
- 10.7M aindresp.dta
- 1626.3k ahhresp.dta
- 330.6k ahhsamp.dta
- 1066.4k aincome.dta
- 541.3k aegoalt.dta
- 303.8k ajobhist.dta
-
- 635.3k bindsamp.dta
- 978.2k bindall.dta
- 11.0M bindresp.dta
- 1499.7k bhhresp.dta
- 257.1k bhhsamp.dta
- 1073.0k bincome.dta
- 546.5k begoalt.dta
- 237.8k bjobhist.dta
624.3k cindsamp.dta 975.6k
cindall.dta 11.0M cindresp.dta
1539.0k chhresp.dta 287.4k
chhsamp.dta 1008.9k cincome.dta
542.2k cegoalt.dta 237.8k
cjobhist.dta 1675.0k clifejob.dta
616.7k dindsamp.dta 943.7k dindall.dta
11.2M dindresp.dta 1508.9k
dhhresp.dta 301.9k dhhsamp.dta
1019.7k dincome.dta 531.8k
degoalt.dta 245.0k djobhist.dta
129.0k dyouth.dta 4977.3k
xwaveid.dta 1027.7k xwlsten.dta
Following sample members
Youth module introduced 1994
Extra modules in Wave 2
Cross-wave identifiers
41Person and household identifiers
- BHPS (along with other panels such as ECHP, SOEP,
ECHP) is a household survey so everyone living
in sample households becomes a member - Need identifiers to
- Associate the same individual with him- or
herself in different waves - Link members of same household with each other in
the same wave - - the HID identifier
- Note no such thing as a longitudinal household!
- Household composition changes, household location
changes.. - HID is a cross-sectional concept only!
42What it looks like 4 waves of data, sorted by
pid and wave.
Observations in rows, variables in columns. Blue
stripes show where one individual ends another
begins
Not present at 2nd wave
A child, so no data on job or marital status
43(Can also use ,nol option)
44Joining data sets together
Adding extra variables merge command
Adding extra observations append command
45Whether appending or merging
- Whether appending or merging
- The data set you are using at the time is called
the master data - The data set you want to merge it with is called
the using data - Make sure you can identify observations properly
beforehand - Make sure you can identify observations uniquely
afterwards
46Appending
- Use this command to add more observations
- Relatively easy
- Check first that you are really adding
observations you dont already have (or that if
you are adding duplicates, you really want to do
this) - Syntax append using using_data
- STATA simply sticks the using data on the end
of the master data - STATA re-orders the variables if necessary.
- If the using data contain variables not present
in the master data, STATA sets the values of
these variables to missing in the using data - (and vice versa if the master data contains
variables not present in the using data)
47Merging is more complicated
- Use merge to add more variables to a data set
Master data age.dta pid wave age 28005 1 30
19057 1 59 28005 2 31 19057 3 61
19057 4 62 28005 4 33
Using data sex.dta pid wave sex
19057 1 female 19057 3 female 28005 1 male
28005 2 male 28005 4 male 42571 1 male
42571 3 male
- First, make sure both data sets are sorted the
same way - use sex.dta
- sort pid wave
- save, replace
- use age.dta
- sort pid wave
48Merging
Master data age.dta pid wave age 19057 1 59
19057 3 61 19057 4 62 28005 1 30
28005 2 31 28005 4 33
Using data sex.dta pid wave sex
19057 1 female 19057 3 female 28005 1 male
28005 2 male 28005 4 male 42571 1 male
42571 3 male
- Notice that both data sets dont contain the same
observations - Merge 11 pid wave using sex
New in STATA this year shows you are expecting
one using observation for each master
observation
pid wave age sex _merge 19057 1 59 female 3 19057
3 61 female 3 19057 4 62 . 1 28005 1 30 male 3 2
8005 2 31 male 3 28005 4 33 male 3 42571 1 . male
2 42571 3 . male 2
49- STATA creates a variable called _merge after
merging - 1 observation in master but not using data
- 2 observation in using but not master data
- 3 observation in both data sets
- Options available for discarding some
observations see manual
50More on merging
- Previous example showed one-to-one merging
- Not every observation was in both data sets, but
every observation in the master data was matched
with a maximum of only one observation in the
using data and vice versa. - Many-to-one merging
- Household-level data sets contain only one
observation per household (usually lt1 per person) - Regional data (eg, regional unemployment data),
usually one observation per region - Sample syntax merge m1 hid wave using
hhinc_data
hid pid age 1604 19057 59 2341
28005 30 3569 42571 59 4301 51538
22 4301 51562 4 4956 59377 46
5421 64966 70 6363 76166 77 6827
81763 71 6827 81798 72
hid pid age h/h income 1604 19057
59 780 2341 28005 30 1501 3569 42571
59 268 4301 51538 22 394 4301
51562 4 394 4956 59377 46 1601 5421
64966 70 225 6363 76166 77
411 6827 81763 71 743 6827 81798 72
743
hid h/h income 1604 780 2341 1501
3569 268 4301 394 4956 1601 5421
225 6363 411 6827 743
- One-to-many merging
- Job and relationship files contain one
observation per episode (potentially gt1 per
person) - Income files contain one observation per source
of income (potentially gt1 per person) - Sample syntax merge 1m pid wave using
births_data
51Long and wide forms
- The data we have here is in long form
- One row for each person/wave combination
- From a few slides back
52Wide form
- However, its also possible to put longitudinal
data into wide form - One observation per person, with different
variables relating to different years of data
Sex doesnt change usually
Age at wave 1, and so on
53The reshape command
- Switching from long to wide
- reshape wide stubnames, i(id) j(year)
- In BHPS, this becomes
- reshape wide stubnames, i(pid) j(wave)
- What are stub names?
- They are a list of variables which vary between
years - Variables like sex or ethnicity would not
normally be included in this list - Switching from wide to long
- Exactly the opposite
- reshape long stubnames, i(id) j(wave)
- Lots more info and examples in STATA manual
54Simple models using longitudinal data
- Auto-regressive and time-lagged models
- Models of change
55But first the GHQ
- Use this for lots of analysis in the lectures and
practical sessions - General Health Questionnaire
- Different versions BHPS carries the GHQ-12, with
12 questions. - Have you recently
- been able to concentrate on whatever you're
doing ? - lost much sleep over worry ?
- felt that you were playing a useful part in
things ? - felt capable of making decisions about things ?
- felt constantly under strain?
- felt you couldn't overcome your difficulties ?
- been able to enjoy your normal day to day
activities ? - been able to face up to problems ?
- been feeling unhappy or depressed?
- been losing confidence in yourself?
- been thinking of yourself as a worthless person
? - been feeling reasonably happy, all things
considered ? - Answer each question on 4-point scale
- not at all - no more than usual - rather more -
much more
56GHQ
- (ghq) 1 likert Freq. Percent
Cum. - -------------------------------------------------
---------- - missing or wild 582 2.10
2.10 - proxy respondent 1,202 4.33
6.43 - 0 77 0.28
6.70 - 1 109 0.39
7.10 - 2 149 0.54
7.63 - 3 288 1.04
8.67 - 4 504 1.82
10.49 - 5 867 3.12
13.61 - 6 2,229 8.03
21.64 - 7 2,265 8.16
29.80 - 8 2,355 8.48
38.28 - 9 2,426 8.74
47.02 - 10 2,259 8.14
55.16 - 11 2,228 8.03
63.19 - 12 2,478 8.93
72.11 - 13 1,316 4.74
76.85 - 14 1,115 4.02
80.87
- HLGHQ1 in BHPS
- Sum of scores
- LIKERT scale
- We recode lt0 values to missings, rename LIKERT
- Consider as a continuous variable
57GHQ
- HLGHQ2
- Caseness scale
- Recodes answers 3-4 as 1, and adds up
- Scores above 2 used to indicate psychological
morbidity
58Time-lagged models
Start with simple OLS model The Likert score is a
measure of psychological wellbeing derived from a
battery of questions
59Generate lagged variable
NB the 1/30 here is just so it will fit on the
page. You should check many more observations
than this!
60OLS, with lagged dependent variable
R-squared rockets from 5 to 26
Big very significant coefficient on lagged
variable
Coeff on ue_sick falls from 3.6 to 2.1
Also possible to include lagged explanatory
variables
61Models of change
Start with OLS model simplified, but imagine
more variables
Separate model for each year suffix denotes year
Subtract 1st from 2nd model
Or, express in terms of change
62Generate difference variables
- capture drop dif
- sort pid wave
- gen dif_LIKERT LIKERT - LIKERT_n-1 if
pid pid_n-1 wave wave_n-1 1 - gen dif_age age - age_n-1 if
pid pid_n-1 wave wave_n-1 1 - gen dif_age2 age2 - age_n-1 if
pid pid_n-1 wave wave_n-1 1 - gen dif_female female - female_n-1 if
pid pid_n-1 wave wave_n-1 1 - gen dif_ue_sick ue_sick - ue_sick_n-1 if
pid pid_n-1 wave wave_n-1 1 - gen dif_partner partner - partner_n-1 if
pid pid_n-1 wave wave_n-1 1
Check you understand why dif_female will very
nearly always be zero
63Check for sensible results!
64More checking.
65Obvious problems
- Interview times may mean difference of 100 in
age difference variable - Most differences are zero
- Moving into unemployment or partnership is given
equal and opposite weighting to moving out. No
real reason why this should be the case - There are MUCH better ways to use these data!
- Nevertheless, lets proceed!
66Results
Age increase equal and opposite to constant
Female drops out
Coeffs on sick and partner significant