SC968 Panel data methods for sociologists Lecture 1, part 1 - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

SC968 Panel data methods for sociologists Lecture 1, part 1

Description:

Classic example: food consumption and income. Cure by using weighted least squares ... Expenditure on rarely-purchased items, eg cars ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 67
Provided by: mariai
Category:

less

Transcript and Presenter's Notes

Title: SC968 Panel data methods for sociologists Lecture 1, part 1


1
SC968Panel data methods for sociologistsLecture
1, part 1
  • A review of concepts for regression modelling
  • Or
  • things you should know already

2
Overview
  • Models
  • OLS, logit and probit
  • Mathematically and practically
  • Interpretation of results, measures of fit and
    regression diagnostics
  • Model specification
  • Post-estimation commands
  • STATA competence

3
Ordinary Least Squares (OLS)
Value of dependent variable for individual i
(LHS variable)
Residual (disturbance, error term)
Intercept (constant)
Total no. of explanatory variables (RHS variables
or regressors) is K
Coefficient on variable 1
Value of explanatory variable 1 for person i
Examples yi mental health x1 sex x2 age x3
marital status x4 employment status x5
physical health
yi hourly pay x1 sex x2 age x3
education x4 job tenure x5 industry x6
region
4
OLS
In vector form
In matrix form
Vector of explanatory variables
Vector of coefficients
Note you will often see xß written as xß
5
OLS
  • Also called linear regression
  • Assumes dependent variable is a linear
    combination of dependent variables, plus
    disturbance
  • Least squares ßs estimated so as to minimise
    the sum of the es.

6
Assumptions
  • Residuals have zero mean.
  • Follows that es and Xs are uncorrelated.
  • violated if a regressor is endogenous
  • Eg, number of children in female labour supply
    models
  • Cure by (eg) Instrumental Variables
  • Homoscedasticity all es have same variance
  • Classic example food consumption and income
  • Cure by using weighted least squares

7
When is OLS appropriate?
  • When you have a continuous dependent variable
  • Eg, you would use it to estimate regressions for
    height, but not for whether a person has a
    university degree.
  • When the assumptions are not obviously violated
  • As a first step in research to get ball-park
    estimates
  • We will use them a lot for this purpose
  • Worked examples
  • Coefficients, P-values, t-statistics
  • Measures of fit (R-squared, adjusted R-squared)
  • Thinking about specification
  • Post-estimation commands
  • Regression diagnostics.
  • A note on the data
  • All examples (in lectures and practicals) drawn
    from a 20 sample of the British Household Panel
    Survey (BHPS) more about the data later!

8
Summarize monthly earned income
9
First worked example
For illustrative purposes only. Not an example of
good practice.
Monthly labour income, for people whose labour
income is gt 1
MS SS/df
Tests whether all coeffs except constant are
jointly zero
Analysis of variance (ANOVA) table
R-squared Model SS / Total SS
Root MSE sqrt(MS)
T-stat coefficient / standard error
Coefficients or 1.96 standard errors
10
What do the results tell us?
  • All coefficients except month of interview are
    significant
  • 29 of variation explained
  • Being female reduces income by nearly 600 per
    month
  • Income goes up with age and then down
  • 16458 observations..oops, this is from panel
    data, so there are repeated observations on
    individuals.

11
Add ,cluster(pid) as an option
  • Coefficients, R-squared etc are unchanged from
    previous specification
  • But standard errors are adjusted standard errors
    larger, t-statistics are lower

12
Lets get rid of the month variable
Think about the female coefficient a bit more.
Could it be to do with women working shorter
hours?
13
Control for weekly hours of work
  • Is the coefficient on hours of work reasonable?
  • 5.65 for every additional hour worked
    certainly in the right ball park.

14
Looking at 2 specifications together
  • R-squared jumps from 29 to 46
  • Coefficient on female goes from -595 to -315
  • Almost half the effect of gender is explained by
    womens shorter hours of work
  • Age, partner and education coefficients are also
    reduced in magnitude, for similar reasons
  • Number of observations reduces from 16460 to
    13998 missing data on hours

15
Interesting post-estimation activities
What age does income peak? Income Y ß1age
ß2age2 d(Income)/d(age) ß1
2ß2age Derivative zero when age - ß1/2ß2
-79.552/(-0.8732) 45.5
Is the effect of university qualifications
statistically different from the effect of
secondary education?
16
A closer look at couple coefficient
17
  • Men benefit much more than women from being in a
    couple.
  • Other coefficients also differ between men and
    women, but with current specification, we cant
    test whether differences are significant.

18
Logit and Probit
  • Developed for discrete (categorical) dependent
    variables
  • Eg, psychological morbidity, whether one has a
    job. Think of other examples.
  • Outcome variable is always 0 or 1. Estimate
  • OLS (linear probability model) would set F(X,ß)
    Xß e
  • Inappropriate because
  • Heteroscedasticity the outcome variable is
    always 0 or 1, so e only takes the value -xß or
    1-xß
  • More seriously, one cannot constrain estimated
    probabilities to lie between 0 and 1.

19
Logit and Probit
  • Looking for a function which lies between 0 and
    1
  • Cumulative normal distribution Probit model
  • Logistic distribution Logit (logistic) model
  • They are very similar! Note how they lie between
    0 and 1 (vertical axis)

20
Maximum likelihood estimation
  • Likelihood function product of
  • Pr(y1) F(xß) for all observations where
    y1
  • Pr(y0) 1- F(xß) for all observations where
    y0
  • (think of the probability of flipping exactly
    four heads and two tails, with six dice)
  • Log likelihood written as
  • Estimated using an iterative procedure
  • STATA chooses starting values for ßs
  • Computes slopes of likelihood function at these
    values
  • Adjusts ßs accordingly
  • Stops when slope of LF is 0
  • Can take time!

21
Lets look at whether a person works
gen byte work (jbstat 1 jbstat 2) if
jbstat gt 1 jbstat ! .
22
Logit regression whether have a job
All the iterations
2 (LL of this model LL of null model)
Interpret like R-squared, but is computed
differently
  • From these coefficients, can tell whether
    estimated effects are positive or negative
  • Whether theyre significant
  • Something about effect sizes but difficult to
    draw inferences from coefficients

23
Comparing logit and probit
  • Scaling factor proposed by Amemiya (1981)
  • Multiply Probit coefficients by 1.6 to get an
    approximation to Logit
  • Other authors have suggested a factor of 1.8

24
Marginal effects
  • After logit or Probit estimation, type mfx into
    the command line
  • Calculates marginal effects of each of the RHS
    variables on the dependent variable
  • Slope of the function for continuous variables
  • Effect of change from 0 to 1 in a dummy variable
  • Can also calculate elasticities
  • By default, calculates mfx at means of dependent
    variables
  • Can also calculate at medians, or at specified
    points

25
Marginal effects
  • Logit and Probit mfx are very similar indeed
  • OLS is actually not too bad

26
Odds ratios
  • Only an option with logit
  • Type or in, after the comma as an option
  • Reports odds ratios that is, how many times more
    (or less) likely the outcome becomes
  • if the variable is 1 rather than 0, in the case
    of a dichotomous variable
  • for each unit increase of the variable, for a
    continuous variable
  • Results gt1 show an increased probability, results
    lt1 show decrease

27
Other post-estimation commands
  • Likelihood ratio test lrtest
  • Adding an extra variable to the RHS always
    increases the likelihood
  • But, does it add enough to the likelihood?
  • LR test calculates L0/L1 (Lrestricted/Lunrestricte
    d) and calculates chi-squared stat with d.f.
    equal to the number of variables you are
    dropping.
  • Null hypothesis restricted specification.
  • Only works on nested models, ie, where the RHS
    variables in one model are a subset of the RHS
    variables in the other.
  • How to do it
  • Run the full model
  • Type estimates store NAME
  • Run a smaller model
  • Type estimates store ANOTHERNAME
  • .. And so on for as many models as you like
  • Type lrtest NAME ANOTHERNAME
  • Be careful..
  • Sample sizes must be the same for both models
  • Wont happen if the dropped variable is missing
    for some observations
  • Solve problem by running the biggest model first
    and using e(sample)

28
LR test - example
  • Similar but not identical regression to previous
    examples
  • Add regional variables, decide which ones to keep
  • Looks as though Scotland might stay, also
    possibly SW, NW, N

29
LR test - example
REJECT nested specification
DONT REJECT nested spec
  • Reject dropping all regional variables against
    keeping full set
  • Dont reject dropping all but 4, over keeping
    full set
  • Dont reject dropping all but Scotland, over
    keeping full set
  • Dont reject dropping all but Scotland, over
    dropping all but 4
  • and just to check DO reject dropping all
    regional variables against dropping all but
    Scotland

30
Again, specification is illustrative only
  • This is not an example of a finished labour
    supply model!
  • How could one improve the model?
  • Model specification
  • Theoretical considerations,
  • Empirical considerations
  • Parsimony
  • Stepwise regression techniques
  • Regression diagnostics
  • Interpreting results
  • Spotting unreasonable results

31
Other models
  • Other models to be aware of, but not covered on
    this course
  • Multinomial logit and probit
  • Ordered models (ologit, oprobit) for ordered
    outcomes
  • Levels of education,
  • Number of children
  • Excellent, good, fair or poor health
  • Multinomial models (mlogit, mprobit) for multiple
    outcomes with no obvious ordering
  • Working in public, private or voluntary sector
  • Choice of nursery, childminder or playgroup for
    pre-school care
  • Heckman selection model
  • For modelling two-stage procedures
  • Earnings, conditional on having a job at all
  • Having a job is modelled as a probit, earnings
    are modelled as OLS
  • Used particularly for womens earnings
  • Tobit model for censored or truncated data
  • Typically, for data where there are lots of zeros
  • Expenditure on rarely-purchased items, eg cars
  • Childrens weights, in an experiment where the
    scales broke and gave a minimum reading of 10kg

32
Competence in STATA
  • Best results in this course if you already know
    how to use STATA competently.
  • Check you know how to
  • Get data into STATA (use and using commands)
  • Manipulate data, (merge, append, rename, drop,
    save)
  • Describe your data (describe, tabulate, table)
  • Create new variables (gen, egen)
  • Work with subsets of data (if, in, by)
  • Do basic regressions (regress, logit, probit)
  • Run sessions interactively and in batch mode
  • Organise your datasets and do-files so you can
    find them again.
  • If you cant do these, upgrade your knowledge
    ASAP!
  • Could enroll in STATA net course 101
  • Costs 110
  • ESRC might pay
  • Courses run regularly
  • www.stata.com

33
SC968Panel data methods for sociologistsLecture
1, part 2
  • Introducing Longitudinal Data

34
Overview
  • Cross-sectional and longitudinal data
  • Types of longitudinal data
  • Types of analysis possible with panel data
  • Data management merging, appending, long and
    wide forms
  • Simple models using longitudinal data

35
Cross-sectional and longitudinal data
  • First, draw the distinction between macro- and
    micro-level data
  • Micro level firms, individuals
  • Macro level local authorities, travel-to-work
    areas, countries, commodity prices
  • Both may exist in cross-sectional or longitudinal
    forms
  • We are interested in micro-level data
  • But macro-level variables are often used in
    conjunction with micro-data
  • Cross-sectional data
  • Contains information collected at a given point
    in time
  • (More strictly, during a given time window)
  • Workplace Industrial Relations Survey (WIRS)
  • General Household Survey (GHS)
  • Many cross-sectional surveys are repeated
    annually, but on different individuals
  • Longitudinal data
  • Contains repeated observations on the same
    subjects

36
Types of longitudinal data
  • Time-series data
  • Eg, commodity prices, exchange rates
  • Repeated interviews at irregular intervals
  • UK cohort studies
  • NCDS (1958), BHPS (1970), MCS (2000)
  • Repeated interviews at regular intervals
  • Panel surveys
  • Usually annual intervals, sometimes two-yearly
  • BHPS, ECHP, PSID, SOEP
  • Some surveys have both cross-sectional and panel
    elements
  • Panels more expensive to collect
  • LFS, EU-SILC both have a rolling panel element
  • Other sources of longitudinal data
  • Retrospective data (eg work or relationship
    history)
  • Linkage with external data (eg, tax or benefit
    records) particularly in Scandinavia
  • May be present in both cross-sectional or
    longitudinal data sets

37
Analysis with longitudinal data
  • The snapshot versus the movie
  • Essentially, longitudinal data allow us to
    observe how events evolve
  • Study flows as well as stocks.
  • Example unemployment
  • Cross-sectional analysis shows steady 5
    unemployment rate
  • Does this mean that everyone is unemployed one
    year out of five?
  • That 5 of people are unemployed all the time?
  • Or something in between
  • Very different implications for equality, social
    policy, etc

38
The BHPS
  • Interviews about 10,000 adults in about 6,000
    households
  • Interviews repeated annually
  • People followed when they move
  • People join the sample if they move in with a
    sample member
  • Household-level information collected from head
    of household
  • Individual-level information collected from
    people aged 17
  • Young people aged 11-16 fill in a youth
    questionnaire
  • BHPS is being upgraded to Understanding Society
  • Much larger and wider-ranging survey
  • BHPS sample being retained as part of US sample
  • Data set used for this course is a 20 sample of
    BHPS, with selected variables

39
The BHPS
  • All files prefixed with a letter indicating the
    year
  • All variables within each file also prefixed with
    this letter
  • 1991 a
  • 1992 b. and so on, so far up to p
  • Several files each year, containing different
    information
  • hhsamp information on sample households
  • hhresp household-level information on households
    that actually responded
  • indall info on all individuals in responding
    households
  • indresp info on respondents to main questionnaire
    (adults)
  • egoalt file showing relationship of household
    members to one another
  • income incomes
  • Extra files each year containing derived
    variables
  • Work histories, net income files
  • And others with occasional modules, eg life
    histories in wave 2
  • bjobhist blifemst bmarriag bcohabit bchildnt

40
Some BHPS files
  • 768.1k aindall.dta
  • 10.7M aindresp.dta
  • 1626.3k ahhresp.dta
  • 330.6k ahhsamp.dta
  • 1066.4k aincome.dta
  • 541.3k aegoalt.dta
  • 303.8k ajobhist.dta
  • 635.3k bindsamp.dta
  • 978.2k bindall.dta
  • 11.0M bindresp.dta
  • 1499.7k bhhresp.dta
  • 257.1k bhhsamp.dta
  • 1073.0k bincome.dta
  • 546.5k begoalt.dta
  • 237.8k bjobhist.dta

624.3k cindsamp.dta 975.6k
cindall.dta 11.0M cindresp.dta
1539.0k chhresp.dta 287.4k
chhsamp.dta 1008.9k cincome.dta
542.2k cegoalt.dta 237.8k
cjobhist.dta 1675.0k clifejob.dta
616.7k dindsamp.dta 943.7k dindall.dta
11.2M dindresp.dta 1508.9k
dhhresp.dta 301.9k dhhsamp.dta
1019.7k dincome.dta 531.8k
degoalt.dta 245.0k djobhist.dta
129.0k dyouth.dta 4977.3k
xwaveid.dta 1027.7k xwlsten.dta

Following sample members
Youth module introduced 1994
Extra modules in Wave 2
Cross-wave identifiers
41
Person and household identifiers
  • BHPS (along with other panels such as ECHP, SOEP,
    ECHP) is a household survey so everyone living
    in sample households becomes a member
  • Need identifiers to
  • Associate the same individual with him- or
    herself in different waves
  • Link members of same household with each other in
    the same wave
  • - the HID identifier
  • Note no such thing as a longitudinal household!
  • Household composition changes, household location
    changes..
  • HID is a cross-sectional concept only!

42
What it looks like 4 waves of data, sorted by
pid and wave.
Observations in rows, variables in columns. Blue
stripes show where one individual ends another
begins
Not present at 2nd wave
A child, so no data on job or marital status
43
(Can also use ,nol option)
44
Joining data sets together
Adding extra variables merge command
Adding extra observations append command
45
Whether appending or merging
  • Whether appending or merging
  • The data set you are using at the time is called
    the master data
  • The data set you want to merge it with is called
    the using data
  • Make sure you can identify observations properly
    beforehand
  • Make sure you can identify observations uniquely
    afterwards

46
Appending
  • Use this command to add more observations
  • Relatively easy
  • Check first that you are really adding
    observations you dont already have (or that if
    you are adding duplicates, you really want to do
    this)
  • Syntax append using using_data
  • STATA simply sticks the using data on the end
    of the master data
  • STATA re-orders the variables if necessary.
  • If the using data contain variables not present
    in the master data, STATA sets the values of
    these variables to missing in the using data
  • (and vice versa if the master data contains
    variables not present in the using data)

47
Merging is more complicated
  • Use merge to add more variables to a data set

Master data age.dta pid wave age 28005 1 30
19057 1 59 28005 2 31 19057 3 61
19057 4 62 28005 4 33
Using data sex.dta pid wave sex
19057 1 female 19057 3 female 28005 1 male
28005 2 male 28005 4 male 42571 1 male
42571 3 male
  • First, make sure both data sets are sorted the
    same way
  • use sex.dta
  • sort pid wave
  • save, replace
  • use age.dta
  • sort pid wave

48
Merging
Master data age.dta pid wave age 19057 1 59
19057 3 61 19057 4 62 28005 1 30
28005 2 31 28005 4 33
Using data sex.dta pid wave sex
19057 1 female 19057 3 female 28005 1 male
28005 2 male 28005 4 male 42571 1 male
42571 3 male
  • Notice that both data sets dont contain the same
    observations
  • Merge 11 pid wave using sex

New in STATA this year shows you are expecting
one using observation for each master
observation
pid wave age sex _merge 19057 1 59 female 3 19057
3 61 female 3 19057 4 62 . 1 28005 1 30 male 3 2
8005 2 31 male 3 28005 4 33 male 3 42571 1 . male
2 42571 3 . male 2
49
  • STATA creates a variable called _merge after
    merging
  • 1 observation in master but not using data
  • 2 observation in using but not master data
  • 3 observation in both data sets
  • Options available for discarding some
    observations see manual

50
More on merging
  • Previous example showed one-to-one merging
  • Not every observation was in both data sets, but
    every observation in the master data was matched
    with a maximum of only one observation in the
    using data and vice versa.
  • Many-to-one merging
  • Household-level data sets contain only one
    observation per household (usually lt1 per person)
  • Regional data (eg, regional unemployment data),
    usually one observation per region
  • Sample syntax merge m1 hid wave using
    hhinc_data

hid pid age 1604 19057 59 2341
28005 30 3569 42571 59 4301 51538
22 4301 51562 4 4956 59377 46
5421 64966 70 6363 76166 77 6827
81763 71 6827 81798 72
hid pid age h/h income 1604 19057
59 780 2341 28005 30 1501 3569 42571
59 268 4301 51538 22 394 4301
51562 4 394 4956 59377 46 1601 5421
64966 70 225 6363 76166 77
411 6827 81763 71 743 6827 81798 72
743
hid h/h income 1604 780 2341 1501
3569 268 4301 394 4956 1601 5421
225 6363 411 6827 743
  • One-to-many merging
  • Job and relationship files contain one
    observation per episode (potentially gt1 per
    person)
  • Income files contain one observation per source
    of income (potentially gt1 per person)
  • Sample syntax merge 1m pid wave using
    births_data

51
Long and wide forms
  • The data we have here is in long form
  • One row for each person/wave combination
  • From a few slides back

52
Wide form
  • However, its also possible to put longitudinal
    data into wide form
  • One observation per person, with different
    variables relating to different years of data

Sex doesnt change usually
Age at wave 1, and so on
53
The reshape command
  • Switching from long to wide
  • reshape wide stubnames, i(id) j(year)
  • In BHPS, this becomes
  • reshape wide stubnames, i(pid) j(wave)
  • What are stub names?
  • They are a list of variables which vary between
    years
  • Variables like sex or ethnicity would not
    normally be included in this list
  • Switching from wide to long
  • Exactly the opposite
  • reshape long stubnames, i(id) j(wave)
  • Lots more info and examples in STATA manual

54
Simple models using longitudinal data
  • Auto-regressive and time-lagged models
  • Models of change

55
But first the GHQ
  • Use this for lots of analysis in the lectures and
    practical sessions
  • General Health Questionnaire
  • Different versions BHPS carries the GHQ-12, with
    12 questions.
  • Have you recently
  • been able to concentrate on whatever you're
    doing ?
  • lost much sleep over worry ?
  • felt that you were playing a useful part in
    things ?
  • felt capable of making decisions about things ?
  • felt constantly under strain?
  • felt you couldn't overcome your difficulties ?
  • been able to enjoy your normal day to day
    activities ?
  • been able to face up to problems ?
  • been feeling unhappy or depressed?
  • been losing confidence in yourself?
  • been thinking of yourself as a worthless person
    ?
  • been feeling reasonably happy, all things
    considered ?
  • Answer each question on 4-point scale
  • not at all - no more than usual - rather more -
    much more

56
GHQ
  • (ghq) 1 likert Freq. Percent
    Cum.
  • -------------------------------------------------
    ----------
  • missing or wild 582 2.10
    2.10
  • proxy respondent 1,202 4.33
    6.43
  • 0 77 0.28
    6.70
  • 1 109 0.39
    7.10
  • 2 149 0.54
    7.63
  • 3 288 1.04
    8.67
  • 4 504 1.82
    10.49
  • 5 867 3.12
    13.61
  • 6 2,229 8.03
    21.64
  • 7 2,265 8.16
    29.80
  • 8 2,355 8.48
    38.28
  • 9 2,426 8.74
    47.02
  • 10 2,259 8.14
    55.16
  • 11 2,228 8.03
    63.19
  • 12 2,478 8.93
    72.11
  • 13 1,316 4.74
    76.85
  • 14 1,115 4.02
    80.87
  • HLGHQ1 in BHPS
  • Sum of scores
  • LIKERT scale
  • We recode lt0 values to missings, rename LIKERT
  • Consider as a continuous variable

57
GHQ
  • HLGHQ2
  • Caseness scale
  • Recodes answers 3-4 as 1, and adds up
  • Scores above 2 used to indicate psychological
    morbidity

58
Time-lagged models
Start with simple OLS model The Likert score is a
measure of psychological wellbeing derived from a
battery of questions
59
Generate lagged variable
NB the 1/30 here is just so it will fit on the
page. You should check many more observations
than this!
60
OLS, with lagged dependent variable
R-squared rockets from 5 to 26
Big very significant coefficient on lagged
variable
Coeff on ue_sick falls from 3.6 to 2.1
Also possible to include lagged explanatory
variables
61
Models of change
Start with OLS model simplified, but imagine
more variables
Separate model for each year suffix denotes year
Subtract 1st from 2nd model
Or, express in terms of change
62
Generate difference variables
  • capture drop dif
  • sort pid wave
  • gen dif_LIKERT LIKERT - LIKERT_n-1 if
    pid pid_n-1 wave wave_n-1 1
  • gen dif_age age - age_n-1 if
    pid pid_n-1 wave wave_n-1 1
  • gen dif_age2 age2 - age_n-1 if
    pid pid_n-1 wave wave_n-1 1
  • gen dif_female female - female_n-1 if
    pid pid_n-1 wave wave_n-1 1
  • gen dif_ue_sick ue_sick - ue_sick_n-1 if
    pid pid_n-1 wave wave_n-1 1
  • gen dif_partner partner - partner_n-1 if
    pid pid_n-1 wave wave_n-1 1

Check you understand why dif_female will very
nearly always be zero
63
Check for sensible results!
64
More checking.
65
Obvious problems
  • Interview times may mean difference of 100 in
    age difference variable
  • Most differences are zero
  • Moving into unemployment or partnership is given
    equal and opposite weighting to moving out. No
    real reason why this should be the case
  • There are MUCH better ways to use these data!
  • Nevertheless, lets proceed!

66
Results
Age increase equal and opposite to constant
Female drops out
Coeffs on sick and partner significant
Write a Comment
User Comments (0)
About PowerShow.com