Intro to Stata - PowerPoint PPT Presentation

1 / 92
About This Presentation
Title:

Intro to Stata

Description:

A Gentle Introduction to STATA Jose Ramon G. Albert Research Division Chief Statistical Research & Training Center (SRTC) email: srtcres_at_srtc.gov.ph – PowerPoint PPT presentation

Number of Views:786
Avg rating:3.0/5.0
Slides: 93
Provided by: too46
Category:
Tags: intro | stata

less

Transcript and Presenter's Notes

Title: Intro to Stata


1
SIAP-SRTC Training Course on Sampling Acceed
Center, AIM, Makati Philippines 4 April 2002
2
OUTLINE
  • Statistical Computing Resources
  • Data Management with Stata
  • Table Generation
  • Tab and Table Commands
  • Survey Commands

3
(No Transcript)
4
Computing Resources
  • The Age of ICT has brought about a synergy of
    computing and communications
  • Implications
  • More DATA collected
  • More DATA stored
  • More DATA accessible and distributed

5
Computing Resources
  • There are a host of statistical software that
    provide pre-programmed analytical and data
    management capabilities. These software may be
    classified according to use and cost.

6
Computing Resources
  • Types of Stat Software by usage
  • General Purpose -- SAS, SPSS, R, Splus,
    Statistica, Stata
  • Special Purposes -- econometric modeling
    (Eviews), seasonal adjustment (X12), Bayesian
    modeling (WINBUGS), survey data tabulation
    variance estimation (IMPS, CENVAR)

7
Computing Resources
  • Types of Stat Software by cost
  • Commercial Software - SAS, SPSS, Stata, S-plus
  • Freeware - R, IMPS, X12

8
Computing Resources
  • FOR SURVEY DATA
  • Bascula from Statistics Netherlands.
  • CENVAR ( IMPS)from U.S. Bureau of the Census.
  • CLUSTERS from University of Essex.
  • Epi Info from Centers for Disease Control.
  • Generalized Estimation System (GES) from
    Statistics Canada.
  • IVEWare (beta version) from University of
    Michigan.

9
Computing Resources
  • FOR SURVEY DATA
  • PCCARP from Iowa State University.
  • SAS/STAT from SAS Institute.
  • Stata from Stata Corporation.
  • SUDAAN from Research Triangle Institute.
  • VPLX from U.S. Bureau of the Census.
  • WesVar from Westat, Inc.

10
Computing Resources
  • Lists of Statistical Software
  • http//members.aol.com/johnp71/javasta2.html
  • http//www.stir.ac.uk/Departments/HumanSciences/S
    ocInfo/Statistical.htm
  • http//www.fas.harvard.edu/stats/survey-soft/
  • http//www.feweb.vu.nl/econometriclinks/software.
    html

11
Computing Resources
  • This afternoon, we will provide a demonstration
    on how to use STATA for accomplishing some of the
    most common tasks of data management, statistical
    computing and analysis of survey data.

12
Computing Resources
  • Stata
  • Estimation of means, totals, ratios, and
    proportions
  • linear regression, logistic regression, and
    probit.
  • Point estimates, associated standard errors,
    confidence intervals, and design effects for the
    full population or subpopulations are displayed.

13
Computing Resources
  • Stata
  • Auxiliary commands display various information
    for linear combinations (e.g., differences) of
    estimators, and conduct hypothesis tests.
  • New in Stata contingency tables with Rao-Scott
    corrections of chi-squared tests new
    survey-corrected regression commands including
    tobit, interval, censored, instrumental
    variables, multinomial logit, ordered logit and
    probit, and Poisson

14
Computing Resources
  • Stata
  • stratified designs
  • cluster sampling
  • FPCs can be calculated for simple random sampling
    w/o replacement of sampling units within strata
  • variance estimation for multistage sample data
    carried out through the customary
    between-PSU-squared-differences calculation.

15
Computing Resources
  • Stata
  • Variance estimation is done thru Taylor-series
    linearization in the survey analysis commands.
    There are also commands for jackknife and
    bootstrap variance estimation, but these are not
    specifically oriented toward survey data.

16
Computing Resources
  • Note
  • We will demonstrate the use of STATA version 6.
    Current version is version 7 even a Special
    Edition (SE) which can handle up to 32,766
    variables w/ strings up to 244 chars, and up to
    11,000 x 11,000 matrices.

17
(No Transcript)
18
Data Management
  • STARTING UP
  • Go to Start, Programs, Stata, Intercooled Stata
  • Alternatively, from Windows Explorer, go to
    folder
  • c\stata
  • Double click
  • wstata.exe

19
Data Management
20
Data Management
  • CREATING A NEW DATASET
  • Open the STATA spreadsheet editor

21
Data Management
  • CREATING A NEW DATASET
  • Enter data into the editor, when done close the
    editor.

22
Data Management
  • CREATING A NEW DATASET
  • In the STATA COMMAND window enter the command
  • save newfile

23
Data Management
  • NOTE
  • A STATA dataset will have extension name dta.
    That is, newfile is actually newfile.dta
  • Public use files of some surveys, e.g. VLSS
    (Vietnam Living Standards Survey), are in Stata
    format.

24
Data Management
  • INSPECTING DATA BASE
  • In the STATA COMMAND window enter the following
    commands
  • describe
  • list
  • summarize

25
Data Management
  • NOTE
  • Stata is case sensitive.
  • Stata commands may be abbreviated, e.g. D for
    DESCRIBE, SUM for SUMMARIZE, etc.
  • We may use Page Up/Down keys or mouse for
    re-selecting commands in the Review window.

26
Data Management
  • NOTE
  • Commands and output are shown in Results window.
    Windows may be re-sized.
  • Commands and output may be logged into a log
    file by pressing Open Log button.

27
Data Management
  • RENAMING VARIABLES
  • ONE WAY (From Data Editor) Double click
    anywhere in the variables column resulting in a
    dialogue box

28
Data Management
  • RENAMING VARIABLES
  • SECOND WAY (In the STATA COMMAND window) enter
  • rename var1 domain
  • rename var2 hcn
  • rename var3 age
  • label variable age HH head age
  • d

29
Data Management
  • SAVING EDITED DATABASE
  • In the STATA COMMAND window enter the following
    commands
  • save newfile, replace
  • Note typing only
  • save newfile
  • will result in an error message

30
Data Management
  • READING PRE-EXISTING
  • STATA DATASET
  • If dataset is in folder c\fies2000 and filename
    is fies00small.dta, enter
  • clear
  • set mem 64m
  • cd c\fies2000
  • use fies00small

NOTE Impt for MEMORY MANAGEMENT
31
Data Management
  • IMPORTING DATA
  • Suppose we have a dataset try.txt in c\fies2000
    folder

NOTE Missing Data coded as .
32
Data Management
  • IMPORTING DATA
  • Suppose we have a dataset try.txt in c\fies2000
    folder
  • Use the infile command with syntax
  • infile variable-list using filename.raw
  • In particular, enter
  • cd c\fies2000
  • infile domain hcn age using try.txt,
  • automatic

33
Data Management
  • TRIVIA ON STRING VARIABLES
  • When using the infile command for character
    (string) variables, we need to identify these
    variables. For instance
  • infile domain hcn str30 prov using tr.txt
  • For more details regarding infile, enter
  • help infile1

34
Data Management
  • IMPORTING DATA
  • Suppose we have a dataset try2.txt in c\fies2000
    folder with the data in specific fields

Assumes last line is blank line
35
Data Management
  • IMPORTING DATA
  • Suppose we have a dataset try2.txt in c\fies2000
    folder with the data in specific fields
  • Use the infix command
  • infix domain 1 hcn 2 age 3-4 using try2.txt,
    clear

36
Data Management
  • Thus, Stata can read text files with
  • Infile (if the data in text is separated by
    spaces and does not have strings, or if strings
    are just one word, or if all strings are enclosed
    in quotes)
  • Infix (fixed format text)
  • Insheet (if text file was created by a
    spreadsheet or db program)

37
Data Management
  • NOTE
  • The commands infile, infix, insheet read data
    from ASCII files. Outfile is a way to save the
    data in ASCII.
  • There are third party programs, esp.
    Stat/Transfer and DBMS/COPY, that perform
    translations from one data format (e.g., dBASE,
    Excel, SAS, SPSS, Stata) to another.

38
Data Management
39
Data Management
  • OTHER USEFUL COMMANDS
  • To sort the dataset by age
  • sort age
  • To get a listing of the dataset
  • list
  • To get a listing of the 2nd-4th data
  • list in 2/4

40
Data Management
  • OTHER USEFUL COMMANDS
  • To summarize the restricted dataset of HHs whose
    heads age is less than/equal to 50
  • summarize if age lt50
  • HH head age between 35 and 50
  • summarize if age lt50 age gt35

41
Data Management
  • Comparison operators
  • gt gt
  • lt lt !
  • Logical operators
  • (and) ! (not)
  • (or) (not)

42
Data Management
  • OTHER USEFUL COMMANDS
  • To tabulate domain
  • tab domain
  • To generate contingency tables
  • tab domain hcn if agegt35
  • To get the correlation matrix
  • correlate x y z

43
Data Management
  • GENERATING REPLACING VARIABLES
  • Suppose we want to obtain per capita income (pci)
    of FIES 2000 households
  • clear
  • cd d\fies00
  • use fies00small
  • gen pcitoinc/hsize

44
Data Management
  • GENERATING REPLACING VARIABLES
  • Now tag the household as poor (1) if pci lt some
    threshold, say 13823, determine percent of HHs
    that are poor.
  • gen poor1 if pci lt 13823
  • replace poor0 if poor.
  • sum poor awrfact
  • save fies00small, replace

45
Data Management
  • NOTE
  • Small portion of data set of FIES 2000 was used.
    The Family Income and Expenditure Survey (FIES)
    is conducted by the National Statistics Office
    (NSO)every 3 years. Data may be purchased
    through the NSO website
  • www.census.gov.ph

46
SIAP-SRTC Training Course on Sampling Acceed
Center, AIM, Makati Philippines 5 April 2002
47
Data Management
  • RECALL
  • That if we use our fies2000 data set
  • set mem 64m
  • cd c\fies2000
  • use fies00small
  • sum poor awrfact
  • Note poverty line we provided is a weighted
    average of the variable poverty lines in the
    Philippines (for urban-rural areas across the
    different regions)

48
(No Transcript)
49
Estimating Food Poverty Line
  • Food poverty line estimated from low cost one day
    menus (breakfast, lunch, supper snack)
    constructed for each urban-rural area of a region
    by Food and Nutrient Research Institute (FNRI)
    which meet 100 sufficiency in energy and protein
    requirements and 80 sufficiency of other
    nutrients and vitamins.
  • RDAs for energy 2000 Kcal per person
  • RDAs for protein 50 grams per person
  • 29 such menus constructed on the basis of the
    1988 Food Consumption Survey

50
Annual Per Capita Food Line Urban, by Region
51
Annual Per Capita Food Line Rural, by Region
52
Estimating Poverty Line
  • Poverty Line Food Threshold/ Engels Coefficient
  • Engels coefficient estimated by analyzing the
    consumption pattern of families having incomes
    within plus or minus 10 percentage points from
    food threshold.
  • Engels coeff Food Exp/ Total Basic Exp

53
Annual Per Capita Poverty Line Urban, by Region
54
Annual Per Capita Poverty Line Rural, by Region
55
Poverty Statistics (Family)
Measures 2000 1997

Poverty Incidence 33.60.3 31.8
Poverty Gap 10.7 0.1 10.0
Severity Index 4.6 0.1 4.3
Standard Error
56
Poverty Incidence All Areas, by Region
57
Small Area Poverty Stats?
  • Stata has some add ons for generating SEs for
    poverty stats
  • If we wish to generate provincial poverty
    statistics, we will find out that SEs are too
    high, i.e. figures are unreliable

58
(No Transcript)
59
Data Management
  • RECALL
  • That if we use our fies2000 data set
  • set mem 64m
  • cd c\fies2000
  • use fies00small
  • sum poor awrfact
  • Note poverty line we provided is a weighted
    average of the variable poverty lines in the
    Philippines (for urban-rural areas across the
    different regions)

60
Data Management
  • NOTE
  • STATA uses several types of weights
  • fw frequency weights
  • aw analytic weights
  • iw importance weights
  • pw probability weights

61
Data Management
  • NOTE
  • Within the command generate or replace, we may
    transform or create variables by using functions,
    e.g.,
  • generate logincln(toinc)
  • generate ycos(x_pi/180)
  • replace newvarnormd(z)
  • generate rvaruniform()

62
Data Management
  • DELETING VARIABLES/DATA
  • To drop a variable, say age
  • drop age
  • To drop some observations
  • drop in 2/3
  • Try also the command keep.
  • To drop all data in memory
  • clear

63
Data Management
  • NOTE
  • So far we have used STATA interactively. We can
    also do batch processing through the DO FILE
    editor.

64
Data Management
  • NOTE
  • The STATA toolbar has 13 buttons.
  • The first three are to OPEN a Stata dataset
  • SAVE to the disk the resident dataset
  • PRINT a graph or log

65
Data Management
  • The next five are for Starting/stopping/suspendin
    g a LOG
  • Bringing the Log to the Front
  • Bringing the Dialog to Front
  • Bringing the Results to Front
  • Bringing the Graph to Front

66
Data Management
  • The last five are for
  • Opening the DO FILE editor
  • Opening the DATA editor
  • Opening the DATA Browser
  • Telling Stat to continue when it has paused
    in mid of long output
  • Stopping the current task

67
Exercise
  • What is the average income of families that are
    below or above the mean family expenditure?

68
Exercise
  • Compare correlation of food expenditures (fexp)
    and nonfood expenditures for families in rural
    urban areas.

69
Extra
  • Enter
  • graph food nfood

70
Extra
  • Now try
  • sort urb
  • graph food nfood, by (urb)
  • graph food nfood, by (urb) total

71
Extra
  • Matrix plots
  • graph toinc food nfood, matrix

72
(No Transcript)
73
Table Generation w/ tab
  • Earlier, we showed the use of the tab(ulate)
    command. Try
  • tab urb
  • tab urb awrfact
  • tab urb iwrfact
  • tab urb regn

74
Tab
  • The tab command has options for generating 1-way
    tables of freqs
  • tab urb, summ(toinc)
  • and two way tables
  • tab urb sex
  • tab urb sex, row
  • tab urb sex, row col chi2
  • tab urb sex, all exact

75
Table Generation w/ table
  • Aside from the tab command, we can generate
    tables of statistics with the table command.
    Compare
  • tab urb
  • with
  • table urb

76
Table
  • To generate the average (family) income and
    average (family) expenditure across urban and
    rural areas, enter
  • table urb, c(mean toinc mean toexp)
  • Using weights
  • table urb awrfact, c(mean toinc mean toexp)

77
Table
  • The contents option may specify at most five of
    the ff statistics
  • freq (for frequency)
  • mean varname (for mean of varname)
  • sd varname (for standard deviation)
  • sum varname (for sum)
  • rawsum varname (for sums ignoring optionally
    specified weight)
  • count varname (for count of nonmissing data)

78
Table
  • The contents option may specify at most five of
    the ff statistics
  • n varname (same as count)
  • max varname (for maximum)
  • min varname (for minimum)
  • median varname (for median)
  • p1 varname (for 1st percentile)
  • p2 varname (for 2nd percentile)
  • ...
  • iqr varname (for interquartile range)

79
Exercise Using Table
  • Obtain the average and median per capita income
    of households by sex of household head
  • table sex, c(mean pci median pci)
  • Obtain the weighted frequency of poor and
    nonpoor households across regions
  • table poor regn iwrfact

80
Using Survey Commands
  • STATA has designed a family of commands
    especially for sample surveys. These commands
    all begin with svy
  • svyset setting variables
  • svydes describe strata and PSUs
  • svymean estimate popn subpop means
  • svytotals estimate popn subpop totals

81
Using Survey Commands
  • Svy commands
  • svyprop estimate popn subpop props
  • svyratio estimate popn subpop ratios
  • svytab for two way tables
  • svyreg for regression
  • svyivreg for instrumental variables reg
  • svylogit for logit reg
  • svyprobit for probit reg

82
Using Survey Commands
  • Svy commands
  • svytest for hypothesis testing
  • svylc for estimating linear combs
  • svymlog for multinomial logistic reg
  • svyolog for ordered logistic reg
  • svyoprob for ordered probit reg
  • svypois for poisson reg
  • svyintrg for censored interval reg

83
Using Survey Commands
  • Before issuing any svy estimation command, we
    identify the weight, strata and PSU identifier
    variables
  • svyset pweight rfact
  • svyset strata domain
  • svyset psu hcn

84
Using Survey Commands
  • To obtain the average family income average
    family expenditure
  • svymean toinc toexp
  • To obtain the total family income, total family
    expenditure by province
  • svytotal toinc toexp, by(regn)

85
Using Survey Commands
  • To obtain the per capita income per capita
    expenditure
  • svyratio toinc/fsize toexp/fsize
  • pci pce by urban/rural
  • svyratio toinc/fsize toexp/fsize, by(urb)

86
Using Survey Commands
  • Linear regression of ln(pci)
  • gen logincln(pci)
  • svyreg loginc age fsize sex prov urb
  • Compare the results with the regular regression
    command
  • reg loginc age fsize sex prov urb

87
Using Survey Commands
  • Two way tables
  • svytab urb poor, row se
  • compared with
  • tab urb poor awrfact, no freq row

88
Alternatives to STATA
89
Learning More about Stata
  • Online tutorial, type
  • tutorial intro
  • List of Tutorials
  • Tutorial Description
  • --------------------------------------------------
    ---
  • intro An introduction to Stata
  • graphics How to make graphs
  • tables How to make tables
  • regress Estimating regression models, inc
    2SLS
  • anova Estimating one-, two- and N-way
    ANOVA and ANCOVA models

90
Learning More about Stata
  • Tutorial Description
  • --------------------------------------------------
    ---
  • logit Estimating maximum-likelihood logit
    and probit models
  • survival Estimating ML survival models
  • factor Estimating factor and principal
    component models
  • ourdata Description of the data we provide
  • yourdata How to input your own data into Stata

91
Learning More about Stata
  • Email distribution list. Send email to
  • Majordomo_at_hsphsun2.harvard.edu
  • In the body of your email message type the
    message   subscribe statalist email_at_addressor
    for a daily summary
  • subscribe statalist-digest email_at_address

92
  • Maraming Salamat sa inyong pakikinig.
  • (Thank you for your attention)
Write a Comment
User Comments (0)
About PowerShow.com