Generating new variables and manipulating data with STATA - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Generating new variables and manipulating data with STATA

Description:

Sex is m/f, not 1/0. Gestational age has nonsense values (0, 60) ... 1 raw data file, original format. 1 raw data file, Stata format. 1 do file that cleans it up ... – PowerPoint PPT presentation

Number of Views:338
Avg rating:3.0/5.0
Slides: 38
Provided by: MarkPl5
Category:

less

Transcript and Presenter's Notes

Title: Generating new variables and manipulating data with STATA


1
Generating new variables and manipulating data
with STATA
  • Biostatistics 212
  • Lecture 3

2
Housekeeping
  • Lab 1 handed back today
  • Think of red ink as teaching points, not
    penalties
  • Do and Log files
  • Understand each command!
  • Order them appropriately

3
The .do file template
cd C\data\biostat212\ version 10 set more
off clear set memory 10m capture log
close log using name of your log.log,
replace / here are my comments / use name
of your dataset, clear summarize this browse
that tabulate this log close set more on
  • Tell STATA where to look for things and where to
    put things
  • Run correct version of Stata
  • Stop STATA from prompting you to push a button to
    continue
  • Tell STATA to clear any datasets in memory and
    increase its mem capacity
  • Since your do file may not be perfect, tell STATA
    to close any logs that are open when you try to
    run your do file
  • Tell STATA to create a log of your output for you
    and what youre going to call that log. Tell it
    to overwrite it each time
  • Stick in some comments to remind you what this do
    file is for
  • Tell STATA what dataset to work on
  • Leave some SPACE for putting in analysis commands
    you want to keep
  • Lastly, tell STATA to close the log and go back
    to its usual more mode

4
Today...
  • What we did in Lab 1, and why it was unrealistic
  • What does data cleaning mean?
  • How to generate a variable
  • How to manipulate the data in your new variable
  • How to label variables and otherwise document
    your work
  • Examples

5
Last time
  • What was unrealistic?

6
Last time
  • What was unrealistic?
  • The dataset came as a Stata .dta file

7
Last time
  • What was unrealistic?
  • The dataset came as a Stata .dta file
  • The variables were ready to analyze

8
Last time
  • What was unrealistic?
  • The dataset came as a Stata .dta file
  • The variables were ready to analyze
  • Most variables were labeled

9
Last time
  • i.e. The data was clean

10
How your data will arrive
  • On paper forms
  • In a text file (comma or tab delimited)
  • In Excel
  • In Access
  • In another data format (SAS, etc)

11
Importing into Stata
  • Options
  • Copy and Paste
  • insheet, infile, fdause, other flexible Stata
    commands
  • A convenience program like Stat/Transfer

12
Importing into Stata
  • Make sure it worked
  • Look at the data

13
Importing into Stata
  • Demo neonatal opiate withdrawal data
  • Import with cut and paste from Excel
  • Import with insheet (save as .csv file first)

14
Exploring your data
  • Figure out what all those variables mean
  • Options
  • Browse, describe, summarize, list in STATA
  • Refer to a data dictionary
  • Refer to a data collection form
  • Guess, or ask the person who gave it to you

15
Exploring your data
  • Demo Neonatal opiate withdrawal data

16
Exploring your data
  • Demo Neonatal opiate withdrawal data
  • Problems arise
  • Sex is m/f, not 1/0
  • Gestational age has nonsense values (0, 60)
  • Breastfeeding has a bunch of weird text values
  • Drug variables coded y or blank
  • Many variable names are obscure

17
Cleaning your data
  • You must clean your data so it is ready to
    analyze.

18
Cleaning your data
  • What does the variable measure?
  • rename and/or label var so its clear
  • Find nonsense values and outliers
  • recode as missing or track down real value?
  • Deal with missing values
  • Too many values missing in some subjects? Coding
    consistent?
  • drop variable or observation?
  • Categorize as needed
  • generate a new numeric variable
  • recode (dichotomous variables coded as 1/0,
    watch missing values)
  • label define and then label values
  • Check
  • tab oldvar newvar, missing
  • bysort catvar sum contvar

19
Cleaning your data
  • The importance of documentation
  • Retracing your steps
  • Document every step using a do file

20
Data cleaningBasic skill 1 Making a new
variable
  • Creating new variables
  • generate newvar expression

21
Data cleaningBasic skill 1 Making a new
variable
  • Creating new variables
  • generate newvar expression
  • An expression can be
  • A number (constant) - generate allzeros 0
  • A variable - generate ageclone age
  • A function - generate agesqrt sqrt(age)

22
Data cleaningBasic skill 2 Manipulating values
of a variable
  • Changing the values of a variable
  • replace var exp if boolean_expression
  • A boolean expression evaluates to true or false
    for each observation

23
Data cleaningBasic skill 2 Manipulating values
of a variable
  • Examples
  • generate bmi weight/(height2)
  • generate male 0
  • replace male 1 if sexmale
  • generate ageover50 0
  • replace ageover 50 1 if agegt50
  • generate complexvar age
  • replace complexvar (ln(age)3)
  • if (agegt30 male1) (othervar1gtothervar2)

24
Data cleaningBasic skill 2 Manipulating values
of a variable
  • Logical operators for boolean expressions
  • English Stata
  • Equal to
  • Not equal to !,
  • Greater than gt
  • Greater than/equal to gt
  • Less than lt
  • Less than/equal to lt
  • And
  • Or

25
Data cleaningBasic skill 2 Manipulating values
of a variable
  • Mathematical operators
  • English Stata
  • Add
  • Subtract -
  • Multiply
  • Divide /
  • To the power of
  • Natural log of ln(expression)
  • Base 10 log of log10(expression)
  • Etcetera

26
Data cleaningBasic skill 2 Manipulating values
of a variable
  • Another way to manipulate data
  • recode var oldvalue1newvalue1 oldvalue2newvalue
    2 if boolean_expression
  • More complicated, but more flexible command than
    replace

27
Data cleaningBasic skill 2 Manipulating values
of a variable
  • Examples
  • generate male 0
  • recode male 01 if sexmale
  • generate female male
  • recode female 10 01

28
Data cleaningBasic skill 2 Manipulating values
of a variable
  • Examples
  • generate raceethnic race
  • recode raceethnic 16 if ethnichispanic
  • (replace raceethnic 6 if ethnichispanic
    race1)
  • generate tertilescac cac
  • recode tertilescac min/541 55/822 83/max3

29
Data cleaningBasic skill 3 Getting rid of
variables/observations
  • Getting rid of a variable
  • drop var
  • Getting rid of observations
  • drop if boolean_expression

30
Data cleaningBasic skill 4 Labeling things
  • You can label
  • A dataset label data label
  • A variable label var varname label
  • Values of a variable (2-step process)
  • label define labelname value1 label1 value2
    value2
  • label values varname labelname
  • label define caccatlabel 0 0 1 1-100 2
    101-400 3 gt400
  • label values caccat caccatlabel

31
Data cleaningBasic skill 5 Dealing with missing
values
  • Missing values are important, easy to forget
  • . for numbers
  • for text
  • tab var1 var2, missing
  • Watch the total n for tab, summarize commands,
    regression analyses, etc.

32
Data cleaning
  • Demo Neonatal opiate withdrawal data

33
Cleaning your data
  • What does the variable measure?
  • rename or label var so its clear
  • Find nonsense values and outliers
  • recode as missing or track down real value?
  • Deal with missing values
  • Too many? Coding consistent?
  • drop variable or observation?
  • Categorize as needed
  • generate a new numeric variable
  • recode (dichotomous variables coded as 1/0,
    watch missing values)
  • label define and then label values
  • Check
  • tab oldvar newvar, missing
  • bysort catvar sum contvar

34
Data cleaning
  • At the end of the day you have
  • 1 raw data file, original format
  • 1 raw data file, Stata format
  • 1 do file that cleans it up
  • 1 log file that documents the cleaning
  • 1 clean data file, Stata format

35
Summary
  • Data cleaning
  • ALWAYS necessary to some extent
  • ALWAYS use a do file
  • NEVER overwrite original data
  • Check your work
  • Watch out for missing values
  • Label as much as you can

36
Lab this week
  • Its long
  • Its hard
  • Its important
  • Email lab to your section leaders email
  • Due at the beginning of lecture next week

37
Preview of next week
  • Using Excel
  • What is it good for?
  • Formulas
  • Designing a good spreadsheet
  • Formatting
Write a Comment
User Comments (0)
About PowerShow.com