When Setup Files Go Bad. Debugging your SAS, SPSS, and STATA code so it works - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

When Setup Files Go Bad. Debugging your SAS, SPSS, and STATA code so it works

Description:

Identify method for ASCII for your favorite stat package ... Why you should always run the ASCII syntax instead? It allows you to customize the file ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 33
Provided by: lind224
Category:
Tags: sas | spss | stata | ascii | bad | code | debugging | files | setup | works

less

Transcript and Presenter's Notes

Title: When Setup Files Go Bad. Debugging your SAS, SPSS, and STATA code so it works


1
When Setup Files Go Bad.  Debugging your SAS,
SPSS, and STATA code so it works
  • Felicia B. LeClere, Ph.D.
  • Director, Data Sharing for Demographic Research

2
Overview of webinar
  • Broaden the scope a bit..
  • No set up files ---this is where we learn to
    debug
  • Set up files ---things that might not work
  • When the double click doesnt work.

3
Things we will be looking for
  • What it looks like when it runs
  • When things dont work
  • How to diagnosis whats wrong

4
Its just numberswhat do I do?
  • Many of our historical files require you to
    create syntax on your ownthat means learning to
    read in ASCII data
  • You know you are in trouble when the download
    page looks like this

5
Instead of this..
6
(No Transcript)
7
What to do.
  • Find the documentation and look for the following
    language
  • Column locations, field length, or variable
    position
  • These describe where your variables are in the
    ASCII data file and mark how you will read them
    in.

8
What you will see.
How to read the data
The data file location
The data file
9
What you need to do
This is from the codebookcalled tape position
index
Variable location
Variable
10
Variable descriptions
11
And you know the drill.
  • Identify method for ASCII for your favorite stat
    package
  • Use fixed format infile to read the fields
  • And build..

12
How do you know when its gone wrong
  • Says the file doesnt exist or cant be read or
    some other message
  • Doesnt read a variable or doesnt recognize a
    variable name
  • Frequency counts really dont match what is in
    the documentation
  • The valid values for a variable dont match
    whats in the codebook
  • The number of cases dont match the number given
    in the codebook

13

This looks right
libname in "D\fleclere\Desktop\misc
documents\10294302\ICPSR_08535\DS0002" data
new infile "D\fleclere\Desktop\misc
documents\10294302\ICPSR_08535\DS0002\08535-0002-d
ata.txt" input pid 1-5 exam 16 lang 17 proc
freq tables lang run
NOTE The infile "D\fleclere\Desktop\misc
documents\10294302\ICPSR_08535\DS0002\08535-0002-d
ata.txt" is File NameD\fleclere\Desktop\m
isc documents\10294302\ICPSR_08535\DS0002\08
535-0002-data.txt, RECFMV,LRECL256 NOTE
11653 records were read from the infile
"D\fleclere\Desktop\misc
documents\10294302\ICPSR_08535\DS0002\08535-0002-d
ata.txt". The minimum record length was
256. The maximum record length was 256.
One or more lines were truncated. NOTE The
data set WORK.NEW has 11653 observations and 3
variables. NOTE DATA statement used (Total
process time) real time 1.75
seconds cpu time 0.15
seconds 6 proc freq 7 tables lang 8
run NOTE There were 11653 observations read
from the data set WORK.NEW. NOTE PROCEDURE FREQ
used (Total process time) real time
0.96 seconds cpu time 0.00
seconds
14
The SAS System 1027 Tuesday, April 28,
2009 1
The FREQ Procedure
Cumulative Cumulative
lang Frequency Percent
Frequency Percent

1 5986
51.53 5986 51.53
2 5631 48.47 11617
100.00
Frequency Missing 36
Looks good!!
15

Not so right .
libname in "D\fleclere\Desktop\misc
documents\10294302\ICPSR_08535\DS0002" data
new infile "D\fleclere\Desktop\misc
documents\10294302\ICPSR_08535\DS0002\08535-0002-d
ata.txt" input pid 1-5 exam 16 lang 17 Bite
407 proc freq tables bite run
NOTE The infile "D\fleclere\Desktop\misc
documents\10294302\ICPSR_08535\DS0002\08535-0002-d
ata.txt" is File NameD\fleclere\Desktop\m
isc documents\10294302\ICPSR_08535\DS0002\08
535-0002-data.txt, RECFMV,LRECL256 NOTE
LOST CARD. pid16785 exam1 lang1 Bite.
_ERROR_1 _N_5827 NOTE 11653 records were read
from the infile "D\fleclere\Desktop\misc
documents\10294302\ICPSR_08535\DS0002\08535-0002-d
ata.txt". The minimum record length was
256. The maximum record length was 256.
One or more lines were truncated. NOTE SAS
went to a new line when INPUT statement reached
past the end of a line. NOTE The data set
WORK.NEW has 5826 observations and 4
variables. NOTE DATA statement used (Total
process time) real time 0.34
seconds cpu time 0.21
seconds 14 proc freq 15 tables bite 16
run NOTE There were 5826 observations read
from the data set WORK.NEW. NOTE PROCEDURE FREQ
used (Total process time) real time
0.01 seconds cpu time 0.01
seconds
16
Allowable values dont match Frequencies dont
match
17
Why?
  • The allowable record length in SAS is 256 it was
    telling us that in the error.
  • Once we got past the field position of 256we got
    lost. Language was at position 17 .and Bite at
    407
  • Solution-
  • Reset lrecl in infile statement (lrecl815)

18
Other reasons things go bad
  • Multiple lines per record --- a product of times
    when data were on cards and the record length was
    fixed at 80
  • You read a string as a numeric or vice versa
  • Data errors or non-standard characters (files
    converted from main frames or other formats)

19
You have a syntax file
  • You find a file and you download it for your
    favorite flavor of software
  • You decide to keep all the variables
  • You know where the data went (i.e. where you
    downloaded it to)

20
Initial steps to test
  • Get rid of formatting
  • Add a frequency check for variables at the
    beginning and end
  • Simplify if you can (do you really need all those
    variables)

21
libname in "D\fleclere\Desktop\misc
documents\10294302\ICPSR_08535\DS0002"
DATA INFILE "D\fleclere\Desktop\misc
documents\10294302\ICPSR_08535\DS0002\22627-0001-D
ata.txt" LRECL2983 INPUT CASEID 1-8
GENDER 9 AGE 10-11
ETHNONAT 12-13 ETHNOS10 14-15
PANETH4 16-17 GENERAT3 18-20 .1
GENERAT4 21-23 .1 AGEARRV 24-25
ABUELOFB 26-27 QUOGRPS 28-31
INTLANG 32-35 SAMPLE 36-39
QS2AM 40-43 QS2AF 44-47 QS5A
48-51 QS5B 52-55 QS6A
56-59 QS6B 60-63 QS7 64-67
QS8 68-71
What happened?
ERROR Physical file does not exist,
D\fleclere\Desktop\misc
documents\10294302\ICPSR_08535\DS0002\22627-0001-D
ata.txt. NOTE The SAS System stopped processing
this step because of errors. WARNING The data
set WORK.DATA1 may be incomplete. When this step
was stopped there were 0 observations
and 657 variables. NOTE DATA statement used
(Total process time) real time
0.29 seconds cpu time 0.29
seconds 1534 Proc freq 1535 Tables gender
polparty 1536 1537 RUN NOTE No observations
in data set WORK.DATA1. NOTE PROCEDURE FREQ used
(Total process time) real time
0.00 seconds cpu time 0.00
seconds
22
TE The infile "D\fleclere\Desktop\misc
documents\10294358\ICPSR_22627\DS0001\22627-0001-D
ata.txt" is File NameD\fleclere\Desktop\m
isc documents\10294358\ICPSR_22627\DS0001\22
627-0001-Data.txt, RECFMV,LRECL2983 NOTE
4655 records were read from the infile
"D\fleclere\Desktop\misc
documents\10294358\ICPSR_22627\DS0001\22627-0001-D
ata.txt". The minimum record length was
2983. The maximum record length was
2983. NOTE The data set WORK.DATA3 has 4655
observations and 657 variables. NOTE DATA
statement used (Total process time) real
time 2.51 seconds cpu time
0.73 seconds 4576 Proc freq 4577 Tables
gender polparty 4578 4579 RUN NOTE There
were 4655 observations read from the data set
WORK.DATA3. NOTE PROCEDURE FREQ used (Total
process time) real time 0.01
seconds cpu time 0.01 seconds
23
This is from our codebook
24
What else should I check?
This is from the original survey documentation
before ICPSR standardization. Always validate
the data produced against documentation from
original data set to be sure. The syntax and the
ICPSR codebook have the same origins --- an error
in one may be reproduced in another. Total case
counts and frequencies.
25
If the frequencies or case counts dont match
  • Check the lrecl against the documentation
  • Check the field lengths ----the codebooks should
    contain for each variable its location and field
    length
  • Punctuation counts SAS likes its semicolons and
    SPSS its spaces and periods and STATA is fussy
    about what goes before and after a comma

26
If variable looks weird
  • Print observations .
  • Everything checks out but there are weird
    fields or non-numeric items in a frequency
    display. Print a record or 2.

27
I pointed, I clicked, and
  • Things to ask yourself
  • Is it a version issue?
  • (SAS in particular has problems reading different
    versions)
  • Do you have the software?
  • (the icon will not look right)

28
Why you should always run the ASCII syntax
instead?
  • It allows you to customize the file
  • It forces you to know where the data are
  • It forces you to read the log files even if all
    you are doing is watching them go by
  • You have to open the software version --- and it
    will run it in the version you have and create
    the file the way you need it
  • It prevents you from being complacent

29
Steps to prevent bugs
  • Simplify the syntaxtake out all the extraneous
    stuff
  • Pick fewer variables
  • Always add frequency counts
  • Always check case counts

30
Steps to prevent bugs
  • Know where you put the data
  • Read the documentation first
  • Save log files as well as program files
  • Verify, verify, verify

31
If you find errors in ICPSR syntax
  • Please help us send a corrected file and a
    description of the error to
  • netmail_at_icpsr.umich.edu
  • We do updates all the time

32
If you need help with the basics
  • Our help site for reading in data
  • Using Data
  • Great help in building and debugging statistical
    software programs
  • UCLA
Write a Comment
User Comments (0)
About PowerShow.com