Preparing your data for analysis using SAS - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Preparing your data for analysis using SAS

Description:

... wanted to stack the three data sets fake, faux, and fraud on top of one other: ... Here the observations in faux are tacked onto the end of the fake data set. ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 76
Provided by: busi343
Category:
Tags: sas | analysis | data | faux | preparing | using

less

Transcript and Presenter's Notes

Title: Preparing your data for analysis using SAS


1
Preparing your data for analysis using SAS
  • Landon Sego
  • 24 April 2003
  • Department of Statistics
  • UW-Madison

2
Assumptions
  • That you have used SAS at least a few times.
  • It doesnt matter whether you run SAS in
    interactive mode (Windows) or in batch mode
    (Unix/Linux).

3
Interactive SAS for Windows
4
Editing SAS code in EMACS for Batch mode execution
5
Executing SAS in batch mode at the Linux prompt
6
Where were going
  • Rarely does data come to you in a form that is
    analyzable.
  • As a best case, all you need to do is clean your
    data and check it for consistency.
  • As a worst case, extensive manipulation of the
    data is needed in order to analyze.
  • We want to get familiar with some tools in SAS
    used to check, clean, and manipulate data.

7
The dark toolbox
  • SAS is like a toolbox as big as a garagewith
    thousands of tools.
  • For many SAS users, this toolbox is dark and
    there is a monster lurking inside.
  • Lets turn on the light and meet a few of the
    tools available in SAS.
  • No guarantees about the monster

8
Published resources
  • Lots available
  • Ive learned on my own and with the SAS
    documentation
  • Google searches

9
SAS Online Documentation
  • http//shelf.doit.wisc.edu/SASOnlineDocs/onldoc.ht
    m
  • or
  • http//shelf.doit.wisc.edu/SASOnlineDocs/main.htm

10
What I use most often in the SAS Online
Documentation
  • Base SAS Software
  • SAS Language Reference Concepts
  • SAS Language Reference Dictionary
  • SAS Macro Language Reference
  • SAS Procedures Guide
  • SAS/STAT
  • SAS/STAT Users Guide

11
SAS Language Reference Dictionary
12
SAS Procedures Guide
13
SAS/STAT Users Guide
14
Conventions, terminology, and options
15
Conventions
  • SAS terminology in red
  • SAS code in blue

16
Basic terminology
  • Data in SAS exists as a data set, with variables
    and observations.
  • variables are the columns.
  • observations are the rows.
  • Two types of variables Character and numeric.
  • Character variables can range in length from 1 to
    32,767 215 characters.
  • Numeric variables can be virtually any size
    (within the limitations of the computer)

17
My favorite options
  • On the first line of almost every SAS program I
    write, I include the following
  • options nodate nocenter nonumber ps3000 ls200
    mprint mlogic symbolgen
  • These options control the format of the output
    and make macro code easier to debug

18
Importing data into SAS
19
Importing data into SAS
  • Data can exist in many forms (text file, Excel
    spreadsheet, permanent SAS data set, etc.)
  • Excel spreadsheets are probably the most common
    form.
  • Can use DDE (dynamic data exchange) (Windows
    version of SAS only)
  • But for Excel files, I like to use CSV file
    format (comma separated value). Works on any
    platform.

20
Excel ? CSV text file ? SAS data set
Column of js provide a buffer at the end of
each line of text in the CSV file. If you are
running SAS on a Linux or UNIX machine, you need
to add the js (or use the dos2unix command to
convert the CSV file to the text formatting used
by UNIX).
21
Save Excel spreadsheet in the CSV format
22
How the CSV file looks (when viewed with a text
editor)
Location,Type,Length,j Albuquerque,1,1.414,j Albuq
uerque,1,2.000,j Albuquerque,1,1.414,j Albuquerque
,1,2.236,j Albuquerque,2,2.000,j Albuquerque,2,2.2
36,j Albuquerque,2,2.236,j Albuquerque,2,2.236,j L
exington,1,2.000,j
23
SAS code to import the CSV file
data fake infile c\mydata\fake.csv dsd
firstobs2 input location 14. type
length proc print run
Note, if I were to use input location type
length it would truncate the location variable
to 8 characters in length.
24
Results from the proc print
Obs location type length 1
Albuquerque 1 1.414 2 Albuquerque
1 2.000 3 Albuquerque 1
2.236 4 Albuquerque 2 2.000 5
Albuquerque 2 2.236 6
Albuquerque 2 2.236 7 Lexington
1 2.000 8 Lexington 1
2.000 9 Lexington 1 1.414 10
Lexington 1 2.000 11 Lexington
2 1.732 12 Lexington 2
1.732 13 Lexington 2 2.236
14 Johannesburg 1 2.000 15
Johannesburg 1 2.236 16
Johannesburg 1 2.236 17
Johannesburg 2 2.000 18
Johannesburg 2 1.414 19
Johannesburg 2 2.000
25
Checking and summarizing data
26
Checking and summarizing data
This is IMPORTANT! Dont take for granted that
there arent mistakes in your data.
proc contents datafake proc freq datafake run
27
Result from proc contents
Data Set Name WORK.FAKE
Observations 19 Member Type DATA
Variables 3
Engine V8
Indexes 0 Created 1146
Wednesday, April 16, 2003 Observation Length
32 Last Modified 1146 Wednesday, April 16,
2003 Deleted Observations 0 . . File Name
/tmp/sastmp_sego/SAS_workB6B7000008
36_
gstat201.stat.wisc.edu/fake.sas7bdat
. . -----Alphabetic List of Variables and
Attributes----- Variable Type Len
Pos ----------------------------------- 3
length Num 8 8 1 location
Char 14 16 2 type Num 8
0
28
Results from proc freq
The FREQ Procedure
Cumulative Cumulative location
Frequency Percent Frequency
Percent ------------------------------------------
----------------------- Albuquerque 6
31.58 6 31.58
Johannesburg 6 31.58
12 63.16 Lexington 7
36.84 19 100.00
Cumulative Cumulative type
Frequency Percent Frequency
Percent ------------------------------------------
--------------- 1 10 52.63
10 52.63 2 9
47.37 19 100.00
Cumulative
Cumulative length Frequency Percent
Frequency Percent ---------------------------
-------------------------------- 1.414
3 15.79 3 15.79 1.732
2 10.53 5
26.32 2 8 42.11
13 68.42 2.236 6 31.58
19 100.00
29
Selecting subsets of the data
30
Selecting observations (rows)
  • A large contiguous group of observations
  • Specific observation numbers
  • Using selection criteria
  • e.g. when the location is Lexington
  • or when the length is between 1 and 2.

31
Selecting a group of contiguous observations
  • data smallfake
  • set fake (firstobs10 obs15)
  • proc print
  • Obs location type length
  • 1 Lexington 1 2.000
  • 2 Lexington 2 1.732
  • 3 Lexington 2 1.732
  • 4 Lexington 2 2.236
  • 5 Johannesburg 1 2.000
  • 6 Johannesburg 1 2.236
  • Selects observations 10 through 15

Data set options
32
Selecting specific observation numbers
  • data smallfake
  • set fake
  • if _n_ in (7,11,16)
  • proc print
  • Obs location type length
  • 1 Lexington 1 2.000
  • 2 Lexington 2 1.732
  • 3 Johannesburg 1 2.236
  • Selects observation numbers 7, 11, and 16.

33
Selection criteria where statement
  • data smallfake
  • set fake
  • where location Lexington
  • or
  • where location ne Lexington
  • or
  • where location in (Lexington ,
    Albuquerque)
  • or
  • where (1 le length le 2)
  • or
  • where (length gt 2.3)

34
Selection criteria if statement
  • data smallfake
  • set fake
  • if location in (Lexington , Albuquerque)
  • or
  • if location Lexington location
    Albuquerque
  • or
  • if location Johannesburg then delete
  • These three if statements produce identical
    results.

35
Some comparison operators
  • ne or not equals to
  • eq or equals to
  • ge or gt greater than or equal to
  • gt or gt greater than
  • le or lt less than or equal to
  • lt or lt less than
  • in if contained in a group
  • not in if not contained in a group
  • and or and logical operator
  • or or or logical operator

36
Selecting and managing variables
37
Selecting variables using keep
data smallfake (keep location length) set
fake where type 1 (type is available for
processing, but not written to
smallfake data set) data smallfake set fake
(keep location length) (type is not available
for processing)
Data set options
38
Selecting variables using drop
data smallfake (drop type) set fake
where type 1 (type is available for
processing, but not written to
smallfake data set) data smallfake set fake
(drop type) (type is not available for
processing)
Data set options
39
Renaming variables
location place type trt
data fake1 set fake (rename(locationplace
typetrt)) where trt 1 data fake2
(rename(locationplace typetrt)) set fake
where type 1 data fake3 (drop location
type) set fake where type 1 place
location trt type
These three pieces of code achieve the same
result. Look closely at the where statements.
40
Concatenation
41
Concatenation (stacking)
  • SAS can stack multiple data sets on top of one
    another.
  • Pay attention whether or not the variables and
    their attributes (length and variable type) match
    among the different data sets.
  • Can use the set statement or proc append to
    concatenate data sets.

42
Using the set statement to concatenate data sets
  • Suppose you wanted to stack the three data sets
    fake, faux, and fraud on top of one other

data fantastic set fake faux fraud
43
Using proc append to concatenate data sets
  • proc append concatenates only two data sets at a
    timeand typically these data sets must have the
    same variable names with the same attributes.
  • proc append basefake datafaux
  • Here the observations in faux are tacked onto
    the end of the fake data set. The combined data
    set is called fake.

44
Splitting data into several data sets
Supposed we want all Albuquerque observations to
go into a data set called albuq, the Lexington
observations to go into the data set lexing, and
observations that have lengths larger than 3.0
into the data set large. data albuq lexing
large set fake if location
Albuquerque then output albuq else if
location Lexington then output lexing if
length gt 3 then output large
45
Merging data
46
Merging (combining) data
  • Merging data sets places two or more data sets
    side by side into a single data set.
  • If you simply want place two data sets side by
    side (1 to 1 merging)

data faux set faux (rename(locationlocation1
typetype1)) data fantastic merge fake
faux proc print data fantastic
47
Results of 1 to 1 merge
  • Obs location type length
    location1 type1 weight
  • 1 Albuquerque 1 1.414
    Lexington 2 4.456
  • 2 Albuquerque 1 2.000
    Lexington 2 2.546
  • 3 Albuquerque 1 2.236
    Lexington 2 3.789
  • 4 Albuquerque 2 2.000
    Lexington 1 2.457
  • 5 Albuquerque 2 2.236
    Lexington 1 4.456
  • 6 Albuquerque 2 2.236
    Lexington 1 4.551
  • 7 Lexington 1 2.000
    Lexington 1 3.246
  • 8 Lexington 1 2.000
    Johannesburg 1 2.998
  • 9 Lexington 1 1.414
    Johannesburg 1 2.721
  • 10 Lexington 1 2.000
    Johannesburg 1 2.115
  • 11 Lexington 2 1.732
    Johannesburg 2 3.489
  • 12 Lexington 2 1.732
    Johannesburg 2 3.447
  • 13 Lexington 2 2.236
    Johannesburg 2 4.122
  • 14 Johannesburg 1 2.000
    Albuquerque 1 2.118
  • 15 Johannesburg 1 2.236
    Albuquerque 1 2.871
  • 16 Johannesburg 1 2.236
    Albuquerque 1 2.516
  • 17 Johannesburg 2 2.000
    Albuquerque 2 3.445

48
Merging data with a by variable
  • All data sets that will be merged must first be
    sorted by the linking variables

proc sort data fake by location
type proc sort data faux by location
type data fantastic merge fake faux by
location type proc print data fantastic
49
Results of merging with by variables
  • Obs location type length weight
  • 1 Albuquerque 1 1.414 2.118
  • 2 Albuquerque 1 2.000 2.871
  • 3 Albuquerque 1 2.236 2.516
  • 4 Albuquerque 2 2.000 3.445
  • 5 Albuquerque 2 2.236 2.998
  • 6 Albuquerque 2 2.236 3.549
  • 7 Johannesburg 1 2.000 2.998
  • 8 Johannesburg 1 2.236 2.721
  • 9 Johannesburg 1 2.236 2.115
  • 10 Johannesburg 2 2.000 3.489
  • 11 Johannesburg 2 1.414 3.447
  • 12 Johannesburg 2 2.000 4.122
  • 13 Lexington 1 2.000 2.457
  • 14 Lexington 1 2.000 4.456
  • 15 Lexington 1 1.414 4.551
  • 16 Lexington 1 2.000 3.246
  • 17 Lexington 2 1.732 4.456
  • 18 Lexington 2 1.732 2.546

50
More about merging
  • When you merge with a by statement, you may only
    want observations that have by-variable matches
    in both data sets.

fake
fraud
51
Using (in ) data set option
  • Assume both fake and fraud are sorted by location
    and type.
  • data fantastic
  • merge fake (in tmp1) fraud (in tmp2)
  • by location type
  • from_fake tmp1
  • from_fraud tmp2
  • proc print

52
Identifying obs from both data sets
53
Using (in ) data set option
  • Now select observations that are common to both
    data sets
  • data fantastic
  • merge fake (in tmp1) fraud (in tmp2)
  • by location type
  • if tmp11 and tmp21
  • proc print

54
After selecting for observations in common
Obs location type length
thickness 1 Albuquerque 1 1.414
1.120 2 Albuquerque 1 2.000
1.120 3 Albuquerque 1
2.236 1.120 4 Johannesburg 1
2.000 1.320 5 Johannesburg 1
2.236 1.945 6 Johannesburg 1
2.236 1.945 7 Johannesburg 2
2.000 1.440 8 Johannesburg 2
1.414 1.440 9 Johannesburg
2 2.000 1.440 10 Lexington
2 1.732 1.160 11 Lexington
2 1.732 0.783 12 Lexington
2 2.236 0.783
55
Merge mania
data fantastic merge fake fraud by
location type data fantastic merge fake
(intmp1) fraud by location type if tmp1
1 data fantastic merge fake (intmp1)
fraud (intmp2) by location type if tmp1 1
and tmp2 1
56
Creating new variables
57
Creating new variables
  • data fantastic
  • merge fake faux
  • by location type
  • newcode substr(location,1,1) '-'
    trim(left(type))
  • growth_index length weight2
  • if (growth_index gt 15) then large ''
  • else large ' '

58
Results of new variables

growth_ Obs location type
length weight newcode index
large 1 Albuquerque 1 1.414
2.118 A-1 5.8999 2
Albuquerque 1 2.000 2.871 A-1
10.2426 3 Albuquerque 1
2.236 2.516 A-1 8.5663
4 Albuquerque 2 2.000 3.445
A-2 13.8680 5 Albuquerque
2 2.236 2.998 A-2 11.2240
6 Albuquerque 2 2.236
3.549 A-2 14.8314 7
Johannesburg 1 2.000 2.998 J-1
10.9880 8 Johannesburg 1
2.236 2.721 J-1 9.6398
9 Johannesburg 1 2.236 2.115
J-1 6.7092 10 Johannesburg
2 2.000 3.489 J-2 14.1731
11 Johannesburg 2 1.414
3.447 J-2 13.2958 12
Johannesburg 2 2.000 4.122 J-2
18.9909 13 Lexington 1
2.000 2.457 L-1 8.0368
14 Lexington 1 2.000 4.456
L-1 21.8559 15 Lexington
1 1.414 4.551 L-1 22.1256
16 Lexington 1 2.000
3.246 L-1 12.5365 17
Lexington 2 1.732 4.456 L-2
21.5879 18 Lexington 2
1.732 2.546 L-2 8.2141
19 Lexington 2 2.236 3.789
L-2 16.5925
59
Common functions used to manipulate text strings
  • compress
  • index
  • left
  • scan
  • substr
  • trim

Refer to SAS Online Docs Base SAS Software
SAS Language Reference Dictionary
Dictionary of Language Elements
Functions and Call Routines
60
by-group processing
61
by-group processing
  • Suppose you wanted a subset of the data that
    contained the observation with the smallest
    length from each location.

proc sort data fake by location
length data shortest set fake by
location length first first.location
last last.location
62
output from by-group processing
Obs location type length first
last 1 Albuquerque 1 1.414
1 0 2 Albuquerque 1 2.000
0 0 3 Albuquerque 2
2.000 0 0 4 Albuquerque 1
2.236 0 0 5 Albuquerque
2 2.236 0 0 6
Albuquerque 2 2.236 0 1
7 Johannesburg 2 1.414 1
0 8 Johannesburg 1 2.000 0
0 9 Johannesburg 2 2.000
0 0 10 Johannesburg 2
2.000 0 0 11 Johannesburg 1
2.236 0 0 12 Johannesburg
1 2.236 0 1 13 Lexington
1 1.414 1 0 14
Lexington 2 1.732 0 0
15 Lexington 2 1.732 0
0 16 Lexington 1 2.000 0
0 17 Lexington 1 2.000
0 0 18 Lexington 1
2.000 0 0 19 Lexington 2
2.236 0 1
63
by-group processing
proc sort data fake by location
length data shortest set fake by location
length if first.location 1
Obs location type length 1
Albuquerque 1 1.414 2
Johannesburg 2 1.414 3 Lexington
1 1.414
64
Basic macros
65
Basic macros
  • SAS macros allow you to easily program repetitive
    tasks.
  • On the surface, creating a SAS macro is very
    similar to creating a function in R or S-Plus.
  • SAS Macro is actually a text generation tool.

66
Macro example
Name macro, begin macro definition, identify
macro variables
macro analyze(dataset,response) proc
mixed data dataset class location type
model response location type lsmeans
location type ods output lsmeansmodel_means
data model_means set model_means
variable "response" proc append
baseresults datamodel_meansmend
analyzeanalyze(datasetfake,responselength)
analyze(datasetfaux,responseweight)proc print
data results
Code to be generated by macro Note the use of
proc append
End macro definition
Call macro
Print results
67
Results from macro code
Obs Effect location type Estimate StdErr
DF tValue Probt variable 1 location
Albuquerque _ 2.0203 0.1243 15 16.25
lt.0001 length 2 location Johannesburg _
1.9810 0.1243 15 15.94 lt.0001 length 3
location Lexington _ 1.8728 0.1155 15
16.21 lt.0001 length 4 type
1 1.9621 0.09678 15 20.27 lt.0001
length 5 type 2 1.9540
0.1015 15 19.25 lt.0001 length 6 location
Albuquerque _ 2.9162 0.2844 15 10.25
lt.0001 weight 7 location Johannesburg _
3.1487 0.2844 15 11.07 lt.0001 weight 8
location Lexington _ 3.6842 0.2643 15
13.94 lt.0001 weight 9 type
1 2.9614 0.2214 15 13.37 lt.0001
weight 10 type 2 3.5379
0.2322 15 15.23 lt.0001 weight
68
proc transpose
69
Rearranging data with proc transpose
Consider this output from proc mixed
Obs year cultivar Effect trt Estimate
StdErr DF tValue Probt 1 95 1
Intercept _ 1.5930 0.4477 38
3.56 0.0010 2 95 1 trt 1
0.2987 0.7485 35 0.40 0.6923 3
95 1 calevel _ -0.00155
0.002401 35 -0.65 0.5229 4 95 1
caleveltrt 1 -0.00010 0.003992 35
-0.02 0.9804 5 95 2 Intercept
_ 0.5331 0.2782 38 1.92 0.0628 6
95 2 trt 1 0.3637
0.3501 36 1.04 0.3058 7 95 2
calevel _ 0.002348 0.001570 36
1.49 0.1437 8 95 2 caleveltrt 1
-0.00170 0.002173 36 -0.78 0.4388 13
96 1 Intercept _ 1.4129
0.3650 10 3.87 0.0031 14 96 1
trt 1 0.4670 0.4387 160
1.06 0.2887 15 96 1 calevel _
0.003074 0.001146 160 2.68 0.0081 16
96 1 caleveltrt 1 -0.00127
0.001387 160 -0.92 0.3598
70
Results from proc transpose
  • proc transpose datasolution outtsolution
  • by year cultivar
  • var estimate
  • id effect

calevel_ Obs year cultivar _NAME_
Intercept trt calevel trt 1
95 1 Estimate 1.5930 0.2987
-0.00155 -0.00010 2 95 2
Estimate 0.5331 0.3637 0.002348
-0.00170 3 95 3 Estimate
1.3996 0.08453 -0.00121 0.000326 4
96 1 Estimate 1.4129 0.4670
0.003074 -0.00127 5 96 2
Estimate 1.9705 0.1312 0.003656
-0.00073 6 96 3 Estimate
3.2758 -0.1568 -0.00131 0.001427 7
96 4 Estimate 2.6396 0.7034
0.001114 -0.00237 8 97 1
Estimate 1.3943 0.2684 -0.00026
-0.00120 9 97 2 Estimate
1.5388 -0.2840 -0.00183 0.002834
71
Parting words of advice
72
Attributes of SAS
  • SAS is read/write intensive.
  • Every time you create a data set, the data set is
    written to the disk.
  • Where does it get written? To the SAS Work
    Library, which is assigned to a directory
    somewhere..use proc contents to find out.
  • For CALS HP users and PC users, the SAS Work
    Library resides on the actual machine.

73
Attributes of SAS
  • Users of the AFS system beware! (Stat department,
    CS department)
  • The SAS Work Library is assigned to your account
    in AFSnot to the local machine that is running
    SAS.

AFS recording and reading your SAS data sets
Network traffic
Your local computer running SAS
74
Assigning the SAS Work Library
  • To assign the SAS work library to a local
    directory (when running in batch mode on a Linux
    or Unix system)
  • sas mysascode.sas -work /scratch

75
Synthesis
  • Most of what weve covered today involves the
    data step.
  • Many of the techniques shown in this presentation
    can be applied together in a single data step.
  • Now that you know the names of some of the tools,
    use the Online Documentation!
Write a Comment
User Comments (0)
About PowerShow.com