table(ftv) ... Here, the variables 'race', 'ptd', and 'ftv' - PowerPoint PPT Presentation

About This Presentation

Title:

table(ftv) ... Here, the variables 'race', 'ptd', and 'ftv'

Description:

table(ftv) ... Here, the variables 'race', 'ptd', and 'ftv', which have been created in this ... this data set, it removes 'ftv' and 'age' these two variables ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 14

Provided by: mathN

Category:

more less

Transcript and Presenter's Notes

Title: table(ftv) ... Here, the variables 'race', 'ptd', and 'ftv'

1
Financial Time Series I/Methods of Statistical
Prediction

Suggested Answers to Project 2 Project 2
Logistic Regression and Model Selection
1/19/2003

2
Question 1

Give a brief explanation on the meaning of those
commands.
data(birthwt)
Load the data set birthwt from the boot
package.
We can use an alternative command by loading the
whole boot package into the working space with
command require(boot).
attach(birthwt)
The data set birthwt is attached to the
working space so that the variable in the data
set can be accessed directly by its name.
Otherwise, we have to use birthwtvarname instead
of varname.
racelt factor(race,labels c("white","black",
"other"))
The function factor is used to encode a vector
as a factor race is originally being treated as
a numerical factor. This command translates into
a categorical factor.
table(ftv)
table uses the cross-classified factors to build
a contingency table of the counts at each
combination of factor levels. The results of this
command is as follows,
0 1 2 3 4 6
100 47 30 7 4 1

3
Question 1

ftvlt-factor( ftv)
The function factor is used to encode a vector
as a factor
levels(ftv)-(12)lt "2"
This command transfers the levels of ftv into
three 0, 1, and 2.
2.Convert ptl to two levels and name the new
variable as ptd.
ptdlt-factor(ptlgt0)
The function factor is used to encode a vector
as a factor. Here, ptd represents a new factor
with two levels only.
3. Create a new data frame bwt.
bwtlt-data-frame(1owfactor(1ow), age, lwt,
race, smoke (smoke gt0), ptd, ht (ht gt0),
ui(uigt0), ftv)
Create a new data frame
4. Clean up data.
detach(birthwt)
Remove the data set from the search path of
available R objects. This command can be used to
remove either a data-frame which has been
attached or a package which was loaded
previously.
rm(race, ptd ,ftv)
remove and rm can be used to remove
objects. Here, the variables race, ptd, and
ftv, which have been created in this workspace,
are removed and no longer exist.

4
Question 2

Give a brief explanation on the specification of
the regression model.
birthwtglmGglm(1ow., familybinomial ,data bwt)
glm is used to fit generalized linear models
(GLM) to the data.
In GLM, two parameters need to be specified.
The first parameter is the error distribution of
dependent variable. Here, the chosen
distribution is binomial distribution. It is
being specified in terms of familybinomial.
The second parameter is the unknown parameters in
the specification of error distribution. For the
binomial distribution, it is the probability of
success p. p will be used to associate the
independent variables to dependent variable.
There is the so-called link function to associate
them. The default of link function with binomial
distribution is logit. It is as follows
F(x) P(Y1x) exp(z)/1exp(z) where z
exp(ß0 ß1X1...ßKXK)
The fitted values of this model are the
probabilities of low1 given all the other
variables in the data frame bwf. Here, we only
consider additive model. It means that all
variables enter into the model linearly. (There
is no interaction term in the model.)

5
Question 2

birthwtglmlt- glm(1ow., familybinomial, data
bwt)
glm is used to fit generalized linear models
(GLM) to the data.
In GLM, two parameters need to be specified.
The first parameter is the error distribution of
dependent variable. Here, the chosen
distribution is binomial distribution. It is
being specified in terms of familybinomial.
The second parameter is the unknown parameters in
the specification of error distribution. For the
binomial distribution, it is the probability of
success p. p will be used to associate the
independent variables to dependent variable.
There is the so-called link function to associate
them. The default of link function with binomial
distribution is logit. It is as follows
F(x) P(Y1x) exp(z)/1exp(z) where z
exp(ß0 ß1X1...ßKXK)
The fitted values of this model are the
probabilities of low1 given all the other
variables in the data frame bwf. Here, we only
consider additive model. It means that all
variables enter into the model linearly. (There
is no interaction term in the model.)

6
Question 2

summary(birthwt.glm,correlationF)
It gives the deviance residuals and coefficients.
The command summary is a generic function used
to produce result summaries of the results of
various model fitting functions.
The correlations of coefficients are not shown
because the parameter correlation is set to be
false.
The AIC value is 217.48 in this additive model
containing all the variables. In addition, the
prediction error rate is 0.2698413
We can compare the AIC values and the prediction
error rates with this one in the following
analyses to show the effect of including or
excluding some particular variables.
From the probability prob(gtz) of each variable,
we found that the variables ptdTRUE, htTRUE,
lwt, and raceblack will be significant for
the prediction if the significant level a is set
at 0.05.

7
Question 2 Model Selection

Consider a model with the above four important
variables.
glm(1owlwt race ptd ht, familybinomial,
data bwt)
The AIC value will become 217.40, which is
slightly smaller than one using all of the
variables.
In addition, the prediction of the error rate is
0.2698413.
If we only drop the two most insignificant
variables ftv and age, this leads to an
alternative model
glm(1owlwt race ptd ht smoke ui,
familybinomial, data bwt).
The AIC value will become 213.8516,which seems
smaller than the two models described above
The prediction error rate is 0.2433862.

8
Question 2 Model Selection with AIC

Use AIC as the objective of model selection.
birthwtsteplt step(birthwtglm, traceF)
The command step selects a formula-based model
by AIC. Its basic idea is to remove the variables
one by one in order to find better models with
smaller AIC values.
The parameter trace is turn off so that the
above deletion process is not shown.
We can build model with backward elimination or
forward selection.
birthwtsteplt step(birthwtglm,traceF, direction
c(forward))
birthwtsteplt step(birthwtglm,traceF, direction
c(backward))
birthwt.stepanova
This command is useful in showing the process of
deleting variables.
For this data set, it removes ftv and age
these two variables and final AIC value is
reduced to 213.8516. The selection procedure
stops because no more reduction of the AIC value
can be achieved by removing any other variables.
It is interesting that the model selected by the
AIC criterion is consistent with that one we use
a different criterion.

9
Question 3

Repeat the steps in Question 2 and consider all
models include pairwise interactions.
In order to ensure that the co-linearity is not
present in the model, we usually start on
checking the correlation of coefficients among
all independent variables.
The difficult question is how to address the
correlation with categorical variables. Refer to
your note on association.
Model building strategy
Strategy 1 backward elimination
Add all pairwise interactions and then remove
them one by one.
Implementation of strategy 1
Start form an additive model with all independent
variables.
Birthwt.glm lt- glm(1ow2, familybinomial,
databwt, maxit20)
Due to convergence problem, we can increase the
upper bound of the number of iterations, which I
done by using maxit.
Suggestion Start form the best additive linear
model derived in Question 2 (Exclude the two
variables ftv and age.)

10
Question 3 backward elimination

birthwt.glm lt-glm(low(-ftv-age)2,
familybinomial, databwt, maxit20)
Consider an additive model without ftv and age.
This leads to the following model
low age lwt race smoke ptd ht ui
fw ageht ageftv lwtsmoke lwtht
lwtui raceht smokeht ptdht htui
htftv
model selection with AIC
birthwt.step.pwall lt- stepAIC(birthwt.glm.pwall,
traceF)
birthwtstep.pwallanova
This leads to the following model with
AIC210.8205
low age 1wt race smoke ptd ht ui
fw lwtsmoke lwtht lwtui htui
Suggestion Can we just compare all possible two
pairwise interaction terms?
The best one is with the interaction terms
ageftv and htui.
Its AIC is 209.0006.
Although the variable age andftv are not
important when we only consider an additive
linear model, they are included when the
interaction tem is also taken into consideration.

11
Question 3 Model Searching Strategy

birthwt.step.both lt- stepAIC(birthwt.glm, scope
list(upper .2, lower 1), traceF)
The direction of stepwise search can be one of
both, backward, or forward, with a default
of both.
If the scope argument is missing, the default
for direction is backward.
Therefore, we do not only remove predictors from
the model but also add predictors to reduce the
AIC value.
For this data set, start with original model with
no interaction term.
The interaction terms ageftv and smokeui
are added sequentially.
Finally, the race tem is removed.
The process stops at the model
low age lwt smoke ptd ht ui ftv
ageftv smokeui
with the lowest AIC value 207.0734.

12
Question 5

Repeat the above procedure with the two model
being chosen with cross-validation to give
prediction error again.
How do we divide the data randomly into several
groups (fold5 for example)?
Use the cv.glm procedure in the boot package.
Some students write their own version of
cross-validation.
cv.glm
Cross-validation for Generalized Linear Models
This function calculates the estimated K-fold
cross-validation prediction error for generalized
linear models.
cv.glm(data, glmfit, cost, K)
Data A matrix or data frame containing the data.
The rows should be cases and the columns
correspond to variables, one of which is the
response.
glmfit An object of class "glm" containing the
results of a generalized linear model fitted to
data.
cost A function of two vector arguments
specifying the cost function for the
cross-validation. The first argument to cost
should correspond to the observed responses and
the second argument should correspond to the
predicted or fitted responses from the
generalized linear model. The default is the
average squared error function.
K The number of groups into which the data
should be split to estimate the cross-validation
prediction error.

13
Cross-validation Algorithm

bwt.shufflelt- bwt shufiter lt- 1000 datasize lt-
length(bwtlow)
for(i in 1shuf.iter)
n1lt- round(runif(n1, min 1, maxdatasize))
n2lt- round(runif(n1, min 1, max datasize))
temp lt- bwt.shufflen1,
bwt.shufflen1, lt- bwt.shuffIen2,
bwt.shumen2, lt- temp
foldlt- k testsize lt- round(datasize/fold)
rate.fold lt-rep(0,fold)
for(i in 1fold)
test.start lt- (i-1)testsize test.end lt-
test.start testsize
if (test.endgtdatasize) test.enddatasize
bwt.testlt- data.frame(bwt.shuffIe(test.start1)
test.end,)
bwt.trainlt- data.frame(bwt.shuffle-((test.start
1)test.end),)
train.glmlt- glm(1ow.,familybinomial,
databwt.train)
predlt-predict(train.glm, subset(bwt.test, select
c(age,lwt, race, smoke, ptd, ht,ui, ftv)), type
"response")
rate-foldilt sum(round(pred)bwt.testlow)/leng
th(pred)
cvratelt- mean(rate.fold) cat("prediction error
rate of cross validation",1-cvrate, \n")