table(ftv) ... Here, the variables 'race', 'ptd', and 'ftv' - PowerPoint PPT Presentation

About This Presentation
Title:

table(ftv) ... Here, the variables 'race', 'ptd', and 'ftv'

Description:

table(ftv) ... Here, the variables 'race', 'ptd', and 'ftv', which have been created in this ... this data set, it removes 'ftv' and 'age' these two variables ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 14
Provided by: mathN
Category:
Tags: ftv | here | ptd | race | table | variables

less

Transcript and Presenter's Notes

Title: table(ftv) ... Here, the variables 'race', 'ptd', and 'ftv'


1
Financial Time Series I/Methods of Statistical
Prediction
  • Suggested Answers to Project 2 Project 2
    Logistic Regression and Model Selection
  • 1/19/2003

2
Question 1
  • Give a brief explanation on the meaning of those
    commands.
  • data(birthwt)
  • Load the data set birthwt from the boot
    package.
  • We can use an alternative command by loading the
    whole boot package into the working space with
    command require(boot).
  • attach(birthwt)
  • The data set birthwt is attached to the
    working space so that the variable in the data
    set can be accessed directly by its name.
    Otherwise, we have to use birthwtvarname instead
    of varname.
  • racelt factor(race,labels c("white","black",
    "other"))
  • The function factor is used to encode a vector
    as a factor race is originally being treated as
    a numerical factor. This command translates into
    a categorical factor.
  • table(ftv)
  • table uses the cross-classified factors to build
    a contingency table of the counts at each
    combination of factor levels. The results of this
    command is as follows,
  • 0 1 2 3 4 6
  • 100 47 30 7 4 1

3
Question 1
  • ftvlt-factor( ftv)
  • The function factor is used to encode a vector
    as a factor
  • levels(ftv)-(12)lt "2"
  • This command transfers the levels of ftv into
    three 0, 1, and 2.
  • 2.Convert ptl to two levels and name the new
    variable as ptd.
  • ptdlt-factor(ptlgt0)
  • The function factor is used to encode a vector
    as a factor. Here, ptd represents a new factor
    with two levels only.
  • 3. Create a new data frame bwt.
  • bwtlt-data-frame(1owfactor(1ow), age, lwt,
    race, smoke (smoke gt0), ptd, ht (ht gt0),
    ui(uigt0), ftv)
  • Create a new data frame
  • 4. Clean up data.
  • detach(birthwt)
  • Remove the data set from the search path of
    available R objects. This command can be used to
    remove either a data-frame which has been
    attached or a package which was loaded
    previously.
  • rm(race, ptd ,ftv)
  • remove and rm can be used to remove
    objects. Here, the variables race, ptd, and
    ftv, which have been created in this workspace,
    are removed and no longer exist.

4
Question 2
  • Give a brief explanation on the specification of
    the regression model.
  • birthwtglmGglm(1ow., familybinomial ,data bwt)
  • glm is used to fit generalized linear models
    (GLM) to the data.
  • In GLM, two parameters need to be specified.
  • The first parameter is the error distribution of
    dependent variable. Here, the chosen
    distribution is binomial distribution. It is
    being specified in terms of familybinomial.
  • The second parameter is the unknown parameters in
    the specification of error distribution. For the
    binomial distribution, it is the probability of
    success p. p will be used to associate the
    independent variables to dependent variable.
  • There is the so-called link function to associate
    them. The default of link function with binomial
    distribution is logit. It is as follows
  • F(x) P(Y1x) exp(z)/1exp(z) where z
    exp(ß0 ß1X1...ßKXK)
  • The fitted values of this model are the
    probabilities of low1 given all the other
    variables in the data frame bwf. Here, we only
    consider additive model. It means that all
    variables enter into the model linearly. (There
    is no interaction term in the model.)

5
Question 2
  • birthwtglmlt- glm(1ow., familybinomial, data
    bwt)
  • glm is used to fit generalized linear models
    (GLM) to the data.
  • In GLM, two parameters need to be specified.
  • The first parameter is the error distribution of
    dependent variable. Here, the chosen
    distribution is binomial distribution. It is
    being specified in terms of familybinomial.
  • The second parameter is the unknown parameters in
    the specification of error distribution. For the
    binomial distribution, it is the probability of
    success p. p will be used to associate the
    independent variables to dependent variable.
  • There is the so-called link function to associate
    them. The default of link function with binomial
    distribution is logit. It is as follows
  • F(x) P(Y1x) exp(z)/1exp(z) where z
    exp(ß0 ß1X1...ßKXK)
  • The fitted values of this model are the
    probabilities of low1 given all the other
    variables in the data frame bwf. Here, we only
    consider additive model. It means that all
    variables enter into the model linearly. (There
    is no interaction term in the model.)

6
Question 2
  • summary(birthwt.glm,correlationF)
  • It gives the deviance residuals and coefficients.
  • The command summary is a generic function used
    to produce result summaries of the results of
    various model fitting functions.
  • The correlations of coefficients are not shown
    because the parameter correlation is set to be
    false.
  • The AIC value is 217.48 in this additive model
    containing all the variables. In addition, the
    prediction error rate is 0.2698413
  • We can compare the AIC values and the prediction
    error rates with this one in the following
    analyses to show the effect of including or
    excluding some particular variables.
  • From the probability prob(gtz) of each variable,
    we found that the variables ptdTRUE, htTRUE,
    lwt, and raceblack will be significant for
    the prediction if the significant level a is set
    at 0.05.

7
Question 2 Model Selection
  • Consider a model with the above four important
    variables.
  • glm(1owlwt race ptd ht, familybinomial,
    data bwt)
  • The AIC value will become 217.40, which is
    slightly smaller than one using all of the
    variables.
  • In addition, the prediction of the error rate is
    0.2698413.
  • If we only drop the two most insignificant
    variables ftv and age, this leads to an
    alternative model
  • glm(1owlwt race ptd ht smoke ui,
    familybinomial, data bwt).
  • The AIC value will become 213.8516,which seems
    smaller than the two models described above
  • The prediction error rate is 0.2433862.

8
Question 2 Model Selection with AIC
  • Use AIC as the objective of model selection.
  • birthwtsteplt step(birthwtglm, traceF)
  • The command step selects a formula-based model
    by AIC. Its basic idea is to remove the variables
    one by one in order to find better models with
    smaller AIC values.
  • The parameter trace is turn off so that the
    above deletion process is not shown.
  • We can build model with backward elimination or
    forward selection.
  • birthwtsteplt step(birthwtglm,traceF, direction
    c(forward))
  • birthwtsteplt step(birthwtglm,traceF, direction
    c(backward))
  • birthwt.stepanova
  • This command is useful in showing the process of
    deleting variables.
  • For this data set, it removes ftv and age
    these two variables and final AIC value is
    reduced to 213.8516. The selection procedure
    stops because no more reduction of the AIC value
    can be achieved by removing any other variables.
  • It is interesting that the model selected by the
    AIC criterion is consistent with that one we use
    a different criterion.

9
Question 3
  • Repeat the steps in Question 2 and consider all
    models include pairwise interactions.
  • In order to ensure that the co-linearity is not
    present in the model, we usually start on
    checking the correlation of coefficients among
    all independent variables.
  • The difficult question is how to address the
    correlation with categorical variables. Refer to
    your note on association.
  • Model building strategy
  • Strategy 1 backward elimination
  • Add all pairwise interactions and then remove
    them one by one.
  • Implementation of strategy 1
  • Start form an additive model with all independent
    variables.
  • Birthwt.glm lt- glm(1ow2, familybinomial,
    databwt, maxit20)
  • Due to convergence problem, we can increase the
    upper bound of the number of iterations, which I
    done by using maxit.
  • Suggestion Start form the best additive linear
    model derived in Question 2 (Exclude the two
    variables ftv and age.)

10
Question 3 backward elimination
  • birthwt.glm lt-glm(low(-ftv-age)2,
    familybinomial, databwt, maxit20)
  • Consider an additive model without ftv and age.
  • This leads to the following model
  • low age lwt race smoke ptd ht ui
    fw ageht ageftv lwtsmoke lwtht
    lwtui raceht smokeht ptdht htui
    htftv
  • model selection with AIC
  • birthwt.step.pwall lt- stepAIC(birthwt.glm.pwall,
    traceF)
  • birthwtstep.pwallanova
  • This leads to the following model with
    AIC210.8205
  • low age 1wt race smoke ptd ht ui
    fw lwtsmoke lwtht lwtui htui
  • Suggestion Can we just compare all possible two
    pairwise interaction terms?
  • The best one is with the interaction terms
    ageftv and htui.
  • Its AIC is 209.0006.
  • Although the variable age andftv are not
    important when we only consider an additive
    linear model, they are included when the
    interaction tem is also taken into consideration.

11
Question 3 Model Searching Strategy
  • birthwt.step.both lt- stepAIC(birthwt.glm, scope
    list(upper .2, lower 1), traceF)
  • The direction of stepwise search can be one of
    both, backward, or forward, with a default
    of both.
  • If the scope argument is missing, the default
    for direction is backward.
  • Therefore, we do not only remove predictors from
    the model but also add predictors to reduce the
    AIC value.
  • For this data set, start with original model with
    no interaction term.
  • The interaction terms ageftv and smokeui
    are added sequentially.
  • Finally, the race tem is removed.
  • The process stops at the model
  • low age lwt smoke ptd ht ui ftv
    ageftv smokeui
  • with the lowest AIC value 207.0734.

12
Question 5
  • Repeat the above procedure with the two model
    being chosen with cross-validation to give
    prediction error again.
  • How do we divide the data randomly into several
    groups (fold5 for example)?
  • Use the cv.glm procedure in the boot package.
  • Some students write their own version of
    cross-validation.
  • cv.glm
  • Cross-validation for Generalized Linear Models
  • This function calculates the estimated K-fold
    cross-validation prediction error for generalized
    linear models.
  • cv.glm(data, glmfit, cost, K)
  • Data A matrix or data frame containing the data.
    The rows should be cases and the columns
    correspond to variables, one of which is the
    response.
  • glmfit An object of class "glm" containing the
    results of a generalized linear model fitted to
    data.
  • cost A function of two vector arguments
    specifying the cost function for the
    cross-validation. The first argument to cost
    should correspond to the observed responses and
    the second argument should correspond to the
    predicted or fitted responses from the
    generalized linear model. The default is the
    average squared error function.
  • K The number of groups into which the data
    should be split to estimate the cross-validation
    prediction error.

13
Cross-validation Algorithm
  • bwt.shufflelt- bwt shufiter lt- 1000 datasize lt-
    length(bwtlow)
  • for(i in 1shuf.iter)
  • n1lt- round(runif(n1, min 1, maxdatasize))
  • n2lt- round(runif(n1, min 1, max datasize))
  • temp lt- bwt.shufflen1,
  • bwt.shufflen1, lt- bwt.shuffIen2,
  • bwt.shumen2, lt- temp
  • foldlt- k testsize lt- round(datasize/fold)
    rate.fold lt-rep(0,fold)
  • for(i in 1fold)
  • test.start lt- (i-1)testsize test.end lt-
    test.start testsize
  • if (test.endgtdatasize) test.enddatasize
  • bwt.testlt- data.frame(bwt.shuffIe(test.start1)
    test.end,)
  • bwt.trainlt- data.frame(bwt.shuffle-((test.start
    1)test.end),)
  • train.glmlt- glm(1ow.,familybinomial,
    databwt.train)
  • predlt-predict(train.glm, subset(bwt.test, select
    c(age,lwt, race, smoke, ptd, ht,ui, ftv)), type
    "response")
  • rate-foldilt sum(round(pred)bwt.testlow)/leng
    th(pred)
  • cvratelt- mean(rate.fold) cat("prediction error
    rate of cross validation",1-cvrate, \n")
Write a Comment
User Comments (0)
About PowerShow.com