Generalized Linear Models - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Generalized Linear Models

Description:

On the Aegean Island of Kalythos the male inhabitants suffer from a congenital ... LS: the maximized log-likelihood for the most ... Deviance = 2(LM LS) ... – PowerPoint PPT presentation

Number of Views:429

Avg rating:3.0/5.0

Slides: 32

Provided by: themegal

Category:

Tags: generalized | island | linear | ls | models

more less

Transcript and Presenter's Notes

Title: Generalized Linear Models

1
Generalized Linear Models
Chapter 8

www.smu.edu.sg

2
Contents
3
GLM A General Introduction
What is a Generalized Linear Model? A
traditional linear model is of the form
where Yi, the responses, are assumed to be
independent, normally distributed random
variables with mean and constant variance
?2. While traditional linear models are used
extensively in statistical data analysis, there
are types of problems for which they are not
appropriate.
4
GLM A General Introduction

It may not be reasonable to assume that data are
normally distributed, e,g, proportion data,
counts data, etc.
If the mean of the data is naturally restricted
to a range of values, the traditional linear
model may not be appropriate, since the linear
predictor can take on any value.
For example, the mean of a measured proportion
is between 0 and 1, but the linear predictor of
the mean in a traditional linear model is not
restricted to this range.
It may not be realistic to assume that the
variance of the data is constant for all
observations. For example, it is not unusual to
observe data where the variance increases with
the mean of the data.

5
GLM A General Introduction
A generalized linear model (GLM) extends the
traditional linear model and is, therefore,
applicable to a wider range of data analysis
problems. A generalized linear model consists of
the following components

a linear component, defined just as it is for the
traditional linear models
?i
a link function g, monotonic differentiable,
describing how the expected value ?i of Yi is
related to the linear predictor ?i
g(?i)
a family of distributions called the exponential
family, from which Yi, i 1, 2, ... , n, are
independently drawn

6
GLM A General Introduction
where ?i is called the natural parameter and
? the dispersion parameter, which is a constant
across i and may be known.

Naturally, the GLM models the ?i by a linear
model, i.e.,
If this is the case, the resulted link is called
canonical link.
The mean of the exponential family is
i.e., ?i is a function of only the mean ?i,
leading to g function upon inversion !

7
GLM A General Introduction
It can easily be seen that the following
distributions are the members of the exponential
family

Normal leading to regular linear regression
Binomial leading to logistic regression
Poisson leading to loglinear model
Multinomial multinomial response model
Gamma lifetime data analysis

GLM provides a unified approach in terms of both
modeling and statistical inferences.
8
Normal GLM
If Yi N(?i, ?2), then E(Yi) ?i, Var(Yi) ?2,
and the pdf of Yi is
Which can be rewritten as
So, in normal GLM, the canonical link is
identity link, leading to linear regression.
?
9
Normal GLM
glm() the R Software which does GLM
Example 8.1. Education Expenditure Data
Finding a Linear Regression to Education
Expenditure Data (Text 1, p144) edu read.table("P146.txt", headerTRUE) y ,2 Per capita expenditure on public
education x1 income x2 thousands under 18 years of age x3 ,5 Number of people per thousands residing in
urban areas x4 northeast region x5 north central x6 South fit gaussian) (summary(fit)) (vcov(fit))
10
Normal GLM
Example 8.1. Contd. GLM Fit
The results are identical to those from lm()
11
GLM for Binary Data
If Yi Bernoulli(?i), then E(Yi) ?i, and
Var(Yi) ?i(1-?i). The distribution function of
Yi is
This can be rewritten in the form of exponential
family
which shows that the natural parameter is b(?i)
log(1exp(?i)), a(?) 1, and c(yi, ?) 0.
12
GLM for Binary Data
Thus, a generalized linear model for binary
response data, with canonical link function, has
the form which is referred to as the logistic
regression with logit link.
This function is called logit function logit(?i)

Like the regular linear model, the explanatory
variables can be
Continuous, giving the logistic regression
Categorical, giving the logistic ANOVA
Mixture of continuous and categorical, giving the
logistic analysis of covariance model

13
GLM for Binary Data
To fit a binary response model using glm(), there
are three possibilities for the response

The response is a vector of 0s or 1s,
representing failure and success,
The response is two-column matrix, with 1st
containing the number of successes, and the 2nd
the number of failures,
The response is a factor, with its 1st level
taken as failure (0) and all others as success
(1).

In each of the three cases, the values for the
explanatory variables X should follow
accordingly. Following example illustrate the
second case.
14
GLM for Binary Data
Example 8.2. Age and Eye Disease, Silvey (1970)
On the Aegean Island of Kalythos the male
inhabitants suffer from a congenital eye disease,
the effect of which becomes more marked with
increasing age. Samples of islander males of
various age were tested for blindness with
results shown below

Fit a logistic regression model relating age to
the probability of blindness.
Estimate the age at which the chance of blindness
for a male inhabitant is 50.

15
GLM for Binary Data
Example 8.2. Solution
The R code for running GLM is as follows
Fit a GLM to Age and Eye Disease Data kalythos
rep(50,5), y c(6,17,26,37,44)) add a
matrix Ymat to the data frame "kalythos"
containing a column of successes and a column
of failures kalythosYmat kalythosn - kalythosy) fit family binomial, data kalythos) (summary(fit))
(vcov(fit))
16
GLM for Binary Data
Example 8.2. Solution 1)
Highly significant relation!
Estimate Std. Error z value Pr(z)
Intercept -3.53778 0.50232 -7.043
1.88e-12 x 0.08114 0.01082
7.498 6.47e-14
Variance-Covariance Matrix of the Parameter
Estimates (Intercept)
x Intercept 0.252329917 -0.0051852955 x
-0.005185295 0.0001170988

If y is the number of blind at age x and n the
number tested, then y Binomial(n, ?(x)), where
log?(x)/(1??(x)) ?0 ?1x,
Let ?(x) 0.5. Then we have 0 ?0 ?1x,
giving x ??0/?1, which is estimated as
0.252329917/0.005185295 48.66.

17
GLM for Binary Data
Logistic Regression with Probit Link

The logit link function, i.e., g(?)
log?/(1??), in the early case can be replaced
by other functions of similar feature
it is a monotonic increasing function of ?,
it maps the values of ? ?0,1 onto the whole
real line (-?, ?)
The inverse of a cumulative distribution function
(CDF) possesses such main features.

If g(?) ??1(?), where ??1 denotes the inverse
of the standard normal CDF, then this link
function is called the probit link.
18
GLM for Binary Data
Example 8.2. Contd,
The R statement for fitting a logistic regression
with probit link would be
fit ), data kalythos)
The output are given below
Estimate Std. Error z value Pr(z)
Intercept -2.102270 0.276287 -7.609
2.76e-14 x 0.048147
0.005885 8.181 2.82e-16
The variance-covariance matrix Intercept
x Intercept 0.076334285
-1.539996e-03 x -0.001539996
3.463864e-05
19
GLM for Count Data

Many discrete response variables have counts as
possible outcomes. Examples are
Y number of parties attended in the past
month, for a sample of students,
Y number of imperfections on each of a sample
of silicon wafers used in manufacturing computer
chips,
Y number of customers entering a bank on Monday.

If Yi Poisson(?i), then E(Yi) Var(Yi) ?i,
and the distribution function has the form
20
GLM for Count Data
which can be written as
Clearly this is an exponential family with
natural parameter ?i log(?i) and b(?i)
exp(?i), a(?) 1, and c(yi, ?) log(yi!).
This leads to a Poisson loglinear model under the
canonical link (the log)
21
GLM for Count Data
Example 8.3. Female Horseshoe Crabs and Their
Satellites http//www.stat.ufl.edu/aa/intro-cda/a
ppendix.html
In a study of nesting horseshoe crabs, each
female horseshoe crab had a male crab attached to
her in her best. The study investigated the
factors that affect whether the female crab had
any other males, called satellites, residing
nearby her. The response variable is the number
of satellites (Sa), the explanatory variables are
female crabs collor (C), Spine Condition (S),
shell width (W), and wight (Wt). The data are
given in the above web site, and is also
available here
22
GLM for Count Data
Example 8.3. Female Horseshoe Crabs and Their
Satellites http//www.stat.ufl.edu/aa/intro-cda/a
ppendix.html
Fit a GLM to Horseshoe Crabs Data crab read.table("horseshoecrab.txt", headerTRUE) y
,1 Color of the female horseshoe crab x2 crab ,2 Spine Condition x3 Shell width of the female horseshoe crab x4 crab ,5 Wight of the female horseshoe
crab fit poisson) (summary(fit)) (vcov(fit))
23
GLM for Count Data
Example 8.3. Female Horseshoe Crabs and Their
Satellites http//www.stat.ufl.edu/aa/intro-cda/a
ppendix.html
Estimate Std. Error z value Pr(z)
Intercept -3.30476 0.54224 -6.095
1.10e-09 x3 0.16405 0.01997
8.216 Intercept x3 COV
Intercept 0.29402590 -0.01078952 x3
-0.01078952 0.00039861

Shell width has a highly significant effect on
the number of satellites.
More sophisticated loglinear models can be fitted
using glm() by adding more explanatory variables.
Meaningful explanations can be given to the
estimated model parameters.

24
Statistical Inferences
Summary of GLM Families by R Fnction glm()
25
Statistical Inferences
The exact definition of the built-in link
functions

Identity Link g(?) ?
Logit Link g(?) log?/(1??)
Probit Link g(?)
Log Link g(?) log(?)
Complementary Log-Log g(?) log(?log(1??))

In glm() of R software, the default link function
is the canonical link where the natural parameter
is modeled by the linear combination of the
predictors.
26
Statistical Inferences

Method of Estimation for GLM
Maximum likelihood method or
Quasi-maximum likelihood method

Inference for Model Parameters
100(1??) confidence interval for ?
where SE is the square-root of a diagonal
element of the output in VCOV
Level ? test for testing H0 ? 0 rejects null
hypothesis if
or

27
Statistical Inferences

The Deviance
LM the maximized log-likelihood for a model of
interest
LS the maximized log-likelihood for the most
complex model possible, i.e., the model which has
a separate parameter for each observation and
provides perfect fit to the data. This model is
called the saturated model.
Deviance ?2(LM ? LS)

The deviance is the likelihood-ratio statistic
for comparing the model of interest to the
saturated model. Likelihood ratio statistic
?2(maximized log-likelihood under the null
hypothesis ? maximized log-likelihood under the
alternative model)
28
Model Checking
glm() has many generic functions for extracting
information
29
Model Checking
For more details on glm(), see R Search Help ?
glm
30
Model Checking
Example 8.3. Contd fit a larger model
R Command
fit Deviance Residuals Min 1Q Median
3Q Max -3.0126 -1.8846 -0.5406
0.9448 4.9602 Estimate Std. Error z
value Pr(z) Intercept -0.3435447
0.9684204 -0.355 0.72278 x1
-0.1849325 0.0665236 -2.780 0.00544
x2 0.0399764 0.0568062 0.704
0.48160 x3 0.0275251 0.0479425
0.574 0.56588 x4 0.0004725
0.0001649 2.865 0.00417 Signif. codes
0 '' 0.001 '' 0.01 '' 0.05 '.' 0.1 ' ' 1
31
Model Checking
(Dispersion parameter for poisson family taken to
be 1) Null deviance 632.79 on 172 degrees
of freedom Residual deviance 551.85 on 168
degrees of freedom AIC 917.15 Number of Fisher
Scoring iterations 6
Intercept x1 x2 x3
x4 Intercept 0.937838 -0.0211481 0.0004346
-0.0438027 0.0001193 x1 -0.021148
0.0044254 -0.0013256 0.0004440
-0.0000008 x2 0.000435 -0.0013256 0.0032269
-0.0003295 0.0000019 x3 -0.043803
0.0004440 -0.0003295 0.0022985
-0.0000072 x4 0.000119 -0.0000008 0.0000019
-0.0000072 0.0000000

Write a Comment

User Comments (0)