Generalized Linear Models and some of their
extensions using the library glmtoolbox
Luis Hernando Vanegas Luz Marina Rondón
Universidad Nacional de Colombia
Sede Bogotá
Facultad de Ciencias
Departamento de Estadística
2024
1 / 430
“All models are wrong, but some are useful”
George E. P. Box, 1979
2 / 430
Outline I
1
Normal Linear Models
Motivation
Richness of the species
Gasoline yield
Model Formulation
Parameter Estimation
Ordinary Least Squares Estimator
Maximum Likelihood Estimat or
Goodness-of-fit
R-squared and adjusted R-squared
Predicted R-squared
Generalized I nform ation Criterion
Statistical Inference
Confidence int ervals
Hypothesis testing
Examples
Richness of the species
Gasoline yield
Exercises
3 / 430
Outline II
Variable selection
Hybrid f orward stepwise
Hybrid backward stepwise
Diagnostic methods
Residual analysis
Normal probability plot with simulated envelope
Influence or Sensitivity analysis
Leverage p oints
Testing the constant variance assumption
Box-Tidwell transfor mation
Lognormal model
Example 3: Gasoline yield
Example 4: Lung capacity
2
Generalized Linear Models
Motivation
Liver cancer in female mice
Insecticides
Burn i njury study
4 / 430
Outline III
Dilution assay
Skin cancer in women
Advertising
Atlantic Croakers
Hardened steel
Gasoline yield
Model formulation
Linear exponential family
Binomial di stributi on
Poisson distr ibution
Normal distr ibution
Gamma d istribu tion
Inverse-Gaussian distr ibution
Exercises
Parameter estimation
Maximum likelihood (ML)
Deviance
Newton-Raphson algor ithm
Fisher’s Scoring algorithm
5 / 430
Outline IV
Canonical link fun ctions
Goodness-of-fit
R-squared and adjusted R-squared
Generalized I nform ation Criterion
Statistical inference
Asymptotic theory
Hypothesis testing
Confidence int ervals
Estimation of the dispersion parameter
Parameter interpretation
Binomial regression
Poisson regression
Normal regression
Gamma regression
Lognormal regression
Genesis of link functions in binary response models
Receiver Operating Characteristic Curve
Variable selection
6 / 430
Outline V
Hybrid f orward stepwise
Hybrid backward stepwise
Best subset selection
Exercises
Diagnostic methods
Residual analysis
Normal probability plot with simulated envelope
Influence or sensitivity analysis
Testing the constant dispersion parameter assumption
Box-Tidwell transfor mation
Quasi-likelihood models
3
Overdispersion in Poisson models
Motivation
Self d iagnozed ear infections in swimmers
The Origins of Sex Dierences in Science
Negative Binomial I model
Negative Binomial model
Exercises
7 / 430
Outline VI
4
Overdispersion in binomial models
Motivation
Cell dierentiation
Teratogenic eects of PHT and TCPO
Beta Binomial model
Random-clumped Binomial model
Exercises
5
Excess of zeros in count data
Zero-Inflated Poisson model
Zero-Modified Poisson model
Zero-Inflated Negative Binomial model
Zero-Modified Negative Binomial model
Example: Males attached to female horseshoe crabs
Exercises
6
Cluster Correlated Data
Motivation
Eect of t he ozone on the tree growth
8 / 430
Outline VII
Treatment of severe postnatal depression
Generalized Estimating Equations
Example 1: Eect of the ozone on the tree growth
7
Appendix
A. Linear algebra
B. Derivatives
C. Taylor series
D. Probability
E. Multivariate normal distribution
F. Nonlinear optimization
8
References
9 / 430
Normal Linear Models
10 / 430
Richness of the species
The analysis of these data, available in the object richness of the library
glmtoolbox, aims to explain the dierences between plots as related
to species richness (represented by a count of the number of plant
species (
richness)) using the dierences between plots with respect
to biomasses (
Biomass) and soil pH levels: low (pH=“low”), medium
(
pH=“mid”) and high (pH=“high”).
highmidlow
pH
Biomass
Number of plant species
0
0.0 2.5 5.0 7.5 10.0
10
20
30
40
11 / 430
Gasoline yield
The analysis of the data in the object Auto of the library ISLR is aimed
at explaining the dierences between the vehicles with respect to their
gasoline yield (
mpg), in miles per gallon, using other vehicle characte-
ristics such as their horsepower (
horsepower), their weight (weight),
in kilograms, and their origin (
origin: America, Europe or Japan).
••
highmidlow
Weight
Horsepower
Gasoline yield
10
20
30
40
50 100 150 200
12 / 430
Normal Linear Model
Let y
i
for i = 1, . . . , n be the interest variable measured on n
individuals, which are assumed to be realizations of independent
random variables
denoted here by Y
i
for i = 1, . . . , n. The random
variables
Y
i
for i = 1, . . . , n
i
are assumed to satisfy the following
condition:
Y
i
= µ
i
+ ǫ
i
,
where ǫ
i
ind
Normal(0, σ
2
)
are random errors and µ
i
= β
0
+ β
1
x
i1
+
. . . + β
p
x
ip
= x
i
β
, in which x
i
= (1, x
i1
, . . . , x
ip
)
includes the
(known) values of
p explanatory variables, quantitative and/or
qualitative, measured on the
i-th individual, and β = (β
0
, . . . , β
p
)
is the (unknown) vector of interest parameters. Thus, the ob-
served value of the interest variable, y, is assumed to be the
result of a “signal”, represented by µ, that is additively contam-
inated by a normally-distributed random “noise”, represented
by the random error ǫ.
Goal: To explain the d ierences between y
1
, . . . , y
n
by using the dierences between x
1
, . . . , x
n
.
13 / 430
Normal Linear Model (p = 1 and σ
2
= 0.4)
y
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x
µ = 1.5 + 1.4 x
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 111111111111
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 333333333333
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4 555555555555
666666666666 888888888888 101010101010101010101010
14 / 430
Normal Linear Model
The above model may be summarized as follows
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
= x
i
β,
ǫ
i
ind
Normal(0, σ
2
),
= Y
i
ind
Normal(µ
i
, σ
2
)
or, equivalently,
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
= x
i
β,
ǫ Normal
n
(0, σ
2
I),
= Y Normal
n
(µ, σ
2
I)
where Y = (Y
1
, . . . , Y
n
)
, µ = (µ
1
, . . . , µ
n
)
= Xβ, ǫ = (ǫ
1
, . . . , ǫ
n
)
and X= (x
1
, . . . , x
n
)
. In matrix form, the model becomes
Y = µ + ǫ
This model is known as the normal linear model because of: (i)
the “signal” (µ
i
) can be written as a linear combination of the
interest parameters (β
0
, β
1
, . . . , β
p
); and (ii) the random “noise”
(ǫ
i
) is assumed to follow the normal or Gaussian distribution.
15 / 430
Ordinary Least Squares (OLS)
Parameter estimation involves selecting a vector from the space
R
p+1
, denoted here by
ˆ
β
, such that the linear combinations given
by
X
ˆ
β provide the “best” representation of the vector of observed
responses
y = (y
1
, . . . , y
n
)
. Based on intuition,
ˆ
β
may be selected
as the
β R
p+1
such that
Q
r
(β) =
1
r
n
X
i=1
y
i
x
i
β
σ
r
,
for some r > 0, is minimum.
If the partial derivatives of
Q
r
(β) are continuous, then the mini-
mum of
Q
r
(β) can be obtained by solving the following equations
Q
r
(β)
∂β
0
Q
r
(β)
∂β
1
.
.
.
Q
r
(β)
∂β
p
=
1
σ
r
n
P
i=1
(y
i
x
i
β)
r1
1
σ
r
n
P
i=1
x
i1
(y
i
x
i
β)
r1
.
.
.
1
σ
r
n
P
i=1
x
ip
(y
i
x
i
β)
r1
=
0
0
.
.
.
0
If r > 1, then the partial derivatives of Q
r
(β) are continuous.
16 / 430
Ordinary Least Squares (OLS)
If r = 2, then the above syste m of equations is linear and
ˆ
β
becomes the Ordinary Least Squares (OLS) estimator, given by
ˆ
β = argmin
β R
p+1
Q
2
(β),
where
Q
2
(β) =
1
2σ
2
n
X
i=1
(y
i
x
i
β)
2
=
1
2σ
2
(y Xβ)
(y Xβ) =
1
2σ
2
h
y
y 2y
Xβ + β
X
Xβ
i
.
The gradient vector of Q
2
(β) is given by (see Appendix B-1,2)
U(β) =
Q
2
(β)
∂β
0
Q
2
(β)
∂β
1
.
.
.
Q
2
(β)
∂β
p
=
1
σ
2
n
P
i=1
(y
i
x
i
β)
1
σ
2
n
P
i=1
x
i1
(y
i
x
i
β)
.
.
.
1
σ
2
n
P
i=1
x
ip
(y
i
x
i
β)
=
1
σ
2
X
(y Xβ)
17 / 430
Ordinary Least Squares (OLS)
x
y
ˆ
µ =
ˆ
β
0
+
ˆ
β
1
x
(x
i
, y
i
)
OLS estimator selects the straight line (
ˆ
β
0
+
ˆ
β
1
x
) that minimizes the
sum of the squares of the dotted segment lengths.
18 / 430
Ordinary Least Squares (OLS)
The Hessian matrix of Q
2
(β) is given by (see Appendix B-3)
J(β) =
2
Q
2
(β)
∂β
2
0
2
Q
2
(β)
∂β
0
∂β
1
. . .
2
Q
2
(β)
∂β
0
∂β
p
2
Q
2
(β)
∂β
1
∂β
0
2
Q
2
(β)
∂β
2
1
. . .
2
Q
2
(β)
∂β
1
∂β
p
.
.
.
.
.
.
.
.
.
.
.
.
2
Q
2
(β)
∂β
p
∂β
0
2
Q
2
(β)
∂β
p
∂β
1
. . .
2
Q
2
(β)
∂β
2
p
=
1
σ
2
X
X
If X has linearly independent columns, then J(β) is a positive def-
inite and non-singular matrix (see
Appendix A-6), implying that
Q
2
(β) is a strictly convex function (see Appendix F-2). Therefore,
if the system of equations
U(β) = 0 has a solution, then it is the
unique minimum global of Q
2
(β). Then, the OL S estimator of β
reduces to the solution of U(
ˆ
β) =
1
σ
2
X
(y X
ˆ
β) = 0
, that is,
ˆ
β = (X
X)
1
X
y.
A necessary condition for X’s columns to be linearly independent
is
p + 1 n, that is, X has fewer columns than rows.
19 / 430
Ordinary Least Squares (OLS)
In conclusion, if the columns of X are linearly independent, then
the OLS estimator of
β satisfies the following:
(1) It always exists.
(2) It is unique.
(3) It has closed-form, given by
ˆ
β = (X
X)
1
X
y
.
(4) It is linear with respect to y, that is, it may be written as Ay,
where
A is a (p + 1) × n matrix which does not depend on
y. Therefore, the (j + 1)-th element of
ˆ
β
, denoted here by
ˆ
β
j
, may be written as
ˆ
β
j
= a
j
y = a
j1
y
1
+ . . . + a
jn
y
n
, where
a
j
= (a
j1
, . . . , a
jn
)
represents the (j + 1)-th row of A.
(5) The estimation of the mean vector µ = Xβ is also linear
with respect to y, that is, it may be written as ˆµ = X
ˆ
β = Hy,
where H = X(X
X)
1
X
is the orthogonal projection matrix.
Therefore,
ˆ
µ
i
= x
i
ˆ
β = h
i
y = h
i1
y
1
+ . . . + h
in
y
n
, where h
i
=
(h
i1
, . . . , h
in
)
represents the i-th row of H.
20 / 430
Exercises
Consider the following simple linear model
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i
,
ǫ
i
ind
Normal(0, σ
2
)
Show that
ˆ
β
0
=
y
ˆ
β
1
x
and
ˆ
β
1
=
n
P
i=1
(x
i
x)(y
i
y)
n
P
i=1
(x
i
x)
2
=
S
y
S
x
r
xy
,
where (n1)S
2
x
=
n
P
i=1
(x
i
x)
2
, nx =
n
P
i=1
x
i
, (n1)S
2
y
=
n
P
i=1
(y
i
y)
2
,
n
y =
n
P
i=1
y
i
and r
xy
is the Pearson
s linear correlation coecient.
21 / 430
Orthogonal projection matrix (H)
(x
i
,
ˆ
β
0
+
ˆ
β
1
x
i
)
(x
i
, y
i
)
x
y
The matrix H orthogonally projects y
1
, . . . , y
n
onto the line, plane,
or hyperplane given by
ˆ
µ =
ˆ
β
0
+
ˆ
β
1
x
1
+
ˆ
β
2
x
2
+ . . . +
ˆ
β
p
x
p
.
22 / 430
Orthogonal projection matrix (H)
h
ij
= x
i
(X
X)
1
x
j
.
H is a symmetric matrix, that is, H
= H, so the i-th row
and the
i-th column of H coincide. Therefore, h
ij
= h
ji
for all
i, j = 1, . . . , n.
H is a idempotent matrix, that is, H
2
= H, so h
ij
= h
i
h
j
for
all
i, j = 1, . . . , n. For e xample,
h
ii
= h
i
h
i
=
n
X
j=1
h
2
ij
= h
2
ii
+
n
X
j:j,i
h
2
ij
= h
ii
h
2
ii
= h
ii
(1 h
ii
) =
n
X
j:j,i
h
2
ij
0
= h
ii
[0, 1].
ˆ
µ
i
y
j
= h
ij
and
ˆ
µ
j
y
i
= h
ji
. Therefore,
ˆ
µ
i
y
j
=
ˆ
µ
j
y
i
.
rank(H) = trace(H) = p + 1 (see Appendix A-1,2). Conse-
quently, if p + 1 < n then H is a singular matrix.
23 / 430
Maximum Likelihood (ML)
This method consists of maximizing, with respect to β, the
chance of observing the data in the sample, that is,
ˆ
β = argmax
β R
p+1
L(β),
where L(β) is the likelihood function of β, which corresponds to
the joint probability density function of
Y
1
, . . . , Y
n
, viewed as a
function of
β rather than as a function of y
1
, . . . , y
n
. Thus,
L(β) =
n
Y
i=1
1
(2π σ
2
)
1
2
exp
1
2
y
i
x
i
β
σ
!
2
= (2π σ
2
)
n
2
exp
1
2σ
2
n
X
i=1
(y
i
x
i
β)
2
= (2π σ
2
)
n
2
exp
Q
2
(β)
.
The function log(x) is strictly increasing, so any β that maximizes
log[L(β)] will also maximize L(β). In addition, the computation
of the gradient of log[L(β)] is easier than that of L(β), as the latter
24 / 430
Maximum Likelihood (ML)
is a product of n terms while the former is a sum of n terms.
Thus, a more convenient way to express
ˆ
β
is the following
ˆ
β = argmax
β R
p+1
(β),
where
(β) = log[L(β)] =
n
2
log(2π σ
2
) Q
2
(β)
is the log-likelihood function of β.
Note that maximizing
(β) with respect to β is equivalent to
maximizing Q
2
(β), as
n
2
log(2π σ
2
)
is a constant. In addition,
maximizing Q
2
(β) is equivalent to minimizing Q
2
(β), as 1 is
a negative constant. Therefore, the ML estimator of
β in the
model where ǫ
i
ind
Normal(0, σ
2
)
coincides with the OLS estima-
tor. For this reason, the normal or Gaussian distribution is the
most widely used to describe the behavior of the random errors.
25 / 430
Statistical propertie s of
ˆ
β
If the model is correctly specified, then (see Appendix E-3)
ˆ
β Normal
p+1
(E(
ˆ
β), Var(
ˆ
β)),
where
E(
ˆ
β) = E[(X
X)
1
X
Y] = (X
X)
1
X
E(Y) = β
and
Var(
ˆ
β) = Var[(X
X)
1
X
Y] = (X
X)
1
X
Var(Y)X(X
X)
1
= σ
2
(X
X)
1
.
Therefore,
ˆ
β Normal
p+1
(β, σ
2
(X
X)
1
).
In addition,
ˆ
β
is an ecient estimator as it achieves the Cramér-
Rao bound, that is, the variance-covariance matrix of
ˆ
β
coincides
with the i nverse of the Fisher information matrix, where the latter
is given by
E
"
2
(β)
ββ
#
=
1
σ
2
X
X.
26 / 430
Exercises
(1) The (generalized) normal linear model may be written as
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
= x
i
β,
ǫ
i
ind
GN(0, σ
2
, τ),
= Y
i
ind
GN(µ
i
, σ
2
, τ),
where, unlike β and σ
2
, τ > 0 is known. The probability
density function of
Y
i
GN(µ
i
, σ
2
, τ) is the follows
f
Y
i
(y; µ
i
, σ
2
, τ) =
τ
1
1
τ
2σΓ
1
τ
exp
1
τ
y µ
i
σ
τ
, y R.
If τ= 2 then GN(0, σ
2
, τ) reduces to Normal(0, σ
2
). Show that,
(a) the distribution of Y
i
is symmetric about µ
i
(b)
the mode of the distribution of Y
i
is µ
i
(c)
the variance of Y
i
does not depend on i, that is, it is constant
(d) the ML estimator of β reduces to
ˆ
β = argmin
β R
p+1
Q
τ
(β)
27 / 430
Exercises
(2) Consider the following linear model
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
= x
i
β,
ǫ
i
ind
Normal(0, σ
2
i
),
where ω
i
> 0 for i = 1, . . . , n are known weights.
Show that,
(a) the estimator of β by the least squares method, better known
as the Weighted Least Squares (WLS) estimator, minimizes
(y Xβ)
(y Xβ) =
n
P
i=1
ω
i
(y
i
x
i
β)
2
and it can be written
as
ˆ
β = (X
X)
1
X
y, where = diag{ω
1
, . . . , ω
n
}.
(b) the ML estimator of β coincides with the WLS estimator.
(c) if the model is correctly specified, then the distribution of
ˆ
β
is Normal
p+1
(β, σ
2
(X
X)
1
).
28 / 430
R-squared
Interest Model
(model under investigation)
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
ǫ
i
ind
Normal(0, σ
2
),
X =
1 x
11
. . . x
1p
1 x
21
. . . x
2p
.
.
.
.
.
.
.
.
.
.
.
.
1 x
n1
. . . x
np
ˆ
β = argmin
β R
p+1
Q
2
(β)
where,
Q
2
(β) =
1
2σ
2
n
X
i=1
y
i
β
0
p
X
j=1
β
j
x
ij
2
Reference Model
(model with the poorest
goodness-of-fit)
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β,
ǫ
i
ind
Normal(0, σ
2
),
X =
1
1
.
.
.
1
ˆ
β = argmin
β R
Q
0
2
(β)
where,
Q
0
2
(β) =
1
2σ
2
n
X
i=1
y
i
β
2
Note that Q
2
(
ˆ
β) = min
β R
p+1
Q
2
(β)
and Q
0
2
(
ˆ
β) = min
β R
p+1
Q
2
(β)
restricted to β
1
= . . . = β
p
= 0
29 / 430
R-squared
R
2
= 1
Q
2
(
ˆ
β)
Q
0
2
(
ˆ
β)
= 1
n
P
i=1
y
i
x
i
ˆ
β
2
n
P
i=1
y
i
y
2
The R
2
satisfies the following:
R
2
[0, 1]
If X = 1
n
then R
2
= 0, where 1
n
represents a column vector
of 1’s of size
n.
If X = I
n
then R
2
= 1, where I
n
represents the identity matrix
of size n.
A hundred times its value represents the percentage reduc-
tion in R esid ual Sum of Squares (RSS) caused by the inclu-
sion of
p parameters in the model.
It coincides with r
2
(y, ˆµ), where 1 r(y, ˆµ) 1 repre-
sents the Pearson’s linear correlation coecient computed
between
y = (y
1
, . . . , y
n
) and ˆµ = (
ˆ
µ
1
, . . . ,
ˆ
µ
n
).
30 / 430
R-squared
The R
2
can be used to compare nested models, that is, models where
the first model’s
X matrix coincides with the rst (r + 1) columns of the
second model’s
X matrix, in which r < p.
Model I
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+
r
P
j=1
β
j
x
ij
,
ǫ
i
ind
Normal(0, σ
2
),
ˆ
β
I
= argmin
β R
r+1
Q
I
2
(β),
where
Q
I
2
(β)
n
X
i=1
y
i
β
0
r
X
j=1
β
j
x
ij
2
Model II
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+
r
P
j=1
β
j
x
ij
+
p
P
j=r+1
β
j
x
ij
,
ǫ
i
ind
Normal(0, σ
2
),
ˆ
β
II
= argmin
β R
p+1
Q
II
2
(β),
where
Q
II
2
(β)
n
X
i=1
y
i
β
0
r
X
j=1
β
j
x
ij
p
X
j=r+1
β
j
x
ij
2
Note that
Q
II
2
(
ˆ
β
II
) = min
β R
p+1
Q
II
2
(β),
and Q
I
2
(
ˆ
β
I
) = min
β R
p+1
Q
II
2
(β)
restricted to β
r+1
= . . . = β
p
= 0
31 / 430
R-squared
Therefore,
Q
II
2
(
ˆ
β
II
) Q
I
2
(
ˆ
β
I
).
As a result, R
2
II
is ALWAYS greater than R
2
I
, that is, according to
R
2
, the model II is ALWAYS better than the model I, even when
the additional
(p r) covariates are not helpful in explaining the
dierences between
y
1
, . . . , y
n
.
The following expression describes an adjusted version of R
2
that measures not just goodness-of-fit, but also penalizes model
complexity:
R
2
adj
= 1
n
P
i=1
y
i
x
i
ˆ
β
2
/(n p 1)
n
P
i=1
y
i
y
2
/(n 1)
= 1
n 1
n p 1
(1 R
2
),
where (n p 1) and (n 1) are the degrees-of-freedom of the
Interest Model and the Reference Model, respectively.
32 / 430
Predicted R
2
Note that
ˆ
β = (X
X)
1
X
y
also may be written as follows
ˆ
β =
n
X
l=1
x
l
x
l
1
n
X
l=1
x
l
y
l
.
Hence, the estimate of β derived from the dataset with the i-th
observation excluded, denoted here by
ˆ
β
(i)
, is (see, for instance,
Cook and Weisberg [1982, page 110])
ˆ
β
(i)
=
n
X
l=1
x
l
x
l
x
i
x
i
1
n
X
l=1
x
l
y
l
x
i
y
i
=
X
X x
i
x
i
1
X
y x
i
y
i
(
see Appendix A3)
=
(X
X)
1
+
(X
X)
1
x
i
x
i
(X
X)
1
1 h
ii
X
y x
i
y
i
=
ˆ
β
y
i
x
i
ˆ
β
1 h
ii
(X
X)
1
x
i
,
where h
ii
= x
i
(X
X)
1
x
i
represents the (i, i)-th entry of H.
33 / 430
Predicted R
2
Therefore, the predicted residual is
y
i
x
i
ˆ
β
(i)
= y
i
x
i
ˆ
β +
y
i
x
i
ˆ
β
1 h
ii
x
i
(X
X)
1
x
i
=
y
i
x
i
ˆ
β
1 h
ii
.
Thus, the predicted R
2
may be defined as follows
R
2
pred
= 1
n
X
i=1
y
i
x
i
ˆ
β
(i)
2
/
n
X
i=1
y
i
y
(i)
2
,
where
n
P
i=1
y
i
x
i
ˆ
β
(i)
2
and
n
P
i=1
y
i
y
(i)
2
are the Predicted Resid-
ual Error Sum of Squares (PRESS) (Allen [1971]) of the interest
and reference models, respectively. If the values of (y
i
x
i
ˆ
β
(i)
)
and (y
i
y
(i)
) are replaced in the expression above, then R
2
pred
becomes (see, for example, Alcantara et al. [2023])
R
2
pred
= 1
n
X
i=1
y
i
x
i
ˆ
β
2
(1 h
ii
)
2
/
n
X
i=1
y
i
y
2
(1 n
1
)
2
.
34 / 430
Generalized Information Criterion
GIC(k) =
pmeasure of the
goodness-of-fit
z}|{
2 (
ˆ
β) +
complexity
penalty
z }| {
k(p + 2)
= n log(2π
ˆ
σ
2
) +
1
ˆ
σ
2
n
X
i=1
(y
i
x
i
ˆ
β)
2
+ k(p + 2)
= n log(2π
ˆ
σ
2
) + (n p 1) + k(p + 2)
= n log(
ˆ
σ
2
) + p(k 1) + C,
where C > 0 is a constant as it does not depend on the fitted
model. The Akaike Information Criterion (
AIC, Akaike [1974]),
and the Bayesian Information Criterion (
BIC, Schwarz [1978]),
also known as the Schwarz Information Criterion (SIC), are spe-
cial cases of GIC(k) when k = 2 and k = log(n), respectively.
If n > 7 then the penalty by model complexity in BIC is greater
than that in AIC. Therefore, the BIC tends to choose simpler
models than AIC.
35 / 430
Exercises
Consider two models: in the first model, denoted here by I, µ
i
=
β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
, while in the second model, denoted here by II,
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
+ β
p+1
x
i(p+1)
+ . . . + β
p+r
x
i(p+r)
. Show that:
(1) The R
2
adj
of model I is greater [smaller] than that of the model II
if and only if
ˆ
σ
2
I
/
ˆ
σ
2
II
< 1
h
ˆ
σ
2
I
/
ˆ
σ
2
II
> 1
i
, where
ˆ
σ
2
I
y
ˆ
σ
2
II
represent the
estimates of
σ
2
obtained from the models I and II, respectively.
Hence, choosing the the model with the highest value of
R
2
adj
is
equivalent to choosing the model with the lowest estimate of
σ
2
.
(2) The AIC of model I is smaller [greater] than that of model II if and
only if
ˆ
σ
2
I
/
ˆ
σ
2
II
< exp(
r
n
)
h
ˆ
σ
2
I
/
ˆ
σ
2
II
> exp(
r
n
)
i
. Therefore, for large
sample sizes, choosing the model with the smallest AIC value is
equivalent to choosing the model with the smallest estimate of
σ
2
.
(3) The BIC of model I is smaller (greater) than that of model II if and
only if
ˆ
σ
2
I
/
ˆ
σ
2
II
< exp(
r
n
[log(n) 1])
h
ˆ
σ
2
I
/
ˆ
σ
2
II
> exp(
r
n
[log(n) 1])
i
.
Therefore, for large sample sizes, larger than in the AIC case,
choosing the model with the smallest BIC value is equivalent to
choosing the model with the smallest estimate of
σ
2
.
36 / 430
Confidence intervals
If
ˆ
β Normal
p+1
(β, σ
2
(X
X)
1
)
, then
ˆ
β
j
Normal(β
j
, σ
2
c
j
), j = 0, 1, . . . , p,
where c
j
represents the (j + 1)th entry of the main di agonal of
(X
X)
1
. Therefore,
ˆ
β
j
β
j
σ
c
j
Normal(0, 1),
and, for all α (0, 1),
Pr
Z
1
α
2
ˆ
β
j
β
j
σ
c
j
Z
1
α
2
= 1 α
= Pr
h
ˆ
β
j
Z
1
α
2
σ
p
c
j
β
j
ˆ
β
j
+ Z
1
α
2
σ
p
c
j
i
= 1 α,
where Z
1
α
2
represents the 100(1
α
2
)%
th percentile of the stan-
dard normal distribution.
37 / 430
Confidence intervals
Then, a confidence interval of 100(1 α)% for β
j
reduces to
IC
1α
(β
j
) =
ˆ
β
j
± Z
1
α
2
σ
p
c
j
Nevertheless, this confidence interval is not useful in practice
because
σ
2
is unknown. An estimator of σ
2
that is statistically
independent of
ˆ
β
and such that (n p 1)
ˆ
σ
2
σ
2
χ
2
(n p 1)
is
given by the following expression:
ˆ
σ
2
=
1
n p 1
(y X
ˆ
β)
(y X
ˆ
β) =
1
n p 1
n
X
i=1
(y
i
x
i
ˆ
β)
2
.
The mean and variance of a random variable following χ
2
(n
p 1) distri bution allow to compute E(
ˆ
σ
2
) = σ
2
and Var(
ˆ
σ
2
) =
2σ
4
/(n p 1)
. The above allows to conclude that
ˆ
σ
2
is an
estimator of σ
2
which satisfies the following: (i) it is unbiased;
(ii) its variance tends to 0 as n tends to ; and (iii) it is consistent.
38 / 430
Confidence intervals
Y Normal
n
(Xβ, σ
2
I) implies that
(n p 1)
ˆ
σ
2
σ
2
= Y
1
σ
2
(I H)Y χ
2
(n p 1)
provided that (see Appendix E-4)
1
σ
2
(I H)σ
2
I = I H
is an idempotent matrix,
(Xβ)
1
σ
2
(I H)Xβ = (Xβ)
1
σ
2
(X HX)β = 0
, and
rank(I H) = trace(I H) = n trace(H) = n p 1.
In addition, (n p 1)
ˆ
σ
2
σ
2
and
ˆ
β = (X
X)
1
X
Y are statistically
independent
provided that (see Appendix E-5)
(X
X)
1
X
σ
2
I
1
σ
2
(I H) = (X
X)
1
(X
X
H) = 0.
39 / 430
Confidence intervals
The distribution of
Z
V/m
is Student-t(m) provided that Z
Normal(0, 1), V χ
2
(m), and Z and V are statistically indepen-
dent random variables (see
Appendix D-2). Therefore,
ˆ
β
j
β
j
σ
c
j
/
r
ˆ
σ
2
σ
2
=
ˆ
β
j
β
j
ˆ
σ
c
j
Student-t(n p 1).
Thus, for all α (0, 1),
Pr
t
(np1)
1
α
2
ˆ
β
j
β
j
ˆ
σ
c
j
t
(np1)
1
α
2
= 1 α
= Pr
ˆ
β
j
t
(np1)
1
α
2
ˆ
σ
p
c
j
β
j
ˆ
β
j
+ t
(np1)
1
α
2
ˆ
σ
p
c
j
= 1 α,
where t
(np1)
1
α
2
represents the 100(1
α
2
)%
th percentile of the
Student-t(n p 1) distribution.
40 / 430
Confidence intervals
Then, a confidence interval of 100(1 α)% for β
j
reduces to
IC
1α
(β
j
) =
ˆ
β
j
± t
(np1)
1
α
2
ˆ
σ
p
c
j
Two desirable attributes for a confidence interval are:
Confidence, which is easily specified by using “small” va-
lues of
α (0, 1).
Precision, which is recognized by “small” value s of the in-
terval width, given by
2 t
(np1)
1
α
2
ˆ
σ
c
j
.
The uncertainty in σ
2
subtracts precision from the confidence
interval because, for fixed α, the value of t
(np1)
1
α
2
is greater than
Z
1
α
2
. However, the loss of precision decreases with increasing
(n p 1), since for fixed α, t
(np1)
1
α
2
Z
1
α
2
as (n p 1) .
41 / 430
Confidence intervals
Z
1
α
2
Z
1
α
2
Standard Normal
α
2
α
2
0.0 0.1 0.2 0.3 0.4
0 1 2 3 41234
t
1
α
2
t
1
α
2
Student-t(ν)
α
2
α
2
0.0 0.1 0.2 0.3
0 1 2 3 41234
42 / 430
Hypothesis testing
Null hypothesis (H
0
) is
True False
Decision about Null Don
t reject ! Type II error
hypothesis (H
0
) is Rej ect Type I error !
α = Pr[ Type I error ] = Pr[ Reject H
0
| H
0
is True ]
γ = Pr[
Type II error ] = Pr[ Don’t reject H
0
| H
0
is False ]
43 / 430
Hypothesis testing
H
0
: β
j
= β
j
versus H
1
: β
j
, β
j
Type I error : Conclude that β
j
, β
j
when, in fact, β
j
= β
j
Type II error : Conclude that β
j
= β
j
when, in fact, β
j
, β
j
Ifσ
2
is known, then
ˆ
β
j
β
j
σ
c
j
Normal(0, 1)
and the decision rule is
“Reject H
0
at the significance level of 100(α)% if
ˆ
β
j
β
j
σ
c
j
>Z
1
α
2
Pr
Type I error
= Pr
h
ˆ
β
j
β
j
p
c
j
> Z
1
α
2
H
0
is True
i
= 1 Pr
h
ˆ
β
j
β
j
p
c
j
Z
1
α
2
H
0
is True
i
= 1 Pr
h
Z
1
α
2
(
ˆ
β
j
β
j
)
p
c
j
Z
1
α
2
H
0
is True
i
= 1
Φ(Z
1
α
2
) Φ(Z
1
α
2
)
= α,
where Φ(·) represents the cumulative distribution function of the stan-
dard normal distribution.
44 / 430
Hypothesis testing
Pr
Type II error
= Pr
h
ˆ
β
j
β
j
p
c
j
Z
1
α
2
H
0
is False
i
= Pr
h
Z
1
α
2
(
ˆ
β
j
β
j
)
p
c
j
Z
1
α
2
H
0
is False
i
= Pr
h
Z
1
α
2
δ (
ˆ
β
j
β
j
)
p
c
j
Z
1
α
2
δ
H
0
is False
i
= Φ(Z
1
α
2
δ) Φ(Z
1
α
2
δ),
where δ = (β
j
β
j
)
c
j
.
α = 0.01
α = 0.05
α = 0.10
δ
Pr[ Type II error]
0 2 424
0.0 0.2 0.4 0.6 0.8 1.0
45 / 430
Hypothesis testing
If σ
2
is unknown and the aim is to assess the hypothesis system
H
0
: β
j
= β
j
versus H
1
: β
j
, β
j
,
then the following decision rule may be used
“Reject H
0
at the significance level of 100(α)% if
ˆ
β
j
β
j
ˆ
σ
c
j
> t
(np1)
1
α
2
If a random variable follows Student-t(n p 1) distribution,
then its square follows
F
(1,np1)
distribution (see Appendix D-5).
Therefore, the above decision rule is equivalent to the following
“Reject H
0
at the significance level of 100(α)% if
(
ˆ
β
j
β
j
)
2
ˆ
σ
2
c
j
>F
(1, np1)
1α
The p-value, gi ven by Pr[F > (
ˆ
β
j
β
j
)
2
/
ˆ
σ
2
c
j
]
, where F F
(1, np1)
,
allows to write the above decision rule as follows
“Reject H
0
at the significance level of 100(α)% if p -value < α
46 / 430
Hypothesis testing
0 1 2 3 4
0.0 0.2 0.4 0.6
α
pvalue
F F
(r, np1)
1α
The null hypothesis is not rejected at the significance level of 100(α)%.
47 / 430
Hypothesis testing
0 1 2 3 4
0.0 0.2 0.4 0.6
α
pvalue
FF
(r, np1)
1α
The null hypothesis is rejected at the significance level of 100(α)%.
48 / 430
Hypothesis testing
The interest is to assess the following hypothesis system
H
0
: L
β = M versus H
1
: L
β , M,
where L
is an r × (p + 1) matrix whose rows are linearly inde-
pendent and
M is an r × 1 matrix. The following are examples
of the hypothesis system above for
p = 3:
(1)
H
0
: β
2
= 0 versus H
1
: β
2
, 0
Then r = 1, L
=
h
0 0 1 0
i
and M = 0.
(2)
H
0
:
"
β
1
β
3
#
=
"
0
0
#
versus H
1
:
"
β
1
β
3
#
,
"
0
0
#
Then r = 2, L
=
"
0 1 0 0
0 0 0 1
#
and M =
"
0
0
#
.
49 / 430
Hypothesis testing
(3)
H
0
: β
2
= β
3
versus H
1
: β
2
, β
3
Then r = 1, L
=
h
0 0 1 1
i
and M = 0.
If the model is correctly specified then
ˆ
β Normal
p+1
(β, σ
2
(X
X)
1
)
,
which implies that (see
Appendix E-3)
L
ˆ
β Normal
r
(L
β, L
σ
2
(X
X)
1
L),
which in turn implies that (see Appendix E-3,4)
(L
ˆ
β L
β)
(L
σ
2
(X
X)
1
L)
1
(L
ˆ
β L
β) χ
2
(r).
The distribution of
V/m
W/r
is F
(m,r)
provided that V χ
2
(m) and
W χ
2
(r) are statistically independent random variables (see
Appendix D-3). Therefore,
σ
2
r
ˆ
σ
2
(L
ˆ
β L
β)
(L
σ
2
(X
X)
1
L)
1
(L
ˆ
β L
β) F
(r,np1)
.
50 / 430
Hypothesis testing
Then, the decision rule reduces to “Reject H
0
at the significance
level of 100(α)% if F > F
(r,np1)
1α
, where F may be written as
follows
F =
1
r
(L
ˆ
β M)
(L
ˆ
σ
2
(X
X)
1
L)
1
(L
ˆ
β M).
For example,
(1)
H
0
: β
2
= 0 versus H
1
: β
2
, 0
Then F =
ˆ
β
2
2
ˆ
Var(
ˆ
β
2
)
=
ˆ
β
2
2
[Std.Error(
ˆ
β
2
)]
2
(2)
H
0
:
"
β
1
β
3
#
=
"
0
0
#
versus H
1
:
"
β
1
β
3
#
,
"
0
0
#
Then F =
1
2
h
ˆ
β
1
ˆ
β
3
i
"
ˆ
Var(
ˆ
β
1
)
ˆ
Cov(
ˆ
β
1
,
ˆ
β
3
)
ˆ
Cov(
ˆ
β
3
,
ˆ
β
1
)
ˆ
Var(
ˆ
β
3
)
#
1
"
ˆ
β
1
ˆ
β
3
#
51 / 430
Hypothesis testing
(3)
H
0
: β
2
= β
3
versus H
1
: β
2
, β
3
Then F =
(
ˆ
β
2
ˆ
β
3
)
2
ˆ
Var(
ˆ
β
2
) +
ˆ
Var(
ˆ
β
3
) 2
ˆ
Cov(
ˆ
β
2
,
ˆ
β
3
)
Alternatively, F can be expressed as follows
F =
1
r
ˆ
σ
2
n
X
i=1
(y
i
x
i
ˆ
β
)
2
n
X
i=1
(y
i
x
i
ˆ
β)
2
,
where
ˆ
β = argmin
β R
p+1
Q
2
(β)
and
ˆ
β
= argmin
β R
p+1
Q
2
(β)
restricted to L
β = M.
52 / 430
Richness of the species
The analysis of these data, available in the object richness of the
library
glmtoolbox, aims to explain the dierences between plots
as related to species richness (represented by a count of the num-
ber of plant species (
richness)) using the dierences between
plots with respect to biomasses (
Biomass) and soil pH levels:
low (
pH=“low”), medium (pH=“mid”) and high (pH=“high”).
highmidlow
pH
Biomass
Number of plant species
0
0.0 2.5 5.0 7.5 10.0
10
20
30
40
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
,
ǫ
i
ind
Normal(0, σ
2
)
y = Number of plant species
x
1
= Biomass
x
2
= 0 and x
3
= 0 if pH=“low”
x
2
= 1 and x
3
= 0 if pH=“mid”
x
2
= 0 and x
3
= 1 if pH=“high”
53 / 430
Parameter interpretation: without inte raction
The expected number of plant species for a plot with biomass x
1
and
pH level low (
x
2
= 0 and x
3
= 0), mid (x
2
= 1 and x
3
= 0) or high (x
2
= 0
and x
3
= 1) is β
0
+ β
1
x
1
+ β
2
x
2
+ β
3
x
3
.
There are two plots with the same pH levels, but the biomass of the
first plot is
x + δ while the biomass of the second plot is x. As a result,
the expected number of plant species for the first plot is |β
1
|δ units
lower, if
β
1
< 0, or greater, if β
1
> 0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
2
x
2
+ β
3
x
3
(β
0
+ β
1
x + β
2
x
2
+ β
3
x
3
) = β
1
δ.
There are two plots with the same biomass, but the pH level of the first
plot is mid while the biomass of the second plot is low. As a result, the
expected number of plant species for the first plot is |β
2
| units lower, if
β
2
< 0, or greater, if β
2
> 0, than that for the second plot.
µ
mid
µ
low
= β
0
+ β
1
x
1
+ β
2
(β
0
+ β
1
x
1
) = β
2
.
There are two plots with the same biomass, but the pH level of the first
plot is high while the biomass of the second plot is low. As a result, the
expected number of plant species for the first plot is |β
3
| units lower, if
β
3
< 0, or greater, if β
3
> 0, than that for the second plot.
µ
high
µ
low
= β
0
+ β
1
x
1
+ β
3
(β
0
+ β
1
x
1
) = β
3
.
54 / 430
Parameter interpretation: with inte raction
The expected number of plant species for a plot with biomass x
1
and
pH level low (
x
2
= 0 and x
3
= 0), mid (x
2
= 1 and x
3
= 0) or high (x
2
= 0
and x
3
= 1) is β
0
+ β
1
x
1
+ β
2
x
2
+ β
3
x
3
+ β
4
x
1
x
2
+ β
5
x
1
x
3
.
There are two plots with pH level low, but the biomass of the first
plot is
x + δ while the biomass of the second plot is x. Therefore, the
expected number of plant species for the first plot is |β
1
|δ units lower,
if
β
1
< 0, or greater, if β
1
> 0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) (β
0
+ β
1
x) = β
1
δ.
There are two plots with pH level mid, but the biomass of the first
plot is
x + δ while the biomass of the second plot is x. Therefore, the
expected number of plant species for the first plot is
|β
1
+ β
4
|δ units
lower, if
β
1
+β
4
<0, or greater, if β
1
+β
4
>0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
4
(x + δ) (β
0
+ β
1
x + β
4
x) = (β
1
+ β
4
)δ.
There are two plots with pH level high, but the biomass of the first
plot is
x + δ while the biomass of the second plot is x. Therefore, the
expected number of plant species for the first plot is
|β
1
+ β
5
|δ units
lower, if
β
1
+β
5
<0, or greater, if β
1
+β
5
>0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
5
(x + δ) (β
0
+ β
1
x + β
5
x) = (β
1
+ β
5
)δ.
55 / 430
Richness of the species
> data(richness)
> fit <- lm(Species ~ Biomass + pH, data=richness)
> summary(fit)
H
0
: β
j
= 0 vs H
1
: β
j
, 0
ˆ
β
j
ˆ
σ
c
j
ˆ
β
j
/
ˆ
σ
c
j
Pr[F
(1,np1)
>t
2
j
]
Estimate Std.Error t-value Pr(>|t|)
(Intercept) 17.6248 0.7890 22.34 <2e-16
Biomass -2.7276 0.1717 -15.89 <2e-16
pHmid 11.9782 1.0020 11.95 <2e-16
pHhigh 22.6200 1.0818 20.91 <2e-16
ˆ
σ (n p 1)
Residual standard error: 3.783 on 86 degrees of freedom
Multiple R-squared: 0.8524 R
2
, Adjusted R-squared: 0.8472 R
2
adj
F-statistic: 165.5 on 3 and 86 DF, p-value: < 2.2e-16
F r (n p 1) Pr[F
(r,np1)
>F]
H
0
: (β
1
, β
2
, β
3
)
= 0 vs H
1
: (β
1
, β
2
, β
3
)
, 0
56 / 430
Richness of the species
Is there an interaction between biomass and soil pH level? That
is, does the eect on the expected number of plant species of an
increase in biomass depend on soil pH level?
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
+ β
4
x
i1
x
i2
+ β
5
x
i1
x
i3
,
ǫ
i
ind
Normal(0, σ
2
)
> fit2 <- update(fit, formula=Species ~ Biomass*pH)
> anova(fit,fit2)
H
0
: (β
4
, β
5
)
= 0 vs H
1
: (β
4
, β
5
)
, 0
Model 1: Species ~ Biomass + pH
Model 2: Species ~ Biomass + pH + Biomass:pH
r F Pr[F
(r,np1)
>F]
Res.Df RSS Df Sum of Sq F Pr(>F)
1 86 1230.9
2 84 1224.7 2 6.2591 0.2147 0.8073
57 / 430
Richness of the species
Model matr ix
X
model.matrix(fit)
Parameter estimates
ˆ
β = (
ˆ
β
0
,
ˆ
β
1
, . . . ,
ˆ
β
p
)
coef(fit)
Residual sum of squares
n
P
i=1
y
i
x
i
ˆ
β
2
sum(resid(fit)**2)
Log-likelihood
(
ˆ
β)
logLik(fit)
Confidence intervals
ˆ
β
j
± t
(np1)
1
α
2
ˆ
σ
c
j
confint(fit,level=1 α)
Degrees-of-freedom
n p 1
df.residual(fit)
Estimate of σ
2
ˆ
σ
2
summary(fit)$sigma**2
Estimate of Var(
ˆ
β)
ˆ
σ
2
(X
X)
1
vcov(fit)
Fitted values
ˆµ = X
ˆ
β = Hy
fitted(fit)
Ordinary residuals
r
i
= y
i
x
i
ˆ
β
resid(fit)
R-squared
R
2
summary(fit)$r.squared
Adjusted R-squared
R
2
adj
summary(fit)$adj.r.squared
58 / 430
Gasoline yield
The analysis of the data in the object Auto of the library ISLR is
aimed at explaining the dierences between the vehicles with re-
spect to their gasoline yield (
mpg), in miles per gallon, using other
vehicle characteristics such as their horsepower (
horsepower)
and their weights (
weight), in kilograms.
••
highmidlow
Weight
Horsepower
Gasoline yield
10
20
30
40
50 100 150 200
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i1
x
i2
,
ǫ
i
ind
Normal(0, σ
2
)
y = Gasoline yield
x
1
= Horsepower
x
2
= Weight
59 / 430
Parameter interpretation: without inte raction
The expected gasoline yield for a vehicle with horsepower x
1
and weight x
2
is β
0
+ β
1
x
1
+ β
2
x
2
.
There are two vehicles with the same weight, but the horsepower
of the first vehicle is
x + δ while the horsepower of the second
vehicle is
x. As a result, the expected gasoline yield for the first
vehicle is
|β
1
|δ miles per gallon lower, if β
1
< 0, or greater, if
β
1
> 0, than that for the second vehicle.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
2
x
2
(β
0
+ β
1
x + β
2
x
2
) = β
1
δ.
There are two vehicles with the same horsepower, but the weight
of the first vehicle is
x + δ while the weight of the second vehicle
is x. As a result, the expected gasoline yield for the first vehicle
is |β
2
|δ miles per gallon lower, if β
1
< 0, or greater, if β
1
> 0, than
that for the second vehicle.
µ
x+δ
µ
x
= β
0
+ β
1
x
1
+ β
2
(x + δ) (β
0
+ β
1
x
1
+ β
2
x) = β
2
δ.
60 / 430
Parameter interpretation: with inte raction
The expected gasoline yield for a vehicle with horsepower x
1
and weight x
2
is β
0
+ β
1
x
1
+ β
2
x
2
+ β
3
x
1
x
2
.
There are two vehicles with weight equal to z, but the horse-
power of the first vehicle is
x + δ while the horsepower of the
second vehicle is
x. As a result, the expected gasoline yield for
the first vehicle is
|β
1
+β
3
z|δ miles per gallon lower, if β
1
+β
3
z < 0,
or greater, if
β
1
+ β
3
z > 0, than that for the second vehicle.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
2
z + β
3
(x + δ)z
(β
0
+ β
1
x + β
2
z + β
3
xz) = (β
1
+ β
3
z)δ.
There are two vehicles with horsepower equal to z, but the
weight of the first vehicle is x + δ while the weight of the second
vehicle is x. As a result, the expected gasoline yield for the first
vehicle is |β
2
+ β
3
z|δ miles per gallon lower, if β
2
+ β
3
z < 0, or
greater, if β
2
+ β
3
z > 0, than that for the second vehicle.
µ
x+δ
µ
x
= β
0
+ β
1
z + β
2
(x + δ) + β
3
z(x + δ)
(β
0
+ β
1
z + β
2
x + β
3
zx) = (β
2
+ β
3
z)δ.
61 / 430
Gasoline yield
> data(Auto,package="ISLR")
> Auto <- within(Auto, weight <- weight/1000)
>
> fit <- lm(mpg ~ horsepower + weight + horsepower:weight,
+ data=Auto)
> summary(fit)
H
0
: β
j
= 0 vs H
1
: β
j
, 0
ˆ
β
j
ˆ
σ
c
j
ˆ
β
j
/
ˆ
σ
c
j
Pr[F
(1,np1)
>t
2
j
]
Estimate Std.Error t-value Pr(>|t|)
(Intercept) 63.557940 2.342949 27.127 < 2e-16
horsepower -0.250838 0.027279 -9.195 < 2e-16
weight -10.772411 0.773809 -13.921 < 2e-16
horsepower:weight 0.053554 0.006649 8.054 9.93e-15
ˆ
σ (n p 1)
Residual standard error: 3.93 on 388 degrees of freedom
Multiple R-squared: 0.7484 R
2
, Adjusted R-squared: 0.7465 R
2
adj
F-statistic: 384.8 on 3 and 388 DF, p-value: < 2.2e-16
F r (n p 1) Pr[F
(r,np1)
>F]
H
0
: (β
1
, β
2
, β
3
)
= 0 vs H
1
: (β
1
, β
2
, β
3
)
, 0
62 / 430
Gasoline yield
> fit2 <- update(fit, formula = mpg ~ horsepower + weight)
> anova(fit2,fit)
H
0
: β
3
= 0 vs H
1
: β
3
, 0
Model 1: mpg ~ horsepower + weight
Model 2: mpg ~ horsepower + weight + horsepower:weight
r F Pr[F
(r,np1)
>F]
Res.Df RSS Df Sum of Sq F Pr(>F)
1 389 6993.8
2 388 5992.0 1 1001.8 64.871 9.934e-15
> fit3 <- update(fit, formula = mpg ~ 1)
> anova(fit3,fit)
H
0
: (β
1
, β
2
, β
3
)
= 0 vs H
1
: (β
1
, β
2
, β
3
)
, 0
Model 1: mpg ~ 1
Model 2: mpg ~ horsepower + weight + horsepower:weight
r F Pr[F
(r,np1)
>F]
Res.Df RSS Df Sum of Sq F Pr(>F)
1 391 23819
2 388 5992 3 17827 384.8 <2.2e-16
63 / 430
Exercises
These data, available in the object races of the library glmtoolbox,
consist of record time, distance, and cumulative climb of 35 hill races
in Scotland. The aim of the statistical analysis of these data is to
explain the dierences between the record time of t he races (
rtime), in
minutes, using their dierences on distance (distance), in miles, and
cumulative climb (cclimb), in thousands of feet.
(a) Fit to the data a normal linear model where the record time of the
races is the response variable, and the distance and cumulative
climb are the explanatory variables.
(b) Assess at the significance level of 5% if the eect of the distance on
the expected record time of the race depends on the cumulative
climb.
(c) Interpret the parameter estimates except the intercept term.
(d) Estimate the expected record time, in minutes, of a race whose
distance and cumulative climb are 7.5 miles and 1800 feet, re-
spectively. Compute this estimate “manually” and using the
predict() function.
64 / 430
Exercises
These data, available in the object whiteside of the library MASS,
were collected to assess the eect of insulation on gas consumption.
The weekly gas consumption (
Gas), in thousands of cubic feet, and
the average external temperature (Temp), in degrees Celsius, were
recorded for 26 weeks before (
Insul=“Before”) and for 30 weeks after
(Insul=“After”) that a cavity-wall insulation was installed in a house.
(a) Fit to the data a normal linear model where the gas consumption
is the response variable, and the average external temperature
and the presence/absence of the cavity-wall insulation are the
explanatory variables.
(b) Assess at the significance level of 5% if the eect of the average
external temperature on the expected gas consumption depends
on the presence/absence of the wall insulation.
(c) Interpret the parameter estimates except the intercept term.
(d) Estimate the expected gas consumption, in thousands of cubic
feet, in a week where the average external temperature is 5 de-
grees Celsius and the wall insulation was present. Compute this
estimate “manually” and using the predict() function.
65 / 430
Variable selection
66 / 430
Hybrid forward stepwise
The algorithm starts with the simplest model, that is, a model
with a linear predictor in which the coecients of all variables
and/or eects, except the intercept term, are zero. Next, the can-
didate models are constructed by hierarchically adding eects to
the linear predictor (i.e., the main eects are first, then the two-
way interactions, and so on), and their “relevance” is assessed
by comparing the models with and without the added eects
using a previously defined criterion. After an eect has been
added to the model, any other eect previously added can be
removed if, according to the specified criterion, that eect is no
longer relevant. See, for example,
James et al. [2013, page 210].
The criteria to compare models include the following
Akaike Information Criterion
Bayesian Information Criterion
Adjusted R
2
Predicted R
2
Mallows’ CP
p-value of the F test
67 / 430
Hybrid backward ste pwise
The process begins with the more complex model (i.e., one that
includes all variables and/or eects in its linear predictor) and
then the candidate models are constructed by hierarchically re-
moving eects from the linear predictor (i.e., the highest level
interactions are first, then the next level interactions, and so on).
The “relevance” of the candidates is determined by comparing
the models with and without the excluded eects using a pre-
viously established criterion. If an eect is excluded from the
model, then any o ther eect previously excluded may be added
to it if, based on the specified criterion, that eect is no longer
irrelevant. See, for example,
James et al. [2013, page 210]. The
criteria to compare models include the following
Akaike Information Criterion
Bayesian Information Criterion
Adjusted R
2
Predicted R
2
Mallows’ CP
p-value of the F test
68 / 430
Variable selection
Forward
Backward
Candidate models
69 / 430
Variable selection
> help(stepCriterion.lm)
>
> stepCriterion(model,criterion,direction,levels,trace)
model: the object where the fitted model is stored.
criterion: the criterion specified to compare the candidate
models. The available options are: AIC ("aic"), BIC ("bic"),
adjusted R-squared ("adjr2"), predicted R-squared ("prdr2"),
Mallows’ CP ("cp") and p-value of the F test ("p-value"). As
default, criterion is set to "bic".
direction: hybrid backward ("backward") or hybrid for-
ward ("forward") procedures. By default, direction="forward".
levels: the pair of values in the interval (0, 1) at which
the eects must in and out from the model, respectively, if
criterion=“p-value”. By default, levels=c(0.05,0.05).
trace: logical switch indicating if should the stepwise re-
ports be printed. As defaul t, trace is set to TRUE.
70 / 430
Salaries for Professors
These data, available in the object Salaries of the library car, are
the 2008-09 nine-month academic salary for Assistant Professors,
Associate Professors and Professors in a college in the U.S. The
data were collected as part of the on-going eort of the college’s
administration to monitor salary dierences between male and
female faculty members.
rank: Rank is a factor with three levels: “AssocProf”, “As-
stProf and “Prof”.
discipline: Discipline is a factor with two levels: “A
(theoretical departments) and “B” (applied departments).
yrs.since.phd: years since PhD.
yrs.service: years of service .
sex: sex is a factor with two le vels: “Female” and “Male”
salary: nine-month salary, in dollars.
71 / 430
Salaries for Professors
> data(Salaries, package="car")
> f <- lm(salary ~ rank*discipline*yrs.since.phd*yrs.service*sex,
+ data=Salaries)
>
> stepCriterion(f, direction="forward", criterion="bic")
"~ rank + discipline"
>
> stepCriterion(f, direction="backward", criterion="bic")
"~ rank + discipline*yrs.since.phd + discipline*yrs.service"
>
> stepCriterion(f, direction="forward", criterion="adjr2")
"~ rank + discipline + sex"
>
> stepCriterion(f, direction="forward", criterion="aic")
"~ rank + discipline"
>
> stepCriterion(f, direction="forward", criterion="prdr2")
"~ rank + discipline + sex"
>
> stepCriterion(f, direction="forward", criterion="cp")
"~ rank + discipline"
72 / 430
Diagnostic methods
73 / 430
Residual analysis
The vector of ordinary residuals, denoted here by r = (r
1
, . . . , r
n
)
,
where
r
i
= Y
i
x
i
ˆ
β
, may be written as follows
r = Y X
ˆ
β
= (I H)Y
= (I H)(Xβ + ǫ) = (I H)ǫ.
That is, r
i
= ǫ
i
n
P
j=1
h
ij
ǫ
j
. According to this identity, r and ǫ are
only related by
H. In the presence of suciently small h
ij
s, r can
substitute for
ǫ, otherwise its usefulness may be limited. Indeed,
E(r) = (I H)E(ǫ)
= (I H)0 = 0
Var(r) = (I H)Var(ǫ)(I H)
= σ
2
(I H)
2
= σ
2
(I H)
Therefore, E(r
i
) = 0, Var(r
i
) = σ
2
(1 h
ii
) and Cov(r
i
, r
j
) = σ
2
h
ij
.
The above indicates that, unlike the random errors (ǫ
1
, . . . , ǫ
n
),
the ordinary residuals (r
1
, . . . , r
n
) are not independent nor iden-
tically distributed random variables.
74 / 430
Residual analysis
Internally Studentized residuals:
r
I
i
=
y
i
x
i
ˆ
β
ˆ
σ
1 h
ii
.
Externally Studentized residuals:
r
E
i
=
y
i
x
i
ˆ
β
ˆ
σ
(i)
1 h
ii
,
where
ˆ
σ
2
(i)
is the estimate of σ
2
derived from the dataset with the i-
th observation excluded. According to
Cook and Weisberg [1982,
page 20], the distribution of r
E
i
is Student-t(n p 2) and it also
can be expressed as follows:
r
E
i
= r
I
i
n p 2
n p 1 r
2
I
i
1
2
.
75 / 430
Normal probability plot with sim ulated envelope
See, for instance, Atkinson [1981].
Step 0: Set T (positive integer) and α (0, 1).
Step 1: Generate T independent samples of size n according to
Y
i
ind
Normal(x
i
ˆ
β,
ˆ
σ
2
)
for i = 1, . . . , n.
Sample 1 Sample 2 Sample 3 . . . Sample T
y
11
y
12
y
13
. . . y
1T
.
.
.
.
.
.
.
.
. . . .
.
.
.
y
n1
y
n2
y
n3
. . . y
nT
Step 2: Compute
ˆ
β
t
= (X
X)
1
X
y
t
and
ˆ
σ
2
t
=
n
P
i=1
(y
it
x
i
ˆ
β
t
)
2
/(n
p 1)
for t = 1, . . . , T, where y
t
= (y
1t
, . . . , y
nt
)
.
Step 3: Compute the residuals r
it
for i = 1, . . . , n and t = 1, . . . , T.
Sample 1 Sample 2 Sample 3 . . . Sample T
r
11
r
12
r
13
. . . r
1T
.
.
.
.
.
.
.
.
. . . .
.
.
.
r
n1
r
n2
r
n3
. . . r
nT
76 / 430
Normal probability plot with sim ulated envelope
Step 4: Compute the order statistics of r
it
in each sample.
Sample 1 Sample 2 Sample 3 . . . Sample T
r
(1)1
r
(1)2
r
(1)3
. . . r
(1)T
.
.
.
.
.
.
.
.
. . . .
.
.
.
r
(n)1
r
(n)2
r
(n)3
. . . r
(n)T
Step 5: Compute the 100(
α
2
)
th, 50th and 100(1
α
2
)
th percentiles
of the row
i, denoted by p
(i)
α
2
, p
(i)
1
2
and p
(i)
1
α
2
for i = 1, . . . , n.
Sample 1 Sample 2 . . . Sample T
r
(1)1
r
(1)2
. . . r
(1)T
.
.
.
.
.
. . . .
.
.
.
r
(n)1
r
(1)2
. . . r
(1)T
p
(1)
α
2
p
(1)
1
2
p
(1)
1
α
2
.
.
.
.
.
.
.
.
.
p
(n)
α
2
p
(n)
1
2
p
(n)
1
α
2
Step 6: Plot p
(1)
α
2
, . . . , p
(n)
α
2
versus the expected values of the order
statistics in a sample of size
n of the standard normal
distribution.
77 / 430
Normal probability plot with sim ulated envelope
Step 7: Plot p
(1)
1
2
, . . . , p
(n)
1
2
versus the expected values of the order
statistics in a sample of size
n of the standard normal
distribution.
Step 8: Plot p
(1)
1
1
2
, . . . , p
(n)
1
1
2
versus the expected values of the or-
der statistics in a sample of size
n of the standard normal
distribution.
Step 9: Plot r
1
, . . . , r
n
versus the expected values of the order
statistics in a sample of size n of the standard normal
distribution.
78 / 430
Influence or Sensitivity analysis
The Cook’s distance, denoted here by CD
i
, is a measure of the
influence of the
i-th observation on the estimate of β, which
consists of the “distance” between
ˆ
β
and
ˆ
β
(i)
using a metric based
on the matrix
ˆ
Var(
ˆ
β)
. The Cook’s distance may be written as
follows (see
Cook [1977])
CD
i
=
1
p + 1
ˆ
β
(i)
ˆ
β
h
ˆ
Var(
ˆ
β)
i
1
ˆ
β
(i)
ˆ
β
=
1
ˆ
σ
2
(p + 1)
ˆ
β
(i)
ˆ
β
X
X
ˆ
β
(i)
ˆ
β
=
1
ˆ
σ
2
(p + 1)
n
X
l=1
x
l
ˆ
β
(i)
x
l
ˆ
β
2
If
ˆ
β
(i)
is replaced by
ˆ
β
r
i
1 h
ii
(X
X)
1
x
i
, then CD
i
becomes
CD
i
=
1
p + 1
(y
i
x
i
ˆ
β)
2
ˆ
σ
2
(1 h
ii
)
h
ii
1 h
ii
=
1
p + 1
r
2
I
i
h
ii
1 h
ii
.
79 / 430
Influence or sensitivity analysis
(1) The matrix
ˆ
Var(
ˆ
β)
is positive definite. Therefore,
CD
i
0
CD
i
= 0 if and only if
ˆ
β =
ˆ
β
(i)
(2)
If r
I
i
= 0 (that is, if y
i
=
ˆ
µ
i
), then
ˆ
β
(i)
=
ˆ
β
and CD
i
= 0.
(3) For fixed h
ii
, the value of the Cook’s distance is a strictly
increasing function of
r
2
I
i
.
(4) For fixed r
2
I
i
, the value of the Cook’s distance is a strictly
increasing function of
h
ii
.
A univariate version of
CD
i
is known as DFBETA
j(i)
, and it is
given by
DFBETA
j(i)
=
ˆ
β
j
ˆ
β
j
(i)
, j = 0, . . . , p,
where
ˆ
β
j
(i)
is the element (j + 1) of
ˆ
β
(i)
. Similarly, DFBETAS
j(i)
is
given by
DFBETAS
j(i)
=
ˆ
β
j
ˆ
β
j
(i)
se(
ˆ
β
j
(i)
)
, j = 0, . . . , p.
80 / 430
Leverage points
The estimation of µ = Xβ may be written as ˆµ = X
ˆ
β = Hy,
where
H is the orthogonal projection matrix. Thus,
ˆ
µ
i
= h
i
y = h
i1
y
1
+ . . . + h
in
y
n
, i = 1, . . . , n,
where h
i
= (h
i1
, . . . , h
in
)
represents the i-th row of H. Therefore
(see
Appendix A-2),
ˆ
µ
i
y
i
= h
ii
and
1
n
n
X
i=1
ˆ
µ
i
y
i
=
1
n
trace(H) =
p + 1
n
.
As a result, “high” values o f h
ii
indicate that y
i
has a relatively
“large” impact on
ˆ
µ
i
, meaning that there is relatively little in-
formation about the relationship between
y and x when x = x
i
.
Thus, the accuracy of the estimation of
µ when x = x
i
is lower
than that for other values of x, as
E(
ˆ
µ
i
) = E(x
i
ˆ
β) = x
i
E(
ˆ
β) = µ
i
and
Var(
ˆ
µ
i
) = Var(x
i
ˆ
β) = x
i
Var(
ˆ
β)x
i
= σ
2
x
i
(X
X)
1
x
i
= σ
2
h
ii
.
81 / 430
Leverage points
If the model has intercept, then (Cook and Weisberg [1982, page
12]
)
1
n
h
ii
1
c
,
where c represents the number of rows equal to x
i
in X. In
addition, if
h
ii
= 1 then
ˆ
µ
i
= y
i
.
For example, if
µ
i
= βx
i
for i = 1, . . . , n then
h
ii
=
x
2
i
n
P
j=1
x
2
j
.
Similarly, if µ
i
= β
0
+ β
1
x
i
for i = 1, . . . , n then
h
ii
=
1
n
+
(x
i
x)
2
n
P
j=1
(x
j
x)
2
.
82 / 430
Testing the constant variance assumption
To assess the assumption of constant variance the following het-
eroskedastic model is considered
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
ǫ
i
ind
Normal(0, σ
2
i
),
log(σ
2
i
) = γ
0
+ γ
1
m
i1
+ . . . + γ
q
m
iq
,
where m
i
= (m
i1
, . . . , m
iq
)
are the values of q explanatory vari-
ables, quantitative and/or qualitative , measured on the
i-th in-
dividual. The sets of variables
x
1
, . . . , x
p
and m
1
, . . . , m
q
may
or may not coincide. Thus, assess the assumption of constant
variance is equivalent to assess the following hypothesis in the
heteroskedastic model described above
H
0
: γ = 0 versus H
1
: γ , 0,
where γ = (γ
1
, . . . , γ
q
)
. If H
0
is not rejected, then it is possible to
conclude that there is not significative statistical evidence against
the assumption of constant variance.
83 / 430
Testing the constant variance assumption
The decision rule is
Reject H
0
at the approximate significance level of 100(α)% if ξ > χ
2
1α
(q)
The statistic of the test is given by
ξ =
1
2
z
M(M
M)
1
M
z,
where M =
I n
1
11
M
, M =
(
m
1
, . . . , m
n
)
and z = (z
1
, . . . , z
n
)
,
in which
z
i
=
(y
i
ˆ
µ
i
)
2
φ
1
and φ
=
1
n
n
X
i=1
(y
i
ˆ
µ
i
)
2
See Breusch and Pagan [1979], Cook and Weisberg [1983].
84 / 430
Box-Tidwell transformation
Let’s consider the following normal nonlinear model
Y
i
= µ
i
+ ǫ
i
, ,
µ
i
= β
0
+ β
1
h
1
(x
i1
) + . . . + β
r
h
r
(x
ir
) + β
r+1
x
i(r+1)
+ . . . + β
p
x
ip
,
ǫ
i
ind
Normal(0, σ
2
)
where x
i1
, . . . , x
ir
are the values of r quantitative and strictly po-
sitive explanatory variables measured on the subject
i, and
h
j
(x) =
x
τ
j
if τ
j
, 0
log(x)
if τ
j
= 0
, j = 1, . . . , r.
Hence, the interest parameter v ector is (β
0
, β
1
, . . . , β
p+1
, τ
1
, . . . , τ
r
)
.
Box and Tidwell [1962] provide an ecient algorithm to estimate
τ = (τ
1
, . . . , τ
r
)
by the maximum likelihood method. Note that,
evaluation of the following hypothesis systems is of interest:
H
0
: τ
j
= 1 versus H
1
: τ
j
, 1,
for j = 1, . . . , r, and
H
0
: (τ
1
, . . . , τ
r
)
= 1 versus H
1
: (τ
1
, . . . , τ
r
)
, 1.
85 / 430
Power generation by windmills
The analysis of these data, available in the object windmill of
the library
GLMsData, aims to explain the dierences between
the amount of direct current (
DC) output from windmills using
the dierences between wind velocities (
Wind).
Wind velocity
Amount of direct current
0.5
1.0
1.5
2.0
4 6 8 10
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i
,
ǫ
i
ind
Normal(0, σ
2
)
y = Amount of direct current
output from windmills
x = Wind velocity
Is there a function h(·) such that the relationship between y and
h(x) is approximately linear?
86 / 430
Box-Tidwell transformation
> args(BoxTidwell.lm)
> help(BoxTidwell.lm)
> data(windmill, package="GLMsData")
> fit <- lm(DC ~ Wind, data=windmill)
>
> BoxTidwell(fit, transf= ~ Wind)
ˆ
τ
j
H
0
: τ
j
= 1
versus H
1
: τ
j
, 1
Estimate Std.Error z-value Pr(>|z|)
Wind -0.83334 0.15566 -11.77799 < 2.22e-16
Wald test for null hypothesis that all taus are 1:
chi = 138.72 , df = 1 , Pr(>chi) = < 2.22e-16
H
0
: (τ
1
, . . . , τ
r
)
= 1 versus H
1
: (τ
1
, . . . , τ
r
)
, 1
>
> fit2 <- update(fit,formula = ~ I(Wind**(-0.833)))
> c(adjR2(fit),adjR2(fit2))
[1] 0.86904 0.98010
> c(AIC(fit),AIC(fit2))
[1] 2.67724 -44.43086
> c(BIC(fit),BIC(fit2))
[1] 6.33387 -40.77424
87 / 430
Richness of the species
The analysis of these data, available in the object richness of the
library
glmtoolbox, aims to explain the dierences between plots
as related to species richness (represented by a count of the num-
ber of plant species (
richness)) using the dierences between
plots with respect to biomasses (
Biomass) and soil pH levels:
low (
pH=“low”), medium (pH=“mid”) and high (pH=“high”).
highmidlow
pH
Biomass
Number of plant species
0
0.0 2.5 5.0 7.5 10.0
10
20
30
40
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
,
ǫ
i
ind
Normal(0, σ
2
)
y = Number of plant species
x
1
= Biomass
x
2
= 0 and x
3
= 0 if pH=“low”
x
2
= 1 and x
3
= 0 if pH=“mid”
x
2
= 0 and x
3
= 1 if pH=“high”
88 / 430
Richness of the species
> data(richness)
> fit <- lm(Species ~ Biomass + pH, data=richness)
> residuals2(fit,type="external",plot.it=TRUE,identify=3)
1
3
18
ˆ
µ
r
E
i
-2 0 2 4
5 10 15 20 25 30 35 40
89 / 430
Richness of the species
> cbind(richness,mu=fitted(fit))[c(1,3,18),]
pH Biomass Species mu
1 high 0.4692972 30 38.96482
3 high 2.0897785 44 34.54483
18 high 7.2420621 36 20.49156
> envelope(fit,rep=100,identify=3)
3
18
Expected quantiles
Observed quantiles
-2
-2 -1
0
0
1
1
2
2
4-4
90 / 430
Richness of the species
> cd <- cooks.distance(fit)
> plot(cd, type="h", xlab="Observation Index", ylab="Cook’s Di stance")
> identify(cd, n=4)
1
3
18
Observation Index
Cook’s Distance
0
20
20 40 60 80
0.00 0.05 0.10 0.15 0.20
91 / 430
Richness of the species
Which subset of the potentially influential observations (#1, #3, #18, #20) causes
the largest change in the estimate of
β when removed from the data set?
> infl <- c(1,3,18,20)
> n <- length(infl)
> metr <- chol2inv(chol(vcov(fit)))
> hm <- 2**n - 1
> results <- matrix(0, nrow=hm, ncol=1)
> id <- 1; nfil <- vect or()
> for(i in 1:n){
+ pos <- combn(n,i)
+ for(j in 1:ncol(pos)){
+ fitemp <- update(fit, subset=-pos[,j])
+ difs <- coef(fitemp)-coef(fit)
+ results[id,1] <- t(difs)%*%metr%*%(difs)
+ nfil <- c(nfil,paste(infl[pos[,j]], collapse=","))
+ id <- id + 1
+ }
+ }
> nfil[results==max(results)]
[1] "3,18,20"
The largest change in the estimate of β occurs when observations #3, #18 and
#20 are removed.
92 / 430
Richness of the species
> a <- round(summary(fit)$coefficients, digits=5); a
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.62484 0.78899 22.33861 0
Biomass -2.72758 0.17167 -15.88830 0
pHmid 11.97819 1.00200 11.95431 0
pHhigh 22.62002 1.08183 20.90895 0
>
> fit2 <- update(fit, subset=-c(3,18,20))
> b <- round(summary(fit2)$coefficients, digits=5); b
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.51424 0.66545 26.31922 0
Biomass -2.67778 0.14883 -17.99242 0
pHmid 11.91347 0.83936 14.19345 0
pHhigh 21.85045 0.92175 23.70550 0
>
> round(100*(b[,1:2]-a[,1:2])/abs(a[,1:2]), digits=3)
Estimate Std. Error
(Intercept) -0.628 -15.658
Biomass 1.826 -13.305
pHmid -0.540 -16.232
pHhigh -3.402 -14.797
93 / 430
Richness of the species
> h <- apply(svd(model.matrix(fit))$u**2,1,sum)
> plot(h, type="h", xlab="Observation Index", ylab=expression(h[ii]))
> identify(h, n=7)
1
10
11
30
44
51
Index
h
ii
0.04 0.05 0.06 0.07 0.08
0
20
20 40 60 80
> richness[c(1,10,11,20,30,44,51), c("pH","Biomass")]
> boxplot(Biomass ~ pH, data=richness)
94 / 430
Richness of the species
> vdtest(fit)
Score test for varying dispersion parameter
Statistic = 11.07537
degrees of freedom = 3
p-value = 0.011325
> fit3 <- update(fit, subset=-c(18))
> vdtest(fit3)
Score test for varying dispersion parameter
Statistic = 5.29641
degrees of freedom = 3
p-value = 0.15134
> vdtest(fit3,varformula=~ Biomass + pH + Biomass:pH)
Score test for varying dispersion parameter
Statistic = 6.14185
degrees of freedom = 5
p-value = 0.29266
95 / 430
Lognormal distribution
If Y Normal(µ, σ
2
), then T = e xp(Y) Lognormal(exp(µ), σ
2
).
If
T Lognormal(η, σ
2
), then Y = log(T) Normal(log(η), σ
2
).
If
T Lognormal(η, σ
2
), then
(1) The distribution of T is strictly positive and right-skewed
(2) The probability density function of T is
f
T
(t; η, σ
2
) =
1
t
2π σ
2
exp
log(t)
2
2σ
2
I(0, ).
(3) The mean, variance and coecient of variation of T are
η exp
σ
2
2
!
, η
2
exp
σ
2
h
exp
σ
2
1
i
and
q
exp
(
σ
2
)
1
(4) The distribution of a T is Lognormal(a η, σ
2
) for all positive
constant a.
96 / 430
Lognormal regression
(5) The 100(α)th percentile of T reduces to
Q
T
(α) = η exp
(
σ Z
α
)
for all α (0, 1) ,
where Z
α
represents the 100(α)th percentile of the Normal(0, 1).
For example, the median of
T is η.
T
i
= η
i
ε
i
,
log(η
i
) = β
0
+ β
1
x
i1
+. . .+β
p
x
ip
ε
i
ind
Lognormal(1, σ
2
)
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+. . .+β
p
x
ip
ǫ
i
ind
Normal(0, σ
2
)
where Y
i
= log(T
i
), µ
i
= log(η
i
) and ǫ
i
= log(ε
i
). Therefore, if
Y
i
ind
Lognormal(η
i
, σ
2
)
, then
(1) E(T
i
) = η
i
exp(σ
2
/2)
(2) CV(T
i
) = exp(σ
2
) 1
(3) Q
T
i
(α) = η
i
exp(σ Z
α
)
97 / 430
Gasoline yield
The analysis of the data in the object Auto of the library ISLR is aimed
at explaining the dierences between the vehicles with respect to their
gasoline yield (
mpg), in miles per gallon, using other vehicle character-
istics such as their horsepower (
horsepower), their weight (weight), in
grams, and their origin (
origin).
••
low mid high
Weight
Horsepower
log(Gasoline yield)
3.5
3.0
2.5
50 100 150 200
T
i
= η
i
ε
i
,
log(η
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
+ β
4
x
i4
+ β
5
x
i1
x
i2
,
ε
i
ind
Lognormal(1, σ
2
)
t = Gasoline yield
x
1
= Horsepower
x
2
= Weight
x
3
= 0 and x
4
= 0 for America
x
3
= 1 and x
4
= 0 for Europe
x
3
= 0 and x
4
= 1 for Japan
98 / 430
Gasoline yield
> data(Auto, package="ISLR")
> str(Auto)
> summary(Auto)
>
> fit1 <- lm(mpg ~ horsepower*weight*origin, data =Auto)
> fit2 <- lm(log(mpg) ~ horsepower*weight*origin, data=Auto)
>
> stepCriterion(fit1, criterion="bic", trace=FALSE)$final
"~ weight + horsepower + origin + weight:horsepower"
> stepCriterion(fit2, criterion="bic", trace=FALSE)$final
"~ weight + horsepower + origin + weight:horsepower"
>
> fit1a <- update(fit1, formula= ~ weight*horsepow er + origin)
> fit2a <- update(fit2, formula= ~ weight*horsepow er + origin)
>
> adjR2(fit1a,fit2a, verbose=FAL SE)[,4]
0.7517 0.8006
> c(AIC(fit1a), BIC(fit1a))
2184.288 2208.116
> c(AIC(fit2a), BIC(fit2a)) + 2*sum(log(Auto[,"mpg"]))
2070.695 2094.522
>
> summary(fit2a)
99 / 430
Lung capacity
In these data, available in the lungcap object of the library GLMsData,
the lung capacity of 654 individuals is measured by their forced expira-
tory volume (
FEV), in liters. The height (Ht), in inches, gender (Gender),
and smoking habit (
Smoke) of each individual were also recorded. The
data analysis aims to explain the dierences between lung capacities
from the dierences in height, gender, and smoking behavior.
F M
Gender
Height
log(forced expiratory volume)
0.0
0.5
1.0
1.5
45 50 55 60 65 70 75
T
i
= η
i
ε
i
,
log(η
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
,
ε
i
ind
lognormal(1, σ
2
)
t = Forced expiratory volume
x
1
= Height
x
2
= 0 if Gender=“F”
x
2
= 1 if Gender=“M”
x
3
= 0 for non-smokers
x
3
= 1 for smokers
100 / 430
Lung capacity
> data(lungcap, package="GLMsDat a")
> str(lungcap)
> summary(lungcap)
>
> fit1 <- lm(FEV ~ Ht*Gender*Smoke, data=lungcap)
> fit2 <- lm(log(FEV) ~ Ht*Gender*Smoke, data=lun gcap)
>
> stepCriterion(fit1, criterion="aic", trace=FALSE)$final
"~ Ht + Gender + Ht:Gender"
> stepCriterion(fit2, criterion="aic", trace=FALSE)$final
"~ Ht + Gender "
>
> fit1a <- update(fit1, formula= ~ Ht + Gender + Ht:Ge nder)
> fit2a <- update(fit2, formula= ~ Ht + Gender)
>
> adjR2(fit1a,fit2a,verbose=FALSE)[,4]
0.7649 0.7958
> c(AIC(fit1a), BIC(fit1a))
728.5365 750.9521
> c(AIC(fit2a), BIC(fit2a)) + 2*sum(log(lungcap[,"FEV"]))
582.2044 600.1368
>
> summary(fit2a)
101 / 430
Generalized Linear Models
102 / 430
Motivation
Normal linear models are based on the following assumptions:
The response variable is continuous, symmetrical, and its
variance does not depend on its mean,
Covariates have additive eects on the mean of the
response.
There are many situations in which those assumptions are incor-
rect. As an example, when:
The response is binary, a discrete proportion, or a count,
The response is continuous, but skewed to the right and
strictly positive,
Covariates act multiplicatively on the mean of the
response.
103 / 430
Motivation
Generalized Linear Models (GLM) extend the normal linear
model by allowing for
some non-normal response distributions, such as
Bernoulli : binary responses
Binomial : discrete proportions
Poisson : counts
Gamma, Inverse Gaussian : continuous and right skewed
an alternative function to the identity that connects the mean
of the response and the linear predictor, allowing for other
eects other than additive eects to be accounted for.
Identity function : additive eects
Logarithmic function : multiplicative eects
104 / 430
Formulation
Random component: States the probability distribution as-
sumed to describe the response,
Y, which may be Normal,
Bernoulli, Binomial, Poisson, Gamma, and Inverse Normal,
among others.
Systematic component: Continuous and/or discrete explana-
tory variables in the linear predictor, that is,
η = β
0
+ β
1
x
1
+ . . . + β
p
x
p
Link function: Continuous, strictly monotone, and twice
dierentiable function, g(·), which connects/links the mean
of the response variable, µ, to the systematic component,
that is,
g(µ) = β
0
+ β
1
x
1
+ . . . + β
p
x
p
105 / 430
Examples
106 / 430
Liver cancer in female mice
These data, extracted from Zhang and Zelterman [1999] and available
in the object
liver of the library glmtoolbox, arose from an experi-
ment where 8 groups of varying sizes (
exposed) of female mice were
continuously fed dietary concentrations of 2-Acetylaminofluorene (2-
AA) (
dose), in parts per 10
4
. These data consist of the number of liver
neoplasms in mice (
cancer) observed for 18 months.
Dose of 2-Acetylaminofluorene
% of mice with liver cancer
3
6
9
0.0 0.5 1.0 1.5
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i
m = Number of exposed mices
y = Proportion of mice with cancer
x = Dose of 2-AA, in parts per 10
4
107 / 430
Insecticides
These data, extracted from Hewlett and Plackett [1950] and available
in the object
deposit of the library GLMsData, arose from an experiment
in which groups of insects were exposed to deposits of insecticides. The
total number of insects exposed in each group (
Number), the number
of the insects killed after six days exposure (
Killed), and the amount
(
Deposit) and type of insecticide used (Insecticide) were recorded.
Insecticide
type A
type B
type C
Amount of insecticide, in milligrams
% of killed insects
2 4 6 8
0
25
50
75
100
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
m = Number of exposed insects
y = Proportion of killed insects
x
1
= Amount of insecticide, in mg.
x
2
= 0 and x
3
= 0 if Insecticide=“A”
x
2
= 1 and x
3
= 0 if Insecticide=“B”
x
2
= 0 and x
3
= 1 if Insecticide=“C”
108 / 430
Burn injury study
This data set, discussed in Hosmer et al. [2013, page 27] and available
in the object burn1000 of the library aplore3, is a subset of 1000 subjects
treated between 2000 and 2007 at 40 dierent burn facilities. Available
information includes patient demographics, total burn surface area
and presence of inhalation injury. The outcome of interest is survival
to hospital discharge. The data set includes
age (x
1
). The patient’s age at admission, in years.
gender ( x
2
). The patient’s gender: “Female” and “Male”.
race (x
3
). The patient’s race: “Non-White” and “White”.
tbsa (x
4
). Total burn surface area, in percentage.
inh_inj ( x
5
). Burn involved inhalation injury? “No” and “Yes”.
flame (x
6
). Flame involved in burn injury? “No” and “Yes”.
death (y). Hospital discharge status: “Alive” and “Dead”.
Y
i
ind
Bernoulli(µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
6
x
i6
109 / 430
Dilution assay
These data, extracted from Ridout (1990) and available in the ob-
ject
dilution of the library glmtoolbox, are counts of virus particles
(
Count) at dierent dilutions (Dilution). The plot of data indicates that
the location as well as the dispersion of the counts of virus particles
increase as the dilution volume increases.
0
Dilution volume
Count of virus particles
0.1 0.2 0.3
5
10
15
20
Y
i
ind
Poisson(µ
i
),
µ
i
= β
0
x
β
1
i
y = Count of virus particles.
x = Dilution volume.
110 / 430
Skin cancer in women
These data, extracted from Kleinbaum et al. [2013, page 751] and avail-
able in the object
skincancer of the library glmtoolbox, describe the
number of of non-melanoma skin cancer cases (
cases) among women
(
population) stratified by age (age: “15-24”, “25-34”, “35-44”, “45-
54”, “55-64”, “65-74”, “75-84”, “85+”) in two metropolitan areas of US
(
city: “St. Paul”, in Minneapolis, and “Fort Worth”, in Dallas).
15−24 25−34 35−44 45−54 55−64 65−74 75−84 85+
City
St.Paul
Ft.Worth
Incidence rate of skin cancer ×10
4
Age range, in years
0
25
50
75
Y
i
ind
Poisson(µ
i
),
log
10
4
µ
i
λ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
y = Number of cases of skin cancer.
λ = Number of women exposed.
x
1
= 1 if city=“Fort Worth”
x
1
= 0 if city=“St. Paul”
x
2
= Midpoint of age range.
111 / 430
Advertising
This data set, extracted from James et al. [2013, page 15] and available
in the object
advertising of the library glmtoolbox, consists of the
sales (
sales), in thousands of units, of a product in 200 dierent mar-
kets, along with advertising budgets, in thousands of dollars, for the
product in each of those markets for television (
TV) and radio (radio).
0
Low
Advertising budget for radio
Mid High
Advertising budget for TV
Sales
5
10
15
20
25
50 100 150 200 250 300
Y
i
ind
Normal(µ
i
, σ
2
),
log(µ
i
) = β
0
+ β
1
log(x
i1
) +
β
2
x
i2
+ β
3
log(x
i1
)x
i2
y = Sales, in thousands of units.
x
1
= Advertising budget for TV.
x
2
= Advertising budget for radio.
112 / 430
Atlantic Croakers
These data, available in the object Croaker2 of the library FSAdata,
describe the age-at-capture (
age), in years, assigned by using sagit-
tal otoliths, and the total length (
tl), in millimeters, for 204 females
(
sex=“F”) and 114 males (sex=“M”) belonging to a species of fish better
known as Atlantic Croaker.
Sex
Male
Age-at-capture
Tot al length at capture
2.5 5.0 7.5 10.0
200
300
400
500
Y
i
ind
Normal(µ
i
, σ
2
),
log(µ
i
) = β
0
+ β
1
log(x
i1
) + β
2
x
i2
y = Total length at capture, in mm
x
1
= Age-at-capture, in years
x
2
= 1 if sex=“M”
x
2
= 0 if sex=“F”
113 / 430
Hardened stee l
These data, extracted from McCool [1980] and available in the object
Steel of the library glmtoolbox, consist of the failure times (life)
for hardened steel specimens in a 4-ball rolling contact fatigue test.
Ten independent observations were taken at each of the four values of
contact stress (
stress), in pounds per square inch ×10
6
.
10
20
30
Contact stress, in psi ×10
6
Time until the specimen failed
0.9 1.0 1.1
0
Y
i
ind
Gamma(µ
i
, σ
2
),
log(µ
i
) = β
0
+ β
1
x
i
y = Time until the specimen failed
x = Contact stress
114 / 430
Gasoline yield
This data set, discussed in James et al. [2013, page 92] and available in
the object
Auto of the library ISLR, includes acceleration, horsepower,
fuel eciency, weight, and other features of dierent makes and mod-
els of cars. The outcome of interest is fuel consumption, in miles per
gallon. The variables in this data set are the following
mpg (y). Fuel consumption, in miles per gallon.
cylinders (x
1
). Number of cylinders.
displacement (x
2
). Engine displacement, in cubic inches.
horsepower (x
3
). Engine horsepower.
weight ( x
4
). weight, in pounds.
acceleration (x
5
). Time, in seconds, to accelerate from 0 to 60
miles per hour.
origin (x
6
, x
7
). Origin of the vehicle: American (1), European ( 2)
and J apanese (3).
Y
i
ind
Gamma(µ
i
, σ
2
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
7
x
i7
115 / 430
Model Formulation
116 / 430
Model formulation
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
= x
i
β
Then, E(Y
i
) = µ
i
and Var(Y
i
) =
φ
ω
i
V(µ
i
),
where
β = (β
0
, β
1
, . . . , β
p
)
is the unknown interest parameter vector.
x
i
= (1, x
i1
, . . . , x
ip
)
, where (x
i1
, . . . , x
ip
) are the values of p
explanatory variables, quantitative and/or qualitative, mea-
sured on the
i-th individual.
φ > 0 is the unknown dispersion parameter.
ω
1
, . . . , ω
n
are known positive weights.
V(·) is the variance function, which describes the way in
which Var(Y
i
) is assumed to be dependent on µ
i
.
g(·) is the link function, which is assumed to be known,
strictly monotone, and twice dierentiable.
117 / 430
Random Component
118 / 430
Linear exponential family (Continuous case)
The probability density function of Y LEF(µ, φ), where µ is its
mean and
φ > 0 is its dispersion parameter, may be written as
f
Y
(y; µ, φ) = C(y, φ) exp
1
φ
yθ(µ) b[θ(µ)]
!
,
for some known functions θ(·), b(·) and C(·, ·). By definition,
Z
f
Y
(y; µ, φ)= 1 =
Z
C(y, φ) exp
1
φ
yθ
!
dy = exp
1
φ
b(θ)
!
Thus, the moment generating function of Y reduces to
M
Y
(t)= E[exp(tY)]=
Z
C(y, φ) exp
1
φ
h
y(θ + tφ) b(θ)
i
!
dy
=exp
1
φ
b(θ)
!
Z
C(y, φ) exp
1
φ
h
y(θ + tφ)
i
!
dy
=exp
1
φ
h
b
θ + tφ
b(θ)
i
!
,
where the values of t are those for which b
θ + tφ
exists.
119 / 430
Linear exponential family (Discrete case)
The probability mass function of Y LEF(µ, φ) , where µ is its
mean and
φ > 0 is its dispersion parameter, may be written as
f
Y
(y; µ, φ) = C(y, φ) exp
1
φ
yθ(µ) b[θ(µ)]
!
,
for some known functions θ(·), b(·) and C(·, ·). By definition,
X
f
Y
(y; µ, φ)= 1 =
X
C(y, φ) exp
1
φ
yθ
!
= exp
1
φ
b(θ)
!
Thus, the moment generating function of Y reduces to
M
Y
(t)= E[exp(tY)]=
X
C(y, φ) exp
1
φ
h
y(θ + tφ) b(θ)
i
!
=exp
1
φ
b(θ)
!
X
C(y, φ) exp
1
φ
h
y(θ + tφ)
i
!
=exp
1
φ
h
b
θ + tφ
b(θ)
i
!
,
where the values of t are those for which b
θ + tφ
exists.
120 / 430
Linear exponential family
Therefore,
E(Y) = M
(1)
Y
(t) = b
(θ),
E(Y
2
) = M
(2)
Y
(t) = φb
′′
(θ) + [b
(θ)]
2
,
and Var(Y) = M
(2)
Y
(t) [M
(1)
Y
(t)]
2
= φV(µ), where V(·) b
′′
[θ(·)] is
known as the variance function.
If
˜
Y = (Y µ)/
p
φ, then its moment generating function becomes
M
˜
Y
(t) = exp
tµ
p
φ
M
Y
t/
p
φ
= exp
tµ
p
φ
+
1
φ
h
b
θ + t
p
φ
b(θ)
i
.
The third-order Taylor series expansion of log
M
˜
Y
(t)
around
t = 0 may be written as follows
log
M
˜
Y
(t)
=
1
2
b
′′
(θ)t
2
+
1
6
b
′′′
θ + t
p
φ
p
φ t
3
=
1
2
V(µ)t
2
+
1
6
b
′′′
θ + t
p
φ
p
φ t
3
,
121 / 430
Linear exponential family
where t
= (1 δ)t for some δ (0, 1). Therefore,
lim
φ 0
log
M
˜
Y
(t)
=
1
2
V(µ)t
2
= lim
φ 0
M
˜
Y
(t) = exp
1
2
V(µ)t
2
.
Then, the moment generating function of
˜
Y
converges to that of
Normal(0, V(µ)). That is (see, for instance, Jorgensen [1987]),
Y µ
p
φ
D
φ 0
Normal(0, V(µ)),
where
D
φ 0
means convergence in distribution as φ 0.
122 / 430
Special Cases
123 / 430
Binomial distribution
Let mY Binomial(m, µ), where m is a known positive integer
and
µ (0, 1). The p.m.f. of Y may be written as
f
Y
(y; µ, m)=
m
my
!
µ
my
(1 µ)
m(1y)
I
my
{0, 1, . . . , m}
=exp
(
m
"
y log
µ
1 µ
!
+ log(1 µ)
#)
m
my
!
I
my
{0, 1, . . . , m}
Then,θ = log[µ/(1 µ)], b(θ) = log(1 + e
θ
), V(µ) = µ(1 µ) and
φ= m
1
.The mean,variance,skewnessandexcess kurtosisofYare
µ,
µ(1 µ)
m
,
1 2µ
p
µ(1 µ)m
and
1 6µ
µ(1 µ)m
,
respectively. In fact, the m.g.f. of Y reduces to
M
Y
(t) =
1 + µ
exp
t
m
1

m
The binomial distribution tends to the Gaussian when m ,
that is, for “large” values of m we have Y ˙Normal
µ, m
1
µ(1 µ)
.
124 / 430
Probability mass function of Binomial(m, µ) for m = 10
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
µ = 0.1 µ = 0.2
µ = 0.4 µ = 0.6
µ = 0.8 µ = 0.9
y
f
Y
(y; m, µ)
Binary responses and discrete proportions
125 / 430
R codes
> library(ggplot2)
>
> size <- 10
> mus <- c(0.1,0.2,0.4,0.6,0.8,0.9)
> binom <- expand.grid(mu=mus,x=seq(from=0,to=1,by=1/size))
> binom <- within(binom,{prob <- dbinom(x=size*x,size=size,prob=mu)
+ cdf <- pbinom(q=si ze*x,size=s ize,prob=mu)
+ mu <- factor(mu,lab els=paste( "mu==",mus))
+ })
>
> dev.new()
> ggplot(binom) +
+ geom_segment(aes(x=x,xend=x,y=0,yend=prob),col="red",linewidth=1.25) +
+ labs(x="y",y=bquote(~f[Y]~"(y;m,"~mu~")")) +
+ facet_wrap(vars(mu),ncol=2,scale="free_x",labeller=label_parsed)
>
> dev.new()
> ggplot(binom) +
> geom_segment(aes(x=x,xend=x+0.1,y=cdf,yend=cdf),col="red",linewidth=1.25) +
> labs(x="y",y=bquote(~F[Y]~"(y;m,"~mu~")")) +
> facet_wrap(vars(mu),ncol=2,scale="free_x",labeller=label_parsed)
>
126 / 430
Binomial regression
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) = m
1
i
µ
i
(1 µ
i
)
Logit link
g(µ
i
) = log
µ
i
1 µ
i
!
Probit link
g(µ
i
) = Φ
1
(µ
i
)
Complementary log-log link
g(µ
i
) = log
log
1 µ
i

Cauchy link
g(µ
i
) = tan
π
2
2µ
i
1
In this model ω
i
= m
i
for i = 1, . . . , n.
127 / 430
Binomial model for m = 25
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
y
x
µ =
exp(4 + 1.6 x)
1 + exp(4 + 1.6 x)
0000000 1111111 2222222 3333333 4444444 5555555
0.00.00.00.00.00.00.0 0.20.20.20.20.20.20.2 0.40.40.40.40.40.40.4 0.60.60.60.60.60.60.6 0.80.80.80.80.80.80.8 1.01.01.01.01.01.01.0
128 / 430
Poisson distribution
Let Y Poisson(µ), where µ > 0. The probability mass function
of
Y may be written as follows
f
Y
(y; µ) =
exp(µ)µ
y
y!
I
y
{0, 1, 2, . . .}
= exp
y log(µ) µ
I
y
{0, 1, 2, . . .}
y!
Therefore, θ = log(µ), b(θ) = exp(θ), V(µ) = µ and φ = 1. The
mean, variance, skewness and excess kurtosis of
Y are
µ, µ, µ
1
2
and µ
1
,
respectively. Indeed, the moment generating function of Y re-
duces to
M
Y
(t) = exp
h
µ(e
t
1)
i
.
129 / 430
Poisson as the limit of Binomial distribution
lim
m
π 0
mπ µ
Binomial(m, π)
= lim
m
π 0
mπ µ
m
y
!
π
y
(1 π)
my
I
y
{0, 1, . . . , m}
= lim
m
π 0
mπ µ
m(m 1) . . . (m y + 1)
y!
π
1 π
y
1
mπ
m
m
I
y
{0, 1, . . . , m}
= lim
m
π 0
mπ µ
product of y terms
z }| {
m(m 1) . . . (m y + 1)
m
y
y!
mπ
1 π
y
1
mπ
m
m
I
y
{0, 1, . . . , m}
=
µ
y
exp(µ)
y!
I
y
{0, 1, . . .}
= Poisson(µ)
130 / 430
Probability mass function of Poisson(µ)
2 4 6 8 2 4 6 8
2 4 6 8 2 4 6 8
2 4 6 8 2 4 6 8
0.0
0.1
0.2
0.3
0.0
0.1
0.2
0.3
0.0
0.1
0.2
0.3
µ = 1.0 µ = 1.5
µ = 2.0 µ = 2.5
µ = 3.0 µ = 3.5
y
f
Y
(y; µ)
Counts of some relatively rare event.
131 / 430
R codes
> library(ggplot2)
>
> rango <- c(0,8)
> mus <- seq(from=1,to=3.5,by=0.5)
> pois <- expand.grid(mu=mus,x=seq(from=rango[1],to=rango[2],by=1))
> pois <- within(pois,{prob <- dpois(x=x,lambda=mu)
+ cdf <- ppois(q=x,la mbda=mu)
+ mu <- factor(mu,lab els=paste( "mu==",mus))
+ })
>
> dev.new()
> ggplot(pois) +
+ geom_segment(aes(x=x,xend=x,y=0,yend=prob),col="red",linewidth=1.25) +
+ labs(x="y",y=bquote(~f[Y]~"(y;"~mu~")")) +
+ facet_wrap(vars(mu),ncol=2,scale="free_x",labeller=label_parsed)
>
> dev.new()
> ggplot(pois) +
> geom_segment(aes(x=x,xend=x+1,y=cdf,yend=cdf),col="red",linewidth=1.25) +
> labs(x="y",y=bquote(~F[Y]~"(y;"~mu~")")) +
> facet_wrap(vars(mu),ncol=2,scale="free_x",labeller=label_parsed)
>
132 / 430
Poisson regression
Y
i
ind
Poisson(µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) = µ
i
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
Square-root link
g(µ
i
) =
µ
i
133 / 430
Poisson model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
y
x
µ = exp(0.5 + 0.6 x)
0
0
0
0
0
0
0
0
0
0
0
0
0
0 1111111 2222222 3333333 4444444
5
5
5
5
5
5
5
5
5
5
5
5
5
5
10101010101010 15151515151515 20202020202020
134 / 430
Normal distribution
Let Y Normal(µ, σ
2
), where µ R and σ > 0. The probability
density function of
Y may be written as follows
f
Y
(y; µ, σ
2
) =
1
2π σ
2
exp
(y µ)
2
2 σ
2
!
=
1
2π σ
2
exp
y
2
+ 2yµ µ
2
2 σ
2
!
= exp
(
1
σ
2
"
yµ
µ
2
2
#)
exp(y
2
/2σ
2
)
2π σ
2
Therefore, θ = µ, b(θ) = θ
2
/2, V(µ) = 1 and φ = σ
2
. The mean,
variance, skewness and excess kurtosis of Y are
µ, σ
2
, 0 and 0,
respectively. Indeed, the moment generating function of Y re-
duces to
M
Y
(t) = exp
µt +
σ
2
t
2
2
!
.
135 / 430
Probability density funct ion of Normal(0, σ
2
)
−4 −2 0 2 4 −4 −2 0 2 4
−4 −2 0 2 4 −4 −2 0 2 4
−4 −2 0 2 4 −4 −2 0 2 4
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
σ
2
= 0.1 σ
2
= 0.2
σ
2
= 0.3 σ
2
= 0.6
σ
2
= 0.9 σ
2
= 1.2
y
f
Y
(y; µ, σ
2
)
Continuous and symmetric responses
136 / 430
R codes
> library(ggplot2)
>
> mu <- 0
> s2 <- c(0.1,0.2,0.3,0.6,0.9,1.2)
> x <- seq(from=-4,to=4,length=200)
> norm <- expand.grid(mu=mu,s2s=s2,x=x)
> norm <- within(norm,{pdf <- dnorm(x=x,mean=mu,sd=sqrt(s2))
+ cdf <- pnorm(q=x,me an=mu,sd=sqr t(s2))
+ s2s <- factor(s2s,la bels=paste ("sigma**2= =",s2))
+ })
>
> dev.new()
> ggplot(norm) +
+ geom_line(aes(x=x,y=pdf),col="red",linewidth=1.1) +
+ labs(x="y",y=bquote(~f[Y]~"(y;"~mu~","~sigma**2~")")) +
+ facet_wrap(vars(s2s),ncol=2,scale="free_x",labeller=label_parsed)
>
> dev.new()
> ggplot(norm) +
+ geom_line(aes(x=x,y=cdf),col="red",linewidth=1.1) +
+ labs(x="y",y=bquote(~F[Y]~"(y;"~mu~","~sigma**2~")")) +
+ facet_wrap(vars(s2s),ncol=2,scale="free_x",labeller=label_parsed)
>
137 / 430
Normal regression
Y
i
ind
Normal
µ
i
, σ
2
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
Y
i
= µ
i
+ ε
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
ε
i
ind
Normal
0, σ
2
i
E(Y
i
) = µ
i
and Var(Y
i
) =
σ
2
ω
i
Inverse link
g(µ
i
) = 1
i
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
If ω
1
= ω
2
= . . . = ω
n
, then the dispersion parameter is constant,
which implies that the variance is also constant.
138 / 430
Normal model for σ
2
= 0.4
y
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x
µ = 1.5 + 1.4 x
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 111111111111
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 333333333333
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4 555555555555
666666666666 888888888888 101010101010101010101010
139 / 430
Gamma distribution
Let Y Gamma(µ, σ
2
), where µ > 0 and σ > 0. The probability
density function of
Y may be written as follows
f
Y
(y; µ, σ
2
) =
1
Γ
1
σ
2
σ
2
µ
1
σ
2
y
1
σ
2
1
exp
y
σ
2
µ
!
I
y
(0, )
= exp
(
1
σ
2
"
y
µ
log(µ)
#)
y
1
σ
2
1
Γ
1
σ
2
(
σ
2
)
1
σ
2
I
y
(0, )
Therefore, θ = µ
1
, b(θ) = log(θ), φ = σ
2
and V(µ) = µ
2
.
As special cases we have Exponential
(µ = λ
1
, σ
2
= 1) and Chi-
squared
(µ = ν, σ
2
= 2ν
1
) distributions. The mean, variance,
skewness and excess kurtosis of Y are
µ, σ
2
µ
2
, 2σ and 6σ
2
,
respectively. Indeed, the m.g.f. of Y reduces to M
Y
(t)=
1 σ
2
µt
1
σ
2
.
The Gamma distribution tends to the Gaussian when σ
2
0,
that is, for “small” values of σ
2
we have Y ˙Normal
µ, σ
2
µ
2
.
140 / 430
Probability density funct ion of Gamma(1, σ
2
)
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0
0.5
1.0
1.5
0.0
0.5
1.0
1.5
0.0
0.5
1.0
1.5
σ
2
= 0.08 σ
2
= 0.15
σ
2
= 0.35 σ
2
= 0.65
σ
2
= 0.9 σ
2
= 1.2
y
f
Y
(y; µ, σ
2
)
Continuous, strictly positive, and right-skewed responses
141 / 430
R codes
> library(ggplot2)
>
> mu <- 1
> s2 <- c(0.08,0.15,0.35,0.65,0.9,1.2)
> x <- seq(from=0.01,to=2.5,length=200)
> gama <- expand.grid(mu=mu,s2s=s2,x=x)
> gama <- within(gama,{pdf <- dgamma(x=x,shape=1/s2s,scale=mu*s2s)
+ cdf <- pgamma(q=x,s hape=1/s2s, scale=mu*s2s )
+ s2s <- factor(s2s,la bels=paste ("sigma**2= =",s2))})
>
> dev.new()
> ggplot(gama) +
+ geom_line(aes(x=x,y=pdf),col="red",linewidth=1.1) +
+ labs(x="y",y=bquote(~f[Y]~"(y;"~mu~","~sigma**2~")")) +
+ facet_wrap(vars(s2s),ncol=2,scale="free_x",labeller=label_parsed)
>
> dev.new()
> ggplot(gama) +
+ geom_line(aes(x=x,y=cdf),col="red",linewidth=1.1) +
+ labs(x="y",y=bquote(~F[Y]~"(y;"~mu~","~sigma**2~")")) +
+ facet_wrap(vars(s2s),ncol=2,scale="free_x",labeller=label_parsed)
>
142 / 430
Gamma regression
Y
i
ind
Gamma
µ
i
, σ
2
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
Y
i
= µ
i
ε
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
ε
i
ind
Gamma
1, σ
2
i
E(Y
i
) = µ
i
and Var(Y
i
) =
σ
2
ω
i
µ
2
i
Inverse link
g(µ
i
) = 1
i
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
If ω
1
= ω
2
= . . . = ω
n
, then the dispersion parameter is constant,
which implies that the coecient of variation is also constant.
143 / 430
Gamma model for σ
2
= 0.4
y
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x
µ =
1
0.5 + 0.4 x
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
6666666666666
144 / 430
Inverse-Gaussian distribution
Let Y Inverse Gaussian(µ, σ
2
), where µ > 0 and σ > 0. The
probability density function of
Y may be written as follows
f
Y
(y; µ, σ
2
) =
1
2π σ
2
y
3
1
2
exp
(y µ)
2
2σ
2
µ
2
y
!
I
y
(0, )
= exp
(
1
σ
2
"
y
2µ
2
+
1
µ
#)
exp(1/2σ
2
y)
2π σ
2
y
3
1
2
I
y
(0, )
Therefore, θ = 1/2µ
2
, b(θ) =
2θ, φ = σ
2
and V(µ) = µ
3
.
The mean, variance, skewness and excess kurtosis of
Y are
µ, σ
2
µ
3
, 3σ
µ and 15σ
2
µ,
respectively. Indeed, the m.g.f. of Y reduces to
M
Y
(t) = exp
1
σ
2
µ
1
q
1 2σ
2
µ
2
t
!
The Inverse Gaussian distribution tends to the Gaussian when
σ
2
0, that is, for "small" values of σ
2
we have Y ˙Normal
µ, σ
2
µ
3
.
145 / 430
Probability density funct ion of Inverse -Gaussian(1, σ
2
)
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0
1
2
0
1
2
0
1
2
σ
2
= 0.025 σ
2
= 0.1
σ
2
= 0.5 σ
2
= 1
σ
2
= 2 σ
2
= 4
y
f
Y
(y; µ, σ
2
)
Continuous, strictly positive, and right-skewed responses
146 / 430
R codes
> library(ggplot2)
> library(statmod)
>
> mu <- 1
> s2 <- c(0.025,0.1,0.5,1,2,4)
> x <- seq(from=0.01,to=2.5,length=200)
> invg <- expand.grid(mu=mu,s2s=s2,x=x)
> invg <- within(invg,{pdf <- dinvgauss(x=x,mean=mu,dispersion=s2s)
+ cdf <- pinvgauss(q=x ,mean=mu,di spersion=s2s )
+ s2s <- factor(s2s,la bels=paste ("sigma**2= =",s2))})
>
> dev.new()
> ggplot(invg) +
+ geom_line(aes(x=x,y=pdf),col="red",linewidth=1.1) +
+ labs(x="y",y=bquote(~f[Y]~"(y;"~mu~","~sigma**2~")")) +
+ facet_wrap(vars(s2s),ncol=2,scale="free_x",labeller=label_parsed)
>
> dev.new()
> ggplot(invg) +
+ geom_line(aes(x=x,y=cdf),col="red",linewidth=1.1) +
+ labs(x="y",y=bquote(~F[Y]~"(y;"~mu~","~sigma**2~")")) +
+ facet_wrap(vars(s2s),ncol=2,scale="free_x",labeller=label_parsed)
>
147 / 430
Inverse-Gaussian regression
Y
i
ind
Inverse Gaussian
µ
i
, σ
2
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) =
σ
2
ω
i
µ
3
i
Inverse link
g(µ
i
) = 1
i
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
The values of ω
1
, . . . , ω
n
are assumed to be known and positive.
148 / 430
Inverse-Gaussian model for σ
2
= 0.4
y
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x
µ =
1
(0.6 + 0.2 x)
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
6666666666666 7777777777777
149 / 430
Main distributions belonging to the LEF
Distribution θ
i
=θ(µ
i
) b(θ
i
) V(µ
i
) ω
i
φ
Binomial(m
i
, µ
i
) log
µ
i
1µ
i
!
log
1+e
θ
i
µ
i
(1µ
i
) m
i
1
Poisson(µ
i
) log(µ
i
) e
θ
i
µ
i
1 1
Normal(µ
i
, σ
2
) µ
i
θ
2
i
2
1 1 σ
2
Gamma(µ
i
, σ
2
)
1
µ
i
log(θ
i
) µ
2
i
1 σ
2
Inv.Gaussian(µ
i
, σ
2
)
1
2µ
2
i
2θ
i
µ
3
i
1 σ
2
See, for example, McCullagh and Nelder [1989, page 30].
150 / 430
Exercises
(1) Let m Y Negative Binomial(m, µ), where m is a positive
integer and
µ > 0. The probability mass function of Y may
be written as
f
Y
(y; m, µ) =
m(y + 1) 1
my
!
µ
µ + 1
!
my
1
µ + 1
!
m
I
my
{0, 1, 2, . . .}
(A)
Show that the distribution of Y belongs to the LEF.
(B) Find the mean and variance of Y.
(C) Find the moment generating function of Y.
(2) Let Y Negative Binomial(µ, σ
2
), where µ > 0 and σ
2
is
assumed to be known. The probability mass function of
Y
may be written as
f
Y
(y; µ, σ
2
) =
Γ
y +
1
σ
2
Γ
1
σ
2
Γ
y + 1
σ
2
µ
σ
2
µ + 1
!
y
1
σ
2
µ + 1
!
1
σ
2
I
y
{0, 1, 2, . . .}
(A)
Show that the distribution of Y belongs to the LEF.
(B) Find the mean and variance of Y.
(C) Find the moment generating function of Y.
151 / 430
(3) Let Y Beta(µ, σ
2
), where 0 < µ < 1. The probability
density function of
Y may be written as
f
Y
(y; µ, σ
2
) =
Γ(σ
2
)
Γ(µσ
2
)Γ((1 µ)σ
2
)
y
µσ
2
1
(1y)
(1µ)σ
2
1
, 0 < y < 1.
(A)
Does the Beta distribution belong to the exponential family
of distributions?
(B) Does the Beta distribution belong to the LEF?
(4) Let Y Weibull (µ, α), where µ, α > 0. The probability
density function of
Y may be written as
f
Y
(y; µ, α) =
α c(α)
µ
y
µ
!
α1
exp
"
c(α)
y
µ
!
α
#
, y > 0,
where c(α) = [Γ(1 + 1)]
α
.
(A) Does the Weibull distribution belong to the exponential fam-
ily of distributions?
(B) Does the Weibull distribution belong to the LEF?
(C) Does the Weibull distribution belong to the LEF when α = 1?
152 / 430
Parameter estimation
153 / 430
Parameter estimation
The maximum likelihood method is used for parameter estima-
tion. Under certain conditions, known as regularity conditions,
this method yields estimators with desirable statistical proper-
ties, such as consistency, asymptotic unbiasedness, asymptotic
eciency, and asymptotic normality. This method consists of
maximizing the likelihood function, denoted here by
L(β), which
corresponds to the joint probability d ensity function (or joint
probability mass function) of
Y
1
, . . . , Y
n
, viewe d as a function of
β rather than as a function of y
1
, . . . , y
n
. Thus,
ˆ
β = argmax
β R
p+1
L(β).
Because Y
1
, . . . , Y
n
are independent, L(β) reduces to
L(β) =
n
Y
i=1
f
Y
i
(y
i
; µ
i
, φ/ω
i
)
=
n
Y
i=1
C(y
i
, φ
i
) exp
ω
i
φ
h
y
i
θ(µ
i
) b[θ(µ
i
)]
i
!
.
154 / 430
Parameter estimation
The function log(x) is strictly increasing, so any β that maximizes
log[L(β)] will also maximize L(β). In addition, the computation
of the gradient of
log[L(β)] is easier than that of L(β), as the
latter is a product of
n terms while the former is a sum of n
terms. Therefore, a more convenient form for
ˆ
β
is the following
ˆ
β = argmax
β R
p+1
(β),
where
(β) = log[L(β)]=
n
X
i=1
log[ f
Y
i
(y
i
; µ
i
, φ/ω
i
)]
=
n
X
i=1
log[C(y
i
, φ
i
)] +
1
φ
n
X
i=1
ω
i
h
y
i
θ(µ
i
) b[θ(µ
i
)]
i
is the log-likelihood function of β. Note that
n
P
i=1
log[C(y
i
, φ
i
)]
and φ > 0 do not depend on β.
155 / 430
Parameter estimation
So,
ˆ
β = argmax
β R
p+1
n
X
i=1
ω
i
y
i
θ(µ
i
) b[θ(µ
i
)]
,
that is,
ˆ
β
does not depend on φ.
Binomial:
ˆ
β = argmax
β R
p+1
n
X
i=1
m
i
y
i
log
µ
i
1 µ
i
!
+ log(1 µ
i
)
!
Poisson:
ˆ
β = argmax
β R
p+1
n
X
i=1
y
i
log(µ
i
) µ
i
Normal:
ˆ
β = argmax
β R
p+1
n
X
i=1
ω
i
y
i
µ
i
µ
2
i
2
Gamma:
ˆ
β = argmax
β R
p+1
n
X
i=1
ω
i
y
i
µ
i
log(µ
i
)
!
Inverse Gaussian:
ˆ
β = argmax
β R
p+1
n
X
i=1
ω
i
1
µ
i
y
i
2µ
2
i
156 / 430
Deviance
Deviance is used to quantify the goodness-of-fit of an interest
model by comparing its log-likelihood function to that of the sat-
urated model (that is, a model with the same number of parame-
ters as subjects or individuals i n the sample). Thus, the lower the
deviance, the less the dierence between the fitted values of the
interest model and those of the saturated model. Hence, “small”
values of deviance indicate that the interest model achieves “sim-
ilar fitted values as the saturated model, despite the fact that
the first only has
(p + 1) parameters, while the second has n
parameters. Note that,
ˆ
β = argmax
β R
p+1
(β)
= argmin
β R
p+1
2
sup
µ
1
,...,µ
n
n
X
i=1
log
h
f
Y
i
(y
i
; µ
i
, φ/ω
i
)
i
(β)
,
where sup
µ
1
,...,µ
n
n
P
i=1
log
h
f
Y
i
(y
i
; µ
i
, φ/ω
i
)
i
is the log-likelihood func-
tion of the saturated model.
157 / 430
Deviance
Then, the deviance can be written as the following function of β:
D
(β)= 2
sup
µ
1
,...,µ
n
n
X
i=1
log
h
f
Y
i
(y
i
; µ
i
, φ/ω
i
)
i
n
X
i=1
log
h
f
Y
i
(y
i
; µ
i
(β), φ/ω
i
)
i
= 2
n
X
i=1
sup
µ
i
log
h
f
Y
i
(y
i
; µ
i
, φ/ω
i
)
i
log
h
f
Y
i
(y
i
; µ
i
(β), φ/ω
i
)
i
= 2
n
X
i=1
sup
µ
i
ω
i
φ
h
y
i
θ(µ
i
) b[θ(µ
i
)]
i
+ log
h
C(y
i
, φ/ω
i
)
i
ω
i
φ
h
y
i
θ(µ
i
(β)) b[θ(µ
i
(β))]
i
log
h
C(y
i
, φ/ω
i
)
i
=
2
φ
n
X
i=1
sup
µ
i
ω
i
h
y
i
θ(µ
i
) b[θ(µ
i
)]
i
ω
i
h
y
i
θ(µ
i
(β)) b[θ(µ
i
(β))]
i
=
1
φ
n
X
i=1
d(y
i
, µ
i
, ω
i
),
158 / 430
Deviance
where
d(y
i
, µ
i
, ω
i
) = 2
sup
µ
ω
i
h
y
i
θ(µ) b[θ(µ)]
i
ω
i
h
y
i
θ(µ
i
) b[θ(µ
i
)]
i
,
in which µ
i
= g
1
(x
i
β)
. Note that
∂µ
ω
i
h
y
i
θ(µ) b[θ(µ)]
i
= ω
i
θ
(µ)
h
y
i
b
[θ(µ)]
i
and
2
∂µ
2
ω
i
h
y
i
θ(µ)b[θ(µ)]
i
= ω
i
(
θ
′′
(µ)
h
y
i
b
[θ(µ)]
i
[θ
(µ)]
2
b
′′
[θ(µ)]
)
Therefore, if y
i
belongs to the parametric space of µ, then the ma-
ximum of ω
i
h
y
i
θ(µ)b[θ(µ)]
i
exists and it is ω
i
h
y
i
θ(y
i
)b[θ(y
i
)]
i
.
Consequently, d(y
i
, µ
i
, ω
i
) reduces to
d(y
i
, µ
i
, ω
i
) = 2
n
ω
i
h
y
i
θ(y
i
) b[θ(y
i
)]
i
ω
i
h
y
i
θ(µ
i
) b[θ(µ
i
)]
io
.
159 / 430
Deviance
If y does not belong to the parametric space of µ, then ω
i
h
y
i
θ(µ)
b[θ(µ)]
i
is a strictly monotonic function of µ, so its maximum
does not exist and must be replaced by its supreme.
In conclusion, the deviance satisfies the following:
Lower is the value of D(
ˆ
β), better the goodness-of-fit of the
model.
D(
ˆ
β) 0 as d(y
i
,
ˆ
µ
i
, ω
i
) 0 for i = 1, . . . , n.
ˆ
β
may be written as
ˆ
β = argmin
β R
p+1
D
(β) = argmin
β R
p+1
D(β)
,
where D(
ˆ
β) =
n
P
i=1
d(y
i
,
ˆ
µ
i
, ω
i
)
is the unscaled deviance.
160 / 430
Deviance (see McCullagh and Nelder [1989, page 34])
If m
i
Y
i
ind
Binomial(m
i
, µ
i
)
then
d(y
i
,
ˆ
µ
i
, ω
i
) =
2 m
i
log(1
ˆ
µ
i
)
if y
i
= 0,
2 m
i
(
y
i
log
"
y
i
(1
ˆ
µ
i
)
ˆ
µ
i
(1 y
i
)
#
+ log
"
1 y
i
1
ˆ
µ
i
#)
if 0 < y
i
< 1,
2 m
i
log(
ˆ
µ
i
)
if y
i
= 1.
If Y
i
ind
Poisson(µ
i
)
then
d(y
i
,
ˆ
µ
i
) =
2
ˆ
µ
i
if y
i
= 0,
2
y
i
log(y
i
/
ˆ
µ
i
) (y
i
ˆ
µ
i
)
if y
i
> 0.
If Y
i
ind
Normal(µ
i
, φ/ω
i
) then
d(y
i
,
ˆ
µ
i
, ω
i
) = 2
n
ω
i
h
y
i
θ(y
i
) b[θ(y
i
)]
i
ω
i
h
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
io
= ω
i
h
2y
i
(y
i
ˆ
µ
i
) + (
ˆ
µ
2
i
y
2
i
)
i
= ω
i
(y
i
ˆ
µ
i
)
2
.
In addition,
n
P
i=1
ω
i
(y
i
µ
i
)
2
φ
χ
2
(n).
161 / 430
Deviance
If Y
i
ind
Gamma(µ
i
, φ/ω
i
) then
d(y
i
,
ˆ
µ
i
, ω
i
) = 2
n
ω
i
h
y
i
θ(y
i
) b[θ(y
i
)]
i
ω
i
h
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
i
o
= 2 ω
i
"
y
i
1
ˆ
µ
i
1
y
i
!
log
y
i
ˆ
µ
i
!#
= 2 ω
i
"
(y
i
ˆ
µ
i
)
ˆ
µ
i
log
y
i
ˆ
µ
i
!#
.
If Y
i
ind
Inverse Gaussian(µ
i
, φ/ω
i
)
then
d(y
i
,
ˆ
µ
i
, ω
i
) = 2
n
ω
i
h
y
i
θ(y
i
) b[θ(y
i
)]
i
ω
i
h
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
io
= 2 ω
i
y
i
1
2
ˆ
µ
2
i
1
2y
2
i
1
ˆ
µ
i
1
y
i
!
= ω
i
(y
i
ˆ
µ
i
)
2
ˆ
µ
2
i
y
i
.
In addition,
n
P
i=1
ω
i
(y
i
µ
i
)
2
φµ
2
i
y
i
χ
2
(n).
162 / 430
Exercises
Let Y Negative Binomial(µ, σ
2
), where µ > 0 and σ
2
is known.
The probability mass function of
Y may be written as
f
Y
(y; µ, σ
2
) =
Γ
y +
1
σ
2
Γ
1
σ
2
Γ
y + 1
σ
2
µ
σ
2
µ + 1
!
y
1
σ
2
µ + 1
!
1
σ
2
I
y
{0, 1, 2, . . .}
Then, the following MLG can be formulated
Y
i
ind
Negative Binomial(µ
i
, σ
2
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
(A)
What is the maximum of y
i
θ(µ
i
) b[θ(µ
i
)] with respect to µ
i
on the interval (0, )?
(B) What is the supreme of b[θ(µ
i
)] on the interval (0, )?
(C) Find the deviance and the unscaled deviance of the model.
(D) Compare the results of (C) above with the obtained in R from
negative.binomial(1)$dev.resids, where .Theta = 1
2
163 / 430
Parameter estimation
If (β) is a strictly concave function, that is, if the Hessian matrix of
(β) is negative definite for all β, then
ˆ
β reduces to the unique solution
to
U(
ˆ
β) = 0, if it exists, where U(β) is given by
U(β) =
∂ℓ(β)
∂β
0
,
∂ℓ(β)
∂β
1
, . . . ,
∂ℓ(β)
∂β
p
!
,
in which
∂ℓ(β)
∂β
j
=
1
φ
n
X
i=1
ω
i
"
y
i
∂θ
i
∂µ
i
∂µ
i
∂η
i
∂η
i
∂β
j
b
(θ
i
)
∂θ
i
∂µ
i
∂µ
i
∂η
i
∂η
i
∂β
j
#
=
1
φ
n
X
i=1
ω
i
(y
i
µ
i
)
∂θ
i
∂µ
i
∂µ
i
∂η
i
∂η
i
∂β
j
=
1
φ
n
X
i=1
ω
i
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
x
ij
,
because
(1) η
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
=
∂η
i
∂β
j
= x
ij
, with x
i0
= 1;
(2) g(µ
i
) = η
i
=
g(µ
i
)
∂η
i
=
∂η
i
∂η
i
= g
(µ
i
)
∂µ
i
∂η
i
= 1 =
∂µ
i
∂η
i
=
1
g
(µ
i
)
;
(3) µ
i
= b
(θ
i
) =
∂µ
i
∂µ
i
=
b
(θ
i
)
∂µ
i
= 1 = b
′′
(θ
i
)
∂θ
i
∂µ
i
=
∂θ
i
∂µ
i
=
1
V(µ
i
)
164 / 430
Parameter estimation
In matrix form U(β) becomes
U(β) =
φ
1
n
P
i=1
ω
i
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
.
φ
1
n
P
i=1
ω
i
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
x
i1
.
.
.
φ
1
n
P
i=1
ω
i
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
x
ip
= φ
1
X
KV
1
(y µ),
where
X = (x
1
, . . . , x
n
)
is the model matrix
K = diag{1/g
(µ
1
), . . . , 1/g
(µ
n
)}
V = diag{V(µ
1
)
1
, . . . , V(µ
n
)
n
}
y = (y
1
, . . . , y
n
)
is the vector of observed responses
µ = (µ
1
, . . . , µ
n
)
is the vector of means
In general, the solution to the nonlinear equations given by
U(
ˆ
β) = 0 must be obtained using numerical methods such as
the Newton-Raphson and the Fisher’s Scoring algorithms.
165 / 430
Exercises
Find the solution to U(
ˆ
β) = 0 in the following cases:
(A)
Y
i
ind
Normal(µ
i
, φ/ω
i
),
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
(B)
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) = β
0
+ β
1
x
i1
,
where the individual i belongs to group 1 or 2. Therefore,
x
i1
= I{i group 2}.
(C)
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
r1
x
i(r1)
,
where the individual i belongs to group 1, 2, . . . , (r 1) or r.
Hence,
x
ij
= I{i group (j + 1)}.
166 / 430
Newton-Raphson algorithm
The idea behind this algorithm is to maximize, in its iteration
(t + 1) for t = 0, 1, 2, . . ., a quadratic approximation of (β), de-
noted here by
[t+1]
(β), so that, as lower is the dierence between
(β) and
[t+1]
(β), lower is the dierence between
ˆ
β
and the max-
imum of
[t+1]
(β), denoted here by β
[t+1]
. This is accomplished
by building, in the iteration
(t + 1) of the algorithm, an approx-
imation of
(β) based on its second-order Taylor serie s around
β
[t]
, obtaining thus the following (see Appendix C-3)
(β)
[t+1]
(β) = (β
[t]
)+U
(β
[t]
)(ββ
[t]
)
1
2
(ββ
[t]
)
J(β
[t]
)(ββ
[t]
),
where J(β) = −∇
2
(β) is minus the Hessian matrix of (β), also
known as the observed Fisher information matrix of β. The
maximum of
[t+1]
(β) is easy to compute as it is a quadratic and
strictly concave function, which is consequence of the following:
(i) the Hessian matrices of (β) and
[t+1]
(β) coincide; and (ii)
J(β) is assumed to be positive definite for all β.
167 / 430
Newton-Raphson algorithm
The maximum of
[t+1]
(β) is its zero-gradient point, that is, the
maximum of
[t+1]
(β) is the solution to the following equation
U(β
[t]
) J(β
[t]
)
β
[t+1]
β
[t]
= 0.
Therefore, β
[t+1]
may be written as follows (see, for instance,
Agresti [2015, section 4.5.1]):
β
[t+1]
= β
[t]
+ [J(β
[t]
)]
1
U(β
[t]
).
The above results motivate the following algorithm:
Step 0: Start the counter at t = 0; set a tolerance limit,
ǫ > 0; set a maximum number of iterations, n
max
;
and set a starting value for β, say β
[0]
.
Step 1: Compute β
[t+1]
= β
[t]
+ [J(β
[t]
)]
1
U(β
[t]
)
Step 2: Compute δ
(t+1)
= δ(β
[t]
, β
[t+1]
)
Step 3: Update the counter by t = t + 1
Step 4: Repeat steps 1, 2 and 3 until δ
(t)
< ǫ or t > n
max
.
Step 5: If δ
(t)
< ǫ then
ˆ
β
is defined to be β
[t]
. Otherwise,
the convergence was not achieved.
168 / 430
Newton-Raphson algorithm
β
[t]
β
[t+1]
ˆ
β
(β)
[t+1]
(β)
β
169 / 430
Newton-Raphson algorithm
Convergence of the algorithm above is obtained when the value
of
β
[t]
stabilizes, that is, convergence is reached when the “dif-
ference” between
β
[t]
and β
[t+1]
is “small” enough. Therefore,
(1) If β
[t]
and β
[t+1]
are equal, then U(β
[t]
) = 0 and
ˆ
β = β
[t+1]
,
as
J(β
[t]
) is a non-singular matrix.
(2) If β
[t]
reaches the value of
ˆ
β
, then the algorithm finishes, as,
according to the Step 1 of the algorithm,
U(β
[t]
) = 0 implies
that
β
[t]
and β
[t+1]
are equal.
The convergence criterion can be formalized by setting the cons-
tant ǫ > 0 and the non-negative function δ(β
[t]
, β
[t+1]
), which is a
strictly increasing function of the “dierence” between β
[t]
and
β
[t+1]
. Possible forms for δ(β
[t]
, β
[t+1]
) include the following:
δ(β
[t]
, β
[t+1]
) =
β
[t+1]
β
[t]
r
=
p
P
j=0
β
[t+1]
j
β
[t]
j
r
1
r
, r 1
δ(β
[t]
, β
[t+1]
) = max
n
β
[t+1]
0
β
[t]
0
/
β
[t]
0
, . . . ,
β
[t+1]
p
β
[t]
p
/
β
[t]
p
o
170 / 430
Fisher’s Scoring algorithm
This algor ithm is a modified version of the Newton-Raphson
algorithm, where
J(β) is replaced by the expected Fisher’s in-
formation matrix, defined by
I(β) = E[J(β)] = φ
1
X
WX, with
W = KV
1
K. This algorithm can be more convenient than the
Newton-Raphson, as:
(i) I(β) is a positive definite matrix for all
β (see Appendix A-3), which, in general, is not true for J(β) ; (ii)
the structure of I(β) is simpler than that of J(β). The Fisher’s
Scoring algorithm may be written as follows (see, for instance,
Agresti [2015, section 4.5.2]):
Step 0: Start the counter at t = 0; set a tolerance limit,
ǫ > 0; set a maximum number of iterations, n
max
;
and set a starting value for β, say β
[0]
.
Step 1: Compute β
[t+1]
= β
[t]
+ [I(β
[t]
)]
1
U(β
[t]
)
Step 2: Compute δ
(t+1)
= δ(β
[t]
, β
[t+1]
)
Step 3: Update the counter by t = t + 1
Step 4:
Repeat steps 1, 2 and 3 until δ
(t)
< ǫ or t > n
max
.
Step 5: If δ
(t)
< ǫ then
ˆ
β
is defined to be β
[t]
. Otherwise,
the convergence was not achieved.
171 / 430
Fisher’s Scoring algorithm
The entry (j, j
) of the matrix J(β) may be obtained as follows
2
(β)
∂β
j
∂β
j
=
∂β
j
1
φ
n
X
i=1
ω
i
(Y
i
µ
i
)
V(µ
i
)g
(µ
i
)
x
ij
=
1
φ
n
X
i=1
ω
i
x
ij
"
f
i
(Y
i
µ
i
)
∂µ
i
+ (Y
i
µ
i
)
f
i
∂µ
i
#
∂µ
i
η
i
∂η
i
β
j
=
1
φ
n
X
i=1
ω
i
x
ij
"
1
V(µ
i
)
g
(µ
i
)
+ (Y
i
µ
i
)
f
i
∂µ
i
#
x
ij
g
(µ
i
)
,
=
1
φ
n
X
i=1
ω
i
x
ij
x
ij
V(µ
i
)
g
(µ
i
)
2
1
φ
n
X
i=1
ω
i
(Y
i
µ
i
)
f
i
∂µ
i
x
ij
x
ij
g
(µ
i
)
,
where f
i
= [V(µ
i
)g
(µ
i
)]
1
. Therefore, the entry (j, j
) of I(β)
reduces to (Agresti [2015, page 126]):
E
2
(β)
∂β
j
∂β
j
!
=
1
φ
n
X
i=1
ω
i
x
ij
x
ij
V(µ
i
)
g
(µ
i
)
2
.
172 / 430
Fisher’s Scoring algorithm
By replacing U(β) and I(β) by φ
1
X
WK
1
(yµ) and φ
1
X
WX,
respectively, the
Step 1 of the Fisher’s Scoring algorithm may be
written as follows (see, for instance,
Agresti [2015, page 142]):
β
[t+1]
= β
[t]
+ [I(β
[t]
)]
1
U(β
[t]
)
= β
[t]
+ (X
W
[t]
X)
1
X
W
[t]
(z
[t]
Xβ
[t]
)
= (X
W
[t]
X)
1
X
W
[t]
z
[t]
,
where W
[t]
and z
[t]
represent to z = Xβ + K
1
(y µ) = g(µ) +
K
1
(y µ)
and W evaluated at β
[t]
, respectively, in which g(µ) is
defined to be
(g(µ
1
), . . . , g(µ
n
))
. Therefore, β
[t+1]
can regarded as
the Weighted Least Squares estimate of β in a linear model such
that
E(Z
[t]
) = Xβ, Var(Z
[t]
) = σ
2
[W
[t]
]
1
and z
[t]
is the observed
value of Z
[t]
. Indeed, at the convergence of the algorithm,
ˆ
β
can be written as
ˆ
β = (X
ˆ
WX)
1
X
ˆ
Wˆz
. Hence, the Fisher’s Scor-
ing algorithm is also known as an Iteratively Rewei ghted Least
Squares (IRLS).
173 / 430
Fisher’s Scoring algorithm
Thus, an initial or starting value for the Newton-Raphson and
Fisher’s Scoring algorithms may be defined as
β
[0]
= (X
W
X)
1
X
W
z
,
where W
and z
represent to W and z evaluated at µ = y.
However, in some cases, the value of
y does not belong to the
parametric space of
µ, so that g(y) and/or g
(y) may be undefined.
This occurs in the following cases
(1) Under the binomial response models when y = 0 or y = 1. In
those cases,
β
[0]
may be computed by replacing such values
of y by |y τ|, where τ is a “small” value as, for instance,
0.01, 0.05 or 0.1.
(2) Under the poisson response models when y = 0. In those
cases, β
[0]
may be computed by replacing such values of y
by τ, where τ is a “small” value as, for instance, 0.01 or 0.05.
174 / 430
Canonical link functions
The canonical link functions are such that
g(µ) θ(µ),
which implies that g
(µ) = [V(µ)]
1
. In those cases, the expres-
sion for the elements of the gradient vector
U(β) reduces to
∂ℓ(β)
∂β
j
=
1
φ
n
X
i=1
ω
i
(y
i
µ
i
)x
ij
, j = 0, 1, . . . , p.
Similarly, the expression for the elements of the matrix J(β) re-
duces to
2
(β)
∂β
j
∂β
j
=
1
φ
n
X
i=1
ω
i
V(µ
i
)x
ij
x
ij
, j, j
= 0, 1, . . . , p.
Therefore,
U(β) =
1
φ
X
(y µ)
and
I(β) = J(β).
175 / 430
Canonical link functions
See, for example, McCullagh and Nelder [1989, page 32].
Distribution Canonical link
Normal g(µ) = µ (identity)
Binomial
g(µ) = log
µ/(1 µ)
(logit)
Poisson
g(µ) = log(µ) (log)
Gamma
g(µ) = µ
1
(inverse)
Inverse Gaussian
g(µ) = µ
2
(1/muˆ2)
Models with those link functions satisfy the following:
(1) The Newton-Raphson and Fisher’s Scoring algorithms coin-
cide, as the only dierence between them is that the former
uses
J(β) whereas the latter uses I(β).
(2) The Hessian matrix of ( β), given by I(β), is negative defi-
nite for all β. Therefore, (β) is a strictly concave function,
and the solution to U(β) = 0, if it exists, is the unique global
maximum of (β), that is, it is the maximum likelihood
estimate of
β.
176 / 430
Hardened stee l
These data, extracted from McCool [1980] and available in the
object
Steel of the library glmtoolbox, consist of the failure
times (
life) for hardened steel specimens in a 4-ball rolling
contact fatigue test. Ten independent observations were taken
at each of the four values of contact stress (
stress), in pounds
per square inch ×10
6
.
10
20
30
Contact stress, in psi ×10
6
Time until the specimen failed
0.9 1.0 1.1
0
Y
i
ind
Gamma(µ
i
, σ
2
),
log(µ
i
) = β
0
+ β
1
x
i
y = Time until the specimen failed
x = Contact stress
177 / 430
Hardened stee l
The convergence criterion of the Fisher’s Scoring algorithm in the
function glm() is the following:
δ(β
[t]
, β
[t+1]
) =
D(β
[t+1]
) D(β
[t]
)
/
h
D(β
[t]
) + 0.1
i
< ǫ
> fit <- glm(life ~ stress, family=Gamma(log), data=Steel)
> FisherScoring(fit)
Tolerance Limit: 1e-08 ǫ
Maximum number of iterations: 25 n
max
t D(β
[t]
) δ(β
[t1]
, β
[t]
) β
[t]
= (β
[t]
0
, β
[t]
1
)
Iteration Deviance Tolerance Intercept Stress
1 43.604 NA 12.660 -12.318
2 34.873 1.9978e-01 14.829 -13.891
3 34.231 1.8341e-02 14.270 -13.456
4 34.226 1.5484e-04 14.190 -13.386
5 34.226 1.7747e-07 14.187 -13.383
6 34.226 2.6181e-10 14.186 -13.383
ˆ
β
> estequa(fit) U(
ˆ
β)
(Intercept) -5.827184e-09
stress 3.420245e-06
178 / 430
Hardened stee l
> fit0 <- lm(log(life) ~ stress, data=Steel)
> fit1 <- glm(life ~ stress, family=Gamma(log), data=Steel,
+ start= coef(fit0) β
[0]
,
+ control=list(epsilon= 1e-8 ,maxit= 25 ,trace=TRUE))
> ǫ n
max
Deviance = 34.87253 Iterations - 1
Deviance = 34.23110 Iterations - 2
Deviance = 34.22578 Iterations - 3
Deviance = 34.22578 Iterations - 4
Deviance = 34.22578 Iterations - 5
> summary(fit1)
ˆ
β
j
Estimate Std.Error t value Pr(>|t|)
(Intercept) 14.186 1.250 11.35 9.02e-14 ***
stress -13.383 1.203 -11.12 1.63e-13 ***
(Dispersion parameter for Gamma family taken to be 0.7710894)
Null deviance: 110.033 on 39 degrees of freedom
Residual deviance: 34.226 on 38 degrees of freedom
D(
ˆ
β)
Number of Fisher Scoring iterations: 5
179 / 430
Hardened stee l
The function glm() also enables the user to specify starting values for
η
1
, . . . , η
n
(etastart) and for µ
1
, . . . , µ
n
(mustart).
> etas <- fitted(fit0)
> fit <- glm(life ~ stress, family=Gamma(log), data=Steel,
+ etastart= etas (η
[0]
1
, . . . , η
[0]
n
)
,
+ control=list(epsilon= 1e-8 ,maxit= 25 ,trace=TRUE))
> ǫ n
max
Deviance = 34.87253 Iterations - 1
Deviance = 34.23110 Iterations - 2
Deviance = 34.22578 Iterations - 3
Deviance = 34.22578 Iterations - 4
Deviance = 34.22578 Iterations - 5
> summary(fit)
>
> mus <- exp(fitted(fit0))
> fit <- glm(life ~ stress, family=Gamma(log), data=Steel,
+ mustart= mus (µ
[0]
1
, . . . , µ
[0]
n
)
,
+ control=list(epsilon= 1e-8 ,maxit= 25 ,trace=TRUE))
> ǫ n
max
180 / 430
Goodness-of-fit
181 / 430
R-squared
Interest Model
(model under investigation)
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
X =
1 x
11
. . . x
1p
1 x
21
. . . x
2p
.
.
.
.
.
.
.
.
.
.
.
.
1 x
n1
. . . x
np
ˆ
β = argmin
β R
p+1
D(β)
Reference Model
(model with the poorest
goodness-of-fit)
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) = β,
X =
1
1
.
.
.
1
ˆ
β = argmin
β R
D
0
(β)
Note that D(
ˆ
β) = min
β R
p+1
D(β)
and D
0
(
ˆ
β) = min
β R
p+1
D(β) restricted to β
1
= . . . = β
p
= 0,
where D(β) and D
0
(β) represent the unscaled deviances of the Interest
Model and the Reference Model, respectively.
182 / 430
R-squared
R
2
= 1
D(
ˆ
β)
D
0
(
ˆ
β)
The R
2
satisfies the following:
Larger is its value, better the goodness-of-fit of the model.
R
2
[0, 1].
If X = 1
n
then R
2
= 0, where 1
n
represents a column vector
of 1’s of size
n.
If y
1
, . . . , y
n
belong to the parametric space of µ and X = I
n
then R
2
=1, where I
n
represents the identity matrix of size n.
A hundred times its value represents the percentage reduc-
tion in unscaled deviance caused by the inclusion of p pa-
rameters in the linear predictor.
It is an one-to-one transformation of the unscaled deviance.
183 / 430
R-squared
The R
2
can be used to compare nested models, that is, models that
assume the same probability distribution to describe the response vari-
able, but where the first model’s
X matrix coincides with the first (r + 1)
columns of the second model’s X matrix, in which r < p.
Model I
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) =
˜
β
0
+
r
P
j=1
˜
β
j
x
ij
,
ˆ
˜
β = argmin
˜
β R
r+1
D
I
(
˜
β)
Model II
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) = β
0
+
r
P
j=1
β
j
x
ij
+
p
P
j=r+1
β
j
x
ij
,
ˆ
β = argmin
β R
p+1
D
II
(β)
Note that D
II
(
ˆ
β) = min
β R
p+1
D
II
(β)
and D
I
(
ˆ
˜
β) = min
β R
p+1
D
II
(β) restricted to β
r+1
= . . . = β
p
= 0,
where D
I
(β) and D
II
(
˜
β) represent the unscaled deviances of the models
I and II, respectively.
184 / 430
R-squared
Therefore,
D
II
(
ˆ
β) D
I
(
ˆ
˜
β),
which implies that R
2
II
is ALWAYS larger or equal than R
2
I
, that
is, according to the
R
2
the model II is ALWAYS better than the
model I, even if the
(p r) additional covariates in the model II
do not help to explain the dierences between
y
1
, . . . , y
n
.
An adjusted version of R
2
in which not just the model goodness-
of-fit is measured but also its complexity is penalized, i s given
by the following expression:
R
2
adj
= 1
D(
ˆ
β)/(n p 1)
D
0
(
ˆ
β)/(n 1)
= 1
n 1
n p 1
(1 R
2
),
where (n p 1) and (n 1) are the degrees-of-freedom of the
Interest Model and the Reference Model, respectively. Note that
R
2
adj
is an one-to-one transformation of D(
ˆ
β)/(n p 1).
185 / 430
Generalized Information Criterion (GIC)
GIC(k) =
pmeasure of the
goodness-of-fit
z}|{
2 (
ˆ
β) +
complexity
penalty
z }| {
k(number of parameters),
where
(
ˆ
β) =
n
X
i=1
log[C(y
i
, φ
i
)] +
1
φ
n
X
i=1
ω
i
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
.
The Akaike Information Criterion (AIC, Akaike [1974]), and the
Bayesian Information Criterion (
BIC, Schwarz [1978]), also known
as the Schwarz Information Criterion (SIC), are special cases of
GIC(k) when k = 2 and k = log(n), respectively.
If n > 7 then the penalty by model complexity in BIC is higher
than that in AIC. Therefore, the BIC tends to choose simpler
models than AIC.
186 / 430
Statistical inference
187 / 430
Asymptotic theory
If the model is correctly specified then, under certain conditions,
better known as regularity conditions,
ˆ
β
is a consistent estimator
of
β, which asymptotically (that is, when n ) satisfies the
following (
Fahrmeir and Kaufmann [1985]):
it is unbiased,
it is ecient,
it follows a multivariate normal distribution.
As a result, for “large” values of the sample size
n, we have that
ˆ
β ˙Normal
p+1
(β, [I(β)]
1
),
where ˙ means “approximately distributed as”. A consistent
estimator of [I(β)]
1
is [I(
ˆ
β)]
1
.
188 / 430
Hypothesis testing
The interest is to assess the following hypothesis system
H
0
: L
β = M versus H
1
: L
β , M,
where L
is an r × (p + 1) matrix whose rows are linearly independent
and M is an r × 1 matrix. The decision rule is the following
Reject H
0
at t he approximate 100(α)% significance level if ξ > χ
2
1α
(r)”,
where
α (0, 1), χ
2
1α
(r) is the 100(1 α)th percentile of the chi-square
distribution with
r degrees-of-freedom, and ξ is one of the following
statistics (see, for instance, Buse [1982] and Terrell [2002]):
Likelihood-ratio
Wald
Rao’s Score
Terrell’s Gradient
If
H
0
is true, then the asymptotic distribution (n ) of ξ is χ
2
(r).
The statistic ξ depends on the dispersion parameter φ. However, the
asymptotic distribution of ξ remains χ
2
(r) regardless that φ is unknown
and is replaced by a consistent estimator.
189 / 430
Hypothesis testing
For example, with p = 3.
(1)
H
0
: β
2
= 0
versus H
1
: β
2
, 0
Then r = 1, L
=
h
0 0 1 0
i
and M = 0.
(2)
H
0
:
"
β
1
β
3
#
=
"
0
0
#
versus H
1
:
"
β
1
β
3
#
,
"
0
0
#
Then r = 2, L
=
"
0 1 0 0
0 0 0 1
#
and M =
"
0
0
#
.
(3)
H
0
: β
2
= β
3
versus H
1
: β
2
, β
3
Then r = 1, L
=
h
0 0 1 1
i
and M = 0.
190 / 430
Hypothesis testing
Likelihood-ratio test
ξ
LR
= 2 log
L(
ˆ
β)
L(
ˆ
β
)
!
= 2
h
(
ˆ
β) (
ˆ
β
)
i
=
2
φ
n
X
i=1
ω
i
h
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
i
n
X
i=1
ω
i
h
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
i
=
2
φ
n
X
i=1
sup
µ
i
ω
i
h
y
i
θ(µ
i
) b[θ(µ
i
)]
i
n
X
i=1
ω
i
h
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
i
2
φ
n
X
i=1
sup
µ
i
ω
i
h
y
i
θ(µ
i
) + b[θ(µ
i
)]
i
n
X
i=1
ω
i
h
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
i
=
1
φ
h
D(
ˆ
β
) D(
ˆ
β)
i
,
where
ˆ
β = argmin
β R
p+1
D(β),
ˆ
β
= argmin
β R
p+1
D(β)
restricted to L
β = M,
ˆ
µ
i
= g
1
(x
i
ˆ
β)
and
ˆ
µ
i
= g
1
(x
i
ˆ
β
)
.
191 / 430
Hypothesis testing
Wald test
ˆ
β ˙Normal
p+1
(β, [I(β)]
1
)
, which implies that (see Appendix E-3)
L
ˆ
β ˙Normal
r
(L
β, L
[I(β)]
1
L),
which in turn implies that (see Appendix E-3,4)
(L
ˆ
β L
β)
(L
[I(β)]
1
L)
1
(L
ˆ
β L
β) ˙χ
2
(r).
So, the statistic of the Wald test reduces to
ξ
W
= (L
ˆ
β M)
(L
[I(
ˆ
β)]
1
L)
1
(L
ˆ
β M).
Unlike ξ
LR
, the statistic ξ
W
does not depend on
ˆ
β
. As a result, ξ
W
has a lower computational cost than ξ
LR
, since the former does
not require the computation of
ˆ
β
.
192 / 430
Hypothesis testing
For example,
(1)
H
0
: β
2
= 0
versus H
1
: β
2
, 0
Then ξ
W
=
ˆ
β
2
2
ˆ
Var(
ˆ
β
2
)
=
ˆ
β
2
2
[Std.Error(
ˆ
β
2
)]
2
(2)
H
0
:
"
β
1
β
3
#
=
"
0
0
#
versus H
1
:
"
β
1
β
3
#
,
"
0
0
#
Then ξ
W
=
h
ˆ
β
1
ˆ
β
3
i
"
ˆ
Var(
ˆ
β
1
)
ˆ
Cov(
ˆ
β
1
,
ˆ
β
3
)
ˆ
Cov(
ˆ
β
3
,
ˆ
β
1
)
ˆ
Var(
ˆ
β
3
)
#
1
"
ˆ
β
1
ˆ
β
3
#
(3)
H
0
: β
2
= β
3
versus H
1
: β
2
, β
3
Then ξ
W
=
(
ˆ
β
2
ˆ
β
3
)
2
ˆ
Var(
ˆ
β
2
) +
ˆ
Var(
ˆ
β
3
) 2
ˆ
Cov(
ˆ
β
2
,
ˆ
β
3
)
193 / 430
Hypothesis testing
Rao’s Score test
U(β) ˙Normal
p+1
(0, I(β)), which implies that (see Appendix E-3)
L
U(β) ˙Normal
r
(0, L
[I(β)]L),
which in turn implies that (see Appendix E-3,4)
[L
U(β)]
(L
[I(β)]L)
1
[L
U(β)] ˙χ
2
(r).
So, the statistic of the Rao’s Score test test reduces to
ξ
S
= [L
U(
ˆ
β
)]
(L
[I(
ˆ
β
)L)
1
[L
U(
ˆ
β
)].
Unlike ξ
LR
, the statistic ξ
S
does not depend on
ˆ
β
. As a result, ξ
S
has a lower computational cost than ξ
LR
, since the former does
not require the computation of
ˆ
β
.
Terrell’s Gradient test
ξ
G
= (L
ˆ
β M)
[L
U(
ˆ
β
)]
Similar to ξ
LR
, the statistic of the Terrell’s Gradient test depends
on both
ˆ
β
and
ˆ
β
.
194 / 430
Hypothesis testing
> help(anova2)
>
> anova2(fit0,fit1,test=‘‘wald’’)
fit0: the restricted model, that is, the model fitted under
the restrictions on
β stated by H
0
.
fit1: the unrestricted model, that is, the model fitted with-
out restrictions on
β.
test: the statistical test specified to assess the hypothe-
sis. The available options are: Wald (“wald”), Rao’s score
(“score”), Terrell’s gradient (“gradient”), and likelihood ra-
tio (“lr”) tests. By default, test is set to be “wald”.
The function anova2() provides the interest values associated
with the hypothesis testing, that is, ξ, r and Pr[χ
2
(r) > ξ].
195 / 430
Confidence intervals (see Agresti [2015, page 131])
The approximate 100(1 α)% confidence interval for β
j
includes
all
β
for which the hypothesis system given by
H
0
: β
j
= β
versus H
1
: β
j
, β
is not rejected at the approximate 100(α)% significance level.
In general, the limits of the approximate
100(1 α)% confidence
interval for
β
j
based on the likelihood-ratio, R ao’s Score, and
Terrell’s Gradient tests do not have closed-form and require nu-
merical methods. On the other hand, the approximate
100(1
α)%
confidence interval for β
j
based on the Wald test comprises
of the
β
such that (
ˆ
β
j
β
j
)
2
/
ˆ
Var(
ˆ
β
j
) χ
2
1α
(1) = Z
2
1
α
2
. Thus, the
interval is the following:
ˆ
β
j
Z
1
α
2
[
ˆ
Var(
ˆ
β
j
)]
1
2
,
ˆ
β
j
+ Z
1
α
2
[
ˆ
Var(
ˆ
β
j
)]
1
2
,
where Z
1
α
2
represents the 100(1
α
2
)
th percentile of the standard
normal distribution.
196 / 430
Confidence intervals
> help(confint2)
>
> confint2(fit,level,test,digits)
fit: the fitted model.
level: the value indicating the required confidence level.
By default, level is set to be
0.95.
test: the statistical test in which the approximate confi-
dence intervals are based on. The available options are:
Wald (“wald”), Rao’s score (“score”), Terrell’s gradient (“gra-
dient”), and likelihood ratio (“lr”) tests. By default, test is
set to be “wald”.
digits: the integer value indicating the number of decimal
places to be used. By de fault, digits is set to be 5.
The function confint2() provides the lower and upper limits of
the confidence intervals for all parameters in the linear predictor.
197 / 430
Estimation of the dispersion parameter
E
1
n
n
X
i=1
ω
i
(Y
i
µ
i
)
2
V(µ
i
)
= φ,
and, by the Law of Large Numbers,
1
n
n
X
i=1
ω
i
(Y
i
µ
i
)
2
V(µ
i
)
P
n→∞
φ.
Then, by replacing µ
1
, . . . , µ
n
by
ˆ
µ
1
, . . . ,
ˆ
µ
n
and adjusting the
value of
n by the number of estimated parameters, the following
estimator of
φ is obtained
ˆ
φ =
1
n p 1
n
X
i=1
ω
i
(y
i
ˆ
µ
i
)
2
V(
ˆ
µ
i
)
=
Pr(
ˆ
β)
n p 1
.
A second estimator of φ is the following
˜
φ =
D(
ˆ
β)
n p 1
.
198 / 430
Special cases
199 / 430
Binomial regression
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) = m
1
i
µ
i
(1 µ
i
)
Logit link
g(µ
i
) = log
µ
i
1 µ
i
!
Probit link
g(µ
i
) = Φ
1
(µ
i
)
Complementary log-log link
g(µ
i
) = log
log
1 µ
i

Cauchy link
g(µ
i
) = tan
π
2
2µ
i
1
In this model ω
i
= m
i
for i = 1, . . . , n.
200 / 430
Males attached to female horseshoe crabs
These data, extracted from Agresti [2015, page 16] and available in
the object
hcrabs of the library GLMsData, come from a study of fe-
male horseshoe crabs on an island in the Gulf of Mexico. During the
spawning season, a female migrates to the shore to breed. With a male
attached to her posterior spine, she burrows into the sand and lays
clusters of eggs. The eggs are fertilized externally in the sand beneath
the pair. During spawning, other male crabs (satellites) may cluster
around the pair and may also fertilize the eggs. The data set includes
Width (x
1
). The carapace width of the female crab, in cms.
Col. The color of the female crab: “LM” (Light Medium), “M”
(Medium), “DM” (Dark Medium), or “D” (Dark). Then, x
2
= 1 if
Col=“LM” or Col=“M”; and
x
2
= 0 if Col=“DM” or Col=“D”.
Sat. The number of male crabs attached to the female crab. Then,
y = 1 if Sat > 0 and y = 0 if Sat = 0.
Y
i
ind
Bernoulli(µ
i
),
g(µ) = β
0
+ β
1
x
i1
+ β
2
x
i2
201 / 430
Parameter interpretation: logit link
Let Z Binomial(m, µ) and Z
= m Z be the number of “suc-
cesses” and “failures”, respectively. Then,
Odds(µ) =
E(Z)
E(Z
)
=
µ
1 µ
= exp[g(µ)].
Odds(µ) satisfies the following:
(1) It is a strictly positive function of µ.
(2) It is a strictly increasing function of µ . Therefore, if µ in-
creases then
Odds(µ) also increases, and vice versa. In ad-
dition, if
µ decreases then Odds(µ) also decreases, and vice
versa.
(3) Odds(0.5) is 1.
(4) It can be interpreted as, in mean, for every “failure” there
are Odds(µ) “successes”.
For example, in mean, for every female horseshoe crab without
satellites, there are Odds(µ) female horseshoe crabs with at least
one satellite.
202 / 430
Parameter interpretation: cloglog link
Let Z Geometric(µ) be the number of independent trials re-
quired to get a “success”. Then,
Pr(Z = z) = (1µ)
z1
µ I
z
{1, 2, . . .},
E(Z) = µ
1
and the median of Z is given approximately by
MNNT(µ) =
log(2)
log(1 µ)
= log(2) exp[g(µ)].
MNNT(µ) satisfies the following:
(1) It is a strictly positive function of µ.
(2) It is a strictly decreasing function of µ. Therefore, if µ in-
creases then
MNNT(µ) decreases, and vice versa. In addi-
tion, if
µ decreases then MNNT(µ) increases, and vice versa.
(3) MNNT(0.5) is 1.
(4) It can be interpreted as the median number of trials required
to get a “success”.
For example, the median number of female horseshoe crabs re-
quired to get one with at least one satellite is MNNT(µ).
203 / 430
Parameter interpretation: without inte raction
The odds of having at least one satellite for a female crab with
carapace width
x
1
and color x
2
is exp(β
0
+ β
1
x
1
+ β
2
x
2
).
We have two female crabs whose carapace widths coincide but
the color of the first is Light, whereas the color of the second is
Dark, then the odds of having at least one satellite of the first
female crab is exp(β
2
) times that of the second.
Odds
L
Odds
D
=
exp(β
0
+ β
1
x
1
+ β
2
)
exp(β
0
+ β
1
x
1
)
= exp(β
2
).
We have two female crabs whose colors coincide but the carapace
width of the first is x+δ whereas the carapace width of the second
is x, then the odds of having at least one satellite of the first female
crab is exp(β
1
δ) times that of the second.
Odds
x+δ
Odds
x
=
exp(β
0
+ β
1
(x + δ) + β
2
x
2
)
exp(β
0
+ β
1
x + β
2
x
2
)
= exp(β
1
δ).
204 / 430
Parameter interpretation: with inte raction
The odds of having at least one satellite for a female crab with
carapace width
x
1
and color x
2
is exp(β
0
+ β
1
x
1
+ β
2
x
2
+ β
3
x
1
x
2
).
There are two “Dark” female crabs where the carapace width of
the first is
x + δ whereas the carapace width of the second is x,
then the odds of having at least one satellite of the first female
crab is
exp(β
1
δ) times that of the second.
Odds
x+δ
Odds
x
=
exp(β
0
+ β
1
(x + δ))
exp(β
0
+ β
1
x)
= exp(β
1
δ).
There are two “Light” female crabs where the carapace width of
the first is x + δ whereas the carapace width of the seco nd is x,
then the odds of having at least one satellite of the first female
crab is
exp[(β
1
+ β
3
)δ] times that of the second.
Odds
x+δ
Odds
x
=
exp(β
0
+ (β
1
+ β
3
)(x + δ) + β
2
)
exp(β
0
+ (β
1
+ β
3
)x + β
2
)
= exp[(β
1
+ β
3
)δ].
205 / 430
Males attached to female horseshoe crabs
> data(hcrabs, package="GLMsData")
> hcrabs <- within(hcrabs,{Col2<-ifelse(Col %in% c("LM","M"),1,0)
+ Sat2<-ifelse(Sat > 0,1,0)})
> fit <- glm(Sat2~Width+Col2,family=binomial(logit),data=hcrabs)
> summary(fit)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) -11.9501 2.6643 -4.485 7.28e-06
Width 0.4670 0.1037 4.506 6.61e-06
Col2 0.6531 0.3571 1.829 0.0675
φ
(Dispersion parameter for binomial family taken to be 1 )
D
0
(
ˆ
β) (n 1)
Null deviance : 225.76 on 172 degrees of freedom
Residual deviance : 191.12 on 170 degrees of freedom
D(
ˆ
β) (n p 1)
AIC : 197.12
2(
ˆ
β) + 2(p + 1)
Number of Fisher Scoring iterations: 4
206 / 430
Males attached to female horseshoe crabs
> fit2 <- update(fit, formula=Sat2 ~ Width + Col2 + Width:Col2)
> anova2(fit,fit2,test="wald")
Model 1 : Sat2 ~ Width + Col2
Model 2 : Sat2 ~ Width + Col2 + Width:Col2
ξ
W
r Pr[χ
2
(r) > ξ
W
]
Chi df Pr(Chisq>)
1 vs 2 0.78331 1 0.3761
>
> anova2(fit,fit2, test="lr")
Model 1 : Sat2 ~ Width + Col2
Model 2 : Sat2 ~ Width + Col2 + Width:Col2
ξ
LR
r Pr[χ
2
(r) > ξ
LR
]
Chi df Pr(Chisq>)
1 vs 2 0.80605 1 0.3693
>
> anova2(fit,fit2, test="score")
Model 1 : Sat2 ~ Width + Col2
Model 2 : Sat2 ~ Width + Col2 + Width:Col2
ξ
S
r Pr[χ
2
(r) > ξ
S
]
Chi df Pr(Chisq>)
1 vs 2 0.79097 1 0.3738
207 / 430
Males attached to female horseshoe crabs
Model matrix X : model.matrix(fit)
Parameter estimates
ˆ
β = (
ˆ
β
0
,
ˆ
β
1
, . . . ,
ˆ
β
p
)
: coef(fit)
Estimate of variance-covariance matrix [I(
ˆ
β)]
1
: vcov(fit)
Fitted values ˆµ = (
ˆ
µ
1
, . . . ,
ˆ
µ
n
)
: fitted(fit)
Unscaled deviance D(
ˆ
β): deviance(fit)
Pearson’s statistic P(
ˆ
β): sum(resid(fit, type="pearson")**2)
Log-likelihood (
ˆ
β): logLik(fit)
AIC = 2(
ˆ
β) + 2(p + 1): AIC(fit)
BIC = 2(
ˆ
β) + log(n)(p + 1): BIC(fit)
R
2
adj
= 1
D(
ˆ
β)/(n p 1)
D
0
(
ˆ
β)/(n 1)
:
with(fit, 1-deviance*df.null/(null.deviance*df.residual))
208 / 430
Insecticides
These data, extracted from Hewlett and Plackett [1950] and available
in the object
deposit of the library GLMsData, arose from an experiment
in which groups of insects were exposed to deposits of insecticides. The
total number of insects exposed in each group (
Number), the number
of the insects killed after six days exposure (
Killed), and the amount
(
Deposit) and type of insecticide used (Insecticide) were recorded.
Insecticide
type A
type B
type C
Amount of insecticide, in milligrams
% of killed insects
2 4 6 8
0
25
50
75
100
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
m = Number of exposed insects
y = Proportion of killed insects
x
1
= Amount of insecticide, in mg.
x
2
= 0 and x
3
= 0 if Insecticide=“A”
x
2
= 1 and x
3
= 0 if Insecticide=“B”
x
2
= 0 and x
3
= 1 if Insecticide=“C”
209 / 430
Parameter interpretation: without inte raction
The odds of death for an insect exposed to an amount of insecticide x
1
of type A (x
2
= 0 and x
3
= 0), B (x
2
= 1 and x
3
= 0) or C (x
2
= 0 and
x
3
= 1) is exp(β
0
+ β
1
log(x
1
) + β
2
x
2
+ β
3
x
3
).
Two insects are exposed to the same type of insecticide, but the amount
the first and second insects were exposed to are
xδ and x, respectively.
Then, the odds of death of the first insect is δ
β
1
times that of the second.
Odds
xδ
Odds
x
=
exp(β
0
+ β
1
log(xδ) + β
2
x
2
+ β
3
x
3
)
exp(β
0
+ β
1
log(x) + β
2
x
2
+ β
3
x
3
)
= δ
β
1
.
Two insects receive the same dose of insecticide, but the first insect is
exposed to type B insecticide and the second insect is exposed to type
A insecticide. Then, the odds of death of the first insect is
exp(β
3
) times
that of the second.
Odds
B
Odds
A
=
exp(β
0
+ β
1
log(x
1
) + β
2
)
exp(β
0
+ β
1
log(x
1
))
= exp(β
2
).
Two insects receive the same dose of insecticide, but the first insect is
exposed to type C insecticide and the second insect is exposed to type
A insecticide. Then, the odds of death of the first insect is
exp(β
3
) times
that of the second.
Odds
C
Odds
A
=
exp(β
0
+ β
1
log(x
1
) + β
3
)
exp(β
0
+ β
1
log(x
1
))
= exp(β
3
).
210 / 430
Parameter interpretation: with inte raction
The odds of death for an insect exposed to an amount of insecticide x
1
of type A (x
2
= 0 and x
3
= 0), B (x
2
= 1 and x
3
= 0) or C (x
2
= 0 and
x
3
= 1) is exp(β
0
+ β
1
log(x
1
) + β
2
x
2
+ β
3
x
3
+ β
4
log(x
1
)x
2
+ β
5
log(x
1
)x
3
).
Two insects were exposed to insecticide type A. The amount the first
and second insects were exposed to are xδ and x, respectively. Then,
the odds of death of the first insect is δ
β
1
times that of the second.
Odds
xδ
Odds
x
=
exp(β
0
+ β
1
log(xδ))
exp(β
0
+ β
1
log(x))
= δ
β
1
.
Two insects were exposed to insecticide type B. The amount the first
and second insects were exposed to are
xδ and x, respectively. Then,
the odds of death of the first insect is δ
β
1
+β
4
times that of the second.
Odds
xδ
Odds
x
=
exp(β
0
+ β
1
log(xδ) + β
4
log(xδ))
exp(β
0
+ β
1
log(x) + β
4
log(x))
= δ
β
1
+β
4
.
Two insects were exposed to insecticide type C. The amount the first
and second insects were exposed to are
xδ and x, respectively. Then,
the odds of death of the first insect is δ
β
1
+β
5
times that of the second.
Odds
xδ
Odds
x
=
exp(β
0
+ β
1
log(xδ) + β
5
log(xδ))
exp(β
0
+ β
1
log(x) + β
5
log(x))
= δ
β
1
+β
5
.
211 / 430
Insecticides
> data(deposit, package="GLMsData")
>
> fit <- glm(Killed/Number ~ log(Deposit) + Insecticide,
+ weights=Number, family=binomial(logit), data=deposit)
> summary(fit)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) -4.4613 0.3564 -12.518 <2e-16
log(Deposit) 2.8875 0.2236 12.912 <2e-16
InsecticideB 0.3489 0.2060 1.694 0.0903
InsecticideC 2.8402 0.2538 11.191 <2e-16
φ
(Dispersion parameter for binomial family taken to be 1 )
D
0
(
ˆ
β) (n 1)
Null deviance :413.644 on 17 degrees of freedom
Residual deviance : 23.385 on 14 degrees of freedom
D(
ˆ
β) (n p 1)
AIC : 93.583
2(
ˆ
β) + 2(p + 1)
Number of Fisher Scoring iterations: 4
212 / 430
Insecticides
> fit2 <- update(fit,formula=Killed/Number~log(Deposit)*Insecticide)
> anova2(fit,fit2,test="wald")
Model 1 : Killed/Number ~ log(Deposit) + Insecticide
Model 2 : Killed/Number ~ log(Deposit) * Insecticide
ξ
W
r Pr[χ
2
(r) > ξ
W
]
Chi df Pr(Chisq>)
1 vs 2 0.99287 2 0.6087
>
> anova2(fit,fit2, test="lr")
Model 1 : Killed/Number ~ log(Deposit) + Insecticide
Model 2 : Killed/Number ~ log(Deposit) * Insecticide
ξ
LR
r Pr[χ
2
(r) > ξ
LR
]
Chi df Pr(Chisq>)
1 vs 2 1.0487 2 0.5919
>
> anova2(fit,fit2, test="score")
Model 1 : Killed/Number ~ log(Deposit) + Insecticide
Model 2 : Killed/Number ~ log(Deposit) * Insecticide
ξ
S
r Pr[χ
2
(r) > ξ
S
]
Chi df Pr(Chisq>)
1 vs 2 0.99865 2 0.6069
213 / 430
Poisson regression
Y
i
ind
Poisson(µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) = µ
i
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
Square-root link
g(µ
i
) =
µ
i
214 / 430
Skin cancer in women
These data, extracted from Kleinbaum et al. [2013, page 751] and avail-
able in the object
skincancer of the library glmtoolbox, describe the
number of of non-melanoma skin cancer cases (
cases) among women
(
population) stratified by age (age: “15-24”, “25-34”, “35-44”, “45-
54”, “55-64”, “65-74”, “75-84”, “85+”) in two metropolitan areas of US
(
city: “St. Paul”, in Minneapolis, and “Fort Worth”, in Dallas).
15−24 25−34 35−44 45−54 55−64 65−74 75−84 85+
City
St.Paul
Ft.Worth
Incidence rate of skin cancer ×10
4
Age range, in years
0
25
50
75
Y
i
ind
Poisson(µ
i
),
log
10
4
µ
i
λ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
y = Number of cases of skin cancer.
λ = Number of women exposed.
x
1
= 1 if city=“Fort Worth”
x
1
= 0 if city=“St. Paul”
x
2
= Midpoint of age range.
215 / 430
Parameter interpretation: without inte raction
The expected incidence of skin cancer for a population group
located at the city
x
1
and an age range with mid point x
2
is
exp(β
0
+ β
1
x
1
+ β
2
log(x
2
)).
There are two population groups whose age ranges coincide but
the first is located at Fort Worth whereas the second is located at
Saint Paul, then the expected incidence of skin cancer in the first
population group is exp(β
1
) times that of the second.
µ
FW
µ
SP
=
exp(β
0
+ β
1
+ β
2
log(x
2
))
exp(β
0
+ β
2
log(x
2
))
= exp(β
1
).
There are two population groups located at the same city but the
mid point of the age range of the first is xδ whereas that of the
second is x, then the expected incidence of skin cancer in the first
population group is δ
β
2
times that of the second.
µ
xδ
µ
x
=
exp(β
0
+ β
1
x
1
+ β
2
log(xδ))
exp(β
0
+ β
1
x
1
+ β
2
log(x))
= exp(β
2
log(δ)) = δ
β
2
.
216 / 430
Parameter interpretation: with inte raction
The expected incidence of skin cancer for a population group at
the city
x
1
and an age range with mid point x
2
is exp(β
0
+ β
1
x
1
+
β
2
log(x
2
) + β
3
x
1
log(x
2
))
.
There are two population g roups located at Saint Paul but the
mid point of the age range of the first is
xδ whereas that of the
second is
x, then the expected incidence of skin cancer in the first
population group is
δ
β
2
times that of the second.
µ
xδ
µ
x
=
exp(β
0
+ β
2
log(xδ))
exp(β
0
+ β
2
log(x))
= exp(β
2
log(δ)) = δ
β
2
.
There are two population groups located at Fort Worth but the
mid point of the age range of the first is xδ whereas that of the
second is x, then the expected incidence of skin cancer in the first
population group is δ
β
2
+β
3
times that of the second.
µ
xδ
µ
x
=
exp(β
0
+ β
1
+ β
2
log(xδ) + β
3
log(xδ))
exp(β
0
+ β
1
+ β
2
log(x) + β
3
log(x))
= δ
β
2
+β
3
.
217 / 430
Skin cancer in women
> data(skincancer)
> fit <- glm(cases ~ city + log(age), offset=log(population),
+ family=poisson(log), data=skincancer)
> summary(fit)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) -20.35008 0.35282 -57.68 <2e-16
cityFt.Worth 0.80928 0.05217 15.51 <2e-16
log(age) 3.37649 0.08469 39.87 <2e-16
φ
(Dispersion parameter for poisson family taken to be 1 )
D
0
(
ˆ
β) (n 1)
Null deviance : 2789.681 on 15 degrees of freedom
Residual deviance : 31.816 on 13 degrees of freedom
D(
ˆ
β) (n p 1)
AIC : 132.06
2(
ˆ
β) + 2(p + 1)
218 / 430
Skin cancer in women
> fit2 <- update(fit,formula = cases ~ city*log(age))
> anova2(fit,fit2, test="wald")
Model 1 : cases ~ city + log(age)
Model 2 : cases ~ city + log(age) + city:log(age)
ξ
W
r Pr[χ
2
(r) > ξ
W
]
Chi df Pr(Chisq>)
1 vs 2 3.6753 1 0.05523
>
> anova2(fit,fit2, test="lr")
Model 1 : cases ~ city + log(age)
Model 2 : cases ~ city + log(age) + city:log(age)
ξ
LR
r Pr[χ
2
(r) > ξ
LR
]
Chi df Pr(Chisq>)
1 vs 2 3.764 1 0.05237
>
> anova2(fit,fit2, test="score")
Model 1 : cases ~ city + log(age)
Model 2 : cases ~ city + log(age) + city:log(age)
ξ
S
r Pr[χ
2
(r) > ξ
S
]
Chi df Pr(Chisq>)
1 vs 2 3.6808 1 0.05504
219 / 430
Richness of the species
The analysis of these data, available in the object richness of the
library
glmtoolbox, aims to explain the dierences between plots
as related to species richness (represented by a count of the num-
ber of plant species (
richness)) using the dierences between
plots with respect to biomasses (
Biomass) and soil pH levels:
low (
pH=“low”), medium (pH=“mid”) and high (pH=“high”).
highmidlow
pH
Biomass
Number of plant species
0
0.0 2.5 5.0 7.5 10.0
10
20
30
40
Y
i
ind
Poisson(µ
i
),
µ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
y = Number of plant species
x
1
= Biomass
x
2
= 0 and x
3
= 0 if pH=“low”
x
2
= 1 and x
3
= 0 if pH=“mid”
x
2
= 0 and x
3
= 1 if pH=“high”
220 / 430
Parameter interpretation: without inte raction
The expected number of plant species for a plot with biomass x
1
and
pH level low (
x
2
= 0 and x
3
= 0), mid (x
2
= 1 and x
3
= 0) or high (x
2
= 0
and x
3
= 1) is β
0
+ β
1
x
1
+ β
2
x
2
+ β
3
x
3
.
There are two plots with the same pH levels, but the biomass of the
first plot is
x + δ while the biomass of the second plot is x. As a result,
the expected number of plant species for the first plot is |β
1
|δ units
lower, if
β
1
< 0, or greater, if β
1
> 0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
2
x
2
+ β
3
x
3
(β
0
+ β
1
x + β
2
x
2
+ β
3
x
3
) = β
1
δ.
There are two plots with the same biomass, but the pH level of the first
plot is mid while the biomass of the second plot is low. As a result, the
expected number of plant species for the first plot is |β
2
| units lower, if
β
2
< 0, or greater, if β
2
> 0, than that for the second plot.
µ
mid
µ
low
= β
0
+ β
1
x
1
+ β
2
(β
0
+ β
1
x
1
) = β
2
.
There are two plots with the same biomass, but the pH level of the first
plot is high while the biomass of the second plot is low. As a result, the
expected number of plant species for the first plot is |β
3
| units lower, if
β
3
< 0, or greater, if β
3
> 0, than that for the second plot.
µ
high
µ
low
= β
0
+ β
1
x
1
+ β
3
(β
0
+ β
1
x
1
) = β
3
.
221 / 430
Parameter interpretation: without inte raction
The expected number of plant species for a plot with biomass x
1
and
pH level low (
x
2
= 0 and x
3
= 0), mid (x
2
= 1 and x
3
= 0) or high (x
2
= 0
and x
3
= 1) is β
0
+ β
1
x
1
+ β
2
x
2
+ β
3
x
3
+ β
4
x
1
x
2
+ β
5
x
1
x
3
.
There are two plots with pH level low, but the biomass of the first
plot is
x + δ while the biomass of the second plot is x. Therefore, the
expected number of plant species for the first plot is |β
1
|δ units lower,
if
β
1
< 0, or greater, if β
1
> 0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) (β
0
+ β
1
x) = β
1
δ.
There are two plots with pH level mid, but the biomass of the first
plot is
x + δ while the biomass of the second plot is x. Therefore, the
expected number of plant species for the first plot is
|β
1
+ β
4
|δ units
lower, if
β
1
+β
4
<0, or greater, if β
1
+β
4
>0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
4
(x + δ) (β
0
+ β
1
x + β
4
x) = (β
1
+ β
4
)δ.
There are two plots with pH level high, but the biomass of the first
plot is
x + δ while the biomass of the second plot is x. Therefore, the
expected number of plant species for the first plot is
|β
1
+ β
5
|δ units
lower, if
β
1
+β
5
<0, or greater, if β
1
+β
5
>0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
5
(x + δ) (β
0
+ β
1
x + β
5
x) = (β
1
+ β
5
)δ.
222 / 430
Richness of the species
> data(richness)
> fit <- glm(Species ~ Biomass + pH, family=poisson(identity),
+ data=richness)
> summary(fit)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) 17.7915 0.7789 22.84 <2e-16
Biomass -2.8710 0.1830 -15.68 <2e-16
pHmid 12.5603 0.9821 12.79 <2e-16
pHhigh 23.0681 1.1801 19.55 <2e-16
φ
(Dispersion parameter for poisson family taken to be 1 )
D
0
(
ˆ
β) (n 1)
Null deviance : 452.346 on 89 degrees of freedom
Residual deviance : 68.126 on 86 degrees of freedom
D(
ˆ
β) (n p 1)
AIC : 495.32
2(
ˆ
β) + 2(p + 1)
223 / 430
Richness of the species
> fit2 <- update(fit,formula=Species ~ Biomass + pH + Biomass:pH)
> anova2(fit,fit2, test="wald")
Model 1 : Species ~ Biomass + pH
Model 2 : Species ~ Biomass + pH + Biomass:pH
ξ
W
r Pr[χ
2
(r) > ξ
W
]
Chi df Pr(Chisq>)
1 vs 2 1.1275 2 0.5691
>
> anova2(fit,fit2, test="lr")
Model 1 : Species ~ Biomass + pH
Model 2 : Species ~ Biomass + pH + Biomass:pH
ξ
LR
r Pr[χ
2
(r) > ξ
LR
]
Chi df Pr(Chisq>)
1 vs 2 1.0563 2 0.5897
>
> anova2(fit,fit2, test="score")
Model 1 : Species ~ Biomass + pH
Model 2 : Species ~ Biomass + pH + Biomass:pH
ξ
S
r Pr[χ
2
(r) > ξ
S
]
Chi df Pr(Chisq>)
1 vs 2 0.98562 2 0.6109
224 / 430
Normal regression
Y
i
ind
Normal(µ
i
, σ
2
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
Y
i
= µ
i
+ ε
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
ε
i
ind
Normal(0, σ
2
i
)
E(Y
i
) = µ
i
and Var(Y
i
) =
σ
2
ω
i
Inverse link
g(µ
i
) = 1
i
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
If ω
1
= ω
2
= . . . = ω
n
, then the dispersion parameter is constant,
which implies that the variance is also constant.
225 / 430
Normal regression
Let Q
ε
i
(α) be the 100(α)th percentile of the random error ε
i
. Thus,
Pr[ε
i
Q
ε
i
(α)] = α
Pr[µ
i
+ ε
i
µ
i
+ Q
ε
i
(α)] = α
Pr[Y
i
µ
i
+ Q
ε
i
(α)] = α.
That is, the 100(α)th percentile of Y
i
, denoted here by Q
Y
i
(α),
can be written as
µ
i
+ Q
ε
i
(α). In addition, if the link function is
identity (that is,
g(µ) = µ) and ω
1
= ω
2
= . . . = ω
n
= 1, then
ε
1
, . . . , ε
n
are identically distributed random variables, so Q
ε
i
(α)
does not depend on i. Hence, Q
Y
i
(α) can be expressed as follows
Q
Y
i
(α) = µ
i
+ Q
ε
(α)
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
where β
0
= β
0
+ Q
ε
(α). Therefore, the parameter interpretation
also may be performed by taking into account the additive eects
of the covariates on
Q
Y
i
(α), for all α (0, 1).
226 / 430
Advertising
This data set, extracted from James et al. [2013, page 15] and
available in the object
advertising of the library glmtoolbox,
consists of the sales (
sales), in thousands of units, of a pro-
duct in 200 dierent markets, along with advertising budgets, in
thousands of dollars, for the product in each of those markets
for television (
TV) and radio (radio).
0
Low
Advertising budget for radio
Mid High
Advertising budget for TV
Sales
5
10
15
20
25
50 100 150 200 250 300
Y
i
ind
Normal(µ
i
, σ
2
),
log(µ
i
) = β
0
+ β
1
log(x
i1
) +
β
2
x
i2
+ β
3
log(x
i1
)x
i2
y = Sales, in thousands of units.
x
1
= Advertising budget for TV.
x
2
= Advertising budget for radio.
227 / 430
Parameter interpretation: without inte raction
The expected sales of the product in a market where the adver-
tising budget for TV is
x
1
and the advertising budget for radio is
x
2
are exp(β
0
+ β
1
log(x
1
) + β
2
x
2
).
In two markets the radio advertising budgets are the same, but
the TV advertisi ng budgets are
xδ and x, respectively. As a result,
the product’s expected sales in the first market are
δ
β
1
times those
in the second market.
µ
xδ
µ
x
=
exp(β
0
+ β
1
log(xδ) + β
2
x
2
)
exp(β
0
+ β
1
log(x) + β
2
x
2
)
= exp(β
1
log(δ)) = δ
β
1
.
In two markets the advertising budgets for TV coincide, but the
advertising budget for radio in the first is x+ δ, and in the second
is x. As a result, the product’s expected sales in the first market
are exp(β
2
δ) times those in the second market.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
log(x
1
) + β
2
(x + δ)
exp(β
0
+ β
1
log(x
1
) + β
2
x)
= exp(β
2
δ).
228 / 430
Parameter interpretation: with inte raction
The expected sales of the product in a market where the adver-
tising budget for TV is
x
1
and the advertising budget for radio is
x
2
are exp(β
0
+ β
1
log(x
1
) + β
2
x
2
+ β
3
log(x
1
)x
2
).
Two markets have radio advertising budgets eq ual to
z, but TV
advertising budgets are
xδ and x, respectively. As a result, the
product’s expected sales in the first market are
δ
β
1
+β
3
z
times those
in the second market.
µ
xδ
µ
x
=
exp(β
0
+ β
1
log(xδ) + β
2
z + β
3
log(xδ)z)
exp(β
0
+ β
1
log(x) + β
2
z + β
3
log(x)z)
= δ
β
1
+β
3
z
.
Two markets have TV advertising budgets equ al to z, but radio
advertising budgets are x + δ and x, respectively. As a result,
the product’s expected sales in the first market are exp[(β
2
+
β
3
log(z))δ] times those in the second market.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
log(z) + β
2
(x+δ) + β
3
log(z)(x+δ))
exp(β
0
+ β
1
log(z) + β
2
x + β
3
log(z)x)
= exp[(β
2
+β
3
log(z))δ].
229 / 430
Advertising
> data(advertising)
> fit <- glm(sales ~ log(TV) + radio + log(TV):radio,
+ family=gaussian(log), data=advertising)
> summary(fit)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) 1.0302249 0.0467959 22.015 <2e-16
log(TV) 0.2646359 0.0092146 28.719 <2e-16
radio -0.0043242 0.0014929 -2.896 0.0042
log(TV):radio 0.0036644 0.0002909 12.598 <2e-16
ˆ
φ
(Dispersion parameter for gaussian family taken to be 0.2668851 )
D
0
(
ˆ
β) (n 1)
Null deviance : 5417.149 on 199 degrees of freedom
Residual deviance : 52.309 on 196 degrees of freedom
D(
ˆ
β) (n p 1)
AIC : 309.35
2(
ˆ
β) + 2(p + 2)
Number of Fisher Scoring iterations: 5
230 / 430
Advertising
> fit2 <- update(fit,formula = sales ~ log(TV) + radio)
> anova2(fit2,fit, test="wald")
Model 1 : sales ~ log(TV) + radio
Model 2 : sales ~ log(TV) + radio + log(TV)*radio
ξ
W
r Pr[χ
2
(r) > ξ
W
]
Chi df Pr(Chisq>)
1 vs 2 158.7 1 < 2.2e-16
>
> anova2(fit,fit2, test="lr")
Model 1 : sales ~ log(TV) + radio
Model 2 : sales ~ log(TV) + radio + log(TV)*radio
ξ
LR
r Pr[χ
2
(r) > ξ
LR
]
Chi df Pr(Chisq>)
1 vs 2 153.45 1 < 2.2e-16
>
> anova2(fit,fit2, test="gradient")
Model 1 : sales ~ log(TV) + radio
Model 2 : sales ~ log(TV) + radio + log(TV)*radio
ξ
S
r Pr[χ
2
(r) > ξ
G
]
Chi df Pr(Chisq>)
1 vs 2 151.16 1 < 2.2e-16
231 / 430
Advertising
Model matrix X : model.matrix(fit)
Parameter estimates
ˆ
β = (
ˆ
β
0
,
ˆ
β
1
, . . . ,
ˆ
β
p
)
: coef(fit)
Estimate of variance-covariance matrix [I(
ˆ
β)]
1
: vcov(fit)
Fitted values ˆµ = (
ˆ
µ
1
, . . . ,
ˆ
µ
n
)
: fitted(fit)
Unscaled deviance D(
ˆ
β): deviance(fit)
Pearson’s statistic P(
ˆ
β): sum(resid(fit, type="pearson")**2)
Log-likelihood (
ˆ
β): logLik(fit)
AIC = 2(
ˆ
β) + 2(p + 2): AIC(fit)
BIC = 2(
ˆ
β) + log(n)(p + 2): BIC(fit)
R
2
adj
= 1
D(
ˆ
β)/(n p 1)
D
0
(
ˆ
β)/(n 1)
:
with(fit, 1-deviance*df.null/(null.deviance*df.residual))
232 / 430
Gamma regression
Y
i
ind
Gamma(µ
i
, σ
2
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
Y
i
= µ
i
ε
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
ε
i
ind
Gamma(1, σ
2
i
)
E(Y
i
) = µ
i
and Var(Y
i
) =
σ
2
ω
i
µ
2
i
Inverse link
g(µ
i
) = 1
i
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
If ω
1
= ω
2
= . . . = ω
n
, then the dispersion parameter is constant,
which implies that the coecient of variation is also constant.
233 / 430
Gamma regression
Let Q
ε
i
(α) be the 100(α)th percentile of the random error ε
i
. Thus,
Pr[ε
i
Q
ε
i
(α)] = α
Pr[µ
i
ε
i
µ
i
Q
ε
i
(α)] = α
Pr[Y
i
µ
i
Q
ε
i
(α)] = α.
That is, the 100(α)th percentile of Y
i
, denoted here by Q
Y
i
(α),
can be written as
µ
i
Q
ε
i
(α). In addition, if the link function is
logarithm (that is,
g(µ) = log(µ)) and ω
1
= ω
2
= . . . = ω
n
= 1,
then
ε
1
, . . . , ε
n
are identically distributed random variables, so
Q
ε
i
(α) does not depend on i. Hence, log[Q
Y
i
(α)] can be expressed
as follows
log[Q
Y
i
(α)] = log(µ
i
) + log[Q
ε
(α)]
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
where β
0
= β
0
+ log[Q
ε
(α)]. Therefore, the parameter interpre-
tation also may be performed by taking into account the multi-
plicative eects of the covariates on
Q
Y
i
(α), for all α (0, 1).
234 / 430
Hill races in Scotland
Each year, the Scottish Hill Runners Association publishes a
list of hill races in Scotland for the year. These data, extracted
from
Agresti [2015, page 62] and available in the object races
of the library glmtoolbox, consist of record time, distance, and
cumulative climb of 35 of those races. The aim of the statistical
analysis of these data is to explain the dierences between the
record time of the races using their dierences on distance and
cumulative climb. The variables in this data set are the following
distance (x
1
). the distance, in miles, of the races.
cclimb (x
2
). the cumulative climb, in thousands of feet, of
the races.
rtime (y). the record time, in minutes, of the races.
race. the names of the races.
Y
i
ind
Gamma(µ
i
, σ
2
),
log(µ
i
) = β
0
+ β
1
log(x
i1
) + β
2
x
i2
235 / 430
Parameter interpretation: without inte raction
The expected (median) record time for a race with distance x
1
and
cumulative climb
x
2
is (proportional to) exp(β
0
+β
1
log(x
1
)+β
2
x
2
).
In two races the cumulative climbs are the same, but the distances
are
xδ and x, respectively. As a result, the expected (median)
record time in the first race is
δ
β
1
times that in the second race.
µ
xδ
µ
x
=
exp(β
0
+ β
1
log(xδ) + β
2
x
2
)
exp(β
0
+ β
1
log(x) + β
2
x
2
)
= exp(β
1
log(δ)) = δ
β
1
.
In two races the distances coincide, but the cumulative climb in
the first is
x + δ, and i n the second is x. As a result, the expected
(median) record time in the first race is
exp(β
2
δ) times that in the
second race.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
log(x
1
) + β
2
(x + δ))
exp(β
0
+ β
1
log(x
1
) + β
2
x)
= exp(β
2
δ).
236 / 430
Parameter interpretation: with inte raction
The expected (median) record time for a race with distance x
1
and cumulative climb x
2
is (proportional to) exp(β
0
+ β
1
log(x
1
) +
β
2
x
2
+ β
3
log(x
1
)x
2
)
.
Two races have cumulative climbs equal to
z, but the distances
are
xδ and x, respectively. As a result, the expected (median)
record time in the first race is
δ
β
1
+β
3
z
times that in the second
race.
µ
xδ
µ
x
=
exp(β
0
+ β
1
log(xδ) + β
2
z + β
3
log(xδ)z)
exp(β
0
+ β
1
log(x) + β
2
z + β
3
log(x)z)
= δ
β
1
+β
3
z
.
Two races have distances equal to z, but the cumulative climbs
are x + δ and x, respectively. As a result, the expected (median)
record time in the first race is exp[(β
2
+ β
3
log(z))δ] times that in
the second race.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
log(z) + β
2
(x+δ) + β
3
log(z)(x+δ))
exp(β
0
+ β
1
log(z) + β
2
x + β
3
log(z)x)
= exp[(β
2
+β
3
log(z))δ].
237 / 430
Hill races in Scotland
> data(races)
> fit <- glm(rtime ~ log(distance) + cclimb, family=Gamma(log),
+ data=races)
> summary(fit)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) 1.86703 0.08938 20.888 <2e-16
log(distance) 0.90141 0.05933 15.193 3.47e-16
cclimb 0.13089 0.02165 6.045 9.52e-07
ˆ
φ
(Dispersion parameter for Gamma family taken to be 0.02141007 )
D
0
(
ˆ
β) (n 1)
Null deviance : 19.79114 on 34 degrees of freedom
Residual deviance : 0.72634 on 32 degrees of freedom
D(
ˆ
β) (n p 1)
AIC : 233.91
2(
ˆ
β) + 2(p + 2)
Number of Fisher Scoring iterations: 5
238 / 430
Hill races in Scotland
> fit2 <- update(fit,formula = rtime ~ log(distance)*cclimb)
> anova2(fit,fit2, test="wald")
Model 1 : rtime ~ log(distance) + cclimb
Model 2 : rtime ~ log(distance) + cclimb + log(distance):cclimb
ξ
W
r Pr[χ
2
(r) > ξ
W
]
Chi df Pr(Chisq>)
1 vs 2 1.8245 1 0.1768
>
> anova2(fit,fit2, test="lr")
Model 1 : rtime ~ log(distance) + cclimb
Model 2 : rtime ~ log(distance) + cclimb + log(distance):cclimb
ξ
LR
r Pr[χ
2
(r) > ξ
LR
]
Chi df Pr(Chisq>)
1 vs 2 1.8072 1 0.1788
>
> anova2(fit,fit2, test="gradient")
Model 1 : rtime ~ log(distance) + cclimb
Model 2 : rtime ~ log(distance) + cclimb + log(distance):cclimb
ξ
S
r Pr[χ
2
(r) > ξ
G
]
Chi df Pr(Chisq>)
1 vs 2 1.8061 1 0.179
239 / 430
Lognormal distribution
If Y Normal(µ, σ
2
) then T = exp(Y) Lognormal(exp(µ), σ
2
).
If
T Lognormal(τ, σ
2
) then Y = log(T) Normal(log(τ), σ
2
).
The probability density function of
T may be written as follows
f
T
(t; τ, σ
2
) =
1
t
2π σ
2
exp
log(t)
2
2σ
2
I
t
(0, ).
The mean, variance and coecient of variation of T are
τ exp
σ
2
2
!
, τ
2
exp
σ
2
h
exp
σ
2
1
i
and
q
exp
(
σ
2
)
1,
respectively. In addition, the 100(α)th percentile of T reduces to
Q
T
(α) = τ exp
(
σ Z
α
)
, α (0, 1),
where Z
α
represents the 100(α)th percentile of the Normal(0, 1).
For example, the median of T is τ. The lognormal distribution
does not belong to the LEF.
240 / 430
Probability density funct ion of lognormal(1, σ
2
)
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0
0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
σ
2
= 0.05
σ
2
= 0.12
σ
2
= 0.30
σ
2
= 0.8
σ
2
= 2.0
σ
2
= 5.0
t
f
T
(t; τ, σ
2
)
241 / 430
Lognormal regression
T
i
= τ
i
ε
i
,
h(τ
i
) = β
0
+ β
1
x
i1
+. . .+β
p
x
ip
ε
i
ind
Lognormal(1, σ
2
i
)
Y
i
= µ
i
+ ǫ
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+. . .+β
p
x
ip
ǫ
i
ind
Normal(0, σ
2
i
)
where Y
i
= log(T
i
), µ
i
= log(τ
i
), ǫ
i
= log(ε
i
) and g(·) h[exp(·)].
E(T
i
) = τ
i
exp
σ
2
2 ω
i
!
and Var(T
i
) = τ
2
i
exp
σ
2
ω
i
!"
exp
σ
2
ω
i
!
1
#
.
Logarithmic link
h(τ
i
) = log(τ
i
) g(µ
i
) = µ
i
Identity link
h(τ
i
) = τ
i
g(µ
i
) = exp(µ
i
)
Inverse link
h(τ
i
) = 1
i
g(µ
i
) = exp(µ
i
)
If ω
1
= ω
2
= . . . = ω
n
then the dispersion parameter is constant,
which implies that the coecient of variation is also constant.
242 / 430
Exercises
T
i
= τ
i
ε
i
,
h(τ
i
) = β
0
+ β
1
x
i1
+. . .+β
p
x
ip
ε
i
ind
Lognormal(1, σ
2
i
)
Y
i
= µ
i
+ ǫ
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+. . .+β
p
x
ip
ǫ
i
ind
Normal(0, σ
2
i
)
where Y
i
= log(T
i
), y
i
= log(t
i
), µ
i
= log(τ
i
), ǫ
i
= log(ε
i
) and
g(·) h[exp(·)]. Show that:
(A) The likelihood function of β under the lognormal model
is proportional to that under the normal model. Therefore,
the maximum likelihood estimator of
β under the lognormal
model coincides with that under the normal model.
(B) The deviance under the lognormal model coincides with
that under the normal model.
(C) The GAIC under the lognormal model is equal to that under
the normal model plus 2[log(t
1
) +. . .+log(t
n
)]= 2(y
1
+ . . .+y
n
).
(D) If h(τ
i
) = log(τ
i
) and ω
1
= ω
2
= . . . = ω
n
then
log[Q
T
i
(α)] = β
0
+ β
1
x
i1
+. . .+β
p
x
ip
for all α (0, 1) ,
where β
0
= β
0
+ log[Q
ξ
(α)]
.
243 / 430
Lognormal versus Gamma model under log link
T
i
= τ
i
ε
i
,
log(τ
i
)= β
0
+ β
1
x
i1
+. . .+β
p
x
ip
,
ε
i
ind
Lognormal(1, σ
2
)
vs.
Y
i
= µ
i
ǫ
i
,
log(µ
i
)= β
0
+ β
1
x
i1
+. . .+β
p
x
ip
,
ǫ
i
ind
Gamma(1, σ
2
)
These models have the following similarities:
(1) They are appropriate when the response variable is contin-
uous, strictly positive, and right skewed.
(2) The constant dispersion parameter assumption is equiva-
lent to the constant coecient of variation assumption.
(3) A higher value of the dispersion parameter indicates more
asymmetry in the distribution of the response variable.
(4) The model parameters can be interpreted by taking into
account the following: (i) log[E(T
i
)] = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
and log[E(Y
i
)] = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
; or (ii) log[Q
T
i
(α)] =
¯
β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
and log[Q
Y
i
(α)] =
˜
β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
for all α (0, 1).
244 / 430
Lognormal versus Gamma model under log link
T
i
= τ
i
ε
i
,
log(τ
i
)= β
0
+ β
1
x
i1
+. . .+β
p
x
ip
,
ε
i
ind
Lognormal(1, σ
2
)
vs.
Y
i
= µ
i
ǫ
i
,
log(µ
i
)= β
0
+ β
1
x
i1
+. . .+β
p
x
ip
,
ǫ
i
ind
Gamma(1, σ
2
)
The lognormal model has the following advantages over the
Gamma model:
(1) The log-likelihood function of β in lognormal model, de-
noted here by
(β), is a strictly convex function. Therefore,
if the system of equations given by
R(β) = 0 has a solution,
where
R(β) = (∂ℓ(β)/∂β
0
, . . . , ∂ℓ(β)/∂β
p
)
is the gradient of
(β), then it is the only global maximum of (β).
(2) The maximum likelihood estimator of model parameters in
the lognormal model has a closed-form (i.e.,
ˆ
β = (X
X)
1
X
t
,
where t
= (log(t
1
), . . . , log(t
n
))
), whereas in the Gamma
model it requires numerical methods.
(3) Statistical inference under the lognormal model is exact,
whereas inference under the Gamma model is asymptotic,
meaning that it is valid only for large sample sizes.
245 / 430
Gasoline yield
The analysis of the data in the object Auto of the library ISLR
is aimed at explaining the dierences between the vehicles with
respect to their gasoline yield (
mpg), in miles per gallon, using
other vehicle characteristics such as their horsepower (
horsepower)
and their weight (
weight), in kilograms.
highmidlow
Weight
Horsepower
Gasoline yield
10
20
30
40
50 100 150 200
Y
i
ind
Lognormal(µ
i
, σ
2
),
log(µ
i
)= β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i1
x
i2
y = Gasoline yield
x
1
= Horsepower
x
2
= Weight
246 / 430
Parameter interpretation: without inte raction
The expected (median) gasoline yield for a vehicle with horse-
power
x
1
and weight x
2
is proportional to exp(β
0
+ β
1
x
1
+ β
2
x
2
).
In two vehicles the weights are the same, but the horsepowers
are
x + δ and x, respectively. As a result, the expected (median)
gasoline yield i n the first vehicle is
exp(β
1
δ) times that in the
second vehicle.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
(x + δ) + β
2
x
2
)
exp(β
0
+ β
1
x + β
2
x
2
)
= exp(β
1
δ).
In two vehicles the horsepowers are the same, but the weights
are x + δ and x, respectively. As a result, the expected (median)
gasoline yield i n the first vehicle is
exp(β
2
δ) times that in the
second vehicle.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
x
1
+ β
2
(x + δ))
exp(β
0
+ β
1
x
1
+ β
2
x)
= exp(β
2
δ).
247 / 430
Parameter interpretation: with inte raction
The expected (median) gasoline yield for a vehicle with horse-
power
x
1
and weight x
2
is proportional to exp(β
0
+ β
1
x
1
+ β
2
x
2
+
β
3
x
1
x
2
)
.
Two vehicles have weights equal to
z, but the horsepowers are
x + δ and x, respectively. As a result, the expected (median)
gasoline yield in the first vehicle is
exp[(β
1
+ β
3
z)δ] times that in
the second vehicle.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
(x + δ) + β
2
z + β
3
(x + δ)z)
exp(β
0
+ β
1
x + β
2
z + β
3
xz)
= exp[(β
1
+ β
3
z)δ].
Two vehicles have horsepowers equal to z, but the weights are
x + δ and x, respectively. As a result, the expected (median)
gasoline yield in the first vehicle is exp[(β
2
+ β
3
z)δ] times that in
the second vehicle.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
z + β
2
(x + δ) + β
3
z(x + δ))
exp(β
0
+ β
1
z + β
2
x + β
3
zx)
= exp[(β
2
+ β
3
z)δ].
248 / 430
Gasoline yield
> data(Auto, package="ISLR")
> Auto <- within(Auto, weight <- weight/1000)
>
> fit <- glm(log(mpg) ~ horsepower + weight + horsepower:weight,
+ family=gaussian(identity), data=Auto)
> summary(fit)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) 4.5121298 0.0912295 49.459 < 2e-16
horsepower -0.0071145 0.0010622 -6.698 7.44e-11
weight -0.3618254 0.0301305 -12.009 < 2e-16
horsepower:weight 0.0011991 0.0002589 4.631 4.96e-06
ˆ
φ
(Dispersion parameter for gaussian family taken to be 0.02341458 )
D
0
(
ˆ
β) (n 1)
Null deviance : 45.2100 on 391 degrees of freedom
Residual deviance : 9.0849 on 388 degrees of freedom
D(
ˆ
β) (n p 1)
> AIC(fit) + 2*sum(fit$y) 2(
ˆ
β) + 2(p + 2)
2075.781
249 / 430
Genesis of link functions in
binary response models
250 / 430
Genesis of link functions in binary response models
Y
i
ind
Bernoulli(µ
i
)
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
The link functions frequently used in binary response models
(logit, probit, complementary log-log and Cauchy) may be mo-
tivated by using the following latent linear model (see, for in-
stance,
Agresti [2015, section 5.1.2])
Y
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
σ e
i
,
where β
= (β
0
, β
1
, . . . , β
p
)
are regression parameters, σ > 0 is a
scale parameter, and
e
1
, . . . , e
n
are independent and identically
distributed r andom errors whose cumulative distri bution func-
tion, denoted here by
G(·), satisfies the following: (i) is a contin-
uous and twice dierentiable function;
(ii) 0 < G(e) < 1 for all
e R; and (iii) it does not depend on unknown parameters.
The response is dichotomized as follows
Y
i
=
1
if Y
i
τ
0
if Y
i
< τ
,
251 / 430
Link functions in binary response models
where τ is an arbitrary value. Therefore, Y
i
ind
Bernoulli(µ
i
),
where
µ
i
is given by
µ
i
= Pr[Y
i
= 1]
= Pr[Y
i
τ]
= Pr[β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
σ e
i
τ]
= Pr[σ e
i
τ β
0
β
1
x
i1
. . . β
p
x
ip
]
= Pr[σ e
i
β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
τ]
= Pr[e
i
(β
0
τ) + (β
1
)x
i1
+ . . . + (β
p
)x
ip
]
= Pr[e
i
β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
]
= G(β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
),
in which β
0
= (β
0
τ)
and β
j
= (β
j
)
for j = 1, . . . , p. Thus, the
link function arises naturally as
g(µ
i
) = G
1
(µ
i
)
252 / 430
Link functions in binary response models
(1) Standard Logistic distribution
G(e) =
exp(e)
1 + exp(e)
.
Then,
g(µ
i
) = G
1
(µ
i
) = log
µ
i
1 µ
i
!
.
That is, the random errors distributed according to the stan-
dard logistic distribution induce the
logit link function.
(2) Standard Normal distribution
G(e) = Φ(e)
1
2
+ sign(e)
1
2
q
1 exp
p
π/8 e
2
.
Then,
g(µ
i
) = G
1
(µ
i
) = Φ
1
(µ
i
).
That is, the random errors distributed according to the stan-
dard normal distribution induce the
probit link function.
253 / 430
Link functions in binary response models
(3) Standard Extreme Value distribution
G(e) = 1 exp[exp(e)].
Then,
g(µ
i
) = G
1
e
(µ
i
) = log[log(1 µ
i
)] = log
"
log
µ
i
1 µ
i
+ 1
!#
That is, the errors distributed according to the standard
extreme value distribution induce the
cloglog link function.
(4) Standard Cauchy distribution
G(e) =
1
2
+
1
π
arctan(e).
Then,
g(µ
i
) G
1
(µ
i
) = tan
π
2
(2µ
i
1)
.
That is, the random errors distributed according to the stan-
dard Cauchy di stribution induce the cauchit link function.
254 / 430
Link functions in binary response models
0.00
0.25
0.50
0.75
1.00
−5.0 −2.5 0.0 2.5 5.0
Link function
Linear predictor (η)
µ = g
1
(η)
logit
probit
cloglog
cauchit
255 / 430
Receiver Operating
Characteristic (ROC) Curve
256 / 430
ROC Curve
The sample is partitioned into a sample for training and a sample
for validation.
Training sample
Validation sample
The binary response model is “trained” using the training sam-
ple. In order to evaluate the predictive ability of the trained
model, the validation sample is used.
257 / 430
ROC Curve
For a given value of the cuto point ν, the Interest Classification
Rule
is the following:
ˆ
Y
i
=
1
if
ˆ
µ
i
ν
0
if
ˆ
µ
i
< ν
Therefore, the confusion matrix reduces to
Predicted condition
ˆ
Y = 0
ˆ
Y = 1
Actual condition
Y = 0 a
ν
c
ν
Y = 1 b
ν
d
ν
Then, the Sensitivity or True Positive Rate (TPR), and the Speci-
ficity or True Negative Rate (TNR), are give n, respectively, by
TPR(ν) =
d
ν
b
ν
+ d
ν
and TNR(ν) =
a
ν
a
ν
+ c
ν
258 / 430
ROC Curve
zeros ones
ˆ
µ
E
D
C
B
A
0.00.00.00.00.0 0.20.20.20.20.2 0.40.40.40.40.4 0.60.60.60.60.6 0.80.80.80.80.8 1.01.01.01.01.0
The ideal scenario is E, since it allows for perfect dierentiation between “suc-
cesses” and “failures”. By contrast, the worst scenario is D, since no distinction
can be made between “successes” and “failures”.
259 / 430
ROC Curve
The Receiver Operating Characteristic (ROC) curve is the plot of
TPR(ν) versus 1 TNR(ν) for ν
1
< ν
2
< . . . < ν
r1
< ν
r
.
Cuto point (ν) 1 TNR(ν) TPR(ν)
ν
1
c
ν
1
a
ν
1
+ c
ν
1
d
ν
1
b
ν
1
+ d
ν
1
ν
2
c
ν
2
a
ν
2
+ c
ν
2
d
ν
2
b
ν
2
+ d
ν
2
.
.
.
.
.
.
.
.
.
ν
r
c
ν
r
a
ν
r
+ c
ν
r
d
ν
r
b
ν
r
+ d
ν
r
Area Under the ROC curve (AUROC) is a measure of the pre-
dictive power of the model, and it can be calculated by applying
successively the formula for trapezoid area. Thus,
AUROC =
1
2
r
X
l=2
[TNR(ν
l
) TNR(ν
l1
)][TPR(ν
l
) + TPR(ν
l1
)].
260 / 430
ROC Curve
1 TNR(ν
l+1
) 1 TNR(ν
l
) 1 TNR(ν
l1
) 1 TNR(ν
l2
)
TPR(ν
l+1
)
TPR(ν
l
)
TPR(ν
l1
)
TPR(ν
l2
)
The area of the shaded region can be obtained as follows
= [(1 TNR(ν
l1
)) (1 TNR(ν
l
))] ×
1
2
[TPR(ν
l
) + TPR(ν
l1
)]
= [TNR(ν
l
) TNR(ν
l1
)] ×
1
2
[TPR(ν
l
) + TPR(ν
l1
)].
261 / 430
ROC Curve
The AUROC also may be expressed using the statistic of the
Mann-Whitney test (see
Gibbons and Chakraborti [2010, section
6.6]
)
AUROC =
1
N
0
N
1
X
l G
0
X
i G
1
S(
ˆ
µ
l
,
ˆ
µ
i
),
where G
0
= {i : y
i
= 0}, G
1
= {i : y
i
= 1}, N
0
= #G
0
, N
1
= #G
1
and
S(a, b) may be expressed as follows
S(a, b) =
1
if a < b
1
2
if a = b
0
if a > b
If
ˆ
µ
l
<
ˆ
µ
i
for all l G
0
and i G
1
then AUROC=1, that is, there
is ν
(0, 1) such that the classification rule given by
ˆ
Y
i
=
1
if
ˆ
µ
i
ν
0
if
ˆ
µ
i
< ν
is “perfect” in the sense of TPR(ν
) = TNR(ν
) = 1.
262 / 430
ROC Curve
Other measures of the predictive power of the model also may
be considered. For instance,
Gini Coecient = 2
AUROC
1
2
,
and the Kolmogorov-Smirnov (K-S) statistic (see, for instance,
Gibbons and Chakraborti [2010, section 6.3]), which compares
the empirical cumulative distribution function of
{
ˆ
µ
i
: y
i
= 0},
denoted here by
ECDF
(0)
(µ), with that of {
ˆ
µ
i
: y
i
= 1}, denoted
here by
ECDF
(1)
(µ), in the following form:
K-S Statistic = max
ECDF
(0)
(µ) ECDF
(1)
(µ)
= max
TNR(ν
l
) (1 TPR(ν
l
))
.
If
ˆ
µ
l
<
ˆ
µ
i
for all l G
0
and i G
1
then K-S Statistic=1.
263 / 430
ROC Curve
For a given value of the cuto point ν, the Reference Classifica-
tion Rule
is the following:
ˆ
Y
i
=
1
if u
i
ν
0
if u
i
< ν
where u
1
, u
2
, . . . are realizations of random variables such that
U
i
ind
Uniform(0, 1)
. The ROC curve of this classification rule
is the zero-intercept and unit-slope straight line as
TPR(ν) =
Pr[U
i
ν] = 1 ν
and TNR(ν) = Pr[U
i
< ν] = ν for all ν ( 0, 1).
Thus, the AUROC of the reference classification rule is
0.5.
The classification rule of a model whose linear predictor includes
just the intercept term also may be used as reference. For that
model the estimates of µ
i
are
ˆ
µ
i
=
ˆ
µ
for all i. Thus, the ROC curve
and the AUROC of that classification rule are the zero-intercept
and unit-slope straight line and
0.5, respectively, as
(1) TPR(ν) = 1 and TNR(ν) = 0 f or all ν
ˆ
µ
(2) TPR(ν) = 0 and TNR(ν) = 1 f or all ν >
ˆ
µ
264 / 430
Burn injury study
This data set, discussed in Hosmer et al. [2013, page 27] and available
in the object
burn1000 of the library aplore3, is a subset of 1000 subjects
treated between 2000 and 2007 at 40 dierent burn facilities. Available
information includes patient demographics, total burn surface area
and presence of inhalation injury. The outcome of interest is survival
to hospital discharge. The data set includes the following:
age (x
1
). The patient’s age at admission, in years.
gender ( x
2
). The patient’s gender: “Female” and “Male”.
race (x
3
). The patient’s race: “Non-White” and “White”.
tbsa (x
4
). Total burn surface area, in percentage.
inh_inj ( x
5
). Burn involved inhalation injury? “No” and “Yes”.
flame (x
6
). Flame involved in burn injury? “No” and “Yes”.
death (y). Hospital discharge status: “Alive” and “Dead”.
Y
i
ind
Bernoulli(µ
i
),
log
µ
i
1 µ
i
!
= β
0
+ β
1
x
i1
+ . . . + β
6
x
i6
265 / 430
Burn injury study
> help(ROCc)
>
> burn <- aplore3::burn1000
> burn <- within(burn, death2 <- ifelse(death=="Dead",1,0))
> s <- sample(1:nrow(burn), size=nrow(burn)*0.7)
> train <- burn[s,]
> test <- burn[-s,]
>
> fit <- glm(death ~ age + tbsa + inh_inj + age*inh_inj +
+ tbsa*inh_inj, family=binomial(logit), data=train)
>
> pr <- predict(fit, newdata=test, type="response")
>
> out <- ROCc(cbind(test[,"death2"],pr),col="red",col.lab="blue")
> out
This function provides a plot of the ROC curve, as well as the va-
lues of AUROC, Gini coecient, and K-S statistic for the interest
classification rule when it is applied to the validation sample.
266 / 430
ROC Curve: Example
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
ROC Curve
1−Specificity
Sensitivity
AUROC = 0.972
267 / 430
ROC Curve: Example
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Kolmogorov−Smirnov Statistic
µ
^
Empirical Cumulative Distribution Function
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Zeros
Ones
K-S Stat istic = 0.851
268 / 430
ROC Curve
There are several ways to define the “optimal” cuto point ν.
For example, the “optimal” cuto point can be defined as being
the value of
ν which minimizes the expected classification cost,
which is given by (see, for instance,
Habibzadeh et al. [2016],
Nahm [2022]
)
E(Cost) = E(Cost |Y = 0) Pr(Y = 0) + E(Cost |Y = 1) Pr(Y = 1),
where
Cost |Y = 0 =
C
0 |0
if
ˆ
Y = 0
C
1 |0
if
ˆ
Y = 1
and
Cost |Y = 1 =
C
0 |1
if
ˆ
Y = 0
C
1 |1
if
ˆ
Y = 1
in which C
0 |0
, C
1 |0
, C
0 |1
and C
1 |1
are assumed to be non-negative
and known values.
269 / 430
ROC Curve
Therefore,
E(Cost |Y = 0) = C
1 |0
(1 TNR) + C
0 |0
TNR.
and
E(Cost |Y = 1) = C
0 |1
(1 TPR) + C
1 |1
TPR.
Thus, the expected classification cost reduces to
E(Cost) = [C
1 |0
(1 TNR) + C
0 |0
TNR] Pr(Y = 0) +
[C
0 |1
(1 TPR) + C
1 |1
TPR] Pr(Y = 1).
The values of Pr(Y = 0) and Pr(Y = 1) can be replaced by their
estimates, which are given by N
0
/(N
0
+ N
1
) and N
1
/(N
0
+ N
1
),
respectively.
270 / 430
Variable selection
271 / 430
Hybrid forward stepwise
The algorithm starts with the simplest model, that is, a model
with a linear predictor in which the coecients of all variables
and/or eects, except the intercept term, are zero. Next, the can-
didate models are constructed by hierarchically adding eects to
the linear predictor (i.e., the main eects are first, then the two-
way interactions, and so on), and their “relevance” is assessed
by comparing the models with and without the added eects
using a previously defined criterion. After an eect has been
added to the model, any other eect previously added can be
removed if, according to the specified criterion, that eect is no
longer relevant. See, for example,
James et al. [2013, page 210].
The criteria to compare models include the following:
Akaike Information Criterion
Bayesian Information Criterion
Adjusted R
2
p-value of the significance tests
272 / 430
Hybrid backward ste pwise
The process begins with the more complex model (i.e., one that
includes all variables and/or eects in its linear predictor) and
then the candidate models are constructed by hierarchically re-
moving eects from the linear predictor (i.e., the highest level
interactions are first, then the next level interactions, and so on).
The “relevance” of the candidates is determined by comparing
the models with and without the excluded eects using a pre-
viously established criterion. If an eect is excluded from the
model, then any o ther eect previously excluded may be added
to it if, based on the specified criterion, that eect is no longer
irrelevant. See, for example,
James et al. [2013, page 210].
The criteria to compare models include the following:
Akaike Information Criterion
Bayesian Information Criterion
Adjusted R
2
p-value of the significance tests
273 / 430
Hybrid stepwise
Forward
Backward
Candidate models
274 / 430
Hybrid stepwise
> args(stepCriterion.glm)
> help(stepCriterion.glm)
model: the object where the fitted GLM is stored.
criterion: the criterion specified to compare the candidate mo-
dels. The available options are: AIC ("aic"), BIC ("bic"), ad-
justed deviance-based R-squared ("adjr2"), and p-value of test
("p-value"). By default, criterion="bic".
test: the statistical test specified to compare candidate models
if criterion=“p-value”. The options are: Wald (“wald”), Rao’s
score (“score”), Terrell’s gradient (“gradient”), and likelihood ra-
tio (“lr”) tests. By default, test=“wald”.
direction: hybrid backward ("backward") or hybrid forward
("forward") procedures. By default, direction="forward".
levels: the pair of values in the interval (0, 1) at which the eects
must in and out from the model, respectively, if criterion=“p-
value”. By default, levels=c(0.05,0.05).
275 / 430
Best subset selection
If there are p explanatory variables, which can be introduced into
the model by
˜
p
columns in the model matrix X, then the number
of possible models of size
i, for i = 1, . . . ,
¯
p, is given by
˜
p
i
!
=
˜
p!
(
˜
p i)! i!
Therefore, the total number of models is
˜
p
X
i=1
˜
p
i
!
= 2
˜
p
1.
In order to apply the “best subset” selection, an exhaustive search
is conducted, separately for every size
i, to identify the model
with the smallest devi ance value. Therefore, if, for a fixed model
size, the interest model selection criteria reduce to monotone
functions of deviance, thus diering only in the way the sizes o f
the mode ls are compared, then the results of the “best subset”
selection do not depend upon the choice o f the trade-o between
goodness-of-fit and complexity on which they are based.
276 / 430
Best subset selection
> args(bestsubset)
> help(bestsubset)
object: the object of the class glm where the fitted model is stored.
nvmax: a positive integer value indicating the maximum size of
subsets to examine.
nbest: a positive integer value indicating the number of subsets
of each size to record.
force.in: a positive integers vector indicating the index of columns
of model matrix that should be in all models.
force.out: a positive integers vector indicating the index of
columns of model matrix that should be in no models.
verbose: a logical indicating if should the report of results be
printed.
digits: an positive integer value indicating the number of deci-
mal places to be used.
277 / 430
Burn injury study
This data set, discussed in Hosmer et al. [2013, page 27] and available
in the object
burn1000 of the library aplore3, is a subset of 1000 subjects
treated between 2000 and 2007 at 40 dierent burn facilities. Available
information includes patient demographics, total burn surface area
and presence of inhalation injury. The outcome of interest is survival
to hospital discharge. The data set includes the following:
age (x
1
). The patient’s age at admission, in years.
gender ( x
2
). The patient’s gender: “Female” and “Male”.
race (x
3
). The patient’s race: “Non-White” and “White”.
tbsa (x
4
). Total burn surface area, in percentage.
inh_inj ( x
5
). Burn involved inhalation injury? “No” and “Yes”.
flame (x
6
). Flame involved in burn injury? “No” and “Yes”.
death (y). Hospital discharge status: “Alive” and “Dead”.
Y
i
ind
Bernoulli(µ
i
),
log
µ
i
1 µ
i
!
= β
0
+ β
1
x
i1
+ . . . + β
6
x
i6
278 / 430
Burn injury study
> burn <- aplore3::burn1000
> burn <- within(burn, death <- fac tor(death,l evels=c("Dea d","Alive") ))
> m <- death ~ (gender + race + flame + age + tbsa + inh_inj)**2
> fit <- glm(m, family=binomial(logit), data=burn)
>
> stepCriterion(fit, direction=" forward", criterion="bic", test="wald")
Initial model: ~ 1
Step 0 :
df AIC BIC adj.R-squared P(Chisq>)(*)
+ tbsa 1 542.65 552.47 0.3622 < 2.2e-16
+ age 1 674.26 684.07 0.2064 < 2.2e-16
+ inh_inj 1 695.66 705.48 0.1811 < 2.2e-16
+ flame 1 759.46 769.27 0.1055 1.443e-15
<none> 847.42 852.33 0.0000
+ race 1 847.97 857.79 0.0007 0.2319
+ gender 1 848.58 858.40 0.0000 0.3567
Step 1 : + tbsa
df AIC BIC adj.R-squared P(Chisq>)(*)
+ age 1 368.98 383.70 0.5698 < 2.2e-16
+ flame 1 522.68 537.40 0.3876 1.224e-05
+ inh_inj 1 526.73 541.46 0.3828 1.040e-05
<none> 542.65 552.47 0.3622
+ gender 1 538.34 553.07 0.3691 0.01108
+ race 1 544.40 559.12 0.3619 0.61320
279 / 430
Burn injury study
Step 2 : + age
df AIC BIC adj.R-squared P(Chisq>)(*)
+ inh_inj 1 352.18 371.81 0.5917 1.35e-05
<none> 368.98 383.70 0.5698
+ flame 1 365.79 385.42 0.5755 0.02695
+ race 1 366.55 386.18 0.5746 0.03628
+ gender 1 369.95 389.58 0.5706 0.30956
+ age:tbsa 1 370.75 390.38 0.5696 0.63608
- tbsa 1 674.26 684.07 0.2064 < 2.2e-16
Step 3 : + inh_inj
df AIC BIC adj.R-squared P(Chisq>)(*)
+ age:inh_inj 1 340.10 364.63 0.6080 0.000178
<none> 352.18 371.81 0.5917
+ race 1 349.78 374.32 0.5965 0.037011
+ flame 1 352.50 377.03 0.5933 0.200542
+ gender 1 353.28 377.82 0.5923 0.340693
+ inh_inj:tbsa 1 353.46 37 8.00 0.5921 0.394472
+ age:tbsa 1 354.17 378.71 0.5913 0.935625
- age 1 526.73 541.46 0.3828 < 2.2e-16
- tbsa 1 539.54 554.26 0.3676 < 2.2e-16
280 / 430
Burn injury study
Step 4 : + age:inh_inj
df AIC BIC adj.R-squared P(Chisq>)(*)
+ inh_inj:tbsa 1 332.42 36 1.87 0.6191 0.002075
<none> 340.10 364.63 0.6080
+ race 1 337.19 366.64 0.6134 0.027951
+ flame 1 339.85 369.30 0.6102 0.140166
+ age:tbsa 1 339.99 369.44 0.6101 0.145308
+ gender 1 342.05 371.49 0.6076 0.826913
- tbsa 1 506.34 525.97 0.4088 < 2.2e-16
Step 5 : + inh_inj:tbsa
df AIC BIC adj.R-squared P(Chisq>)(*)
<none> 332.42 361.87 0.6191
+ race 1 328.74 363.09 0.6255 0.01843
+ flame 1 332.83 367.19 0.6206 0.21233
+ age:tbsa 1 333.61 367.96 0.6197 0.36009
+ gender 1 334.33 368.68 0.6188 0.75873
- age:inh_inj 1 353.46 378.00 0.5921 2.823e-06
Final model: ~ tbsa + age + inh_inj + age:inh_inj + tbsa:inh_inj
*********************************************************************
(*) p-values of the Wald test
281 / 430
Burn injury study
> mod <- death ~ gender + race + flame + age*tbsa*inh_inj
> fit <- glm(mod, family=binomial(logit), data=burn)
> bestSubset(fit, digits=3)
_SIZE_ genderMale raceWhite flameYes age tbsa inh_injYes age:tbsa
1 0 0 0 0 0 0 1
2 0 0 0 1 1 0 0
3 0 0 0 1 1 1 0
4 0 0 0 1 1 1 0
5 0 0 0 1 1 1 0
6 0 1 0 1 1 1 0
7 0 1 0 1 1 1 1
8 0 1 1 1 1 1 1
age:inh_injYes tbsa:inh_injYes age:tbsa:inh_i njYes AIC BIC
0 0 0 460.088 469.903
0 0 0 368.977 383.700
0 0 0 352.179 371.810
1 0 0 340.095 364.634
1 1 0 332.423 361.870
1 1 0 328.740 363.095
1 0 1 326.971 366.233
1 0 1 327.657 371.827
282 / 430
Exercises
(1) Show that:
(A) Maximizing R
2
adj
is equivalent to minimizing D(
ˆ
β)/(np1).
(B) In GLMs in which the dispersion parameter is known and
equal to 1, such as Poisson and binomial models, minimizing
GAIC(k) is equivalent to minimizing D(
ˆ
β) k(n p 1).
(2) Consider two GLMs in which the dispersion parameter is
known and equal to 1, such as Poisson and binomial models,
which are identical except for the linear predictor. In the first
model, denoted here by (I),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
while in the second model, denoted by (II),
g(µ
i
) = β
0
+
β
1
x
i1
+ . . . + β
p
x
ip
+ β
p+1
x
i(p+1)
+ . . . + β
p+r
x
i(p+r)
. Model I is
often described as nested in model II. Show that:
(A) If the likelihood-ratio test to assess H
0
: (β
p+1
, . . . , β
p+r
)
= 0
versus H
1
: (β
p+1
, . . . , β
p+r
)
, 0 rejects H
0
at the approximate
significance level α, then GAIC
II
(k) < GAIC
I
(k) for all k
χ
2
1α
(r)/r.
283 / 430
Exercises
(B) If the likelihood-ratio test to assess H
0
: (β
p+1
, . . . , β
p+r
)
= 0
versus H
1
: (β
p+1
, . . . , β
p+r
)
, 0 does not reject H
0
at the
approximate significance level α, then GAIC
II
(k) > GAIC
I
(k)
for all k > χ
2
1α
(r)/r
.
(C) If GAIC
II
(k) < GAIC
I
(k), then the likelihood-ratio test to
assess H
0
: (β
p+1
, . . . , β
p+r
)
= 0 versus H
1
: (β
p+1
, . . . , β
p+r
)
,
0 rejects H
0
at the approximate significance level α, for all
α 1 Pr[χ
2
(r) kr].
(D) If GAIC
II
(k) > GAIC
I
(k), then the likelihood-ratio test to
assess H
0
: (β
p+1
, . . . , β
p+r
)
= 0 versus H
1
: (β
p+1
, . . . , β
p+r
)
,
0
does not rejects H
0
at the approximate significance level α,
for all
α 1 Pr[χ
2
(r) kr].
(E) R
2
adjII
> R
2
adjI
if and only if the inclusion of the explanatory
variables x
p+1
, . . . , x
p+r
in the model results in a reduction in
deviance greater than 100 × r/(n p 1)%. Therefore, the
larger the sample size, the smaller the required reduction in
deviance for
R
2
adj
to indicate that model II is “better than
model I.
284 / 430
Diagnostic methods
285 / 430
Residual analysis (see Pierce and Schafer [1986])
The residual r
i
is aimed to measure the dierence between the
observed value (
y
i
) and the fitted value (
ˆ
µ
i
= g
1
(x
i
ˆ
β)
) for the
i-th observation. "Large" values of r
i
may indicate that the fit-
ted model does not adequately describe the behavior of the
i-th
observation. Indeed, a "large" (in absolute value) and positiv e
value of
r
i
indicates that the fitted model underestimates the
value of
y
i
, whereas a "large" (in absolute value) and negative
value of
r
i
indicates that the fitted model overestimates the value
of
y
i
. Therefore, knowing the distribution of r
i
is desirable, since
it allows us to determine when the value of
r
i
(positive or ne-
gative) is "l arge". In other words, knowing the distribution of
the residual allows us to determine whether the value of
r
i
is
“large” enough to raise doubts about the dierence between the
observed and fitted values for i-th observation being a result of
chance alone.
286 / 430
Residual analysis
Pearson-type residuals:
This type of residuals is defined from the (signed) contribution
of the i-th observation to the Pearson statistic. Thus,
r
P
i
= sign(y
i
ˆ
µ
i
)
s
p(y
i
,
ˆ
µ
i
, ω
i
)
ˆ
φ(1
ˆ
h
ii
)
=
y
i
ˆ
µ
i
q
(
ˆ
φ/ω
i
)V(
ˆ
µ
i
)(1
ˆ
h
ii
)
,
where
ˆ
h
ii
is the (i, i)th element of
ˆ
H =
ˆ
W
1
2
X(X
ˆ
WX)
1
X
ˆ
W
1
2
, in
which
ˆ
W = diag{
ˆ
w
1
, . . . ,
ˆ
w
n
}
and
ˆ
w
i
= ω
i
/[g
(
ˆ
µ
i
)]
2
V(
ˆ
µ
i
)
.
Deviance-type residuals
(Davison and Gigli [1989]):
This type of residuals is defined from the (signed) contribution
of the i-th observation to the Deviance. Thus,
r
D
i
= sign(y
i
ˆ
µ
i
)
s
d(y
i
,
ˆ
µ
i
, ω
i
)
ˆ
φ(1
ˆ
h
ii
)
.
287 / 430
Residual analysis
Quantile-type residuals (Dunn and Smyth [1996]):
If Y
1
, . . . , Y
n
are independent and strictly continuous random
variables, then
F
Y
i
(Y
i
; µ
i
, φ/ω
i
)
ind
Uniform(0, 1)
and
Φ
1
[F
Y
i
(Y
i
; µ
i
, φ/ω
i
)]
ind
Normal(0, 1).
These facts motivate the residuals given by
r
Q
i
= Φ
1
h
F
Y
i
(y
i
;
ˆ
µ
i
,
ˆ
φ/ω
i
)
i
, i = 1, . . . , n,
in models where the response is assu med to be distributed as
normal, Gamma, or inverse Gaussian, since, if the model is cor-
rectly specified and the values of (
ˆ
β
,
ˆ
φ) and (β
, φ) are “close”,
which occurs for “large” valu es of n due to the consistency of
the estimators of
β and φ, then the behavior of the residuals
r
Q
1
, . . . , r
Q
n
should be similar to that of a random sample drawn
from the Normal(0, 1).
288 / 430
Residual analysis
Accordingly, values of r
Q
i
less than 3 or greater than 3 could be
considered “large”.
In models where the response is assumed to be distributed as a
binomial or Poisson, the quantile-type residual is randomized to
mimic the behavior expected under a continuous distribution.
Therefore,
r
Q
i
becomes
r
Q
i
=
Φ
1
h
F
Y
i
(y
i
;
ˆ
µ
i
,
ˆ
φ/ω
i
)
i
if Y
i
is strictly continuous,
Φ
1
h
a
i
+ f
Y
i
(y
i
;
ˆ
µ
i
,
ˆ
φ/ω
i
)u
i
i
in other cases
where
u
1
, . . . , u
n
are realizations of random variables U
1
, . . . , U
n
such that U
i
ind
Uniform(0, 1) and a
i
= lim
y y
i
F
Y
i
(y;
ˆ
µ
i
,
ˆ
φ/ω
i
). The
standardized quantile-type residuals are given by
r
Q
i
(1
ˆ
h
ii
)
1
2
289 / 430
Residual analysis
> args(residuals2)
> help(residuals2)
object: the object where the fitted GLM is stored.
type: the type of residuals which should be returned. The op-
tions are: Pearson ("pearson"), Deviance ("deviance"), and (ran-
domized) Quantile ("quantile"). By default, type is set to be "de-
viance".
standardized: a logical switch indicating if the residuals should
be standardized by dividing by
(1
ˆ
h
ii
)
1
2
. By default, standardized
is set to be FALSE.
plot.it: a logical switch indicating if a plot of the residuals
versus the fitted values is required. By default, plot.it=TRUE.
...: some graphical parameters to customize the plot when
plot.it=TRUE.
290 / 430
Normal probability plot with sim ulated envelope
See, for instance, Atkinson [1981].
Step 0: Set T (positive integer) and α (0, 1).
Step 1: Generate T independent samples of size n according to
Y
i
ind
LEF(
ˆ
µ
i
,
ˆ
φ/ω
i
)
for i = 1, . . . , n.
Sample 1 Sample 2 Sample 3 . . . Sample T
y
11
y
12
y
13
. . . y
1T
.
.
.
.
.
.
.
.
. . . .
.
.
.
y
n1
y
n2
y
n3
. . . y
nT
Step 2: Compute the estimates of β and φ in each sample.
Step 3: Compute the residuals r
it
for i = 1, . . . , n and t = 1, . . . , T.
Sample 1 Sample 2 Sample 3 . . . Sample T
r
11
r
12
r
13
. . . r
1T
.
.
.
.
.
.
.
.
. . . .
.
.
.
r
n1
r
n2
r
n3
. . . r
nT
291 / 430
Normal probability plot with sim ulated envelope
Step 4: Compute the order statistics of r
it
in each sample.
Sample 1 Sample 2 Sample 3 . . . Sample T
r
(1)1
r
(1)2
r
(1)3
. . . r
(1)T
.
.
.
.
.
.
.
.
. . . .
.
.
.
r
(n)1
r
(n)2
r
(n)3
. . . r
(n)T
Step 5: Compute the 100(
α
2
)
th, 50th and 100(1
α
2
)
th percentiles
of the row
i, denoted by p
(i)
α
2
, p
(i)
1
2
and p
(i)
1
α
2
for i = 1, . . . , n.
Sample 1 Sample 2 . . . Sample T
r
(1)1
r
(1)2
. . . r
(1)T
.
.
.
.
.
. . . .
.
.
.
r
(n)1
r
(1)2
. . . r
(1)T
p
(1)
α
2
p
(1)
1
2
p
(1)
1
α
2
.
.
.
.
.
.
.
.
.
p
(n)
α
2
p
(n)
1
2
p
(n)
1
α
2
Step 6: Plot p
(1)
α
2
, . . . , p
(n)
α
2
versus the expected values of the order
statistics in a sample of size n of the standard normal
distribution.
292 / 430
Normal probability plot with sim ulated envelope
Step 7: Plot p
(1)
1
2
, . . . , p
(n)
1
2
versus the expected values of the order
statistics in a sample of size
n of the standard normal
distribution.
Step 8: Plot p
(1)
1
1
2
, . . . , p
(n)
1
1
2
versus the expected values of the or-
der statistics in a sample of size
n of the standard normal
distribution.
Step 9: Plot the residuals of the interest model (that is, r
1
, . . . , r
n
)
versus the expected values of the order statistics in a
sample of size n of the standard normal distribution.
> args(envelope.glm)
> help(envelope.glm)
293 / 430
Normal probability plot with sim ulated envelope
object: the object where the fitted GLM is stored.
rep: a positive integer indicating the number of replicates which
should be used to build the simulated envelope. By default, rep
is set to be
25.
conf: a value in the interval (0, 1) indicating the confidence level
which should be used to build the point-wise confidence intervals.
By default, conf is set to be
0.95.
type: the type of residuals which should be used to build the
envelope. The options are: Pearson ("pearson"), Deviance ("de-
viance"), and (randomized) Quantile ("quantile"). By default,
type is set to be "quantile".
standardized: a logical switch indicating if the residuals should
be standardized by dividing by (1
ˆ
h
ii
)
1
2
. By default, standardized
is set to be FALSE.
plot.it: a logical switch indicating if a plot of the residuals
versus the fitted values is required. By default, plot.it=FALSE.
...: some graphical parameters to customize the plot when
plot.it=TRUE.
294 / 430
Influence or sensitivity analysis
The Cook’s di stance, denoted here by CD
i
, i s a measure of the in-
fluence of the observation
i on the maximum-likelihood estimate
of
β, which consists of the norm of
ˆ
β
ˆ
β
(i)
using a metric based
on
ˆ
Var(
ˆ
β)
, where
ˆ
β
(i)
represents the estimate of β computed from
the dataset in which the observation
i is excluded. CD
i
may be
expressed as follows (
Cook and Weisberg [1982, section 5.4]):
CD
i
=
1
p + 1
(
ˆ
β
ˆ
β
(i)
)
[
ˆ
Var(
ˆ
β)]
1
(
ˆ
β
ˆ
β
(i)
)
=
1
ˆ
φ(p + 1)
n
X
i=1
ˆ
w
i
x
i
ˆ
β x
i
ˆ
β
(i)
2
,
where
ˆ
w
i
= ω
i
/[g
(
ˆ
µ
i
)]
2
V(
ˆ
µ
i
)
. “Large” values of CD
i
indicate
that removing the observ ation i introduces “large” changes in
the estimate of
β and that a comprehensive analysis is required to
investigate whether the elimination of this observation changes
the sign of the parameter estimates and/or introduces changes
in the decision about their statistical significance.
295 / 430
Influence or sensitivity analysis
The computational cost of the computation of
ˆ
β
(i)
for i = 1, . . . , n
may be “high”, especially when n is “large”. Therefore, the
value of
ˆ
β
(i)
in CD
i
is replaced by its one-step approximation (see
Pregibon [1981]), denoted here by
ˆ
β
1
(i)
. This approximation con-
sists of the first iteration of the Fisher’s Scoring al gorithm to
estimate
β when it is performed using: (i) a d ataset in which the
observation
i is excluded; and (ii)
ˆ
β as the starting value. Thus,
ˆ
β
1
(i)
=
ˆ
β
ω
i
(y
i
ˆ
µ
i
)
g
(
ˆ
µ
i
)V(
ˆ
µ
i
)(1
ˆ
h
ii
)
(X
ˆ
WX)
1
x
i
.
The values of
ˆ
β
1
(i)
and
ˆ
β
(i)
coincide for Gaussian models with iden-
tity link function. The Cook’s distance may be approximated by
the following expression
CD
1
i
=
1
ˆ
φ(p + 1)
n
X
i=1
ˆ
w
i
x
i
ˆ
β x
i
ˆ
β
1
(i)
2
=
1
p + 1
r
2
P
i
ˆ
h
ii
1
ˆ
h
ii
.
296 / 430
Influence or sensitivity analysis
(1) The matrix
ˆ
Var(
ˆ
β)
is positive definite. Therefore,
CD
1
i
0
CD
1
i
= 0 if and only if
ˆ
β =
ˆ
β
1
(i)
(2)
If r
P
i
= 0 (that is, if y
i
=
ˆ
µ
i
), then
ˆ
β
1
(i)
=
ˆ
β
and CD
1
i
= 0.
A univariate version of
CD
1
i
is known as DFBETA
1
j(i)
, and it is
given by
DFBETA
1
j(i)
=
ˆ
β
j
ˆ
β
1
j
(i)
, j = 0, . . . , p,
where
ˆ
β
1
j
(i)
is the element (j + 1) of
ˆ
β
1
(i)
. Similarly, DFBETAS
1
j(i)
is
given by
DFBETAS
1
j(i)
=
ˆ
β
j
ˆ
β
1
j
(i)
se(
ˆ
β
1
j
(i)
)
, j = 0, . . . , p.
> help(cooks.distance)
> help(dfbeta)
> help(dfbetas)
297 / 430
Testing the constant dispersion parameter assumption
To assess the assumption of constant dispersion parameter the
following varying dispersion model is considered
Y
i
ind
LEF(µ
i
, φ
i
i
)
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
log(φ
i
) = γ
0
+ γ
1
m
i1
+ . . . + γ
q
m
iq
where
˜
m
i
= (m
i1
, . . . , m
iq
)
are the values of q explanatory vari-
ables, quantitative and/or qualitative, measured on the
i-th indi-
vidual. The sets of variables
x
1
, . . . , x
p
and m
1
, . . . , m
q
may or may
not coincide. Thus, assess the assumption of constant dispersion
parameter is equivalent to assess the following hypothesis in the
varying dispersion model described above
H
0
: γ = 0 versus H
1
: γ , 0,
where γ = (γ
1
, . . . , γ
q
)
. If H
0
is not rejected, then it is possible to
conclude that there is not significative statistical evidence against
the assumption of constant dispersion parameter.
298 / 430
Testing the constant dispersion parameter assumption
The decision rule is
Reject H
0
at the approximate significance level of 100(α)% if ξ > χ
2
1α
(q)
The statistic of the Rao’s Score test is given by
ξ =
1
2
z
M(M
M)
1
M
z,
where M =
I n
1
11
M
, M =
(
˜
m
1
, . . . ,
˜
m
n
)
and z = (z
1
, . . . , z
n
)
.
If
Y
i
ind
Normal(µ
i
, φ/ω
i
)
then
z
i
=
ω
i
(y
i
ˆ
µ
i
)
2
φ
1
and φ
=
1
n
n
X
i=1
ω
i
(y
i
ˆ
µ
i
)
2
If Y
i
ind
Inverse Gaussian(µ
i
, φ/ω
i
)
then
z
i
=
ω
i
(y
i
ˆ
µ
i
)
2
ˆ
µ
2
i
y
i
φ
1
and φ
=
1
n
n
X
i=1
ω
i
(y
i
ˆ
µ
i
)
2
ˆ
µ
2
i
y
i
See, for instance, Wei et al. [1998].
299 / 430
Testing the constant dispersion parameter assumption
> args(vdtest.glm)
> help(vdtest.glm)
model: the object where the fitted GLM is stored. Only
gaussian(), Gamma() and inverse.gaussian() families are
supported.
varformula: By default, the same explanatory variables are
taken as in the model for the mean. However, the argument
varformula, which must to be a formula-type object, may
be used to specify the desired explanatory variables.
verbose: a logical switch indicating if should the report of
results be printed. By default, verbose=TRUE.
300 / 430
Box-Tidwell transformation
Let’s consider the following generalized nonlinear model
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) = β
0
+ β
1
h
1
(x
i1
) + . . . + β
r
h
r
(x
ir
) + β
r+1
x
i(r+1)
+ . . . + β
p
x
ip
,
where x
i1
, . . . , x
ir
are the values of r quantitative and strictly po-
sitive explanatory variables measured on the subject
i, and
h
j
(x) =
x
τ
j
if τ
j
, 0
log(x)
if τ
j
= 0
, j = 1, . . . , r.
Hence, the interest parameter v ector is (β
0
, β
1
, . . . , β
p+1
, τ
1
, . . . , τ
r
)
.
Box and Tidwell [1962] provide an ecient algorithm to estimate
τ = (τ
1
, . . . , τ
r
)
by the maximum likelihood method. Note that,
evaluation of the following hypothesis systems is of interest:
H
0
: τ
j
= 1 versus H
1
: τ
j
, 1,
for j = 1, . . . , r, and
H
0
: (τ
1
, . . . , τ
r
)
= 1 versus H
1
: (τ
1
, . . . , τ
r
)
, 1.
301 / 430
Insecticides
These data, extracted from Hewlett and Plackett [1950] and available
in the object
deposit of the library GLMsData, arose from an experiment
in which groups of insects were exposed to deposits of insecticides. The
total number of insects exposed in each group (
Number), the number
of the insects killed after six days exposure (
Killed), and the amount
(
Deposit) and type of insecticide used (Insecticide) were recorded.
Insecticide
type A
type B
type C
Amount of insecticide, in milligrams
% of killed insects
2 4 6 8
0
25
50
75
100
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
m = Number of exposed insects
y = Proportion of killed insects
x
1
= Amount of insecticide, in mg.
x
2
= 0 and x
3
= 0 if Insecticide=“A”
x
2
= 1 and x
3
= 0 if Insecticide=“B”
x
2
= 0 and x
3
= 1 if Insecticide=“C”
302 / 430
Box-Tidwell transformation
> help(BoxTidwell.glm)
> data(deposit, package="GLMsData")
> fit <- glm(Killed/Number ~ Deposit + Insecticide,
+ weights=Number, family=binomial(logit), data=deposit)
>
> BoxTidwell(fit, transf= ~ Deposit)
ˆ
τ
j
H
0
: τ
j
= 1
versus H
1
: τ
j
, 1
Estimate Std.Error z-value Pr(>|z|)
Deposit -0.90393 0.34310 -5.54914 2.8708e-08
Wald test for null hypothesis that all taus are 1:
chi = 30.793 , df = 1 , Pr(>chi) = 2.8708e-08
H
0
: (τ
1
, . . . , τ
r
)
= 1 versus H
1
: (τ
1
, . . . , τ
r
)
, 1
>
> fit2 <- update(fit,formula=~I(Deposit**(-0.9039))+Insecticide)
> c(adjR2(fit),adjR2(fit2))
[1] 0.85902 0.95405
> c(AIC(fit),AIC(fit2))
[1] 118.22335 85.84911
> c(BIC(fit),BIC(fit2))
[1] 121.7848 89.4106
303 / 430
Skin cancer in women
These data, extracted from Kleinbaum et al. [2013, page 751] and avail-
able in the object
skincancer of the library glmtoolbox, describe the
number of of non-melanoma skin cancer cases (
cases) among women
(
population) stratified by age (age: “15-24”, “25-34”, “35-44”, “45-
54”, “55-64”, “65-74”, “75-84”, “85+”) in two metropolitan areas of US
(
city: “St. Paul”, in Minneapolis, and “Fort Worth”, in Dallas).
15−24 25−34 35−44 45−54 55−64 65−74 75−84 85+
City
St.Paul
Ft.Worth
Incidence rate of skin cancer ×10
4
Age range, in years
0
25
50
75
Y
i
ind
Poisson(µ
i
),
log
10
4
µ
i
λ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
y = Number of cases of skin cancer.
λ = Number of women exposed.
x
1
= 1 if city=“Fort Worth”
x
1
= 0 if city=“St. Paul”
x
2
= Midpoint of age range.
304 / 430
Box-Tidwell transformation
> help(BoxTidwell.glm)
> data(skincancer)
> fit <- glm(cases ~ age + city, offset=log(population),
+ family=poisson(log), data=skincancer)
>
> BoxTidwell(fit, transf= ~ age)
ˆ
τ
j
H
0
: τ
j
= 1
versus H
1
: τ
j
, 1
Estimate Std.Error z-value Pr(>|z|)
age -0.52014 0.12389 -12.26983 < 2.22e-16
Wald test for null hypothesis that all taus are 1:
chi = 150.55 , df = 1 , Pr(>chi) = < 2.22e-16
H
0
: (τ
1
, . . . , τ
r
)
= 1 versus H
1
: (τ
1
, . . . , τ
r
)
, 1
>
> fit2 <- update(fit, formula = ~ I(age**(-0.52)) + city)
> c(adjR2(fit),adjR2(fit2))
[1] 0.93512 0.99381
> c(AIC(fit),AIC(fit2))
[1] 257.1082 115.2115
> c(BIC(fit),BIC(fit2))
[1] 259.4259 117.5293
305 / 430
Gasoline yield
This data set, discussed in James et al. [2013, page 92] and available in
the object
Auto of the library ISLR, includes acceleration, horsepower,
fuel eciency, weight, and other features of dierent makes and mod-
els of cars. The outcome of interest is fuel consumption, in miles per
gallon. The variables in this data set are the following:
mpg (y). Fuel consumption, in miles per gallon.
cylinders (x
1
). Number of cylinders.
displacement (x
2
). Engine displacement, in cubic inches.
horsepower (x
3
). Engine horsepower.
weight ( x
4
). weight, in pounds.
acceleration (x
5
). Time, in seconds, to accelerate from 0 to 60
miles per hour.
origin (x
6
, x
7
). Origin of the vehicle: American (1), European ( 2)
and J apanese (3).
306 / 430
Gasoline yield
The following is the code of the exponential link function (i.e., g(µ) =
exp(µ)
) for the Gaussian distribution:
> exp.link <- gaussian()
> exp.link$link <- "exp"
> exp.link$linkfun <- function(mu) exp(mu)
> exp.link$linkinv <- function(eta) log(eta)
> exp.link$mu.eta <- function(eta) 1/eta
>
> fit <- glm(..., family=exp.link, ...)
The following is the code of the inverse exponential link function (i.e.,
g(µ) = exp(µ)) for the Gaussian distribution:
> iexp.link <- gaussian()
> iexp.link$link <- "1/exp"
> iexp.link$linkfun <- function(mu) exp(-mu)
> iexp.link$linkinv <- function(eta) -log(eta)
> iexp.link$mu.eta <- function(eta) -1/eta
>
> fit <- glm(..., family=iexp.link, ...)
307 / 430
Gasoline yield
Simpler model: 1
More complex model
: 1+ weight + horsepower + weight:horsepower + displacement
+ acceleration + cylinders + origin
Distribution “Best” linear predictor according to BIC BICg(µ)
log(µ)
Gaussian 1+displacement+weight*horsepower 2213.2
Gamma 1+weight*horsepower 2100.2
Inverse Gaussian 1+weight*horsepower
2089.4
Lognormal 1+weight*horsepower 2095.6
µ
Gaussian 1+weight*horsepower 2211.3
Gamma 1+weight*horsepower 2102.4
Inverse Gaussian 1+weight*horsepower 2094.0
Lognormal 1+weight*horsepower 2098.6
1
µ
Gaussian 1+displacement+horsepower+acceleration 2201.0
Gamma 1+weight+horsepower+cylinders 2098.9
Inverse Gaussian 1+weight+horsepower+cylinders
2089.1
Lognormal 1+weight+displacement+horsepower+acceleration 2092.6
> data(Auto, package="ISLR")
> Auto <- within(Auto, weight <- wei ght/1000)
> fit <- glm(mpg ~ horsepower + weight + horsepower:weight,
+ family=inverse.gaussian(log), data=Auto)
> summary(fit)
308 / 430
Gasoline yield
> anova2(fit, test="lr")
> residuals2(fit, type="quantile ", plot.it=TRUE, identify=3)
15 20 25 30 35 40
−3 −2 −1 0 1 2 3
331
360
382
ˆ
µ
r
Q
i
An “horizontal band” of residuals is roughly drawn around the 0 line.
This suggests that the random component of the model successfully
describes the mean-variance relation present in the data.
309 / 430
Gasoline yield
> envelope(fit, type="quantile", rep=500, identify=3)
> cbind(model.matrix(fit)[,-1],y=fit$y,mu=fitted(fit))[,c(331,360,382)]
horsepower weight horsepower:wei ght y mu
132 2910 384120 32.7 19.92788
105 3725 391125 26.6 18.02434
85 3015 256275 38.0 22.94378
−3 −2 −1 0 1 2 3
−4 −2 0 2 4
331
360
382
Expected quantiles
Observed quantiles
310 / 430
Gasoline yield
> cdi <- cooks.distance(fit)
> plot(cdi, type="h", xlab="Observation Index", ylab="Cook’s Distance")
> identify(cdi, n=3)
0 100 200 300 400
0.00 0.02 0.04 0.06 0.08 0.10
29
116
331
Observation Index
Cook’s Distance
311 / 430
Gasoline yield
Which subset of the p otentially influential observations (#29, #116, #331) causes
the largest change in the estimate of
β when removed from the data set?
> infl <- c(29,116,331)
> n <- length(infl)
> metr <- chol2inv(chol(vcov(fit)))
> hm <- 2**n - 1
> results <- matrix(0, nrow=hm, ncol=1)
> id <- 1; nfil <- vect or()
> for(i in 1:n){
+ pos <- combn(n,i)
+ for(j in 1:ncol(pos)){
+ fitemp <- update(fit, subset=-pos[,j])
+ difs <- coef(fitemp)-coef(fit)
+ results[id,1] <- t(difs)%*%metr%*%(difs)
+ nfil <- c(nfil,paste(infl[pos[,j]], collapse=","))
+ id <- id + 1
+ }
+ }
> nfil[results==max(results)]
[1] "29,331"
The largest change in the estimate of β occurs when observations #29 and #331
are removed.
312 / 430
Gasoline yield
> a <- round(summary(fit)$coefficients, digits=5); a
Estimate Std.Error t value Pr(>|t|)
(Intercept) 4.55600 0.09410 48.41499 0
horsepower -0.00725 0.00102 -7.08715 0
weight -0.37430 0.02942 -12.72132 0
horsepower:weight 0.00126 0.00025 5.09735 0
>
> fit2 <- update(fit, subset=-c(29,331))
> b <- round(summary(fit2)$coefficients, digits=5); b
Estimate Std.Error t value Pr(>|t|)
(Intercept) 4.61036 0.09201 50.10786 0
horsepower -0.00804 0.00100 -8.01447 0
weight -0.38628 0.02874 -13.44288 0
horsepower:weight 0.00144 0.00024 5.94463 0
>
> round(100*(b[,1:2]-a[,1:2])/abs(a[,1:2]), digits=3)
Estimate Std.Error
(Intercept) 1.193 -2.221
horsepower -10.897 -1.961
weight -3.201 -2.311
horsepower:weight 14.286 -4.000
313 / 430
Gasoline yield
> vdtest(fit)
Score test for varying dispersion parameter
Statistic = 1.50238
degrees of freedom = 3
p-value = 0.68172
> vdtest(fit,varformula=~horsepower*weight+cylinders+displacement)
Score test for varying dispersion parameter
Statistic = 4.42663
degrees of freedom = 5
p-value = 0.48976
> vdtest(fit,varformula=~horsepower*weight+cylinders*displacement)
Score test for varying dispersion parameter
Statistic = 7.30864
degrees of freedom = 6
p-value = 0.29324
314 / 430
Quasi-likelihood models
315 / 430
Quasi-likelihood models (QLMs)
Y
i
ind
µ
i
, (φ/ω
i
)V(µ
i
)
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
= x
i
β
Then, E(Y
i
) = µ
i
and Var(Y
i
) =
φ
ω
i
V(µ
i
),
where
β = (β
0
, β
1
, . . . , β
p
)
is the interest parameter vector.
x
i
= (1, x
i1
, . . . , x
ip
)
, where (x
i1
, . . . , x
ip
) are the values of p
explanatory variables, quantitative and/or qualitative, mea-
sured on the
i-th individual.
φ > 0 is the dispersion parameter.
ω
1
, . . . , ω
n
are known and positive weights.
V(·) is the variance function.
g(·) is the link function, which is assumed to be known,
strictly monotone, and twice dierentiable.
316 / 430
Quasi-likelihood models (QLMs)
ˆ
β = argmax
β R
p+1
Q(β),
where Q(β) =
1
φ
n
X
i=1
Z
µ
i
y
i
ω
i
(y
i
t)
V(t)
dt
is the log-quasi-likelihood function. For example,
If
V(µ
i
) = µ
i
(1µ
i
) Q(β) = C +
1
φ
n
X
i=1
ω
i
"
y
i
log
µ
i
1 µ
i
!
+ log(1 µ
i
)
#
If V(µ
i
) = µ Q(β) = C +
1
φ
n
X
i=1
ω
i
y
i
log(µ
i
) µ
i
If V(µ
i
) = 1 Q(β) =
1
2 φ
n
X
i=1
ω
i
(y
i
µ
i
)
2
If V(µ
i
) = µ
2
i
Q(β) = C
1
φ
n
X
i=1
ω
i
log(µ
i
) +
y
i
µ
i
!
If V(µ
i
) = µ
3
i
Q(β) = C +
1
φ
n
X
i=1
ω
i
1
µ
i
y
i
2µ
2
i
317 / 430
Quasi-likelihood models (QLMs)
Q(β)
∂β
j
=
1
φ
n
X
i=1
(y
i
µ
i
)
V(µ
i
)
∂µ
i
∂η
i
x
ij
=
1
φ
n
X
i=1
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
x
ij
In matrix form, U(β) becomes
U(β) =
φ
1
n
P
i=1
ω
i
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
.
φ
1
n
P
i=1
ω
i
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
x
i1
.
.
.
φ
1
n
P
i=1
ω
i
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
x
ip
,
which is the same as that in GLMs, that is, the parameter estimates
under GLMs and QLMs coincide. Therefore, QLMs relaxes the distri-
butional assumptions in GLMs, as the former leads to the same results
as the latter, but in the former just the correct specification of the mean
and variance of the response variable is required.
318 / 430
Gasoline Yields from Crude Oil
The dataset, collected by Prater [1956] and included in the betareg
library as GasolineYield, pertains to the percentage of crude oil con-
verted to gasoline after distillation and fractionation. The explanatory
variables are temperature (
temp), in degrees Fahrenheit, vapor pres-
sure of crude oil (
pressure), in pounds per square inch, and crude oil
gravity (
gravity), in degrees API.
0.1
0.2
0.3
0.4
200 250 300 350 400 450
Temperature
low high
Vapor pressure
Proportion of crude converted to gasoline
Y
i
ind
µ
i
, φµ
i
(1 µ
i
)
,
g(µ
i
)= β
0
+β
1
x
i1
+β
2
x
i2
+β
3
x
i3
+β
4
x
i2
x
i3
y = Proportion of crude oil
converted to gasoline
x
1
= Temperature
x
2
= Vapor pressure
x
3
= Gravity
319 / 430
Gasoline Yields from Crude Oil
> library(betareg)
> data(GasolineYield)
> str(GasolineYield)
> summary(GasolineYield)
>
> fit1 <- glm(yield ~ temp+pressure+gravity+pressure:gravity,
+ data=GasolineYield, family=quasibinomial(logit))
> fit2 <- update(fit1, family=quasibinomial(probit))
> fit3 <- update(fit1, family=quasibinomial(cloglog))
> fit4 <- update(fit1, family=quasibinomial(cauchit))
>
> adjR2(fit1,fit2,fit3,fit4)
Deviance R-squared df adj.R-squared
fit1 0.1829340 0.9219 5 0.9103
fit2 0.1637178 0.9301 5 0.9197
fit3 0.2064191 0.9119 5 0.8988
fit4 0.3876270 0.8345 5 0.8100
> dev.new()
> residuals2(fit2)
320 / 430
Gasoline Yields from Crude Oil
> summary(fit2)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) -4.3117643 0.3927982 -10.977 1.85e-11
temp 0.0057560 0.0003401 16.926 6.68e-16
pressure 0.2027705 0.0526505 3.851 0.000655
gravity 0.0292699 0.0092548 3.163 0.003842
pressure:gravity -0.0029725 0.0013007 -2.285 0.030367
> fit5 <- betareg(yield ~ temp+pressure+gravity+pressure:gravity,
data=GasolineYield, link="probit")
> summary(fit5)
Estimate Std.Error z value Pr(>|z|)
(Intercept) -4.2782892 0.3608205 -11.857 < 2e-16
temp 0.0057884 0.0003126 18.518 < 2e-16
pressure 0.1976645 0.0483439 4.089 4.34e-05
gravity 0.0279473 0.0084987 3.288 0.00101
pressure:gravity -0.0028123 0.0011945 -2.354 0.01856
321 / 430
Alaska pipeline
This dataset, extracted from Weisberg [2005] and available in the object
pipeline of the library glmtoolbox, consists of ultrasonic measure-
ments of the depth of defects in the Alaska pipeline ( Field). A second
measurement of the depth of the defects was conducted in the labora-
tory (Lab). Data were analyzed in order to calibrate the bias between
field measurements and laboratory measurements.
20
40
60
80
2 3 4
log(Laboratory measurement)
Field measurement
Y
i
ind
µ
i
, φµ
r
i
,
µ
i
= β
0
x
β
1
i
y = Field measurement
x = Laboratory measurement
r = 0, 1, 2 or 3
322 / 430
Alaska pipeline
> data(pipeline)
> str(pipeline)
> summary(pipeline)
>
> fit1 <- glm(Field ~ log(Lab), data=pipeline,
+ family=quasi(link=log,variance=constant))
> fit2 <- update(fit1, family=quasi(link=log,variance=mu))
> fit3 <- update(fit1, family=quasi(link=log,variance=muˆ2))
> fit4 <- update(fit1, family=quasi(link=log,variance=muˆ3))
>
> adjR2(fit1,fit2,fit3,fit4)
Deviance R-squared df adj.R-squared
fit1 3597.6085352 0.9019 2 0.9010
fit2 88.3293923 0.9216 2 0.9209
fit3 2.9264710 0.9256 2 0.9249
fit4 0.1439215 0.9109 2 0.9100
>
> dev.new(); residuals2(fit1)
> dev.new(); residuals2(fit2)
> dev.new(); residuals2(fit3)
> dev.new(); residuals2(fit4)
323 / 430
Alaska pipeline
> summary(fit3)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) 0.30610 0.07924 3.863 0.000194
log(Lab) 0.88197 0.02254 39.133 < 2e-16
ˆ
φ
Dispersion parameter for quasi family taken to be 0.02714494
Null deviance: 39.3211 on 106 degrees of freedom
Residual deviance: 2.9265 on 105 degrees of freedom
>
> fit5 <- glm(Field ~ log(Lab), data=pipeline, family=Gamma(log))
> summary(fit5)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) 0.30610 0.07924 3.863 0.000194
log(Lab) 0.88197 0.02254 39.133 < 2e-16
ˆ
φ
Dispersion parameter for Gamma family taken to be 0.02714494
Null deviance: 39.3211 on 106 degrees of freedom
Residual deviance: 2.9265 on 105 degrees of freedom
324 / 430
Overdispersion in Poisson
models
325 / 430
Overdispersion
Overdispersion occurs when the observed variance “ex-
ceeds” the expected variance according to the Poisson dis-
tribution.
Overdispersion results in an underestimation of the variabi-
lity associated with parameter estimates. As a result, vari-
ables may appear significant when they are actually not
significant, as well as producing confidence intervals with
“low” coverage rates.
As an informal rule, if the deviance statistic is “r ather
higher than the degrees of freedom, or if the Pearson’s
chi-squared statistic is “rather higher than the degrees
of freedom, there may be evidence that the data are over-
dispersed.
Overdispersion may be addressed by replacing the Poisson
regression model with a regression model based on a more
complex distribution.
326 / 430
Self diagnozed ear infections in swimmers
The data were extracted from Hand et al. [1993, page 266] and
can be found in the object
swimmers of the library glmtoolbox.
This study was conducted in order to determine if beach swim-
mers are at greater risk of contracting ear infections than non-
beach swimmers. This dataset contains the following variables:
frequency. The recruit’s perception of whether he or she is
a frequent swimmer: “frequent” and “occasional”.
location. The recruit’s usually chosen swimming location:
“beach” and “non-beach”.
age. The recruit’s age range: “15-19”, “20-24” and “25-29”.
gender. The recruit’s gender: “male” and “female”.
infections. The number of self diagnozed ear infections
that were reported by the recruit.
327 / 430
Self diagnozed ear infections in swimmers
> fit0 <- glm(infections ~ frequency + location + age + gender,
+ family=poisson(log), data=swimmers)
>
> c("Deviance/df"=deviance(fit0)/df.residual(fit0),
+ "Pearson/df"=sum(resid(fit0, type="pearson")**2)/df.residual(fit0))
Deviance/df Pearson/df
2.688374 3.379853
>
> envelope(fit0, type="quantile" )
−3 −2 −1 0 1 2 3
−4 −2 0 2 4 6 8
Expected quantiles
Observed quantiles
328 / 430
The Origins of Sex Dierences in Science
This data, extracted from Long [1990] and included in the library
pscl object bioChemists, is the result of a study that examined
gender dierences in productivity (number of papers published)
during graduate school for a sample of 915 PhD students in
biochemistry, as a result of educational, mentoring, marriage,
and family factors. The following variables are included in this
dataset:
art. Count of articles produced during last 3 years of PhD.
fem. The gender of the student: “Men” or “Women”.
mar. The marital status of the student: “Single” or “Mar-
ried”.
kid5. The number of children aged 5 or younger.
phd. The prestige of the PhD department.
ment. The count of articles produced by the PhD mentor
during last 3 years.
329 / 430
The Origins of Sex Dierences in Science
> data(bioChemists, package="psc l")
> fit0 <- glm(art ~ fem + mar + kid5 + phd + ment, family=poisson(log),
+ data=bioChemists)
>
> c("Deviance/df"=deviance(fit0)/df.residual(fit0),
+ "Pearson/df"=sum(resid(fit0, type="pearson")**2)/df.residual(fit0))
Deviance/df Pearson/df
1.797988 1.828984
>
> envelope(fit0, type="quantile" )
−3 −2 −1 0 1 2 3
−4 −2 0 2 4 6 8
Expected quantiles
Observed quantiles
330 / 430
Negative Binomial I
If Y |λ Poisson(λ), so that E(Y |λ) = λ and Var(Y |λ) = λ,
and
λ Gamma(µ, φ), so that E(λ) = µ and Var(λ) = φµ
2
, then
YNegative Binomial I(µ, φ) . Being so,
(1) E(Y) = E[E(Y |λ)] = µ
(2) Var(Y) = Var[E(Y |λ)] + E[Var(Y |λ)] = µ(1 + φµ)
(3)
The probability mass function of Y is
f
Y
(y; µ, φ) =
Γ
y + φ
1
Γ
φ
1
Γ
y + 1
µ
µ + φ
1
!
y
φ
1
µ + φ
1
!
φ
1
I
y
{0, 1, . . .}
(4)
The moment generating function of Y is
M
Y
(t) =
h
1 φµ
e
t
1
i
1
φ
for t < log
µ + φ
1
µ
!
(5)
The mode of Y is max
nj
µ(1 φ)
k
, 0
o
.
331 / 430
Negative Binomial I
The probability mass function of Y may be obtained as follows
f
Y
(y; µ, φ) =
Z
−∞
f
Y|λ
(y; λ) f
λ
(λ; µ, φ) dλ
=
Z
0
exp(λ)λ
y
Γ(y + 1)
I
y
{0, 1, . . .}
λ
φ
1
1
Γ(φ
1
)(φµ)
φ
1
exp
λ
φµ
!
dλ
=
I
y
{0, 1, . . .}
Γ(φ
1
)(φµ)
φ
1
Γ(y + 1)
Z
0
λ
y+φ
1
1
exp
λ(µ + φ
1
)
µ
!
dλ
=
Γ(
˜
φ
1
)(
˜
φ
˜
µ)
˜
φ
1
I
y
{0, 1, . . .}
Γ(φ
1
)(φµ)
φ
1
Γ(y + 1)
Z
0
λ
˜
φ
1
1
Γ(
˜
φ
1
)(
˜
φ
˜
µ)
˜
φ
1
exp
λ
˜
φ
˜
µ
!
dλ
| {z }
R
0
f
λ
(λ;
˜
µ,
˜
φ) dλ where λGamma(
˜
µ,
˜
φ)
=
Γ
y + φ
1
Γ
φ
1
Γ
y + 1
µ
µ + φ
1
!
y
φ
1
µ + φ
1
!
φ
1
I
y
{0, 1, . . .},
where
˜
φ = 1/(y + φ
1
)
and
˜
µ = µ
y + φ
1
/
µ + φ
1
.
332 / 430
Negative Binomial I
The moment generating function of Y may be obtained as fo llows
M
Y
(t) = E
M
Y|λ
(t)
=
Z
0
M
Y|λ
(t) f
λ
(λ; µ, φ) dλ
=
Z
0
exp[λ(e
t
1)]
λ
φ
1
1
Γ(φ
1
)(φµ)
φ
1
exp
λ
φµ
!
dλ
=
˜
µ
µ
!
φ
1
Z
0
λ
φ
1
1
Γ(φ
1
)(φ
˜
µ)
φ
1
exp
λ
φ
˜
µ
!
dλ
| {z }
R
0
f
λ
(λ;
˜
µ,φ) dλ where λGamma(
˜
µ,φ)
=
h
1 φµ
e
t
1
i
1
φ
for t < log
µ + φ
1
µ
!
,
as
˜
µ =
µ
1 φµ
(
e
t
1
)
, and
˜
µ > 0
if and only if t < log
µ + φ
1
µ
!
.
333 / 430
Negative Binomial distribution
If Y |λ Poisson(λ), so that E(Y |λ) = λ and Var(Y |λ) = λ, and
λ Gamma(µ, φµ
τ
), so that E(λ) = µ and Var(λ) = φµ
τ+2
, for
τ R, then Y Negative Binomial(µ, φ, τ) . Being so,
(1) E(Y) = E[E(Y |λ)] = µ
(2) Var(Y) = Var[E(Y |λ)] + E[Var(Y |λ)] = µ(1 + φµ
τ+1
)
(3)
The probability mass function of Y is
f
Y
(y; µ, φ, τ) =
Γ
y +
1
φµ
τ
Γ
1
φµ
τ
Γ
y + 1
µ
µ +
1
φµ
τ
y
1
φµ
τ
µ +
1
φµ
τ
1
φµ
τ
I
y
{0, 1, . . .}
(4)
The moment generating function of Y is
M
Y
(t) =
h
1 φµ
τ+1
e
t
1
i
1
φµ
τ
for t < log
µ +
1
φµ
τ
µ
(5)
The mode of Y is max
nj
µ(1 φµ
τ
)
k
, 0
o
.
(6) Negative Binomial I(µ, φ) for τ = 0
(7) Negative Binomial II(µ, φ) for τ = 1
334 / 430
Negative Binomial regression
Y
i
ind
Negative Binomial(µ
i
, φ, τ),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) = µ(1 + φµ
τ+1
)
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
Square-root link
g(µ
i
) =
µ
i
Parameter estimation may be accomplished by using the maximum-likelihood method.
Inference is based on the asymptotic properties of the maximum-likelihood estimator.
335 / 430
Self diagnozed ear infections in swimmers
The data were extracted from Hand et al. [1993, page 266] and
can be found in the object
swimmers of the library glmtoolbox.
This study was conducted in order to determine if beach swim-
mers are at greater risk of contracting ear infections than non-
beach swimmers. This dataset contains the following variables:
frequency. The recruit’s perception of whether he or she is
a frequent swimmer: “frequent” and “occasional”.
location. The recruit’s usually chosen swimming location:
“beach” and “non-beach”.
age. The recruit’s age range: “15-19”, “20-24” and “25-29”.
gender. The recruit’s gender: “male” and “female”.
infections. The number of self diagnozed ear infections
that were reported by the recruit.
336 / 430
Self diagnozed ear infections in swimmers
> fit1 <- overglm(infections ~ frequency + location + age +
+ gender, family="nb1(log)", data=swimmers)
> fit2 <- update(fit1, family="nb2(log)")
> fit3 <- update(fit1, family="nbf(log)")
>
> AIC(fit0, fit1, fit2, fit3)
df AIC
fit0 6 1139.8280
fit1 7 904.6900
fit2 7 907.2357
fit3 8 906.5393
> BIC(fit0, fit1, fit2, fit3)
df BIC
fit0 6 1161.7849
fit1 7 930.3064
fit2 7 932.8521
fit3 8 935.8152
>
> fit1a <- update(fit1, formula = . ~ . - age - gender)
> anova(fit1a, fit1, test="wald")
Model 1 : infections ~ frequency + location
Model 2 : infections ~ frequency + location + age + gender
Chi Df Pr(>Chi)
1 vs 2 3.8493 3 0.2782
337 / 430
Self diagnozed ear infections in swimmers
> anova(fit1a, fit1, test="score")
Model 1 : infections ~ frequency + location
Model 2 : infections ~ frequency + location + age + gender
Chi Df Pr(>Chi)
1 vs 2 3.9632 3 0.2655
>
> anova(fit1a, fit1, test="lr")
Model 1 : infections ~ frequency + location
Model 2 : infections ~ frequency + location + age + gender
Chi Df Pr(>Chi)
1 vs 2 3.8214 3 0.2814
>
> anova(fit1a, fit1, test="gradient")
Model 1 : infections ~ frequency + location
Model 2 : infections ~ frequency + location + age + gender
Chi Df Pr(>Chi)
1 vs 2 3.7811 3 0.2861
>
> summary(fit1a)
Estimate Std.Error z-value Pr(>|z|)
(Intercept) -0.27766 0.17191 -1.61512 0.1062847
frequencyoccasional 0.59019 0.19066 3.09554 0.0019645
locationnon-beach 0.48134 0.19004 2.53279 0.0113158
phi 1.78560 0.27779
338 / 430
Self diagnozed ear infections in swimmers
> summary(update(fit0, formula = . ~ . - age - gender))
Estimate Std.Error z-value Pr(>|z|)
(Intercept) -0.30583 0.10593 -2.88703 0.00389
frequencyoccasional 0.61304 0.10498 5.83931 5.24e-09
locationnon-beach 0.50873 0.10281 4.94833 7.49e-07
>
> envelope(fit1a, type="quantile ")
−3 −2 −1 0 1 2 3
−4 −2 0 2 4
Expected quantiles
Observed quantiles
339 / 430
The Origins of Sex Dierences in Science
This data, extracted from Long [1990] and included in the library
pscl object bioChemists, is the result of a study that examined
gender dierences in productivity (number of papers published)
during graduate school for a sample of 915 PhD students in
biochemistry, as a result of educational, mentoring, marriage,
and family factors. The following variables are included in this
dataset:
art. Count of articles produced during last 3 years of PhD.
fem. The gender of the student: “Men” or “Women”.
mar. The marital status of the student: “Single” or “Mar-
ried”.
kid5. The number of children aged 5 or younger.
phd. The prestige of the PhD department.
ment. The count of articles produced by the PhD mentor
during last 3 years.
340 / 430
The Origins of Sex Dierences in Science
> fit1 <- overglm(art ~ fem + mar + kid5 + phd + ment,
+ family="nb1(log)", data=bioChemists)
> fit2 <- update(fit1, family="nb2(log)")
> fit3 <- update(fit1, family="nbf(log)")
>
> AIC(fit0, fit1, fit2, fit3)
df AIC
fit0 6 3314.113
fit1 7 3135.917
fit2 7 3143.397
fit3 8 3137.905
> BIC(fit0, fit1, fit2, fit3)
df BIC
fit0 6 3343.026
fit1 7 3169.649
fit2 7 3177.130
fit3 8 3176.457
>
> fit1a <- update(fit1, formula = . ~ . - mar - phd)
> anova(fit1a, fit1, test="wald")
Model 1 : art ~ fem + kid5 + ment
Model 2 : art ~ fem + mar + kid5 + phd + ment
Chi Df Pr(>Chi)
1 vs 2 3.4159 2 0.1812
341 / 430
The Origins of Sex Dierences in Science
> anova(fit1a, fit1, test="score")
Model 1 : art ~ fem + kid5 + ment
Model 2 : art ~ fem + mar + kid5 + phd + ment
Chi Df Pr(>Chi)
1 vs 2 3.4338 2 0.1796
>
> anova(fit1a, fit1, test="lr")
Model 1 : art ~ fem + kid5 + ment
Model 2 : art ~ fem + mar + kid5 + phd + ment
Chi Df Pr(>Chi)
1 vs 2 3.4118 2 0.1816
>
> anova(fit1a, fit1, test="gradient")
Model 1 : art ~ fem + kid5 + ment
Model 2 : art ~ fem + mar + kid5 + phd + ment
Chi Df Pr(>Chi)
1 vs 2 3.4054 2 0.1822
>
> summary(fit1a)
Estimate Std.Error z-value Pr(>|z|)
(Intercept) 0.39102 0.06628 5.89972 3.6411e-09
femWomen -0.23270 0.07230 -3.21857 0.0012883
kid5 -0.13775 0.04846 -2.84278 0.0044722
ment 0.02937 0.00338 8.67714 < 2.22e-16
342 / 430
The Origins of Sex Dierences in Science
> summary(update(fit0, formula = . ~ . - mar - phd))
Estimate Std.Error z-value Pr(>|z|)
(Intercept) 0.43673 0.04683 9.32683 < 2e-16
femWomen -0.24311 0.05421 -4.48475 7.30e-06
kid5 -0.14379 0.03657 -3.93146 8.44e-05
ment 0.02565 0.00195 13.13606 < 2e-16
>
> envelope(fit1a, type="quantile ")
−3 −2 −1 0 1 2 3
−4 −2 0 2 4
Expected quantiles
Observed quantiles
343 / 430
Exercises
(1) Let’s assume that
Y =
m
X
i=1
Y
i
and Y
i
ind
Poisson(µ
i
).
(A)
Determine the probability distribution of Y.
(B) Let’s assume that Y
i
’s have a positive correlation. What is
the eect of this change on the mean and variance of
Y?
(C) Let’s assume that Y
i
’s have a negative correlation. What is
the eect of this change on the mean and variance of Y?
(2) Show that, if Y |λ Poisson(λ) and λ follows a continuous
distribution on the interval
(0, ), such that E(λ) = µ and
Var(λ) = κ(σ
2
, µ), for some positive function κ(·, ·), then Y is
a discrete random variable such that:
(i) its possible values
correspond to non-negative integers; (ii) its expectation is
equal to µ; and (iii) its variance is equal to µ+κ(σ
2
, µ). There-
fore, a model based on the distribution of Y is an alternative
to the Poisson model in the presence of overdispersion.
344 / 430
Overdispersion in binomial
models
345 / 430
Overdispersion
Overdispersion occurs when the observed variance “ex-
ceeds” the expected variance according to the binomial dis-
tribution.
Overdispersion results in an underestimation of the variabi-
lity associated with parameter estimates. As a result, vari-
ables may appear significant when they are actually not
significant, as well as producing confidence intervals with
“low” coverage rates.
As an informal rule, if the deviance statistic is “r ather
higher than the degrees of freedom, or if the Pearson’s
chi-squared statistic is “rather higher than the degrees
of freedom, there may be evidence that the data are over-
dispersed.
Overdispersion may be addressed by replacing the binomial
regression model with a regression model based on a more
complex distribution.
346 / 430
Cell dierentiation
These data, which can be found in the object cellular of the library glmtoolbox,
have been extracted from
Piegorsch et al. [ 1988] and relate to the ability of two
immune-activating agents, TNF and IFN, to induce the dierentiation of cells.
The cells were assessed after they h ad been exposed to none, one, or both
agents in order to determine whether they had displayed markers of dieren-
tiation. A total of 200 cells were examined for each of the 16 dose combinations
of TNF and INF. The main question is whether the two agents stimulate cell
dierentiation in a synergistic or independent manner.
0
25
50
75
100
0 4 20 100
0 1 10 100
Dose of TNF
% of dierentiated cells
Dose of INF
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i1
x
i2
m = Number of examined cells
y = Proportion of dierentiated
cells
x
1
= Dose of TNF, in U/ml
x
2
= Dose of INF, in U/ml
347 / 430
Cell dierentiation
> fit0 <- glm(cells/200 ~ tnf + ifn + tnf*ifn, weights=rep(200,16),
+ family=binomial(logit), data=cellular)
>
> c("Deviance/df"=deviance(fit0)/df.residual(fit0),
+ "Pearson/df"=sum(resid(fit0, type="pearson")**2)/df.residual(fit0))
pearson/df deviance/df
16.02978 19.05526
>
> envelope(fit0, type="quantile" )
−2 −1 0 1 2
−6 −4 −2 0 2 4
Expected quantiles
Observed quantiles
348 / 430
Teratogenic eects of PHT and TCPO
These data, extracted from Morel and Neerchal [1997] and available in the
object
ossification of the library glmtoolbox, were obtained from an exper-
iment in which pregnant mice were randomly assigned to three treatments in
an experiment in which 60 mg/kg PHT, 100 mg/kg TCPO, or 60 mg/kg PHT
and 100 mg/kg TCPO were administered daily by gastric gavage. The left
middle third phalanx was examined on day 18 of p regnancy f or the presence
or absence of ossification. The purpose of the study was to investigate the
synergy between PHT and TCPO.
0
20
40
60
0 mg/kg 60 mg/kg
0 mg/kg 100 mg/kg
Dose of PHT
% of fetuses showing ossification
Dose of TCPO
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i1
x
i2
m = Litter size
y = Proportion of f etuses showing
ossification
x
1
= Dose of PHT, in mg/kg
x
2
= Dose of TCPO, in mg/kg
349 / 430
Teratogenic eects of PHT and TCPO
> fit0 <- glm(fetuses/litter ~ pht + tcpo + pht*tcpo, weights=litter,
+ family=binomial(logit), data=ossification)
>
> c("Deviance/df"=deviance(fit0)/df.residual(fit0),
+ "Pearson/df"=sum(resid(fit0, type="pearson")**2)/df.residual(fit0))
pearson/df deviance/df
3.771843 3.273809
>
> envelope(fit0, type="quantile" )
−2 −1 0 1 2
−4 −2 0 2 4
Expected quantiles
Observed quantiles
350 / 430
Beta Binomial distribution
If mY |π Binomial(m, π), so that E(Y |π) = π and Var(Y |π) =
m
1
π(1 π)
, and π Beta(µ, φ) for φ > 0, so that E(π) = µ and
Var(π) = (φ + 1)
1
µ(1 µ), then mY Beta Binomial(m, µ, φ) .
Being so,
(1) E(Y) = E[E(Y |π)] = µ
(2) Var(Y) = Var[E(Y |π)] + E[Var(Y |π)] =
µ(1 µ)
m
1 +
m 1
φ + 1
!
(3)
The probability mass function of Y is
f
Y
(y; µ, φ) =
m
my
!
B(my + φµ, m(1 y) + φ(1 µ))
B(φµ, φ(1 µ))
I
my
{0, 1, . . . , m},
where B(a, b) =
Γ(a)Γ(b)
Γ(a + b)
is the beta function.
351 / 430
Beta Binomial distribution
The probability mass function of Y may be obtained as follows
=
Z
−∞
f
mY|π
(y; m, π) f
π
(π; µ, φ) dπ
=
Z
1
0
m
my
!
π
my
(1 π)
m(1y)
I
my
{0, 1, . . . , m}
π
µφ1
(1 π)
(1µ)φ1
B(µφ, (1 µ)φ)
dπ
=
m
my
!
I
my
{0, 1, . . . , m}
B(µφ, (1 µ)φ)
Z
1
0
π
my+µφ1
(1 π)
m(1y)+(1µ)φ1
dπ
=
m
my
!
B(
˜
µ
˜
φ, (1
˜
µ)
˜
φ)
B(µφ, (1 µ)φ)
I
my
{0, 1, . . . , m}
Z
1
0
π
˜
µ
˜
φ1
(1 π)
(1
˜
µ)
˜
φ1
B(
˜
µ
˜
φ, (1
˜
µ)
˜
φ)
dπ
| {z }
R
1
0
f
π
(π;
˜
µ,
˜
φ) dπ where πBeta(
˜
µ,
˜
φ)
=
m
my
!
B(my + φµ, m(1 y) + φ(1 µ))
B(φµ, φ(1 µ))
I
my
{0, 1, . . . , m},
where
˜
µ = (my + µφ)/(m + φ)
and
˜
φ = m + φ
.
352 / 430
Beta Binomial regression
Y
i
ind
Beta Binomial(µ
i
, φ),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) =
µ
i
(1 µ
i
)
m
i
1 +
m
i
1
φ + 1
!
Logit link
g(µ
i
) = log
µ
i
1 µ
i
!
Probit link
g(µ
i
) = Φ
1
(µ
i
)
Complementary log-log link
g(µ
i
) = log
log
1 µ
i

Parameter estimation may be accomplished by using the maximum-likelihood method.
Inference is based on the asymptotic properties of the maximum-likelihood estimator.
353 / 430
Random-clumped Binomial distribution
If mY |π Binomial(m, π), so that E(Y |π) = π and Var(Y |π) =
m
1
π(1 π)
, and π = (1 φ)µ + φ Bernoulli(µ) for µ, φ
(0, 1)
, so that E(π) = µ and Var(π) = φ
2
µ(1 µ), then mY
RandomClumped Binomial(m, µ, φ)
.Thus, E(Y)= E[E(Y |π)] = µ,
Var(Y) = Var[E(Y |π)] + E[Var(Y |π)] =
µ(1 µ)
m
1 + φ
2
(m 1)
,
and the probability mass function of Y may be written as follows
f
Y
(y; m, µ, φ) = µ
m
my
!
π
my
1
(1 π
1
)
m(1y)
I
my
{0, 1, . . . , m} +
(1 µ)
m
my
!
π
my
2
(1 π
2
)
m(1y)
I
my
{0, 1, . . . , m},
where π
1
= (1 φ)µ + φ and π
2
= (1 φ)µ.
354 / 430
Random-clumped Binomial regression
Y
i
ind
RandomClumped Binomial(µ
i
, φ),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) =
µ
i
(1 µ
i
)
m
i
1 + φ
2
(m
i
1)
Logit link
g(µ
i
) = log
µ
i
1 µ
i
!
Probit link
g(µ
i
) = Φ
1
(µ
i
)
Complementary log-log link
g(µ
i
) = log
log
1 µ
i

Parameter estimation may be accomplished by using the maximum-likelihood method.
Inference is based on the asymptotic properties of the maximum-likelihood estimator.
355 / 430
Cell dierentiation
These data, which can be found in the object cellular of the library glmtoolbox,
have been extracted from
Piegorsch et al. [ 1988] and relate to the ability of two
immune-activating agents, TNF and IFN, to induce the dierentiation of cells.
The cells were assessed after they h ad been exposed to none, one, or both
agents in order to determine whether they had displayed markers of dieren-
tiation. A total of 200 cells were examined for each of the 16 dose combinations
of TNF and INF. The main question is whether the two agents stimulate cell
dierentiation in a synergistic or independent manner.
0
25
50
75
100
0 4 20 100
0 1 10 100
Dose of TNF
% of dierentiated cells
Dose of INF
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i1
x
i2
m = Number of examined cells
y = Proportion of dierentiated
cells
x
1
= Dose of TNF, in U/ml
x
2
= Dose of INF, in U/ml
356 / 430
Cell dierentiation
> fit1 <- overglm(cbind(cells,200-cells) ~ tnf + ifn + tnf*ifn,
+ family="bb(logit)", data=cellular)
> fit2 <- update(fit1, family="rcb(probit)")
>
> AIC(fit0, fit1, fit2)
df AIC
fit0 4 281.8449
fit1 5 148.0026
fit2 5 163.1216
>
> BIC(fit0, fit1, fit2)
df BIC
fit0 4 284.9353
fit1 5 151.8656
fit2 5 166.9846
>
> fit1a <- update(fit1, formula = . ~ . - tnf:ifn)
> anova(fit1a, fit1,test="wald")
Model 1 : cbind(cells, 200 - cells) ~ tnf + ifn
Model 2 : cbind(cells, 200 - cells) ~ tnf * ifn
Chi Df Pr(>Chi)
1 vs 2 0.83757 1 0.3601
357 / 430
Cell dierentiation
> anova(fit4a, fit4,test="lr")
Model 1 : cbind(cells, 200 - cells) ~ tnf + ifn
Model 2 : cbind(cells, 200 - cells) ~ tnf * ifn
Chi Df Pr(>Chi)
1 vs 2 1.1974 1 0.2738
>
> anova(fit4a, fit4,test="score" )
Model 1 : cbind(cells, 200 - cells) ~ tnf + ifn
Model 2 : cbind(cells, 200 - cells) ~ tnf * ifn
Chi Df Pr(>Chi)
1 vs 2 1.0277 1 0.3107
>
> anova(fit4a, fit4,test="gradie nt")
Model 1 : cbind(cells, 200 - cells) ~ tnf + ifn
Model 2 : cbind(cells, 200 - cells) ~ tnf * ifn
Chi Df Pr(>Chi)
1 vs 2 1.3701 1 0.2418
>
> summary(fit1a)
Estimate Std.Error z-value Pr(>|z|)
(Intercept) -1.75132 0.24252 -7.22127 5.1504e-13
tnf 0.02952 0.00399 7.39682 1.3949e-13
ifn 0.01296 0.00367 3.53479 0.0004081
358 / 430
Cell dierentiation
> fit0 <- update(fit0, formula = . ~ . - tnf:ifn)
> summary(fit0)
Estimate Std.Error z-value Pr(>|z|)
(Intercept) -1.83209 0.06955 -26.34045 <2e-16
tnf 0.02996 0.00113 26.56381 <2e-16
ifn 0.01410 0.00106 13.33093 <2e-16
>
> envelope(fit1a, type="quantile ")
−2 −1 0 1 2
−3 −2 −1 0 1 2 3
Expected quantiles
Observed quantiles
359 / 430
Teratogenic eects of PHT and TCPO
These data, extracted from Morel and Neerchal [1997] and available in the
object
ossification of the library glmtoolbox, were obtained from an exper-
iment in which pregnant mice were randomly assigned to three treatments in
an experiment in which 60 mg/kg PHT, 100 mg/kg TCPO, or 60 mg/kg PHT
and 100 mg/kg TCPO were administered daily by gastric gavage. The left
middle third phalanx was examined on day 18 of p regnancy f or the presence
or absence of ossification. The purpose of the study was to investigate the
synergy between PHT and TCPO.
0
20
40
60
0 mg/kg 60 mg/kg
0 mg/kg 100 mg/kg
Dose of PHT
% of fetuses showing ossification
Dose of TCPO
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i1
x
i2
m = Litter size
y = Proportion of f etuses showing
ossification
x
1
= Dose of PHT, in mg/kg
x
2
= Dose of TCPO, in mg/kg
360 / 430
Teratogenic eects of PHT and TCPO
> fit1 <- overglm(cbind(fetuses,litter-fetuses) ~ pht + tcpo + pht*tcpo,
+ family="bb(logit)", data=ossification)
> fit2 <- update(fit1, family="rcb(logit)")
>
> AIC(fit0, fit1, fit2)
df AIC
fit0 4 409.7828
fit1 5 316.5751
fit2 5 315.0534
>
> BIC(fit0, fit1, fit2)
df BIC
fit0 4 419.3606
fit1 5 328.5474
fit2 5 327.0257
>
> fit2a <- update(fit2, formula = . ~ . - pht:tcpo)
>
> anova(fit2a, fit2, test="wald")
Model 1 : cbind(fetuses, litter - fetuses) ~ pht + tcpo
Model 2 : cbind(fetuses, litter - fetuses) ~ pht * tcpo
Chi Df Pr(>Chi)
1 vs 2 0.85147 1 0.3561
361 / 430
Teratogenic eects of PHT and TCPO
> anova(fit2a, fit2, test="lr")
Model 1 : cbind(fetuses, litter - fetuses) ~ pht + tcpo
Model 2 : cbind(fetuses, litter - fetuses) ~ pht * tcpo
Chi Df Pr(>Chi)
1 vs 2 1.1974 1 0.2738
>
> anova(fit2a, fit2, test="score")
Model 1 : cbind(fetuses, litter - fetuses) ~ pht + tcpo
Model 2 : cbind(fetuses, litter - fetuses) ~ pht * tcpo
Chi Df Pr(>Chi)
1 vs 2 0.83027 1 0.3622
>
> anova(fit2a, fit2, test="gradient")
Model 1 : cbind(fetuses, litter - fetuses) ~ pht + tcpo
Model 2 : cbind(fetuses, litter - fetuses) ~ pht * tcpo
Chi Df Pr(>Chi)
1 vs 2 0.81836 1 0.3657
>
> summary(fit2a)
Estimate Std.Error z-value Pr(>|z|)
(Intercept) 0.57038 0.21361 2.67014 0.007582
pht60 mg/kg -1.33705 0.33663 -3.97188 7.1307e-05
tcpo100 mg/kg -0.76212 0.31146 -2.44693 0.014408
362 / 430
Teratogenic eects of PHT and TCPO
> fit0 <- update(fit0, formula = . ~ . - tnf:ifn)
> summary(fit0)
Estimate Std.Error z-value Pr(>|z|)
(Intercept) 0.72414 0.12728 5.68926 0.00000
pht60 mg/kg -1.75661 0.19908 -8.82366 0.00000
tcpo100 mg/kg -0.55281 0.19117 -2.89173 0.00383
>
> envelope(fit2a, type="quantile ")
−2 −1 0 1 2
−4 −2 0 2 4
Expected quantiles
Observed quantiles
363 / 430
Exercises
(1) Let’s assume that
Y =
1
n
1
+ . . . + n
m
m
X
i=1
Y
i
and Y
i
ind
Binomial(n
i
, µ).
(A)
Determine the probability distribution of Y.
(B) Let’s assume that Y
i
’s have a positive correlation. What is
the eect of this change on the mean and variance of
Y?
(C) Let’s assume that Y
i
’s have a negative correlation. What is
the eect of this change on the mean and variance of Y?
(2) Show that, if m Y |π Binomial(m, π) and π follows a con-
tinuous distribution on the interval
(0, 1), such that E(π) = µ
and Var(π) = κ(σ
2
, µ), for some positive function κ(·, ·), then
Y is a discrete random variable such that: (i) its possible
values are 0, m
1
, 2m
1
, 3m
1
, . . . , (m 1)m
1
, 1; (ii) its expec-
tation is equal to µ; (iii) its variance is equal to m
1
µ(1
µ) + (1 m
1
)κ(σ
2
, µ). Therefore, a model based on the dis-
tribution of
Y is an alternative to the binomial model in the
presence of overdispersion.
364 / 430
Excess of zeros
in count data
365 / 430
Males attached to female horseshoe crabs
These data, extracted from Agresti [2015] and available in GLMsData’s
object
hcrabs, are drawn from a s tudy of female horseshoe crabs
on an isl and in the Gulf of Mexico. During the spawning season,
females migrate to the shore i n order to breed. She burrows into
the sand with a male attached to her posterior spine and lays
clusters of eggs. In the sand beneath the pair, the eggs are fertil-
ized externally. Other male crabs (satellites) may cluster around
the spawning pair and fertilize the eggs as well. The variables
in the dataset are the following:
Col. The color of the female crab: “LM” (Light Medium),
“M” (Medium), “DM” (Dark Medium) or “D” (Dark).
Spine. The spine condition of the female crab: “BothOK”,
“OneOK” or “NoneOK”.
Width. The carapace width of the female crab, in cms.
Wt. The weight of the female crab, in grams.
Sat. The number of male crabs attached to the female crab.
366 / 430
Males attached to female horseshoe crabs
There is an excess of zeros when the observed number of zeros is
“rather greater than it is expected to be. According to the Poisson
model,
Y
i
ind
Poisson(µ
i
)
. Thus, O = O
1
+ . . . + O
n
is the number of
zeros, where
O
i
=
1
if Y
i
= 0
0
if Y
i
> 0
,
so that O
i
ind
Bernoulli(exp(µ
i
)).
Therefore, E(O) and Var(O) are given, respectively, by
n
X
i=1
E(O
i
) =
n
X
i=1
Pr(Y
i
= 0) =
n
X
i=1
exp(µ
i
)
and
n
X
i=1
Var(O
i
)=
n
X
i=1
Pr(Y
i
= 0)[1 Pr(Y
i
= 0)]=
n
X
i=1
exp(µ
i
)[1 exp(µ
i
)]
The CLT indicates that
Z =
O E(O)
p
Var(O)
D
n
Normal(0, 1)
367 / 430
Males attached to female horseshoe crabs
This “naive” statistic, which is obtained by substituting β for
ˆ
β
in
the above expression for
Z, will be biased against the asymptotic
standard normal, since it treats the estimated
ˆ
β
as the true β,
thus ignoring the sampling variation associated with
ˆ
β
. In order
to correct this, parametric bootstrapping can be used to estimate
the mean and variance of the dierence between the obser ved
and estimated expected number of zeros.
> fit0 <- glm(Sat ~ log(Wt) + Col + Spine + Width, data=hcrabs,
+ family=poisson(log))
>
> zero.excess(fit0)
Number of Zeros
Observed Expected z-value Pr(>z)
62 16.787 12.266 <2.2e-16
o
ˆ
E(O) Pr(χ
2
(1)
> z
2
)
368 / 430
Zero-Inflated Poisson distribution
If Y |ν =
0
if ν = 1
Poisson(µ)
if ν = 0
and ν Bernoulli(π),
then Y is distributed according to the Zero-Inflated Poisson dis-
tribution, denoted here by
ZIP(µ, π). The probability mass func-
tion of
Y may be written as follows
f
Y
(y; µ, π) =
π + (1 π) exp(µ)
if y = 0
(1 π)
exp(µ)µ
y
y!
if y = 1, 2, 3, . . .
In addition,
(1) E(Y) = (1 π)µ,
(2) Var(Y) = (1 π)µ + π(1 π)µ
2
,
(3) F
Y
(y; µ, π) = π I
y
[0, ) + (1 π) F
˜
Y
(y; µ), and
(4) E(Y
r
) = (1 π)E(
˜
Y
r
) for all r > 0, where
˜
Y Poisson(µ)
.
369 / 430
Zero-Inflated Poisson model
Y
i
ind
ZIP(µ
i
, π
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
h(π
i
) = γ
0
+ γ
1
z
i1
+ . . . + γ
p
z
iq
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
Square-root link
g(µ
i
) =
µ
i
Logit link
h(π
i
) = log
π
i
1 π
i
Probit link
h(π
i
) = Φ
1
(π
i
)
Complementary log-log link
h(π
i
) = log
log
(
1 π
i
)
Cauchy link
h(π
i
) = tan
π
2
(
2π
i
1
)
370 / 430
Zero-Modified Poisson distr ibution
If Y |ν =
0
if ν = 1
ZT Poisson(µ)
if ν = 0
,
and ν Bernoulli(π),
then Y is distributed according to the Zero-Modified Poisson
distribution, d enoted here by
ZMP(µ, π). The probability mass
function of
Y may be written as follows
f
Y
(y; µ, π) =
π
if y = 0
κ
0
exp(µ)µ
y
y!
if y = 1, 2, 3, . . .
where κ
0
= (1 π)/[1 exp(µ)]. In addition,
(1) E(Y) = κ
0
µ,
(2) Var(Y) = κ
0
µ + κ
0
(1 κ
0
)µ
2
,
(3) F
Y
(y; µ, π) = π I
y
[0, ) + κ
0
[F
˜
Y
(y; µ) F
˜
Y
(0; µ)], and
(4) E(Y
r
) = κ
0
E(
˜
Y
r
) for all r > 0, where
˜
Y Poisson(µ)
.
371 / 430
Zero-Modified Poisson model
Y
i
ind
ZMP(µ
i
, π
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
h(π
i
) = γ
0
+ γ
1
z
i1
+ . . . + γ
p
z
iq
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
Square-root link
g(µ
i
) =
µ
i
Logit link
h(π
i
) = log
π
i
1 π
i
Probit link
h(π
i
) = Φ
1
(π
i
)
Complementary log-log link
h(π
i
) = log
log
(
1 π
i
)
Cauchy link
h(π
i
) = tan
π
2
(
2π
i
1
)
372 / 430
Zero-Inflated and Zero-Modified Poisson
Parameter estimation may be accomplished by using the maximum-
likelihood method. Inference is based on the asymptotic proper-
ties of the maximum-likelihood estimator.
Note that,
(1) If Y ZIP(µ, π) then Pr(Y = 0) tends to be 1 as π 1
(2)
If Y ZIP(µ, π) then Pr(Y = 0) tends to be exp(µ) as π 0
(3)
If Y ZMP(µ, π) then Pr(Y = 0) tends to be 1 as π 1
(4)
If Y ZMP(µ, π) then Pr(Y = 0) tends to be 0 as π 0
Therefore, the ZIP(µ, π) distribution can only handle zero excess,
whereas the
ZMP(µ, π) distribution can handle both excess of
zeros and lack of zeros.
373 / 430
Zero-Inflated Negativ e Binomial distribution
If Y |ν=
0
if ν = 1
Negative Binomial(µ, φ, τ)
if ν = 0
,
and νBernoulli(π),
then Y is distributed according to the Zero-Inflated Negative Binomial
distribution, denoted here by
ZINB(µ, φ, τ, π). T he probability mass
function of Y may be written as follows
f
Y
(y; µ,φ,τ) =
π + (1π)
1
φµ
τ
µ +
1
φµ
τ
1
φµ
τ
if y = 0
(1π)
Γ
y +
1
φµ
τ
Γ
1
φµ
τ
Γ
y + 1
µ
µ +
1
φµ
τ
y
1
φµ
τ
µ +
1
φµ
τ
1
φµ
τ
if y = 1, 2, . . .
In addition,
(1) E(Y) = (1 π)µ,
(2) Var(Y ) = (1 π)µ(1 + φµ
τ+1
) + π(1 π)µ
2
,
(3) F
Y
(y; µ, φ, τ, π) = π I
y
[0, ) + (1 π) F
˜
Y
(y; µ, φ, τ)
, and
(4) E(Y
r
)= (1π)E(
˜
Y
r
) for all r >0, where
˜
YNegative Binomial(µ,φ,τ)
374 / 430
Zero-Inflated Negativ e Binomial model
Y
i
ind
ZINB(µ
i
, φ, τ, π
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
h(π
i
) = γ
0
+ γ
1
z
i1
+ . . . + γ
p
z
iq
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
Square-root link
g(µ
i
) =
µ
i
Logit link
h(π
i
) = log
π
i
1 π
i
Probit link
h(π
i
) = Φ
1
(π
i
)
Complementary log-log link
h(π
i
) = log
log
(
1 π
i
)
Cauchy link
h(π
i
) = tan
π
2
(
2π
i
1
)
375 / 430
Zero-Modified Negativ e Binomial distribution
If Y |ν=
0
if ν = 1
ZT Negative Binomial(µ, φ, τ)
if ν = 0
and νBernoulli(π),
then Y is distributed according to the Zero-Modified Negative Bino-
mial distribution, denoted here by ZMNB(µ, φ, τ, π). The probability
mass function of
Y may be written as follows
f
Y
(y; µ, φ, τ, π)=
π
if y = 0
κ
0
Γ
y +
1
φµ
τ
Γ
1
φµ
τ
Γ
y + 1
µ
µ +
1
φµ
τ
y
1
φµ
τ
µ +
1
φµ
τ
1
φµ
τ
if y = 1, 2, . . .
where (1 π)κ
1
0
= 1
1
φµ
τ
µ +
1
φµ
τ
1
φµ
τ
. In addition,
(1) E(Y) = κ
0
µ,
(2) Var(Y ) = κ
0
µ(1 + φµ
τ+1
) + κ
0
(1 κ
0
)µ
2
,
(3) F
Y
(y; µ, φ, τ, π) = π I
y
[0, ) + κ
0
[F
˜
Y
(y; µ, φ, τ) f
˜
Y
(0; µ, φ, τ)]
, and
(4) E(Y
r
) = κ
0
E(
˜
Y
r
) for all r > 0, where
˜
YNegative Binomial(µ, φ, τ).
376 / 430
Zero-Modified Negativ e Binomial model
Y
i
ind
ZMNB(µ
i
, φ, τ, π
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
h(π
i
) = γ
0
+ γ
1
z
i1
+ . . . + γ
p
z
iq
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
Square-root link
g(µ
i
) =
µ
i
Logit link
h(π
i
) = log
π
i
1 π
i
Probit link
h(π
i
) = Φ
1
(π
i
)
Complementary log-log link
h(π
i
) = log
log
(
1 π
i
)
Cauchy link
h(π
i
) = tan
π
2
(
2π
i
1
)
377 / 430
Zero-Inflated and Zero-Modified Negative Binomial
Parameter estimation may be accomplished by using the maximum-
likelihood method. Inference is based on the asymptotic proper-
ties of the maximum-likelihood estimator.
Note that,
(1) If Y ZINB(µ, φ, τ, π) then Pr(Y = 0) tends to be 1 as π 1
(2)
If Y ZINB(µ, φ, τ, π) then Pr(Y = 0) tends to be
1
φµ
τ
µ +
1
φµ
τ
1
φµ
τ
as π 0
(3)
If Y ZMNB(µ, φ, τ, π) then Pr(Y = 0) tends to be 1 as π 1
(4) If Y ZMNB(µ, φ, τ, π) then Pr( Y = 0) tends to be 0 as π 0
Therefore, the ZINB(µ, φ, τ, π) distri bution can only handle zero
excess, whereas the
ZMNB(µ, φ, τ, π) distribution can handle
both excess of zeros and lack of zeros.
378 / 430
Males attached to female horseshoe crabs
> fit1 <- overglm(Sat ~ log(Wt) + Col + Spine + Width,
+ family="ztpoi(log)", data=hcrabs, subset={Sat>0})
> fit2 <- update(fit1, family="ztnb1(log)")
> fit3 <- update(fit1, family="ztnb2(log)")
> fit4 <- update(fit1, family="ztnbf(log)")
>
> AIC(fit1, fit2, fit3, fit4)
df AIC
fit1 8 525.5947
fit2 9 514.6772
fit3 9 512.4328
fit4 10 512.6216
> BIC(fit1, fit2, fit3, fit4)
df BIC
fit1 8 547.2710
fit2 9 539.0630
fit3 9 536.8185
fit4 10 539.7169
>
> fit3a <- update(fit3, formula = Sat ~ log(Wt) + Col)
> anova(fit3a, fit3, test="lr")
379 / 430
Males attached to female horseshoe crabs
Likelihood-ratio test
Model 1 : Sat ~ log(Wt) + Col
Model 2 : Sat ~ log(Wt) + Col + Spine + Width
Chi Df Pr(>Chi)
1 vs 2 5.9052 3 0.1163
>
> summary(fit3a)
Estimate Std.Error z-value Pr(>|z|)
(Intercept) -3.91458 2.29715 -1.70410 0.0883619
log(Wt) 0.75194 0.29754 2.52716 0.0114988
ColDM -0.74218 0.25754 -2.88187 0.0039532
ColLM -0.31739 0.28417 -1.11692 0.2640306
ColM -0.54407 0.22370 -2.43210 0.0150114
> fit5 <- glm(ifelse(Sat==0,1,0) ~ log(Wt) + Col + Spine + Width,
+ family=binomial(logit), data=hcrabs)
> fit5a <- update(fit5, formula=ifelse(Sat==0,1,0) ~ Width)
> anova(fit5a, fit5, test="lr")
Likelihood-ratio test
Model 1 : ifelse(Sat==0,1,0) ~ Width
Model 2 : ifelse(Sat==0,1,0) ~ log(Wt) + Col + Spine + Width
380 / 430
Males attached to female horseshoe crabs
Chi Df Pr(>Chi)
1 vs 2 9.2299 6 0.1611
>
> summary(fit5a)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 12.3508 2.6287 4.698 2.62e-06
Width -0.4972 0.1017 -4.887 1.02e-06
>
> fit6 <- zeroalt(Sat ~ log(Wt) + Col | Width, family="nb2(log)",
+ zero.link="logit", data=hcrabs)
> summary(fit6)
Count model (Negative Binomial type II with log link):
Estimate Std.Error z-value Pr(>|z|)
(Intercept) -3.91458 2.29715 -1.70410 0.0883619
log(Wt) 0.75194 0.29754 2.52716 0.0114988
ColDM -0.74218 0.25754 -2.88187 0.0039532
ColLM -0.31739 0.28417 -1.11692 0.2640299
ColM -0.54407 0.22370 -2.43210 0.0150114
Zero-alteration model (Bernoulli with logit link):
(Intercept) 12.35082 2.62873 4.69840 2.6221e-06
Width -0.49723 0.10174 -4.88746 1.0215e-06
381 / 430
Exercises
Suppose that Y is a discrete random variable whose possible
values are all non-negative integers. The cumulative distribution
function of
Y is denoted by F
Y
(y).
(1) Suppose that
˜
Y
is the Zero-Truncated version of Y, that is,
Pr(
˜
Y = 0) = 0 and Pr(
˜
Y = y) = Pr(Y = y)/[1 Pr(Y = 0)] for
y = 1, 2, . . . Show that:
(A) E(
˜
Y
r
) = κ
0
E(Y
r
) for all r > 0 such that E(Y
r
) exists, where
κ
0
= 1/[1 Pr(Y = 0)]. Therefore, if Var(Y) exists, then
Var(
˜
Y) = κ
0
Var(Y) + κ
0
(1 κ
0
)E
2
(Y).
(B) F
˜
Y
(y) = Pr(
˜
Y y) = [κ
0
F
Y
(y) + (1 κ
0
)]I
y
[0, )
.
(2) Suppose that
˜
Y
is the Zero-Inflated version of Y, that is,
Pr(
˜
Y = 0) = π+(1π)Pr(Y = 0) and Pr(
˜
Y = y) = (1π)Pr(Y =
y) for y = 1, 2, . . . Show that:
(A) E(
˜
Y
r
) = κ
0
E(Y
r
) f or all r > 0 such that E(Y
r
) exists, where
κ
0
= 1π. Therefore, if Var(Y) exists, then Var(
˜
Y) = κ
0
Var(Y)+
κ
0
(1 κ
0
)E
2
(Y)
.
(B) F
˜
Y
(y) = Pr(
˜
Y y) = [κ
0
F
Y
(y) + (1 κ
0
)]I
y
[0, )
.
382 / 430
Exercises
(3) Suppose that
˜
Y
is the Zero-Modified version of Y, that is,
Pr(
˜
Y = 0) = π and Pr(
˜
Y = y) = (1 π)Pr(Y = y)/[1 Pr(Y =
0)]
for y = 1, 2, . . . Show that:
(A) E(
˜
Y
r
) = κ
0
E(Y
r
) for all r > 0 such that E(Y
r
) exists, where
κ
0
= (1 π)/[1 Pr(Y = 0)]. Therefore, if Var(Y) exists, then
Var(
˜
Y) = κ
0
Var(Y) + κ
0
(1 κ
0
)E
2
(Y).
(B) F
˜
Y
(y) = Pr(
˜
Y y) = [κ
0
F
Y
(y) + (1 κ
0
)]I
y
[0, )
.
383 / 430
Cluster Correlated Data
384 / 430
Eect of the ozone on the tree growth
Analyzing this data, previously analyzed in Diggle et al. [2002]
and available in the object spruces of the library glmtoolbox,
aims to assess the eect of ozone pollution on tree growth.
Ozone pollution is common in urban areas, thus the impact of
increased ozone concentrations on tree growth is of consider-
able interest. The response variable is the size of the tree (size),
where size is conventionally measured by the product of tree
height and stem diameter squared. This experiment examined
79 trees identified in the dataset by the column
tree. A first
group of 54 trees was grown in an ozone-enriched atmosphere
(
treat=“ozone-enriched”), while a second group of 25 trees
was grown in a normal atmosphere ( treat=“normal”). The size
of each tree was measured 13 times over the course of the ex-
periment (days), so the data are balanced and the number of
rows in the d ataset is 1027. The main objective of the analysis
is to compare the growth patterns of trees under two types of
atmosphere: normal and ozone-enriched.
385 / 430
Eect of the ozone on the tree growth
0
500
1000
1500
2000
152 174 201 227 258 469 496 528 556 579 613 639 674
Days since the beginning of the experiment
Size
Atmosphere
normal ozone-enriched
386 / 430
Treatment of severe postnatal depression
This dataset, extracted from Gregoire et al. [1996] and available
in the library
glmtoolbox as object depression, is from a placebo-
controlled study on the ecacy of oestrogen for the treatment of
severe postnatal depression. There were 61 women (
subj) with
severe depression randomly assigned to one of two treatment
groups (
group=“placebo”), which received a placebo patch, or
group (
group=“oestrogen”), which received an oestrogen patch.
The Edinburgh Postnatal Depression Scale (EPDS) was admin-
istered to all women prior to the therapy, where higher scores
indicate greater depression levels. Once the treatment began,
EPDS (
dep) were collected monthly for 6 months (visit). The
response (depressd) is 1 for severe depression (EDPS>=11) and
0 in other cases.
387 / 430
Treatment of severe postnatal depression
0
25
50
75
100
−1 1 2 3 4 5 6
Months since the therapy began
% of women with depression
Treatment
Placebo Estrogen
388 / 430
Generalized Least Squares (GLS)
Let y
i
= (y
i1
, . . . , y
ij
, . . . , y
in
i
)
for i = 1, . . . , n to be the interest
multivariate responses measured on
n clusters, which are as-
sumed to be realizations of
independent random vectors de-
noted here by
Y
i
= (Y
i1
, . . . , Y
ij
, . . . , Y
in
i
)
for i = 1, . . . , n, where
n
i
represents the size of the i-th cluster or the number of mea-
surements performed on the
i-th subject. The random variables
Y
ij
, for j = 1, . . . , n
i
, are assumed to satisfy the following
µ
i
= E(Y
i
) = X
i
β and Var(Y
i
) = Σ
i
,
where X
i
is a (known) n
i
× (p + 1) matrix, β = (β
0
, β
1
, . . . , β
p
)
is the set of regression coecients and Σ
i
is a (known) n
i
× n
i
symmetric positive-definite matrix. The estimator of β may be
written as follows
ˆ
β = argmin
β R
p+1
Q(β),
where
Q(β) =
1
2
(y Xβ)
Σ
1
(y Xβ) =
1
2
n
X
i=1
(y
i
X
i
β)
Σ
1
(y
i
X
i
β),
389 / 430
Generalized Least Squares (GLS)
in which y = (y
1
, . . . , y
n
)
, X = (X
1
, . . . , X
n
)
and Σ = diag{Σ
1
, . . . , Σ
n
}.
The function
Q(β) is strictly convex. Therefore,
ˆ
β
is the solution
to the system of equations given by
n
X
i=1
X
i
|{z}
µ
i
/∂β
Σ
1
i
|{z}
[Var(Y
i
)]
1
y
i
X
i
β
= 0.
How to modify the GLS method to estimate β when the random
vectors
Y
i
, for i = 1, . . . , n, are such that
g(µ
i
) = X
i
β and Var(Y
i
) = σ
2
A
1
2
i
R
i
(ρ)A
1
2
i
,
where g(·) is a (known) link function, A
i
= diag{V(µ
i1
), . . . , V(µ
in
i
)},
V(·) is a (known) variance function, and R
i
(ρ), which is aasumed
to be dependent on
ρ, is the correlation matrix of Y
i
?
390 / 430
Generalized Estimating Equations (GEEs)
GEEs are a very flexible statistical tool for the analysis of hete-
roskedastic continuous and discrete cluster-correlated data. An
appealing feature of GEEs is that they do not require the full spe-
cification of the multivariate distribution of the response vector
measured on each cluster. Indeed, in GEEs, only the following
needs to be specified:
(1) A variance function describing the mechanism by which
heteroskedasticity is assumed to occur (if there is), i.e., the
way in which variance is assumed to depend on mean.
(2) A regression structure that consists of a link function and
a linear predictor, which describe the way in which the
mean of the response variable distribution is assumed to
be dependent on some explanatory variables.
(3) A structure for the correlation matrix, which is aimed at des-
cribing the dynamic of the linear association between the
dierent measurements of the response variable performed
on the same subject or cluster.
391 / 430
Generalized Estimating Equations
Let y
i
= (y
i1
, . . . , y
ij
, . . . , y
in
i
)
for i = 1, . . . , n to be the multi-
variate responses of interest measured on
n clusters, which are
assumed to be realizations of
independent random vectors de-
noted here by
Y
i
= (Y
i1
, . . . , Y
ij
, . . . , Y
in
i
)
for i = 1, . . . , n, where
n
i
represents the size of the i-th cluster or the number of mea-
surements performed on the
i-th subject. The random variables
Y
ij
, for j = 1, . . . , n
i
, are assumed to satisfy the following
Var(Y
ij
) =
φ
ω
ij
V(µ
ij
)
and Corr(Y
ij
, Y
ik
) = r
jk
(ρ),
where µ
ij
= E(Y
ij
), φ > 0 is the dispersion parameter, ω
ij
> 0 are
known weights,
V(µ) is the variance function, and r
jk
(ρ) is the
Pearson’s linear correlation coecient, which depends on the
parameter vector
ρ. In addition, µ
ij
is assumed to be dependent
on x
ij
= (1, x
1ij
, . . . , x
pij
)
in the following way
g(µ
ij
) = x
ij
β,
where g(·) is the link function and β = (β
0
, β
1
, . . . , β
p
)
is the set
of unknown regression coecients.
392 / 430
Generalized Estimating Equations
According to Liang and Zeger [1986], the estimate of β, denoted
here by
ˆ
β
, reduces to the solution to U(
ˆ
β) = 0, where
U(β) =
n
X
i=1
X
i
K
i
|{z}
µ
i
/∂β
[φA
1
2
i
R
i
A
1
2
i
]
1
| {z }
[Var(Y
i
)]
1
(y
i
µ
i
) =
1
φ
n
X
i=1
X
i
W
i
K
1
i
(y
i
µ
i
),
in which
X
i
= (x
i1
, . . . , x
i n
i
)
W
i
= K
i
V
1
i
K
i
K
i
= diag{1/g
(µ
i1
), . . . , 1/g
(µ
i n
i
)}
V
i
= A
1
2
i
R
i
A
1
2
i
A
i
= diag{V(µ
i1
)
i1
, . . . , V(µ
i n
i
)
i n
i
}
The (j, k)-th entry of R
i
is r
j,k
(ρ)
µ
i
= (g
1
(x
i1
β), . . . , g
1
(x
i n
i
β))
393 / 430
Estimating algorithm
Step 0: Start the counter at t = 0; set the tolerance limit,
ǫ > 0; set the maximum number of iterations, n
max
;
and set the initial value for
β, say β
[0]
.
Step 1: Compute ρ
[t]
from the Pearson’s residuals evalu-
ated at
β
[t]
, given by r
[t]
ij
=
y
ij
µ
[t]
ij
q
φ
[t]
V(µ
[t]
ij
)
ij
.
Step 2: Compute β
[t+1]
= β
[t]
+ [I(β
[t]
)]
1
U(β
[t]
).
Step 3: Compute δ
(t+1)
= δ(β
[t]
, β
[t+1]
).
Step 4: Update the counter by t = t + 1.
Step 5: Repeat steps 1,2,3 and 4 until δ
(t)
< ǫ or t > n
max
.
Step 6: If δ
(t)
< ǫ, then
ˆ
β
is defined to be β
[t]
. Otherwise,
the convergence was not achieved.
I(β) = E
U(β)
β
!
=
1
φ
n
X
i=1
X
i
W
i
X
i
.
A simple choice for β
[0]
is the estimate of β obtained from the
GEE where the correlation matrix is the identity.
394 / 430
Asymptotic theory
If the model for the mean (µ) is correctly specified, then, un-
der certain regularity conditions,
ˆ
β
is consistent for β and its
distribution is such that (
Liang and Zeger [1986]):
n(
ˆ
β β)
D
n→∞
N(0, Var(
ˆ
β)),
where
Var(
ˆ
β)= lim
n→∞
1
n
X
WX
1
1
n
n
X
i=1
X
i
W
i
K
1
i
Var(Y
i
)K
1
i
W
i
X
i
1
n
X
WX
1
.
Therefore, if the model for the mean is correctly specified, then
ˆ
β
remain consistent and asymptotically normal although the struc-
ture of the correlation matrix is not correctly specified. Indeed, if
the structure of the correlation matrix is also correctly specified,
that is, if
Var(Y
i
) = φV
i
for i = 1, . . . , n, then Var(
ˆ
β) reduces to
Var(
ˆ
β) = lim
n→∞
1
n
X
WX
1
.
395 / 430
Working-correlation matrix
Independence:
Corr(Y
ij
, Y
ik
) =
1
if j = k,
0
if j , k
Exchangeable:
Corr(Y
ij
, Y
ik
) =
1
if j = k,
ρ
if j , k,
and
ˆ
ρ =
1
M p 1
n
X
i=1
X
j<k
r
ij
r
ik
,
where M =
1
2
n
P
i=1
n
i
(n
i
1)
.
AR-M-dependent(m)
:
For example, if m = 1, then the correlations become
Corr(Y
ij
, Y
ik
) =
1
if j = k,
ρ
|jk|
if j , k,
and
ˆ
ρ =
1
M p 1
n
X
i=1
n
i
1
X
j=1
r
ij
r
i,j+1
,
where M =
n
P
i=1
(n
i
1)
.
396 / 430
Working-correlation matrix
Stationary-M-dependent(m):
Corr(Y
ij
,Y
i,j+t
)=
1
if t = 0,
ρ
t
if t = 1, . . . , m
0
if t > m,
and
ˆ
ρ
t
=
1
M
t
p1
n
X
i=1
n
i
t
X
j=1
r
ij
r
i,j+t
,
where M
t
=
n
P
i=1
(n
i
t)
.
Non-Stationary-M-dependent(m)
:
Corr(Y
ij
, Y
ik
) =
1
if j = k,
ρ
jk
if 0 < |j k| m,
0 if |j k| > m,
and
ˆ
ρ
jk
=
1
n p 1
n
X
i=1
r
ij
r
ik
,
Unstructured:
Corr(Y
ij
, Y
ik
) =
1
if j = k,
ρ
jk
if j , k,
and
ˆ
ρ
jk
=
1
n p 1
n
X
i=1
r
ij
r
ik
,
397 / 430
Variance Estimation
model:
ˆ
Var
M
(
ˆ
β) = [I(
ˆ
β)]
1
=
ˆ
φ
X
ˆ
WX
1
,
where X = (X
1
, . . . , X
n
)
and
ˆ
W = diag{
ˆ
W
1
, . . . ,
ˆ
W
n
}
.
robust (Liang and Zeger [1986]):
ˆ
Var
R
(
ˆ
β) =
X
ˆ
WX
1
n
X
i=1
X
i
ˆ
W
i
ˆ
K
1
i
e
i
e
i
ˆ
K
1
i
ˆ
W
i
X
i
X
ˆ
WX
1
,
where e
i
= y
i
ˆµ
i
. This estimator is robust to the missespecifi-
cation of the working correlation matrix in the sense that it is a
consistent estimator of the true variance-covariance matrix of
ˆ
β
as long as the model for the marginal mean is correctly specified.
df-adjusted:
ˆ
Var
A
(
ˆ
β) =
n
n p 1
ˆ
Var
R
(
ˆ
β)
398 / 430
Criteria to select the Working Correlation Matrix
Correlation Information Criterion (Hin and Wang [2009]):
CIC = trace
n
ˆ
φ
1
(X
ˆ
K
ˆ
A
1
ˆ
KX)
ˆ
Var
R
(
ˆ
β)
o
Akaike-type penalized Gaussian Pseudo-likelihood Criterion
(Carey and Wang [2011], Zhu and Zhu [2013], Fu et al. [2018]):
AGPC = Gaussian pseudo-Likelihood + 2
p + 1 + dim
(
ˆρ
)
,
where the Gaussian pseudo-likelihood is given by
n
X
i=1
n
i
log(2π) +
1
ˆ
φ
(y
i
ˆµ
i
)
ˆ
V
1
i
(y
i
ˆµ
i
) + log(
ˆ
φ|
ˆ
V
i
|)
Schwarz-type penalized Gaussian Pseudo-likelihood Criterion
(Carey and Wang [2011], Zhu and Zhu [2013], Fu et al. [2018]):
SGPC = Gaussian pseudo-Likelihood + log(n)
p + 1 + dim
(
ˆρ
)
399 / 430
Hypothesis testing
To assess the hypothesis system H
0
: β
= 0 versus H
1
: β
,
0
, where the elements of β
are a subset of those of β, as β
may
be written as
β
= L
β, in which L is a r ×(p + 1) contrast matrix,
the following decision rule may be used
Reject
H
0
at t he approximate 100(α)% significance level if ξ > χ
2
1α
(r)
”,
where
α (0, 1), χ
2
1α
(r) is the 100(1 α)th percentile of the χ
2
(r)
distribution, and ξ is one of the following statistics:
wald
:
ξ
W
=
L
ˆ
β
L
ˆ
Var
R
(
ˆ
β)L
1
L
ˆ
β
generalized score (Rotnitzky and Jewell [1990], Boos [1992]):
ξ
S
=
L
ˆ
Var
M
(
ˆ
β) U(β)
L
ˆ
Var
R
(
ˆ
β)L
1
L
ˆ
Var
M
(
ˆ
β) U(β)
,
evaluated at the parameter estimates obtained under the restric-
tion given by H
0
, that is, restricted to β
= 0.
400 / 430
Residuals
pearson:
r
P
ij
=
y
ij
ˆ
µ
ij
q
ˆ
φ V(
ˆ
µ
ij
)
ij
, i = 1, . . . , n; j = 1, . . . , n
i
deviance:
r
D
ij
= sign(y
ij
ˆ
µ
ij
)
q
d(y
ij
,
ˆ
µ
ij
, ω
ij
)/
ˆ
φ, i = 1, . . . , n; j = 1, . . . , n
i
mahalanobis:
r
M
i
=
n
1
i
ˆ
φ
(y
i
ˆµ
i
)
ˆ
V
1
i
(y
i
ˆµ
i
), i = 1, . . . , n
401 / 430
Influence or sensitivity
The “leave-one-out” statistics are computed to quantify the eect
on the estimates of the parameters in the linear predictor of
deleting each subject/cluster. Thus, if the
i-th cluster is excluded
then these statistics may be expressed by
Dfbeta
(i)
=
ˆ
β
ˆ
β
(i)
and CD
(i)
=
1
(p + 1)
(
ˆ
β
ˆ
β
(i)
)
[
ˆ
Var(
ˆ
β)]
1
(
ˆ
β
ˆ
β
(i)
),
respectively, where
ˆ
β
(i)
is the estimate of β computed from the
dataset in which the
i-th cluster is excluded. To avoid the pos-
sible computational burden, the “leave-one-out” estimates of
ˆ
β
are replaced by their “one-step approximations”, denoted here
by
ˆ
β
I
(i)
for i = 1, . . . , n. The value of
ˆ
β
I
(i)
corresponds to the first
iteration of the estimating algorithm of the GEE when it is per-
formed using: (i) the dataset in which the i-th cluster is excluded;
and (ii) a starting value which is the solution to the same GEE
but computed from the dataset including all clusters, that is,
ˆ
β
.
402 / 430
Eect of the ozone on the tree growth
The data are analyzed by using the following model:
Var(Y
ij
) = φµ
2
ij
, Corr(Y
ij
, Y
ik
) = ρ
|jk|
and
log(µ
ij
) = β
0
+ β
1
x
1ij
+ β
2
x
2
1ij
+ β
3
x
3
1ij
+ β
4
x
4
1ij
+ β
5
x
2ij
,
where
y
ij
= j-th measurement of size performed on the i-th tree.
x
1ij
= Time, in days, since the experiment began until the j-th
measurement on the
i-th tree was performed.
x
2ij
=
1
if the i-th tree grew under the ozone-enriched atmosphere
0 if the i-th tree grew under the normal atmosphere
403 / 430
Eect of the ozone on the tree growth
> ggplot(spruces, aes(x=factor(days), y=size, fill=treat)) +
+ geom_boxplot(outlier.shape=16, outlier.size=1) +
+ labs(x="Days since the beginning of the experiment",
+ y="Size", fill="Atmosphere") +
+ scale_fill_manual(values=c("salmon2", "green2")) +
+ theme(legend.position="top", legend.direction="horizontal")
>
> res <- function(x,y)
+ data.frame(type=y,fitted(x),residuals(x,type="pearson"))
> m1 <- glmgee(size ~ poly(days,4)+treat, id=tree, data=spruces,
+ corstr="Independence", family=inverse.gaussian(log))
> r1 <- res(m1,"V(mu)==muˆ3")
> m2 <- update(m1,family=poisson(log));r2 <- res(m2,"V(mu)==mu")
> m3 <- update(m1,family=Gamma(log));r3 <- res(m3,"V(mu)==muˆ2")
> m4 <- update(m1,family=gaussian(log));r4 <- res(m4,"V(mu)==1")
> rs <- rbind(r1,r2,r3,r4);colnames(rs) <- c("type","mu","res")
>
> ggplot(rs, aes(x=mu, y=res)) +
+ geom_point(shape=16, col="black") +
+ labs(x="Fitted values", y="Pearson’s residuals") +
+ facet_wrap(vars(type), labeller=label_parsed)
404 / 430
Eect of the ozone on the tree growth
200 400 600 800 200 400 600 800
−2
0
2
4
−2
0
2
4
V(µ) = 1 V(µ) = µ
V(µ) = µ
2
V(µ) = µ
3
Fitted values
Pearson’s residuals
405 / 430
Eect of the ozone on the tree growth
> fit1 <- glmgee(size ~ poly(days,4) + treat, family=Gamma(log),
+ id=tree, corstr="Independence", data=spruces)
> fit2 <- update(fit1, corstr="Exchangeable")
> fit3 <- update(fit1, corstr="AR-M-dependent(1)")
> fit4 <- update(fit1, corstr="AR-M-dependent(2)")
> fit5 <- update(fit1, corstr="AR-M-dependent(3)")
> fit6 <- update(fit1, corstr="AR-M-dependent(4)")
>
> a <- AGPC(fit1, fit2, fit3, fit4, fit5, fit6, verbose=FALSE)
> b <- SGPC(fit1, fit2, fit3, fit4, fit5, fit6, verbose=FALSE)
> cbind(a, SGPC=b[,"SGPC"])
Object Correlation GPL Parameters AGPC SGPC
fit1 Independence 13527.34 6 13539.34 13553.55
fit2 Exchangeable 11675.19 7 11689.19 11705.78
fit3 AR-M-dependent(1) 10926.66 7 10940.66 10957.25
fit4 AR-M-dependent(2) 10964.71 8 10980.71 10999.67
fit5 AR-M-dependent(3) 10976.47 9 10994.47 11015.80
fit6 AR-M-dependent(4) 11011.43 10 11031.43 11055.12
406 / 430
Eect of the ozone on the tree growth
> summary(fit2)
Number of observations: 1027
Number of clusters: 79
Cluster size: 13
*************************************************************
Variance function: Gamma
Link function: log
Correlation structure: AR-1
*************************************************************
Estimate Std.Error z-value Pr(>|z|)
(Intercept) 5.90378 0.10486 56.30321 < 2e-16
poly(days, 4)1 19.20015 0.51848 37.03159 < 2e-16
poly(days, 4)2 -2.85755 0.20585 -13.88147 < 2e-16
poly(days, 4)3 5.41639 0.18246 29.68549 < 2e-16
poly(days, 4)4 -3.57407 0.12478 -28.64405 < 2e-16
treatozone-enriched -0.25861 0.12835 -2.01486 0.043919
ˆ
φ
Dispersion 0.32866
407 / 430
Eect of the ozone on the tree growth
> fit2a <- update(fit2, formula=. ~ . + poly(days,4):treat)
>
> anova(fit2a, test="wald")
Model 1 : size ~ 1
Model 2 : size ~ poly(days, 4)
Model 3 : size ~ poly(days, 4) + treat
Model 4 : size ~ poly(days, 4) + treat + poly(days, 4):treat
Chi df Pr(>Chi)
1 vs 2 1931.9813 4 < 2e-16 ***
2 vs 3 4.0597 1 0.04392 *
3 vs 4 3.6641 4 0.45336
>
> anova(fit2a, test="score")
Model 1 : size ~ 1
Model 2 : size ~ poly(days, 4)
Model 3 : size ~ poly(days, 4) + treat
Model 4 : size ~ poly(days, 4) + treat + poly(days, 4):treat
Chi df Pr(>Chi)
1 vs 2 61.3028 4 1.544e-12 ***
2 vs 3 3.3687 1 0.06645 .
3 vs 4 3.4665 4 0.48300
408 / 430
Treatment of severe postnatal depression
This dataset, extracted from Gregoire et al. [1996] and available in the
library
glmtoolbox as object depression, is from a placebo-controlled
study on the ecacy of oestrogen for the treatment of severe postnatal
depression. There were 61 women (
subj) with severe depression ran-
domly assigned to one of two treatment groups (group=“placebo”),
which received a placebo patch, or group (
group=“oestrogen”), which
received an oestrogen patch. The Edinburgh Postnatal Depression
Scale (EPDS) was administered to all women prior to the therapy,
where higher scores indicate greater depression levels. Once the treat-
ment began, EPDS (
dep) were collected monthly for 6 months (visit).
The response (depressd) is 1 for severe depression (EDPS>=11) and 0
in other cases.
Var(Y
ij
) = µ
ij
(1µ
ij
), Corr(Y
ij
, Y
ik
) = ρ
|jk|
, logit(µ
ij
) = β
0
+β
1
x
1ij
+β
2
x
2ij
,
where
y
ij
= j-th measurement of the response performed on the i-th woman
x
1ij
= Time, in months, since the experiment began until the j-th mea-
surement on the i-th women was performed
x
2ij
=
1
if the ith woman was treated with oestrogen patch
0 if the ith woman was treated with placebo patch
409 / 430
Treatment of severe postnatal depression
0
25
50
75
100
−1 1 2 3 4 5 6
Months since the therapy began
% of women with depression
Treatment
Placebo Estrogen
410 / 430
Treatment of severe postnatal depression
> d2 <- aggregate(depressd ~ visit + group,mean,data=depression)
> ggplot(d2,aes(x=factor(visit),fill=group,weight=100*depressd)) +
+ geom_bar(position=position_dodge(), width=0.7) +
+ labs(x="Months since the therapy began",
+ y="% of women with depression", fill="Treatment") +
+ scale_fill_manual(values=c("yellow", "blue")) +
+ theme(legend.position="top", legend.direction="horizontal")
>
> fit1 <- glmgee(depressd ~ visit + group,family=binomial(logit),
+ id=subj, corstr="Exchangeable", data=depression)
> fit2 <- update(fit1, corstr="AR-M-dependent(1)")
> fit3 <- update(fit1, corstr="AR-M-dependent(2)")
>
> a <- AGPC(fit1, fit2, fit3, verbose=FALSE)
> b <- SGPC(fit1, fit2, fit3, verbose=FALSE)
>
> cbind(a, SGPC=b[,"SGPC"])
Object Correlation GPL Parameters AGPC SGPC
fit1 Exchangeable 239.9647 4 247.9647 256.4082
fit2 AR-M-dependent(1) 226.4696 4 234.4696 242.9131
fit3 AR-M-dependent(2) 224.9438 5 234.9438 245.4982
411 / 430
Treatment of severe postnatal depression
> summary(fit2)
Number of observations: 356
Number of clusters: 61
Min 25% 50% 75% Max
Cluster sizes: 2 4 7 7 7
*************************************************************
Estimate Std.Error z-value Pr(>|z|)
(Intercept) 3.23604 0.51842 6.24218 4.3152e-10
visit -0.62632 0.07477 -8.37681 < 2.22e-16
groupestrogen -1.77723 0.54578 -3.25631 0.0011287
ˆ
φ
Dispersion 1.02842
*************************************************************
[1] [2] [3] [4] [5] [6] [7]
[1] 1.000 0.513 0.263 0.135 0.069 0.036 0.018
[2] 0.513 1.000 0.513 0.263 0.135 0.069 0.036
[3] 0.263 0.513 1.000 0.513 0.263 0.135 0.069
[4] 0.135 0.263 0.513 1.000 0.513 0.263 0.135 R(
ˆ
ρ)
[5] 0.069 0.135 0.263 0.513 1.000 0.513 0.263
[6] 0.036 0.069 0.135 0.263 0.513 1.000 0.513
[7] 0.018 0.036 0.069 0.135 0.263 0.513 1.000
412 / 430
Treatment of severe postnatal depression
> fit2a <- update(fit2, formula=. ~ . + visit:group)
>
> anova(fit2a, test="wald")
Model 1 : depressd ~ 1
Model 2 : depressd ~ visit
Model 3 : depressd ~ visit + group
Model 4 : depressd ~ visit + group + visit:group
Chi df Pr(>Chi)
1 vs 2 88.1275 1 < 2.2e-16 ***
2 vs 3 10.6036 1 0.001129 **
3 vs 4 2.2104 1 0.137082
>
> anova(fit2a, test="score")
Model 1 : depressd ~ 1
Model 2 : depressd ~ visit
Model 3 : depressd ~ visit + group
Model 4 : depressd ~ visit + group + visit:group
Chi df Pr(>Chi)
1 vs 2 39.9226 1 2.642e-10 ***
2 vs 3 10.9208 1 0.0009509 ***
3 vs 4 2.3977 1 0.1215150
413 / 430
Appendix
414 / 430
A. Linear algebra
(1) If A is an idempotent matrix, that is, if A
2
= A, then trace(A) =
rank(A), where trace(A) and rank(A) correspond to the trace and
rank of A, respectively (see, for instance, Gentle [2007, page 281]).
(2) If A is an m ×n matrix and B is an n ×m matrix, then trace(AB) =
trace(BA)
. (see, for example, Gentle [2007, page 67]).
(3) If A is a non-singular matrix, and u and v are column vectors, then
(A uv
)
1
= A
1
+
(A
1
u)(v
A
1
)
1 v
A
1
u
(see, for instance, Gentle [2007,
page 220]
).
If A is a symmetric and positive definite matrix, then
(4) The square root of A, denoted here by A
1
2
and such that A
1
2
A
1
2
= A
,
exists, is unique, is symmetric, and also is positive definite (see,
for instance,
Gentle [2007, section 3.8.8]).
(5) A
1
exists, is symmetric, and also is positive definite (see, for
instance, Gentle [2007, section 3.8.8]).
(6) B
AB is also a positive definite matrix if and only if B has full
column-rank (that is, if the columns of B are linearly independent)
(see, f or instance,
Gentle [2007, page 89]).
415 / 430
B. Derivatives
If x and b are q × 1 matrices and A is an q × q matrix, then (see,
for instance,
Gentle [2007, section 4.2])
(1)
h(x)
x
= b
provided that h(x) = b
x.
(2)
h(x)
x
= (A + A
)x
provided that h(x) = x
Ax.
(3)
h(x)
x
= A
provided that h(x) = Ax.
(4)
h(x)
x
= h(x)Ax
provided that h(x) = exp(
1
2
x
Ax)
and A
is a symmetric matrix.
(5)
h(A)
A
= I
q
provided that h(A) = trace(A).
(6)
h(A)
A
= |A|A
−⊤
provided that h(A) = |A| and A is a
non-singular matrix.
(7)
h(A)
A
= A
−⊤
provided that h(A) = log |A| and |A| > 0.
416 / 430
C. Taylor series
See, for example, Marsden and Tromba [2012, section 3.2].
(1) If h : D R R is a function with (r + 1) continuous
derivatives at
x
0
D, then
h(x) =
r
X
j=0
(x x
0
)
j
j!
h
(j)
(x
0
) + R
r
(x, x
0
),
where h
(j)
(x) is the j- th derivative of h(x) and R
r
(x, x
0
) is the
residual, which is given by
R
r
(x, x
0
) =
Z
x
x
0
(x t)
r
r!
h
(r+1)
(t) dt.
The residual R
r
(x, x
0
) is such that
R
r
(x, x
0
)
(x x
0
)
r
0
as x x
0
.
417 / 430
C. Taylor series
(2) If h : D R
q
R is a function with continuous second-
order partial derivatives at
x
0
D, then
h(x) = h(x
0
) + (x x
0
)
"
h(x)
x
x=x
0
#
+ R
1
(x, x
0
),
where
R
1
(x, x
0
)
||x x
0
||
2
0
as x x
0
.
(3) If h : D R
q
R is a function with continuous third-order
partial derivatives at
x
0
D, then
h(x) = h(x
0
) + (x x
0
)
"
h(x)
x
x=x
0
#
+
1
2
(x x
0
)
"
h(x)
x x
x=x
0
#
(x x
0
) + R
2
(x, x
0
),
where
R
2
(x, x
0
)
(||x x
0
||
2
)
2
0
as x x
0
.
418 / 430
D. Probability
If Z Normal(0, 1), V χ
2
(m) and W χ
2
(r), then (see, for
instance,
Searle [1971, chapter 2])
(1) Z
2
χ
2
(1)
(2)
Z
V/m
Student
-t(m) provided that Z and V are indepen-
dent, where
Student-t(m) represents a random variable fo-
llowing the Student-tdistribution with
m degrees-of-freedom.
(3)
V/m
W/r
F
(m,r)
provided that V and W are independent,
where
F
(m,r)
represents a random variable following the F
-Snedecor distribution with m and r degrees-of-freedom.
(4)
W/r
V/m
F(r, m)
provided that V and W are independent.
(5) [Student-t(m)]
2
F
(1,m)
.
419 / 430
E. Multivariate normal distribution
If Z Normal
m
(µ, Σ), where Σ is a positive definite matrix, then
(see, for instance,
Searle [1971, chapter 2])
(1) The probability density function of Z may be written as
f
Z
(z; µ, Σ) = (2π)
m
2
|Σ|
1
2
exp
1
2
(z µ)
Σ
1
(z µ)
, zR
m
.
(2)
The generating moment function of Z is the following
M(t; µ, Σ) = exp
t
µ +
1
2
t
Σ
1
t
, t = (t
1
, . . . , t
m
)
.
(3) a + AZ Normal
q
(a+ Aµ, AΣA
)
, where a is an q ×1 matrix
and A is an q × m matrix of full row-rank.
(4) Z
AZ χ
2
(rank(A)) provided that AΣ is an idempotent
symmetric matrix and
µ
Aµ = 0.
(5) Z
AZ and BZ are statistically independent provided that
BΣA = 0, where A is a symmetric matrix.
420 / 430
F. Nonlinear optimization
See, for example, Sun and Yuan [2006, section 1.4].
(1) If h : D R
q
R is a twice continuously dierentiable
function,
D is a open set,
h(x)
x
x=x
= 0
and
2
h(x)
x x
x=x
is a
positive (negative) definite Hessian matrix, then
x
is a local
minimum (maximum) of the function
h(x) in D.
(2) If h : D R
q
R is a twice dierentiable function, D is a
open set, and
2
h(x)
x x
is a positive (negative) definite Hessian
matrix for all
x D, then h(x) is a strictly convex (concave)
function in D.
(3) If h : D R
q
R is a twice dierentiable function, D
is a open set,
h(x)
x
x=x
= 0
and h(x) is a strictly convex
(concave) function, then x
is the unique global minimum
(maximum) of h(x) in D.
421 / 430
References I
R.D. Cook and S. Weisberg. Residuals and Influence in Regression.
Chapman & Hall/CRC Monographs on Statistics & Applied
Probability. Taylor & Francis, 1982.
D.M. Allen. Mean square error of prediction as a criterion for
selecting variables. Technometrics, 13(3):469–475, 1971.
I.M. Alcantara, J. Naranjo, and Y. Lang. Model selection using
press statistic. Computational Statistics, 38(1):285–298, 2023.
H. Akaike. A new look at the statistical model identification.
IEEE Transactions on Automatic Control, 19(6):716–723, 1974.
Gideon Schwarz. Estimating the dimension of a model. The
Annals of Statistics, 6(2):461–464, 1978.
G. James, D. v, T. Hastie, and R. Tibshirani. An Introduction to
Statistical Learning: with Applications in R. Springer Texts in
Statistics. Springer, 2013.
422 / 430
References II
A.C. Atkinson. Two graphical displays for outlying and influen-
tial observations in regression. Biometrika, 68(1):13–20, 1981.
R.D. Cook. Detection of influential observation in linear regres-
sion. Technometrics, 19(1):15–18, 1977.
T.S. Breusch and A.R. Pagan. A simple test for heteroscedasticity
and random coecient variation. Econometrica, 47(5):1287–
1294, 1979.
R.D. Cook and S. Weisberg. Diagnostics for heteroscedasticity in
regression. Biometrika, 70(1):1–10, 1983.
G.E.P. Box and P.W. Tidwell. Transformation of the independent
variables. Technometrics, 4(4):531–550, 1962.
H. Zhang and D. Zelterman. Binary regression for risks in ex-
cess of subject-specific thresholds. Biometrics, 55(4):1247–1251,
1999.
423 / 430
References III
P.S. Hewlett and T.J. Plackett. Statistical aspects of the inde-
pendent joint action of poisons, particularly insecticides. ii.
examination of data for agreement with hypothesis. Annals of
Applied Biology, 37:527–552, 1950.
D.W. Hosmer, S. Lemeshow, and R.X. Sturdivant. Applied Logistic
Regression. Wiley Series in Probability and Statistics. Wiley,
2013.
D.G. Kleinbaum, L.L. Kupper, A. Nizam, and E.S. Rosenberg.
Applied Regression Analysis and Other Multivariable Methods.
Cengage Learning, 2013.
J.I. McCool. Confidence limits for weibull regression with cen-
sored data. IEEE Transactions on Reliability, 29(2):145–150, 1980.
B. Jorgensen. Exponential dispersion model s. Journal of the Royal
Statistical Society. Serie s B (Methodological), 49(2):127–162, 1987.
424 / 430
References IV
P. McCullagh and J.A. Nelder. Generalized Linear Models, Second
Edition. Chapman and Hall/CRC Monographs on Statistics
and Applied Probability Series. Chapman & Hall, 1989.
A. Agresti. Foundations of Linear and Generalized Linear Models.
Wiley Series in Probability and Statistics. Wiley, 2015.
L. Fahrmeir and H. Kaufmann. Consistency and asymptotic
normality of the maximum likelihood estimator in generalized
linear models. The Annals of Statistics, 13(1):342–368, 1985.
A. Buse. The likelihood ratio, wald, and lagrange multiplier
tests: An expository note. The American Statistician, 36(3a):
153–157, 1982.
G. Terrell. The gradient statistic. Computing Science and Statis tics,
34:206–215, 2002.
J.D. Gibbons and S. Chakraborti. Nonparametric Statistical Infer-
ence (5th ed.). Chapman and Hall/CRC, 2010.
425 / 430
References V
F. Habibzadeh, P. Habibzadeh, and M. Yad ollahie. On determin-
ing the most appropriate test cut-o value: the case of tests
with continuous results. Biochemia Medica, 26(3):297–307, 2016.
F.S. Nahm. Receiver operating characteristic curve: overview
and practical use for clinicians. Korean Journal of Anesthesiology,
75(1):25–36, 2022.
D.A. Pierce and D.W. Schafer. Residuals in generalized linear
models. Journal of the American Statistical Association, 81(396):
977–986, 1986.
A.C. Davison and A. Gigli. Deviance residuals and normal scores
plots. Biometrika, 76(2):211–221, 1989.
P.K. Dunn and G.K. Smyth. Randomized quantile residuals.
Journal of Computational and Graphical Statistics, 5(3):236–244,
1996.
426 / 430
References VI
D. Pregibon. Logistic regression diagnostics. The Annals of Statis-
tics, 9(4):705–724, 1981.
B.-C. Wei, J.-Q. Shi, W.-K. Fung, and Y.-Q. Hu. Testing for varying
dispersion in exponential family nonlinear models. Annals of
the Institute of Statistical Mathematics, 50:277–294, 1998.
N.H. Prater. Estimate gasoline yields from crudes. Petroleum
Refiner, 5(35):236–238, 1956.
S. Weisberg. Applied Linear Regression. Wiley Serie s in Probability
and Statistics. Wiley, 2005.
D.J. Hand, F. Daly, K. McConway, D. Lunn, and E. Ostrowski. A
Handbook of Small Data Sets. Chapman & Hall Statistics Texts.
Taylor & Francis, 1993.
J.S. Long. The origins of sex dierences in science. Social Forces,
68(4):1297–1316, 1990.
427 / 430
References VII
W.W. Piegorsch, C.R. Weinberg, and B.H. Margolin. Exploring
simple independent action in multifactor tables of propor-
tions. Biometrics, 44(2):595–603, 1988.
J.G. Morel and N.K. Neerchal. Clustered binary logistic regres-
sion in teratology data using a finite mixture distribution.
Statistics in Medicine, 24(16):2843–2853, 1997.
P.J. Diggle, P. Heagerty, K.-Y. Liang, and S.L. Zeger. Analysis of
longitudinal data. Oxford university press, 2002.
A.J.P. Gregoire, R. Kumar, B. Everitt, and J.W.W. Studd. Trans-
dermal oestrogen for treatment of severe postnatal depression.
The Lancet, 347:930–933, 1996.
K.Y. Liang and S.L. Zeger. Longitudinal data analysis using
generalized linear models. Biometrika, 73:13–22, 1986.
428 / 430
References VIII
L.-Y. Hin and Y.-G. Wang. Working-correlation-structure iden-
tification in generalized estimating equations. Statis tics in
Medicine, 28(4):642–658, 2009.
V.J. Carey and Y.-G. Wang. Working covariance model selection
for generalized estimating equations. Statistics in Medicine, 30
(26):3117–3124, 2011.
Xiaolu Zhu and Zhongyi Zhu. Comparison of criteria to select
working correlation matrix in generalized estimating equa-
tions. Chinese Journal of applied probability and statistics, 5:515–
30, 2013.
L. Fu, Y. Hao, and Y.-G. Wang. Working correlation structure
selection in generalized estimating equations. Computational
Statistics, 33:983–96, 2018.
A. Rotnitzky and N.P. Jewe ll. Hypothesis testing of regression
parameters in semiparametric generalized linear models for
cluster correlated data. Biometrika, 77(3):485–497, 1990.
429 / 430
References IX
D.D. Boos. On generalized score tests. The American Statistician ,
46(4):327–33, 1992.
J.E. Gentle. Matrix Algebra: Theory, Computations, and Applications
in Statistics. Springer Texts in Statistics. Springer New York,
2007.
J.E. Marsden and A.J. Tromba. Calculo vectorial, sixth Edition.
W.H. Freeman and Company, 2012.
S.R. Searle. Linear Models. Wile y Classics Library. Jhon Wiley &
Sons, 1971.
W. Sun and Y.-X. Yuan. Optimization Theory and Methods. Nonlin-
ear Programming. Springer Optimization and Its Applications.
Springer, 2006.
430 / 430