Generalized Linear Models and some of their
extensions using the library glmtoolbox in R
Luis Hernando Vanegas Luz Marina Rondón
Universidad Nacional de Colombia
Sede Bogotá
Facultad de Ciencias
Departamento de Estadística
2024
1 / 460
“All models are wrong, but some are useful”
George E. P. Box, 1979
2 / 460
Outline I
1
Normal Linear Models
Motivation
Richness of the species
Gasoline yield
Model Formulation
Parameter Estimation
Ordinary Least Squares Estimator
Maximum Likelihood Estimat or
Goodness-of-fit
R-squared and adjusted R-squared
Predicted R-squared
Generalized I nform ation Criterion
Statistical Inference
Confidence int ervals
Hypothesis testing
Examples
Richness of the species
Gasoline yield
Exercises
3 / 460
Outline II
Variable selection
Hybrid f orward stepwise
Hybrid backward stepwise
Best subset selection
Diagnostic methods
Residual analysis
Normal probability plot with simulated envelope
Influence or Sensitivity analysis
Leverage p oints
Testing the constant variance assumption
Collinearity
Box-Tidwell transfor mation
Lognormal model
Example 3: Gasoline yield
Example 4: Lung capacity
2
Generalized Linear Models
Motivation
Liver cancer in female mice
4 / 460
Outline III
Insecticides
Burn i njury study
Dilution assay
Skin cancer in women
Advertising
Atlantic Croakers
Hardened steel
Gasoline yield
Model formulation
Linear exponential family
Binomial di stributi on
Poisson distr ibution
Normal distr ibution
Gamma d istribu tion
Inverse-Gaussian distr ibution
Exercises
Parameter estimation
Maximum likelihood (ML)
Deviance
5 / 460
Outline IV
Newton-Raphson algor ithm
Fisher’s Scoring algorithm
Canonical link fun ctions
Goodness-of-fit
R-squared and adjusted R-squared
Generalized I nform ation Criterion
Statistical inference
Asymptotic theory
Hypothesis testing
Confidence int ervals
Estimation of the dispersion parameter
Parameter interpretation
Binomial regression
Poisson regression
Normal regression
Gamma regression
Lognormal regression
Genesis of link functions in binary response models
Receiver Operating Characteristic Curve
6 / 460
Outline V
Variable selection
Hybrid f orward stepwise
Hybrid backward stepwise
Best subset selection
Exercises
Diagnostic methods
Residual analysis
Normal probability plot with simulated envelope
Influence or sensitivity analysis
Testing the constant dispersion parameter assumption
Box-Tidwell transfor mation
Quasi-likelihood models
3
Overdispersion in Poisson models
Motivation
Self d iagnozed ear infections in swimmers
The Origins of Sex Dierences in Science
Negative Binomial I model
Negative Binomial model
7 / 460
Outline VI
Exercises
4
Overdispersion in binomial models
Motivation
Cell dierentiation
Teratogenic eects of PHT and TCPO
Beta Binomial model
Random-clumped Binomial model
Exercises
5
Excess of zeros in count data
Zero-Inflated Poisson model
Zero-Modified Poisson model
Zero-Inflated Negative Binomial model
Zero-Modified Negative Binomial model
Example: Males attached to female horseshoe crabs
Exercises
6
Cluster Correlated Data
8 / 460
Outline VII
Motivation
Eect of t he ozone on the tree growth
Treatment of severe postnatal depression
Generalized Estimating Equations
Example 1: Eect of the ozone on the tree growth
7
Appendix
A. Linear algebra
B. Derivatives
C. Taylor series
D. Probability
E. Multivariate normal distribution
F. Nonlinear optimization
8
References
9 / 460
Normal Linear Models
10 / 460
Richness of the species
The analysis of these data, available in the object richness of the library
glmtoolbox, aims to explain the dierences between plots as related
to species richness (represented by a count of the number of plant
species (
richness)) using the dierences between plots with respect
to biomasses (
Biomass) and soil pH levels: low (pH=“low”), medium
(
pH=“mid”) and high (pH=“high”).
High
Mid
Low
pH level
Biomass
Number of plant species
0
0.0 2.5 5.0 7.5 10.0
10
20
30
40
11 / 460
Gasoline yield
The analysis of the data in the object Auto of the library ISLR is aimed
at explaining the dierences between the vehicles with respect to their
gasoline yield (
mpg), in miles per gallon, using other vehicle characte-
ristics such as their horsepower (
horsepower), their weight (weight),
in kilograms, and their origin (
origin: America, Europe or Japan).
••
High
Mid
Low
Weight
Horsepower
Gasoline yield
10
20
30
40
50 100 150 200
12 / 460
Normal Linear Model
Let y
i
for i = 1, . . . , n be the interest variable measured on n
individuals, which are assumed to be realizations of independent
random variables
denoted here by Y
i
for i = 1, . . . , n. The random
variables
Y
i
for i = 1, . . . , n
i
are assumed to satisfy the following
condition:
Y
i
= µ
i
+ ǫ
i
,
where ǫ
i
ind
Normal(0, σ
2
)
are random errors and µ
i
= β
0
+ β
1
x
i1
+
. . . + β
p
x
ip
= x
i
β
, in which x
i
= (1, x
i1
, . . . , x
ip
)
includes the
(known) values of
p explanatory variables, quantitative and/or
qualitative, measured on the
i-th individual, and β = (β
0
, . . . , β
p
)
is the (unknown) vector of interest parameters. Thus, the ob-
served value of the interest variable, y, is assumed to be the
result of a “signal”, represented by µ, that is additively contam-
inated by a normally-distributed random “noise”, represented
by the random error ǫ.
Goal: To explain the dierences between y
1
, . . . , y
n
by using the dierences between x
1
, . . . , x
n
.
13 / 460
Normal Linear Model (p = 1 and σ
2
= 0.4)
y
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x
µ = 1.5 + 1.4 x
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 111111111111
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 333333333333
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4 555555555555
666666666666 888888888888 101010101010101010101010
14 / 460
Normal Linear Model
The above model may be summarized as follows
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
= x
i
β,
ǫ
i
ind
Normal(0, σ
2
),
= Y
i
ind
Normal(µ
i
, σ
2
)
or, equivalently,
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
= x
i
β,
ǫ Normal
n
(0, σ
2
I),
= Y Normal
n
(µ, σ
2
I)
where Y = (Y
1
, . . . , Y
n
)
, µ = (µ
1
, . . . , µ
n
)
= Xβ, ǫ = (ǫ
1
, . . . , ǫ
n
)
and X= (x
1
, . . . , x
n
)
. In matrix form, the model becomes
Y = µ + ǫ
This model is known as the normal linear model because of: (i)
the “signal” (µ
i
) can be written as a linear combination of the
interest parameters (β
0
, β
1
, . . . , β
p
); and (ii) the random “noise”
(ǫ
i
) is assumed to follow the normal or Gaussian distribution.
15 / 460
Ordinary Least Squares (OLS)
Parameter estimation involves selecting a vector from the space
R
p+1
, denoted here by
ˆ
β
, such that the linear combinations given
by
X
ˆ
β provide the “best” representation of the vector of observed
responses
y = (y
1
, . . . , y
n
)
. Based on intuition,
ˆ
β
may be selected
as the
β R
p+1
such that
Q
r
(β) =
1
r
n
X
i=1
y
i
x
i
β
σ
r
,
for some r > 0, is minimum.
If the partial derivatives of
Q
r
(β) are continuous, then the mini-
mum of
Q
r
(β) can be obtained by solving the following equations
Q
r
(β)
∂β
0
Q
r
(β)
∂β
1
.
.
.
Q
r
(β)
∂β
p
=
1
σ
r
n
P
i=1
(y
i
x
i
β)
r1
1
σ
r
n
P
i=1
x
i1
(y
i
x
i
β)
r1
.
.
.
1
σ
r
n
P
i=1
x
ip
(y
i
x
i
β)
r1
=
0
0
.
.
.
0
If r > 1, then the partial derivatives of Q
r
(β) are continuous.
16 / 460
Ordinary Least Squares (OLS)
If r = 2, then the above syste m of equations is linear and
ˆ
β
becomes the Ordinary Least Squares (OLS) estimator, given by
ˆ
β = argmin
β R
p+1
Q
2
(β),
where
Q
2
(β) =
1
2σ
2
n
X
i=1
(y
i
x
i
β)
2
=
1
2σ
2
(y Xβ)
(y Xβ) =
1
2σ
2
h
y
y 2y
Xβ + β
X
Xβ
i
.
The gradient vector of Q
2
(β) is given by (see Appendix B-1,2)
U(β) =
Q
2
(β)
∂β
0
Q
2
(β)
∂β
1
.
.
.
Q
2
(β)
∂β
p
=
1
σ
2
n
P
i=1
(y
i
x
i
β)
1
σ
2
n
P
i=1
x
i1
(y
i
x
i
β)
.
.
.
1
σ
2
n
P
i=1
x
ip
(y
i
x
i
β)
=
1
σ
2
X
(y Xβ)
17 / 460
Ordinary Least Squares (OLS)
x
y
ˆ
µ =
ˆ
β
0
+
ˆ
β
1
x
(x
i
, y
i
)
OLS estimator selects the straight line (
ˆ
β
0
+
ˆ
β
1
x
) that minimizes the
sum of the squares of the dotted segment lengths.
18 / 460
Ordinary Least Squares (OLS)
The Hessian matrix of Q
2
(β) is given by (see Appendix B-3)
J(β) =
2
Q
2
(β)
∂β
2
0
2
Q
2
(β)
∂β
0
∂β
1
. . .
2
Q
2
(β)
∂β
0
∂β
p
2
Q
2
(β)
∂β
1
∂β
0
2
Q
2
(β)
∂β
2
1
. . .
2
Q
2
(β)
∂β
1
∂β
p
.
.
.
.
.
.
.
.
.
.
.
.
2
Q
2
(β)
∂β
p
∂β
0
2
Q
2
(β)
∂β
p
∂β
1
. . .
2
Q
2
(β)
∂β
2
p
=
1
σ
2
X
X
If X has linearly independent columns, then J(β) is a positive def-
inite and nonsingular matrix (see
Appendix A-6), implying that
Q
2
(β) is a strictly convex function (see Appendix F-2). Therefore,
if the system of equations
U(β) = 0 has a solution, then it is the
unique minimum global of Q
2
(β). Then, the OLS estimator of β
reduces to the solution of U(
ˆ
β) =
1
σ
2
X
(y X
ˆ
β) = 0
, that is,
ˆ
β = (X
X)
1
X
y.
A necessary condition for X’s columns to be linearly independent
is
p + 1 n, that is, X has fewer columns than rows.
19 / 460
Ordinary Least Squares (OLS)
In conclusion, if the columns of X are linearly independent, then
the OLS estimator of
β satisfies the following:
(1) It always exists.
(2) It is unique.
(3) It has closed-form, given by
ˆ
β = (X
X)
1
X
y
.
(4) It is linear with respect to y, that is, it may be written as Ay,
where
A is a (p + 1) × n matrix which does not depend on
y. Therefore, the (j + 1)-th element of
ˆ
β
, denoted here by
ˆ
β
j
, may be written as
ˆ
β
j
= a
j
y = a
j1
y
1
+ . . . + a
jn
y
n
, where
a
j
= (a
j1
, . . . , a
jn
)
represents the (j + 1)-th row of A.
(5) The estimation of the mean vector µ = Xβ is also linear
with respect to y, that is, it may be written as ˆµ = X
ˆ
β = Hy,
where H = X(X
X)
1
X
is the orthogonal projection matrix.
Therefore,
ˆ
µ
i
= x
i
ˆ
β = h
i
y = h
i1
y
1
+ . . . + h
in
y
n
, where h
i
=
(h
i1
, . . . , h
in
)
represents the i-th row of H.
20 / 460
Exercises
Consider the following simple linear model
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i
,
ǫ
i
ind
Normal(0, σ
2
)
Show that
ˆ
β
0
=
y
ˆ
β
1
x
and
ˆ
β
1
=
n
P
i=1
(x
i
x)(y
i
y)
n
P
i=1
(x
i
x)
2
=
S
y
S
x
r
xy
,
where (n1)S
2
x
=
n
P
i=1
(x
i
x)
2
, nx =
n
P
i=1
x
i
, (n1)S
2
y
=
n
P
i=1
(y
i
y)
2
,
n
y =
n
P
i=1
y
i
and r
xy
is the Pearson
s linear correlation coecient.
21 / 460
Orthogonal projection matrix (H)
(x
i
,
ˆ
β
0
+
ˆ
β
1
x
i
)
(x
i
, y
i
)
x
y
The matrix H orthogonally projects y
1
, . . . , y
n
onto the line, plane,
or hyperplane given by
ˆ
µ =
ˆ
β
0
+
ˆ
β
1
x
1
+
ˆ
β
2
x
2
+ . . . +
ˆ
β
p
x
p
.
22 / 460
Orthogonal projection matrix (H)
h
ij
= x
i
(X
X)
1
x
j
.
H is a symmetric matrix, that is, H
= H, so the i-th row
and the
i-th column of H coincide. Therefore, h
ij
= h
ji
for all
i, j = 1, . . . , n.
H is a idempotent matrix, that is, H
2
= H, so h
ij
= h
i
h
j
for
all
i, j = 1, . . . , n. For example,
h
ii
= h
i
h
i
=
n
X
j=1
h
2
ij
= h
2
ii
+
n
X
j:j,i
h
2
ij
= h
ii
h
2
ii
= h
ii
(1 h
ii
) =
n
X
j:j,i
h
2
ij
0
= h
ii
[0, 1].
ˆ
µ
i
y
j
= h
ij
and
ˆ
µ
j
y
i
= h
ji
. Therefore,
ˆ
µ
i
y
j
=
ˆ
µ
j
y
i
.
rank(H) = tr ace(H) = p + 1 (see Appendix A-1,2). Conse-
quently, if p + 1 < n then H is a singular matrix.
23 / 460
Maximum Likelihood (ML)
This method consists of maximizing, with respect to β, the
chance of observing the data in the sample, that is,
ˆ
β = argmax
β R
p+1
L(β),
where L(β) is the likelihood function of β, which corresponds to
the joint probability density function of
Y
1
, . . . , Y
n
, viewed as a
function of
β rather than as a function of y
1
, . . . , y
n
. Thus,
L(β) =
n
Y
i=1
1
(2π σ
2
)
1
2
exp
1
2
y
i
x
i
β
σ
!
2
= (2π σ
2
)
n
2
exp
1
2σ
2
n
X
i=1
(y
i
x
i
β)
2
= (2π σ
2
)
n
2
exp
Q
2
(β)
.
The function log(x) is strictly increasing, so any β that maximizes
log[L(β)] will also maximize L(β). In addition, the computation
of the gradient of log[L(β)] is easier than that of L(β), as the latter
24 / 460
Maximum Likelihood (ML)
is a product of n terms while the former is a sum of n terms.
Thus, a more convenient way to express
ˆ
β
is the following
ˆ
β = argmax
β R
p+1
(β),
where
(β) = log[L(β)] =
n
2
log(2π σ
2
) Q
2
(β)
is the log-likelihood function of β.
Note that maximizing
(β) with respect to β is equivalent to
maximizing Q
2
(β), as
n
2
log(2π σ
2
)
is a constant. In addition,
maximizing Q
2
(β) is equivalent to minimizing Q
2
(β), as 1 is
a negative constant. Therefore, the ML estimator of
β in the
model where ǫ
i
ind
Normal(0, σ
2
)
coincides with the OLS estima-
tor. For this reason, the normal or Gaussian distribution is the
most widely used to describe the behavior of the random errors.
25 / 460
Statistical propertie s of
ˆ
β
If the model is correctly specified, then (see Appendix E-3)
ˆ
β Normal
p+1
(E(
ˆ
β), Var(
ˆ
β)),
where
E(
ˆ
β) = E[(X
X)
1
X
Y] = (X
X)
1
X
E(Y) = β
and
Var(
ˆ
β) = Var[(X
X)
1
X
Y] = (X
X)
1
X
Var(Y)X(X
X)
1
= σ
2
(X
X)
1
.
Therefore,
ˆ
β Normal
p+1
(β, σ
2
(X
X)
1
).
In addition,
ˆ
β
is an ecient estimator as it achieves the Cramér-
Rao bound, that is, the variance-covariance matrix of
ˆ
β
coincides
with the i nverse of the Fisher information matrix, where the latter
is given by
E
"
2
(β)
ββ
#
=
1
σ
2
X
X.
26 / 460
Exercises
(1) The (generalized) normal linear model may be written as
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
= x
i
β,
ǫ
i
ind
GN(0, σ
2
, τ),
= Y
i
ind
GN(µ
i
, σ
2
, τ),
where, unlike β and σ
2
, τ > 0 is known. The probability
density function of
Y
i
GN(µ
i
, σ
2
, τ) is the follows
f
Y
i
(y; µ
i
, σ
2
, τ) =
τ
1
1
τ
2σΓ
1
τ
exp
1
τ
y µ
i
σ
τ
, y R.
If τ= 2 then GN(0, σ
2
, τ) reduces to Normal(0, σ
2
). Show that,
(a) the distribution of Y
i
is symmetric about µ
i
(b)
the mode of the distribution of Y
i
is µ
i
(c)
the variance of Y
i
does not depend on i, that is, it is constant
(d) the ML estimator of β reduces to
ˆ
β = argmin
β R
p+1
Q
τ
(β)
27 / 460
Exercises
(2) Consider the following linear model
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
= x
i
β,
ǫ
i
ind
Normal(0, σ
2
i
),
where ω
i
> 0 for i = 1, . . . , n are known weights.
Show that,
(a) the estimator of β by the least squares method, better known
as the Weighted Least Squares (WLS) estimator, minimizes
(y Xβ)
(y Xβ) =
n
P
i=1
ω
i
(y
i
x
i
β)
2
and it can be written
as
ˆ
β = (X
X)
1
X
y, where = diag{ω
1
, . . . , ω
n
}.
(b) the ML estimator of β coincides with the WLS estimator.
(c) if the model is correctly specified, then the distribution of
ˆ
β
is Normal
p+1
(β, σ
2
(X
X)
1
).
28 / 460
R-squared
Interest Model
(model under investigation)
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
ǫ
i
ind
Normal(0, σ
2
),
X =
1 x
11
. . . x
1p
1 x
21
. . . x
2p
.
.
.
.
.
.
.
.
.
.
.
.
1 x
n1
. . . x
np
ˆ
β = argmin
β R
p+1
Q
2
(β)
where,
Q
2
(β) =
1
2σ
2
n
X
i=1
y
i
β
0
p
X
j=1
β
j
x
ij
2
Reference Model
(model with the poorest
goodness-of-fit)
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β,
ǫ
i
ind
Normal(0, σ
2
),
X =
1
1
.
.
.
1
ˆ
β = argmin
β R
Q
2
(β)
where,
Q
2
(β) =
1
2σ
2
n
X
i=1
y
i
β
2
Note that Q
2
(
ˆ
β) = min
β R
p+1
Q
2
(β)
and Q
2
(
ˆ
β) = min
β R
p+1
Q
2
(β)
restricted to β
1
= . . . = β
p
= 0
29 / 460
R-squared
R
2
= 1
Q
2
(
ˆ
β)
Q
2
(
ˆ
β)
= 1
n
P
i=1
y
i
x
i
ˆ
β
2
n
P
i=1
y
i
y
2
The R
2
satisfies the following:
(1) R
2
[0, 1]
(2)
If X =1
n
then R
2
=0, where 1
n
represents a column vector of 1’s of
size
n.
(3) If X=I
n
then R
2
=1, where I
n
is the identity matrix of size n.
(4) A hundred times its value represents the percentage reduction
in Residual Sum of Squares (RSS) caused by the inclusion of
p
parameters in the model.
(5) It coincides with r
2
(y, ˆµ), where r(y, ˆµ) represents the Pearson’s
linear correlation coecient computed between y
1
, . . . , y
n
and
ˆ
µ
1
, . . . ,
ˆ
µ
n
.
(6) It may be interpreted as the percentage of the dierences be-
tween y
1
, . . . , y
n
which are explained by the dierences between
x
1
, . . . , x
n
.
30 / 460
R-squared
The R
2
can be used to compare nested models, that is, models where
the first model’s
X matrix coincides with the rst (r + 1) columns of the
second model’s
X matrix, in which r < p.
Model I
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+
r
P
j=1
β
j
x
ij
,
ǫ
i
ind
Normal(0, σ
2
),
ˆ
β
I
= argmin
β R
r+1
Q
I
2
(β),
where
Q
I
2
(β)
n
X
i=1
y
i
β
0
r
X
j=1
β
j
x
ij
2
Model II
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+
r
P
j=1
β
j
x
ij
+
p
P
j=r+1
β
j
x
ij
,
ǫ
i
ind
Normal(0, σ
2
),
ˆ
β
II
= argmin
β R
p+1
Q
II
2
(β),
where
Q
II
2
(β)
n
X
i=1
y
i
β
0
r
X
j=1
β
j
x
ij
p
X
j=r+1
β
j
x
ij
2
Note that
Q
II
2
(
ˆ
β
II
) = min
β R
p+1
Q
II
2
(β),
and Q
I
2
(
ˆ
β
I
) = min
β R
p+1
Q
II
2
(β)
restricted to β
r+1
= . . . = β
p
= 0
31 / 460
R-squared
Therefore,
Q
II
2
(
ˆ
β
II
) Q
I
2
(
ˆ
β
I
).
As a result, R
2
II
is ALWAYS greater than R
2
I
, that is, according to
R
2
, the model II is ALWAYS better than the model I, even when
the additional
(p r) covariates are not helpful in explaining the
dierences between
y
1
, . . . , y
n
.
The following expression describes an adjusted version of R
2
that measures not just goodness-of-fit, but also penalizes model
complexity:
R
2
adj
= 1
n
P
i=1
y
i
x
i
ˆ
β
2
/(n p 1)
n
P
i=1
y
i
y
2
/(n 1)
= 1
n 1
n p 1
(1 R
2
),
where (n p 1) and (n 1) are the degrees-of-freedom of the
Interest Model and the Reference Model, respectively.
32 / 460
Predicted R
2
Note that
ˆ
β = (X
X)
1
X
y
also may be written as follows
ˆ
β =
n
X
l=1
x
l
x
l
1
n
X
l=1
x
l
y
l
.
Hence, the estimate of β derived from the dataset with the i-th
observation excluded, denoted here by
ˆ
β
(i)
, is (see, for instance,
Cook and Weisberg [1982, page 110])
ˆ
β
(i)
=
n
X
l=1
x
l
x
l
x
i
x
i
1
n
X
l=1
x
l
y
l
x
i
y
i
=
X
X x
i
x
i
1
X
y x
i
y
i
(
see Appendix A3)
=
(X
X)
1
+
(X
X)
1
x
i
x
i
(X
X)
1
1 h
ii
X
y x
i
y
i
=
ˆ
β
y
i
x
i
ˆ
β
1 h
ii
(X
X)
1
x
i
,
where h
ii
= x
i
(X
X)
1
x
i
represents the (i, i)-th entry of H.
33 / 460
Predicted R
2
Therefore, the predicted residual is
y
i
x
i
ˆ
β
(i)
= y
i
x
i
ˆ
β +
y
i
x
i
ˆ
β
1 h
ii
x
i
(X
X)
1
x
i
=
y
i
x
i
ˆ
β
1 h
ii
.
Thus, the predicted R
2
may be defined as follows
R
2
pred
= 1
n
X
i=1
y
i
x
i
ˆ
β
(i)
2
/
n
X
i=1
y
i
y
(i)
2
,
where
n
P
i=1
y
i
x
i
ˆ
β
(i)
2
and
n
P
i=1
y
i
y
(i)
2
are the Predicted Resid-
ual Error Sum of Squares (PRESS) (Allen [1971]) of the interest
and reference models, respectively. If the values o f (y
i
x
i
ˆ
β
(i)
)
and (y
i
y
(i)
) are replaced in the expression above, then R
2
pred
becomes (see, for example, Alcantara et al. [2023])
R
2
pred
= 1
n
X
i=1
y
i
x
i
ˆ
β
2
(1 h
ii
)
2
/
n
X
i=1
y
i
y
2
(1 n
1
)
2
.
34 / 460
Generalized Information Criterion
GIC(k) =
pmeasure of the
goodness-of-fit
z}|{
2 (
ˆ
β) +
complexity
penalty
z }| {
k(p + 2)
= n log(2π
ˆ
σ
2
) +
1
ˆ
σ
2
n
X
i=1
(y
i
x
i
ˆ
β)
2
+ k(p + 2)
= n log(2π
ˆ
σ
2
) + (n p 1) + k(p + 2)
= n log(
ˆ
σ
2
) + p(k 1) + C,
where C > 0 is a constant as it does not depend on the fitted
model. The Akaike Information Criterion (
AIC, Akaike [1974]),
and the Bayesian Information Criterion (
BIC, Schwarz [1978]),
also known as the Schwarz Information Criterion (SIC), are spe-
cial cases of GIC(k) when k = 2 and k = log(n), respectively.
If n > 7 then the penalty by model complexity in BIC is greater
than that in AIC. Therefore, the BIC tends to choose simpler
models than AIC.
35 / 460
Exercises
Consider two models: in the first model, denoted here by I, µ
i
=
β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
, while in the second model, denoted here by II,
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
+ β
p+1
x
i(p+1)
+ . . . + β
p+r
x
i(p+r)
. Show that:
(1) The R
2
adj
of model I is greater [smaller] than that of the model II
if and only if
ˆ
σ
2
I
/
ˆ
σ
2
II
< 1
h
ˆ
σ
2
I
/
ˆ
σ
2
II
> 1
i
, where
ˆ
σ
2
I
y
ˆ
σ
2
II
represent the
estimates of
σ
2
obtained from the models I and II, respectively.
Hence, choosing the the model with the highest value of
R
2
adj
is
equivalent to choosing the model with the lowest estimate of
σ
2
.
(2) The AIC of model I is smaller [greater] than that of model II if and
only if
ˆ
σ
2
I
/
ˆ
σ
2
II
< exp(
r
n
)
h
ˆ
σ
2
I
/
ˆ
σ
2
II
> exp(
r
n
)
i
. Therefore, for large
sample sizes, choosing the model with the smallest AIC value is
equivalent to choosing the model with the smallest estimate of
σ
2
.
(3) The BIC of model I is smaller (greater) than that of model II if and
only if
ˆ
σ
2
I
/
ˆ
σ
2
II
< exp(
r
n
[log(n) 1])
h
ˆ
σ
2
I
/
ˆ
σ
2
II
> exp(
r
n
[log(n) 1])
i
.
Therefore, for large sample sizes, larger than in the AIC case,
choosing the model with the smallest BIC value is equivalent to
choosing the model with the smallest estimate of
σ
2
.
36 / 460
Confidence intervals
If
ˆ
β Normal
p+1
(β, σ
2
(X
X)
1
)
, then
ˆ
β
j
Normal(β
j
, σ
2
c
j
), j = 0, 1, . . . , p,
where c
j
represents the (j + 1)th entry of the main di agonal of
(X
X)
1
. Therefore,
ˆ
β
j
β
j
σ
c
j
Normal(0, 1),
and, for all α (0, 1),
Pr
Z
1
α
2
ˆ
β
j
β
j
σ
c
j
Z
1
α
2
= 1 α
= Pr
h
ˆ
β
j
Z
1
α
2
σ
p
c
j
β
j
ˆ
β
j
+ Z
1
α
2
σ
p
c
j
i
= 1 α,
where Z
1
α
2
represents the 100(1
α
2
)%
th percentile of the stan-
dard normal distribution.
37 / 460
Confidence intervals
Then, a confidence interval of 100(1 α)% for β
j
reduces to
IC
1α
(β
j
) =
ˆ
β
j
± Z
1
α
2
σ
p
c
j
Nevertheless, this confidence interval is not useful in practice
because
σ
2
is unknown. An estimator of σ
2
that is statistically
independent of
ˆ
β
and such that (n p 1)
ˆ
σ
2
σ
2
χ
2
(n p 1)
is
given by the following expression:
ˆ
σ
2
=
1
n p 1
(y X
ˆ
β)
(y X
ˆ
β) =
1
n p 1
n
X
i=1
(y
i
x
i
ˆ
β)
2
.
The mean and variance of a random variable following χ
2
(n
p 1) distri bution allow to compute E(
ˆ
σ
2
) = σ
2
and Var(
ˆ
σ
2
) =
2σ
4
/(n p 1)
. The above allows to conclude that
ˆ
σ
2
is an
estimator of σ
2
which satisfies the following: (i) it is unbiased;
(ii) its variance tends to 0 as n tends to ; and (iii) it is consistent.
38 / 460
Confidence intervals
Y Normal
n
(Xβ, σ
2
I) implies that
(n p 1)
ˆ
σ
2
σ
2
= Y
1
σ
2
(I H)Y χ
2
(n p 1)
provided that (see Appendix E-4)
1
σ
2
(I H)σ
2
I = I H
is an idempotent matrix,
(Xβ)
1
σ
2
(I H)Xβ = (Xβ)
1
σ
2
(X HX)β = 0
, and
rank(I H) = trace(I H) = n trace(H) = n p 1.
In addition, (n p 1)
ˆ
σ
2
σ
2
and
ˆ
β = (X
X)
1
X
Y are statistically
independent
provided that (see Appendix E-5)
(X
X)
1
X
σ
2
I
1
σ
2
(I H) = (X
X)
1
(X
X
H) = 0.
39 / 460
Confidence intervals
The distribution of
Z
V/m
is Student-t(m) provided that Z
Normal(0, 1), V χ
2
(m), and Z and V are statistically indepen-
dent random variables (see
Appendix D-2). Therefore,
ˆ
β
j
β
j
σ
c
j
/
r
ˆ
σ
2
σ
2
=
ˆ
β
j
β
j
ˆ
σ
c
j
Student-t(n p 1).
Thus, for all α (0, 1),
Pr
t
(np1)
1
α
2
ˆ
β
j
β
j
ˆ
σ
c
j
t
(np1)
1
α
2
= 1 α
= Pr
ˆ
β
j
t
(np1)
1
α
2
ˆ
σ
p
c
j
β
j
ˆ
β
j
+ t
(np1)
1
α
2
ˆ
σ
p
c
j
= 1 α,
where t
(np1)
1
α
2
represents the 100(1
α
2
)%
th percentile of the
Student-t(n p 1) distribution.
40 / 460
Confidence intervals
Then, a confidence interval of 100(1 α)% for β
j
reduces to
IC
1α
(β
j
) =
ˆ
β
j
± t
(np1)
1
α
2
ˆ
σ
p
c
j
Two desirable attributes for a confidence interval are:
Confidence, which is easily specified by using “small” va-
lues of
α (0, 1).
Precision, which is recognized by “small” value s of the in-
terval width, given by
2 t
(np1)
1
α
2
ˆ
σ
c
j
.
The uncertainty in σ
2
subtracts precision from the confidence
interval because, for fixed α, the value of t
(np1)
1
α
2
is greater than
Z
1
α
2
. However, the loss of precision decreases with i ncreasing
(n p 1), since for fixed α, t
(np1)
1
α
2
Z
1
α
2
as (n p 1) .
41 / 460
Confidence intervals
Z
1
α
2
Z
1
α
2
Standard Normal
α
2
α
2
0.0 0.1 0.2 0.3 0.4
0 1 2 3 41234
t
1
α
2
t
1
α
2
Student-t(ν)
α
2
α
2
0.0 0.1 0.2 0.3
0 1 2 3 41234
42 / 460
Hypothesis testing
Null hypothesis (H
0
) is
True False
Decision about Null Don
t reject ! Type II error
hypothesis (H
0
) is Reject Type I error !
α = Pr[ Type I error ] = Pr[ Reject H
0
| H
0
is True ]
γ = Pr[
Type II error ] = Pr[ Don’t reject H
0
| H
0
is False ]
43 / 460
Hypothesis testing
H
0
: β
j
= β
j
versus H
1
: β
j
, β
j
Type I error : Conclude that β
j
, β
j
when, in fact, β
j
= β
j
Type II error : Conclude that β
j
= β
j
when, in fact, β
j
, β
j
Ifσ
2
is known, then
ˆ
β
j
β
j
σ
c
j
Normal(0, 1)
and the decision rule is
“Reject H
0
at the significance level of 100(α)% if
ˆ
β
j
β
j
σ
c
j
>Z
1
α
2
Pr
Type I error
= Pr
h
ˆ
β
j
β
j
p
c
j
> Z
1
α
2
H
0
is True
i
= 1 Pr
h
ˆ
β
j
β
j
p
c
j
Z
1
α
2
H
0
is True
i
= 1 Pr
h
Z
1
α
2
(
ˆ
β
j
β
j
)
p
c
j
Z
1
α
2
H
0
is True
i
= 1
Φ(Z
1
α
2
) Φ(Z
1
α
2
)
= α,
where Φ(·) represents the cumulative distribution function of the stan-
dard normal distribution.
44 / 460
Hypothesis testing
Pr
Type II error
= Pr
h
ˆ
β
j
β
j
p
c
j
Z
1
α
2
H
0
is False
i
= Pr
h
Z
1
α
2
(
ˆ
β
j
β
j
)
p
c
j
Z
1
α
2
H
0
is False
i
= Pr
h
Z
1
α
2
δ (
ˆ
β
j
β
j
)
p
c
j
Z
1
α
2
δ
H
0
is False
i
= Φ(Z
1
α
2
δ) Φ(Z
1
α
2
δ),
where δ = (β
j
β
j
)
c
j
.
α = 0.01
α = 0.05
α = 0.10
δ
Pr[ Type II error]
0 2 424
0.0 0.2 0.4 0.6 0.8 1.0
45 / 460
Hypothesis testing
If σ
2
is unknown and the aim is to assess the hypothesis system
H
0
: β
j
= β
j
versus H
1
: β
j
, β
j
,
then the following decision rule may be used
“Reject H
0
at the significance level of 100(α)% if
ˆ
β
j
β
j
ˆ
σ
c
j
> t
(np1)
1
α
2
If a random variable follows Student-t(n p 1) distribution,
then its square follows
F
(1,np1)
distribution (see Appendix D-5).
Therefore, the above decision rule is equivalent to the following
“Reject H
0
at the significance level of 100(α)% if
(
ˆ
β
j
β
j
)
2
ˆ
σ
2
c
j
>F
(1, np1)
1α
The p-value, gi ven by Pr[F > (
ˆ
β
j
β
j
)
2
/
ˆ
σ
2
c
j
]
, where F F
(1, np1)
,
allows to write the above decision rule as follows
“Reject H
0
at the significance level of 100(α)% if p -value < α
46 / 460
Hypothesis testing
0 1 2 3 4
0.0 0.2 0.4 0.6
α
pvalue
F F
(r, np1)
1α
The null hypothesis is not rejected at the significance level of 100(α)%.
47 / 460
Hypothesis testing
0 1 2 3 4
0.0 0.2 0.4 0.6
α
pvalue
FF
(r, np1)
1α
The null hypothesis is rejected at the significance level of 100(α)%.
48 / 460
Hypothesis testing
The interest is to assess the f ollow ing hypothesis system
H
0
: L
β = M versus H
1
: L
β , M,
where L
is an r × (p + 1) matrix whose rows are linearly inde-
pendent and
M is an r × 1 matrix. The following are examples
of the hypothesis system above for
p = 3:
(1)
H
0
: β
2
= 0 versus H
1
: β
2
, 0
Then r = 1, L
=
h
0 0 1 0
i
and M = 0.
(2)
H
0
:
"
β
1
β
3
#
=
"
0
0
#
versus H
1
:
"
β
1
β
3
#
,
"
0
0
#
Then r = 2, L
=
"
0 1 0 0
0 0 0 1
#
and M =
"
0
0
#
.
49 / 460
Hypothesis testing
(3)
H
0
: β
2
= β
3
versus H
1
: β
2
, β
3
Then r = 1, L
=
h
0 0 1 1
i
and M = 0.
If the model is correctly specified then
ˆ
β Normal
p+1
(β, σ
2
(X
X)
1
)
,
which implies that (see
Appendix E-3)
L
ˆ
β Normal
r
(L
β, L
σ
2
(X
X)
1
L),
which in turn implies that (see Appendix E-3,4)
(L
ˆ
β L
β)
(L
σ
2
(X
X)
1
L)
1
(L
ˆ
β L
β) χ
2
(r).
The distribution of
V/m
W/r
is F
(m,r)
provided that V χ
2
(m) and
W χ
2
(r) are statistically independent random variables (see
Appendix D-3). Therefore,
σ
2
r
ˆ
σ
2
(L
ˆ
β L
β)
(L
σ
2
(X
X)
1
L)
1
(L
ˆ
β L
β) F
(r,np1)
.
50 / 460
Hypothesis testing
Then, the decision rule reduces to “Reject H
0
at the significance
level of 100(α)% if F > F
(r,np1)
1α
, where F may be written as
follows
F =
1
r
(L
ˆ
β M)
(L
ˆ
σ
2
(X
X)
1
L)
1
(L
ˆ
β M).
For example,
(1)
H
0
: β
2
= 0 versus H
1
: β
2
, 0
Then F =
ˆ
β
2
2
ˆ
Var(
ˆ
β
2
)
=
ˆ
β
2
2
[Std.Error(
ˆ
β
2
)]
2
(2)
H
0
:
"
β
1
β
3
#
=
"
0
0
#
versus H
1
:
"
β
1
β
3
#
,
"
0
0
#
Then F =
1
2
h
ˆ
β
1
ˆ
β
3
i
"
ˆ
Var(
ˆ
β
1
)
ˆ
Cov(
ˆ
β
1
,
ˆ
β
3
)
ˆ
Cov(
ˆ
β
3
,
ˆ
β
1
)
ˆ
Var(
ˆ
β
3
)
#
1
"
ˆ
β
1
ˆ
β
3
#
51 / 460
Hypothesis testing
(3)
H
0
: β
2
= β
3
versus H
1
: β
2
, β
3
Then F =
(
ˆ
β
2
ˆ
β
3
)
2
ˆ
Var(
ˆ
β
2
) +
ˆ
Var(
ˆ
β
3
) 2
ˆ
Cov(
ˆ
β
2
,
ˆ
β
3
)
Alternatively, F can be expressed as follows
F =
1
r
ˆ
σ
2
n
X
i=1
(y
i
x
i
ˆ
β
)
2
n
X
i=1
(y
i
x
i
ˆ
β)
2
,
where
ˆ
β = argmin
β R
p+1
Q
2
(β)
and
ˆ
β
= argmin
β R
p+1
Q
2
(β)
restricted to L
β = M.
52 / 460
Richness of the species
The analysis of these data, available in the object richness of the
library
glmtoolbox, aims to explain the dierences between plots
as related to species richness (represented by a count of the num-
ber of plant species (
richness)) using the dierences between
plots with respect to biomasses (
Biomass) and soil pH levels:
low (
pH=“low”), medium (pH=“mid”) and high (pH=“high”).
High
Mid
Low
pH level
Biomass
Number of plant species
0
0.0 2.5 5.0 7.5 10.0
10
20
30
40
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
,
ǫ
i
ind
Normal(0, σ
2
)
y = Number of plant species
x
1
= Biomass
x
2
= 0 and x
3
= 0 if pH=“low”
x
2
= 1 and x
3
= 0 if pH=“mid”
x
2
= 0 and x
3
= 1 if pH=“high”
53 / 460
Parameter interpretation: without inte raction
The expected number of plant species for a plot with biomass x
1
and
pH level low (
x
2
= 0 and x
3
= 0), mid (x
2
= 1 and x
3
= 0) or high (x
2
= 0
and x
3
= 1) is β
0
+ β
1
x
1
+ β
2
x
2
+ β
3
x
3
.
There are two plots with the same pH levels, but the biomass of the
first plot is
x + δ while the biomass of the second plot is x. As a result,
the expected number of plant species for the first plot is |β
1
|δ units
lower, if
β
1
< 0, or greater, if β
1
> 0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
2
x
2
+ β
3
x
3
(β
0
+ β
1
x + β
2
x
2
+ β
3
x
3
) = β
1
δ.
There are two plots with the same biomass, but the pH level of the first
plot is mid while the biomass of the second plot is low. As a result, the
expected number of plant species for the first plot is |β
2
| units lower, if
β
2
< 0, or greater, if β
2
> 0, than that for the second plot.
µ
mid
µ
low
= β
0
+ β
1
x
1
+ β
2
(β
0
+ β
1
x
1
) = β
2
.
There are two plots with the same biomass, but the pH level of the first
plot is high while the biomass of the second plot is low. As a result, the
expected number of plant species for the first plot is |β
3
| units lower, if
β
3
< 0, or greater, if β
3
> 0, than that for the second plot.
µ
high
µ
low
= β
0
+ β
1
x
1
+ β
3
(β
0
+ β
1
x
1
) = β
3
.
54 / 460
Parameter interpretation: with inte raction
The expected number of plant species for a plot with biomass x
1
and
pH level low (
x
2
= 0 and x
3
= 0), mid (x
2
= 1 and x
3
= 0) or high (x
2
= 0
and x
3
= 1) is β
0
+ β
1
x
1
+ β
2
x
2
+ β
3
x
3
+ β
4
x
1
x
2
+ β
5
x
1
x
3
.
There are two plots with pH level low, but the biomass of the first
plot is
x + δ while the biomass of the second plot is x. Therefore, the
expected number of plant species for the first plot is |β
1
|δ units lower,
if
β
1
< 0, or greater, if β
1
> 0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) (β
0
+ β
1
x) = β
1
δ.
There are two plots with pH level mid, but the biomass of the first
plot is
x + δ while the biomass of the second plot is x. Therefore, the
expected number of plant species for the first plot is
|β
1
+ β
4
|δ units
lower, if
β
1
+β
4
<0, or greater, if β
1
+β
4
>0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
4
(x + δ) (β
0
+ β
1
x + β
4
x) = (β
1
+ β
4
)δ.
There are two plots with pH level high, but the biomass of the first
plot is
x + δ while the biomass of the second plot is x. Therefore, the
expected number of plant species for the first plot is
|β
1
+ β
5
|δ units
lower, if
β
1
+β
5
<0, or greater, if β
1
+β
5
>0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
5
(x + δ) (β
0
+ β
1
x + β
5
x) = (β
1
+ β
5
)δ.
55 / 460
Richness of the species
> data(richness)
> fit <- lm(Species ~ Biomass + pH, data=richness)
> summary(fit)
H
0
: β
j
= 0 vs H
1
: β
j
, 0
ˆ
β
j
ˆ
σ
c
j
ˆ
β
j
/
ˆ
σ
c
j
Pr[F
(1,np1)
>t
2
j
]
Estimate Std.Error t-value Pr(>|t|)
(Intercept) 17.6248 0.7890 22.34 <2e-16
Biomass -2.7276 0.1717 -15.89 <2e-16
pHmid 11.9782 1.0020 11.95 <2e-16
pHhigh 22.6200 1.0818 20.91 <2e-16
ˆ
σ (n p 1)
Residual standard error: 3.783 on 86 degrees of freedom
Multiple R-squared: 0.8524 R
2
, Adjusted R-squared: 0.8472 R
2
adj
F-statistic: 165.5 on 3 and 86 DF, p-value: < 2.2e-16
F r (n p 1) Pr[F
(r,np1)
>F]
H
0
: (β
1
, β
2
, β
3
)
= 0 vs H
1
: (β
1
, β
2
, β
3
)
, 0
56 / 460
Richness of the species
Is there an interaction between biomass and soil pH level? That
is, does the eect on the expected number of plant species of an
increase in biomass depend on soil pH level?
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
+ β
4
x
i1
x
i2
+ β
5
x
i1
x
i3
,
ǫ
i
ind
Normal(0, σ
2
)
> fit2 <- update(fit, formula=Species ~ Biomass*pH)
> anova(fit,fit2)
H
0
: (β
4
, β
5
)
= 0 vs H
1
: (β
4
, β
5
)
, 0
Model 1: Species ~ Biomass + pH
Model 2: Species ~ Biomass + pH + Biomass:pH
r F Pr[F
(r,np1)
>F]
Res.Df RSS Df Sum of Sq F Pr(>F)
1 86 1230.9
2 84 1224.7 2 6.2591 0.2147 0.8073
57 / 460
Richness of the species
Model matr ix
X
model.matrix(fit)
Parameter estimates
ˆ
β = (
ˆ
β
0
,
ˆ
β
1
, . . . ,
ˆ
β
p
)
coef(fit)
Residual sum of squares
n
P
i=1
y
i
x
i
ˆ
β
2
sum(resid(fit)**2)
Log-likelihood
(
ˆ
β)
logLik(fit)
Confidence intervals
ˆ
β
j
± t
(np1)
1
α
2
ˆ
σ
c
j
confint(fit,level=1 α)
Degrees-of-freedom
n p 1
df.residual(fit)
Estimate of σ
2
ˆ
σ
2
summary(fit)$sigma**2
Estimate of Var(
ˆ
β)
ˆ
σ
2
(X
X)
1
vcov(fit)
Fitted values
ˆµ = X
ˆ
β = Hy
fitted(fit)
Ordinary residuals
r
i
= y
i
x
i
ˆ
β
resid(fit)
R-squared
R
2
summary(fit)$r.squared
Adjusted R-squared
R
2
adj
summary(fit)$adj.r.squared
58 / 460
Gasoline yield
The analysis of the data in the object Auto of the library ISLR
is aimed at explaining the dierences between the vehicles with
respect to their gasoline yield (
mpg), in miles per gallon, using
other vehicle characteristics such as their horsepower (
horsepower)
and their weights (
weight), in kilograms.
••
High
Mid
Low
Weight
Horsepower
Gasoline yield
10
20
30
40
50 100 150 200
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i1
x
i2
,
ǫ
i
ind
Norm al(0, σ
2
)
y = Gasoline yield
x
1
= Horsepower
x
2
= Weight
59 / 460
Parameter interpretation: without inte raction
The expected gasoline yield for a vehicle with horsepower x
1
and weight x
2
is β
0
+ β
1
x
1
+ β
2
x
2
.
There are two vehicles with the same weight, but the horsepower
of the first vehicle is
x + δ while the horsepower of the se cond
vehicle is
x. As a result, the expected gasoline yield f or the first
vehicle is
|β
1
|δ miles per gallon lower, if β
1
< 0, or greater, if
β
1
> 0, than that for the second vehicle.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
2
x
2
(β
0
+ β
1
x + β
2
x
2
) = β
1
δ.
There are two vehicles with the same horsepower, but the weight
of the first vehicle is
x + δ while the weight of the second vehicle
is x. As a result, the expected gasoline yi eld for the first vehicle
is |β
2
|δ miles per gallon lower, if β
1
< 0, or greater, if β
1
> 0, than
that for the second vehicle.
µ
x+δ
µ
x
= β
0
+ β
1
x
1
+ β
2
(x + δ) (β
0
+ β
1
x
1
+ β
2
x) = β
2
δ.
60 / 460
Parameter interpretation: with inte raction
The expected gasoline yield for a vehicle with horsepower x
1
and weight x
2
is β
0
+ β
1
x
1
+ β
2
x
2
+ β
3
x
1
x
2
.
There are two vehicles with weight equal to z, but the horse-
power of the first vehicle is
x + δ while the horsepower of the
second vehicle is
x. As a result, the expected gasoline yield for
the first vehicle is
|β
1
+β
3
z|δ miles per gallon lower, if β
1
+β
3
z < 0,
or greater, if
β
1
+ β
3
z > 0, than that for the second vehicle.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
2
z + β
3
(x + δ)z
(β
0
+ β
1
x + β
2
z + β
3
xz) = (β
1
+ β
3
z)δ.
There are two vehicles with horsepower equal to z, but the
weight of the first vehicle is x + δ while the weight of the second
vehicle is x. As a result, the expected gaso line yield for the first
vehicle is |β
2
+ β
3
z|δ miles per gallon lower, if β
2
+ β
3
z < 0, or
greater, if β
2
+ β
3
z > 0, than that for the second vehicle.
µ
x+δ
µ
x
= β
0
+ β
1
z + β
2
(x + δ) + β
3
z(x + δ)
(β
0
+ β
1
z + β
2
x + β
3
zx) = (β
2
+ β
3
z)δ.
61 / 460
Gasoline yield
> data(Auto,package="ISLR")
> Auto <- within(Auto, weight <- weight/1000)
>
> fit <- lm(mpg ~ horsepower + weight + horsepower:weight,
+ data=Auto)
> summary(fit)
H
0
: β
j
= 0 vs H
1
: β
j
, 0
ˆ
β
j
ˆ
σ
c
j
ˆ
β
j
/
ˆ
σ
c
j
Pr[F
(1,np1)
>t
2
j
]
Estimate Std.Error t-value Pr(>|t|)
(Intercept) 63.557940 2.342949 27.127 < 2e-16
horsepower -0.250838 0.027279 -9.195 < 2e-16
weight -10.772411 0.773809 -13.921 < 2e-16
horsepower:weight 0.053554 0.006649 8.054 9.93e-15
ˆ
σ (n p 1)
Residual standard error: 3.93 on 388 degrees of freedom
Multiple R-squared: 0.7484 R
2
, Adjusted R-squared: 0.7465 R
2
adj
F-statistic: 384.8 on 3 and 388 DF, p-value: < 2.2e-16
F r (n p 1) Pr[F
(r,np1)
>F]
H
0
: (β
1
, β
2
, β
3
)
= 0 vs H
1
: (β
1
, β
2
, β
3
)
, 0
62 / 460
Gasoline yield
> fit2 <- update(fit, formula = mpg ~ horsepower + weight)
> anova(fit2,fit)
H
0
: β
3
= 0 vs H
1
: β
3
, 0
Model 1: mpg ~ horsepower + weight
Model 2: mpg ~ horsepower + weight + horsepower:weight
r F Pr[F
(r,np1)
>F]
Res.Df RSS Df Sum of Sq F Pr(>F)
1 389 6993.8
2 388 5992.0 1 1001.8 64.871 9.934e-15
> fit3 <- update(fit, formula = mpg ~ 1)
> anova(fit3,fit)
H
0
: (β
1
, β
2
, β
3
)
= 0 vs H
1
: (β
1
, β
2
, β
3
)
, 0
Model 1: mpg ~ 1
Model 2: mpg ~ horsepower + weight + horsepower:weight
r F Pr[F
(r,np1)
>F]
Res.Df RSS Df Sum of Sq F Pr(>F)
1 391 23819
2 388 5992 3 17827 384.8 <2.2e-16
63 / 460
Exercises
These data, available in the object races of the library glmtoolbox,
consist of record time, distance, and cumulative climb of 35 hill races
in Scotland. The aim of the statistical analysis of these data is to
explain the dierences between the record time of t he races (
rtime), in
minutes, using their dierences on distance (distance), in miles, and
cumulative climb (cclimb), in thousands of feet.
(a) Fit to the data a normal linear model where the record time of the
races is the response variable, and the distance and cumulative
climb are the explanatory variables.
(b) Assess at the significance level of 5% if the eect of the distance on
the expected record time of the race depends on the cumulative
climb.
(c) Interpret the parameter estimates except the intercept term.
(d) Estimate the expected record time, in minutes, of a race whose
distance and cumulative climb are 7.5 miles and 1800 feet, re-
spectively. Compute this estimate “manually” and using the
predict() function.
64 / 460
Exercises
These data, available in the object whiteside of the library MASS,
were collected to assess the eect of insulation on gas consumption.
The weekly gas consumption (
Gas), in thousands of cubic feet, and
the average external temperature (Temp), in degrees Celsius, were
recorded for 26 weeks before (
Insul=“Before”) and for 30 weeks after
(Insul=“After”) that a cavity-wall insulation was installed in a house.
(a) Fit to t he data a normal linear model where the gas consumption
is the response variable, and the average external temperature
and the presence/absence of the cavity-wall insulation are the
explanatory variables.
(b) Assess at the significance level of 5% if the eect of the average
external temperature on the expected gas consumption depends
on the presence/absence of the wall insulation.
(c) Interpret the parameter estimates except the intercept term.
(d) Estimate the expected gas consumption, in thousands of cubic
feet, in a week where the average external temperature is 5 de-
grees Celsius and the wall insulation was present. Compute this
estimate “manually” and using the predict() function.
65 / 460
Variable selection
66 / 460
Hybrid forward stepwise
The algorithm starts with the simplest model, that is, a model
with a linear predictor in which the coecients of all variables
and/or eects, except the intercept term, are zero. Next, the can-
didate models are constructed by hierarchically adding eects to
the linear predictor (i.e., the main eects are first, then the two-
way interactions, and so on), and their “relevance” is assessed
by comparing the models with and without the added eects
using a previously defined criterion. After an eect has been
added to the model, any other eect previously added can be
removed if, according to the specified criterion, that eect is no
longer relevant. See, for example,
James et al. [2013, page 210].
The criteria to compare models include the following
Akaike Information Criterion
Bayesian Information Criterion
Adjusted R
2
Predicted R
2
Mallows’ CP
p-value of the F test
67 / 460
Hybrid backward ste pwise
The process begins with the more complex model (i.e., one that
includes all variables and/or eects in its linear predictor) and
then the candidate models are constructed by hierarchically re-
moving eects from the linear predictor (i.e., the highest level
interactions are first, then the next level interactions, and so on).
The “relevance” of the candidates is determined by comparing
the models with and without the excluded eects using a pre-
viously established criterion. If an eect is excluded from the
model, then any other eect previously e xcluded may be added
to it if, based on the specified criterion, that eect is no longer
irrelevant. See, for example,
James et al. [2013, page 210]. The
criteria to compare models include the follo wing
Akaike Information Criterion
Bayesian Information Criterion
Adjusted R
2
Predicted R
2
Mallows’ CP
p-value of the F test
68 / 460
Variable selection
Forward
Backward
Candidate models
69 / 460
Variable selection
> args(stepCriterion.lm)
> help(stepCriterion.lm)
model: the object where the fitted LM is stored.
criterion: the criterion specified to compare the candidate mod-
els. The available options are: AIC ("aic"), BIC ("bic"), adjusted
R-squared ("adjr2"), predicted R-squared ("prdr2"), Mallows’ CP
("cp") and p-value of the F test ("p-value"). As default, criterion
is set to "bic".
direction: hybrid backward ("backward") or hybrid forward
("forward") procedures. By default, direction="forward".
levels: the pair of values in the interval (0, 1) at which the eects
must in and out from the model, respectively, if criterion=“p-
value”. By default, levels=c(0.05,0.05).
trace: logical switch indicating if should the stepwise reports be
printed. As default, trace is set to TRUE.
70 / 460
Best subset selection
If there are p explanatory variables, which can be introduced into
the model by
˜
p
columns in the model matrix X, then the number
of possible models of size
i, for i = 1, . . . ,
˜
p, is given by
˜
p
i
!
=
˜
p!
(
˜
p i)! i!
Therefore, the total number of models is
˜
p
X
i=1
˜
p
i
!
= 2
˜
p
1.
In the “best subset” selection, an exhaustive search is conducted
separately for every model size f rom
i to
˜
p
, in which the model
with the smallest Residual Sum of Squares (RSS) value is iden-
tified. Therefore, if, for a fixed model size, the interest model
selection criteria reduce to monotone functions of RSS, thus dif-
fering only in the way the model sizes are compared, then the
results of the “best subset” selection do not depend upon the
choice of the trade-o between goodness-of-fit and complexity
on which they are based.
71 / 460
Gasoline yield
> data(Auto, package="ISLR")
> Auto <- within(Auto, {weight <- weight/1000
+ origin <- factor(origin, labels=c("America","Europe","Japan")) })
>
> fit <- lm(mpg ~ 1 + horsepower + weight + horsepower:weight + origin + acceleration +
+ displacement + cylinders, data=Auto)
>
>
> stepCriterion(fit, direction="forward", criterion="bic")
Initial model:
~ 1
Step 0 :
df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)
+ weight 1 2265.939 2277.8 52 0.6918 0.6913 96.39 <2.2e-16
+ displacement 1 2318.830 2330.744 0.6473 0.6465 166.37 <2.2e-16
+ horsepower 1 2363.324 2375.2 37 0.6049 0.6032 232.10 <2.2e-16
+ cylinders 1 2364.574 2376.488 0.6037 0.6030 234.98 <2.2e-16
+ origin 2 2572.313 2588.1 98 0.3284 0.3250 666.96 <2.2e-16
+ acceleration 1 2650.969 2662.883 0.1771 0.1749 905.51 <2.2e-16
<none> 2726.383 2734.325 0.0000 0.00 00 1185.93
Step 1 : + weight
df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)
+ horsepower 1 2250.005 2265.8 90 0.7049 0.7034 76. 73 2.488e-05
+ acceleration 1 2258.820 2274.705 0.6982 0.6958 87.26 0.002645
+ displacement 1 2259.773 2275.658 0.6974 0.6960 88.41 0.004445
+ origin 2 2257.750 2277.6 06 0.6997 0.6963 85. 56 0.002401
+ cylinders 1 2261.726 2277.611 0.6959 0.6948 90.78 0.013087
<none> 2265.939 2277.852 0.6918 0.69 13 96.39
72 / 460
Gasoline yield
Step 2 : + horsepower
df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)
+ horsepower:weight 1 2191.401 2211.258 0.7465 0.74 47 12.45 9.992e-15
+ origin 2 2236.388 2260.2 15 0.7164 0.7126 60. 40 0.0001672
<none> 2250.005 2265.890 0.7049 0.70 34 76.73
+ cylinders 1 2250.297 2270.153 0.7054 0.7035 76.72 0.1938059
+ displacement 1 2251.230 2271.086 0.7047 0.7023 77.82 0.3813177
+ acceleration 1 2252.005 2271.861 0.7041 0.7011 78.73 0.9866463
- weight 1 2363.324 2375.2 37 0.6049 0.6032 232.10 <2.2e-16
Step 3 : + horsepower:weight
df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)
<none> 2191.401 2211.258 0.7465 0.74 47 12.450
+ origin 2 2184.990 2212.7 89 0.7519 0.7477 6.058 0.00594
+ acceleration 1 2191.167 2214.994 0.7473 0.7440 12.196 0.13770
+ displacement 1 2192.028 2215.856 0.7467 0.7437 13.063 0.24457
+ cylinders 1 2192.807 2216.634 0.7462 0.7439 13.848 0.44377
Final model:
~ weight + horsepower + weight:horsepower
> a <- stepCriterion(fit, direction="forward", criterion="bic", trace=FALSE)
> a
$initial
[1] "~ 1"
$direction
[1] "forward"
$criterion
[1] "BIC"
$final
[1] "~ weight + horsepower + weight:horsepower"
73 / 460
Gasoline yield
> stepCriterion(fit, direction="backward", criterion="bic")
Initial model:
~ horsepower + weight + horsepower:weight + origin + acceleration + displacement + cylinders
Step 0 :
df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)
- displacement 1 2185.9861 2221 .72 75 0.7525 0.7459 7.1114 0.73876
- cylinders 1 2185.9877 2221.7291 0.7525 0.7449 7.1129 0.73700
- acceleration 1 2188.6987 2224 .44 01 0.7507 0.7449 9.7717 0.09676
- origin 2 2192.9214 2224.6915 0.7474 0.7418 13.9443 0.01203
<none> 2187.8721 2227.5848 0.7519 0.7439 9.0000
- horsepower:weight 1 2240.1859 2275.9272 0.7158 0.7087 63.9189 3.338e-13
Step 1 : - displacement
df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)
- cylinders 1 2184.5512 2216.3213 0.7528 0.7471 5.6640 0.457177
- acceleration 1 2186.7146 2218 .48 47 0.7514 0.7465 7.7873 0.102302
- origin 2 2192.1403 2219.9391 0.7473 0.7433 13.1649 0.006919
<none> 2185.9861 2221.7275 0.7525 0.7459 7.1114
- horsepower:weight 1 2239.6375 2271.4076 0.7154 0.7092 63.5511 1.599e-13
Step 2 : - cylinders
df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)
- acceleration 1 2184.9899 2212 .78 88 0.7519 0.7477 6.0584 0.121947
- origin 2 2191.1665 2214.9941 0.7473 0.7440 12.1956 0.005446
<none> 2184.5512 2216.3213 0.7528 0.7471 5.6640
- horsepower:weight 1 2238.3223 2266.1211 0.7157 0.7104 62.3230 1.401e-13
74 / 460
Gasoline yield
Step 3 : - acceleration
df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)
- origin 2 2191.4014 2211.2577 0.7465 0.7447 12.4494 0.00594
<none> 2184.9899 2212.7888 0.7519 0.7477 6.0584
- horsepower:weight 1 2236.3878 2260.2153 0.7164 0.7126 60.3969 4.274e-13
Step 4 : - origin
df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)
<none> 2191.401 2211.258 0.7465 0.7447 12.449
- horsepower:weight 1 2250.005 2265.890 0.7049 0.7034 76.733 9.992e-15
Final model:
~ horsepower + weight + horsepower:weight
> b <- stepCriterion(fit, direction="backward", criterion="bic", trace=FALSE)
> b
$initial
[1]"~horsepower + weight + horsepower:weight + origin + acceleration + displacement + cylinders"
$direction
[1] "backward"
$criterion
[1] "BIC"
$final
[1] "~ weight + horsepower + weight:horsepower"
75 / 460
Gasoline yield
> library(leaps)
> args(regsubsets)
> help(regsubsets)
>
> out <- regsubsets(mpg ~ 1 + horsepower + weight + horsepower:weight + origin + acceleration +
+ displacement + cylinders, data=Auto)
> out2 <- summary(out)
1 subsets of each size up to 8
Selection Algorithm: exhaustive
horsepower weight origEurope origJapan acceleration displacement cylinders horsepower:weight
1 " " "*" " " " " " " " " " " " "
2 "*" "*" " " " " " " " " " " " "
3 "*" "*" " " " " " " " " " " "*"
4 "*" "*" " " "*" " " " " " " "*"
5 "*" "*" " " "*" "*" " " " " "*"
6 "*" "*" " " "*" "*" "*" " " "*"
7 "*" "*" "*" "*" "*" " " "*" "*"
8 "*" "*" "*" "*" "*" "*" "*" "*"
> out2$bic
[1] -450.5016 -462.4637 -517.0963 -521.2104 -517.5362 -512.4959 -506.6265 -500.7692
>
> out2$adjr2
[1] 0.6918423 0.7048656 0.7464901 0.7522907 0.7531000 0.7530459 0.7524672 0.7518930
>
> out2$cp
[1] 96.393839 76.732891 12.449353 4.37964 1 4.122108 5.210932 7.111377 9.000000
>
> out2$which
>
> X <- model.matrix(fit)
76 / 460
Gasoline yield
> # best model according to BIC
> X.bic. <- X[,out2$which[out2$bic==min(out2$bic) ]]
> fit.bic <- lm(mpg ~ 0 + X.bic., data=Auto)
> summary(fit.bic)
Estimate Std.Error t value Pr(>|t|)
X.bic.(Intercept) 61.10110 2.44178 25.023 < 2e-16 ***
X.bic.horsepower -0.24512 0.02703 -9.070 < 2e-16 ***
X.bic.weight -9.95311 0.80724 -12.330 < 2e-16 ***
X.bic.originJapan 1.77132 0.55775 3 .1 76 0.00161 **
X.bic.horsepower:weight 0.05079 0.00663 7. 66 1 1.5e-13 ***
>
> # best model according to adjusted r-squared
> X.adjr2. <- X[,out2$which[out2$adjr2==max(out2$ ad jr 2) ]]
> fit.adjr2 <- lm(mpg ~ 0 + X.adjr2., data=Auto)
> summary(fit.adjr2)
Estimate Std.Error t value Pr(>|t|)
X.adjr2.(Intercept) 64.882708 3.499511 18.541 < 2e-16 ***
X.adjr2.horsepower -0.268051 0.030981 -8.652 < 2e-16 ***
X.adjr2.weight -9.723237 0.820242 -11.854 < 2e-16 ***
X.adjr2.originJapan 1.771654 0.556839 3.182 0.00158 **
X.adjr2.acceleration -0.172484 0.114518 -1.506 0.13284
X.adjr2.horsepower:weight 0.052590 0.006726 7.819 5.13e-14 ***
>
> # best model according to Mallows’ Cp
> X.Cp. <- X[,out2$which[out2$cp==min(out2$cp )]]
> fit.Cp <- lm(mpg ~ 0 + X.Cp., data=Auto)
> summary(fit.Cp)
Estimate Std.Error t value Pr(>|t|)
X.Cp.(Intercept) 64.882708 3.499511 18.541 < 2e-16 ***
X.Cp.horsepower -0.268051 0.030981 -8.652 < 2e-16 ***
X.Cp.weight -9.723237 0.820242 -11.854 < 2e-16 ***
X.Cp.originJapan 1.771654 0.556839 3.182 0.00158 **
X.Cp.acceleration -0.172484 0.114518 -1.506 0.13284
X.Cp.horsepower:weight 0.052590 0.006726 7.819 5.13e-14 ***
77 / 460
Diagnostic methods
78 / 460
Residual analysis
The vector of ordinary residuals, denoted here by r = (r
1
, . . . , r
n
)
,
where
r
i
= Y
i
x
i
ˆ
β
, may be written as follows
r = Y X
ˆ
β
= (I H)Y
= (I H)(Xβ + ǫ) = (I H)ǫ.
That is, r
i
= ǫ
i
n
P
j=1
h
ij
ǫ
j
. According to this identity, r and ǫ are
only related by
H. In the presence of suciently small h
ij
s, r can
substitute for
ǫ, otherwise its usefulness may be limited. Indeed,
E(r) = (I H)E(ǫ)
= (I H)0 = 0
Var(r) = (I H)Var(ǫ)( I H)
= σ
2
(I H)
2
= σ
2
(I H)
Therefore, E( r
i
) = 0, Var(r
i
) = σ
2
(1 h
ii
) and Cov( r
i
, r
j
) = σ
2
h
ij
.
The above indicates that, unlike the random errors (ǫ
1
, . . . , ǫ
n
),
the ordinary residuals (r
1
, . . . , r
n
) are not i ndependent nor iden-
tically distributed random variables.
79 / 460
Residual analysis
Internally Studentized residuals:
r
I
i
=
y
i
x
i
ˆ
β
ˆ
σ
1 h
ii
.
Externally Studentized residuals:
r
E
i
=
y
i
x
i
ˆ
β
ˆ
σ
(i)
1 h
ii
,
where
ˆ
σ
2
(i)
is the estimate of σ
2
derived from the dataset with the i-
th observation excluded. According to
Cook and Weisberg [1982,
page 20], the distribution o f r
E
i
is Student-t(n p 2) and it also
can be expressed as follows:
r
E
i
= r
I
i
n p 2
n p 1 r
2
I
i
1
2
.
80 / 460
Normal probability plot with sim ulated envelope
See, for instance, Atkinson [1981].
Step 0: Set T (positive integer) and α (0, 1).
Step 1: Generate T independent samples of size n according to
Y
i
ind
Normal(x
i
ˆ
β,
ˆ
σ
2
)
for i = 1, . . . , n.
Sample 1 Sample 2 Sample 3 . . . Sample T
y
11
y
12
y
13
. . . y
1T
.
.
.
.
.
.
.
.
. . . .
.
.
.
y
n1
y
n2
y
n3
. . . y
nT
Step 2: Compute
ˆ
β
t
= (X
X)
1
X
y
t
and
ˆ
σ
2
t
=
n
P
i=1
(y
it
x
i
ˆ
β
t
)
2
/(n
p 1)
for t = 1, . . . , T, where y
t
= (y
1t
, . . . , y
nt
)
.
Step 3: Compute the residuals r
it
for i = 1, . . . , n and t = 1, . . . , T.
Sample 1 Sample 2 Sample 3 . . . Sample T
r
11
r
12
r
13
. . . r
1T
.
.
.
.
.
.
.
.
. . . .
.
.
.
r
n1
r
n2
r
n3
. . . r
nT
81 / 460
Normal probability plot with sim ulated envelope
Step 4: Compute the order statistics of r
it
in each sample.
Sample 1 Sample 2 Sample 3 . . . Sample T
r
(1)1
r
(1)2
r
(1)3
. . . r
(1)T
.
.
.
.
.
.
.
.
. . . .
.
.
.
r
(n)1
r
(n)2
r
(n)3
. . . r
(n)T
Step 5: Compute the 100(
α
2
)
th, 50th and 100(1
α
2
)
th percentiles
of the row
i, denoted by p
(i)
α
2
, p
(i)
1
2
and p
(i)
1
α
2
for i = 1, . . . , n.
Sample 1 Sample 2 . . . Sample T
r
(1)1
r
(1)2
. . . r
(1)T
.
.
.
.
.
. . . .
.
.
.
r
(n)1
r
(1)2
. . . r
(1)T
p
(1)
α
2
p
(1)
1
2
p
(1)
1
α
2
.
.
.
.
.
.
.
.
.
p
(n)
α
2
p
(n)
1
2
p
(n)
1
α
2
Step 6: Plot p
(1)
α
2
, . . . , p
(n)
α
2
versus the expected values of the order
statistics in a sample of size
n of the standard normal
distribution.
82 / 460
Normal probability plot with sim ulated envelope
Step 7: Plot p
(1)
1
2
, . . . , p
(n)
1
2
versus the expected values of the order
statistics in a sample of size
n of the standard normal
distribution.
Step 8: Plot p
(1)
1
1
2
, . . . , p
(n)
1
1
2
versus the expected values of the or-
der statistics in a sample of size
n of the standard normal
distribution.
Step 9: Plot r
1
, . . . , r
n
versus the expected values of the order
statistics in a sample of size n of the standard normal
distribution.
83 / 460
Influence or Sensitivity analysis
The Cook’s distance, denoted here by CD
i
, is a measure of the
influence of the
i-th observation on the estimate of β, which
consists of the “distance” between
ˆ
β
and
ˆ
β
(i)
using a metric based
on the matrix
ˆ
Var(
ˆ
β)
. The Cook’s distance may be written as
follows (see
Cook [1977])
CD
i
=
1
p + 1
ˆ
β
(i)
ˆ
β
h
ˆ
Var(
ˆ
β)
i
1
ˆ
β
(i)
ˆ
β
=
1
ˆ
σ
2
(p + 1)
ˆ
β
(i)
ˆ
β
X
X
ˆ
β
(i)
ˆ
β
=
1
ˆ
σ
2
(p + 1)
n
X
l=1
x
l
ˆ
β
(i)
x
l
ˆ
β
2
If
ˆ
β
(i)
is replaced by
ˆ
β
r
i
1 h
ii
(X
X)
1
x
i
, then CD
i
becomes
CD
i
=
1
p + 1
(y
i
x
i
ˆ
β)
2
ˆ
σ
2
(1 h
ii
)
h
ii
1 h
ii
=
1
p + 1
r
2
I
i
h
ii
1 h
ii
.
84 / 460
Influence or sensitivity analysis
(1) The matrix
ˆ
Var(
ˆ
β)
is positive definite. Therefore,
CD
i
0
CD
i
= 0 if and only if
ˆ
β =
ˆ
β
(i)
(2)
If r
I
i
= 0 (that is, if y
i
=
ˆ
µ
i
), then
ˆ
β
(i)
=
ˆ
β
and CD
i
= 0.
(3) For fixed h
ii
, the value of the Cook’s distance is a strictly
increasing function of
r
2
I
i
.
(4) For fixed r
2
I
i
, the value of the Cook’s distance is a strictly
increasing function of
h
ii
.
A univariate version of
CD
i
is known as DFBETA
j(i)
, and it is
given by
DFBETA
j(i)
=
ˆ
β
j
ˆ
β
j
(i)
, j = 0, . . . , p,
where
ˆ
β
j
(i)
is the element (j + 1) of
ˆ
β
(i)
. Similarly, DFBETAS
j(i)
is
given by
DFBETAS
j(i)
=
ˆ
β
j
ˆ
β
j
(i)
se(
ˆ
β
j
(i)
)
, j = 0, . . . , p.
85 / 460
Leverage points
The estimation of µ = Xβ may be written as ˆµ = X
ˆ
β = Hy,
where
H is the orthogonal projection matrix. Thus,
ˆ
µ
i
= h
i
y = h
i1
y
1
+ . . . + h
in
y
n
, i = 1, . . . , n,
where h
i
= (h
i1
, . . . , h
in
)
represents the i-th row of H. Therefore
(see
Appendix A-2),
ˆ
µ
i
y
i
= h
ii
and
1
n
n
X
i=1
ˆ
µ
i
y
i
=
1
n
trace(H) =
p + 1
n
.
As a result, “high” values o f h
ii
indicate that y
i
has a relatively
“large” impact on
ˆ
µ
i
, meaning that there is relatively little in-
formation about the relationship between
y and x when x = x
i
.
Thus, the accuracy of the estimation of
µ when x = x
i
is lower
than that for other values of x, as
E(
ˆ
µ
i
) = E(x
i
ˆ
β) = x
i
E(
ˆ
β) = µ
i
and
Var(
ˆ
µ
i
) = Var(x
i
ˆ
β) = x
i
Var(
ˆ
β)x
i
= σ
2
x
i
(X
X)
1
x
i
= σ
2
h
ii
.
86 / 460
Leverage points
If the model has intercept, then (Cook and Weisberg [1982, page
12]
)
1
n
h
ii
1
c
,
where c represents the number of rows equal to x
i
in X. In
addition, if
h
ii
= 1 then
ˆ
µ
i
= y
i
.
For example, if
µ
i
= βx
i
for i = 1, . . . , n then
h
ii
=
x
2
i
n
P
j=1
x
2
j
.
Similarly, if µ
i
= β
0
+ β
1
x
i
for i = 1, . . . , n then
h
ii
=
1
n
+
(x
i
x)
2
n
P
j=1
(x
j
x)
2
.
87 / 460
Testing the constant variance assumption
To assess the assumption of constant variance the following het-
eroskedastic model is considered
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
ǫ
i
ind
Normal(0, σ
2
i
),
log(σ
2
i
) = γ
0
+ γ
1
m
i1
+ . . . + γ
q
m
iq
,
where m
i
= (m
i1
, . . . , m
iq
)
are the values of q explanatory vari-
ables, quantitative and/or qualitative , measured on the
i-th in-
dividual. The sets of variables
x
1
, . . . , x
p
and m
1
, . . . , m
q
may
or may not coincide. Thus, assess the assumption of constant
variance is equivalent to assess the following hypothesis in the
heteroskedastic model described above
H
0
: γ = 0 versus H
1
: γ , 0,
where γ = (γ
1
, . . . , γ
q
)
. If H
0
is not rejected, then it is possible to
conclude that there is not significative statistical evidence against
the assumption of constant variance.
88 / 460
Testing the constant variance assumption
The decision rule is the following:
Reject H
0
at the approximate significance level of 100(α)% if ξ > χ
2
1α
(q)
The statistic of the test is given by
ξ =
1
2
z
M(M
M)
1
M
z,
where M =
I n
1
11
M
, M =
(
m
1
, . . . , m
n
)
and z = (z
1
, . . . , z
n
)
,
in which
z
i
=
n(y
i
ˆ
µ
i
)
2
n
P
k=1
(y
k
ˆ
µ
k
)
2
1.
See Breusch and Pagan [1979], Cook and Wei sberg [1983].
89 / 460
Collinearity
The estimate of the variance-covariance matrix of (
ˆ
β
0
,
ˆ
β
1
, . . . ,
ˆ
β
p
)
may be expressed as follows
ˆ
σ
2
(X
X)
1
=
ˆ
σ
2
1
1 1
X
1
··· 1
X
p
X
1
1 X
1
X
1
··· X
1
X
p
.
.
. ···
.
.
.
.
.
.
X
p
1 X
p
X
1
··· X
p
X
p
1
,
where X
j
represents the (j + 1)th column of X. Therefore, the
estimate o f the variance-covariance matrix of
(
ˆ
β
1
, . . . ,
ˆ
β
p
)
co-
rresponds to the
p × p dimension submatrix at the bottom right,
which, due to the expression for the inverse of a partitioned
matrix (see
Appendix A-8), may be written as
ˆ
σ
2
(
¯
X
¯
X)
1
=
ˆ
σ
2
¯
X
1
¯
X
1
¯
X
1
¯
X
2
···
¯
X
1
¯
X
p
.
.
. ···
.
.
.
.
.
.
¯
X
p
¯
X
1
¯
X
p
¯
X
2
. . .
¯
X
p
¯
X
p
1
,
where
¯
X = [
¯
X
1
. . .
¯
X
p
]
and
¯
X
j
= (x
1j
¯
x
j
, . . . , x
nj
¯
x
j
)
represents
90 / 460
Collinearity
the centered version of X
j
. Thus,
ˆ
Var(
ˆ
β
j
)
is the pth element of the
main diagonal of the matrix given by
ˆ
σ
2
¯
X
j
¯
X
j
¯
X
j
¯
X
j
¯
X
j
¯
X
j
¯
X
j
¯
X
j
1
,
where
¯
X
j
represents the matrix
¯
X
after excluding its jth column.
According to the expression for the inverse of a partitioned ma-
trix (see Appendix A-8),
ˆ
Var(
ˆ
β
j
)
reduces to
ˆ
Var(
ˆ
β
j
) =
ˆ
σ
2
[
¯
X
j
¯
X
j
¯
X
j
¯
X
j
(
¯
X
j
¯
X
j
)
1
¯
X
j
¯
X
j
]
1
=
ˆ
σ
2
[
¯
X
j
(I
¯
X
j
(
¯
X
j
¯
X
j
)
1
¯
X
j
)
¯
X
j
]
1
=
ˆ
σ
2
[
¯
X
j
(I
¯
H
j
)
¯
X
j
]
1
=
ˆ
σ
2
n
[(I
¯
H
j
)
¯
X
j
]
[(I
¯
H
j
)
¯
X
j
]
o
1
=
ˆ
σ
2
RSS
¯
X
j
|
¯
X
j
=
ˆ
σ
2
n
P
i=1
(x
ij
¯
x
j
)
2
×
1
1 R
2
¯
X
j
|
¯
X
j
,
91 / 460
Collinearity
where
¯
H
j
=
¯
X
j
(
¯
X
j
¯
X
j
)
1
¯
X
j
, RSS
¯
X
j
|
¯
X
j
and R
2
¯
X
j
|
¯
X
j
represent,
respectively, the orthogonal projection matrix (which implies
that it is symmetric and idempotent matrix), the Residual Sum
of Squares (RSS) and the R
2
(i.e, coecient of determination) of
the linear regression with
¯
X
j
as the dependent variable and
¯
X
j
as the model matrix.
The Variance Inflation Factor of
ˆ
β
j
, denoted by VIF
j
, is given by
VIF
j
=
1
1 R
2
¯
X
j
|
¯
X
j
, j = 1, . . . , p.
VIF
j
can be interpreted as inflation in the estimate of the variance
of
ˆ
β
j
due to redundant information present in the model matrix
¯
X
,
that is, information from a linear perspective, that
¯
X
j
contributes
but that
¯
X
j
already contains. Therefore, the optimum value of
VIF
j
is 1, which is obtained when
¯
X
j
¯
X
j
= 0
, that is, when
¯
X
j
is
non-correlated with the columns of
¯
X
j
.
92 / 460
Collinearity
To generalize VIF, suppose that X
j
is a matrix with r columns,
which represents, for example, a qualitative variable with
r + 1
levels. The 100(1α)% confidence region for β
j
can be expressed
as follows:
CR
1α
(β
j
) = {β R
r
: (β
ˆ
β
j
)
[
ˆ
Var(
ˆ
β
j
)]
1
(β
ˆ
β)/F
(r1,np1)
1α
1}
The (squared) area/volume of an ellipse/ellipsoid/hyperellipsoid
composed by the
z = (z
1
, . . . , z
r
)
R
r
such that,
(z
1
a
1
)
2
b
2
1
+ . . . +
(z
r
a
r
)
2
b
2
r
= (z a)
B
1
(z a) 1,
where a = (a
1
, . . . , a
r
)
and B = diag{b
2
1
, . . . , b
2
r
}, is proportional
to det(B) =
r
Q
l=1
b
2
l
. Then, by the Spectral Decomposition (see, for
instance,
Gentle [2007, section 3.8.7]) of
ˆ
Var(
ˆ
β
j
)
, the (squared)
area/volume of CR
1α
(β
j
), denoted by V
2
, is proportional to the
determinant of
ˆ
Var(
ˆ
β
j
)
, which is the product of its eige nvalues.
93 / 460
Collinearity
Then, the Generalized Variance Inflation Factor, denoted by
GVIF
j
, is defined as (Fox et al. [1992])
GVIF
j
=
V
2
V
2
0
=
|
¯
X
j
¯
X
j
|
|
¯
X
j
¯
X
j
¯
X
j
¯
X
j
(
¯
X
j
¯
X
j
)
1
¯
X
j
¯
X
j
|
,
where V
2
0
represents the (squared) area/volume of CR
1α
(β
j
)
when the columns of the matrix
¯
X
j
are non-correlated with those
of
¯
X
j
(i.e., when
¯
X
j
¯
X
j
= 0
).
GVIF
j
can be interpreted as inflation in the (squared) area/volume
of
CR
1α
(β
j
) due to redundant information present in the model
matrix
¯
X
, that is, information from a li near perspective, that
¯
X
j
contributes but that
¯
X
j
already contains. Therefore, the opti-
mum value of GVIF
j
is 1. If r = 1 then GVIF
j
reduces to VIF
j
. To
make the GVIF
j
comparable for dierent value s of r, it may be
standardized as follows
GVIF
1
2r
j
94 / 460
Salaries for Professors
These data, available in the object Salaries of the library car, are
the 2008-09 nine-month academic salary for Assistant Professors,
Associate Professors and Professors in a college in the U.S. The
data were collected as part of the on-going eort of the college’s
administration to monitor salary dierences between male and
female faculty members. The variables are the following:
salary: nine-month salary, in dollars (y).
yrs.since.phd: years since PhD (x
1
).
yrs.service: years of service (x
2
).
discipline: Discipline is a factor with two levels: “A
(theoretical departments) (x
3
= 0) and “B” (applied depart-
ments) (x
3
= 1).
sex: sex is a factor with two levels: “Female” (x
4
= 0) and
“Male” (x
4
= 1)
rank: Rank is a factor with three levels: “AsstProf (x
5
= 0
and x
6
= 0), “AssocProf (x
5
= 1 and x
6
= 0), and “Prof
(x
5
= 0 and x
6
= 1).
95 / 460
Salaries for Professors
> data(Salaries, package="car")
> fit <- lm(salary ~ yrs.since.phd + yrs.service + discipline +
+ sex + rank, data=Salaries)
> gvif(fit)
VIF
j
GVIF df GVIF**(1/(2*df))
yrs.service 5.9230 1 2.4337
yrs.since.phd 7.5189 1 2.7421
discipline 1.0641 1 1.0316
sex 1.0308 1 1.0153
rank 2.0132 2 1.1912
>
> f1 <- lm(yrs.service ~ yrs.since.phd + discipline + sex +
+ rank, data=Salaries)
> 1/(1-summary(f1)$r.squared)
5.923038
> f2 <- lm(yrs.since.phd ~ yrs.service + discipline + sex +
+ rank, data=Salaries)
> 1/(1-summary(f2)$r.squared)
7.518936
96 / 460
Box-Tidwell transformat ion
Let’s consider the following normal nonlinear model
Y
i
= µ
i
+ ǫ
i
, ,
µ
i
= β
0
+ β
1
h
1
(x
i1
) + . . . + β
r
h
r
(x
ir
) + β
r+1
x
i(r+1)
+ . . . + β
p
x
ip
,
ǫ
i
ind
Normal(0, σ
2
)
where x
i1
, . . . , x
ir
are the values of r quantitative and strictly po-
sitive explanatory variables measured on the subje ct
i, and
h
j
(x) =
x
τ
j
if τ
j
, 0
log(x)
if τ
j
= 0
, j = 1, . . . , r.
Hence, the interest parameter v ector is (β
0
, β
1
, . . . , β
p+1
, τ
1
, . . . , τ
r
)
.
Box and Tidwell [1962] provide an ecient algorithm to estimate
τ = (τ
1
, . . . , τ
r
)
by the maximum likelihood method. Note that,
evaluation of the following hypothesis systems is of interest:
H
0
: τ
j
= 1 versus H
1
: τ
j
, 1,
for j = 1, . . . , r, and
H
0
: (τ
1
, . . . , τ
r
)
= 1 versus H
1
: (τ
1
, . . . , τ
r
)
, 1.
97 / 460
Power generation by windmills
The analysis of these data, available in the object windmill of
the library
GLMsData, aims to explain the dierences between
the amount of direct current (
DC) output from windmills using
the dierences between wind velocities (
Wind).
Wind velocity
Amount of direct current
0.5
1.0
1.5
2.0
4 6 8 10
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i
,
ǫ
i
ind
Normal(0, σ
2
)
y = Amount of direct current
output from windmills
x = Wind velocity
Is there a function h(·) such that the relationship between y and
h(x) is approximately linear?
98 / 460
Power generation by windmills
> data(windmill, package="GLMsData")
> fit <- lm(DC ~ Wind, data=windmill)
>
> BoxTidwell(fit, transf= ~ Wind)
ˆ
τ
j
H
0
: τ
j
= 1 ver sus H
1
: τ
j
, 1
Estimate Std.Error 95% CI z-value Pr(>|z|)
Wind -0.83334 0.15566 -1.13843 -0.52826 -11.778 < 2.22e-16
Wald test for null hypothesis that all taus are 1:
chi = 138.72 , df = 1 , Pr(>chi) = < 2.22e-16
H
0
: (τ
1
, . . . , τ
r
)
= 1 versus H
1
: (τ
1
, . . . , τ
r
)
, 1
>
> fit2 <- update(fit,formula = ~ I(Wind**(-0.833)))
> c(adjR2(fit),adjR2(fit2))
[1] 0.86904 0.98010
> c(AIC(fit),AIC(fit2))
[1] 2.67724 -44.43086
> c(BIC(fit),BIC(fit2))
[1] 6.33387 -40.77424
99 / 460
New York air quality measurements
> fit <- lm(log(Ozone) ~ Solar.R + Wind + Temp, data=airquality)
>
> BoxTidwell(fit, transf= ~ Solar.R + Wind + Temp)
ˆ
τ
j
H
0
: τ
j
= 1 ver sus H
1
: τ
j
, 1
Estimate Std.Error 95% CI z-value Pr(>|z|)
Solar.R -0.17013 0.32306 -0.80331 0.46305 -3.62207 0.00029226
Wind -0.07896 0.61554 -1.28541 1.12748 -1.75286 0.07962542
Temp 2.64564 1.69191 -0.67043 5.96172 0.97266 0.33072429
Wald test for null hypothesis that all taus are 1:
chi = 18.179 , df = 3 , Pr(>chi) = 0.00040397
H
0
: (τ
1
, . . . , τ
r
)
= 1 versus H
1
: (τ
1
, . . . , τ
r
)
, 1
>
> fit2 <- update(fit, formula = ~ log(Solar.R) + Wind + Temp)
> c(adjR2(fit),adjR2(fit2))
[1] 0.65502 0.67043
> c(AIC(fit),AIC(fit2))
[1] 170.8251 165.7519
> c(BIC(fit),BIC(fit2))
[1] 184.3727 179.2995
100 / 460
Richness of the species
The analysis of these data, available in the object richness of the
library
glmtoolbox, aims to explain the dierences between plots
as related to species richness (represented by a count of the num-
ber of plant species (
richness)) using the dierences between
plots with respect to biomasses (
Biomass) and soil pH levels:
low (
pH=“low”), medium (pH=“mid”) and high (pH=“high”).
High
Mid
Low
pH level
Biomass
Number of plant species
0
0.0 2.5 5.0 7.5 10.0
10
20
30
40
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
,
ǫ
i
ind
Normal(0, σ
2
)
y = Number of plant species
x
1
= Biomass
x
2
= 0 and x
3
= 0 if pH=“low”
x
2
= 1 and x
3
= 0 if pH=“mid”
x
2
= 0 and x
3
= 1 if pH=“high”
101 / 460
Richness of the species
> data(richness)
> fit <- lm(Species ~ Biomass + pH, data=richness)
> residuals2(fit,type="external",plot.it=TRUE,identify=3)
1
3
18
ˆ
µ
r
E
i
-2 0 2 4
5 10 15 20 25 30 35 40
102 / 460
Richness of the species
> cbind(richness,mu=fitted(fit))[c(1,3,18),]
pH Biomass Species mu
1 high 0.4692972 30 38.96482
3 high 2.0897785 44 34.54483
18 high 7.2420621 36 20.49156
> envelope(fit,rep=100,identify=3)
3
18
Expected quantiles
Observed quantiles
-2
-2 -1
0
0
1
1
2
2
4-4
103 / 460
Richness of the species
> cd <- cooks.distance(fit)
> plot(cd, type="h", xlab="Observation Index", ylab="Cook’s Di stance")
> identify(cd, n=4)
1
3
18
Observation Index
Cook’s Distance
0
20
20 40 60 80
0.00 0.05 0.10 0.15 0.20
104 / 460
Richness of the species
Which subset of the potentially influential observations (#1, #3, #18, #20) causes
the largest change in the estimate of
β when removed from the data set?
> infl <- c(1,3,18,20)
> n <- length(infl)
> metr <- chol2inv(chol(vcov(fit)))
> hm <- 2**n - 1
> results <- matrix(0, nrow=hm, ncol=1)
> id <- 1; nfil <- vect or()
> for(i in 1:n){
+ pos <- combn(n,i)
+ for(j in 1:ncol(pos)){
+ fitemp <- update(fit, subset=-pos[,j])
+ difs <- coef(fitemp)-coef(fit)
+ results[id,1] <- t(difs)%*%metr%*%(difs)
+ nfil <- c(nfil,paste(infl[pos[,j]], collapse=","))
+ id <- id + 1
+ }
+ }
> nfil[results==max(results)]
[1] "3,18,20"
The largest change in the estimate of β occurs when observations #3, #18 and
#20 are removed.
105 / 460
Richness of the species
> a <- round(summary(fit)$coefficients, digits=5); a
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.62484 0.78899 22.33861 0
Biomass -2.72758 0.17167 -15.88830 0
pHmid 11.97819 1.00200 11.95431 0
pHhigh 22.62002 1.08183 20.90895 0
>
> fit2 <- update(fit, subset=-c(3,18,20))
> b <- round(summary(fit2)$coefficients, digits=5); b
Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.51424 0.66545 26.31922 0
Biomass -2.67778 0.14883 -17.99242 0
pHmid 11.91347 0.83936 14.19345 0
pHhigh 21.85045 0.92175 23.70550 0
>
> round(100*(b[,1:2]-a[,1:2])/abs(a[,1:2]), digits=3)
Estimate Std. Error
(Intercept) -0.628 -15.658
Biomass 1.826 -13.305
pHmid -0.540 -16.232
pHhigh -3.402 -14.797
106 / 460
Richness of the species
> h <- apply(svd(model.matrix(fit))$u**2,1,sum)
> plot(h, type="h", xlab="Observation Index", ylab=expression(h[ii]))
> identify(h, n=7)
1
10
11
30
44
51
Index
h
ii
0.04 0.05 0.06 0.07 0.08
0
20
20 40 60 80
> richness[c(1,10,11,20,30,44,51), c("pH","Biomass")]
> boxplot(Biomass ~ pH, data=richness)
107 / 460
Richness of the species
> vdtest(fit)
Score test for varying dispersion parameter
Statistic = 11.07537
degrees of freedom = 3
p-value = 0.011325
> fit3 <- update(fit, subset=-c(18))
> vdtest(fit3)
Score test for varying dispersion parameter
Statistic = 5.29641
degrees of freedom = 3
p-value = 0.15134
> vdtest(fit3,varformula=~ Biomass + pH + Biomass:pH)
Score test for varying dispersion parameter
Statistic = 6.14185
degrees of freedom = 5
p-value = 0.29266
108 / 460
Lognormal model
109 / 460
Lognormal distribution
If Y Normal(µ, σ
2
), then T = e xp(Y) Lognormal(exp(µ), σ
2
).
If
T Lognormal(η, σ
2
), then Y = log(T) Normal(log(η), σ
2
).
If
T Lognormal(η, σ
2
), then
(1) The distribution of T is strictly positive and right-skewed
(2) The probability density function of T is
f
T
(t; η, σ
2
) =
1
t
2π σ
2
exp
log(t)
2
2σ
2
I(0, ).
(3) The mean, variance and coecient of variation of T are
η exp
σ
2
2
!
, η
2
exp
σ
2
h
exp
σ
2
1
i
and
q
exp
(
σ
2
)
1
(4) The distribution of a T is Lognormal(a η, σ
2
) for all positive
constant a.
110 / 460
Probability density funct ion of lognormal(1, σ
2
)
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0
0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
σ
2
= 0.05
σ
2
= 0.12
σ
2
= 0.30
σ
2
= 0.8
σ
2
= 2.0
σ
2
= 5.0
t
f
T
(t; η, σ
2
)
111 / 460
Lognormal regression
(5) The 100(α)th percentile of T reduces to
Q
T
(α) = η exp
(
σ Z
α
)
,
where Z
α
represents the 100(α)th percentile of the Normal(0, 1).
For example, the median of
T is η.
T
i
= η
i
ε
i
,
log(η
i
) = β
0
+ β
1
x
i1
+. . .+β
p
x
ip
ε
i
ind
Lognormal(1, σ
2
)
Y
i
= µ
i
+ ǫ
i
,
µ
i
= β
0
+ β
1
x
i1
+. . .+β
p
x
ip
ǫ
i
ind
Normal(0, σ
2
)
where Y
i
= log(T
i
), µ
i
= log(η
i
) and ǫ
i
= log(ε
i
). Therefore, if
Y
i
ind
Lognormal(η
i
, σ
2
)
, then
(1) E(T
i
) = η
i
exp(σ
2
/2)
(2) CV(T
i
) = exp(σ
2
) 1
(3) Q
T
i
(α) = η
i
exp(σ Z
α
)
112 / 460
Gasoline yield
The analysis of the data in the object Auto of the library ISLR
is aimed at explaining the dierences between the vehicles with
respect to their gasoline yield (
mpg), in miles per gallon, using
other vehicle characteristics such as their horsepower (
horsepower)
and their weight (
weight), in kilograms.
••
High
Mid
Low
Weight
Horsepower
Gasoline yield
10
20
30
40
50 100 150 200
Y
i
ind
Lognormal(µ
i
, σ
2
),
log(µ
i
)= β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i1
x
i2
y = Gasoline yield
x
1
= Horsepower
x
2
= Weight
113 / 460
Parameter interpretation: without inte raction
The expected/median gasoline yield for a vehicle with horse-
power
x
1
and weight x
2
is proportional to exp(β
0
+ β
1
x
1
+ β
2
x
2
).
In two vehicles the weights are the same, but the horsepowers
are
x + δ and x, respectively. As a result, the expected/median
gasoline yield i n the first vehicle is
exp(β
1
δ) times that in the
second vehicle.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
(x + δ) + β
2
x
2
)
exp(β
0
+ β
1
x + β
2
x
2
)
= exp(β
1
δ).
In two vehicles the horsepowers are the same, but the weights
are x + δ and x, respectively. As a result, the expected/median
gasoline yield i n the first vehicle is
exp(β
2
δ) times that in the
second vehicle.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
x
1
+ β
2
(x + δ))
exp(β
0
+ β
1
x
1
+ β
2
x)
= exp(β
2
δ).
114 / 460
Parameter interpretation: with inte raction
The expected/median gasoline yield for a vehicle with horse-
power
x
1
and weight x
2
is proportional to exp(β
0
+ β
1
x
1
+ β
2
x
2
+
β
3
x
1
x
2
)
.
Two vehicles have weights equal to
z, but the horsepowers are x+
δ
and x, respectively. As a result, the expected/median gasoline
yield in the first vehicle is
exp[(β
1
+β
3
z)δ] times that in the second
vehicle.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
(x + δ) + β
2
z + β
3
(x + δ)z)
exp(β
0
+ β
1
x + β
2
z + β
3
xz)
= exp[(β
1
+ β
3
z)δ].
Two vehicles have horsepowers e qual to z, but the weights are x+
δ and x, respectively. As a result, the expected/median gasoline
yield in the first vehicle is exp[(β
2
+β
3
z)δ] times that in the second
vehicle.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
z + β
2
(x + δ) + β
3
z(x + δ))
exp(β
0
+ β
1
z + β
2
x + β
3
zx)
= exp[(β
2
+ β
3
z)δ].
115 / 460
Gasoline yield
> data(Auto, package="ISLR")
> Auto <- within(Auto, weight <- wei ght/1000)
> summary(Auto[,c("mpg","horsepower","weight")])
mpg horsepower weight
Min. : 9.00 Min. : 46.0 Min. :1.613
1st Qu.:17.00 1st Qu.: 75.0 1st Qu.:2.225
Median :22.75 Median : 93.5 Median :2.804
Mean :23.45 Mean :104.5 Mean :2.978
3rd Qu.:29.00 3rd Qu.:126.0 3rd Qu.:3.615
Max. :46.60 Max. :230.0 Max. :5.140
>
> fit1 <- lm(mpg ~ horsepower + weight + horsepower:weight, data=Auto)
> fit2 <- lm(log(mpg) ~ horsepower + weight + horsepower:weight, d ata=Auto)
>
> stepCriterion(fit1,criterion="bic",direction="forward",trace=FALSE)$final
"~ weight + horsepower + weight:horsepower"
> stepCriterion(fit2,criterion="bic",direction="forward",trace=FALSE)$final
"~ weight + horsepower + weight:horsepower"
>
> adjR2(fit1,fit2)
RSS R-squared df adj.R-squared
fit1 5992.021638 0.74844 4 0.74649
fit2 9.084858 0.79905 4 0.79750
116 / 460
Gasoline yield
> c(AIC(fit1), BIC(fit1))
2191.401 2211.258
> c(AIC(fit2), BIC(fit2)) + 2*sum(log(Auto[,"mpg"]))
2075.781 2095.638
>
> summary(fit2)
H
0
: β
j
= 0 versus H
1
: β
j
, 0
ˆ
β
j
ˆ
σ
c
j
ˆ
β
j
/
ˆ
σ
c
j
Pr[F
(1,np1)
>t
2
j
]
Estimate Std.Error t-value Pr(>|t|)
(Intercept) 4.5121298 0.0912295 49.459 < 2e-16
horsepower -0.0071145 0.0010622 -6.698 7.44e-11
weight -0.3618254 0.0301305 -12.009 < 2e-16
horsepower:weight 0.0011991 0.0002589 4.631 4.96e-06
ˆ
σ (n p 1)
Residual standard error: 0.153 on 388 degrees of freedom
Multiple R-squared: 0.7991 R
2
, Adjusted R-squared: 0.7975 R
2
adj
F-statistic: 514.3 on 3 and 388 DF, p-value: < 2.2e-16
F r (n p 1) Pr[F
(r,np1)
>F]
H
0
: (β
1
, β
2
, β
3
)
= 0 versus H
1
: (β
1
, β
2
, β
3
)
, 0
117 / 460
Lung capacity
In these data, available in the lungcap object of the library GLMsData,
the lung capacity of 654 individuals is measured by their forced expira-
tory volume (
FEV), in liters. The height (Ht), in inches, gender (Gender),
and smoking habit (
Smoke) of each individual were also recorded. The
data analysis aims to explain the dierences between lung capacities
from the dierences in height, gender, and smoking behavior.
Female
Male
Gender
Height
log(Forced expiratory volume)
0.0
0.5
1.0
1.5
45 50 55 60 65 70 75
T
i
= η
i
ε
i
,
log(η
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
,
ε
i
ind
lognormal(1, σ
2
)
t = Forced expiratory volume
x
1
= Height
x
2
= 0 if Gender=“F”
x
2
= 1 if Gender=“M”
x
3
= 0 for non-smokers
x
3
= 1 for smokers
118 / 460
Lung capacity
> data(lungcap, package="GLMsDat a")
> lungcap <- within(lungcap, Smoke <- factor(Smo ke))
> summary(lungcap[,c("FEV","Ht","Gender","Smoke")])
FEV Ht Gender Smoke
Min. :0.791 Min. :46.00 F:318 0:589
1st Qu.:1.981 1st Qu.:57.00 M:336 1: 65
Median :2.547 Median :61.50
Mean :2.637 Mean :61.14
3rd Qu.:3.119 3rd Qu.:65.50
Max. :5.793 Max. :74.00
>
> fit1 <- lm(FEV ~ Ht*Gender*Smoke, data=lungcap)
> fit2 <- lm(log(FEV) ~ Ht*Gender*Smoke, data=lun gcap)
> stepCriterion(fit1,criterion="aic",direction="forward",trace=FALSE)$final
"~ Ht + Gender + Ht:Gender"
> stepCriterion(fit2,criterion="aic",direction="forward",trace=FALSE)$final
"~ Ht + Gender "
> fit1a <- update(fit1, formula= ~ Ht + Gender + Ht:Ge nder)
> fit2a <- update(fit2, formula= ~ Ht + Gender)
>
> adjR2(fit1a,fit2a)
RSS R-squared df adj.R-squared
fit1a 114.88251 0.76599 4 0.76491
fit2a 14.76626 0.79640 3 0.79577
119 / 460
Lung capacity
> c(AIC(fit1a), BIC(fit1a))
728.5365 750.9521
> c(AIC(fit2a), BIC(fit2a)) + 2*sum(log(lungcap[,"FEV"]))
582.2044 600.1368
>
> summary(fit2a)
H
0
: β
j
= 0 versus H
1
: β
j
, 0
ˆ
β
j
ˆ
σ
c
j
ˆ
β
j
/
ˆ
σ
c
j
Pr[F
(1,np1)
>t
2
j
]
Estimate Std.Error t-value Pr(>|t|)
(Intercept) -2.264870 0.063585 -35.619 < 2e-16***
Ht 0.051854 0.001047 49.542 < 2e-16***
GenderM 0.019004 0.011935 1.592 0.112
ˆ
σ (n p 1)
Residual standard error: 0.1506 on 651 degrees of freedom
Multiple R-squared: 0.7964 R
2
, Adjusted R-squared: 0.7958 R
2
adj
F-statistic: 1273 on 2 and 651 DF, p-value: < 2.2e-16
F r (n p 1) Pr[F
(r,np1)
>F]
H
0
: (β
1
, β
2
)
= 0 versus H
1
: (β
1
, β
2
)
, 0
120 / 460
Generalized Linear Models
121 / 460
Motivation
Normal linear models are based on the following assumptions:
The response variable is continuous, symmetrical, and its
variance does not depend on its mean,
Covariates have additive eects on the mean of the
response.
There are many situations where those assumptions are incor-
rect. As an example, when:
The response is binary, a discrete proportion, or a count,
The response is continuous, but skewed to the right and
strictly positive,
Covariates act multiplicatively on the mean of the
response.
122 / 460
Motivation
Generalized Linear Models (GLM) extend the normal linear
model by allowing for
some non-normal response distributions, such as
Bernoulli : binary responses
Binomial : discrete proportions
Poisson : counts
Gamma, Inverse Gaussian : continuous and rightskewed
an alternative function to the identity that connects the mean
of the response and the linear predictor, thus enabling other
eects other than additive eects to be accounted for.
Identity function : additive eects
Logarithmic function : multiplicative eects
123 / 460
Formulation
Random component: States the probability distribution as-
sumed to describe the response,
Y, which may be Normal,
Bernoulli, Binomial, Poisson, Gamma, and Inverse Normal,
among others.
Systematic component: Continuous and/or discrete explana-
tory variables in the linear predictor, that is,
η = β
0
+ β
1
x
1
+ . . . + β
p
x
p
Link function: Continuous, strictly monotone, and twice
dierentiable function, g(·) , which connects/links the mean
of the response variable, µ, to the systematic component,
that is,
g(µ) = β
0
+ β
1
x
1
+ . . . + β
p
x
p
124 / 460
Examples
125 / 460
Liver cancer in female mice
These data, extracted from Zhang and Zelterman [1999] and available
in the object
liver of the library glmtoolbox, arose from an experi-
ment where 8 groups of varying sizes (
exposed) of female mice were
continuously fed dietary concentrations of 2-Acetylaminofluorene (2-
AA) (
dose), in parts per 10
4
. These data consist of the number of liver
neoplasms in mice (
cancer) observed for 18 months.
Dose of 2-Acetylaminofluorene
% of mice with liver cancer
3
6
9
0.0 0.5 1.0 1.5
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i
m = Number of exposed mices
y = Proportion of mice with cancer
x = Dose of 2-AA, in parts per 10
4
126 / 460
Insecticides
These data, extracted from Hewlett and Plackett [1950] and available
in the object
deposit of the library GLMsData, arose from an experiment
in which groups of insects were exposed to deposits of insecticides. The
total number of insects exposed in each group (
Number), the number
of insects killed after six days exposure (
Killed), and the amount
(
Deposit) and type of insecticide used (Insecticide) were recorded.
Insecticide
type A
type B
type C
Amount of insecticide, in milligrams
% of killed insects
2 4 6 8
0
25
50
75
100
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
m = Number of exposed insects
y = Proportion of killed insects
x
1
= Amount of insecticide, in mg.
x
2
= 0 and x
3
= 0 if Insecticide=“A”
x
2
= 1 and x
3
= 0 if Insecticide=“B”
x
2
= 0 and x
3
= 1 if Insecticide=“C”
127 / 460
Burn injury study
This data set, discussed in Hosmer et al. [2013, page 27] and available
in the object burn1000 of the library aplore3, is a subset of 1000 subjects
treated between 2000 and 2007 at 40 dierent burn facilities. Available
information includes patient demographics, total burn surface area
and presence of inhalation injury. The outcome of interest is survival
after hospital discharge. The data set includes the following variables
age (x
1
). The patient’s age at admission, in years.
gender ( x
2
). The patient’s gender: “Female” and “Male”.
race (x
3
). The patient’s race: “Non-White” and “White”.
tbsa (x
4
). Total burn surface area, in percentage.
inh_inj ( x
5
). Burn involved inhalation injury? “No” and “Yes”.
flame (x
6
). Flame involved in burn injury? “No” and “Yes”.
death (y). Hospital discharge status: “Alive” and “Dead”.
Y
i
ind
Bernoulli(µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
6
x
i6
128 / 460
Dilution assay
These data, extracted from Ridout (1990) and available in the ob-
ject
dilution of the library glmtoolbox, are counts of virus particles
(
Count) at dierent dilutions (Dilution). The plot of data indicates that
the location and dispersion of the counts of virus particles increase as
the dilution volume increases.
0
Dilution volume
Count of virus particles
0.1 0.2 0.3
5
10
15
20
Y
i
ind
Poisson(µ
i
),
µ
i
= β
0
x
β
1
i
y = Count of virus particles
x = Dilution volume
129 / 460
Skin cancer in women
These data, extracted from Kleinbaum et al. [2013, page 751] and avail-
able in the object
skincancer of the library glmtoolbox, describe the
number of non-melanoma skin cancer cases (
cases) among women
(
population) stratified by age (age: “15-24”, “25-34”, “35-44”, “45-
54”, “55-64”, “65-74”, “75-84”, “85+”) in two metropolitan areas of US
(
city: “St. Paul”, in Minneapolis, and “Fort Worth”, in Dallas).
15−24 25−34 35−44 45−54 55−64 65−74 75−84 85+
City
St.Paul
Ft.Worth
Incidence rate of skin cancer ×10
4
Age range, in years
0
25
50
75
Y
i
ind
Poisson(µ
i
),
log
10
4
µ
i
λ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
y = Number of cases of skin cancer
λ = Number of women exposed
x
1
= 1 if city=“Fort Worth”
x
1
= 0 if city=“St. Paul”
x
2
= Midpoint of age range
130 / 460
Advertising
This data set, extracted from James et al. [2013, page 15] and available
in the object
advertising of the library glmtoolbox, consists of the
sales (
sales), in thousands of units, of a product in 200 dierent mar-
kets, along with advertising budgets, in thousands of dollars, for the
product in each of those markets for television (
TV) and radio (radio).
Advertising
budget for radio
Low
Mid
High
Advertising budget for TV
Sales
0
10
20
100 200 300
Y
i
ind
Norm al(µ
i
, σ
2
),
log(µ
i
) = β
0
+ β
1
log(x
i1
) + β
2
x
i2
+
β
3
log(x
i1
)x
i2
y = Sales, in thousands of units
x
1
= Advertising budget for TV
x
2
= Advertising budget for radio
131 / 460
Atlantic Croakers
These data, available in the object Croaker2 of the library FSAdata,
describe the age-at-capture (
age), in years, assigned by using sagit-
tal otoliths, and the total length (
tl), in millimeters, for 204 females
(
sex=“F”) and 114 males (sex=“M”) belonging to a species of fish better
known as Atlantic Croaker.
Sex
Female
Male
Age-at-capture
To tal length at capture
2.5 5.0 7.5 10.0
200
300
400
500
Y
i
ind
Normal(µ
i
, σ
2
),
log(µ
i
) = β
0
+ β
1
log(x
i1
) + β
2
x
i2
y = Total length at capture, in mm
x
1
= Age-at-capture, in years
x
2
= 1 if sex=“M”
x
2
= 0 if sex=“F”
132 / 460
Hardened stee l
These data, extracted from McCool [1980] and available in the object
Steel of the library glmtoolbox, consist of the failure times (life)
for hardened steel specimens in a 4-ball rolling contact fatigue test.
Ten independent observations were taken at each of the four values of
contact stress (
stress), in pounds per square inch ×10
6
.
Contact stress, in psi ×10
6
Time until the specimen failed
0.9 1.0 1.1
0
Y
i
ind
Gamma(µ
i
, σ
2
),
log(µ
i
) = β
0
+ β
1
x
i
y = Time until the specimen failed
x = Contact stress
133 / 460
Gasoline yield
This data set, discussed in James et al. [2013, page 92] and available in
the object
Auto of the library ISLR, includes acceleration, horsepower,
fuel eciency, weight, and other features of dierent makes and mod-
els of cars. The interest variable is fuel consumption, in miles per
gallon. The variables in this data set are the following
mpg (y). Fuel consumption, in miles per gallon.
cylinders (x
1
). Number of cylinders.
displacement (x
2
). Engine displacement, in cubic inches.
horsepower (x
3
). Engine horsepower.
weight ( x
4
). weight, in pounds.
acceleration (x
5
). Time, in seconds, to accelerate from 0 to 60
miles per hour.
origin (x
6
, x
7
). Origin of the vehicle: American (1), European (2)
and J apanese (3).
Y
i
ind
Gamma(µ
i
, σ
2
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
7
x
i7
134 / 460
Model Formulation
135 / 460
Model formulation
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
= x
i
β
Then, E(Y
i
) = µ
i
and Var(Y
i
) =
φ
ω
i
V(µ
i
),
where
β = (β
0
, β
1
, . . . , β
p
)
is the unknown interest parameter vector.
x
i
= (1, x
i1
, . . . , x
ip
)
, where (x
i1
, . . . , x
ip
) are the values of p
explanatory variables, quantitative and/or qualitative, mea-
sured on the
i-th individual.
φ > 0 is the unknown dispersion parameter.
ω
1
, . . . , ω
n
are known positive weights.
V(·) is the variance function, which describes the way in
which Var(Y
i
) is assumed to be dependent on µ
i
.
g(·) is the link function, which is assumed to be known,
strictly monotone, and twice dierentiable.
136 / 460
Random Component
137 / 460
Linear exponential family (Continuous case)
The probability density function of Y LEF(µ, φ), where µ is its
mean and
φ > 0 is its dispersion parameter, may be written as
f
Y
(y; µ, φ) = C(y, φ) exp
1
φ
yθ(µ) b[θ(µ)]
!
,
for some known functions θ(·), b(·) and C(·, ·). By definition,
Z
f
Y
(y; µ, φ)= 1 =
Z
C(y, φ) exp
1
φ
yθ
!
dy = exp
1
φ
b(θ)
!
Thus, the moment generating function of Y reduces to
M
Y
(t)= E[exp(tY)]=
Z
C(y, φ) exp
1
φ
h
y(θ + tφ) b(θ)
i
!
dy
=exp
1
φ
b(θ)
!
Z
C(y, φ) exp
1
φ
h
y(θ + tφ)
i
!
dy
=exp
1
φ
h
b
θ + tφ
b(θ)
i
!
,
where the values of t are those for which b
θ + tφ
exists.
138 / 460
Linear exponential family (Discrete case)
The probability mass function of Y LEF(µ, φ) , where µ is its
mean and
φ > 0 is its dispersion parameter, may be written as
f
Y
(y; µ, φ) = C(y, φ) exp
1
φ
yθ(µ) b[θ(µ)]
!
,
for some known functions θ(·), b(·) and C(·, ·). By definition,
X
f
Y
(y; µ, φ)= 1 =
X
C(y, φ) exp
1
φ
yθ
!
= exp
1
φ
b(θ)
!
Thus, the moment generating function of Y reduces to
M
Y
(t)= E[exp(tY)]=
X
C(y, φ) exp
1
φ
h
y(θ + tφ) b(θ)
i
!
=exp
1
φ
b(θ)
!
X
C(y, φ) exp
1
φ
h
y(θ + tφ)
i
!
=exp
1
φ
h
b
θ + tφ
b(θ)
i
!
,
where the values of t are those for which b
θ + tφ
exists.
139 / 460
Linear exponential family
Therefore,
E(Y) = M
(1)
Y
(t) = b
(θ),
E(Y
2
) = M
(2)
Y
(t) = φb
′′
(θ) + [b
(θ)]
2
,
and Var(Y) = M
(2)
Y
(t) [M
(1)
Y
(t)]
2
= φV(µ), where V(·) b
′′
[θ(·)] is
known as the variance function.
If
˜
Y = (Y µ)/
p
φ, then its moment generating function becomes
M
˜
Y
(t) = exp
tµ
p
φ
M
Y
t/
p
φ
= exp
tµ
p
φ
+
1
φ
h
b
θ + t
p
φ
b(θ)
i
.
The third-order Taylor series expansion of log
M
˜
Y
(t)
around
t = 0 may be written as follows
log
M
˜
Y
(t)
=
1
2
b
′′
(θ)t
2
+
1
6
b
′′′
θ + t
p
φ
p
φ t
3
=
1
2
V(µ)t
2
+
1
6
b
′′′
θ + t
p
φ
p
φ t
3
,
140 / 460
Linear exponential family
where t
= (1 δ)t for some δ (0, 1). Therefore,
lim
φ 0
log
M
˜
Y
(t)
=
1
2
V(µ)t
2
= lim
φ 0
M
˜
Y
(t) = exp
1
2
V(µ)t
2
.
Then, the moment generating function of
˜
Y
converges to that of
Normal(0, V(µ)). That is (see, for instance, Jorgensen [1987]),
Y µ
p
φ
D
φ 0
Normal(0, V(µ)),
where
D
φ 0
means convergence in distribution as φ 0.
141 / 460
Special Cases
142 / 460
Binomial distribution
Let mY Binomial(m, µ), where m is a known positive integer
and
µ (0, 1). The p.m.f. of Y may be written as
f
Y
(y; µ, m)=
m
my
!
µ
my
(1 µ)
m(1y)
I
my
{0, 1, . . . , m}
=exp
(
m
"
y log
µ
1 µ
!
+ log(1 µ)
#)
m
my
!
I
my
{0, 1, . . . , m}
Then,θ = log[µ/(1 µ)], b(θ) = log(1 + e
θ
), V(µ) = µ(1 µ) and
φ= m
1
.The mean,variance,skewnessandexcess kurtosisofYare
µ,
µ(1 µ)
m
,
1 2µ
p
µ(1 µ)m
and
1 6µ
µ(1 µ)m
,
respectively. In fact, the m.g.f. of Y reduces to
M
Y
(t) =
1 + µ
exp
t
m
1

m
The binomial distribution tends to the Gaussian when m ,
that is, for “large” values of m we have Y ˙Normal
µ, m
1
µ(1 µ)
.
143 / 460
Probability mass function of Binomial(m, µ) for m = 10
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
0.0
0.1
0.2
0.3
0.4
µ = 0.1 µ = 0.2
µ = 0.4 µ = 0.6
µ = 0.8 µ = 0.9
y
f
Y
(y; m, µ)
Binary responses and discrete proportions
144 / 460
R codes
> library(ggplot2)
>
> size <- 10
> mus <- c(0.1,0.2,0.4,0.6,0.8,0.9)
> binom <- expand.grid(mu=mus,x=seq(from=0,to=1,by=1/size))
> binom <- within(binom,{prob <- dbinom(x=size*x,size=size,prob=mu)
+ cdf <- pbinom(q=si ze*x,size=s ize,prob=mu)
+ mu <- factor(mu,lab els=paste( "mu==",mus))
+ })
>
> dev.new()
> ggplot(binom) +
+ geom_segment(aes(x=x,xend=x,y=0,yend =prob),col= "red",linew idth=1.25) +
+ labs(x="y",y=bquote(~f[Y]~"(y;m,"~mu ~")")) +
+ facet_wrap(vars(mu),ncol=2,scale="fre e_x",labell er=label_par sed)
>
> dev.new()
> ggplot(binom) +
> geom_segment(aes(x=x,xend=x+0.1,y=cd f,yend=cdf) ,col="red",l inewidth=1. 25) +
> labs(x="y",y=bquote(~F[Y]~"(y;m,"~mu ~")")) +
> facet_wrap(vars(mu),ncol=2,scale="fre e_x",labell er=label_par sed)
>
145 / 460
Binomial regression
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) = m
1
i
µ
i
(1 µ
i
)
Logit link
g(µ
i
) = log
µ
i
1 µ
i
!
Probit link
g(µ
i
) = Φ
1
(µ
i
)
Complementary log-log link
g(µ
i
) = log
log
1 µ
i

Cauchy link
g(µ
i
) = tan
π
2
2µ
i
1
In this model ω
i
= m
i
for i = 1, . . . , n.
146 / 460
Binomial model for m = 25
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
y
x
µ =
exp(4 + 1.6 x)
1 + exp(4 + 1.6 x)
0000000 1111111 2222222 3333333 4444444 5555555
0.00.00.00.00.00.00.0 0.20.20.20.20.20.20.2 0.40.40.40.40.40.40.4 0.60.60.60.60.60.60.6 0.80.80.80.80.80.80.8 1.01.01.01.01.01.01.0
147 / 460
Poisson distribution
Let Y Poisson(µ), where µ > 0. The probability mass function
of
Y may be written as follows
f
Y
(y; µ) =
exp(µ)µ
y
y!
I
y
{0, 1, 2, . . .}
= exp
y log(µ) µ
I
y
{0, 1, 2, . . .}
y!
Therefore, θ = log(µ), b(θ) = exp(θ), V(µ) = µ and φ = 1. The
mean, variance, skewness and excess kurtosis of
Y are
µ, µ, µ
1
2
and µ
1
,
respectively. Indeed, the moment generating function of Y re-
duces to
M
Y
(t) = exp
h
µ(e
t
1)
i
.
148 / 460
Poisson as the limit of Binomial distribution
lim
m
π 0
mπ µ
Binomial(m, π)
= lim
m
π 0
mπ µ
m
y
!
π
y
(1 π)
my
I
y
{0, 1, . . . , m}
= lim
m
π 0
mπ µ
m(m 1) . . . (m y + 1)
y!
π
1 π
y
1
mπ
m
m
I
y
{0, 1, . . . , m}
= lim
m
π 0
mπ µ
product of y terms
z }| {
m(m 1) . . . (m y + 1)
m
y
y!
mπ
1 π
y
1
mπ
m
m
I
y
{0, 1, . . . , m}
=
µ
y
exp(µ)
y!
I
y
{0, 1, . . .}
= Poisson(µ)
149 / 460
Probability mass function of Poisson(µ)
2 4 6 8 2 4 6 8
2 4 6 8 2 4 6 8
2 4 6 8 2 4 6 8
0.0
0.1
0.2
0.3
0.0
0.1
0.2
0.3
0.0
0.1
0.2
0.3
µ = 1.0 µ = 1.5
µ = 2.0 µ = 2.5
µ = 3.0 µ = 3.5
y
f
Y
(y; µ)
Counts of some relatively rare event.
150 / 460
R codes
> library(ggplot2)
>
> rango <- c(0,8)
> mus <- seq(from=1,to=3.5,by=0.5)
> pois <- expand.grid(mu=mus,x=seq(from=rango[1],to=rango[2],by=1))
> pois <- within(pois,{prob <- dpois(x=x,lambda=mu)
+ cdf <- ppois(q=x,la mbda=mu)
+ mu <- factor(mu,lab els=paste( "mu==",mus))
+ })
>
> dev.new()
> ggplot(pois) +
+ geom_segment(aes(x=x,xend=x,y=0,yend =prob),col= "red",linew idth=1.25) +
+ labs(x="y",y=bquote(~f[Y]~"(y;"~mu~" )")) +
+ facet_wrap(vars(mu),ncol=2,scale="fre e_x",labell er=label_par sed)
>
> dev.new()
> ggplot(pois) +
> geom_segment(aes(x=x,xend=x+1,y=cdf, yend=cdf),c ol="red",lin ewidth=1.25 ) +
> labs(x="y",y=bquote(~F[Y]~"(y;"~mu~" )")) +
> facet_wrap(vars(mu),ncol=2,scale="fre e_x",labell er=label_par sed)
>
151 / 460
Poisson regression
Y
i
ind
Poisson(µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) = µ
i
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
Square-root link
g(µ
i
) =
µ
i
152 / 460
Poisson model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
y
x
µ = exp(0.5 + 0.6 x)
0
0
0
0
0
0
0
0
0
0
0
0
0
0 1111111 2222222 3333333 4444444
5
5
5
5
5
5
5
5
5
5
5
5
5
5
10101010101010 15151515151515 20202020202020
153 / 460
Normal distribution
Let Y Normal(µ, σ
2
), where µ R and σ > 0. The probability
density function of
Y may be written as follows
f
Y
(y; µ, σ
2
) =
1
2π σ
2
exp
(y µ)
2
2 σ
2
!
=
1
2π σ
2
exp
y
2
+ 2yµ µ
2
2 σ
2
!
= exp
(
1
σ
2
"
yµ
µ
2
2
#)
exp(y
2
/2σ
2
)
2π σ
2
Therefore, θ = µ, b(θ) = θ
2
/2, V(µ) = 1 and φ = σ
2
. The mean,
variance, skewness and excess kurtosis of Y are
µ, σ
2
, 0 and 0,
respectively. Indeed, the moment generating function of Y re-
duces to
M
Y
(t) = exp
µt +
σ
2
t
2
2
!
.
154 / 460
Probability density funct ion of Normal(0, σ
2
)
−4 −2 0 2 4 −4 −2 0 2 4
−4 −2 0 2 4 −4 −2 0 2 4
−4 −2 0 2 4 −4 −2 0 2 4
0.0
0.5
1.0
0.0
0.5
1.0
0.0
0.5
1.0
σ
2
= 0.1 σ
2
= 0.2
σ
2
= 0.3 σ
2
= 0.6
σ
2
= 0.9 σ
2
= 1.2
y
f
Y
(y; µ, σ
2
)
Continuous and symmetric responses
155 / 460
R codes
> library(ggplot2)
>
> mu <- 0
> s2 <- c(0.1,0.2,0.3,0.6,0.9,1.2)
> x <- seq(from=-4,to=4,length=200)
> norm <- expand.grid(mu=mu,s2s=s2,x=x)
> norm <- within(norm,{pdf <- dnorm(x=x,mean=mu,sd=sqrt(s2))
+ cdf <- pnorm(q=x,me an=mu,sd=sqr t(s2))
+ s2s <- factor(s2s,la bels=paste ("sigma**2= =",s2))
+ })
>
> dev.new()
> ggplot(norm) +
+ geom_line(aes(x=x,y=pdf),col="red",linewidth= 1.1) +
+ labs(x="y",y=bquote(~f[Y]~"(y;"~mu~","~sigma* *2~")")) +
+ facet_wrap(vars(s2s),ncol=2,scale="free_x",lab eller=label_ parsed)
>
> dev.new()
> ggplot(norm) +
+ geom_line(aes(x=x,y=cdf),col="red",linewidth= 1.1) +
+ labs(x="y",y=bquote(~F[Y]~"(y;"~mu~","~sigma* *2~")")) +
+ facet_wrap(vars(s2s),ncol=2,scale="free_x",lab eller=label_ parsed)
>
156 / 460
Normal regression
Y
i
ind
Normal
µ
i
, σ
2
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
Y
i
= µ
i
+ ε
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
ε
i
ind
Normal
0, σ
2
i
E(Y
i
) = µ
i
and Var(Y
i
) =
σ
2
ω
i
Inverse link
g(µ
i
) = 1
i
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
If ω
1
= ω
2
= . . . = ω
n
, then the dispersion parameter is constant,
which implies that the variance is also constant.
157 / 460
Normal model for σ
2
= 0.4
y
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x
µ = 1.5 + 1.4 x
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0 111111111111
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2 333333333333
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4 555555555555
666666666666 888888888888 101010101010101010101010
158 / 460
Gamma distribution
Let Y Gamma(µ, σ
2
), where µ > 0 and σ > 0. The probability
density function of
Y may be written as follows
f
Y
(y; µ, σ
2
) =
1
Γ
1
σ
2
σ
2
µ
1
σ
2
y
1
σ
2
1
exp
y
σ
2
µ
!
I
y
(0, )
= exp
(
1
σ
2
"
y
µ
log(µ)
#)
y
1
σ
2
1
Γ
1
σ
2
(
σ
2
)
1
σ
2
I
y
(0, )
Therefore, θ = µ
1
, b(θ) = log(θ), φ = σ
2
and V(µ) = µ
2
.
As special cases we have Exponential
(µ = λ
1
, σ
2
= 1) and Chi-
squared
(µ = ν, σ
2
= 2ν
1
) distributions. The mean, variance,
skewness and excess kurtosis of Y are
µ, σ
2
µ
2
, 2σ and 6σ
2
,
respectively. Indeed, the m.g.f. of Y reduces to M
Y
(t)=
1 σ
2
µt
1
σ
2
.
The Gamma distribution tends to the Gaussian when σ
2
0,
that is, for “small” values of σ
2
we have Y ˙Normal
µ, σ
2
µ
2
.
159 / 460
Probability density funct ion of Gamma(1, σ
2
)
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0
0.5
1.0
1.5
0.0
0.5
1.0
1.5
0.0
0.5
1.0
1.5
σ
2
= 0.08 σ
2
= 0.15
σ
2
= 0.35 σ
2
= 0.65
σ
2
= 0.9 σ
2
= 1.2
y
f
Y
(y; µ, σ
2
)
Continuous, strictly positive, and right-skewed responses
160 / 460
R codes
> library(ggplot2)
>
> mu <- 1
> s2 <- c(0.08,0.15,0.35,0.65,0.9,1.2)
> x <- seq(from=0.01,to=2.5,length=200)
> gama <- expand.grid(mu=mu,s2s=s2,x=x)
> gama <- within(gama,{pdf <- dgamma(x=x,shape=1/s2s,scale=mu*s2s)
+ cdf <- pgamma(q=x,s hape=1/s2s, scale=mu*s2s )
+ s2s <- factor(s2s,la bels=paste ("sigma**2= =",s2))})
>
> dev.new()
> ggplot(gama) +
+ geom_line(aes(x=x,y=pdf),col="red",linewidth= 1.1) +
+ labs(x="y",y=bquote(~f[Y]~"(y;"~mu~","~sigma* *2~")")) +
+ facet_wrap(vars(s2s),ncol=2,scale="free_x",lab eller=label_ parsed)
>
> dev.new()
> ggplot(gama) +
+ geom_line(aes(x=x,y=cdf),col="red",linewidth= 1.1) +
+ labs(x="y",y=bquote(~F[Y]~"(y;"~mu~","~sigma* *2~")")) +
+ facet_wrap(vars(s2s),ncol=2,scale="free_x",lab eller=label_ parsed)
>
161 / 460
Gamma regression
Y
i
ind
Gamma
µ
i
, σ
2
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
Y
i
= µ
i
ε
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
ε
i
ind
Gamma
1, σ
2
i
E(Y
i
) = µ
i
and Var(Y
i
) =
σ
2
ω
i
µ
2
i
Inverse link
g(µ
i
) = 1
i
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
If ω
1
= ω
2
= . . . = ω
n
, then the dispersion parameter is constant,
which implies that the coecient of variation is also constant.
162 / 460
Gamma model for σ
2
= 0.4
y
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x
µ =
1
0.5 + 0.4 x
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
6666666666666
163 / 460
Inverse-Gaussian distribution
Let Y Inverse Gaussian(µ, σ
2
), where µ > 0 and σ > 0. The
probability density function of
Y may be written as follows
f
Y
(y; µ, σ
2
) =
1
2π σ
2
y
3
1
2
exp
(y µ)
2
2σ
2
µ
2
y
!
I
y
(0, )
= exp
(
1
σ
2
"
y
2µ
2
+
1
µ
#)
exp(1/2σ
2
y)
2π σ
2
y
3
1
2
I
y
(0, )
Therefore, θ = 1/2µ
2
, b(θ) =
2θ, φ = σ
2
and V(µ) = µ
3
.
The mean, variance, skewness and excess kurtosis of
Y are
µ, σ
2
µ
3
, 3σ
µ and 15σ
2
µ,
respectively. Indeed, the m.g.f. of Y reduces to
M
Y
(t) = exp
1
σ
2
µ
1
q
1 2σ
2
µ
2
t
!
The Inverse Gaussian distribution tends to the Gaussian when
σ
2
0, that is, for "small" values of σ
2
we have Y ˙Normal
µ, σ
2
µ
3
.
164 / 460
Probability density funct ion of Inverse -Gaussian(1, σ
2
)
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0
1
2
0
1
2
0
1
2
σ
2
= 0.025 σ
2
= 0.1
σ
2
= 0.5 σ
2
= 1
σ
2
= 2 σ
2
= 4
y
f
Y
(y; µ, σ
2
)
Continuous, strictly positive, and right-skewed responses
165 / 460
R codes
> library(ggplot2)
> library(statmod)
>
> mu <- 1
> s2 <- c(0.025,0.1,0.5,1,2,4)
> x <- seq(from=0.01,to=2.5,length=200)
> invg <- expand.grid(mu=mu,s2s=s2,x=x)
> invg <- within(invg,{pdf <- dinvgauss(x=x,mean=mu,dispersion=s2s)
+ cdf <- pinvgauss(q=x ,mean=mu,di spersion=s2s )
+ s2s <- factor(s2s,la bels=paste ("sigma**2= =",s2))})
>
> dev.new()
> ggplot(invg) +
+ geom_line(aes(x=x,y=pdf),col="red",linewidth= 1.1) +
+ labs(x="y",y=bquote(~f[Y]~"(y;"~mu~","~sigma* *2~")")) +
+ facet_wrap(vars(s2s),ncol=2,scale="free_x",lab eller=label_ parsed)
>
> dev.new()
> ggplot(invg) +
+ geom_line(aes(x=x,y=cdf),col="red",linewidth= 1.1) +
+ labs(x="y",y=bquote(~F[Y]~"(y;"~mu~","~sigma* *2~")")) +
+ facet_wrap(vars(s2s),ncol=2,scale="free_x",lab eller=label_ parsed)
>
166 / 460
Inverse-Gaussian regression
Y
i
ind
Inverse Gaussian
µ
i
, σ
2
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) =
σ
2
ω
i
µ
3
i
Inverse link
g(µ
i
) = 1
i
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
The values of ω
1
, . . . , ω
n
are assumed to be known and positive.
167 / 460
Inverse-Gaussian model for σ
2
= 0.4
y
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
x
µ =
1
(0.6 + 0.2 x)
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
5
6666666666666 7777777777777
168 / 460
Main distributions belonging to the LEF
Distribution θ
i
=θ(µ
i
) b(θ
i
) V(µ
i
) ω
i
φ
Binomial(m
i
, µ
i
) log
µ
i
1µ
i
!
log
1+e
θ
i
µ
i
(1µ
i
) m
i
1
Poisson(µ
i
) log(µ
i
) e
θ
i
µ
i
1 1
Normal(µ
i
, σ
2
) µ
i
θ
2
i
2
1 1 σ
2
Gamma(µ
i
, σ
2
)
1
µ
i
log(θ
i
) µ
2
i
1 σ
2
Inv.Gaussian(µ
i
, σ
2
)
1
2µ
2
i
2θ
i
µ
3
i
1 σ
2
See, for example, McCullagh and Nelder [1989, page 30].
169 / 460
Exercises
(1) Let m Y Negative B inomial(m, µ), where m is a positive
integer and
µ > 0. The probability mass function of Y may
be written as
f
Y
(y; m, µ) =
m(y + 1) 1
my
!
µ
µ + 1
!
my
1
µ + 1
!
m
I
my
{0, 1, 2, . . .}
(A)
Show that the distribution of Y belongs to the LEF.
(B) Find the mean and variance of Y.
(C) Find the moment generating function of Y.
(2) Let Y Negative Binomial(µ, σ
2
), where µ > 0 and σ
2
is
assumed to be known. The probability mass function of
Y
may be written as
f
Y
(y; µ, σ
2
) =
Γ
y +
1
σ
2
Γ
1
σ
2
Γ
y + 1
σ
2
µ
σ
2
µ + 1
!
y
1
σ
2
µ + 1
!
1
σ
2
I
y
{0, 1, 2, . . .}
(A)
Show that the distribution of Y belongs to the LEF.
(B) Find the mean and variance of Y.
(C) Find the moment generating function of Y.
170 / 460
(3) Let Y Beta(µ, σ
2
), where 0 < µ < 1. The probability
density function of
Y may be written as
f
Y
(y; µ, σ
2
) =
Γ(σ
2
)
Γ(µσ
2
)Γ((1 µ)σ
2
)
y
µσ
2
1
(1y)
(1µ)σ
2
1
, 0 < y < 1.
(A)
Does the Beta distribution belong to the exponential family
of distributions?
(B) Does the Beta distribution belong to the LEF?
(4) Let Y Weibull(µ, α), where µ, α > 0. The probability
density function of
Y may be written as
f
Y
(y; µ, α) =
α c(α)
µ
y
µ
!
α1
exp
"
c(α)
y
µ
!
α
#
, y > 0,
where c(α) = [Γ(1 + 1)]
α
.
(A) Does the Weibull distribution belong to the exponential fam-
ily of distributions?
(B) Does the Weibull distribution belong to the LEF?
(C) Does the Weibull distribution belong to the LEF when α = 1?
171 / 460
Parameter estimation
172 / 460
Parameter estimation
Parameter estimation is performed using the maximum likeli-
hood method. Under certain conditions, known as regularity
conditions, this method yields estimators with desi rable statis-
tical properties, such as consistency, asymptotic unbiasedness,
asymptotic eciency, and asymptotic normality. This method
consists of maximizing the likelihood function, denoted here by
L(β), which corresponds to the joint probability density func-
tion (or joint probability mass function) of
Y
1
, . . . , Y
n
, viewed as
a function of
β rather than as a function of y
1
, . . . , y
n
. Thus,
ˆ
β = argmax
β R
p+1
L(β).
Because Y
1
, . . . , Y
n
are independent, L(β) reduces to
L(β) =
n
Y
i=1
f
Y
i
(y
i
; µ
i
, φ/ω
i
)
=
n
Y
i=1
C(y
i
, φ
i
) exp
ω
i
φ
h
y
i
θ(µ
i
) b[θ(µ
i
)]
i
!
.
173 / 460
Parameter estimation
The function log(x) is strictly increasing, so any β that maximizes
log[L(β)] will also maximize L(β). In addition, the computation
of the gradient of
log[L(β)] is easier than that of L(β), as the
latter is a product of
n terms while the former is a sum of n
terms. Therefore, a more convenient form for
ˆ
β
is the following:
ˆ
β = argmax
β R
p+1
(β),
where
(β) = log[L(β)]=
n
X
i=1
log[ f
Y
i
(y
i
; µ
i
, φ/ω
i
)]
=
n
X
i=1
log[C(y
i
, φ
i
)] +
1
φ
n
X
i=1
ω
i
h
y
i
θ(µ
i
) b[θ(µ
i
)]
i
is the log-likelihood function of β. Note that
n
P
i=1
log[C(y
i
, φ
i
)]
and φ > 0 do not depend on β.
174 / 460
Parameter estimation
So,
ˆ
β = argmax
β R
p+1
n
X
i=1
ω
i
y
i
θ(µ
i
) b[θ(µ
i
)]
,
that is,
ˆ
β
does not depend on φ.
Binomial:
ˆ
β = argmax
β R
p+1
n
X
i=1
m
i
y
i
log
µ
i
1 µ
i
!
+ log(1 µ
i
)
!
Poisson:
ˆ
β = argmax
β R
p+1
n
X
i=1
y
i
log(µ
i
) µ
i
Normal:
ˆ
β = argmax
β R
p+1
n
X
i=1
ω
i
y
i
µ
i
µ
2
i
2
Gamma:
ˆ
β = argmax
β R
p+1
n
X
i=1
ω
i
y
i
µ
i
log(µ
i
)
!
Inverse Gaussian:
ˆ
β = argmax
β R
p+1
n
X
i=1
ω
i
1
µ
i
y
i
2µ
2
i
175 / 460
Deviance
Deviance is used to quantify the goodness-of-fit of an interest
model by comparing its log-likelihood function to that of the sa-
turated model (that is, a model with the same number of parame-
ters as subjects or individuals i n the sample). Thus, the lower the
deviance, the less the dierence between the fitted values of the
interest model and those of the saturated model. Hence, “small”
values of deviance indicate that the interest model achieves “si-
milar fitted values as the saturated mode l, despite the fact that
the first only has (p + 1) parameters, while the second has n
parameters. Note that,
ˆ
β = argmax
β R
p+1
(β)
= argmin
β R
p+1
2
sup
µ
1
,...,µ
n
n
X
i=1
log
h
f
Y
i
(y
i
; µ
i
, φ/ω
i
)
i
(β)
,
where sup
µ
1
,...,µ
n
n
P
i=1
log
h
f
Y
i
(y
i
; µ
i
, φ/ω
i
)
i
is the log-likelihood func-
tion of the saturated model.
176 / 460
Deviance
Then, the deviance can be written as the following function of β:
D
(β)= 2
sup
µ
1
,...,µ
n
n
X
i=1
log
h
f
Y
i
(y
i
; µ
i
, φ/ω
i
)
i
n
X
i=1
log
h
f
Y
i
(y
i
; µ
i
(β), φ/ω
i
)
i
= 2
n
X
i=1
sup
µ
i
log
h
f
Y
i
(y
i
; µ
i
, φ/ω
i
)
i
log
h
f
Y
i
(y
i
; µ
i
(β), φ/ω
i
)
i
= 2
n
X
i=1
sup
µ
i
ω
i
φ
h
y
i
θ(µ
i
) b[θ(µ
i
)]
i
+ log
h
C(y
i
, φ/ω
i
)
i
ω
i
φ
h
y
i
θ(µ
i
(β)) b[θ(µ
i
(β))]
i
log
h
C(y
i
, φ/ω
i
)
i
=
2
φ
n
X
i=1
sup
µ
i
ω
i
h
y
i
θ(µ
i
) b[θ(µ
i
)]
i
ω
i
h
y
i
θ(µ
i
(β)) b[θ(µ
i
(β))]
i
=
1
φ
n
X
i=1
d(y
i
, µ
i
, ω
i
),
177 / 460
Deviance
where
d(y
i
, µ
i
, ω
i
) = 2
sup
µ
ω
i
h
y
i
θ(µ) b[θ(µ)]
i
ω
i
h
y
i
θ(µ
i
) b[θ(µ
i
)]
i
,
in which µ
i
= g
1
(x
i
β)
. Note that
∂µ
ω
i
h
y
i
θ(µ) b[θ(µ)]
i
= ω
i
θ
(µ)
h
y
i
b
[θ(µ)]
i
and
2
∂µ
2
ω
i
h
y
i
θ(µ)b[θ(µ)]
i
= ω
i
(
θ
′′
(µ)
h
y
i
b
[θ(µ)]
i
[θ
(µ)]
2
b
′′
[θ(µ)]
)
Therefore, if y
i
belongs to the parametric space of µ, then the ma-
ximum of ω
i
h
y
i
θ(µ)b[θ(µ)]
i
exists and it is ω
i
h
y
i
θ(y
i
)b[θ(y
i
)]
i
.
Consequently, d(y
i
, µ
i
, ω
i
) reduces to
d(y
i
, µ
i
, ω
i
) = 2
n
ω
i
h
y
i
θ(y
i
) b[θ(y
i
)]
i
ω
i
h
y
i
θ(µ
i
) b[θ(µ
i
)]
io
.
178 / 460
Deviance
If y does not belong to the parametric space of µ, then ω
i
h
y
i
θ(µ)
b[θ(µ)]
i
is a strictly monotonic function of µ, so its maximum
does not exist and must be replaced by its supreme.
In conclusion, the deviance satisfies the following:
Lower is the value of D(
ˆ
β), better the goodness-of-fit of the
model.
D(
ˆ
β) 0 as d(y
i
,
ˆ
µ
i
, ω
i
) 0 for i = 1, . . . , n.
ˆ
β
may be written as
ˆ
β = argmin
β R
p+1
D
(β) = argmin
β R
p+1
D(β)
,
where D(
ˆ
β) =
n
P
i=1
d(y
i
,
ˆ
µ
i
, ω
i
)
is the unscaled deviance.
179 / 460
Deviance (see McCullagh and Nelder [1989, page 34])
If m
i
Y
i
ind
Binomial(m
i
, µ
i
)
then
d(y
i
,
ˆ
µ
i
, ω
i
) =
2 m
i
log(1
ˆ
µ
i
)
if y
i
= 0,
2 m
i
(
y
i
log
"
y
i
(1
ˆ
µ
i
)
ˆ
µ
i
(1 y
i
)
#
+ log
"
1 y
i
1
ˆ
µ
i
#)
if 0 < y
i
< 1,
2 m
i
log(
ˆ
µ
i
)
if y
i
= 1.
If Y
i
ind
Poisson(µ
i
)
then
d(y
i
,
ˆ
µ
i
) =
2
ˆ
µ
i
if y
i
= 0,
2
y
i
log(y
i
/
ˆ
µ
i
) (y
i
ˆ
µ
i
)
if y
i
> 0.
If Y
i
ind
Normal(µ
i
, φ/ω
i
) then
d(y
i
,
ˆ
µ
i
, ω
i
) = 2
n
ω
i
h
y
i
θ(y
i
) b[θ(y
i
)]
i
ω
i
h
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
io
= ω
i
h
2y
i
(y
i
ˆ
µ
i
) + (
ˆ
µ
2
i
y
2
i
)
i
= ω
i
(y
i
ˆ
µ
i
)
2
.
In addition,
n
P
i=1
ω
i
(y
i
µ
i
)
2
φ
χ
2
(n).
180 / 460
Deviance
If Y
i
ind
Gamma(µ
i
, φ/ω
i
) then
d(y
i
,
ˆ
µ
i
, ω
i
) = 2
n
ω
i
h
y
i
θ(y
i
) b[θ(y
i
)]
i
ω
i
h
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
i
o
= 2 ω
i
"
y
i
1
ˆ
µ
i
1
y
i
!
log
y
i
ˆ
µ
i
!#
= 2 ω
i
"
(y
i
ˆ
µ
i
)
ˆ
µ
i
log
y
i
ˆ
µ
i
!#
.
If Y
i
ind
Inverse Gaussian(µ
i
, φ/ω
i
)
then
d(y
i
,
ˆ
µ
i
, ω
i
) = 2
n
ω
i
h
y
i
θ(y
i
) b[θ(y
i
)]
i
ω
i
h
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
io
= 2 ω
i
y
i
1
2
ˆ
µ
2
i
1
2y
2
i
1
ˆ
µ
i
1
y
i
!
= ω
i
(y
i
ˆ
µ
i
)
2
ˆ
µ
2
i
y
i
.
In addition,
n
P
i=1
ω
i
(y
i
µ
i
)
2
φµ
2
i
y
i
χ
2
(n).
181 / 460
Exercises
Let Y Negative Binomial(µ, σ
2
), where µ > 0 and σ
2
is known.
The probability mass function of
Y may be written as
f
Y
(y; µ, σ
2
) =
Γ
y +
1
σ
2
Γ
1
σ
2
Γ
y + 1
σ
2
µ
σ
2
µ + 1
!
y
1
σ
2
µ + 1
!
1
σ
2
I
y
{0, 1, 2, . . .}
Then, the following MLG can be formulated
Y
i
ind
Negative Binomial(µ
i
, σ
2
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
(A)
What is the maximum of y
i
θ(µ
i
) b[θ(µ
i
)] with respect to µ
i
on the interval (0, )?
(B) What is the supreme of b[θ(µ
i
)] on the interval (0, )?
(C) Find the deviance and the unscaled deviance of the model.
(D) Compare the results of (C) above with the obtained in R from
negative.binomial(1)$dev.resids, where .Theta = 1
2
182 / 460
Parameter estimation
If (β) is a strictly concave function, that is, if the Hessian matrix of
(β) is negative definite for all β, then
ˆ
β reduces to the unique solution
to
U(
ˆ
β) = 0, if it exists, where U(β) is given by
U(β) =
∂ℓ(β)
∂β
0
,
∂ℓ(β)
∂β
1
, . . . ,
∂ℓ(β)
∂β
p
!
,
in which
∂ℓ(β)
∂β
j
=
1
φ
n
X
i=1
ω
i
"
y
i
∂θ
i
∂µ
i
∂µ
i
∂η
i
∂η
i
∂β
j
b
(θ
i
)
∂θ
i
∂µ
i
∂µ
i
∂η
i
∂η
i
∂β
j
#
=
1
φ
n
X
i=1
ω
i
(y
i
µ
i
)
∂θ
i
∂µ
i
∂µ
i
∂η
i
∂η
i
∂β
j
=
1
φ
n
X
i=1
ω
i
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
x
ij
,
because
(1) η
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
=
∂η
i
∂β
j
= x
ij
, with x
i0
= 1;
(2) g(µ
i
) = η
i
=
g(µ
i
)
∂η
i
=
∂η
i
∂η
i
= g
(µ
i
)
∂µ
i
∂η
i
= 1 =
∂µ
i
∂η
i
=
1
g
(µ
i
)
;
(3) µ
i
= b
(θ
i
) =
∂µ
i
∂µ
i
=
b
(θ
i
)
∂µ
i
= 1 = b
′′
(θ
i
)
∂θ
i
∂µ
i
=
∂θ
i
∂µ
i
=
1
V(µ
i
)
183 / 460
Parameter estimation
In matrix form U(β) becomes
U(β) =
φ
1
n
P
i=1
ω
i
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
.
φ
1
n
P
i=1
ω
i
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
x
i1
.
.
.
φ
1
n
P
i=1
ω
i
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
x
ip
= φ
1
X
KV
1
(y µ),
where
X = (x
1
, . . . , x
n
)
is the model matrix
K = diag{1/g
(µ
1
), . . . , 1/g
(µ
n
)}
V = diag{V(µ
1
)
1
, . . . , V(µ
n
)
n
}
y = (y
1
, . . . , y
n
)
is the vector of observed responses
µ = (µ
1
, . . . , µ
n
)
is the vector of means
In general, the solution to the nonlinear equations given by
U(
ˆ
β) = 0 must be obtained using numerical methods such as
the Newton-Raphson and the Fisher’s Scoring algorithms.
184 / 460
Exercises
Find the solution to U(
ˆ
β) = 0 in the following cases:
(A)
Y
i
ind
Normal(µ
i
, φ/ω
i
),
µ
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
(B)
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) = β
0
+ β
1
x
i1
,
where the individual i belongs to group 1 or 2. Therefore,
x
i1
= I{i group 2}.
(C)
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
r1
x
i(r1)
,
where the individual i belongs to group 1, 2, . . . , (r 1) or r.
Hence,
x
ij
= I{i group (j + 1)}.
185 / 460
Newton-Raphson algorithm
The idea behind this algorithm is to maximize, in its iteration (t+
1)
for t = 0, 1, 2, . . ., a quadratic approximation of (β), denoted
here by
[t+1]
(β), so that, as lower is the dierence between (β)
and
[t+1]
(β), lower is the dierence between
ˆ
β
and the maxi-
mum of
[t+1]
(β), denoted here by β
[t+1]
. This is accomplished
by building, in the iteration
(t + 1) of the algorithm, an appro-
ximation of
(β) based on its second-order Taylor series around
β
[t]
, obtaining thus the following: (see Appendix C-3)
(β)
[t+1]
(β) = (β
[t]
)+U
(β
[t]
)(ββ
[t]
)
1
2
(ββ
[t]
)
J(β
[t]
)(ββ
[t]
),
where J(β) = −∇
2
(β) is minus the Hessian matrix of (β), also
known as the observed Fisher information matrix of β. The
maximum of
[t+1]
(β) is easy to compute as it is a quadratic and
strictly concave function, which is consequence of the following:
(i) the Hessian matrices of (β) and
[t+1]
(β) coincide; and (ii)
J(β) is assumed to be positive definite for all β.
186 / 460
Newton-Raphson algorithm
The maximum of
[t+1]
(β) is its zero-gradient point, that is, the
maximum of
[t+1]
(β) is the solution to the following equation:
U(β
[t]
) J(β
[t]
)
β
[t+1]
β
[t]
= 0.
Therefore, β
[t+1]
may be written as follows (see, for instance,
Agresti [2015, section 4.5.1]):
β
[t+1]
= β
[t]
+ [J(β
[t]
)]
1
U(β
[t]
).
The above results motivate the f ollowi ng algorithm:
Step 0: Start the counter at t = 0; set a tolerance limit,
ǫ > 0; set a maximum number of iterations, n
max
;
and set a starting value for
β, say β
[0]
.
Step 1: Compute β
[t+1]
= β
[t]
+ [J(β
[t]
)]
1
U(β
[t]
)
Step 2: Compute δ
(t+1)
= δ(β
[t]
, β
[t+1]
)
Step 3: Update the counter by t = t + 1
Step 4: Repeat steps 1, 2 and 3 until δ
(t)
< ǫ or t > n
max
.
Step 5: If δ
(t)
< ǫ then
ˆ
β
is defined to be β
[t]
. Otherwise,
the convergence was not achieved.
187 / 460
Newton-Raphson algorithm
β
[t]
β
[t+1]
ˆ
β
(β)
[t+1]
(β)
β
188 / 460
Newton-Raphson algorithm
Convergence of the algorithm above is obtained when the value
of
β
[t]
stabilizes, that is, convergence is reached when the “di-
erence” between
β
[t]
and β
[t+1]
is “small” enough. Therefore,
(1) If β
[t]
and β
[t+1]
are equal, then U(β
[t]
) = 0 and
ˆ
β = β
[t+1]
,
as
J(β
[t]
) is a nonsingular matrix.
(2) If β
[t]
reaches the value of
ˆ
β
, then the algorithm finishes, as,
according to the Step 1 of the algorithm,
U(β
[t]
) = 0 implies
that
β
[t]
and β
[t+1]
are equal.
The convergence criterion can be formalized by setting the cons-
tant ǫ > 0 and the non-negative function δ(β
[t]
, β
[t+1]
), which is a
strictly increasing function of the “dierence” between β
[t]
and
β
[t+1]
. Possible forms for δ(β
[t]
, β
[t+1]
) include the following:
δ(β
[t]
, β
[t+1]
) =
β
[t+1]
β
[t]
r
=
p
P
j=0
β
[t+1]
j
β
[t]
j
r
1
r
, r 1
δ(β
[t]
, β
[t+1]
) = max
n
β
[t+1]
0
β
[t]
0
/
β
[t]
0
, . . . ,
β
[t+1]
p
β
[t]
p
/
β
[t]
p
o
189 / 460
Fisher’s Scoring algorithm
This algor ithm is a modified version of the Newton-Raphson
algorithm, where
J(β) is replaced by the expected Fisher’s in-
formation matrix, defined by
I(β) = E[J(β)] = φ
1
X
WX, with
W = KV
1
K. This algorithm can be more convenient than the
Newton-Raphson, as:
(i) I(β) is a positive definite matrix for all
β (see Appendix A-3), which, in general, is not true for J(β); (ii)
the structure of I(β) is simpler than that of J(β). The Fisher’s
Scoring algorithm may be written as follows (see, for instance,
Agresti [2015, section 4.5.2]):
Step 0: Start the counter at t = 0; set a tolerance limit,
ǫ > 0; set a maximum number of iterations, n
max
;
and set a starting value for β, say β
[0]
.
Step 1: Compute β
[t+1]
= β
[t]
+ [I(β
[t]
)]
1
U(β
[t]
)
Step 2: Compute δ
(t+1)
= δ(β
[t]
, β
[t+1]
)
Step 3: Update the counter by t = t + 1
Step 4: Repeat steps 1, 2 and 3 until δ
(t)
< ǫ or t > n
max
.
Step 5: If δ
(t)
< ǫ then
ˆ
β
is defined to be β
[t]
. Otherwise,
the convergence was not achieved.
190 / 460
Fisher’s Scoring algorithm
The entry (j, j
) of the matrix J(β) may be obtained as follows
2
(β)
∂β
j
∂β
j
=
∂β
j
1
φ
n
X
i=1
ω
i
(Y
i
µ
i
)
V(µ
i
)g
(µ
i
)
x
ij
=
1
φ
n
X
i=1
ω
i
x
ij
"
f
i
(Y
i
µ
i
)
∂µ
i
+ (Y
i
µ
i
)
f
i
∂µ
i
#
∂µ
i
η
i
∂η
i
β
j
=
1
φ
n
X
i=1
ω
i
x
ij
"
1
V(µ
i
)
g
(µ
i
)
+ (Y
i
µ
i
)
f
i
∂µ
i
#
x
ij
g
(µ
i
)
,
=
1
φ
n
X
i=1
ω
i
x
ij
x
ij
V(µ
i
)
g
(µ
i
)
2
1
φ
n
X
i=1
ω
i
(Y
i
µ
i
)
f
i
∂µ
i
x
ij
x
ij
g
(µ
i
)
,
where f
i
= [V(µ
i
)g
(µ
i
)]
1
. Therefore, the entry (j, j
) of I(β)
reduces to (Agresti [2015, page 126]):
E
2
(β)
∂β
j
∂β
j
!
=
1
φ
n
X
i=1
ω
i
x
ij
x
ij
V(µ
i
)
g
(µ
i
)
2
.
191 / 460
Fisher’s Scoring algorithm
By replacing U(β) and I(β) by φ
1
X
WK
1
(yµ) and φ
1
X
WX,
respectively, the
Step 1 of the Fisher’s Scoring algorithm may be
written as follows (see, for instance,
Agresti [2015, page 142]):
β
[t+1]
= β
[t]
+ [I(β
[t]
)]
1
U(β
[t]
)
= β
[t]
+ (X
W
[t]
X)
1
X
W
[t]
(z
[t]
Xβ
[t]
)
= (X
W
[t]
X)
1
X
W
[t]
z
[t]
,
where W
[t]
and z
[t]
represent to z = Xβ + K
1
(y µ) = g(µ) +
K
1
(y µ)
and W evaluated at β
[t]
, respectively, in which g(µ) is
defined to be
(g(µ
1
), . . . , g(µ
n
))
. Therefore, β
[t+1]
can regarded as
the Weighted Least Squares estimate of β in a linear model such
that
E(Z
[t]
) = Xβ, Var(Z
[t]
) = σ
2
[W
[t]
]
1
and z
[t]
is the observed
value of Z
[t]
. Indeed, at the convergence of the algorithm,
ˆ
β
can be written as
ˆ
β = (X
ˆ
WX)
1
X
ˆ
Wˆz
. Hence, the Fisher’s Sco-
ring algorithm is also known as an Iteratively Reweighted Least
Squares (IRLS).
192 / 460
Fisher’s Scoring algorithm
Thus, an initial or starting value for the Newton-Raphson and
Fisher’s Scoring algorithms may be defined as
β
[0]
= (X
W
X)
1
X
W
z
,
where W
and z
represent to W and z evaluated at µ = y.
However, in some cases, the value of
y does not belong to the
parametric space of
µ, so that g(y) and/or g
(y) may be undefined.
This occurs in the following cases
(1) Under the binomial response models when y = 0 or y = 1. In
those cases,
β
[0]
may be computed by replacing such val ues
of y by |y τ|, where τ is a “small” value as, for instance,
0.01, 0.05 or 0.1.
(2) Under the poisson response models when y = 0. In those
cases, β
[0]
may be computed by replacing such values of y
by τ, where τ is a “small” value as, for instance, 0.01 or 0.05.
193 / 460
Canonical link functions
The canonical link functions are such that
g(µ) θ(µ),
which implies that g
(µ) = [V(µ)]
1
. In those cases, the expres-
sion for the elements of the gradient vector
U(β) reduces to
∂ℓ(β)
∂β
j
=
1
φ
n
X
i=1
ω
i
(y
i
µ
i
)x
ij
, j = 0, 1, . . . , p.
Similarly, the expression for the elements of J(β) reduces to
2
(β)
∂β
j
∂β
j
=
1
φ
n
X
i=1
ω
i
V(µ
i
)x
ij
x
ij
, j, j
= 0, 1, . . . , p.
Therefore,
U(β) =
1
φ
X
(y µ)
and
J(β) = I(β),
where = diag{ω
1
, . . . , ω
n
}.
194 / 460
Canonical link functions
See, for example, McCullagh and Nelder [1989, page 32].
Distribution Canonical link
Normal g(µ) = µ (identity)
Binomial
g(µ) = log
µ/(1 µ)
(logit)
Poisson
g(µ) = log(µ) (log)
Gamma
g(µ) = µ
1
(inverse)
Inverse Gaussian
g(µ) = µ
2
(1/muˆ2)
Models with those link functions satisfy the following:
(1) The Newton-Raphson and Fisher’s Scoring algor ithms coin-
cide, as the only dierence between them is that the former
uses
J(β) whereas the latter uses I(β).
(2) The Hessian matrix of (β), given by I(β), is negative defi-
nite for all β. Therefore, (β) is a strictly concave function,
and the solution to U(β) = 0, if it exists, is the unique global
maximum of (β), that is, it is the maximum likelihood
estimate of
β.
195 / 460
Hardened stee l
These data, extracted from McCool [1980] and available in the object
Steel of the library glmtoolbox, consist of the failure times (life)
for hardened steel specimens in a 4-ball rolling contact fatigue test.
Ten independent observations were taken at each of the four values of
contact stress (
stress), in pounds per square inch ×10
6
.
Contact stress, in psi ×10
6
Time until the specimen failed
0.9 1.0 1.1
0
Y
i
ind
Gamma(µ
i
, σ
2
),
log(µ
i
) = β
0
+ β
1
x
i
y = Time until the specimen failed
x = Contact stress
196 / 460
Hardened stee l
The convergence criterion of the Fisher’s Scoring algorithm in the
function glm() is the following:
δ(β
[t]
, β
[t+1]
) =
D(β
[t+1]
) D(β
[t]
)
/
h
D(β
[t]
) + 0.1
i
< ǫ
> fit <- glm(life ~ stress, family=Gamma(log), data=Steel)
> FisherScoring(fit)
Tolerance Limit: 1e-08 ǫ
Maximum number of iterations: 25 n
max
t D(β
[t]
) δ(β
[t1]
, β
[t]
) β
[t]
= (β
[t]
0
, β
[t]
1
)
Iteration Deviance Tolerance Intercept Stress
1 43.604 NA 12.660 -12.318
2 34.873 1.9978e-01 14.829 -13.891
3 34.231 1.8341e-02 14.270 -13.456
4 34.226 1.5484e-04 14.190 -13.386
5 34.226 1.7747e-07 14.187 -13.383
6 34.226 2.6181e-10 14.186 -13.383
ˆ
β
> estequa(fit) U(
ˆ
β)
(Intercept) -5.827184e-09
stress 3.420245e-06
197 / 460
Hardened stee l
> fit0 <- lm(log(life) ~ stress, data=Steel)
> fit1 <- glm(life ~ stress, family=Gamma(log), data=Steel,
+ start= coef(fit0) β
[0]
,
+ control=list(epsilon= 1e-8 ,maxit= 25 ,trace=TRUE))
> ǫ n
max
Deviance = 34.87253 Iterations - 1
Deviance = 34.23110 Iterations - 2
Deviance = 34.22578 Iterations - 3
Deviance = 34.22578 Iterations - 4
Deviance = 34.22578 Iterations - 5
> summary(fit1)
ˆ
β
j
Estimate Std.Error t value Pr(>|t|)
(Intercept) 14.186 1.250 11.35 9.02e-14 ***
stress -13.383 1.203 -11.12 1.63e-13 ***
(Dispersion parameter for Gamma family taken to be 0.7710894)
Null deviance: 110.033 on 39 degrees of freedom
Residual deviance: 34.226 on 38 degrees of freedom
D(
ˆ
β)
Number of Fisher Scoring iterations: 5
198 / 460
Hardened stee l
The function glm() also enables the user to specify starting values for
η
1
, . . . , η
n
(etastart) and for µ
1
, . . . , µ
n
(mustart).
> etas <- fitted(fit0)
> fit <- glm(life ~ stress, family=Gamma(log), data=Steel,
+ etastart= etas (η
[0]
1
, . . . , η
[0]
n
)
,
+ control=list(epsilon= 1e-8 ,maxit= 25 ,trace=TRUE))
> ǫ n
max
Deviance = 34.87253 Iterations - 1
Deviance = 34.23110 Iterations - 2
Deviance = 34.22578 Iterations - 3
Deviance = 34.22578 Iterations - 4
Deviance = 34.22578 Iterations - 5
> summary(fit)
>
> mus <- exp(fitted(fit0))
> fit <- glm(life ~ stress, family=Gamma(log), data=Steel,
+ mustart= mus (µ
[0]
1
, . . . , µ
[0]
n
)
,
+ control=list(epsilon= 1e-8 ,maxit= 25 ,trace=TRUE))
> ǫ n
max
199 / 460
Goodness-of-fit
200 / 460
R-squared
Interest Model
(model under investigation)
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
X =
1 x
11
. . . x
1p
1 x
21
. . . x
2p
.
.
.
.
.
.
.
.
.
.
.
.
1 x
n1
. . . x
np
ˆ
β = argmin
β R
p+1
D(β)
Reference Model
(model with the poorest
goodness-of-fit)
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) = β,
X =
1
1
.
.
.
1
ˆ
β = argmin
β R
D
(β)
Note that D(
ˆ
β) = min
β R
p+1
D(β)
and D
(
ˆ
β) = min
β R
p+1
D(β) restricted to β
1
= . . . = β
p
= 0,
where D(β) and D
(β) represent the unscaled deviances of the Interest
Model and the Reference Model, respectively.
201 / 460
R-squared
R
2
= 1
D(
ˆ
β)
D
(
ˆ
β)
The R
2
satisfies the following:
(1) The l arger is its value, the better is the goodness-of-fit of the
model.
(2) R
2
[0, 1].
(3) If X = 1
n
then R
2
= 0, where 1
n
represents a column vector
of 1’s of size
n.
(4) If X = I
n
then R
2
= 1, where I
n
represents the identity matrix
of size n.
(5) A hundred times its value represents the percentage reduc-
tion in unscaled deviance caused by the inclusion of p pa-
rameters in the linear predictor.
(6) It is an one-to-one transformation of the unscaled deviance.
202 / 460
R-squared
The R
2
can be used to compare nested models, that is, models that
assume the same probability distribution to describe the response va-
riable, but where the first model’s
X matrix coincides with the first
(r + 1) columns of the second model’s X matrix, in which r < p.
Model I
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) =
˜
β
0
+
r
P
j=1
˜
β
j
x
ij
,
ˆ
˜
β = argmin
˜
β R
r+1
D
I
(
˜
β)
Model II
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) = β
0
+
r
P
j=1
β
j
x
ij
+
p
P
j=r+1
β
j
x
ij
,
ˆ
β = argmin
β R
p+1
D
II
(β)
Note that D
II
(
ˆ
β) = min
β R
p+1
D
II
(β)
and D
I
(
ˆ
˜
β) = min
β R
p+1
D
II
(β) restricted to β
r+1
= . . . = β
p
= 0,
where D
I
(β) and D
II
(
˜
β) represent the unscaled deviances of the models
I and II, respectively.
203 / 460
R-squared
Therefore,
D
II
(
ˆ
β) D
I
(
ˆ
˜
β),
which implies that R
2
II
is ALWAYS larger or equal than R
2
I
, that
is, according to the
R
2
the model II is ALWAYS better than the
model I, even if the
(p r) additional covariates in the model II
do not help to explain the dierences between
y
1
, . . . , y
n
.
An adjusted version of
R
2
in which not just the model goodness-
of-fit is measured but also its complexity is penalized, i s given
by the following expression:
R
2
adj
= 1
D(
ˆ
β)/(n p 1)
D
(
ˆ
β)/(n 1)
= 1
n 1
n p 1
(1 R
2
),
where (n p 1) and (n 1) are the degrees-of-freedom of the
Interest Model and the Reference Model, respectively. Note that
R
2
adj
is an one-to-one transformation of D(
ˆ
β)/(n p 1).
204 / 460
Generalized Information Criterion (GIC)
GIC(k) =
pmeasure of the
goodness-of-fit
z}|{
2 (
ˆ
β) +
complexity
penalty
z }| {
k(number of parameters),
where
(
ˆ
β) =
n
X
i=1
log[C(y
i
, φ
i
)] +
1
φ
n
X
i=1
ω
i
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
.
The Akaike Information Criterion (AIC, Akaike [1974]), and the
Bayesian Information Criterion (
BIC, Schwarz [1978]), also known
as the Schwarz Information Criterion (SIC), are special cases of
GIC(k) when k = 2 and k = log(n), respectively.
If n > 7 then the penalty by model complexity in BIC is higher
than that in AIC. Therefore, the BIC tends to choose simpler
models than AIC.
205 / 460
Statistical inference
206 / 460
Asymptotic theory
If the model is correctly specified then, under certain conditions,
better known as regularity conditions,
ˆ
β
is a consistent estimator
of
β, which asymptotically (that is, when n ) satisfies the
following (
Fahrmeir and Kaufmann [1985]):
it is unbiased,
it is ecient,
it follows a multivariate normal distribution.
As a result, for “large” values of the sample size
n, we have that
ˆ
β ˙Normal
p+1
(β, [I(β)]
1
),
where ˙ means “approximately distributed as”. A consistent
estimator of [I(β)]
1
is [I(
ˆ
β)]
1
.
207 / 460
Hypothesis testing
The interest is to assess the following hypothesis system
H
0
: L
β = M versus H
1
: L
β , M,
where L
is an r × (p + 1) matrix whose rows are linearly independent
and M is an r × 1 matrix. The decision rule is the following
Reject H
0
at t he approximate 100(α)% significance level if ξ > χ
2
1α
(r)”,
where
α (0, 1), χ
2
1α
(r) is the 100(1 α)th percentile of the chi-square
distribution with
r degrees-of-freedom, and ξ is one of the following
statistics (see, for instance, Buse [ 1982] and Terrell [2002]):
Likelihood-ratio
Wald
Rao’s Score
Terrell’s Gradient
If
H
0
is true, then the asymptotic distribution (n ) of ξ is χ
2
(r).
The statistic ξ depends on the dispersion parameter φ. However, the
asymptotic distribution of ξ remains χ
2
(r) regardless that φ is unknown
and is replaced by a consistent estimator.
208 / 460
Hypothesis testing
For example, with p = 3.
(1)
H
0
: β
2
= 0
versus H
1
: β
2
, 0
Then r = 1, L
=
h
0 0 1 0
i
and M = 0.
(2)
H
0
:
"
β
1
β
3
#
=
"
0
0
#
versus H
1
:
"
β
1
β
3
#
,
"
0
0
#
Then r = 2, L
=
"
0 1 0 0
0 0 0 1
#
and M =
"
0
0
#
.
(3)
H
0
: β
2
= β
3
versus H
1
: β
2
, β
3
Then r = 1, L
=
h
0 0 1 1
i
and M = 0.
209 / 460
Hypothesis testing
Likelihood-ratio test
ξ
LR
= 2 log
L(
ˆ
β)
L(
ˆ
β
)
!
= 2
h
(
ˆ
β) (
ˆ
β
)
i
=
2
φ
n
X
i=1
ω
i
h
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
i
n
X
i=1
ω
i
h
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
i
=
2
φ
n
X
i=1
sup
µ
i
ω
i
h
y
i
θ(µ
i
) b[θ(µ
i
)]
i
n
X
i=1
ω
i
h
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
i
2
φ
n
X
i=1
sup
µ
i
ω
i
h
y
i
θ(µ
i
) + b[θ(µ
i
)]
i
n
X
i=1
ω
i
h
y
i
θ(
ˆ
µ
i
) b[θ(
ˆ
µ
i
)]
i
=
1
φ
h
D(
ˆ
β
) D(
ˆ
β)
i
,
where
ˆ
β = argmin
β R
p+1
D(β),
ˆ
β
= argmin
β R
p+1
D(β)
restricted to L
β = M,
ˆ
µ
i
= g
1
(x
i
ˆ
β)
and
ˆ
µ
i
= g
1
(x
i
ˆ
β
)
.
210 / 460
Hypothesis testing
Wald test
ˆ
β ˙Normal
p+1
(β, [I(β)]
1
)
, which implies that (see Appendix E-3)
L
ˆ
β ˙Normal
r
(L
β, L
[I(β)]
1
L),
which in turn implies that (see Appendix E-3,4)
(L
ˆ
β L
β)
(L
[I(β)]
1
L)
1
(L
ˆ
β L
β) ˙χ
2
(r).
So, the statistic of the Wald test reduces to
ξ
W
= (L
ˆ
β M)
(L
[I(
ˆ
β)]
1
L)
1
(L
ˆ
β M).
Unlike ξ
LR
, the statistic ξ
W
does not depend on
ˆ
β
. As a result, ξ
W
has a lower computational cost than ξ
LR
, since the former does
not require the computation of
ˆ
β
.
211 / 460
Hypothesis testing
For example,
(1)
H
0
: β
2
= 0
versus H
1
: β
2
, 0
Then ξ
W
=
ˆ
β
2
2
ˆ
Var(
ˆ
β
2
)
=
ˆ
β
2
2
[Std.Error(
ˆ
β
2
)]
2
(2)
H
0
:
"
β
1
β
3
#
=
"
0
0
#
versus H
1
:
"
β
1
β
3
#
,
"
0
0
#
Then ξ
W
=
h
ˆ
β
1
ˆ
β
3
i
"
ˆ
Var(
ˆ
β
1
)
ˆ
Cov(
ˆ
β
1
,
ˆ
β
3
)
ˆ
Cov(
ˆ
β
3
,
ˆ
β
1
)
ˆ
Var(
ˆ
β
3
)
#
1
"
ˆ
β
1
ˆ
β
3
#
(3)
H
0
: β
2
= β
3
versus H
1
: β
2
, β
3
Then ξ
W
=
(
ˆ
β
2
ˆ
β
3
)
2
ˆ
Var(
ˆ
β
2
) +
ˆ
Var(
ˆ
β
3
) 2
ˆ
Cov(
ˆ
β
2
,
ˆ
β
3
)
212 / 460
Hypothesis testing
Rao’s Score test
U(β) ˙Normal
p+1
(0, I(β)), which implies that (see Appendix E-3)
L
U(β) ˙Normal
r
(0, L
[I(β)]L),
which in turn implies that (see Appendix E-3,4)
[L
U(β)]
(L
[I(β)]L)
1
[L
U(β)] ˙χ
2
(r).
So, the statistic of the Rao’s Score test test reduces to
ξ
S
= [L
U(
ˆ
β
)]
(L
[I(
ˆ
β
)L)
1
[L
U(
ˆ
β
)].
Unlike ξ
LR
, the statistic ξ
S
does not depend on
ˆ
β
. As a result, ξ
S
has a lower computational cost than ξ
LR
, since the former does
not require the computation of
ˆ
β
.
Terrell’s Gradient test
ξ
G
= (L
ˆ
β M)
[L
U(
ˆ
β
)]
Similar to ξ
LR
, the statistic of the Terrell’s Gradient test depends
on both
ˆ
β
and
ˆ
β
.
213 / 460
Hypothesis testing
> help(anova2)
>
> anova2(fit0,fit1,test=‘‘wald’’)
fit0: the restricted model, that is, the model fitted under
the restrictions on
β stated by H
0
.
fit1: the unrestricted model, that is, the model fitted with-
out restrictions on
β.
test: the statistical test specified to assess the hypothe-
sis. The available options are: Wald (“wald”), Rao’s score
(“score”), Terrell’s gradient (“gradient”), and likelihood ra-
tio (“lr”) tests. By default, test is set to be “wald”.
The function anova2() provides the interest values associated
with the hypothesis testing, that is, ξ, r and Pr[χ
2
(r) > ξ].
214 / 460
Confidence intervals (see Agresti [2015, page 131])
The approximate 100(1 α)% confidence interval for β
j
includes
all
β
for which the hypothesis system given by
H
0
: β
j
= β
versus H
1
: β
j
, β
is not rejected at the approximate 100(α)% significance level.
In general, the limits of the approximate
100(1 α)% confidence
interval for
β
j
based on the likelihood-ratio, R ao’s Score, and
Terrell’s Gradie nt tests do not have closed-form and require nu-
merical methods. On the other hand, the approximate
100(1
α)%
confidence interval for β
j
based on the Wald test comprises
of the
β
such that (
ˆ
β
j
β
j
)
2
/
ˆ
Var(
ˆ
β
j
) χ
2
1α
(1) = Z
2
1
α
2
. Thus, the
interval is the following:
ˆ
β
j
Z
1
α
2
[
ˆ
Var(
ˆ
β
j
)]
1
2
,
ˆ
β
j
+ Z
1
α
2
[
ˆ
Var(
ˆ
β
j
)]
1
2
,
where Z
1
α
2
represents the 100(1
α
2
)
th percentile of the standard
normal distribution.
215 / 460
Confidence intervals
> help(confint2)
>
> confint2(fit,level,test,digits)
fit: the fitted model.
level: the value indicating the required confidence level .
By default, level is set to be
0.95.
test: the statistical test in w hich the approximate confi-
dence intervals are based on. The available options are:
Wald (“wald”), Rao’s score (“score”), Terrell’s gradient (“gra-
dient”), and likelihood ratio (“lr”) tests. By default, test is
set to be “wald”.
digits: the integer value indicating the number of decimal
places to be used. By default, digits is set to be 5.
The function confint2() provides the lower and upper limits of
the confidence intervals for all parameters in the linear predictor.
216 / 460
Estimation of the dispersion parameter
E
1
n
n
X
i=1
ω
i
(Y
i
µ
i
)
2
V(µ
i
)
= φ,
and, by the Law of Large Numbers,
1
n
n
X
i=1
ω
i
(Y
i
µ
i
)
2
V(µ
i
)
P
n→∞
φ.
Then, by replacing µ
1
, . . . , µ
n
by
ˆ
µ
1
, . . . ,
ˆ
µ
n
and adjusting the
value of
n by the number of estimated parameters, the following
estimator of
φ is obtained
ˆ
φ =
1
n p 1
n
X
i=1
ω
i
(y
i
ˆ
µ
i
)
2
V(
ˆ
µ
i
)
=
Pr(
ˆ
β)
n p 1
.
A second estimator of φ is the following
˜
φ =
D(
ˆ
β)
n p 1
.
217 / 460
Special cases
218 / 460
Binomial regression
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) = m
1
i
µ
i
(1 µ
i
)
Logit link
g(µ
i
) = log
µ
i
1 µ
i
!
Probit link
g(µ
i
) = Φ
1
(µ
i
)
Complementary log-log link
g(µ
i
) = log
log
1 µ
i

Cauchy link
g(µ
i
) = tan
π
2
2µ
i
1
In this model ω
i
= m
i
for i = 1, . . . , n.
219 / 460
Males attached to female horseshoe crabs
These data, extracted from Agresti [2015, page 16] and available in
the object
hcrabs of the library GLMsData, come from a study of fe-
male horseshoe crabs on an island in the Gulf of Mexico. During the
spawning season, a female migrates to the shore to breed. With a male
attached to her posterior spine, she burrows into the sand and lays
clusters of eggs. The eggs are fertilized externally in the sand beneath
the pair. During spawning, other male crabs (satellites) may cluster
around the pair and may also fertilize the eggs. The data set includes
Width (x
1
). The carapace width of the female crab, in cms.
Col. The color of the female crab: “LM” (Light Medium), “M”
(Medium), “DM” (Dark Medium), or “D” (Dark). Then, x
2
= 1 if
Col=“LM” or Col=“M”; and
x
2
= 0 if Col=“DM” or Col=“D”.
Sat. The number of male crabs attached to the female crab. Then,
y = 1 if Sat > 0 and y = 0 if Sat = 0.
Y
i
ind
Bernoulli(µ
i
),
g(µ) = β
0
+ β
1
x
i1
+ β
2
x
i2
220 / 460
Parameter interpretation: logit link
Let Z Binomial(m, µ) and Z
= m Z be the number of “suc-
cesses” and “failures”, respectively. Then,
Odds(µ) =
E(Z)
E(Z
)
=
µ
1 µ
= exp[g(µ)].
Odds(µ) satisfies the following:
(1) It is a strictly positive function of µ.
(2) It is a strictly increasing function of µ. Therefore, if µ in-
creases then
Odds(µ) also increases, and vice versa. In ad-
dition, if
µ decreases then Odds(µ) also decreases, and vice
versa.
(3) Odds(0.5) is 1.
(4) It can be interpreted as, in mean, for every “failure” there
are Odds(µ) “successes”.
For example, in mean, for every female horseshoe crab without
satellites, there are Odds(µ) female horseshoe crabs with at least
one satellite.
221 / 460
Parameter interpretation: cloglog link
Let Z Geometric(µ) be the number of independent trials re-
quired to get a “success”. Then,
Pr(Z = z) = (1µ)
z1
µ I
z
{1, 2, . . .},
E(Z) = µ
1
and the median of Z is given approximately by
MNNT(µ) =
log(2)
log(1 µ)
= log(2) exp[g(µ)].
MNNT(µ) satisfies the following:
(1) It is a strictly positive function of µ.
(2) It is a strictly decreasing function of µ. Therefore, if µ in-
creases then
MNNT(µ) decreases, and vice versa. In addi-
tion, if
µ decreases then MNNT(µ) increases, and vice versa.
(3) MNNT(0.5) is 1.
(4) It can be interpreted as the median number of trials required
to get a “success”.
For example, the median number of female horseshoe crabs re-
quired to get one with at least one satellite is MNNT(µ).
222 / 460
Parameter interpretation: without inte raction
The odds of having at least one satellite for a female crab with
carapace width
x
1
and color x
2
is exp(β
0
+ β
1
x
1
+ β
2
x
2
).
We have two female crabs whose carapace widths coincide but
the color of the first is Light, whereas the color of the second is
Dark, then the odds of having at least one satellite of the first
female crab is exp(β
2
) times that of the second.
Odds
L
Odds
D
=
exp(β
0
+ β
1
x
1
+ β
2
)
exp(β
0
+ β
1
x
1
)
= exp(β
2
).
We have two female crabs whose colors coincide but the carapace
width of the first is x+δ whereas the carapace width of the second
is x, then the odds of having at least one satellite of the first female
crab is exp(β
1
δ) times that of the second.
Odds
x+δ
Odds
x
=
exp(β
0
+ β
1
(x + δ) + β
2
x
2
)
exp(β
0
+ β
1
x + β
2
x
2
)
= exp(β
1
δ).
223 / 460
Parameter interpretation: with inte raction
The odds of having at least one satellite for a female crab with
carapace width
x
1
and color x
2
is exp(β
0
+ β
1
x
1
+ β
2
x
2
+ β
3
x
1
x
2
).
There are two “Dark” female crabs where the carapace width of
the first is
x + δ whereas the carapace width of the second is x,
then the odds of having at least one satellite of the first female
crab is
exp(β
1
δ) times that of the second.
Odds
x+δ
Odds
x
=
exp(β
0
+ β
1
(x + δ))
exp(β
0
+ β
1
x)
= exp(β
1
δ).
There are two “Light” female crabs where the carapace width of
the first is x + δ whereas the carapace width of the second is x,
then the odds of having at least one satellite of the first female
crab is
exp[(β
1
+ β
3
)δ] times that of the second.
Odds
x+δ
Odds
x
=
exp(β
0
+ (β
1
+ β
3
)(x + δ) + β
2
)
exp(β
0
+ (β
1
+ β
3
)x + β
2
)
= exp[(β
1
+ β
3
)δ].
224 / 460
Males attached to female horseshoe crabs
> data(hcrabs, package="GLMsData")
> hcrabs <- within(hcrabs,{Col2<-ifelse(Col %in% c("LM","M"),1,0)
+ Sat2<-ifelse(Sat > 0,1,0)})
> fit <- glm(Sat2~Width+Col2,family=binomial(logit),data=hcrabs)
> summary(fit)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) -11.9501 2.6643 -4.485 7.28e-06
Width 0.4670 0.1037 4.506 6.61e-06
Col2 0.6531 0.3571 1.829 0.0675
φ
(Dispersion parameter for binomial family taken to be 1 )
D
(
ˆ
β) (n 1)
Null deviance : 225.76 on 172 degrees of freedom
Residual deviance : 191.12 on 170 degrees of freedom
D(
ˆ
β) (n p 1)
AIC : 197.12
2(
ˆ
β) + 2(p + 1)
Number of Fisher Scoring iterations: 4
225 / 460
Males attached to female horseshoe crabs
> fit2 <- update(fit, formula=Sat2 ~ Width + Col2 + Width:Col2)
> anova2(fit,fit2,test="wald")
Model 1 : Sat2 ~ Width + Col2
Model 2 : Sat2 ~ Width + Col2 + Width:Col2
ξ
W
r Pr[χ
2
(r) > ξ
W
]
Chi df Pr(Chisq>)
1 vs 2 0.78331 1 0.3761
>
> anova2(fit,fit2, test="lr")
Model 1 : Sat2 ~ Width + Col2
Model 2 : Sat2 ~ Width + Col2 + Width:Col2
ξ
LR
r Pr[χ
2
(r) > ξ
LR
]
Chi df Pr(Chisq>)
1 vs 2 0.80605 1 0.3693
>
> anova2(fit,fit2, test="score")
Model 1 : Sat2 ~ Width + Col2
Model 2 : Sat2 ~ Width + Col2 + Width:Col2
ξ
S
r Pr[χ
2
(r) > ξ
S
]
Chi df Pr(Chisq>)
1 vs 2 0.79097 1 0.3738
226 / 460
Males attached to female horseshoe crabs
Model matrix X : model.matrix(fit)
Parameter estimates
ˆ
β = (
ˆ
β
0
,
ˆ
β
1
, . . . ,
ˆ
β
p
)
: coef(fit)
Estimate of variance-covariance matrix [I(
ˆ
β)]
1
: vcov(fit)
Fitted values ˆµ = (
ˆ
µ
1
, . . . ,
ˆ
µ
n
)
: fitted(fit)
Unscaled deviance D(
ˆ
β): deviance(fit)
Pearson’s statistic P(
ˆ
β): sum(resid(fit, type="pearson")**2)
Log-likelihood (
ˆ
β): logLik(fit)
AIC = 2(
ˆ
β) + 2(p + 1): AIC(fit)
BIC = 2(
ˆ
β) + log(n)(p + 1): BIC(fit)
R
2
adj
= 1
D(
ˆ
β)/(n p 1)
D
(
ˆ
β)/(n 1)
:
with(fit, 1-deviance*df.null/(null.deviance*df.residual))
227 / 460
Insecticides
These data, extracted from Hewlett and Plackett [1950] and available
in the object
deposit of the library GLMsData, arose from an experiment
in which groups of insects were exposed to deposits of insecticides. The
total number of insects exposed in each group (
Number), the number
of insects killed after six days exposure (
Killed), and the amount
(
Deposit) and type of insecticide used (Insecticide) were recorded.
Insecticide
type A
type B
type C
Amount of insecticide, in milligrams
% of killed insects
2 4 6 8
0
25
50
75
100
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
m = Number of exposed insects
y = Proportion of killed insects
x
1
= Amount of insecticide, in mg.
x
2
= 0 and x
3
= 0 if Insecticide=“A”
x
2
= 1 and x
3
= 0 if Insecticide=“B”
x
2
= 0 and x
3
= 1 if Insecticide=“C”
228 / 460
Parameter interpretation: without inte raction
The odds of death for an insect exposed to an amount of insecticide x
1
of type A (x
2
= 0 and x
3
= 0), B (x
2
= 1 and x
3
= 0) or C (x
2
= 0 and
x
3
= 1) is exp(β
0
+ β
1
log(x
1
) + β
2
x
2
+ β
3
x
3
).
Two insects are exposed to the same type of insecticide, but the amount
the first and second insects were exposed to are
xδ and x, respectively.
Then, the odds of death of the first insect is δ
β
1
times that of the second.
Odds
xδ
Odds
x
=
exp(β
0
+ β
1
log(xδ) + β
2
x
2
+ β
3
x
3
)
exp(β
0
+ β
1
log(x) + β
2
x
2
+ β
3
x
3
)
= δ
β
1
.
Two insects receive the same dose of insecticide, but the first insect is
exposed to type B insecticide and the second insect is exposed to type
A insecticide. Then, the odds of death of the first insect is
exp(β
3
) times
that of the second.
Odds
B
Odds
A
=
exp(β
0
+ β
1
log(x
1
) + β
2
)
exp(β
0
+ β
1
log(x
1
))
= exp(β
2
).
Two insects receive the same dose of insecticide, but the first insect is
exposed to type C insecticide and the second insect is exposed to type
A insecticide. Then, the odds of death of the first insect is
exp(β
3
) times
that of the second.
Odds
C
Odds
A
=
exp(β
0
+ β
1
log(x
1
) + β
3
)
exp(β
0
+ β
1
log(x
1
))
= exp(β
3
).
229 / 460
Parameter interpretation: with inte raction
The odds of death for an insect exposed to an amount of insecticide x
1
of type A (x
2
= 0 and x
3
= 0), B (x
2
= 1 and x
3
= 0) or C (x
2
= 0 and
x
3
= 1) is exp(β
0
+ β
1
log(x
1
) + β
2
x
2
+ β
3
x
3
+ β
4
log(x
1
)x
2
+ β
5
log(x
1
)x
3
).
Two insects were exposed to insecticide type A. The amount the rst
and second insects were exposed to are xδ and x, respectively. Then,
the odds of death of the first insect is δ
β
1
times that of the second.
Odds
xδ
Odds
x
=
exp(β
0
+ β
1
log(xδ))
exp(β
0
+ β
1
log(x))
= δ
β
1
.
Two insects were exposed to insecticide type B. The amount the first
and second insects were exposed to are
xδ and x, respectively. Then,
the odds of death of the first insect is δ
β
1
+β
4
times that of the second.
Odds
xδ
Odds
x
=
exp(β
0
+ β
1
log(xδ) + β
4
log(xδ))
exp(β
0
+ β
1
log(x) + β
4
log(x))
= δ
β
1
+β
4
.
Two insects were exposed to insecticide type C. The amount the first
and second insects were exposed to are
xδ and x, respectively. Then,
the odds of death of the first insect is δ
β
1
+β
5
times that of the second.
Odds
xδ
Odds
x
=
exp(β
0
+ β
1
log(xδ) + β
5
log(xδ))
exp(β
0
+ β
1
log(x) + β
5
log(x))
= δ
β
1
+β
5
.
230 / 460
Insecticides
> data(deposit, package="GLMsData")
>
> fit <- glm(Killed/Number ~ log(Deposit) + Insecticide,
+ weights=Number, family=binomial(logit), data=deposit)
> summary(fit)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) -4.4613 0.3564 -12.518 <2e-16
log(Deposit) 2.8875 0.2236 12.912 <2e-16
InsecticideB 0.3489 0.2060 1.694 0.0903
InsecticideC 2.8402 0.2538 11.191 <2e-16
φ
(Dispersion parameter for binomial family taken to be 1 )
D
(
ˆ
β) (n 1)
Null deviance :413.644 on 17 degrees of freedom
Residual deviance : 23.385 on 14 degrees of freedom
D(
ˆ
β) (n p 1)
AIC : 93.583
2(
ˆ
β) + 2(p + 1)
Number of Fisher Scoring iterations: 4
231 / 460
Insecticides
> fit2 <- update(fit,formula=Killed/Number~log(Deposit)*Insecticide)
> anova2(fit,fit2,test="wald")
Model 1 : Killed/Number ~ log(Deposit) + Insecticide
Model 2 : Killed/Number ~ log(Deposit) * Insecticide
ξ
W
r Pr[χ
2
(r) > ξ
W
]
Chi df Pr(Chisq>)
1 vs 2 0.99287 2 0.6087
>
> anova2(fit,fit2, test="lr")
Model 1 : Killed/Number ~ log(Deposit) + Insecticide
Model 2 : Killed/Number ~ log(Deposit) * Insecticide
ξ
LR
r Pr[χ
2
(r) > ξ
LR
]
Chi df Pr(Chisq>)
1 vs 2 1.0487 2 0.5919
>
> anova2(fit,fit2, test="score")
Model 1 : Killed/Number ~ log(Deposit) + Insecticide
Model 2 : Killed/Number ~ log(Deposit) * Insecticide
ξ
S
r Pr[χ
2
(r) > ξ
S
]
Chi df Pr(Chisq>)
1 vs 2 0.99865 2 0.6069
232 / 460
Poisson regression
Y
i
ind
Poisson(µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) = µ
i
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
Square-root link
g(µ
i
) =
µ
i
233 / 460
Skin cancer in women
These data, extracted from Kleinbaum et al. [2013, page 751] and avai-
lable in the object
skincancer of the library glmtoolbox, describe the
number of non-melanoma skin cancer cases (
cases) among women
(
population) stratified by age (age: “15-24”, “25-34”, “35-44”, “45-
54”, “55-64”, “65-74”, “75-84”, “85+”) in two metropolitan areas of US
(
city: “St. Paul”, in Minneapolis, and “Fort Worth”, in Dallas).
15−24 25−34 35−44 45−54 55−64 65−74 75−84 85+
City
St.Paul
Ft.Worth
Incidence rate of skin cancer ×10
4
Age range, in years
0
25
50
75
Y
i
ind
Poisson(µ
i
),
log
10
4
µ
i
λ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
y = Number of cases of skin cancer
λ = Number of women exposed
x
1
= 1 if city=“Fort Worth”
x
1
= 0 if city=“St. Paul”
x
2
= Midpoint of age range
234 / 460
Parameter interpretation: without inte raction
The expected incidence of skin cancer for a population group
located at the city
x
1
and an age range with mid point x
2
is
exp(β
0
+ β
1
x
1
+ β
2
log(x
2
)).
There are two population groups whose age ranges coincide but
the first is located at Fort Worth whereas the second is located at
Saint Paul, then the expected incidence of skin cancer in the first
population group is exp(β
1
) times that of the second.
µ
FW
µ
SP
=
exp(β
0
+ β
1
+ β
2
log(x
2
))
exp(β
0
+ β
2
log(x
2
))
= exp(β
1
).
There are two population groups located at the same city but the
mid point of the age range of the first is xδ whereas that of the
second is x, then the expected incidence of skin cancer in the first
population group is δ
β
2
times that of the second.
µ
xδ
µ
x
=
exp(β
0
+ β
1
x
1
+ β
2
log(xδ))
exp(β
0
+ β
1
x
1
+ β
2
log(x))
= exp(β
2
log(δ)) = δ
β
2
.
235 / 460
Parameter interpretation: with inte raction
The expected incidence of skin cancer for a population group at
the city
x
1
and an age range with mid point x
2
is exp(β
0
+ β
1
x
1
+
β
2
log(x
2
) + β
3
x
1
log(x
2
))
.
There are two population g roups located at Saint Paul but the
mid point of the age range of the first is
xδ whereas that of the
second is
x, then the expected incidence of skin cancer in the first
population group is
δ
β
2
times that of the second.
µ
xδ
µ
x
=
exp(β
0
+ β
2
log(xδ))
exp(β
0
+ β
2
log(x))
= exp(β
2
log(δ)) = δ
β
2
.
There are two population groups located at Fort Worth but the
mid point of the age range of the first is xδ whereas that of the
second is x, then the expected incidence of skin cancer in the first
population group is δ
β
2
+β
3
times that of the second.
µ
xδ
µ
x
=
exp(β
0
+ β
1
+ β
2
log(xδ) + β
3
log(xδ))
exp(β
0
+ β
1
+ β
2
log(x) + β
3
log(x))
= δ
β
2
+β
3
.
236 / 460
Skin cancer in women
> data(skincancer)
> fit <- glm(cases ~ city + log(age), offset=log(population),
+ family=poisson(log), data=skincancer)
> summary(fit)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) -20.35008 0.35282 -57.68 <2e-16
cityFt.Worth 0.80928 0.05217 15.51 <2e-16
log(age) 3.37649 0.08469 39.87 <2e-16
φ
(Dispersion parameter for poisson family taken to be 1 )
D
(
ˆ
β) (n 1)
Null deviance : 2789.681 on 15 degrees of freedom
Residual deviance : 31.816 on 13 degrees of freedom
D(
ˆ
β) (n p 1)
AIC : 132.06
2(
ˆ
β) + 2(p + 1)
237 / 460
Skin cancer in women
> fit2 <- update(fit,formula = cases ~ city*log(age))
> anova2(fit,fit2, test="wald")
Model 1 : cases ~ city + log(age)
Model 2 : cases ~ city + log(age) + city:log(age)
ξ
W
r Pr[χ
2
(r) > ξ
W
]
Chi df Pr(Chisq>)
1 vs 2 3.6753 1 0.05523
>
> anova2(fit,fit2, test="lr")
Model 1 : cases ~ city + log(age)
Model 2 : cases ~ city + log(age) + city:log(age)
ξ
LR
r Pr[χ
2
(r) > ξ
LR
]
Chi df Pr(Chisq>)
1 vs 2 3.764 1 0.05237
>
> anova2(fit,fit2, test="score")
Model 1 : cases ~ city + log(age)
Model 2 : cases ~ city + log(age) + city:log(age)
ξ
S
r Pr[χ
2
(r) > ξ
S
]
Chi df Pr(Chisq>)
1 vs 2 3.6808 1 0.05504
238 / 460
Richness of the species
The analysis of these data, available in the object richness of the
library
glmtoolbox, aims to explain the dierences between plots
as related to species richness (represented by a count of the num-
ber of plant species (
richness)) using the dierences between
plots with respect to biomasses (
Biomass) and soil pH levels:
low (
pH=“low”), medium (pH=“mid”) and high (pH=“high”).
High
Mid
Low
pH level
Biomass
Number of plant species
0
0.0 2.5 5.0 7.5 10.0
10
20
30
40
Y
i
ind
Poisson(µ
i
),
µ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
y = Number of plant species
x
1
= Biomass
x
2
= 0 and x
3
= 0 if pH=“low”
x
2
= 1 and x
3
= 0 if pH=“mid”
x
2
= 0 and x
3
= 1 if pH=“high”
239 / 460
Parameter interpretation: without inte raction
The expected number of plant species for a plot with biomass x
1
and
pH level low (
x
2
= 0 and x
3
= 0), mid (x
2
= 1 and x
3
= 0) or high (x
2
= 0
and x
3
= 1) is β
0
+ β
1
x
1
+ β
2
x
2
+ β
3
x
3
.
There are two plots with the same pH levels, but the biomass of the
first plot is
x + δ while the biomass of the second plot is x. As a result,
the expected number of plant species for the first plot is |β
1
|δ units
lower, if
β
1
< 0, or greater, if β
1
> 0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
2
x
2
+ β
3
x
3
(β
0
+ β
1
x + β
2
x
2
+ β
3
x
3
) = β
1
δ.
There are two plots with the same biomass, but the pH level of the first
plot is mid while the biomass of the second plot is low. As a result, the
expected number of plant species for the first plot is |β
2
| units lower, if
β
2
< 0, or greater, if β
2
> 0, than that for the second plot.
µ
mid
µ
low
= β
0
+ β
1
x
1
+ β
2
(β
0
+ β
1
x
1
) = β
2
.
There are two plots with the same biomass, but the pH level of the first
plot is high while the biomass of the second plot is low. As a result, the
expected number of plant species for the first plot is |β
3
| units lower, if
β
3
< 0, or greater, if β
3
> 0, than that for the second plot.
µ
high
µ
low
= β
0
+ β
1
x
1
+ β
3
(β
0
+ β
1
x
1
) = β
3
.
240 / 460
Parameter interpretation: without inte raction
The expected number of plant species for a plot with biomass x
1
and
pH level low (
x
2
= 0 and x
3
= 0), mid (x
2
= 1 and x
3
= 0) or high (x
2
= 0
and x
3
= 1) is β
0
+ β
1
x
1
+ β
2
x
2
+ β
3
x
3
+ β
4
x
1
x
2
+ β
5
x
1
x
3
.
There are two plots with pH level low, but the biomass of the first
plot is
x + δ while the biomass of the second plot is x. Therefore, the
expected number of plant species for the first plot is |β
1
|δ units lower,
if
β
1
< 0, or greater, if β
1
> 0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) (β
0
+ β
1
x) = β
1
δ.
There are two plots with pH level mid, but the biomass of the first
plot is
x + δ while the biomass of the second plot is x. Therefore, the
expected number of plant species for the first plot is
|β
1
+ β
4
|δ units
lower, if
β
1
+β
4
<0, or greater, if β
1
+β
4
>0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
4
(x + δ) (β
0
+ β
1
x + β
4
x) = (β
1
+ β
4
)δ.
There are two plots with pH level high, but the biomass of the first
plot is
x + δ while the biomass of the second plot is x. Therefore, the
expected number of plant species for the first plot is
|β
1
+ β
5
|δ units
lower, if
β
1
+β
5
<0, or greater, if β
1
+β
5
>0, than that for the second plot.
µ
x+δ
µ
x
= β
0
+ β
1
(x + δ) + β
5
(x + δ) (β
0
+ β
1
x + β
5
x) = (β
1
+ β
5
)δ.
241 / 460
Richness of the species
> data(richness)
> fit <- glm(Species ~ Biomass + pH, family=poisson(identity),
+ data=richness)
> summary(fit)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) 17.7915 0.7789 22.84 <2e-16
Biomass -2.8710 0.1830 -15.68 <2e-16
pHmid 12.5603 0.9821 12.79 <2e-16
pHhigh 23.0681 1.1801 19.55 <2e-16
φ
(Dispersion parameter for poisson family taken to be 1 )
D
(
ˆ
β) (n 1)
Null deviance : 452.346 on 89 degrees of freedom
Residual deviance : 68.126 on 86 degrees of freedom
D(
ˆ
β) (n p 1)
AIC : 495.32
2(
ˆ
β) + 2(p + 1)
242 / 460
Richness of the species
> fit2 <- update(fit,formula=Species ~ Biomass + pH + Biomass:pH)
> anova2(fit,fit2, test="wald")
Model 1 : Species ~ Biomass + pH
Model 2 : Species ~ Biomass + pH + Biomass:pH
ξ
W
r Pr[χ
2
(r) > ξ
W
]
Chi df Pr(Chisq>)
1 vs 2 1.1275 2 0.5691
>
> anova2(fit,fit2, test="lr")
Model 1 : Species ~ Biomass + pH
Model 2 : Species ~ Biomass + pH + Biomass:pH
ξ
LR
r Pr[χ
2
(r) > ξ
LR
]
Chi df Pr(Chisq>)
1 vs 2 1.0563 2 0.5897
>
> anova2(fit,fit2, test="score")
Model 1 : Species ~ Biomass + pH
Model 2 : Species ~ Biomass + pH + Biomass:pH
ξ
S
r Pr[χ
2
(r) > ξ
S
]
Chi df Pr(Chisq>)
1 vs 2 0.98562 2 0.6109
243 / 460
Normal regression
Y
i
ind
Normal(µ
i
, σ
2
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
Y
i
= µ
i
+ ε
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
ε
i
ind
Normal(0, σ
2
i
)
E(Y
i
) = µ
i
and Var(Y
i
) =
σ
2
ω
i
Inverse link
g(µ
i
) = 1
i
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
If ω
1
= ω
2
= . . . = ω
n
, then the dispersion parameter is constant,
which implies that the variance is also constant.
244 / 460
Normal regression
Let Q
ε
i
(α) be the 100(α)th percentile of the random error ε
i
. Thus,
Pr[ε
i
Q
ε
i
(α)] = α
Pr[µ
i
+ ε
i
µ
i
+ Q
ε
i
(α)] = α
Pr[Y
i
µ
i
+ Q
ε
i
(α)] = α.
That is, the 100(α)th percentile of Y
i
, denoted here by Q
Y
i
(α),
can be written as
µ
i
+ Q
ε
i
(α). In addition, if the link function is
identity (that is,
g(µ) = µ) and ω
1
= ω
2
= . . . = ω
n
= 1, then
ε
1
, . . . , ε
n
are identically distributed random variables, so Q
ε
i
(α)
does not depend on i. Hence, Q
Y
i
(α) can be expressed as follows
Q
Y
i
(α) = µ
i
+ Q
ε
(α)
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
where β
0
= β
0
+ Q
ε
(α). Therefore, the parameter interpretation
also may be performed by taking into account the additive eects
of the covariates on
Q
Y
i
(α), for all α (0, 1).
245 / 460
Advertising
This data set, extracted from James et al. [2013, page 15] and available
in the object
advertising of the library glmtoolbox, consists of the
sales (
sales), in thousands of units, of a product in 200 dierent mar-
kets, along with advertising budgets, in thousands of dollars, for the
product in each of those markets for television (
TV) and radio (radio).
Advertising
budget for radio
Low
Mid
High
Advertising budget for TV
Sales
0
10
20
100 200 300
Y
i
ind
Norm al(µ
i
, σ
2
),
log(µ
i
) = β
0
+ β
1
log(x
i1
) + β
2
x
i2
+
β
3
log(x
i1
)x
i2
y = Sales, in thousands of units
x
1
= Advertising budget for TV
x
2
= Advertising budget for radio
246 / 460
Parameter interpretation: without inte raction
The expected sales of the product in a market where the adver-
tising budget for TV is
x
1
and the advertising budget for radio is
x
2
are exp(β
0
+ β
1
log(x
1
) + β
2
x
2
).
In two markets the radio advertising budgets are the same, but
the TV advertisi ng budgets are
xδ and x, respectively. As a result,
the product’s expected sales in the first market are
δ
β
1
times those
in the second market.
µ
xδ
µ
x
=
exp(β
0
+ β
1
log(xδ) + β
2
x
2
)
exp(β
0
+ β
1
log(x) + β
2
x
2
)
= exp(β
1
log(δ)) = δ
β
1
.
In two markets the advertising budgets for TV coincide, but the
advertising budget for radio in the first is x+ δ, and in the second
is x. As a result, the product’s expected sales in the first market
are exp(β
2
δ) times those in the second market.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
log(x
1
) + β
2
(x + δ)
exp(β
0
+ β
1
log(x
1
) + β
2
x)
= exp(β
2
δ).
247 / 460
Parameter interpretation: with inte raction
The expected sales of the product in a market where the adver-
tising budget for TV is
x
1
and the advertising budget for radio is
x
2
are exp(β
0
+ β
1
log(x
1
) + β
2
x
2
+ β
3
log(x
1
)x
2
).
Two markets have radio advertising budgets eq ual to
z, but TV
advertising budgets are
xδ and x, respectively. As a result, the
product’s expected sales in the first market are
δ
β
1
+β
3
z
times those
in the second market.
µ
xδ
µ
x
=
exp(β
0
+ β
1
log(xδ) + β
2
z + β
3
log(xδ)z)
exp(β
0
+ β
1
log(x) + β
2
z + β
3
log(x)z)
= δ
β
1
+β
3
z
.
Two markets have TV advertising budgets equ al to z, but radio
advertising budgets are x + δ and x, respectively. As a result,
the product’s expected sales in the first market are exp[(β
2
+
β
3
log(z))δ] times those in the second market.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
log(z) + β
2
(x+δ) + β
3
log(z)(x+δ))
exp(β
0
+ β
1
log(z) + β
2
x + β
3
log(z)x)
= exp[(β
2
+β
3
log(z))δ].
248 / 460
Advertising
> data(advertising)
> fit <- glm(sales ~ log(TV) + radio + log(TV):radio,
+ family=gaussian(log), data=advertising)
> summary(fit)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) 1.0302249 0.0467959 22.015 <2e-16
log(TV) 0.2646359 0.0092146 28.719 <2e-16
radio -0.0043242 0.0014929 -2.896 0.0042
log(TV):radio 0.0036644 0.0002909 12.598 <2e-16
ˆ
φ
(Dispersion parameter for gaussian family taken to be 0.2668851 )
D
(
ˆ
β) (n 1)
Null deviance : 5417.149 on 199 degrees of freedom
Residual deviance : 52.309 on 196 degrees of freedom
D(
ˆ
β) (n p 1)
AIC : 309.35
2(
ˆ
β) + 2(p + 2)
Number of Fisher Scoring iterations: 5
249 / 460
Advertising
> fit2 <- update(fit,formula = sales ~ log(TV) + radio)
> anova2(fit2,fit, test="wald")
Model 1 : sales ~ log(TV) + radio
Model 2 : sales ~ log(TV) + radio + log(TV)*radio
ξ
W
r Pr[χ
2
(r) > ξ
W
]
Chi df Pr(Chisq>)
1 vs 2 158.7 1 < 2.2e-16
>
> anova2(fit,fit2, test="lr")
Model 1 : sales ~ log(TV) + radio
Model 2 : sales ~ log(TV) + radio + log(TV)*radio
ξ
LR
r Pr[χ
2
(r) > ξ
LR
]
Chi df Pr(Chisq>)
1 vs 2 153.45 1 < 2.2e-16
>
> anova2(fit,fit2, test="gradient")
Model 1 : sales ~ log(TV) + radio
Model 2 : sales ~ log(TV) + radio + log(TV)*radio
ξ
S
r Pr[χ
2
(r) > ξ
G
]
Chi df Pr(Chisq>)
1 vs 2 151.16 1 < 2.2e-16
250 / 460
Advertising
Model matrix X : model.matrix(fit)
Parameter estimates
ˆ
β = (
ˆ
β
0
,
ˆ
β
1
, . . . ,
ˆ
β
p
)
: coef(fit)
Estimate of variance-covariance matrix [I(
ˆ
β)]
1
: vcov(fit)
Fitted values ˆµ = (
ˆ
µ
1
, . . . ,
ˆ
µ
n
)
: fitted(fit)
Unscaled deviance D(
ˆ
β): deviance(fit)
Pearson’s statistic P(
ˆ
β): sum(resid(fit, type="pearson")**2)
Log-likelihood (
ˆ
β): logLik(fit)
AIC = 2(
ˆ
β) + 2(p + 2): AIC(fit)
BIC = 2(
ˆ
β) + log(n)(p + 2): BIC(fit)
R
2
adj
= 1
D(
ˆ
β)/(n p 1)
D
(
ˆ
β)/(n 1)
:
with(fit, 1-deviance*df.null/(null.deviance*df.residual))
251 / 460
Gamma regression
Y
i
ind
Gamma(µ
i
, σ
2
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
Y
i
= µ
i
ε
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
ε
i
ind
Gamma(1, σ
2
i
)
E(Y
i
) = µ
i
and Var(Y
i
) =
σ
2
ω
i
µ
2
i
Inverse link
g(µ
i
) = 1
i
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
If ω
1
= ω
2
= . . . = ω
n
, then the dispersion parameter is constant,
which implies that the coecient of variation is also constant.
252 / 460
Gamma regression
Let Q
ε
i
(α) be the 100(α)th percentile of the random error ε
i
. Thus,
Pr[ε
i
Q
ε
i
(α)] = α
Pr[µ
i
ε
i
µ
i
Q
ε
i
(α)] = α
Pr[Y
i
µ
i
Q
ε
i
(α)] = α.
That is, the 100(α)th percentile of Y
i
, denoted here by Q
Y
i
(α),
can be written as
µ
i
Q
ε
i
(α). In addition, if the link function is
logarithm (that is,
g(µ) = log(µ)) and ω
1
= ω
2
= . . . = ω
n
= 1,
then
ε
1
, . . . , ε
n
are identically distributed random variables, so
Q
ε
i
(α) does not depend on i. Hence, log[Q
Y
i
(α)] can be expressed
as follows
log[Q
Y
i
(α)] = log(µ
i
) + log[Q
ε
(α)]
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
where β
0
= β
0
+ log[Q
ε
(α)]. Therefore, the parameter interpre-
tation also may be performed by taking into account the multi-
plicative eects of the covariates on
Q
Y
i
(α), for all α (0, 1).
253 / 460
Hill races in Scotland
Each year, the Scottish Hill Runners Association publishes a
list of hill races in Scotland for the year. These data, extracted
from
Agresti [2015, page 62] and available in the object races
of the library glmtoolbox, consist of record time, distance, and
cumulative climb of 35 of those races. The aim of the statistical
analysis of these data is to explain the dierences between the
record time of the races using their dierences on distance and
cumulative climb. The variables in this data set are the following
distance (x
1
). the distance, in miles, of the races.
cclimb (x
2
). the cumulative climb, in thousands of feet, of
the races.
rtime (y). the record time, in minutes, of the races.
race. the names of the races.
Y
i
ind
Gamma(µ
i
, σ
2
),
log(µ
i
) = β
0
+ β
1
log(x
i1
) + β
2
x
i2
254 / 460
Parameter interpretation: without inte raction
The expected (median) record time for a race with distance x
1
and
cumulative climb
x
2
is (proportional to) exp(β
0
+β
1
log(x
1
)+β
2
x
2
).
In two races the cumulative climbs are the same, but the distances
are
xδ and x, respectively. As a result, the expected (median)
record time in the first race is
δ
β
1
times that in the second race.
µ
xδ
µ
x
=
exp(β
0
+ β
1
log(xδ) + β
2
x
2
)
exp(β
0
+ β
1
log(x) + β
2
x
2
)
= exp(β
1
log(δ)) = δ
β
1
.
In two races the distances coincide , but the cumulative climb in
the first is
x + δ, and in the second is x. As a result, the expected
(median) record time in the first race is
exp(β
2
δ) times that in the
second race.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
log(x
1
) + β
2
(x + δ))
exp(β
0
+ β
1
log(x
1
) + β
2
x)
= exp(β
2
δ).
255 / 460
Parameter interpretation: with inte raction
The expected (median) record time for a race with distance x
1
and cumulative climb x
2
is (proportional to) exp(β
0
+ β
1
log(x
1
) +
β
2
x
2
+ β
3
log(x
1
)x
2
)
.
Two races have cumulative climbs equal to
z, but the distances
are
xδ and x, respectively. As a result, the expected (median)
record time in the first race is
δ
β
1
+β
3
z
times that in the second
race.
µ
xδ
µ
x
=
exp(β
0
+ β
1
log(xδ) + β
2
z + β
3
log(xδ)z)
exp(β
0
+ β
1
log(x) + β
2
z + β
3
log(x)z)
= δ
β
1
+β
3
z
.
Two races have distances equal to z, but the cumulative climbs
are x + δ and x, respectively. As a result, the expected (median)
record time in the first race is exp[(β
2
+ β
3
log(z))δ] times that in
the second race.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
log(z) + β
2
(x+δ) + β
3
log(z)(x+δ))
exp(β
0
+ β
1
log(z) + β
2
x + β
3
log(z)x)
= exp[(β
2
+β
3
log(z))δ].
256 / 460
Hill races in Scotland
> data(races)
> fit <- glm(rtime ~ log(distance) + cclimb, family=Gamma(log),
+ data=races)
> summary(fit)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) 1.86703 0.08938 20.888 <2e-16
log(distance) 0.90141 0.05933 15.193 3.47e-16
cclimb 0.13089 0.02165 6.045 9.52e-07
ˆ
φ
(Dispersion parameter for Gamma family taken to be 0.02141007 )
D
(
ˆ
β) (n 1)
Null deviance : 19.79114 on 34 degrees of freedom
Residual deviance : 0.72634 on 32 degrees of freedom
D(
ˆ
β) (n p 1)
AIC : 233.91
2(
ˆ
β) + 2(p + 2)
Number of Fisher Scoring iterations: 5
257 / 460
Hill races in Scotland
> fit2 <- update(fit,formula = rtime ~ log(distance)*cclimb)
> anova2(fit,fit2, test="wald")
Model 1 : rtime ~ log(distance) + cclimb
Model 2 : rtime ~ log(distance) + cclimb + log(distance):cclimb
ξ
W
r Pr[χ
2
(r) > ξ
W
]
Chi df Pr(Chisq>)
1 vs 2 1.8245 1 0.1768
>
> anova2(fit,fit2, test="lr")
Model 1 : rtime ~ log(distance) + cclimb
Model 2 : rtime ~ log(distance) + cclimb + log(distance):cclimb
ξ
LR
r Pr[χ
2
(r) > ξ
LR
]
Chi df Pr(Chisq>)
1 vs 2 1.8072 1 0.1788
>
> anova2(fit,fit2, test="gradient")
Model 1 : rtime ~ log(distance) + cclimb
Model 2 : rtime ~ log(distance) + cclimb + log(distance):cclimb
ξ
S
r Pr[χ
2
(r) > ξ
G
]
Chi df Pr(Chisq>)
1 vs 2 1.8061 1 0.179
258 / 460
Lognormal distribution
If Y Normal(µ, σ
2
) then T = exp(Y) Lognormal(exp(µ), σ
2
).
If
T Lognormal(τ, σ
2
) then Y = log(T) Normal(log(τ), σ
2
).
The probability density function of
T may be written as follows
f
T
(t; τ, σ
2
) =
1
t
2π σ
2
exp
log(t)
2
2σ
2
I
t
(0, ).
The mean, variance and coecient of variation of T are
τ exp
σ
2
2
!
, τ
2
exp
σ
2
h
exp
σ
2
1
i
and
q
exp
(
σ
2
)
1,
respectively. In addition, the 100(α)th percentile of T reduces to
Q
T
(α) = τ exp
(
σ Z
α
)
, α (0, 1),
where Z
α
represents the 100(α)th percentile of the Normal(0, 1).
For example, the median of T is τ. The lognormal distribution
does not belong to the LEF.
259 / 460
Probability density funct ion of lognormal(1, σ
2
)
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5
0.0
0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
0.0
0.5
1.0
1.5
2.0
σ
2
= 0.05
σ
2
= 0.12
σ
2
= 0.30
σ
2
= 0.8
σ
2
= 2.0
σ
2
= 5.0
t
f
T
(t; η, σ
2
)
260 / 460
Lognormal regression
T
i
= τ
i
ε
i
,
h(τ
i
) = β
0
+ β
1
x
i1
+. . .+β
p
x
ip
ε
i
ind
Lognormal(1, σ
2
i
)
Y
i
= µ
i
+ ǫ
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+. . .+β
p
x
ip
ǫ
i
ind
Normal(0, σ
2
i
)
where Y
i
= log(T
i
), µ
i
= log(τ
i
), ǫ
i
= log(ε
i
) and g(·) h[exp(·)].
E(T
i
) = τ
i
exp
σ
2
2 ω
i
!
and Var(T
i
) = τ
2
i
exp
σ
2
ω
i
!"
exp
σ
2
ω
i
!
1
#
.
Logarithmic link
h(τ
i
) = log(τ
i
) g(µ
i
) = µ
i
Identity link
h(τ
i
) = τ
i
g(µ
i
) = exp(µ
i
)
Inverse link
h(τ
i
) = 1
i
g(µ
i
) = exp(µ
i
)
If ω
1
= ω
2
= . . . = ω
n
then the dispersion parameter is constant,
which implies that the coecient of variation is also constant.
261 / 460
Lognormal versus Gamma model under log link
T
i
= τ
i
ε
i
,
log(τ
i
)= β
0
+ β
1
x
i1
+. . .+β
p
x
ip
,
ε
i
ind
Lognormal(1, σ
2
)
vs.
Y
i
= µ
i
ǫ
i
,
log(µ
i
)= β
0
+ β
1
x
i1
+. . .+β
p
x
ip
,
ǫ
i
ind
Gamma(1, σ
2
)
These models have the following similarities:
(1) They are appropriate when the response variable is contin-
uous, strictly positive, and right skewed.
(2) The constant dispersion parameter assumption is equiva-
lent to the constant coecient of variation assumption.
(3) A higher value of the dispersion parameter indicates more
asymmetry in the distribution of the response variable.
(4) The model parameters can be interpreted by taking into
account the foll owing: (i) log[E(T
i
)] = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
and log[E(Y
i
)] = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
; or (ii) log[Q
T
i
(α)] =
¯
β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
and log[Q
Y
i
(α)] =
˜
β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
for all α (0, 1).
262 / 460
Lognormal versus Gamma model under log link
T
i
= τ
i
ε
i
,
log(τ
i
)= β
0
+ β
1
x
i1
+. . .+β
p
x
ip
,
ε
i
ind
Lognormal(1, σ
2
)
vs.
Y
i
= µ
i
ǫ
i
,
log(µ
i
)= β
0
+ β
1
x
i1
+. . .+β
p
x
ip
,
ǫ
i
ind
Gamma(1, σ
2
)
The lognormal model has the following advantages over the
Gamma model:
(1) The log-likelihood function of β in lognormal model, de-
noted here by
(β), is a strictly convex function. Therefore,
if the system of equations given by
R(β) = 0 has a solution,
where
R(β) = (∂ℓ(β)/∂β
0
, . . . , ∂ℓ(β)/∂β
p
)
is the gradient of
(β), then it is the only global maximum of (β).
(2) The maximum likelihood estimator of model parameters in
the lognormal model has a closed-form (i.e.,
ˆ
β = (X
X)
1
X
t
,
where t
= (log(t
1
), . . . , log(t
n
))
), whereas in the Gamma
model it requires numerical methods.
(3) Statistical inference under the lognormal model is exact,
whereas inference under the Gamma model is asymptotic,
meaning that it is valid only for large sample sizes.
263 / 460
Exercises
T
i
= τ
i
ε
i
,
h(τ
i
) = β
0
+ β
1
x
i1
+. . .+β
p
x
ip
ε
i
ind
Lognormal(1, σ
2
i
)
Y
i
= µ
i
+ ǫ
i
,
g(µ
i
) = β
0
+ β
1
x
i1
+. . .+β
p
x
ip
ǫ
i
ind
Normal(0, σ
2
i
)
where Y
i
= log(T
i
), y
i
= log(t
i
), µ
i
= log(τ
i
), ǫ
i
= log(ε
i
) and
g(·) h[exp(·)]. Show that:
(A) The likelihood function of β under the lognormal model
is proportional to that under the normal model. Therefore,
the maximum likelihood estimator of
β under the lognormal
model coincides with that under the normal model.
(B) The deviance under the lognormal model coincides with
that under the normal model.
(C) The GAIC under the lognormal model is equal to that under
the normal model plus 2[log(t
1
) +. . .+log(t
n
)]= 2(y
1
+ . . .+y
n
).
(D) If h(τ
i
) = log(τ
i
) and ω
1
= ω
2
= . . . = ω
n
then
log[Q
T
i
(α)] = β
0
+ β
1
x
i1
+. . .+β
p
x
ip
for all α (0, 1) ,
where β
0
= β
0
+ log[Q
ξ
(α)]
.
264 / 460
Gasoline yield
The analysis of the data in the object Auto of the library ISLR
is aimed at explaining the dierences between the vehicles with
respect to their gasoline yield (
mpg), in miles per gallon, using
other vehicle characteristics such as their horsepower (
horsepower)
and their weight (
weight), in kilograms.
••
High
Mid
Low
Weight
Horsepower
Gasoline yield
10
20
30
40
50 100 150 200
Y
i
ind
Lognor mal(µ
i
, σ
2
),
log(µ
i
)= β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i1
x
i2
y = Gasoline yield
x
1
= Horsepower
x
2
= Weight
265 / 460
Parameter interpretation: without inte raction
The expected (median) gasoline yield for a vehicle with horse-
power
x
1
and weight x
2
is proportional to exp(β
0
+ β
1
x
1
+ β
2
x
2
).
In two vehicles the weights are the same, but the horsepowers
are
x + δ and x, respectively. As a result, the expected (median)
gasoline yield i n the first vehicle is
exp(β
1
δ) times that in the
second vehicle.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
(x + δ) + β
2
x
2
)
exp(β
0
+ β
1
x + β
2
x
2
)
= exp(β
1
δ).
In two vehicles the horsepowers are the same, but the weights
are x + δ and x, respectively. As a result, the expected (median)
gasoline yield i n the first vehicle is
exp(β
2
δ) times that in the
second vehicle.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
x
1
+ β
2
(x + δ))
exp(β
0
+ β
1
x
1
+ β
2
x)
= exp(β
2
δ).
266 / 460
Parameter interpretation: with inte raction
The expected (median) gasoline yield for a vehicle with horse-
power
x
1
and weight x
2
is proportional to exp(β
0
+ β
1
x
1
+ β
2
x
2
+
β
3
x
1
x
2
)
.
Two vehicles have weights equal to
z, but the horsepowers are
x + δ and x, respectively. As a result, the expected (median)
gasoline yield in the first vehicle is
exp[(β
1
+ β
3
z)δ] times that in
the second vehicle.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
(x + δ) + β
2
z + β
3
(x + δ)z)
exp(β
0
+ β
1
x + β
2
z + β
3
xz)
= exp[(β
1
+ β
3
z)δ].
Two vehicles have horsepowers equal to z, but the weights are
x + δ and x, respectively. As a result, the expected (median)
gasoline yield in the first vehicle is exp[(β
2
+ β
3
z)δ] times that in
the second vehicle.
µ
x+δ
µ
x
=
exp(β
0
+ β
1
z + β
2
(x + δ) + β
3
z(x + δ))
exp(β
0
+ β
1
z + β
2
x + β
3
zx)
= exp[(β
2
+ β
3
z)δ].
267 / 460
Gasoline yield
> data(Auto, package="ISLR")
> Auto <- within(Auto, weight <- weight/1000)
>
> fit <- glm(log(mpg) ~ horsepower + weight + horsepower:weight,
+ family=gaussian(identity), data=Auto)
> summary(fit)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) 4.5121298 0.0912295 49.459 < 2e-16
horsepower -0.0071145 0.0010622 -6.698 7.44e-11
weight -0.3618254 0.0301305 -12.009 < 2e-16
horsepower:weight 0.0011991 0.0002589 4.631 4.96e-06
ˆ
φ
(Dispersion parameter for gaussian family taken to be 0.02341458 )
D
(
ˆ
β) (n 1)
Null deviance : 45.2100 on 391 degrees of freedom
Residual deviance : 9.0849 on 388 degrees of freedom
D(
ˆ
β) (n p 1)
> AIC(fit) + 2*sum(fit$y) 2(
ˆ
β) + 2(p + 2)
2075.781
268 / 460
Genesis of link functions in
binary response models
269 / 460
Genesis of link functions in binary response models
Y
i
ind
Bernoulli(µ
i
)
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
The link functions frequently used in binary response models
(logit, probit, complementary log-log and Cauchy) may be mo-
tivated by using the following latent linear model (see, for in-
stance,
Agresti [2015, section 5.1.2])
Y
i
= β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
σ e
i
,
where β
= (β
0
, β
1
, . . . , β
p
)
are regression parameters, σ > 0 is a
scale parameter, and
e
1
, . . . , e
n
are independent and identically
distributed random errors whose cumulative distribution func-
tion, denoted here by
G(·), satisfies the following: (i) is a contin-
uous and twice dierentiable function;
(ii) 0 < G(e) < 1 for all
e R; and (iii) it does not depend on unknown parameters.
The response is dichotomized as follows
Y
i
=
1
if Y
i
τ
0
if Y
i
< τ
,
270 / 460
Link functions in binary response models
where τ is an arbitrary value. Therefore, Y
i
ind
Bernoulli(µ
i
),
where
µ
i
is given by
µ
i
= Pr[Y
i
= 1]
= Pr[Y
i
τ]
= Pr[β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
σ e
i
τ]
= Pr[σ e
i
τ β
0
β
1
x
i1
. . . β
p
x
ip
]
= Pr[σ e
i
β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
τ]
= Pr[e
i
(β
0
τ) + (β
1
)x
i1
+ . . . + (β
p
)x
ip
]
= Pr[e
i
β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
]
= G(β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
),
in which β
0
= (β
0
τ)
and β
j
= (β
j
)
for j = 1, . . . , p. Thus, the
link function arises naturally as
g(µ
i
) = G
1
(µ
i
)
271 / 460
Link functions in binary response models
(1) Standard Logistic distribution
G(e) =
exp(e)
1 + exp(e)
.
Then,
g(µ
i
) = G
1
(µ
i
) = log
µ
i
1 µ
i
!
.
That is, the random errors distributed according to the stan-
dard logistic distribution induce the
logit link function.
(2) Standard Normal distribution
G(e) = Φ(e)
1
2
+ sign(e)
1
2
q
1 exp
p
π/8 e
2
.
Then,
g(µ
i
) = G
1
(µ
i
) = Φ
1
(µ
i
).
That is, the random errors distributed according to the stan-
dard normal distribution induce the
probit link function.
272 / 460
Link functions in binary response models
(3) Standard Extreme Value distribution
G(e) = 1 exp[exp(e)].
Then,
g(µ
i
) = G
1
e
(µ
i
) = log[log(1 µ
i
)] = log
"
log
µ
i
1 µ
i
+ 1
!#
That is, the errors distributed according to the standard
extreme value distribution induce the
cloglog link function.
(4) Standard Cauchy distribution
G(e) =
1
2
+
1
π
arctan(e).
Then,
g(µ
i
) G
1
(µ
i
) = tan
π
2
(2µ
i
1)
.
That is, the random errors distributed according to the stan-
dard Cauchy di stribution induce the cauchit link function.
273 / 460
Link functions in binary response models
0.00
0.25
0.50
0.75
1.00
−5.0 −2.5 0.0 2.5 5.0
Link function
Linear predictor (η)
µ = g
1
(η)
logit
probit
cloglog
cauchit
274 / 460
Receiver Operating
Characteristic (ROC) Curve
275 / 460
ROC Curve
The sample is partitioned into a sample for training and a sample
for validation.
Training sample
Validation sample
The binary response model is “trained” using the training sam-
ple. In order to evaluate the predictive ability of the trained
model, the validation sample is used.
276 / 460
ROC Curve
For a given value of the cuto point ν, the Interest Classification
Rule
is the following:
ˆ
Y
i
=
1
if
ˆ
µ
i
ν
0
if
ˆ
µ
i
< ν
Therefore, the confusion matrix reduces to
Predicted condition
ˆ
Y = 0
ˆ
Y = 1
Actual condition
Y = 0 a
ν
c
ν
Y = 1 b
ν
d
ν
Then, the Sensitivity or True Positive Rate (TPR), and the Speci-
ficity or True Negative Rate (TNR), are given, respectively, by
TPR(ν) =
d
ν
b
ν
+ d
ν
and TNR(ν) =
a
ν
a
ν
+ c
ν
277 / 460
ROC Curve
zeros ones
••
ˆ
µ
E
D
C
B
A
0.00.00.00.00.0 0.20.20.20.20.2 0.40.40.40.40.4 0.60.60.60.60.6 0.80.80.80.80.8 1.01.01.01.01.0
The ideal scenario is E, since it allows for perfect dierentiation between “suc-
cesses” and “failures”. By contrast, the worst scenario is D, since no distinction
can be made between “successes” and “failures”.
278 / 460
ROC Curve
The Receiver Operating Characteristic (ROC) curve is the plot of
TPR(ν) versus 1 TNR(ν) for ν
1
< ν
2
< . . . < ν
r1
< ν
r
.
Cuto point (ν) 1 TNR(ν) TPR(ν)
ν
1
c
ν
1
a
ν
1
+ c
ν
1
d
ν
1
b
ν
1
+ d
ν
1
ν
2
c
ν
2
a
ν
2
+ c
ν
2
d
ν
2
b
ν
2
+ d
ν
2
.
.
.
.
.
.
.
.
.
ν
r
c
ν
r
a
ν
r
+ c
ν
r
d
ν
r
b
ν
r
+ d
ν
r
Area Under the ROC curve (AUROC) is a measure of the pre-
dictive power of the model, and it can be calculated by applying
successively the formula for trapezoid area. Thus,
AUROC =
1
2
r
X
l=2
[TNR(ν
l
) TNR(ν
l1
)][TPR(ν
l
) + TPR(ν
l1
)].
279 / 460
ROC Curve
1 TNR(ν
l+1
) 1 TNR(ν
l
) 1 TNR(ν
l1
) 1 TNR(ν
l2
)
TPR(ν
l+1
)
TPR(ν
l
)
TPR(ν
l1
)
TPR(ν
l2
)
The area of the shaded region can be obtained as follows
= [(1 TNR(ν
l1
)) (1 TNR(ν
l
))] ×
1
2
[TPR(ν
l
) + TPR(ν
l1
)]
= [TNR(ν
l
) TNR(ν
l1
)] ×
1
2
[TPR(ν
l
) + TPR(ν
l1
)].
280 / 460
ROC Curve
The AUROC also may be expressed using the statistic of the
Mann-Whitney test (see
Gibbons and Chakraborti [2010, section
6.6]
)
AUROC =
1
N
0
N
1
X
l G
0
X
i G
1
S(
ˆ
µ
l
,
ˆ
µ
i
),
where G
0
= {i : y
i
= 0}, G
1
= {i : y
i
= 1}, N
0
= #G
0
, N
1
= #G
1
and
S(a, b) may be expressed as follows
S(a, b) =
1
if a < b
1
2
if a = b
0
if a > b
If
ˆ
µ
l
<
ˆ
µ
i
for all l G
0
and i G
1
then AUROC=1, that is, there
is ν
(0, 1) such that the classification rule given by
ˆ
Y
i
=
1
if
ˆ
µ
i
ν
0
if
ˆ
µ
i
< ν
is “perfect” in the sense of TPR(ν
) = TNR(ν
) = 1.
281 / 460
ROC Curve
Other measures of the predictive power of the model also may
be considered. For instance,
Gini Coecient = 2
AUROC
1
2
,
and the Kolmogorov-Smirnov (K-S) statistic (see, for instance,
Gibbons and Chakraborti [2010, section 6.3]), which compares
the empirical cumulative distribution function of
{
ˆ
µ
i
: y
i
= 0},
denoted here by
ECDF
(0)
(µ), with that of {
ˆ
µ
i
: y
i
= 1}, denoted
here by
ECDF
(1)
(µ), in the following form:
K-S Statistic = max
ECDF
(0)
(µ) ECDF
(1)
(µ)
= max
TNR(ν
l
) (1 TPR(ν
l
))
.
If
ˆ
µ
l
<
ˆ
µ
i
for all l G
0
and i G
1
then K-S Statistic=1.
282 / 460
ROC Curve
For a given value of the cuto point ν, the Reference Classifica-
tion Rule
is the following:
ˆ
Y
i
=
1
if u
i
ν
0
if u
i
< ν
where u
1
, u
2
, . . . are realizations of random variables such that
U
i
ind
Uniform(0, 1)
. The ROC curv e of this classification rule
is the zero-intercept and unit-slope straight line as
TPR(ν) =
Pr[U
i
ν] = 1 ν
and TNR(ν) = Pr[U
i
< ν] = ν for all ν (0, 1).
Thus, the AUROC of the reference classification rule is
0.5.
The classification rule of a model whose linear predictor includes
just the intercept term also may be used as reference. For that
model the estimates of µ
i
are
ˆ
µ
i
=
ˆ
µ
for all i. Thus, the ROC curve
and the AUROC of that classification rule are the zero-intercept
and unit-slope straight line and
0.5, respectively, as
(1) TPR(ν) = 1 and TNR(ν) = 0 for all ν
ˆ
µ
(2) TPR(ν) = 0 and TNR(ν) = 1 for all ν >
ˆ
µ
283 / 460
Burn injury study
This data set, discussed in Hosmer et al. [2013, page 27] and available
in the object burn1000 of the library aplore3, is a subset of 1000 subjects
treated between 2000 and 2007 at 40 dierent burn facilities. Available
information includes patient demographics, total burn surface area
and presence of inhalation injury. The outcome of interest is survival
after hospital discharge. The data set includes the following variables
age (x
1
). The patient’s age at admission, in years.
gender ( x
2
). The patient’s gender: “Female” and “Male”.
race (x
3
). The patient’s race: “Non-White” and “White”.
tbsa (x
4
). Total burn surface area, in percentage.
inh_inj ( x
5
). Burn involved inhalation injury? “No” and “Yes”.
flame (x
6
). Flame involved in burn injury? “No” and “Yes”.
death (y). Hospital discharge status: “Alive” and “Dead”.
Y
i
ind
Bernoulli(µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
6
x
i6
284 / 460
Burn injury study
> help(ROCc)
>
> burn <- aplore3::burn1000
> burn <- within(burn, death2 <- ifelse(death=="Dead",1,0))
> s <- sample(1:nrow(burn), size=nrow(burn)*0.7)
> train <- burn[s,]
> test <- burn[-s,]
>
> fit <- glm(death ~ age + tbsa + inh_inj + age*inh_inj +
+ tbsa*inh_inj, family=binomial(logit), data=train)
>
> pr <- predict(fit, newdata=test, type="response")
>
> out <- ROCc(cbind(test[,"death2"],pr),col="red",col.lab="blue")
> out
This function provides a plot of the ROC curve, as well as the va-
lues of AUROC, Gini coecient, and K-S statistic for the interest
classification rule when it is applied to the validation sample.
285 / 460
ROC Curve: Example
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
ROC Curve
1−Specificity
Sensitivity
AUROC = 0.972
286 / 460
ROC Curve: Example
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Kolmogorov−Smirnov Statistic
µ
^
Empirical Cumulative Distribution Function
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.2 0.4 0.6 0.8 1.0
Zeros
Ones
K-S Stat istic = 0.851
287 / 460
ROC Curve
There are several ways to define the “optimal” cuto point ν.
For example, the “optimal” cuto point can be defined as being
the value of
ν which minimizes the expected classification cost,
which is given by (see, for instance,
Habibzadeh et al. [2016],
Nahm [2022]
)
E(Cost) = E(Cost |Y = 0) Pr(Y = 0) + E(Cost |Y = 1) Pr(Y = 1),
where
Cost |Y = 0 =
C
0 |0
if
ˆ
Y = 0
C
1 |0
if
ˆ
Y = 1
and
Cost |Y = 1 =
C
0 |1
if
ˆ
Y = 0
C
1 |1
if
ˆ
Y = 1
in which C
0 |0
, C
1 |0
, C
0 |1
and C
1 |1
are assumed to be non-negative
and known values.
288 / 460
ROC Curve
Therefore,
E(Cost |Y = 0) = C
1 |0
(1 TNR) + C
0 |0
TNR.
and
E(Cost |Y = 1) = C
0 |1
(1 TPR) + C
1 |1
TPR.
Thus, the expected classification cost reduces to
E(Cost) = [C
1 |0
(1 TNR) + C
0 |0
TNR] Pr(Y = 0) +
[C
0 |1
(1 TPR) + C
1 |1
TPR] Pr(Y = 1).
The values of Pr(Y = 0) and Pr(Y = 1) can be replaced by their
estimates, which are given by N
0
/(N
0
+ N
1
) and N
1
/(N
0
+ N
1
),
respectively.
289 / 460
Variable selection
290 / 460
Hybrid forward stepwise
The algorithm starts with the simplest model, that is, a model
with a linear predictor in which the coecients of all variables
and/or eects, except the intercept term, are zero. Next, the can-
didate models are constructed by hierarchically adding eects to
the linear predictor (i.e., the main eects are first, then the two-
way interactions, and so on), and their “relevance” is assessed
by comparing the models with and without the added eects
using a previously defined criterion. After an eect has been
added to the model, any other eect previously added can be
removed if, according to the specified criterion, that eect is no
longer relevant. See, for example,
James et al. [2013, page 210].
The criteria to compare models include the following:
Akaike Information Criterion
Bayesian Information Criterion
Adjusted R
2
p-value of the significance tests
291 / 460
Hybrid backward ste pwise
The process begins with the more complex model (i.e., one that
includes all variables and/or eects in its linear predictor) and
then the candidate models are constructed by hierarchically re-
moving eects from the linear predictor (i.e., the highest level
interactions are first, then the next level interactions, and so on).
The “relevance” of the candidates is determined by comparing
the models with and without the excluded eects using a pre-
viously established criterion. If an eect is excluded from the
model, then any other eect previously e xcluded may be added
to it if, based on the specified criterion, that eect is no longer
irrelevant. See, for example,
James et al. [2013, page 210].
The criteria to compare models include the following:
Akaike Information Criterion
Bayesian Information Criterion
Adjusted R
2
p-value of the significance tests
292 / 460
Hybrid stepwise
Forward
Backward
Candidate models
293 / 460
Hybrid stepwise
> args(stepCriterion.glm)
> help(stepCriterion.glm)
model: the object where the fitted GLM is stored.
criterion: the criterion specified to compare the candidate mo-
dels. The available options are: AIC ("aic"), BIC ("bic"), ad-
justed deviance-based R-squared ("adjr2"), and p-value of test
("p-value"). By default, criterion="bic".
test: the statistical test specified to compare candidate models
if criterion=“p-value”. The options are: Wald (“wald”), Rao’s
score (“score”), Terrell’s gradient (“gradient”), and likelihood ra-
tio (“lr”) tests. By default, test=“wald”.
direction: hybrid backward ("backward") or hybrid forward
("forward") procedures. By default, direction="forward".
levels: the pair of values in the interval (0, 1) at which the eects
must in and out from the model, respectively, if criterion=“p-
value”. By default, levels=c(0.05,0.05).
294 / 460
Best subset selection
If there are p explanatory variables, which can be introduced into
the model by
˜
p
columns in the model matrix X, then the number
of possible models of size
i, for i = 1, . . . ,
˜
p, is given by
˜
p
i
!
=
˜
p!
(
˜
p i)! i!
Therefore, the total number of models is
˜
p
X
i=1
˜
p
i
!
= 2
˜
p
1.
In the “best subset” selection, an exhaustive search is conducted
separately for every model size f rom
i to
˜
p
, in which the model
with the smallest deviance value is identified. Therefore, if, for
a fixed model size, the interest model selection criteria reduce to
monotone functions of the deviance, thus diering only in the
way the model sizes are compared, then the results of the “best
subset” selection do not depend upon the choice of the trade-
o between goodness-of-fit and complexity on which they are
based.
295 / 460
Best subset selection
> args(bestsubset)
> help(bestsubset)
object: the object of the class glm where the fitted model is stored.
nvmax: a positive integer value indicating the maximum size of
subsets to examine.
nbest: a positive integer value indicating the number of subsets
of each size t o record.
force.in: a positive integers vector indicating the index of columns
of model matrix that should be in all models.
force.out: a positive integers vector indicating the index of
columns of model matrix that should be in no models.
verbose: a logical indicating if should the report of results be
printed.
digits: an positive integer value indicating the number of deci-
mal places to be used.
296 / 460
Burn injury study
This data set, discussed in Hosmer et al. [2013, page 27] and available
in the object burn1000 of the library aplore3, is a subset of 1000 subjects
treated between 2000 and 2007 at 40 dierent burn facilities. Available
information includes patient demographics, total burn surface area
and presence of inhalation injury. The outcome of interest is survival
after hospital discharge. The data set includes the following variables
age (x
1
). The patient’s age at admission, in years.
gender ( x
2
). The patient’s gender: “Female” and “Male”.
race (x
3
). The patient’s race: “Non-White” and “White”.
tbsa (x
4
). Total burn surface area, in percentage.
inh_inj ( x
5
). Burn involved inhalation injury? “No” and “Yes”.
flame (x
6
). Flame involved in burn injury? “No” and “Yes”.
death (y). Hospital discharge status: “Alive” and “Dead”.
Y
i
ind
Bernoulli(µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
6
x
i6
297 / 460
Burn injury study
> burn <- aplore3::burn1000
> burn <- within(burn, death <- fac tor(death,l evels=c("Dea d","Alive") ))
> m <- death ~ (gender + race + flame + age + tbsa + inh_inj)**2
> fit <- glm(m, family=binomial(logit), data=burn)
>
> stepCriterion(fit, direction=" forward", criterion="bic", test="wald")
Initial model: ~ 1
Step 0 :
df AIC BIC adj.R-squared P(Chisq>)(*)
+ tbsa 1 542.65 552.47 0.3622 < 2.2e-16
+ age 1 67 4.26 684.07 0.2064 < 2.2e-16
+ inh_inj 1 695.66 705.48 0.1811 < 2.2e-16
+ flame 1 759.46 769.27 0.1055 1.443e-15
<none> 847.42 852.33 0.0000
+ race 1 847.97 857.79 0.0007 0.2319
+ gender 1 848.58 858.40 0.0000 0.3567
Step 1 : + tbsa
df AIC BIC adj.R-squared P(Chisq>)(*)
+ age 1 36 8.98 383.70 0.5698 < 2.2e-16
+ flame 1 522.68 537.40 0.3876 1.224e-05
+ inh_inj 1 526.73 541.46 0.3828 1.040e-05
<none> 542.65 552.47 0.3622
+ gender 1 538.34 553.07 0.3691 0.01108
+ race 1 544.40 559.12 0.3619 0.61320
298 / 460
Burn injury study
Step 2 : + age
df AIC BIC adj.R-squared P(Chisq>)(*)
+ inh_inj 1 352.18 371.81 0.5917 1.35e-05
<none> 368.98 383.70 0.5698
+ flame 1 365.79 385.42 0.5755 0.02695
+ race 1 366.55 386.18 0.5746 0.03628
+ gender 1 369.95 389.58 0.5706 0.30956
+ age:tbsa 1 370.75 390.38 0.5696 0.63608
- tbsa 1 674.26 684.07 0.2064 < 2.2e-16
Step 3 : + inh_inj
df AIC BIC adj.R-squared P(Chisq>)(*)
+ age:inh_inj 1 340.10 364.63 0.6080 0.000178
<none> 352.18 371.81 0.5917
+ race 1 349.78 374.32 0.5965 0.037011
+ flame 1 352.50 377.03 0.5933 0.200542
+ gender 1 353.28 377.82 0.5923 0.340693
+ inh_inj:tbsa 1 353.46 37 8.00 0.592 1 0.394472
+ age:tbsa 1 354.17 378.71 0.5913 0.935625
- age 1 52 6.73 541.46 0.3828 < 2.2e-16
- tbsa 1 539.54 554.26 0.3676 < 2.2e-16
299 / 460
Burn injury study
Step 4 : + age:inh_inj
df AIC BIC adj.R-squared P(Chisq>)(*)
+ inh_inj:tbsa 1 332.42 36 1.87 0.619 1 0.002075
<none> 340.10 364.63 0.6080
+ race 1 337.19 366.64 0.6134 0.027951
+ flame 1 339.85 369.30 0.6102 0.140166
+ age:tbsa 1 339.99 369.44 0.6101 0.145308
+ gender 1 342.05 371.49 0.6076 0.826913
- tbsa 1 506.34 525.97 0.4088 < 2.2e-16
Step 5 : + inh_inj:tbsa
df AIC BIC adj.R-squared P(Chisq>)(*)
<none> 332.42 361.87 0.6191
+ race 1 328.74 363.09 0.6255 0.01843
+ flame 1 332.83 367.19 0.6206 0.21233
+ age:tbsa 1 333.61 367.96 0.6197 0.36009
+ gender 1 334.33 368.68 0.6188 0.75873
- age:inh_inj 1 353.46 378.00 0.5921 2.823e-06
Final model: ~ tbsa + age + inh_inj + age:inh_inj + tbsa:inh_inj
*********************************************************************
(*) p-values of the Wald test
300 / 460
Burn injury study
> mod <- death ~ gender + race + flame + age*tbsa*inh_inj
> fit <- glm(mod, family=binomial(logit), data=burn)
> bestSubset(fit, digits=3)
_SIZE_ genderMale raceWhite flameYes age tbsa inh_injYes age:tbsa
1 0 0 0 0 0 0 1
2 0 0 0 1 1 0 0
3 0 0 0 1 1 1 0
4 0 0 0 1 1 1 0
5 0 0 0 1 1 1 0
6 0 1 0 1 1 1 0
7 0 1 0 1 1 1 1
8 0 1 1 1 1 1 1
age:inh_injYes tbsa:inh_injYes age:tbsa:inh_i njYes AIC BIC
0 0 0 460.088 469.903
0 0 0 368.977 383.700
0 0 0 352.179 371.810
1 0 0 340.095 364.634
1 1 0 332.423 361.870
1 1 0 328.740 363.095
1 0 1 326.971 366.233
1 0 1 327.657 371.827
301 / 460
Exercises
(1) Show that:
(A) Maximizing R
2
adj
is equivalent to minimizing D(
ˆ
β)/(np1).
(B) In GLMs in which the dispersion parameter is known and
equal to 1, such as Poisson and binomial models, minimizing
GAIC(k) is equivalent to minimizing D(
ˆ
β) k(n p 1).
(2) Consider two GLMs in which the dispersion parameter is
known and equal to 1, such as Poisson and binomial models,
which are identical except for the linear predictor. In the first
model, denoted here by (I),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
while in the second model, denoted by (II),
g(µ
i
) = β
0
+
β
1
x
i1
+ . . . + β
p
x
ip
+ β
p+1
x
i(p+1)
+ . . . + β
p+r
x
i(p+r)
. Model I is
often described as nested in model II. Show that:
(A) If the likelihood-ratio test to assess H
0
: (β
p+1
, . . . , β
p+r
)
= 0
versus H
1
: (β
p+1
, . . . , β
p+r
)
, 0 rejects H
0
at the approximate
significance level α, then GAIC
II
(k) < GAIC
I
(k) for all k
χ
2
1α
(r)/r.
302 / 460
Exercises
(B) If the likelihood-ratio test to assess H
0
: (β
p+1
, . . . , β
p+r
)
= 0
versus H
1
: (β
p+1
, . . . , β
p+r
)
, 0 does not reject H
0
at the
approximate significance level α, then GAIC
II
(k) > GAIC
I
(k)
for all k > χ
2
1α
(r)/r
.
(C) If GAIC
II
(k) < GAIC
I
(k), then the likelihood-ratio test to
assess H
0
: (β
p+1
, . . . , β
p+r
)
= 0 versus H
1
: (β
p+1
, . . . , β
p+r
)
,
0 rejects H
0
at the approximate significance level α, for all
α 1 Pr[χ
2
(r) kr].
(D) If GAIC
II
(k) > GAIC
I
(k), then the likelihood-ratio test to
assess H
0
: (β
p+1
, . . . , β
p+r
)
= 0 versus H
1
: (β
p+1
, . . . , β
p+r
)
,
0
does not rejects H
0
at the approximate significance level α,
for all
α 1 Pr[χ
2
(r) kr].
(E) R
2
adjII
> R
2
adjI
if and only if the inclusion of the explanatory
variables x
p+1
, . . . , x
p+r
in the model results in a reduction in
deviance greater than 100 × r/(n p 1)%. Therefore, the
larger the sample size, the smaller the required reduction in
deviance for
R
2
adj
to indicate that model II is “better than
model I.
303 / 460
Diagnostic methods
304 / 460
Residual analysis (see Pierce and Schafer [1986])
The residual r
i
is aimed to measure the dierence between the
observed value (
y
i
) and the fitted value (
ˆ
µ
i
= g
1
(x
i
ˆ
β)
) for the
i-th observation. "Large" values of r
i
may indicate that the fit-
ted model does not adequately describe the behavior of the
i-th
observation. Indeed, a "large" (in absolute value) and positiv e
value of
r
i
indicates that the fitted model underestimates the
value of
y
i
, whereas a "large" (in absolute value) and negative
value of
r
i
indicates that the fitted model overestimates the value
of
y
i
. Therefore, knowing the distribution of r
i
is desirable, since
it allows us to determine when the value of
r
i
(positive or ne-
gative) is "l arge". In other words, knowing the distribution of
the residual allows us to determine whether the value of
r
i
is
“large” enough to raise doubts about the dierence between the
observed and fitted values for i-th observation being a result of
chance alone.
305 / 460
Residual analysis
Pearson-type residuals:
This type of residuals is defined from the (signed) contribution
of the i-th observation to the Pearson statistic. Thus,
r
P
i
= sign(y
i
ˆ
µ
i
)
s
p(y
i
,
ˆ
µ
i
, ω
i
)
ˆ
φ(1
ˆ
h
ii
)
=
y
i
ˆ
µ
i
q
(
ˆ
φ/ω
i
)V(
ˆ
µ
i
)(1
ˆ
h
ii
)
,
where
ˆ
h
ii
is the (i, i)th element of
ˆ
H =
ˆ
W
1
2
X(X
ˆ
WX)
1
X
ˆ
W
1
2
, in
which
ˆ
W = diag{
ˆ
w
1
, . . . ,
ˆ
w
n
}
and
ˆ
w
i
= ω
i
/[g
(
ˆ
µ
i
)]
2
V(
ˆ
µ
i
)
.
Deviance-type residuals
(Davison and Gigli [1989]):
This type of residuals is defined from the (signed) contribution
of the i-th observation to the Deviance. Thus,
r
D
i
= sign(y
i
ˆ
µ
i
)
s
d(y
i
,
ˆ
µ
i
, ω
i
)
ˆ
φ(1
ˆ
h
ii
)
.
306 / 460
Residual analysis
Quantile-type residuals (Dunn and Smyth [ 1996]):
If Y
1
, . . . , Y
n
are independent and strictly continuous random
variables, then
F
Y
i
(Y
i
; µ
i
, φ/ω
i
)
ind
Uniform(0, 1)
and
Φ
1
[F
Y
i
(Y
i
; µ
i
, φ/ω
i
)]
ind
Normal(0, 1).
These facts motivate the residuals gi ven by
r
Q
i
= Φ
1
h
F
Y
i
(y
i
;
ˆ
µ
i
,
ˆ
φ/ω
i
)
i
, i = 1, . . . , n,
in models where the response is assu med to be distributed as
normal, Gamma, or inverse Gaussian, since, if the model is cor-
rectly specified and the values of (
ˆ
β
,
ˆ
φ) and (β
, φ) are “close”,
which occurs for “large” valu es of n due to the consistency of
the estimators of
β and φ, then the behavior of the residuals
r
Q
1
, . . . , r
Q
n
should be similar to that of a random sample drawn
from the Normal(0, 1).
307 / 460
Residual analysis
Accordingly, values of r
Q
i
less than 3 or greater than 3 could be
considered “large”.
In models where the response is assumed to be distributed as a
binomial or Poisson, the quantile-type residual is randomized to
mimic the behavior expected under a continuous distribution.
Therefore,
r
Q
i
becomes
r
Q
i
=
Φ
1
h
F
Y
i
(y
i
;
ˆ
µ
i
,
ˆ
φ/ω
i
)
i
if Y
i
is strictly continuous,
Φ
1
h
a
i
+ f
Y
i
(y
i
;
ˆ
µ
i
,
ˆ
φ/ω
i
)u
i
i
in other cases
where
u
1
, . . . , u
n
are realizations of random variables U
1
, . . . , U
n
such that U
i
ind
Uniform(0, 1) and a
i
= lim
y y
i
F
Y
i
(y;
ˆ
µ
i
,
ˆ
φ/ω
i
). The
standardized quantile-type residuals are given by
r
Q
i
(1
ˆ
h
ii
)
1
2
308 / 460
Residual analysis
> args(residuals2)
> help(residuals2)
object: the object where the fitted GLM is stored.
type: the type of residuals which should be returned. The op-
tions are: Pearson ("pearson"), Deviance ("deviance"), and (ran-
domized) Quantile ("quantile"). By default, type is set to be "de-
viance".
standardized: a logical switch indicating if the residuals should
be standardized by dividing by
(1
ˆ
h
ii
)
1
2
. By default, standardized
is set to be FALSE.
plot.it: a logical switch indicating if a plot of the residuals
versus the fitted values is required. By default, plot.it=TRUE.
...: some graphical parameters to customize the plot when
plot.it=TRUE.
309 / 460
Normal probability plot with sim ulated envelope
See, for instance, Atkinson [1981].
Step 0: Set T (positive integer) and α (0, 1).
Step 1: Generate T independent samples of size n according to
Y
i
ind
LEF(
ˆ
µ
i
,
ˆ
φ/ω
i
)
for i = 1, . . . , n.
Sample 1 Sample 2 Sample 3 . . . Sample T
y
11
y
12
y
13
. . . y
1T
.
.
.
.
.
.
.
.
. . . .
.
.
.
y
n1
y
n2
y
n3
. . . y
nT
Step 2: Compute the estimates of β and φ in each sample.
Step 3: Compute the residuals r
it
for i = 1, . . . , n and t = 1, . . . , T.
Sample 1 Sample 2 Sample 3 . . . Sample T
r
11
r
12
r
13
. . . r
1T
.
.
.
.
.
.
.
.
. . . .
.
.
.
r
n1
r
n2
r
n3
. . . r
nT
310 / 460
Normal probability plot with sim ulated envelope
Step 4: Compute the order statistics of r
it
in each sample.
Sample 1 Sample 2 Sample 3 . . . Sample T
r
(1)1
r
(1)2
r
(1)3
. . . r
(1)T
.
.
.
.
.
.
.
.
. . . .
.
.
.
r
(n)1
r
(n)2
r
(n)3
. . . r
(n)T
Step 5: Compute the 100(
α
2
)
th, 50th and 100(1
α
2
)
th percentiles
of the row
i, denoted by p
(i)
α
2
, p
(i)
1
2
and p
(i)
1
α
2
for i = 1, . . . , n.
Sample 1 Sample 2 . . . Sample T
r
(1)1
r
(1)2
. . . r
(1)T
.
.
.
.
.
. . . .
.
.
.
r
(n)1
r
(1)2
. . . r
(1)T
p
(1)
α
2
p
(1)
1
2
p
(1)
1
α
2
.
.
.
.
.
.
.
.
.
p
(n)
α
2
p
(n)
1
2
p
(n)
1
α
2
Step 6: Plot p
(1)
α
2
, . . . , p
(n)
α
2
versus the expected values of the order
statistics in a sample of size n of the standard normal
distribution.
311 / 460
Normal probability plot with sim ulated envelope
Step 7: Plot p
(1)
1
2
, . . . , p
(n)
1
2
versus the expected values of the order
statistics in a sample of size
n of the standard normal
distribution.
Step 8: Plot p
(1)
1
α
2
, . . . , p
(n)
1
α
2
versus the expected values of the or-
der statistics in a sample of size
n of the standard normal
distribution.
Step 9: Plot the residuals of the interest model (that is, r
1
, . . . , r
n
)
versus the expected values of the order statistics in a
sample of size n of the standard normal distribution.
> args(envelope.glm)
> help(envelope.glm)
312 / 460
Normal probability plot with sim ulated envelope
object: the object where the fitted GLM is stored.
rep: a positive integer indicating the number of replicates which
should be used to build the simulated envelope. By default, rep
is set to be
25.
conf: a value in the interval (0, 1) indicating the confidence level
which should be used to build the point-wise confidence intervals.
By default, conf is set t o be
0.95.
type: the type of residuals which should be used to build the
envelope. The options are: Pearson (" pearson"), Deviance ("de-
viance"), and (randomized) Quantile ("quantile"). By default,
type is set to be "quantile".
standardized: a logical switch indicating if the residuals should
be standardized by dividing by (1
ˆ
h
ii
)
1
2
. By default, standardized
is set to be FALSE.
plot.it: a logical switch indicating if a plot of the residuals
versus the fitted values is required. By default, plot.it=FALSE.
...: some graphical parameters to customize the plot when
plot.it=TRUE.
313 / 460
Influence or sensitivity analysis
The Cook’s di stance, denoted here by CD
i
, i s a measure of the in-
fluence of the observation
i on the maximum-likelihood estimate
of
β, which consists of the norm of
ˆ
β
ˆ
β
(i)
using a metric based
on
ˆ
Var(
ˆ
β)
, where
ˆ
β
(i)
represents the estimate of β computed from
the dataset in which the observation
i is excluded. CD
i
may be
expressed as follows (
Cook and Weisberg [1982, section 5.4]):
CD
i
=
1
p + 1
(
ˆ
β
ˆ
β
(i)
)
[
ˆ
Var(
ˆ
β)]
1
(
ˆ
β
ˆ
β
(i)
)
=
1
ˆ
φ(p + 1)
n
X
i=1
ˆ
w
i
x
i
ˆ
β x
i
ˆ
β
(i)
2
,
where
ˆ
w
i
= ω
i
/[g
(
ˆ
µ
i
)]
2
V(
ˆ
µ
i
)
. “Large” values of CD
i
indicate
that removing the observ ation i introduces “large” changes in
the estimate of
β and that a comprehensive analysis is required to
investigate whether the elimination of this observation changes
the sign of the parameter estimates and/or introduces changes
in the decision about their statistical significance.
314 / 460
Influence or sensitivity analysis
The computational cost of the computation of
ˆ
β
(i)
for i = 1, . . . , n
may be “high”, especially when n is “large”. Therefore, the
value of
ˆ
β
(i)
in CD
i
is replaced by its one-step approximation (see
Pregibon [1981]), denoted here by
ˆ
β
1
(i)
. This approximation con-
sists of the first iteration of the Fisher’s Scoring al gorithm to
estimate
β when it is performed using: (i) a dataset in which the
observation
i is excluded; and (ii)
ˆ
β as the starting value. Thus,
ˆ
β
1
(i)
=
ˆ
β
ω
i
(y
i
ˆ
µ
i
)
g
(
ˆ
µ
i
)V(
ˆ
µ
i
)(1
ˆ
h
ii
)
(X
ˆ
WX)
1
x
i
.
The values of
ˆ
β
1
(i)
and
ˆ
β
(i)
coincide for Gaussian models with iden-
tity link function. The Cook’s distance may be approximated by
the following expression
CD
1
i
=
1
ˆ
φ(p + 1)
n
X
i=1
ˆ
w
i
x
i
ˆ
β x
i
ˆ
β
1
(i)
2
=
1
p + 1
r
2
P
i
ˆ
h
ii
1
ˆ
h
ii
.
315 / 460
Influence or sensitivity analysis
(1) The matrix
ˆ
Var(
ˆ
β)
is positive definite. Therefore,
CD
1
i
0
CD
1
i
= 0 if and only if
ˆ
β =
ˆ
β
1
(i)
(2)
If r
P
i
= 0 (that is, if y
i
=
ˆ
µ
i
), then
ˆ
β
1
(i)
=
ˆ
β
and CD
1
i
= 0.
A univariate version of
CD
1
i
is known as DFBETA
1
j(i)
, and it is
given by
DFBETA
1
j(i)
=
ˆ
β
j
ˆ
β
1
j
(i)
, j = 0, . . . , p,
where
ˆ
β
1
j
(i)
is the element (j + 1) of
ˆ
β
1
(i)
. Similarly, DFBETAS
1
j(i)
is
given by
DFBETAS
1
j(i)
=
ˆ
β
j
ˆ
β
1
j
(i)
se(
ˆ
β
1
j
(i)
)
, j = 0, . . . , p.
> help(cooks.distance)
> help(dfbeta)
> help(dfbetas)
316 / 460
Testing the constant dispersion parameter assumption
To assess the assumption of constant dispersion parameter the
following varying dispersion model is considered
Y
i
ind
LEF(µ
i
, φ
i
i
)
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
log(φ
i
) = γ
0
+ γ
1
m
i1
+ . . . + γ
q
m
iq
where
˜
m
i
= (m
i1
, . . . , m
iq
)
are the values of q explanatory varia-
bles, quantitative and/or qualitative, measured on the
i-th indi-
vidual. The sets of variables
x
1
, . . . , x
p
and m
1
, . . . , m
q
may or may
not coincide. Thus, assess the assumption of constant dispersion
parameter is equivalent to assess the following hypothesis in the
varying dispersion model described above
H
0
: γ = 0 versus H
1
: γ , 0,
where γ = (γ
1
, . . . , γ
q
)
. If H
0
is not rejected, then it is possible to
conclude that there is not significative statistical evidence against
the assumption of constant dispersion parameter.
317 / 460
Testing the constant dispersion parameter assumption
The decision rule is
Reject H
0
at the approximate significance level of 100(α)% if ξ > χ
2
1α
(q)
The statistic of the Rao’s Score test is given by
ξ =
1
2
z
M(M
M)
1
M
z,
where M =
I n
1
11
M
, M =
(
˜
m
1
, . . . ,
˜
m
n
)
and z = (z
1
, . . . , z
n
)
.
If
Y
i
ind
Normal(µ
i
, φ/ω
i
)
then
z
i
=
ω
i
(y
i
ˆ
µ
i
)
2
φ
1
and φ
=
1
n
n
X
i=1
ω
i
(y
i
ˆ
µ
i
)
2
If Y
i
ind
Inverse Gaussian(µ
i
, φ/ω
i
)
then
z
i
=
ω
i
(y
i
ˆ
µ
i
)
2
ˆ
µ
2
i
y
i
φ
1
and φ
=
1
n
n
X
i=1
ω
i
(y
i
ˆ
µ
i
)
2
ˆ
µ
2
i
y
i
See, for instance, Wei et al. [1998].
318 / 460
Testing the constant dispersion parameter assumption
> args(vdtest.glm)
> help(vdtest.glm)
model: the object where the fitted GLM is stored. Only
gaussian(), Gamma() and inverse.gaussian() families are
supported.
varformula: By default, the same explanatory variables are
taken as in the model for the mean. However, the argument
varformula, which must to be a formula-type object, may
be used to specify the desired explanatory variables.
verbose: a logical switch indicating if should the report of
results be printed. By default, verbose=TRUE.
319 / 460
Box-Tidwell transformat ion
Let’s consider the following generalized nonlinear model
Y
i
ind
LEF(µ
i
, φ/ω
i
),
g(µ
i
) = β
0
+ β
1
h
1
(x
i1
) + . . . + β
r
h
r
(x
ir
) + β
r+1
x
i(r+1)
+ . . . + β
p
x
ip
,
where x
i1
, . . . , x
ir
are the values of r quantitative and strictly po-
sitive explanatory variables measured on the subje ct
i, and
h
j
(x) =
x
τ
j
if τ
j
, 0
log(x)
if τ
j
= 0
, j = 1, . . . , r.
Hence, the interest parameter v ector is (β
0
, β
1
, . . . , β
p+1
, τ
1
, . . . , τ
r
)
.
Box and Tidwell [1962] provide an ecient algorithm to estimate
τ = (τ
1
, . . . , τ
r
)
by the maximum likelihood method. Note that,
evaluation of the following hypothesis systems is of interest:
H
0
: τ
j
= 1 versus H
1
: τ
j
, 1,
for j = 1, . . . , r, and
H
0
: (τ
1
, . . . , τ
r
)
= 1 versus H
1
: (τ
1
, . . . , τ
r
)
, 1.
320 / 460
Insecticides
These data, extracted from Hewlett and Plackett [1950] and available
in the object
deposit of the library GLMsData, arose from an experiment
in which groups of insects were exposed to deposits of insecticides. The
total number of insects exposed in each group (
Number), the number
of insects killed after six days exposure (
Killed), and the amount
(
Deposit) and type of insecticide used (Insecticide) were recorded.
Insecticide
type A
type B
type C
Amount of insecticide, in milligrams
% of killed insects
2 4 6 8
0
25
50
75
100
m
i
Y
i
ind
Binomial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i3
m = Number of exposed insects
y = Proportion of killed insects
x
1
= Amount of insecticide, in mg.
x
2
= 0 and x
3
= 0 if Insecticide=“A”
x
2
= 1 and x
3
= 0 if Insecticide=“B”
x
2
= 0 and x
3
= 1 if Insecticide=“C”
321 / 460
Box-Tidwell transformat ion
> help(BoxTidwell.glm)
> data(deposit, package="GLMsData")
> fit <- glm(Killed/Number ~ Deposit + Insecticide,
+ weights=Number, family=binomial(logit), data=deposit)
>
> BoxTidwell(fit, transf= ~ Deposit)
ˆ
τ
j
H
0
: τ
j
= 1
versus H
1
: τ
j
, 1
Estimate Std.Error z-value Pr(>|z|)
Deposit -0.90393 0.34310 -5.54914 2.8708e-08
Wald test for null hypothesis that all taus are 1:
chi = 30.793 , df = 1 , Pr(>chi) = 2.8708e-08
H
0
: (τ
1
, . . . , τ
r
)
= 1 versus H
1
: (τ
1
, . . . , τ
r
)
, 1
>
> fit2 <- update(fit,formula=~I(Deposit**(-0.9039))+Insecticide)
> c(adjR2(fit),adjR2(fit2))
[1] 0.85902 0.95405
> c(AIC(fit),AIC(fit2))
[1] 118.22335 85.84911
> c(BIC(fit),BIC(fit2))
[1] 121.7848 89.4106
322 / 460
Skin cancer in women
These data, extracted from Kleinbaum et al. [2013, page 751] and avail-
able in the object
skincancer of the library glmtoolbox, describe the
number of non-melanoma skin cancer cases (
cases) among women
(
population) stratified by age (age: “15-24”, “25-34”, “35-44”, “45-
54”, “55-64”, “65-74”, “75-84”, “85+”) in two metropolitan areas of US
(
city: “St. Paul”, in Minneapolis, and “Fort Worth”, in Dallas).
15−24 25−34 35−44 45−54 55−64 65−74 75−84 85+
City
St.Paul
Ft.Worth
Incidence rate of skin cancer ×10
4
Age range, in years
0
25
50
75
Y
i
ind
Poisson(µ
i
),
log
10
4
µ
i
λ
i
= β
0
+ β
1
x
i1
+ β
2
x
i2
y = Number of cases of skin cancer
λ = Number of women exposed
x
1
= 1 if city=“Fort Worth”
x
1
= 0 if city=“St. Paul”
x
2
= Midpoint of age range
323 / 460
Box-Tidwell transformat ion
> help(BoxTidwell.glm)
> data(skincancer)
> fit <- glm(cases ~ age + city, offset=log(population),
+ family=poisson(log), data=skincancer)
>
> BoxTidwell(fit, transf= ~ age)
ˆ
τ
j
H
0
: τ
j
= 1
versus H
1
: τ
j
, 1
Estimate Std.Error z-value Pr(>|z|)
age -0.52014 0.12389 -12.26983 < 2.22e-16
Wald test for null hypothesis that all taus are 1:
chi = 150.55 , df = 1 , Pr(>chi) = < 2.22e-16
H
0
: (τ
1
, . . . , τ
r
)
= 1 versus H
1
: (τ
1
, . . . , τ
r
)
, 1
>
> fit2 <- update(fit, formula = ~ I(age**(-0.52)) + city)
> c(adjR2(fit),adjR2(fit2))
[1] 0.93512 0.99381
> c(AIC(fit),AIC(fit2))
[1] 257.1082 115.2115
> c(BIC(fit),BIC(fit2))
[1] 259.4259 117.5293
324 / 460
Gasoline yield
This data set, discussed in James et al. [2013, page 92] and available in
the object
Auto of the library ISLR, includes acceleration, horsepower,
fuel eciency, weight, and other features of dierent makes and mod-
els of cars. The interest variable is fuel consumption, in miles per
gallon. The variables in this data set are the following
mpg (y). Fuel consumption, in miles per gallon.
cylinders (x
1
). Number of cylinders.
displacement (x
2
). Engine displacement, in cubic inches.
horsepower (x
3
). Engine horsepower.
weight ( x
4
). weight, in pounds.
acceleration (x
5
). Time, in seconds, to accelerate from 0 to 60
miles per hour.
origin (x
6
, x
7
). Origin of the vehicle: American (1), European (2)
and J apanese (3).
325 / 460
Gasoline yield
The following is the code of the exponential link function (i.e., g(µ) =
exp(µ)
) for the Gaussian distribution:
> exp.link <- gaussian()
> exp.link$link <- "exp"
> exp.link$linkfun <- function(mu) exp(mu)
> exp.link$linkinv <- function(eta) log(eta)
> exp.link$mu.eta <- function(eta) 1/eta
>
> fit <- glm(..., family=exp.link, ...)
The following is the code of the inverse exponential link function (i.e.,
g(µ) = exp(µ)) for the Gaussian distribution:
> iexp.link <- gaussian()
> iexp.link$link <- "1/exp"
> iexp.link$linkfun <- function(mu) exp(-mu)
> iexp.link$linkinv <- function(eta) -log(eta)
> iexp.link$mu.eta <- function(eta) -1/eta
>
> fit <- glm(..., family=iexp.link, ...)
326 / 460
Gasoline yield
Simpler model: 1
More complex model
: 1+ weight + horsepower + weight:horsepower + displacement
+ acceleration + cylinders + origin
Distribution “Best” linear predictor according to BIC BICg(µ)
log(µ)
Gaussian 1+displacement+weight*horsepower 2213.2
Gamma 1+weight*horsepower 2100.2
Inverse Gaussian 1+weight*horsepower
2089.4
Lognormal 1+weight*horsepower 2095.6
µ
Gaussian 1+weight*horsepower 2211.3
Gamma 1+weight*horsepower 2102.4
Inverse Gaussian 1+weight*horsepower 2094.0
Lognormal 1+weight*horsepower 2098.6
1
µ
Gaussian 1+displacement+horsepower+acceleration 2201.0
Gamma 1+weight+horsepower+cylinders 2098.9
Inverse Gaussian 1+weight+horsepower+cylinders
2089.1
Lognormal 1+weight+displacement+horsepower+acceleration 2092.6
> data(Auto, package="ISLR")
> Auto <- within(Auto, weight <- wei ght/1000)
> fit <- glm(mpg ~ horsepower + weight + horsepower:weight,
+ family=inverse.gaussian(log), data=Auto)
> summary(fit)
327 / 460
Gasoline yield
> anova2(fit, test="lr")
> residuals2(fit, type="quantile ", plot.it=TRUE, identify=3)
15 20 25 30 35 40
−3 −2 −1 0 1 2 3
331
360
382
ˆ
µ
r
Q
i
An “horizontal band” of residuals is roughly drawn around the 0 line.
This suggests that the random component of the model successfully
describes the mean-variance relation present in the data.
328 / 460
Gasoline yield
> envelope(fit, type="quantile", rep=500, identify=3)
> cbind(model.matrix(fit)[,-1],y=fit$y,mu=fitted(fit))[,c(331,360,382)]
horsepower weight horsepower:wei ght y mu
132 2910 384120 32.7 19.92788
105 3725 391125 26.6 18.02434
85 3015 256275 38.0 22.94378
−3 −2 −1 0 1 2 3
−4 −2 0 2 4
331
360
382
Expected quantiles
Observed quantiles
329 / 460
Gasoline yield
> cdi <- cooks.distance(fit)
> plot(cdi, type="h", xlab="Observation Index", ylab="Cook’s Distance")
> identify(cdi, n=3)
0 100 200 300 400
0.00 0.02 0.04 0.06 0.08 0.10
29
116
331
Observation Index
Cook’s Distance
330 / 460
Gasoline yield
Which subset of the p otentially influential observations (#29, #116, #331) causes
the largest change in the estimate of
β when removed from the data set?
> infl <- c(29,116,331)
> n <- length(infl)
> metr <- chol2inv(chol(vcov(fit)))
> hm <- 2**n - 1
> results <- matrix(0, nrow=hm, ncol=1)
> id <- 1; nfil <- vect or()
> for(i in 1:n){
+ pos <- combn(n,i)
+ for(j in 1:ncol(pos)){
+ fitemp <- update(fit, subset=-pos[,j])
+ difs <- coef(fitemp)-coef(fit)
+ results[id,1] <- t(difs)%*%metr%*%(difs)
+ nfil <- c(nfil,paste(infl[pos[,j]], collapse=","))
+ id <- id + 1
+ }
+ }
> nfil[results==max(results)]
[1] "29,331"
The largest change in the estimate of β occurs when observations #29 and #331
are removed.
331 / 460
Gasoline yield
> a <- round(summary(fit)$coefficients, digits=5); a
Estimate Std.Error t value Pr(>|t|)
(Intercept) 4.55600 0.09410 48.41499 0
horsepower -0.00725 0.00102 -7.08715 0
weight -0.37430 0.02942 -12.72132 0
horsepower:weight 0.00126 0.00025 5.09735 0
>
> fit2 <- update(fit, subset=-c(29,331))
> b <- round(summary(fit2)$coefficients, digits=5); b
Estimate Std.Error t value Pr(>|t|)
(Intercept) 4.61036 0.09201 50.10786 0
horsepower -0.00804 0.00100 -8.01447 0
weight -0.38628 0.02874 -13.44288 0
horsepower:weight 0.00144 0.00024 5.94463 0
>
> round(100*(b[,1:2]-a[,1:2])/abs(a[,1:2]), digits=3)
Estimate Std.Error
(Intercept) 1.193 -2.221
horsepower -10.897 -1.961
weight -3.201 -2.311
horsepower:weight 14.286 -4.000
332 / 460
Gasoline yield
> vdtest(fit)
Score test for varying dispersion parameter
Statistic = 1.50238
degrees of freedom = 3
p-value = 0.68172
> vdtest(fit,varformula=~horsepower*weight+cylinders+displacement)
Score test for varying dispersion parameter
Statistic = 4.42663
degrees of freedom = 5
p-value = 0.48976
> vdtest(fit,varformula=~horsepower*weight+cylinders*displacement)
Score test for varying dispersion parameter
Statistic = 7.30864
degrees of freedom = 6
p-value = 0.29324
333 / 460
Quasi-likelihood models
334 / 460
Quasi-likelihood models (QLMs)
Y
i
ind
µ
i
, (φ/ω
i
)V(µ
i
)
,
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
= x
i
β
Then, E(Y
i
) = µ
i
and Var(Y
i
) =
φ
ω
i
V(µ
i
),
where
β = (β
0
, β
1
, . . . , β
p
)
is the interest parameter vector.
x
i
= (1, x
i1
, . . . , x
ip
)
, where (x
i1
, . . . , x
ip
) are the values of p
explanatory variables, quantitative and/or qualitative, mea-
sured on the
i-th individual.
φ > 0 is the dispersion parameter.
ω
1
, . . . , ω
n
are known and positive weights.
V(·) is the variance function.
g(·) is the link function, which is assumed to be known,
strictly monotone, and twice dierentiable.
335 / 460
Quasi-likelihood models (QLMs)
ˆ
β = argmax
β R
p+1
Q(β),
where Q(β) =
1
φ
n
X
i=1
Z
µ
i
y
i
ω
i
(y
i
t)
V(t)
dt
is the log-quasi-likelihood function. For example,
If
V(µ
i
) = µ
i
(1µ
i
) Q(β) = C +
1
φ
n
X
i=1
ω
i
"
y
i
log
µ
i
1 µ
i
!
+ log(1 µ
i
)
#
If V(µ
i
) = µ Q(β) = C +
1
φ
n
X
i=1
ω
i
y
i
log(µ
i
) µ
i
If V(µ
i
) = 1 Q(β) =
1
2 φ
n
X
i=1
ω
i
(y
i
µ
i
)
2
If V(µ
i
) = µ
2
i
Q(β) = C
1
φ
n
X
i=1
ω
i
log(µ
i
) +
y
i
µ
i
!
If V(µ
i
) = µ
3
i
Q(β) = C +
1
φ
n
X
i=1
ω
i
1
µ
i
y
i
2µ
2
i
336 / 460
Quasi-likelihood models (QLMs)
Q(β)
∂β
j
=
1
φ
n
X
i=1
(y
i
µ
i
)
V(µ
i
)
∂µ
i
∂η
i
x
ij
=
1
φ
n
X
i=1
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
x
ij
In matrix form, U(β) becomes
U(β) =
φ
1
n
P
i=1
ω
i
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
.
φ
1
n
P
i=1
ω
i
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
x
i1
.
.
.
φ
1
n
P
i=1
ω
i
(y
i
µ
i
)
V(µ
i
)g
(µ
i
)
x
ip
,
which is the same as that in GLMs, that is, the parameter estimates
under GLMs and QLMs coincide. Therefore, QLMs relaxes the distri-
butional assumptions in GLMs, as the former leads to the same results
as the latter, but in the former just the correct specification of the mean
and variance of the response variable is required.
337 / 460
Gasoline Yields from Crude Oil
The dataset, collected by Prater [1956] and included in the library
betareg as GasolineYield, pertains to the percentage of crude oil
converted to gasoline after distillation and fractionation (
yield). The
explanatory variables are temperature (
temp), in degrees Fahrenheit,
vapor pressure of crude oil (
pressure), in pounds per square inch, and
crude oil gravity (
gravity), in degrees API.
0.1
0.2
0.3
0.4
200 250 300 350 400 450
Temperature
Low
High
Vapor pressure
Proportion of crude converted to gasoline
Y
i
ind
µ
i
, φµ
i
(1 µ
i
)
,
g(µ
i
)= β
0
+β
1
x
i1
+β
2
x
i2
+β
3
x
i3
+β
4
x
i2
x
i3
y = Proportion of crude oil
converted to gasoline
x
1
= Temperature
x
2
= Vapor pressure
x
3
= Gravity
338 / 460
Gasoline Yields from Crude Oil
> library(betareg)
> data(GasolineYield)
> str(GasolineYield)
> summary(GasolineYield)
>
> fit1 <- glm(yield ~ temp+pressure+gravity+pressure:gravity,
+ data=GasolineYield, family=quasibinomial(logit))
> fit2 <- update(fit1, family=quasibinomial(probit))
> fit3 <- update(fit1, family=quasibinomial(cloglog))
> fit4 <- update(fit1, family=quasibinomial(cauchit))
>
> adjR2(fit1,fit2,fit3,fit4)
Deviance R-squared df adj.R-squared
fit1 0.1829340 0.9219 5 0.9103
fit2 0.1637178 0.9301 5 0.9197
fit3 0.2064191 0.9119 5 0.8988
fit4 0.3876270 0.8345 5 0.8100
> dev.new()
> residuals2(fit2)
339 / 460
Gasoline Yields from Crude Oil
> summary(fit2)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) -4.3117643 0.3927982 -10.977 1.85e-11
temp 0.0057560 0.0003401 16.926 6.68e-16
pressure 0.2027705 0.0526505 3.851 0.000655
gravity 0.0292699 0.0092548 3.163 0.003842
pressure:gravity -0.0029725 0.0013007 -2.285 0.030367
> fit5 <- betareg(yield ~ temp+pressure+gravity+pressure:gravity,
data=GasolineYield, link="probit")
> summary(fit5)
Estimate Std.Error z value Pr(>|z|)
(Intercept) -4.2782892 0.3608205 -11.857 < 2e-16
temp 0.0057884 0.0003126 18.518 < 2e-16
pressure 0.1976645 0.0483439 4.089 4.34e-05
gravity 0.0279473 0.0084987 3.288 0.00101
pressure:gravity -0.0028123 0.0011945 -2.354 0.01856
340 / 460
Alaska pipeline
This dataset, extracted from Weisberg [2005] and available in the object
pipeline of the library glmtoolbox, consists of ultrasonic measure-
ments of the depth of defects in the Alaska pipeline (
Field). A second
measurement of the depth of the defects was conducted in the labora-
tory (
Lab). Data were analyzed in order to calibrate the bias between
field measurements and laboratory measurements.
20
40
60
80
2 3 4
log(Laboratory measurement)
Field measurement
Y
i
ind
µ
i
, φµ
r
i
,
µ
i
= β
0
x
β
1
i
y = Field measurement
x = Laboratory measurement
r = 0, 1, 2 or 3
341 / 460
Alaska pipeline
> data(pipeline)
> str(pipeline)
> summary(pipeline)
>
> fit1 <- glm(Field ~ log(Lab), data=pipeline,
+ family=quasi(link=log,variance=constant))
> fit2 <- update(fit1, family=quasi(link=log,variance=mu))
> fit3 <- update(fit1, family=quasi(link=log,variance=muˆ2))
> fit4 <- update(fit1, family=quasi(link=log,variance=muˆ3))
>
> adjR2(fit1,fit2,fit3,fit4)
Deviance R-squared df adj.R-squared
fit1 3597.6085352 0.9019 2 0.9010
fit2 88.3293923 0.9216 2 0.9209
fit3 2.9264710 0.9256 2 0.9249
fit4 0.1439215 0.9109 2 0.9100
>
> dev.new(); residuals2(fit1)
> dev.new(); residuals2(fit2)
> dev.new(); residuals2(fit3)
> dev.new(); residuals2(fit4)
342 / 460
Alaska pipeline
> summary(fit3)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) 0.30610 0.07924 3.863 0.000194
log(Lab) 0.88197 0.02254 39.133 < 2e-16
ˆ
φ
Dispersion parameter for quasi family taken to be 0.02714494
Null deviance: 39.3211 on 106 degrees of freedom
Residual deviance: 2.9265 on 105 degrees of freedom
>
> fit5 <- glm(Field ~ log(Lab), data=pipeline, family=Gamma(log))
> summary(fit5)
ˆ
β
j
c
j
ˆ
β
j
/c
j
Pr[χ
2
(1)
> z
2
j
]
Estimate Std.Error z value Pr(>|z|)
(Intercept) 0.30610 0.07924 3.863 0.000194
log(Lab) 0.88197 0.02254 39.133 < 2e-16
ˆ
φ
Dispersion parameter for Gamma family taken to be 0.02714494
Null deviance: 39.3211 on 106 degrees of freedom
Residual deviance: 2.9265 on 105 degrees of freedom
343 / 460
Overdispersion in Poisson
models
344 / 460
Overdispersion
Overdispersion occurs when the observed variance “ex-
ceeds” the expected variance according to the Poisson dis-
tribution.
Overdispersion results in an underestimation of the variabi-
lity associated with parameter estimates. As a result, vari-
ables may appear significant when they are actually not
significant, as well as producing confidence intervals with
“low” coverage rates.
As an informal rule, if the deviance statistic is “rather
higher than the degrees of freedom, or if the Pearson’s
chi-squared statistic is “rather higher than the degrees
of freedom, there may be evidence that the data are over-
dispersed.
Overdispersion may be addressed by replacing the Poisson
regression model with a regression model based on a more
complex distribution.
345 / 460
Self diagnozed ear infections in swimmers
The data were extracted from Hand et al. [1993, page 266] and
can be found in the object
swimmers of the library glmtoolbox.
This study was conducted in order to determine if beach swim-
mers are at greater risk of contracting ear infections than non-
beach swimmers. This dataset contains the following vari ables:
frequency. The recruit’s perception of whether he or she is
a frequent swimmer: “frequent” and “occasional”.
location. The recruit’s usually chosen swimming location:
“beach” and “non-beach”.
age. The recruit’s age range: “15-19”, “20-24” and “25-29”.
gender. The recruit’s gender: “male” and “female”.
infections. The number of self diagnozed ear infections
that were reported by the recruit.
346 / 460
Self diagnozed ear infections in swimmers
> fit0 <- glm(infections ~ frequency + location + age + gender,
+ family=poisson(log), data=swimmers)
>
> c("Deviance/df"=deviance(fit0)/df.residual(fit0),
+ "Pearson/df"=sum(resid(fit0, type="pearson")**2)/df.residual(fit0))
Deviance/df Pearson/df
2.688374 3.379853
>
> envelope(fit0, type="quantile" )
−3 −2 −1 0 1 2 3
−4 −2 0 2 4 6 8
Expected quantiles
Observed quantiles
347 / 460
The Origins of Sex Dierences in Science
This data, extracted from Long [1990] and included in the library
pscl object bioChemists, is the result of a study that examined
gender dierences in productivity (number of papers published)
during graduate school for a sample of 915 PhD students in
biochemistry, as a result of educational, mentoring, marriage,
and family factors. The following variables are included in this
dataset:
art. Count of articles produced during last 3 years of PhD.
fem. The gender of the student: “Men” or “Women”.
mar. The marital status of the student: “Single” or “Mar-
ried”.
kid5. The number of children aged 5 or younger.
phd. The prestige of the PhD department.
ment. The count of articles produced by the PhD mentor
during last 3 years.
348 / 460
The Origins of Sex Dierences in Science
> data(bioChemists, package="psc l")
> fit0 <- glm(art ~ fem + mar + kid5 + phd + ment, family=poisson(log),
+ data=bioChemists)
>
> c("Deviance/df"=deviance(fit0)/df.residual(fit0),
+ "Pearson/df"=sum(resid(fit0, type="pearson")**2)/df.residual(fit0))
Deviance/df Pearson/df
1.797988 1.828984
>
> envelope(fit0, type="quantile" )
−3 −2 −1 0 1 2 3
−4 −2 0 2 4 6 8
Expected quantiles
Observed quantiles
349 / 460
Negative Binomial I
If Y |λ Poisson(λ), so that E(Y |λ) = λ and Var(Y |λ) = λ,
and
λ Gamma(µ, φ), so that E(λ) = µ and Var (λ) = φµ
2
, then
YNegative Binomial I(µ, φ) . Therefore,
(1) E(Y) = E[E(Y |λ)] = µ
(2) Var(Y) = Var[E(Y |λ)] + E[Var(Y |λ)] = µ(1 + φµ)
(3)
The probability mass function of Y is
f
Y
(y; µ, φ) =
Γ
y + φ
1
Γ
φ
1
Γ
y + 1
µ
µ + φ
1
!
y
φ
1
µ + φ
1
!
φ
1
I
y
{0, 1, . . .}
(4)
The moment generating function of Y is
M
Y
(t) =
h
1 φµ
e
t
1
i
1
φ
for t < log
µ + φ
1
µ
!
(5)
The mode of Y is max
nj
µ(1 φ)
k
, 0
o
.
350 / 460
Negative Binomial I
The probability mass function of Y may be obtained as follows
f
Y
(y; µ, φ) =
Z
0
f
Y|λ
(y; λ) f
λ
(λ; µ, φ) dλ
=
Z
0
exp(λ)λ
y
Γ(y + 1)
I
y
{0, 1, . . .}
λ
φ
1
1
Γ(φ
1
)(φµ)
φ
1
exp
λ
φµ
!
dλ
=
I
y
{0, 1, . . .}
Γ(φ
1
)(φµ)
φ
1
Γ(y + 1)
Z
0
λ
y+φ
1
1
exp
λ(µ + φ
1
)
µ
!
dλ
=
Γ(
˜
φ
1
)(
˜
φ
˜
µ)
˜
φ
1
I
y
{0, 1, . . .}
Γ(φ
1
)(φµ)
φ
1
Γ(y + 1)
Z
0
λ
˜
φ
1
1
Γ(
˜
φ
1
)(
˜
φ
˜
µ)
˜
φ
1
exp
λ
˜
φ
˜
µ
!
dλ
| {z }
R
0
f
λ
(λ;
˜
µ,
˜
φ) dλ where λGamma(
˜
µ,
˜
φ)
=
Γ
y + φ
1
Γ
φ
1
Γ
y + 1
µ
µ + φ
1
!
y
φ
1
µ + φ
1
!
φ
1
I
y
{0, 1, . . .},
where
˜
φ = 1/(y + φ
1
)
and
˜
µ = µ
y + φ
1
/
µ + φ
1
.
351 / 460
Negative Binomial I
The moment generating function of Y may be obtained as fo llows
M
Y
(t) = E
M
Y|λ
(t)
=
Z
0
M
Y|λ
(t) f
λ
(λ; µ, φ) dλ
=
Z
0
exp[λ(e
t
1)]
λ
φ
1
1
Γ(φ
1
)(φµ)
φ
1
exp
λ
φµ
!
dλ
=
˜
µ
µ
!
φ
1
Z
0
λ
φ
1
1
Γ(φ
1
)(φ
˜
µ)
φ
1
exp
λ
φ
˜
µ
!
dλ
| {z }
R
0
f
λ
(λ;
˜
µ,φ) dλ where λGamma(
˜
µ,φ)
=
h
1 φµ
e
t
1
i
1
φ
for t < log
µ + φ
1
µ
!
,
as
˜
µ =
µ
1 φµ
(
e
t
1
)
, and
˜
µ > 0
if and only if t < log
µ + φ
1
µ
!
.
352 / 460
Negative Binomial distribution
If Y |λ Poisson(λ), so that E(Y |λ) = λ and Var(Y |λ) = λ, and
λ Gamma(µ, φµ
τ
), so that E(λ) = µ and Var(λ) = φµ
τ+2
, for
τ R, then YNegative Binomial(µ, φ, τ) . Therefore,
(1) E(Y) = E[E(Y |λ)] = µ
(2) Var(Y) = Var[E(Y |λ)] + E[Var(Y |λ)] = µ(1 + φµ
τ+1
)
(3)
The probability mass function of Y is
f
Y
(y; µ, φ, τ) =
Γ
y +
1
φµ
τ
Γ
1
φµ
τ
Γ
y + 1
µ
µ +
1
φµ
τ
y
1
φµ
τ
µ +
1
φµ
τ
1
φµ
τ
I
y
{0, 1, . . .}
(4)
The moment generating function of Y is
M
Y
(t) =
h
1 φµ
τ+1
e
t
1
i
1
φµ
τ
for t < log
µ +
1
φµ
τ
µ
(5)
The mode of Y is max
nj
µ(1 φµ
τ
)
k
, 0
o
.
(6) Negative Binomial I(µ, φ) for τ = 0
(7) Negative Binomial II(µ, φ) for τ = 1
353 / 460
Negative Binomial regression
Y
i
ind
Negative Binomial(µ
i
, φ, τ),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) = µ(1 + φµ
τ+1
)
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
Square-root link
g(µ
i
) =
µ
i
Parameter estimation may be accomplished by using the maximum-likelihood method.
Inference is based on the asymptotic properties of the maximum-likelihood estimator.
354 / 460
Self diagnozed ear infections in swimmers
The data were extracted from Hand et al. [1993, page 266] and
can be found in the object
swimmers of the library glmtoolbox.
This study was conducted in order to determine if beach swim-
mers are at greater risk of contracting ear infections than non-
beach swimmers. This dataset contains the following vari ables:
frequency. The recruit’s perception of whether he or she is
a frequent swimmer: “frequent” and “occasional”.
location. The recruit’s usually chosen swimming location:
“beach” and “non-beach”.
age. The recruit’s age range: “15-19”, “20-24” and “25-29”.
gender. The recruit’s gender: “male” and “female”.
infections. The number of self diagnozed ear infections
that were reported by the recruit.
355 / 460
Self diagnozed ear infections in swimmers
> fit1 <- overglm(infections ~ frequency + location + age +
+ gender, family="nb1(log)", data=swimmers)
> fit2 <- update(fit1, family="nb2(log)")
> fit3 <- update(fit1, family="nbf(log)")
>
> AIC(fit0, fit1, fit2, fit3)
df AIC
fit0 6 1139.8280
fit1 7 904.6900
fit2 7 907.2357
fit3 8 906.5393
> BIC(fit0, fit1, fit2, fit3)
df BIC
fit0 6 1161.7849
fit1 7 930.3064
fit2 7 932.8521
fit3 8 935.8152
>
> fit1a <- update(fit1, formula = . ~ . - age - gender)
> anova(fit1a, fit1, test="wald")
Model 1 : infections ~ frequency + location
Model 2 : infections ~ frequency + location + age + gender
Chi Df Pr(>Chi)
1 vs 2 3.8493 3 0.2782
356 / 460
Self diagnozed ear infections in swimmers
> anova(fit1a, fit1, test="score")
Model 1 : infections ~ frequency + location
Model 2 : infections ~ frequency + location + age + gender
Chi Df Pr(>Chi)
1 vs 2 3.9632 3 0.2655
>
> anova(fit1a, fit1, test="lr")
Model 1 : infections ~ frequency + location
Model 2 : infections ~ frequency + location + age + gender
Chi Df Pr(>Chi)
1 vs 2 3.8214 3 0.2814
>
> anova(fit1a, fit1, test="gradient")
Model 1 : infections ~ frequency + location
Model 2 : infections ~ frequency + location + age + gender
Chi Df Pr(>Chi)
1 vs 2 3.7811 3 0.2861
>
> summary(fit1a)
Estimate Std.Error z-value Pr(>|z|)
(Intercept) -0.27766 0.17191 -1.61512 0.1062847
frequencyoccasional 0.59019 0.19066 3.09554 0.0019645
locationnon-beach 0.48134 0.19004 2.53279 0.0113158
phi 1.78560 0.27779
357 / 460
Self diagnozed ear infections in swimmers
> summary(update(fit0, formula = . ~ . - age - gender))
Estimate Std.Error z-value Pr(>|z|)
(Intercept) -0.30583 0.10593 -2.88703 0.00389
frequencyoccasional 0.61304 0.10498 5.83931 5.24e-09
locationnon-beach 0.50873 0.10281 4.94833 7.49e-07
>
> envelope(fit1a, type="quantile ")
−3 −2 −1 0 1 2 3
−4 −2 0 2 4
Expected quantiles
Observed quantiles
358 / 460
The Origins of Sex Dierences in Science
This data, extracted from Long [1990] and included in the library
pscl object bioChemists, is the result of a study that examined
gender dierences in productivity (number of papers published)
during graduate school for a sample of 915 PhD students in
biochemistry, as a result of educational, mentoring, marriage,
and family factors. The following variables are included in this
dataset:
art. Count of articles produced during last 3 years of PhD.
fem. The gender of the student: “Men” or “Women”.
mar. The marital status of the student: “Single” or “Mar-
ried”.
kid5. The number of children aged 5 or younger.
phd. The prestige of the PhD department.
ment. The count of articles produced by the PhD mentor
during last 3 years.
359 / 460
The Origins of Sex Dierences in Science
> fit1 <- overglm(art ~ fem + mar + kid5 + phd + ment,
+ family="nb1(log)", data=bioChemists)
> fit2 <- update(fit1, family="nb2(log)")
> fit3 <- update(fit1, family="nbf(log)")
>
> AIC(fit0, fit1, fit2, fit3)
df AIC
fit0 6 3314.113
fit1 7 3135.917
fit2 7 3143.397
fit3 8 3137.905
> BIC(fit0, fit1, fit2, fit3)
df BIC
fit0 6 3343.026
fit1 7 3169.649
fit2 7 3177.130
fit3 8 3176.457
>
> fit1a <- update(fit1, formula = . ~ . - mar - phd)
> anova(fit1a, fit1, test="wald")
Model 1 : art ~ fem + kid5 + ment
Model 2 : art ~ fem + mar + kid5 + phd + ment
Chi Df Pr(>Chi)
1 vs 2 3.4159 2 0.1812
360 / 460
The Origins of Sex Dierences in Science
> anova(fit1a, fit1, test="score")
Model 1 : art ~ fem + kid5 + ment
Model 2 : art ~ fem + mar + kid5 + phd + ment
Chi Df Pr(>Chi)
1 vs 2 3.4338 2 0.1796
>
> anova(fit1a, fit1, test="lr")
Model 1 : art ~ fem + kid5 + ment
Model 2 : art ~ fem + mar + kid5 + phd + ment
Chi Df Pr(>Chi)
1 vs 2 3.4118 2 0.1816
>
> anova(fit1a, fit1, test="gradient")
Model 1 : art ~ fem + kid5 + ment
Model 2 : art ~ fem + mar + kid5 + phd + ment
Chi Df Pr(>Chi)
1 vs 2 3.4054 2 0.1822
>
> summary(fit1a)
Estimate Std.Error z-value Pr(>|z|)
(Intercept) 0.39102 0.06628 5.89972 3.6411e-09
femWomen -0.23270 0.07230 -3.21857 0.0012883
kid5 -0.13775 0.04846 -2.84278 0.0044722
ment 0.02937 0.00338 8.67714 < 2.22e-16
361 / 460
The Origins of Sex Dierences in Science
> summary(update(fit0, formula = . ~ . - mar - phd))
Estimate Std.Error z-value Pr(>|z|)
(Intercept) 0.43673 0.04683 9.32683 < 2e-16
femWomen -0.24311 0.05421 -4.48475 7.30e-06
kid5 -0.14379 0.03657 -3.93146 8.44e-05
ment 0.02565 0.00195 13.13606 < 2e-16
>
> envelope(fit1a, type="quantile ")
−3 −2 −1 0 1 2 3
−4 −2 0 2 4
Expected quantiles
Observed quantiles
362 / 460
Exercises
(1) Let’s assume that
Y =
m
X
i=1
Y
i
and Y
i
ind
Poisson(µ
i
).
(A)
Determine the probability distribution of Y.
(B) Let’s assume that Y
i
’s have a positive correlation. What is
the eect of this change on the mean and variance of
Y?
(C) Let’s assume that Y
i
’s have a negative correlation. What is
the eect of this change on the mean and variance of Y?
(2) Show that, if Y |λ Poisson(λ) and λ follows a continuous
distribution on the interval
(0, ), such that E(λ) = µ and
Var(λ) = κ(σ
2
, µ), for some positive function κ(·, ·), then Y is
a discrete random variable such that:
(i) its possible values
correspond to non-negative integers; (ii) its expectation is
equal to µ; and (iii) its variance is equal to µ+κ(σ
2
, µ). There-
fore, a model based on the distribution of Y is an alternative
to the Poisson model in the presence of overdispersion.
363 / 460
Overdispersion in binomial
models
364 / 460
Overdispersion
Overdispersion occurs when the observed variance “ex-
ceeds” the expected variance according to the binomial dis-
tribution.
Overdispersion results in an underestimation of the variabi-
lity associated with parameter estimates. As a result, vari-
ables may appear significant when they are actually not
significant, as well as producing confidence intervals with
“low” coverage rates.
As an informal rule, if the deviance statistic is “rather
higher than the degrees of freedom, or if the Pearson’s
chi-squared statistic is “rather higher than the degrees
of freedom, there may be evidence that the data are over-
dispersed.
Overdispersion may be addressed by replacing the binomial
regression model with a regression model based on a more
complex distribution.
365 / 460
Cell dierentiation
These data, which can be found in the object cellular of the library glmtoolbox,
have been extracted from
Piegorsch et al. [ 1988] and relate to the ability of two
immune-activating agents, TNF and IFN, to induce the dierentiation of cells.
The cells were assessed after they h ad been exposed to none, one, or both
agents in order to determine whether they had displayed markers of dieren-
tiation. A total of 200 cells were examined for each of the 16 dose combinations
of TNF and INF. The main question is whether the two agents stimulate cell
dierentiation in a synergistic or independent manner.
0
25
50
75
100
0 4 20 100
0
1
10
100
Dose of IFN
% of dierentiated cells
Dose of TNF
m
i
Y
i
ind
Binom ial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i1
x
i2
m = Number of examined cells
y = Proportion of dierentiated cells
x
1
= Dose of TNF, in U/ml
x
2
= Dose of INF, in U/ml
366 / 460
Cell dierentiation
> fit0 <- glm(cells/200 ~ tnf + ifn + tnf*ifn, weights=rep(200,16),
+ family=binomial(logit), data=cellular)
>
> c("Deviance/df"=deviance(fit0)/df.residual(fit0),
+ "Pearson/df"=sum(resid(fit0, type="pearson")**2)/df.residual(fit0))
pearson/df deviance/df
16.02978 19.05526
>
> envelope(fit0, type="quantile" )
−2 −1 0 1 2
−6 −4 −2 0 2 4
Expected quantiles
Observed quantiles
367 / 460
Teratogenic eects of PHT and TCPO
These data, extracted from Morel and Neerchal [1997] and available in the
object
ossification of the library glmtoolbox, were obtained from an exper-
iment in which pregnant mice were randomly assigned to three treatments in
an experiment in which 60 mg/kg PHT, 100 mg/kg TCPO, or 60 mg/kg PHT
and 100 mg/kg TCPO were administered daily by gastric gavage. The left
middle third phalanx was examined on day 18 of pregnancy for the presence
or absence of ossification. The purpose of the study was to investigate the
synergy between PHT and TCPO.
0
20
40
60
0 mg/kg 60 mg/kg
0 mg/kg
100 mg/kg
Dose of PHT
% of fetuses showing ossification
Dose of TCPO
m
i
Y
i
ind
Binom ial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i1
x
i2
m= Litter size
y= Proportion of fetuses showing
ossification
x
1
= Dose of PHT, in mg/kg
x
2
= Dose of TCPO, in mg/kg
368 / 460
Teratogenic eects of PHT and TCPO
> fit0 <- glm(fetuses/litter ~ pht + tcpo + pht*tcpo, weights=litter,
+ family=binomial(logit), data=ossification)
>
> c("Deviance/df"=deviance(fit0)/df.residual(fit0),
+ "Pearson/df"=sum(resid(fit0, type="pearson")**2)/df.residual(fit0))
pearson/df deviance/df
3.771843 3.273809
>
> envelope(fit0, type="quantile" )
−2 −1 0 1 2
−4 −2 0 2 4
Expected quantiles
Observed quantiles
369 / 460
Beta Binomial distribution
If mY |π Binomial(m, π), so that E(Y |π) = π and Var(Y |π) =
m
1
π(1 π)
, and π Beta(µ, φ) for φ > 0, so that E(π) = µ and
Var(π) = (φ + 1)
1
µ(1 µ), then mY Beta Binomial(m, µ, φ) .
Being so,
(1) E(Y) = E[E(Y |π)] = µ
(2) Var(Y) = Var[E(Y |π)] + E[Var(Y |π)] =
µ(1 µ)
m
1 +
m 1
φ + 1
!
(3)
The probability mass function of Y is
f
Y
(y; m, µ, φ)=
m
my
!
B(my + φµ, m(1 y) + φ(1 µ))
B(φµ, φ(1 µ))
I
my
{0, 1, . . . , m},
where B(a, b) =
Γ(a)Γ(b)
Γ(a + b)
is the beta function.
370 / 460
Beta Binomial distribution
The probability mass function of Y may be obtained as follows
=
Z
−∞
f
mY|π
(y; m, π) f
π
(π; µ, φ) dπ
=
Z
1
0
m
my
!
π
my
(1 π)
m(1y)
I
my
{0, 1, . . . , m}
π
µφ1
(1 π)
(1µ)φ1
B(µφ, (1 µ)φ)
dπ
=
m
my
!
I
my
{0, 1, . . . , m}
B(µφ, (1 µ)φ)
Z
1
0
π
my+µφ1
(1 π)
m(1y)+(1µ)φ1
dπ
=
m
my
!
B(
˜
µ
˜
φ, (1
˜
µ)
˜
φ)
B(µφ, (1 µ)φ)
I
my
{0, 1, . . . , m}
Z
1
0
π
˜
µ
˜
φ1
(1 π)
(1
˜
µ)
˜
φ1
B(
˜
µ
˜
φ, (1
˜
µ)
˜
φ)
dπ
| {z }
R
1
0
f
π
(π;
˜
µ,
˜
φ) dπ where πBeta(
˜
µ,
˜
φ)
=
m
my
!
B(my + φµ, m(1 y) + φ(1 µ))
B(φµ, φ(1 µ))
I
my
{0, 1, . . . , m},
where
˜
µ = (my + µφ)/(m + φ)
and
˜
φ = m + φ
.
371 / 460
Beta Binomial regression
m
i
Y
i
ind
Beta Binomial(m
i
, µ
i
, φ),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) =
µ
i
(1 µ
i
)
m
i
1 +
m
i
1
φ + 1
!
Logit link
g(µ
i
) = log
µ
i
1 µ
i
!
Probit link
g(µ
i
) = Φ
1
(µ
i
)
Complementary log-log link
g(µ
i
) = log
log
1 µ
i

Parameter estimation may be accomplished by using the maximum-likelihood method.
Inference is based on the asymptotic properties of the maximum-likelihood estimator.
372 / 460
Random-clumped Binomial distribution
If mY |π Binomial(m, π), so that E(Y |π) = π and Var(Y |π) =
m
1
π(1 π)
, and π = (1 φ)µ + φ Bernoulli(µ) for µ, φ
(0, 1)
, so that E(π) = µ and Var(π) = φ
2
µ(1 µ), then, mY
RandomClumped Binomial(m, µ, φ)
. Being so,
E(Y)= E[E(Y |π)] = µ,
Var(Y) = Var[E(Y |π)] + E[Var(Y |π)] =
µ(1 µ)
m
1 + φ
2
(m 1)
,
and the probability mass function of Y may be written as follows
f
Y
(y; m, µ, φ) = µ
m
my
!
π
my
1
(1 π
1
)
m(1y)
I
my
{0, 1, . . . , m} +
(1 µ)
m
my
!
π
my
2
(1 π
2
)
m(1y)
I
my
{0, 1, . . . , m},
where π
1
= (1 φ)µ + φ and π
2
= (1 φ)µ.
373 / 460
Random-clumped Binomial regression
m
i
Y
i
ind
RandomClumped Binomial(m
i
, µ
i
, φ),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
E(Y
i
) = µ
i
and Var(Y
i
) =
µ
i
(1 µ
i
)
m
i
1 + φ
2
(m
i
1)
Logit link
g(µ
i
) = log
µ
i
1 µ
i
!
Probit link
g(µ
i
) = Φ
1
(µ
i
)
Complementary log-log link
g(µ
i
) = log
log
1 µ
i

Parameter estimation may be accomplished by using the maximum-likelihood method.
Inference is based on the asymptotic properties of the maximum-likelihood estimator.
374 / 460
Cell dierentiation
These data, which can be found in the object cellular of the library glmtoolbox,
have been extracted from
Piegorsch et al. [ 1988] and relate to the ability of two
immune-activating agents, TNF and IFN, to induce the dierentiation of cells.
The cells were assessed after they h ad been exposed to none, one, or both
agents in order to determine whether they had displayed markers of dieren-
tiation. A total of 200 cells were examined for each of the 16 dose combinations
of TNF and INF. The main question is whether the two agents stimulate cell
dierentiation in a synergistic or independent manner.
0
25
50
75
100
0 4 20 100
0
1
10
100
Dose of IFN
% of dierentiated cells
Dose of TNF
m
i
Y
i
ind
Binom ial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i1
x
i2
m = Number of examined cells
y = Proportion of dierentiated cells
x
1
= Dose of TNF, in U/ml
x
2
= Dose of INF, in U/ml
375 / 460
Cell dierentiation
> fit1 <- overglm(cbind(cells,200-cells) ~ tnf + ifn + tnf*ifn,
+ family="bb(logit)", data=cellular)
> fit2 <- update(fit1, family="rcb(probit)")
>
> AIC(fit0, fit1, fit2)
df AIC
fit0 4 281.8449
fit1 5 148.0026
fit2 5 163.1216
>
> BIC(fit0, fit1, fit2)
df BIC
fit0 4 284.9353
fit1 5 151.8656
fit2 5 166.9846
>
> fit1a <- update(fit1, formula = . ~ . - tnf:ifn)
> anova(fit1a, fit1,test="wald")
Model 1 : cbind(cells, 200 - cells) ~ tnf + ifn
Model 2 : cbind(cells, 200 - cells) ~ tnf * ifn
Chi Df Pr(>Chi)
1 vs 2 0.83757 1 0.3601
376 / 460
Cell dierentiation
> anova(fit4a, fit4,test="lr")
Model 1 : cbind(cells, 200 - cells) ~ tnf + ifn
Model 2 : cbind(cells, 200 - cells) ~ tnf * ifn
Chi Df Pr(>Chi)
1 vs 2 1.1974 1 0.2738
>
> anova(fit4a, fit4,test="score" )
Model 1 : cbind(cells, 200 - cells) ~ tnf + ifn
Model 2 : cbind(cells, 200 - cells) ~ tnf * ifn
Chi Df Pr(>Chi)
1 vs 2 1.0277 1 0.3107
>
> anova(fit4a, fit4,test="gradie nt")
Model 1 : cbind(cells, 200 - cells) ~ tnf + ifn
Model 2 : cbind(cells, 200 - cells) ~ tnf * ifn
Chi Df Pr(>Chi)
1 vs 2 1.3701 1 0.2418
>
> summary(fit1a)
Estimate Std.Error z-value Pr(>|z|)
(Intercept) -1.75132 0.24252 -7.22127 5.1504e-13
tnf 0.02952 0.00399 7.39682 1.3949e-13
ifn 0.01296 0.00367 3.53479 0.0004081
377 / 460
Cell dierentiation
> fit0 <- update(fit0, formula = . ~ . - tnf:ifn)
> summary(fit0)
Estimate Std.Error z-value Pr(>|z|)
(Intercept) -1.83209 0.06955 -26.34045 <2e-16
tnf 0.02996 0.00113 26.56381 <2e-16
ifn 0.01410 0.00106 13.33093 <2e-16
>
> envelope(fit1a, type="quantile ")
−2 −1 0 1 2
−3 −2 −1 0 1 2 3
Expected quantiles
Observed quantiles
378 / 460
Teratogenic eects of PHT and TCPO
These data, extracted from Morel and Neerchal [1997] and available in the
object
ossification of the library glmtoolbox, were obtained from an exper-
iment in which pregnant mice were randomly assigned to three treatments in
an experiment in which 60 mg/kg PHT, 100 mg/kg TCPO, or 60 mg/kg PHT
and 100 mg/kg TCPO were administered daily by gastric gavage. The left
middle third phalanx was examined on day 18 of pregnancy for the presence
or absence of ossification. The purpose of the study was to investigate the
synergy between PHT and TCPO.
0
20
40
60
0 mg/kg 60 mg/kg
0 mg/kg
100 mg/kg
Dose of PHT
% of fetuses showing ossification
Dose of TCPO
m
i
Y
i
ind
Binom ial(m
i
, µ
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ β
2
x
i2
+ β
3
x
i1
x
i2
m= Litter size
y= Proportion of fetuses showing
ossification
x
1
= Dose of PHT, in mg/kg
x
2
= Dose of TCPO, in mg/kg
379 / 460
Teratogenic eects of PHT and TCPO
> fit1 <- overglm(cbind(fetuses,litter-fetuses) ~ pht + tcpo + pht*tcpo,
+ family="bb(logit)", data=ossification)
> fit2 <- update(fit1, family="rcb(logit)")
>
> AIC(fit0, fit1, fit2)
df AIC
fit0 4 409.7828
fit1 5 316.5751
fit2 5 315.0534
>
> BIC(fit0, fit1, fit2)
df BIC
fit0 4 419.3606
fit1 5 328.5474
fit2 5 327.0257
>
> fit2a <- update(fit2, formula = . ~ . - pht:tcpo)
>
> anova(fit2a, fit2, test="wald")
Model 1 : cbind(fetuses, litter - fetuses) ~ pht + tcpo
Model 2 : cbind(fetuses, litter - fetuses) ~ pht * tcpo
Chi Df Pr(>Chi)
1 vs 2 0.85147 1 0.3561
380 / 460
Teratogenic eects of PHT and TCPO
> anova(fit2a, fit2, test="lr")
Model 1 : cbind(fetuses, litter - fetuses) ~ pht + tcpo
Model 2 : cbind(fetuses, litter - fetuses) ~ pht * tcpo
Chi Df Pr(>Chi)
1 vs 2 1.1974 1 0.2738
>
> anova(fit2a, fit2, test="score")
Model 1 : cbind(fetuses, litter - fetuses) ~ pht + tcpo
Model 2 : cbind(fetuses, litter - fetuses) ~ pht * tcpo
Chi Df Pr(>Chi)
1 vs 2 0.83027 1 0.3622
>
> anova(fit2a, fit2, test="gradient")
Model 1 : cbind(fetuses, litter - fetuses) ~ pht + tcpo
Model 2 : cbind(fetuses, litter - fetuses) ~ pht * tcpo
Chi Df Pr(>Chi)
1 vs 2 0.81836 1 0.3657
>
> summary(fit2a)
Estimate Std.Error z-value Pr(>|z|)
(Intercept) 0.57038 0.21361 2.67014 0.007582
pht60 mg/kg -1.33705 0.33663 -3.97188 7.1307e-05
tcpo100 mg/kg -0.76212 0.31146 -2.44693 0.014408
381 / 460
Teratogenic eects of PHT and TCPO
> fit0 <- update(fit0, formula = . ~ . - tnf:ifn)
> summary(fit0)
Estimate Std.Error z-value Pr(>|z|)
(Intercept) 0.72414 0.12728 5.68926 0.00000
pht60 mg/kg -1.75661 0.19908 -8.82366 0.00000
tcpo100 mg/kg -0.55281 0.19117 -2.89173 0.00383
>
> envelope(fit2a, type="quantile ")
−2 −1 0 1 2
−4 −2 0 2 4
Expected quantiles
Observed quantiles
382 / 460
Exercises
(1) Let’s assume that
Y =
1
n
1
+ . . . + n
m
m
X
i=1
Y
i
and Y
i
ind
Binomial(n
i
, µ).
(A)
Determine the probability distribution of Y.
(B) Let’s assume that Y
i
’s have a positive correlation. What is
the eect of this change on the mean and variance of
Y?
(C) Let’s assume that Y
i
’s have a negative correlation. What is
the eect of this change on the mean and variance of Y?
(2) Show that, if m Y |π Binomial(m, π) and π follows a con-
tinuous distribution on the interval
(0, 1), such that E(π) = µ
and Var(π) = κ(σ
2
, µ), for some positive function κ(·, ·), then
Y is a discrete random variable such that: (i) its possible
values are 0, m
1
, 2m
1
, 3m
1
, . . . , (m 1)m
1
, 1; (ii) its expec-
tation is equal to µ; (iii) its variance is equal to m
1
µ(1
µ) + (1 m
1
)κ(σ
2
, µ). Therefore, a model based on the dis-
tribution of
Y is an alternative to the binomial model in the
presence of overdispersion.
383 / 460
Excess of zeros
in count data
384 / 460
Males attached to female horseshoe crabs
These data, extracted from Agresti [2015] and available in GLMsData’s
object
hcrabs, are drawn from a s tudy of female horseshoe crabs
on an isl and in the Gulf of Mexico. During the spawning season,
females migrate to the shore in order to breed. She burrows into
the sand with a male attached to her posterior spine and lays
clusters of eggs. In the sand beneath the pair, the eggs are fertil-
ized externally. Other male crabs (satellites) may cluster around
the spawning pair and fertilize the eggs as well. The variables
in the dataset are the f ollowi ng:
Col. The color of the female crab: “LM” ( Light Medium),
“M” (Medium), “DM” (Dark Medium) or “D” (Dark).
Spine. The spine condition of the female crab: “BothOK”,
“OneOK” or “NoneOK”.
Width. The carapace width of the female crab, in cms.
Wt. The weight of the female crab, in grams.
Sat. The number of male crabs attached to the female crab.
385 / 460
Males attached to female horseshoe crabs
There is an excess of zeros when the observed number of zeros is
“rather greater than it is expected to be. According to the Poisson
model,
Y
i
ind
Poisson(µ
i
)
. Thus, O = O
1
+ . . . + O
n
is the number of
zeros, where
O
i
=
1
if Y
i
= 0
0
if Y
i
> 0
,
so that O
i
ind
Bernoulli(exp(µ
i
)).
Therefore, E(O) and Var(O) are given, respectively, by
n
X
i=1
E(O
i
) =
n
X
i=1
Pr(Y
i
= 0) =
n
X
i=1
exp(µ
i
)
and
n
X
i=1
Var(O
i
)=
n
X
i=1
Pr(Y
i
= 0)[1 Pr(Y
i
= 0)]=
n
X
i=1
exp(µ
i
)[1 exp(µ
i
)]
The CLT indicates that
Z =
O E(O)
p
Var(O)
D
n
Normal(0, 1)
386 / 460
Males attached to female horseshoe crabs
This “naive” statistic, which is obtained by substituting β for
ˆ
β
in
the above expression for
Z, will be biased against the asymptotic
standard normal, since it treats the estimated
ˆ
β
as the true β,
thus ignoring the sampling variation associated with
ˆ
β
. In order
to correct this, parametric bootstrapping can be used to estimate
the mean and variance of the dierence between the obser ved
and estimated expected number of zeros.
> fit0 <- glm(Sat ~ log(Wt) + Col + Spine + Width, data=hcrabs,
+ family=poisson(log))
>
> zero.excess(fit0)
Number of Zeros
Observed Expected z-value Pr(>z)
62 16.787 12.266 <2.2e-16
o
ˆ
E(O) Pr(χ
2
(1)
> z
2
)
387 / 460
Zero-Inflated Poisson distribution
If Y |ν =
0
if ν = 1
Poisson(µ)
if ν = 0
and ν Bernoulli(π),
then Y is distributed according to the Zero-Inflated Poi sson dis-
tribution, denoted here by
ZIP(µ, π). Remember that,
Pr(Y = y) = Pr(Y = y |ν = 1)Pr(ν = 1) + Pr(Y = y |ν = 0)Pr(ν = 0).
Therefore, the probability mass function of Y reduces to
f
Y
(y; µ, π) =
π + (1 π) exp(µ)
if y = 0
(1 π)
exp(µ)µ
y
y!
if y = 1, 2, 3, . . .
Note that Pr(Y = 0) is a strictly increasing function of π for
fixed µ. Indeed, Pr(Y = 0) Pr(
˜
Y = 0) = exp(µ) as π
0, and Pr(Y = 0) 1 as π 1, where
˜
Y Poisson(µ)
. In
consequence, the ZIP(µ, π) distribution is an alternative to the
Poisson distribution in the presence of excess of zeros.
388 / 460
Zero-Inflated Poisson distribution
If r > 0 and E(
˜
Y
r
) exists, where
˜
Y Poisson(µ)
, then
E(Y
r
)=
X
y=0
f
Y
(y; µ, π)y
r
=
X
y=1
f
Y
(y; µ, π)y
r
=(1 π)
X
y=1
exp(µ)µ
y
y!
y
r
=(1 π)
X
y=0
exp(µ)µ
y
y!
y
r
= (1 π)E(
˜
Y
r
).
Hence,
E(Y) = (1 π)µ, E(Y
2
) = (1 π)µ(µ + 1)
and
Var(Y) = (1 π)µ + π(1 π)µ
2
.
Moreover,
F
Y
(y; µ, π) = π I
y
[0, ) + (1 π)F
˜
Y
(y; µ).
389 / 460
Zero-Inflated Poisson model
Y
i
ind
ZIP(µ
i
, π
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
h(π
i
) = γ
0
+ γ
1
z
i1
+ . . . + γ
q
z
iq
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
Square-root link
g(µ
i
) =
µ
i
Logit link
h(π
i
) = log
π
i
1 π
i
Probit link
h(π
i
) = Φ
1
(π
i
)
Complementary log-log link
h(π
i
) = log
log
(
1 π
i
)
Cauchy link
h(π
i
) = tan
π
2
(
2π
i
1
)
390 / 460
Zero-Modified Poisson distr ibution
If Y |ν =
0
if ν = 1
ZTP(µ)
if ν = 0
and ν Bernoulli(π),
then the distribution of Y is the Zero-Modified Poi sson, denoted
here by
ZMP(µ,π), where ZTP means the Zero-Truncated Poisson
distribution. The probability mass function of
Y reduces to
f
Y
(y; µ, π) =
π
if y = 0
κ(µ, π)
exp(µ)µ
y
y!
if y = 1, 2, 3, . . .
where κ(µ, π) =
1 π
1 Pr(
˜
Y = 0)
=
1 π
1 exp(µ)
, in which
˜
Y
Poisson(µ)
. Note that:
(1) If π < exp(µ ) (i.e., if κ(µ, π) > 1) then the distribution of Y
is an alternative to the Poisson distribution in the presence
of lack of zeros,
391 / 460
Zero-Modified Poisson distr ibution
(2) If π = exp(µ ) (i.e., if κ(µ, π) = 1) then the distribution of Y
reduces to the Poisson distribution,
(3) If π > exp(µ ) (i.e., if κ(µ, π) < 1) then the distribution of Y
is an alternative to the Poisson distribution in the presence
of excess of zeros.
If
r > 0 and E(
˜
Y
r
) exists then E(Y
r
) = κ(µ, π)E(
˜
Y
r
). Hence,
E(Y) = κ(µ, π)µ,
E(Y
2
) = κ(µ, π)µ(µ + 1),
and
Var(Y) = κ(µ, π)µ + κ(µ, π)[1 κ(µ, π)]µ
2
.
Moreover,
F
Y
(y; µ, π) = π I
y
[0, ) + κ(µ, π)[F
˜
Y
(y; µ) exp(µ)].
392 / 460
Zero-Modified Poisson model
Y
i
ind
ZMP(µ
i
, π
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
h(π
i
) = γ
0
+ γ
1
z
i1
+ . . . + γ
q
z
iq
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
Square-root link
g(µ
i
) =
µ
i
Logit link
h(π
i
) = log
π
i
1 π
i
Probit link
h(π
i
) = Φ
1
(π
i
)
Complementary log-log link
h(π
i
) = log
log
(
1 π
i
)
Cauchy link
h(π
i
) = tan
π
2
(
2π
i
1
)
393 / 460
Zero-Modified Poisson model
The maximum likelihood function of θ = (γ
, β
)
becomes
L(θ) =
n
Y
i=1
f
Y
(y
i
; µ
i
(β), π
i
(γ))
=
n
Y
i=1
π
I
y
i
{0}
i
1 π
i
1 exp(µ
i
)
exp(µ
i
)µ
y
i
i
y
i
!
I
y
i
{1,2,...}
=
n
Y
i=1
π
I
y
i
{0}
i
(1 π
i
)
I
y
i
{1,2,...}
exp(µ
i
)µ
y
i
i
[1 exp(µ
i
)]y
i
!
I
y
i
{1,2,...}
= L
1
(γ)L
2
(β),
where
L
1
(γ)=
n
Y
i=1
π
I
y
i
{0}
i
(1π
i
)
I
y
i
{1,2,...}
is positive and does not depend on β
and
L
2
(β)=
n
Y
i=1
exp(µ
i
)µ
y
i
i
[1 exp(µ
i
)]y
i
!
I
y
i
{1,2,...}
is positive and does not depend on γ
394 / 460
Zero-Modified Poisson model
Thus, maximizing L(θ) with respect to θ is equivalent to maxi-
mizing
L
1
(γ) with respect to γ = (γ
0
, γ
1
, . . . , γ
q
)
and in parallel,
maximizing
L
2
(β) with respect to β = (β
0
, β
1
, . . . , β
p
)
. Therefore,
fitting to the data the ZMP model is equivalent to dividing the
problem into two parts:
(M
1
) Classify the observed responses into zero and non-zero. Fit-
ting a model that assumes that
r
1
, . . . , r
n
are realizations of
independent random variables denoted by
R
1
, . . . , R
n
and
such that
R
i
Bernoulli(π
i
), where r
i
= I
y
i
{0} and h(π
i
) =
γ
0
+γ
1
z
i1
+. . .+ γ
q
z
iq
for i = 1, . . . , n. Hence, ˆγ = argmax
1
(γ)
for γ R
q+1
.
(M
2
) Only consider responses that are greater than zero. Select
the m values of y that are greater than zero and fit to them
a model that assumes that they are realizations of inde-
pendent random variables denoted by T
1
, . . . , T
m
and such
that T
l
ZTP(µ
l
), where g(µ
l
) = β
0
+ β
1
x
l1
+ . . . + β
p
x
lp
for
l = 1, . . . , m. Hence,
ˆ
β = argmax
2
(β)
for β R
p+1
.
395 / 460
Zero-Modified Poisson model
When n is “large”, we have the followi ng
ˆ
θ ˙Normal
p+q+2
(θ, [I(θ)]
1
),
where ˙ means “approximately distributed as” and I(θ) is the
expected Fisher information matrix, which may be expressed as
I(θ) = E[
2
(θ)], in which (θ) = log[L(θ)] is the log-likelihood
function and
2
(θ) is its Hessian matrix. As a consequence,
ˆ
θ
is
an approximately unbiased and ecient estimator for “large”
n.
As
L(θ) = L
1
(γ) L
2
(β), (θ) =
1
(γ) +
2
(β), where
1
(γ) =
log[L
1
(γ)]
and
2
(β) = log[L
2
(β)]. Therefore,
[I(θ)]
1
=
"
[I
1
(γ)]
1
0
0 [I
2
(β)]
1
#
,
where I
1
(γ) = E[
2
1
(γ)] and I
2
(β) = E[
2
2
(β)]. Conse-
quently, we have that for “large”
n: (i) ˆγ ˙Normal
q+1
(γ, [I
1
(γ)]
1
);
(ii)
ˆ
β ˙Normal
p+1
(β, [I
2
(β)]
1
); and (iii) ˆγ and
ˆ
β
are approxi-
mately statistically independent.
396 / 460
Zero-Modified Poisson model
Moreover, note that
(
ˆ
θ) =
1
( ˆγ) +
2
(
ˆ
β).
Therefore,
2(
ˆ
θ) + 2(p + q + 2)
| {z }
AIC
= 2
1
( ˆγ) + 2(q + 1)
| {z }
AIC
1
+ 2(
ˆ
β) + 2(p + 1)
| {z }
AIC
2
,
where AIC
1
and AIC
2
are the values of AIC for M
1
and M
2
, respectively.
Thus, searching for the model with the lowest
AIC value is equi-
valent to searching for the model with the lowest AIC
1
value, and
simultaneously, searching for the model with the lowest
AIC
2
value.
On the other hand,
2(
ˆ
θ) + log(n)(p + q + 2)
| {z }
BIC
, 2
1
( ˆγ) + log(n)(q + 1)
| {z }
BIC
1
+
2
2
(
ˆ
β) + log(
m)(p + 1)
| {z }
BIC
2
,
where BIC
1
and BIC
2
are the values of BIC for M
1
and M
2
, respectively.
397 / 460
Zero-Inflated Negativ e Binomial distribution
If Y |ν=
0
if ν = 1
Negative Binomial(µ, φ, τ)
if ν = 0
and νBernoulli(π),
then the distribution of Y is the Zero-Inflated Negative Binomial,
denoted here by
ZINB(µ, φ, τ, π) . The probability mass f unction
of
Y may be written as follows
f
Y
(y; µ,φ,τ) =
π + (1π)
1
φµ
τ
µ +
1
φµ
τ
1
φµ
τ
if y = 0
(1π)
Γ
y +
1
φµ
τ
Γ
1
φµ
τ
Γ
y + 1
µ
µ +
1
φµ
τ
y
1
φµ
τ
µ +
1
φµ
τ
1
φµ
τ
if y = 1, 2, . . .
Note that Pr(Y = 0) is a strictly increasing function of π for fixed
µ, φ and τ. Indeed, Pr(Y = 0) Pr(
˜
Y = 0) as π 0, and
Pr(Y = 0) 1 as π 1, where
˜
Y Negative Binomial(µ, φ, τ).
Hence, the ZINB (µ, φ, τ, π) distribution is an alternative to the
negative binomial distribution in the presence of excess of zeros.
398 / 460
Zero-Inflated Negativ e Binomial distribution
If r > 0 is such that E(
˜
Y
r
) exists, where
˜
Y Negative Binomial(µ, φ, τ)
,
then
E(Y
r
)=
X
y=0
f
Y
(y; µ, φ, τ, π)y
r
=
X
y=1
f
Y
(y; µ, φ, τ, π)y
r
=(1 π)
X
y=1
f
˜
Y
(y; µ, φ, τ)y
r
=(1 π)
X
y=0
f
˜
Y
(y; µ, φ, τ)y
r
= (1 π)E(
˜
Y
r
).
Hence,
E(Y) = (1 π)µ
and
Var(Y) = (1 π)µ(1 + φµ
τ+1
) + π(1 π)µ
2
.
Furthermore,
F
Y
(y; µ, φ, τ, π) = π I
y
[0, ) + (1 π)F
˜
Y
(y; µ, φ, τ).
399 / 460
Zero-Inflated Negativ e Binomial model
Y
i
ind
ZINB(µ
i
, φ, τ, π
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
h(π
i
) = γ
0
+ γ
1
z
i1
+ . . . + γ
q
z
iq
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
Square-root link
g(µ
i
) =
µ
i
Logit link
h(π
i
) = log
π
i
1 π
i
Probit link
h(π
i
) = Φ
1
(π
i
)
Complementary log-log link
h(π
i
) = log
log
(
1 π
i
)
Cauchy link
h(π
i
) = tan
π
2
(
2π
i
1
)
400 / 460
Zero-Modified Negativ e Binomial distribution
If Y |ν=
0
if ν = 1
ZTNB(µ, φ, τ)
if ν = 0
and νBernoulli(π),
then the distribution of Y is the Zero-Modified Negati ve Bino-
mial, denoted here by
ZMNB(µ, φ, τ, π) , where ZTNB represents
the Zero-Truncated Negative Binomial distribution. The proba-
bility mass function of
Y reduces to
f
Y
(y; µ, φ, τ, π)=
π
if y = 0
κ(µ, φ, τ, π)
Γ
y +
1
φµ
τ
Γ
1
φµ
τ
Γ
y + 1
µ
µ +
1
φµ
τ
y
1
φµ
τ
µ +
1
φµ
τ
1
φµ
τ
if y = 1, 2, . . .
where κ(µ, φ, τ, π)=
1 π
1 f
˜
Y
(0;µ,φ,τ)
and
˜
YNegative Binomial(µ,φ,τ)
.
Note that:
(1) If π < f
˜
Y
(0; µ, φ, τ) (i.e., if κ(µ, φ, τ, π) > 1) then the distribu-
tion of Y is an alternative to the negative binomial distribu-
tion in the presence of lack of zeros,
401 / 460
Zero-Modified Negativ e Binomial distribution
(2) If π = f
˜
Y
(0; µ, φ, τ) (i.e., if κ(µ, φ, τ, π) = 1) then the distribu-
tion of
Y reduces to the negative binomial distribution,
(3) If π > f
˜
Y
(0; µ, φ, τ) (i.e., if κ(µ, φ, τ, π) < 1) then the distribu-
tion of
Y is an alternative to the negative binomial distribu-
tion in the presence of excess of zeros.
If
r > 0 and E(
˜
Y
r
) exists then E(Y
r
) = κ(µ, φ, τ, π)E(
˜
Y
r
). Hence,
E(Y) = κ(µ, φ, τ, π)µ,
E(Y
2
) = κ(µ, φ, τ, π)µ(1 + φµ
τ+1
+ µ),
and
Var(Y) = κ(µ, φ, τ, π)µ(1+φµ
τ+1
)+κ(µ, φ, τ, π)[1κ(µ, φ, τ, π)]µ
2
.
Moreover,
F
Y
(y; µ, φ, τ, π) = π I
y
[0, )+κ(µ, φ, τ, π)[F
˜
Y
(y; µ, φ, τ)f
˜
Y
(0; µ, φ, τ)].
402 / 460
Zero-Modified Negativ e Binomial model
Y
i
ind
ZMNB(µ
i
, φ, τ, π
i
),
g(µ
i
) = β
0
+ β
1
x
i1
+ . . . + β
p
x
ip
,
h(π
i
) = γ
0
+ γ
1
z
i1
+ . . . + γ
q
z
iq
Logarithmic link
g(µ
i
) = log(µ
i
)
Identity link
g(µ
i
) = µ
i
Square-root link
g(µ
i
) =
µ
i
Logit link
h(π
i
) = log
π
i
1 π
i
Probit link
h(π
i
) = Φ
1
(π
i
)
Complementary log-log link
h(π
i
) = log
log
(
1 π
i
)
Cauchy link
h(π
i
) = tan
π
2
(
2π
i
1
)
403 / 460
Zero-Modified Negativ e Binomial model
The maximum likelihood function of θ = (γ
, β
, φ, τ)
becomes
L(θ) =
n
Y
i=1
f
Y
(y
i
; µ
i
(β), φ, τ, π
i
(γ))
=
n
Y
i=1
π
I
y
i
{0}
i
"
1 π
i
1 f
˜
Y
(0; µ
i
, φ, τ)
f
˜
Y
(y
i
; µ
i
, φ, τ)
#
I
y
i
{1,2,...}
=
n
Y
i=1
π
I
y
i
{0}
i
(1 π
i
)
I
y
i
{1,2,...}
"
f
˜
Y
(y
i
; µ
i
, φ, τ)
1 f
˜
Y
(0; µ
i
, φ, τ)
#
I
y
i
{1,2,...}
= L
1
(γ) L
2
(β, φ, τ),
where
L
1
(γ)=
n
Y
i=1
π
I
y
i
{0}
i
(1π
i
)
I
y
i
{1,2,...}
is positive and does not depend on β,φ,τ
and
L
2
(β, φ, τ)=
n
Y
i=1
"
f
˜
Y
(y
i
; µ
i
, φ, τ)
1 f
˜
Y
(0; µ
i
, φ, τ)
#
I
y
i
{1,2,...}
is positive and does not depend on γ
404 / 460
Zero-Modified Negativ e Binomial model
Thus, maximizing L(θ) with respect to θ is equivalent to maxi-
mizing
L(γ) with respect to γ and in parallel, maximizing L(β, φ, τ)
with respect to β, φ and τ. Therefore, fitting to the data the ZMNB
model is equivalent to dividing the problem into two parts:
(M
1
) Classify the observed responses into zero and non-zero. Fit-
ting a model that assumes that
r
1
, . . . , r
n
are realizations of
independent random variables denoted by
R
1
, . . . , R
n
and
such that
R
i
Bernoulli(π
i
), where r
i
= I
y
i
{0} and h(π
i
) =
γ
0
+γ
1
z
i1
+. . .+ γ
q
z
iq
for i = 1, . . . , n. Hence, ˆγ = argmax
1
(γ)
for γ R
q+1
.
(M
2
) Only consider responses that are greater than zero. Select
the m values of y that are greater than zero and fit to them
a model that assumes that they are realizations of indepen-
dent random variables denoted by
T
1
, . . . , T
m
and such that
T
l
ZTNB(µ
l
, φ, τ), where g(µ
l
) = β
0
+ β
1
x
l1
+ . . . + β
p
x
lp
for l = 1, . . . , m. Hence, (
ˆ
β,
ˆ
φ,
ˆ
τ) = argmax
2
(β, φ, τ) for
β R
q+1
, φ R
+
and τ R.
405 / 460
Zero-Modified Negativ e Binomial model
When n is “large”, we have the followi ng
ˆ
θ ˙Normal
p+q+4
(θ, [I(θ)]
1
),
where ˙ means “approximately distributed as” and I(θ) is the
expected Fisher information matrix, which may be expressed as
I(θ) = E[
2
(θ)], in which (θ) = log[L(θ)] is the log-likelihood
function and
2
(θ) is its Hessian matrix. As a consequence,
ˆ
θ
is
an approximately unbiased and ecient estimator for “large”
n.
As
L(θ) = L
1
(γ) L
2
(β, φ, τ), (θ) =
1
(γ) +
2
(β, φ, τ), where
1
(γ) = log[L
1
(γ)] and
2
(β, φ, τ) = log[L
2
(β, φ, τ)]. Therefore,
[I(θ)]
1
=
"
[I
1
(γ)]
1
0
0 [I
2
(β, φ, τ)]
1
#
,
where I
1
(γ) = E[
2
1
(γ)] and I
2
(β, φ, τ) = E[
2
2
(β, φ, τ)].
Thus, we have that for “large” n: (i) ˆγ ˙Normal
q+1
(γ, [I
1
(γ)]
1
);
(ii) (
ˆ
β
,
ˆ
φ,
ˆ
τ)
˙Normal
p+3
((β
, φ, τ)
, [I
2
(β, φ, τ)]
1
); and (iii) ˆγ
and (
ˆ
β
,
ˆ
φ,
ˆ
τ)
are approximately statistically independent.
406 / 460
Zero-Modified Negativ e Binomial model
Moreover, note that
(
ˆ
θ) =
1
( ˆγ) +
2
(
ˆ
β,
ˆ
φ,
ˆ
τ).
Therefore,
2(
ˆ
θ) + 2(p + q + 4)
| {z }
AIC
= 2
1
( ˆγ) + 2(q + 1)
| {z }
AIC
1
+ 2
2
(
ˆ
β,
ˆ
φ,
ˆ
τ) + 2(p + 3)
| {z }
AIC
2
,
where AIC
1
and AIC
2
are the values of AIC for M
1
and M
2
, respectively.
Thus, searching for the model with the lowest AIC value is equi-
valent to searching for the model with the lowest
AIC
1
value, and
simultaneously, searching for the model with the lowest
AIC
2
value.
On the other hand,
2(
ˆ
θ) + log(n)(p + q + 2)
| {z }
BIC
, 2
1
( ˆγ) + log(n)(q + 1)
| {z }
BIC
1
+
2
2
(
ˆ
β,
ˆ
φ,
ˆ
τ) + log(
m)(p + 3)
| {z }
BIC
2
,
where BIC
1
and BIC
2
are the values of BIC for M
1
and M
2
, respectively.
407 / 460
Males attached to female horseshoe crabs
> fit1 <- overglm(Sat ~ log(Wt) + Col + Spine + Width,
+ family="ztpoi(log)", data=hcrabs, subset={Sat>0})
> fit2 <- update(fit1, family="ztnb1(log)")
> fit3 <- update(fit1, family="ztnb2(log)")
> fit4 <- update(fit1, family="ztnbf(log)")
>
> AIC(fit1, fit2, fit3, fit4)
df AIC
fit1 8 525.5947
fit2 9 514.6772
fit3 9 512.4328
fit4 10 512.6216
> BIC(fit1, fit2, fit3, fit4)
df BIC
fit1 8 547.2710
fit2 9 539.0630
fit3 9 536.8185
fit4 10 539.7169
>
> fit3a <- update(fit3, formula = Sat ~ log(Wt) + Col)
> anova(fit3a, fit3, test="lr")
408 / 460
Males attached to female horseshoe crabs
Likelihood-ratio test
Model 1 : Sat ~ log(Wt) + Col
Model 2 : Sat ~ log(Wt) + Col + Spine + Width
Chi Df Pr(>Chi)
1 vs 2 5.9052 3 0.1163
>
> summary(fit3a)
Estimate Std.Error z-value Pr(>|z|)
(Intercept) -3.91458 2.29715 -1.70410 0.0883619
log(Wt) 0.75194 0.29754 2.52716 0.0114988
ColDM -0.74218 0.25754 -2.88187 0.0039532
ColLM -0.31739 0.28417 -1.11692 0.2640306
ColM -0.54407 0.22370 -2.43210 0.0150114
> fit5 <- glm(ifelse(Sat==0,1,0) ~ log(Wt) + Col + Spine + Width,
+ family=binomial(logit), data=hcrabs)
> fit5a <- update(fit5, formula=ifelse(Sat==0,1,0) ~ Width)
> anova(fit5a, fit5, test="lr")
Likelihood-ratio test
Model 1 : ifelse(Sat==0,1,0) ~ Width
Model 2 : ifelse(Sat==0,1,0) ~ log(Wt) + Col + Spine + Width
409 / 460
Males attached to female horseshoe crabs
Chi Df Pr(>Chi)
1 vs 2 9.2299 6 0.1611
>
> summary(fit5a)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 12.3508 2.6287 4.698 2.62e-06
Width -0.4972 0.1017 -4.887 1.02e-06
>
> fit6 <- zeroalt(Sat ~ log(Wt) + Col | Width, family="nb2(log)",
+ zero.link="logit", data=hcrabs)
> summary(fit6)
Count model (Negative Binomial type II with log link):
Estimate Std.Error z-value Pr(>|z|)
(Intercept) -3.91458 2.29715 -1.70410 0.0883619
log(Wt) 0.75194 0.29754 2.52716 0.0114988
ColDM -0.74218 0.25754 -2.88187 0.0039532
ColLM -0.31739 0.28417 -1.11692 0.2640299
ColM -0.54407 0.22370 -2.43210 0.0150114
Zero-alteration model (Bernoulli with logit link):
(Intercept) 12.35082 2.62873 4.69840 2.6221e-06
Width -0.49723 0.10174 -4.88746 1.0215e-06
410 / 460
Exercises
Suppose that Y is a discrete random variable whose possible
values are all non-negative integers. The cumulative distribution
function of
Y is denoted by F
Y
(y).
(1) Suppose that
˜
Y
is the Zero-Truncated version of Y, that is,
Pr(
˜
Y = 0) = 0 and Pr(
˜
Y = y) = Pr(Y = y)/[1 Pr(Y = 0)] for
y = 1, 2, . . . Show that:
(A) E(
˜
Y
r
) = κE(Y
r
) for all r > 0 such that E(Y
r
) exists, where κ =
1/[1 Pr(Y = 0)]. Therefore, if Var(Y) exists, then Var(
˜
Y) =
κVar(Y) + κ(1 κ)E
2
(Y)
.
(B) F
˜
Y
(y) = Pr(
˜
Y y) = [κF
Y
(y) + (1 κ)]I
y
[0, )
.
(2) Suppose that
˜
Y
is the Zero-Inflated version of Y, that is,
Pr(
˜
Y = 0) = π+(1π)Pr(Y = 0) and Pr(
˜
Y = y) = (1π)Pr(Y =
y) for y = 1, 2, . . . Show that:
(A) E(
˜
Y
r
) = κE(Y
r
) for all r > 0 such that E(Y
r
) exists, where
κ = 1π. T herefore, if Var(Y) exists, then Var(
˜
Y) = κVar(Y)+
κ(1 κ)E
2
(Y)
.
(B) F
˜
Y
(y) = Pr(
˜
Y y) = [κF
Y
(y) + (1 κ)]I
y
[0, )
.
411 / 460
Exercises
(3) Suppose that
˜
Y
is the Zero-Modified version of Y, that is,
Pr(
˜
Y = 0) = π and Pr(
˜
Y = y) = (1 π)Pr(Y = y)/[1 Pr(Y =
0)]
for y = 1, 2, . . . Show that:
(A) E(
˜
Y
r
) = κE(Y
r
) for all r > 0 such that E(Y
r
) exists, where
κ = (1 π)/[1 Pr(Y = 0)]. Therefore, if Var(Y) exists, then
Var(
˜
Y) = κVar(Y) + κ(1 κ)E
2
(Y).
(B) F
˜
Y
(y) = Pr(
˜
Y y) = [κF
Y
(y) + (1 κ)]I
y
[0, )
.
412 / 460
Cluster Correlated Data
413 / 460
Eect of the ozone on the tree growth
Analyzing this data, previously analyzed in Diggle et al. [2002]
and available in the object spruces of the library glmtoolbox,
aims to assess the eect of ozone pollutio n on tree growth.
Ozone pollution is common in urban areas, thus the impact of
increased ozone concentrations on tree growth is of consider-
able interest. The response variable is the size of the tree (size),
where size is conventionally measured by the product of tree
height and stem diameter squared. This experiment examined
79 trees identified in the dataset by the column
tree. A first
group of 54 trees was grown in an ozone-enriched atmosphere
(
treat=“ozone-enriched”), while a second group of 25 trees
was grown in a normal atmosphere (treat=“normal”). The size
of each tree was measured 13 times over the course of the ex-
periment (days), so the data are balanced and the number of
rows in the d ataset is 1027. The main objective of the analysis
is to compare the growth patterns of trees under two types of
atmosphere: normal and ozone-enriched.
414 / 460
Eect of the ozone on the tree growth
0
500
1000
1500
2000
152 174 201 227 258 469 496 528 556 579 613 639 674
Days since the beginning of the experiment
Size
Atmosphere
normal ozone-enriched
415 / 460
Treatment of severe postnatal depression
This dataset, extracted from Gregoire et al. [1996] and available
in the library
glmtoolbox as object depression, is from a placebo-
controlled study on the ecacy of oestrogen for the treatment of
severe postnatal depression. There were 61 women (
subj) with
severe depression randomly assigned to one of two treatment
groups (
group=“placebo”), which received a placebo patch, or
group (
group=“oestrogen”), which received an oestrogen patch.
The Edinburgh Postnatal Depression Scale (EPDS) was admin-
istered to all women prior to the therapy, where higher scores
indicate greater depression levels. Once the treatment began,
EPDS (
dep) were collected monthly for 6 months (visit). The
response (depressd) is 1 for severe depression (EDPS>=11) and
0 in other cases.
416 / 460
Treatment of severe postnatal depression
0
25
50
75
100
−1 1 2 3 4 5 6
Months since the therapy began
% of women with depression
Treatment
Placebo Estrogen
417 / 460
Generalized Least Squares (GLS)
Let y
i
= (y
i1
, . . . , y
ij
, . . . , y
in
i
)
for i = 1, . . . , n to be the interest
multivariate responses measured on
n clusters, which are as-
sumed to be realizations of
independent random vectors de-
noted here by
Y
i
= (Y
i1
, . . . , Y
ij
, . . . , Y
in
i
)
for i = 1, . . . , n, where
n
i
represents the size of the i-th cluster or the number of mea-
surements performed on the
i-th subject. The random variables
Y
ij
, for j = 1, . . . , n
i
, are assumed to satisfy the foll owing
µ
i
= E(Y
i
) = X
i
β and Var(Y
i
) = Σ
i
,
where β = (β
0
, β
1
, . . . , β
p
)
is the set of regression coecients,
whereas
X
i
is a (known) n
i
× (p + 1) matrix and Σ
i
is a (known)
n
i
× n
i
symmetric positive-definite matrix for i = 1, . . . , n. The
estimator of β by the GLS method may be written as follows
ˆ
β = argmin
β R
p+1
Q(β),
where
Q(β) =
1
2
(y Xβ)
Σ
1
(y Xβ) =
1
2
n
X
i=1
(y
i
X
i
β)
Σ
1
(y
i
X
i
β),
418 / 460
Generalized Least Squares (GLS)
in which y = (y
1
, . . . , y
n
)
, X = (X
1
, . . . , X
n
)
and Σ = diag{Σ
1
, . . . , Σ
n
}.
The function
Q(β) is strictly convex. Therefore,
ˆ
β
is the solution
to the system of linear equations given by
n
X
i=1
X
i
Σ
1
i
y
i
X
i
β
= 0,
which reduces to
ˆ
β =
n
X
i=1
X
i
Σ
1
i
X
i
1
n
X
i=1
X
i
Σ
1
i
y
i
.
How to modify the GLS method to estimate β when the random
vectors
Y
i
, for i = 1, . . . , n, are such that: (i) g(µ
i
) = X
i
β, where
g(·) is a known link function; (ii) the variance of Y
ij
depends on its
mean through a variance function, that is, Var(Y
ij
) = φV(µ
ij
)
ij
;
(iii) the correlation matrix associated with Y
i
depends on an
unknown parameter vector ρ.
419 / 460
Generalized Estimating Equations (GEEs)
GEEs are a very flexible statistical tool for the analysis of het-
eroskedastic continuous and discrete cluster-correlated data. An
appealing feature of GEEs is that they do not require the full spec-
ification of the multivariate distribution of the response vector
measured in each cluster. Indeed, in GEEs, only the following
needs to be specified:
(1) A variance function describing the mechanism by which
heteroskedasticity is assumed to occur (if there is), i.e., the
way in which variance is assumed to depend on the mean.
(2) A regression structure that consists of a link function and
a linear predictor, which describes the way in which the
mean of the response variable distribution is assumed to be
dependent on some explanatory variables.
(3) A structure for the correlation matrix, which is aimed at
describing the dynamic of the linear association between the
dierent measurements of the response variable performed
on the same subject or cluster.
420 / 460
Generalized Estimating Equations
Let y
i
= (y
i1
, . . . , y
ij
, . . . , y
in
i
)
for i = 1, . . . , n to be the multi-
variate responses of interest measured on
n clusters, which are
assumed to be realizations of
independent random vectors de-
noted here by
Y
i
= (Y
i1
, . . . , Y
ij
, . . . , Y
in
i
)
for i = 1, . . . , n, where
n
i
represents the size of the i-th cluster or the number of mea-
surements performed on the
i-th subject. The random variables
Y
ij
, for j = 1, . . . , n
i
, are assumed to satisfy the foll owing
Var(Y
ij
) =
φ
ω
ij
V(µ
ij
)
and Corr(Y
ij
, Y
ik
) = r
jk
(ρ),
where µ
ij
= E(Y
ij
), φ > 0 is the dispersion parameter, ω
ij
> 0 are
known weights, V(µ) is the variance function, and r
jk
(ρ) is the
Pearson’s linear correlation coecient, which depends on the
parameter vector ρ. In addition, µ
ij
is assumed to be dependent
on x
ij
= (1, x
1ij
, . . . , x
pij
)
in the following way
g(µ
ij
) = x
ij
β,
where g(·) is the link function and β = (β
0
, β
1
, . . . , β
p
)
is the set
of unknown regression coecients.
421 / 460
Generalized Estimating Equations
According to Liang and Zeger [1986], the estimate of β, denoted
here by
ˆ
β
, reduces to the solution to U(
ˆ
β) = 0, where
U(β) =
n
X
i=1
X
i
K
i
|{z}
µ
i
/∂β
[φA
1
2
i
R
i
A
1
2
i
]
1
| {z }
[Var(Y
i
)]
1
(y
i
µ
i
) =
1
φ
n
X
i=1
X
i
W
i
K
1
i
(y
i
µ
i
),
in which
X
i
= (x
i1
, . . . , x
i n
i
)
W
i
= K
i
V
1
i
K
i
K
i
= diag{1/g
(µ
i1
), . . . , 1/g
(µ
i n
i
)}
V
i
= A
1
2
i
R
i
A
1
2
i
A
i
= diag{V(µ
i1
)
i1
, . . . , V(µ
i n
i
)
i n
i
}
The (j, k)-th entry of R
i
is r
j,k
(ρ)
µ
i
= (g
1
(x
i1
β), . . . , g
1
(x
i n
i
β))
422 / 460
Estimating algorithm
Step 0: Start the counter at t = 0; set the tolerance limit,
ǫ > 0; set the maximum number of iterations, n
max
;
and set the initial value for
β, say β
[0]
.
Step 1: Compute ρ
[t]
from the Pearson’s residuals evalu-
ated at
β
[t]
, given by r
[t]
ij
=
y
ij
µ
[t]
ij
q
φ
[t]
V(µ
[t]
ij
)
ij
.
Step 2: Compute β
[t+1]
= β
[t]
+ [I(β
[t]
)]
1
U(β
[t]
).
Step 3: Compute δ
(t+1)
= δ(β
[t]
, β
[t+1]
).
Step 4: Update the counter by t = t + 1.
Step 5: Repeat steps 1,2,3 and 4 until δ
(t)
< ǫ or t > n
max
.
Step 6: If δ
(t)
< ǫ, then
ˆ
β
is defined to be β
[t]
. Otherwise,
the convergence was not achieved.
I(β) = E
U(β)
β
!
=
1
φ
n
X
i=1
X
i
W
i
X
i
.
A simple choice for β
[0]
is the estimate of β obtained from the
GEE where the correlation matrix is the identity.
423 / 460
Asymptotic theory
If the model for the mean (µ) is correctly specified, then, un-
der certain regularity conditions,
ˆ
β
is consistent for β and its
distribution is such that (
Liang and Zeger [1986]):
n(
ˆ
β β)
D
n→∞
N(0, Var(
ˆ
β)),
where
Var(
ˆ
β)= lim
n→∞
1
n
X
WX
1
1
n
n
X
i=1
X
i
W
i
K
1
i
Var(Y
i
)K
1
i
W
i
X
i
1
n
X
WX
1
.
Therefore, if the model for the mean is correctly specified, then
ˆ
β
remain consistent and asymptotically normal although the struc-
ture of the correlation matrix is not correctly specified. Indeed, if
the structure of the correlation matrix is also correctly specified,
that is, if
Var(Y
i
) = φV
i
for i = 1, . . . , n, then Var(
ˆ
β) reduces to
Var(
ˆ
β) = lim
n→∞
1
n
X
WX
1
.
424 / 460
Working-correlation matrix
Independence:
Corr(Y
ij
, Y
ik
) =
1
if j = k,
0
if j , k
Exchangeable:
Corr(Y
ij
, Y
ik
) =
1
if j = k,
ρ
if j , k,
and
ˆ
ρ =
1
M p 1
n
X
i=1
X
j<k
r
ij
r
ik
,
where M =
1
2
n
P
i=1
n
i
(n
i
1)
.
AR-M-dependent(m)
:
For example, if m = 1, then the correlations become
Corr(Y
ij
, Y
ik
) =
1
if j = k,
ρ
|jk|
if j , k,
and
ˆ
ρ =
1
M p 1
n
X
i=1
n
i
1
X
j=1
r
ij
r
i,j+1
,
where M =
n
P
i=1
(n
i
1)
.
425 / 460
Working-correlation matrix
Stationary-M-dependent(m):
Corr(Y
ij
,Y
i,j+t
)=
1
if t = 0,
ρ
t
if t = 1, . . . , m
0
if t > m,
and
ˆ
ρ
t
=
1
M
t
p1
n
X
i=1
n
i
t
X
j=1
r
ij
r
i,j+t
,
where M
t
=
n
P
i=1
(n
i
t)
.
Non-Stationary-M-dependent(m)
:
Corr(Y
ij
, Y
ik
) =
1
if j = k,
ρ
jk
if 0 < |j k| m,
0 if |j k| > m,
and
ˆ
ρ
jk
=
1
n p 1
n
X
i=1
r
ij
r
ik
,
Unstructured:
Corr(Y
ij
, Y
ik
) =
1
if j = k,
ρ
jk
if j , k,
and
ˆ
ρ
jk
=
1
n p 1
n
X
i=1
r
ij
r
ik
,
426 / 460
Variance Estimation
model:
ˆ
Var
M
(
ˆ
β) = [I(
ˆ
β)]
1
=
ˆ
φ
X
ˆ
WX
1
,
where X = (X
1
, . . . , X
n
)
and
ˆ
W = diag{
ˆ
W
1
, . . . ,
ˆ
W
n
}
.
robust (Liang and Zeger [1986]):
ˆ
Var
R
(
ˆ
β) =
X
ˆ
WX
1
n
X
i=1
X
i
ˆ
W
i
ˆ
K
1
i
e
i
e
i
ˆ
K
1
i
ˆ
W
i
X
i
X
ˆ
WX
1
,
where e
i
= y
i
ˆµ
i
. This estimator is robust to the missespecifi-
cation of the working correlation matrix in the sense that it is a
consistent estimator of the true variance-covariance matrix of
ˆ
β
as long as the model for the marginal mean is correctly specified.
df-adjusted:
ˆ
Var
A
(
ˆ
β) =
n
n p 1
ˆ
Var
R
(
ˆ
β)
427 / 460
Criteria to select the Working Correlation Mat rix
Correlation Information Criterion (Hin and Wang [2009]):
CIC = trace
n
ˆ
φ
1
(X
ˆ
K
ˆ
A
1
ˆ
KX)
ˆ
Var
R
(
ˆ
β)
o
Akaike-type penalized Gaussian Pseudo-likelihood Criterion
(Carey and Wang [2011], Zhu and Zhu [2013], Fu et al. [2018]):
AGPC = Gaussian pseudo-Likelihood + 2
p + 1 + dim
(
ˆρ
)
,
where the Gaussian pseudo-likelihood is given by
n
X
i=1
n
i
log(2π) +
1
ˆ
φ
(y
i
ˆµ
i
)
ˆ
V
1
i
(y
i
ˆµ
i
) + log(
ˆ
φ|
ˆ
V
i
|)
Schwarz-type penalized Gaussian Pseudo-likelihood Criterion
(Carey and Wang [2011], Zhu and Zhu [2013], Fu et al. [2018]):
SGPC = Gaussian pseudo-Likelihood + log(n)
p + 1 + dim
(
ˆρ
)
428 / 460
Hypothesis testing
To assess the hypothesis system H
0
: β
= 0 versus H
1
: β
,
0
, where the elements of β
are a subset of those of β, as β
may
be written as
β
= L
β, in which L is a r ×(p + 1) contrast matrix,
the following decision rule may be used
Reject
H
0
at t he approximate 100(α)% significance level if ξ > χ
2
1α
(r)
”,
where
α (0, 1), χ
2
1α
(r) is the 100(1 α)th percentile of the χ
2
(r)
distribution, and ξ is one of the following statistics:
wald
:
ξ
W
=
L
ˆ
β
L
ˆ
Var
R
(
ˆ
β)L
1
L
ˆ
β
generalized score (Rotnitzky and Jewell [1990], Boos [1992]):
ξ
S
=
L
ˆ
Var
M
(
ˆ
β) U(β)
L
ˆ
Var
R
(
ˆ
β)L
1
L
ˆ
Var
M
(
ˆ
β) U(β)
,
evaluated at the parameter estimates obtained under the restric-
tion given by H
0
, that is, restricted to β
= 0.
429 / 460
Residuals
pearson:
r
P
ij
=
y
ij
ˆ
µ
ij
q
ˆ
φ V(
ˆ
µ
ij
)
ij
, i = 1, . . . , n; j = 1, . . . , n
i
deviance:
r
D
ij
= sign(y
ij
ˆ
µ
ij
)
q
d(y
ij
,
ˆ
µ
ij
, ω
ij
)/
ˆ
φ, i = 1, . . . , n; j = 1, . . . , n
i
mahalanobis:
r
M
i
=
n
1
i
ˆ
φ
(y
i
ˆµ
i
)
ˆ
V
1
i
(y
i
ˆµ
i
), i = 1, . . . , n
430 / 460
Influence or sensitivity
The “leave-one-out” statistics are computed to quantify the eect
on the estimates of the parameters in the linear predictor of
deleting each subject/cluster. Thus, if the
i-th cluster is excluded
then these statistics may be expressed by
Dfbeta
(i)
=
ˆ
β
ˆ
β
(i)
and CD
(i)
=
1
(p + 1)
(
ˆ
β
ˆ
β
(i)
)
[
ˆ
Var(
ˆ
β)]
1
(
ˆ
β
ˆ
β
(i)
),
respectively, where
ˆ
β
(i)
is the estimate of β computed from the
dataset in which the
i-th cluster is excluded. To avoid the pos-
sible computational burden, the “leave-one-out” estimates of
ˆ
β
are replaced by their “one-step approximations”, denoted here
by
ˆ
β
I
(i)
for i = 1, . . . , n. The value of
ˆ
β
I
(i)
corresponds to the first
iteration of the estimating algorithm of the GEE when it is per-
formed using: (i) the dataset in which the i-th cluster is excluded;
and (ii) a starting value which is the solution to the same GEE
but computed from the dataset includ ing all clusters, that is,
ˆ
β
.
431 / 460
Eect of the ozone on the tree growth
The data are analyzed by using the followi ng model:
Var(Y
ij
) = φµ
2
ij
, Corr(Y
ij
, Y
ik
) = ρ
|jk|
and
log(µ
ij
) = β
0
+ β
1
x
1ij
+ β
2
x
2
1ij
+ β
3
x
3
1ij
+ β
4
x
4
1ij
+ β
5
x
2ij
,
where
y
ij
= j-th measurement of size performed on the i-th tree.
x
1ij
= Time, in days, since the experiment began until the j-th
measurement on the
i-th tree was performed.
x
2ij
=
1
if the i-th tree grew under the ozone-enriched atmosphere
0 if the i-th tree grew under the normal atmosphere
432 / 460
Eect of the ozone on the tree growth
> ggplot(spruces, aes(x=factor(days), y=size, fill=treat)) +
+ geom_boxplot(outlier.shape=16, outlier.size=1) +
+ labs(x="Days since the beginning of the experiment",
+ y="Size", fill="Atmosphere") +
+ scale_fill_manual(values=c("salmon2", "green2")) +
+ theme(legend.position="top", legend.direction="horizontal")
>
> res <- function(x,y)
+ data.frame(type=y,fitted(x),residuals(x,type="pearson"))
> m1 <- glmgee(size ~ poly(days,4)+treat, id=tree, data=spruces,
+ corstr="Independence", family=inverse.gaussian(log))
> r1 <- res(m1,"V(mu)==muˆ3")
> m2 <- update(m1,family=poisson(log));r2 <- res(m2,"V(mu)==mu")
> m3 <- update(m1,family=Gamma(log));r3 <- res(m3,"V(mu)==muˆ2")
> m4 <- update(m1,family=gaussian(log));r4 <- res(m4,"V(mu)==1")
> rs <- rbind(r1,r2,r3,r4);colnames(rs) <- c("type","mu","res")
>
> ggplot(rs, aes(x=mu, y=res)) +
+ geom_point(shape=16, col="black") +
+ labs(x="Fitted values", y="Pearson’s residuals") +
+ facet_wrap(vars(type), labeller=label_parsed)
433 / 460
Eect of the ozone on the tree growth
200 400 600 800 200 400 600 800
−2
0
2
4
−2
0
2
4
V(µ) = 1 V(µ) = µ
V(µ) = µ
2
V(µ) = µ
3
Fitted values
Pearson’s residuals
434 / 460
Eect of the ozone on the tree growth
> fit1 <- glmgee(size ~ poly(days,4) + treat, family=Gamma(log),
+ id=tree, corstr="Independence", data=spruces)
> fit2 <- update(fit1, corstr="Exchangeable")
> fit3 <- update(fit1, corstr="AR-M-dependent(1)")
> fit4 <- update(fit1, corstr="AR-M-dependent(2)")
> fit5 <- update(fit1, corstr="AR-M-dependent(3)")
> fit6 <- update(fit1, corstr="AR-M-dependent(4)")
>
> a <- AGPC(fit1, fit2, fit3, fit4, fit5, fit6, verbose=FALSE)
> b <- SGPC(fit1, fit2, fit3, fit4, fit5, fit6, verbose=FALSE)
> cbind(a, SGPC=b[,"SGPC"])
Object Correlation GPL Parameters AGPC SGPC
fit1 Independence 13527.34 6 13539.34 13553.55
fit2 Exchangeable 11675.19 7 11689.19 11705.78
fit3 AR-M-dependent(1) 10926.66 7 10940.66 10957.25
fit4 AR-M-dependent(2) 10964.71 8 10980.71 10999.67
fit5 AR-M-dependent(3) 10976.47 9 10994.47 11015.80
fit6 AR-M-dependent(4) 11011.43 10 11031.43 11055.12
435 / 460
Eect of the ozone on the tree growth
> summary(fit2)
Number of observations: 1027
Number of clusters: 79
Cluster size: 13
*************************************************************
Variance function: Gamma
Link function: log
Correlation structure: AR-1
*************************************************************
Estimate Std.Error z-value Pr(>|z|)
(Intercept) 5.90378 0.10486 56.30321 < 2e-16
poly(days, 4)1 19.20015 0.51848 37.03159 < 2e-16
poly(days, 4)2 -2.85755 0.20585 -13.88147 < 2e-16
poly(days, 4)3 5.41639 0.18246 29.68549 < 2e-16
poly(days, 4)4 -3.57407 0.12478 -28.64405 < 2e-16
treatozone-enriched -0.25861 0.12835 -2.01486 0.043919
ˆ
φ
Dispersion 0.32866
436 / 460
Eect of the ozone on the tree growth
> fit2a <- update(fit2, formula=. ~ . + poly(days,4):treat)
>
> anova(fit2a, test="wald")
Model 1 : size ~ 1
Model 2 : size ~ poly(days, 4)
Model 3 : size ~ poly(days, 4) + treat
Model 4 : size ~ poly(days, 4) + treat + poly(days, 4):treat
Chi df Pr(>Chi)
1 vs 2 1931.9813 4 < 2e-16 ***
2 vs 3 4.0597 1 0.04392 *
3 vs 4 3.6641 4 0.45336
>
> anova(fit2a, test="score")
Model 1 : size ~ 1
Model 2 : size ~ poly(days, 4)
Model 3 : size ~ poly(days, 4) + treat
Model 4 : size ~ poly(days, 4) + treat + poly(days, 4):treat
Chi df Pr(>Chi)
1 vs 2 61.3028 4 1.544e-12 ***
2 vs 3 3.3687 1 0.06645 .
3 vs 4 3.4665 4 0.48300
437 / 460
Treatment of severe postnatal depression
This dataset, extracted from Gregoire et al. [1996] and available in the
library
glmtoolbox as object depression, is from a placebo-controlled
study on the ecacy of oestrogen for the treatment of severe postnatal
depression. There were 61 women (
subj) with severe depression ran-
domly assigned to one of two treatment groups (group=“placebo”),
which received a placebo patch, or group (
group=“oestrogen”), which
received an oestrogen patch. The Edinburgh Postnatal Depression
Scale (EPDS) was administered to all women prior to the therapy,
where higher scores indicate greater depression levels. Once the treat-
ment began, EPDS (
dep) were collected monthly for 6 months (visit).
The response (depressd) is 1 for severe depression (EDPS>=11) and 0
in other cases.
Var(Y
ij
) = µ
ij
(1µ
ij
), Corr(Y
ij
, Y
ik
) = ρ
|jk|
, logit(µ
ij
) = β
0
+β
1
x
1ij
+β
2
x
2ij
,
where
y
ij
= j-th measurement of the response performed on the i-th woman
x
1ij
= Time, in months, since the experiment began until the j-th mea-
surement on the i-th women was performed
x
2ij
=
1
if the ith woman was treated with oestrogen patch
0 if the ith woman was treated with placebo patch
438 / 460
Treatment of severe postnatal depression
0
25
50
75
100
−1 1 2 3 4 5 6
Months since the therapy began
% of women with depression
Treatment
Placebo Estrogen
439 / 460
Treatment of severe postnatal depression
> d2 <- aggregate(depressd ~ visit + group,mean,data=depression)
> ggplot(d2,aes(x=factor(visit),fill=group,weight=100*depressd)) +
+ geom_bar(position=position_dodge(), width=0.7) +
+ labs(x="Months since the therapy began",
+ y="% of women with depression", fill="Treatment") +
+ scale_fill_manual(values=c("yellow", "blue")) +
+ theme(legend.position="top", legend.direction="horizontal")
>
> fit1 <- glmgee(depressd ~ visit + group,family=binomial(logit),
+ id=subj, corstr="Exchangeable", data=depression)
> fit2 <- update(fit1, corstr="AR-M-dependent(1)")
> fit3 <- update(fit1, corstr="AR-M-dependent(2)")
>
> a <- AGPC(fit1, fit2, fit3, verbose=FALSE)
> b <- SGPC(fit1, fit2, fit3, verbose=FALSE)
>
> cbind(a, SGPC=b[,"SGPC"])
Object Correlation GPL Parameters AGPC SGPC
fit1 Exchangeable 239.9647 4 247.9647 256.4082
fit2 AR-M-dependent(1) 226.4696 4 234.4696 242.9131
fit3 AR-M-dependent(2) 224.9438 5 234.9438 245.4982
440 / 460
Treatment of severe postnatal depression
> summary(fit2)
Number of observations: 356
Number of clusters: 61
Min 25% 50% 75% Max
Cluster sizes: 2 4 7 7 7
*************************************************************
Estimate Std.Error z-value Pr(>|z|)
(Intercept) 3.23604 0.51842 6.24218 4.3152e-10
visit -0.62632 0.07477 -8.37681 < 2.22e-16
groupestrogen -1.77723 0.54578 -3.25631 0.0011287
ˆ
φ
Dispersion 1.02842
*************************************************************
[1] [2] [3] [4] [5] [6] [7]
[1] 1.000 0.513 0.263 0.135 0.069 0.036 0.018
[2] 0.513 1.000 0.513 0.263 0.135 0.069 0.036
[3] 0.263 0.513 1.000 0.513 0.263 0.135 0.069
[4] 0.135 0.263 0.513 1.000 0.513 0.263 0.135 R(
ˆ
ρ)
[5] 0.069 0.135 0.263 0.513 1.000 0.513 0.263
[6] 0.036 0.069 0.135 0.263 0.513 1.000 0.513
[7] 0.018 0.036 0.069 0.135 0.263 0.513 1.000
441 / 460
Treatment of severe postnatal depression
> fit2a <- update(fit2, formula=. ~ . + visit:group)
>
> anova(fit2a, test="wald")
Model 1 : depressd ~ 1
Model 2 : depressd ~ visit
Model 3 : depressd ~ visit + group
Model 4 : depressd ~ visit + group + visit:group
Chi df Pr(>Chi)
1 vs 2 88.1275 1 < 2.2e-16 ***
2 vs 3 10.6036 1 0.001129 **
3 vs 4 2.2104 1 0.137082
>
> anova(fit2a, test="score")
Model 1 : depressd ~ 1
Model 2 : depressd ~ visit
Model 3 : depressd ~ visit + group
Model 4 : depressd ~ visit + group + visit:group
Chi df Pr(>Chi)
1 vs 2 39.9226 1 2.642e-10 ***
2 vs 3 10.9208 1 0.0009509 ***
3 vs 4 2.3977 1 0.1215150
442 / 460
Appendix
443 / 460
A. Linear algebra
(1) If A is an idempotent matrix, that is, if A
2
= A, then trace(A) =
rank(A), where trace(A) and rank(A) correspond to the trace and
rank of A, respectively (see, for instance, Gentle [2007, page 281]).
(2) If A is an m ×n matrix and B is an n ×m matrix, then trace(AB) =
trace(BA)
. (see, for example, Gentle [2007, page 67]).
(3) If A is a nonsingular matrix, and u and v are column vectors, then
(A uv
)
1
= A
1
+
(A
1
u)(v
A
1
)
1 v
A
1
u
(see, for instance, Gentle [2007,
page 220]
).
If A is a symmetric and positive definite matrix, then
(4) The square root of A, denoted here by A
1
2
and such that A
1
2
A
1
2
= A
,
exists, is unique, is symmetric, and also is positive definite (see,
for instance,
Gentle [2007, section 3.8.8]).
(5) A
1
exists, is symmetric, and also is positive definite (see, for
instance, Gentle [2007, section 3.8.8]).
(6) B
AB is also a positive definite matrix if and only if B has full
column-rank (that is, if the columns of B are linearly independent)
(see, f or instance,
Gentle [2007, page 89]).
444 / 460
A. Linear algebra
If A is a nonsingular and symmetric matrix given by
A =
"
B C
C
D
#
,
where B is nonsingular, then
(7) The determinant of A is given by (see, for instance, Gentle [2007,
section 3.4.2])
|A | = |B | |D C
B
1
C |
(8)
The inverse of A is given by (see, for instance, Gentle [2007, section
3.4.1])
A
1
=
"
B
1
+ B
1
CE
1
C
B
1
B
1
CE
1
E
1
C
B
1
E
1
#
,
where E = D C
B
1
C.
445 / 460
B. Derivatives
If x and b are q × 1 matrices and A is an q × q matrix, then (see,
for instance,
Gentle [2007, section 4.2])
(1)
h(x)
x
= b
provided that h(x) = b
x.
(2)
h(x)
x
= (A + A
)x
provided that h(x) = x
Ax.
(3)
h(x)
x
= A
provided that h(x) = Ax.
(4)
h(x)
x
= h(x)Ax
provided that h(x) = exp(
1
2
x
Ax)
and A
is a symmetric matrix.
(5)
h(A)
A
= I
q
provided that h(A) = trace(A).
(6)
h(A)
A
= |A|A
−⊤
provided that h(A) = |A| and A is a
nonsingular matrix.
(7)
h(A)
A
= A
−⊤
provided that h(A) = log |A| and |A| > 0.
446 / 460
C. Taylor series
See, for example, Marsden and Tromba [2012, section 3.2].
(1) If h : D R R is a fu nction with (r + 1) continuous
derivatives at
x
0
D, then
h(x) =
r
X
j=0
(x x
0
)
j
j!
h
(j)
(x
0
) + R
r
(x, x
0
),
where h
(j)
(x) is the j- th derivative of h(x) and R
r
(x, x
0
) is the
residual, which is given by
R
r
(x, x
0
) =
Z
x
x
0
(x t)
r
r!
h
(r+1)
(t) dt.
The residual R
r
(x, x
0
) is such that
R
r
(x, x
0
)
(x x
0
)
r
0
as x x
0
.
447 / 460
C. Taylor series
(2) If h : D R
q
R is a function with continuous second-
order partial derivatives at
x
0
D, then
h(x) = h(x
0
) + (x x
0
)
"
h(x)
x
x=x
0
#
+ R
1
(x, x
0
),
where
R
1
(x, x
0
)
||x x
0
||
2
0
as x x
0
.
(3) If h : D R
q
R is a function with continuous third-order
partial derivatives at
x
0
D, then
h(x) = h(x
0
) + (x x
0
)
"
h(x)
x
x=x
0
#
+
1
2
(x x
0
)
"
h(x)
x x
x=x
0
#
(x x
0
) + R
2
(x, x
0
),
where
R
2
(x, x
0
)
(||x x
0
||
2
)
2
0
as x x
0
.
448 / 460
D. Probability
If Z Normal(0, 1), V χ
2
(m) and W χ
2
(r), then (see, for
instance,
Searle [1971, chapter 2])
(1) Z
2
χ
2
(1)
(2)
Z
V/m
Student
-t(m) provided that Z and V are indepen-
dent, where
Student-t(m) represents a random variable fo-
llowing the Student-tdistribution with
m degrees-of-freedom.
(3)
V/m
W/r
F
(m,r)
provided that V and W are independent,
where
F
(m,r)
represents a random variable following the F
-Snedecor distribution with m and r degrees-of-freedom.
(4)
W/r
V/m
F(r, m)
provided that V and W are independent.
(5) [Student-t(m)]
2
F
(1,m)
.
449 / 460
E. Multivariate normal distribution
If Z Normal
m
(µ, Σ), where Σ is a positive definite matrix, then
(see, for instance,
Searle [1971, chapter 2])
(1) The probability density function of Z may be written as
f
Z
(z; µ, Σ) = (2π)
m
2
|Σ|
1
2
exp
1
2
(z µ)
Σ
1
(z µ)
, zR
m
.
(2)
The generating moment function of Z is the following
M(t; µ, Σ) = exp
t
µ +
1
2
t
Σ
1
t
, t = (t
1
, . . . , t
m
)
.
(3) a + AZ Normal
q
(a+ Aµ, AΣA
)
, where a is an q ×1 matrix
and A is an q × m matrix of full row-rank.
(4) Z
AZ χ
2
(rank(A)) provided that AΣ is an idempotent
symmetric matrix and
µ
Aµ = 0.
(5) Z
AZ and BZ are statistically independent provided that
BΣA = 0, where A is a symmetric matrix.
450 / 460
F. Nonlinear optimization
See, for example, Sun and Yuan [2006, section 1.4].
(1) If h : D R
q
R is a twice continuously dierentiable
function,
D is a open set,
h(x)
x
x=x
= 0
and
2
h(x)
x x
x=x
is a
positive (negative) definite Hessian matrix, then
x
is a local
minimum (maximum) of the function
h(x) in D.
(2) If h : D R
q
R is a twice dierentiable function, D is a
open set, and
2
h(x)
x x
is a positive (negative) definite Hessian
matrix for all
x D, then h(x) is a strictly convex (concave)
function in D.
(3) If h : D R
q
R is a twice dierentiable function, D
is a open set,
h(x)
x
x=x
= 0
and h(x) is a strictly convex
(concave) function, then x
is the unique global minimum
(maximum) of h(x) in D.
451 / 460
References I
R.D. Cook and S. Weisberg. Residuals and Influence in Regression.
Chapman & Hall/CRC Monographs on Statistics & Applied
Probability. Taylor & Francis, 1982.
D.M. Allen. Mean square error of prediction as a criterion for
selecting variables. Technometrics, 13(3):469–475, 1971.
I.M. Alcantara, J. Naranjo, and Y. Lang. Model selection using
press statistic. Computational Statistics, 38(1):285–298, 2023.
H. Akaike. A new look at the statistical model identification.
IEEE Transactions on Automatic Control, 19(6):716–723, 1974.
Gideon Schwarz. Estimating the dimension of a model. The
Annals of Statistics, 6(2):461–464, 1978.
G. James, D. v, T. Hastie, and R. Tibshirani. An Introduction to
Statistical Learning: with Applications in R. Springer Texts in
Statistics. Springer, 2013.
452 / 460
References II
A.C. Atkinson. Two graphical displays for outlying and influen-
tial observations in regression. Biometrika, 68(1):13–20, 1981.
R.D. Cook. Detection of influential observation in linear regres-
sion. Technometrics, 19(1):15–18, 1977.
T.S. Breusch and A.R. Pagan. A simple test for heteroscedasticity
and random coecient variation. Econometrica, 47(5):1287–
1294, 1979.
R.D. Cook and S. Weisberg. Diagnostics for heteroscedasticity in
regression. Biometrika, 70(1):1–10, 1983.
J.E. Gentle. Matrix Algebra: The ory, Computations, and Applications
in Statistics. Springer Texts in Statistics. Springer New York,
2007.
J. Fox, , and G. Monette. Generalized collinearity diagnostics.
Journal of the American St atist ical Association, 87(417):178–183,
1992.
453 / 460
References III
G.E.P. Box and P.W. Tidwel l. Transformation of the independent
variables. Technometrics, 4(4):531–550, 1962.
H. Zhang and D. Zelterman. Binary regression for risks in ex-
cess of subject-specific thresholds. B iometrics , 55(4):1247–1251,
1999.
P.S. Hewlett and T.J. Plackett. Statistical aspects of the inde-
pendent joint action of poisons, particularly insecticides. ii.
examination of data for agreement with hypothesis. Annals of
Applied Biology, 37:527–552, 1950.
D.W. Hosmer, S. Lemeshow, and R.X. Sturdivant. Applied Logistic
Regression. Wiley Series in Probability and Statistics. Wiley,
2013.
D.G. Kleinbaum, L.L. Kupper, A. Nizam, and E.S. Rosenberg.
Applied Regression Analysis and Other Multivariable Methods.
Cengage Learning, 2013.
454 / 460
References IV
J.I. McCool. Confidence limits for weibull regression with cen-
sored data. IEEE Transactions on Reliability, 29(2):145–150, 1980.
B. Jorgensen. Exponential dispersion model s. Journal of the Royal
Statistical Society. Serie s B (Methodological), 49(2):127–162, 1987.
P. McCullagh and J.A. Nelder. Generalized Linear Models, Second
Edition. Chapman and Hall/CRC Monographs on Statistics
and Applied Probability Series. Chapman & Hall, 1989.
A. Agresti. Foundations of Linear and Generalized Linear Models.
Wiley Series in Probability and Statistics. Wiley, 2015.
L. Fahrmeir and H. Kaufmann. Consistency and asymptotic
normality of the maximum likelihood estimator in generalized
linear models. The Annals of Statistics, 13(1):342–368, 1985.
A. Buse. The likelihood ratio, wald, and lagrange multiplier
tests: An expository note. The American Statistician, 36(3a):
153–157, 1982.
455 / 460
References V
G. Terrell. The gradient statistic. Computing Science and Statis tics,
34:206–215, 2002.
J.D. Gibbons and S. Chakraborti. Nonparametric Statistical Infer-
ence (5th ed.). Chapman and Hall/CRC, 2010.
F. Habibzadeh, P. Habibzadeh, and M. Yadollahie. On determin-
ing the most appropriate test cut-o value: the case of tests
with continuous results. Biochemia Medica, 26(3):297–307, 2016.
F.S. Nahm. Receiver operating characteristic curve: overview
and practical use for clinicians. Korean Journal of Anesthesiology,
75(1):25–36, 2022.
D.A. Pierce and D.W. Schafer. Residuals in generalized linear
models. Journal of the American Statistical Association, 81(396):
977–986, 1986.
A.C. Davison and A. Gigli. Deviance residuals and normal scores
plots. Biometrika, 76(2):211–221, 1989.
456 / 460
References VI
P.K. Dunn and G.K. Smyth. Randomized quantile residuals.
Journal of Computational and Graphical Statistics, 5(3):236–244,
1996.
D. Pregibon. Logistic regression diagnostics. The Annals of Statis-
tics, 9(4):705–724, 1981.
B.-C. Wei, J.-Q. Shi, W.-K. Fung, and Y.-Q. Hu. Testing for varying
dispersion in exponential family nonlinear models. Annals of
the Institute of Statistical Mathematics, 50:277–294, 1998.
N.H. Prater. E stimate gasoline yields from crudes. Petroleum
Refiner, 5(35):236–238, 1956.
S. Weisberg. Applied Linear Regression. Wiley Series in Probability
and Statistics. Wiley, 2005.
D.J. Hand, F. Daly, K. McConway, D. Lunn, and E. Ostrowski. A
Handbook of Small Data Sets. Chapman & Hall Statistics Texts.
Taylor & Francis, 1993.
457 / 460
References VII
J.S. Long. The origins of sex dierences in science. Social Forces,
68(4):1297–1316, 1990.
W.W. Piegorsch, C.R. Weinberg, and B.H. Margolin. Exploring
simple independent action in multifactor tables of propor-
tions. Biometrics, 44(2):595–603, 1988.
J.G. Morel and N.K. Neerchal. Clustered binary logistic regres-
sion in teratology data using a finite mixture distribution.
Statistics in Medicine, 24(16):2843–2853, 1997.
P.J. Diggle, P. Heagerty, K.-Y. Liang, and S.L. Zeger. Analysis of
longitudinal data. Oxford university press, 2002.
A.J.P. Gregoire, R. Kumar, B. Everitt, and J.W.W. Studd. Trans-
dermal oestrogen for treatment of severe postnatal depression.
The Lancet, 347:930–933, 1996.
K.Y. Liang and S.L. Zeger. Longitudinal data analysis using
generalized linear models. Biometrika, 73:13–22, 1986.
458 / 460
References VIII
L.-Y. Hin and Y.-G. Wang. Working-correlation-structure iden-
tification in generalized estimating equations. Statistics in
Medicine, 28(4):642–658, 2009.
V.J. Carey and Y.-G. Wang. Working covariance model selection
for generalized estimating equations. Statistics in Medicine, 30
(26):3117–3124, 2011.
Xiaolu Zhu and Zhongyi Zhu. Comparison of criteria to select
working correlation matrix in generalized estimating equa-
tions. Chinese Journal of applied probability and statistics, 5:515–
30, 2013.
L. Fu, Y. Hao, and Y.-G. Wang. Working correlation structure
selection in generalized estimating equations. Computational
Statistics, 33:983–96, 2018.
A. Rotnitzky and N.P. Jewell. Hypothesis testing of regression
parameters in semiparametric generalized linear models for
cluster correlated data. Biometrika, 77(3):485–497, 1990.
459 / 460
References IX
D.D. Boos. On generalized score tests. The American Statistician ,
46(4):327–33, 1992.
J.E. Marsden and A.J. Tromba. Calculo vectorial, sixth Edition.
W.H. Freeman and Company, 2012.
S.R. Searle . Linear Models. Wiley Classics Library. Jhon Wiley &
Sons, 1971.
W. Sun and Y.-X. Yuan. Optimization Theory and Methods. Nonlin-
ear Programming. Springer Optimization and Its Applications.
Springer, 2006.
460 / 460