Generalized Linear Models and some of their

extensions using the library glmtoolbox

Luis Hernando Vanegas Luz Marina Rondón

Universidad Nacional de Colombia

Sede Bogotá

Facultad de Ciencias

Departamento de Estadística

2024

1 / 436

“All models are wrong, but some are useful”

George E. P. Box, 1979

2 / 436

Outline I

1

Normal Linear Models

Motivation

Richness of the species

Gasoline yield

Model Formulation

Parameter Estimation

Ordinary Least Squares Estimator

Maximum Likelihood Estimat or

Goodness-of-ﬁt

R-squared and adjusted R-squared

Predicted R-squared

Generalized I nform ation Criterion

Statistical Inference

Conﬁdence int ervals

Hypothesis testing

Examples

Richness of the species

Gasoline yield

Exercises

3 / 436

Outline II

Variable selection

Hybrid f orward stepwise

Hybrid backward stepwise

Best subset selection

Diagnostic methods

Residual analysis

Normal probability plot with simulated envelope

Inﬂuence or Sensitivity analysis

Leverage p oints

Testing the constant variance assumption

Box-Tidwell transfor mation

Lognormal model

Example 3: Gasoline yield

Example 4: Lung capacity

2

Generalized Linear Models

Motivation

Liver cancer in female mice

Insecticides

4 / 436

Outline III

Burn i njury study

Dilution assay

Skin cancer in women

Advertising

Atlantic Croakers

Hardened steel

Gasoline yield

Model formulation

Linear exponential family

Binomial di stributi on

Poisson distr ibution

Normal distr ibution

Gamma d istribu tion

Inverse-Gaussian distr ibution

Exercises

Parameter estimation

Maximum likelihood (ML)

Deviance

Newton-Raphson algor ithm

5 / 436

Outline IV

Fisher’s Scoring algorithm

Canonical link fun ctions

Goodness-of-ﬁt

R-squared and adjusted R-squared

Generalized I nform ation Criterion

Statistical inference

Asymptotic theory

Hypothesis testing

Conﬁdence int ervals

Estimation of the dispersion parameter

Parameter interpretation

Binomial regression

Poisson regression

Normal regression

Gamma regression

Lognormal regression

Genesis of link functions in binary response models

Receiver Operating Characteristic Curve

6 / 436

Outline V

Variable selection

Hybrid f orward stepwise

Hybrid backward stepwise

Best subset selection

Exercises

Diagnostic methods

Residual analysis

Normal probability plot with simulated envelope

Inﬂuence or sensitivity analysis

Testing the constant dispersion parameter assumption

Box-Tidwell transfor mation

Quasi-likelihood models

3

Overdispersion in Poisson models

Motivation

Self d iagnozed ear infections in swimmers

The Origins of Sex Diﬀerences in Science

Negative Binomial I model

Negative Binomial model

7 / 436

Outline VI

Exercises

4

Overdispersion in binomial models

Motivation

Cell diﬀerentiation

Teratogenic eﬀects of PHT and TCPO

Beta Binomial model

Random-clumped Binomial model

Exercises

5

Excess of zeros in count data

Zero-Inﬂated Poisson model

Zero-Modiﬁed Poisson model

Zero-Inﬂated Negative Binomial model

Zero-Modiﬁed Negative Binomial model

Example: Males attached to female horseshoe crabs

Exercises

6

Cluster Correlated Data

8 / 436

Outline VII

Motivation

Eﬀect of t he ozone on the tree growth

Treatment of severe postnatal depression

Generalized Estimating Equations

Example 1: Eﬀect of the ozone on the tree growth

7

Appendix

A. Linear algebra

B. Derivatives

C. Taylor series

D. Probability

E. Multivariate normal distribution

F. Nonlinear optimization

8

References

9 / 436

Normal Linear Models

10 / 436

Richness of the species

The analysis of these data, available in the object richness of the library

glmtoolbox, aims to explain the diﬀerences between plots as related

to species richness (represented by a count of the number of plant

species (

richness)) using the diﬀerences between plots with respect

to biomasses (

Biomass) and soil pH levels: low (pH=“low”), medium

(

pH=“mid”) and high (pH=“high”).

•

••

•

••

•

••

•

••

•

High

Mid

Low

pH level

Biomass

Number of plant species

0

0.0 2.5 5.0 7.5 10.0

10

20

30

40

11 / 436

Gasoline yield

The analysis of the data in the object Auto of the library ISLR is aimed

at explaining the diﬀerences between the vehicles with respect to their

gasoline yield (

mpg), in miles per gallon, using other vehicle characte-

ristics such as their horsepower (

horsepower), their weight (weight),

in kilograms, and their origin (

origin: America, Europe or Japan).

•

••

•

•• •

••

•

•• •

•

••

•

••

•

••

•

• ••

•

•• ••

•

• •

•

• •

•• •

•

• ••

•

••

•

••

•

• •

•

••

•

••

•

••

•

••

•

• •

•

••

•

• •

•

••

•

• •

•

••

•

High

Mid

Low

Weight

Horsepower

Gasoline yield

10

20

30

40

50 100 150 200

12 / 436

Normal Linear Model

Let y

i

for i = 1, . . . , n be the interest variable measured on n

individuals, which are assumed to be realizations of independent

random variables

denoted here by Y

i

for i = 1, . . . , n. The random

variables

Y

i

for i = 1, . . . , n

i

are assumed to satisfy the following

condition:

Y

i

= µ

i

+ ǫ

i

,

where ǫ

i

ind

∼ Normal(0, σ

2

)

are random errors and µ

i

= β

0

+ β

1

x

i1

+

. . . + β

p

x

ip

= x

⊤

i

β

, in which x

i

= (1, x

i1

, . . . , x

ip

)

⊤

includes the

(known) values of

p explanatory variables, quantitative and/or

qualitative, measured on the

i-th individual, and β = (β

0

, . . . , β

p

)

⊤

is the (unknown) vector of interest parameters. Thus, the ob-

served value of the interest variable, y, is assumed to be the

result of a “signal”, represented by µ, that is additively contam-

inated by a normally-distributed random “noise”, represented

by the random error ǫ.

Goal: To explain the d iﬀerences between y

1

, . . . , y

n

by using the diﬀerences between x

1

, . . . , x

n

.

13 / 436

Normal Linear Model (p = 1 and σ

2

= 0.4)

y

|

x

µ = 1.5 + 1.4 x

0

0 111111111111

2

2 333333333333

4

4 555555555555

666666666666 888888888888 101010101010101010101010

14 / 436

Normal Linear Model

The above model may be summarized as follows











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

= x

⊤

i

β,

ǫ

i

ind

∼ Normal(0, σ

2

),

=⇒ Y

i

ind

∼ Normal(µ

i

, σ

2

)

or, equivalently,











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

= x

⊤

i

β,

ǫ ∼ Normal

n

(0, σ

2

I),

=⇒ Y ∼ Normal

n

(µ, σ

2

I)

where Y = (Y

1

, . . . , Y

n

)

⊤

, µ = (µ

1

, . . . , µ

n

)

⊤

= Xβ, ǫ = (ǫ

1

, . . . , ǫ

n

)

⊤

and X= (x

1

, . . . , x

n

)

⊤

. In matrix form, the model becomes

Y = µ + ǫ

This model is known as the normal linear model because of: (i)

the “signal” (µ

i

) can be written as a linear combination of the

interest parameters (β

0

, β

1

, . . . , β

p

); and (ii) the random “noise”

(ǫ

i

) is assumed to follow the normal or Gaussian distribution.

15 / 436

Ordinary Least Squares (OLS)

Parameter estimation involves selecting a vector from the space

R

p+1

, denoted here by

ˆ

β

, such that the linear combinations given

by

X

ˆ

β provide the “best” representation of the vector of observed

responses

y = (y

1

, . . . , y

n

)

⊤

. Based on intuition,

ˆ

β

may be selected

as the

β ∈ R

p+1

such that

Q

r

(β) =

1

r

n

X

i=1



y

i

− x

⊤

i

β

σ



r

,

for some r > 0, is minimum.

If the partial derivatives of

Q

r

(β) are continuous, then the mini-

mum of

Q

r

(β) can be obtained by solving the following equations







∂Q

r

(β)

∂β

0

∂Q

r

(β)

∂β

1

.

∂Q

r

(β)

∂β

p







=







−

1

σ

r

n

P

i=1

(y

i

− x

⊤

i

β)

r−1

−

1

σ

r

n

P

i=1

x

i1

(y

i

− x

⊤

i

β)

r−1

.

−

1

σ

r

n

P

i=1

x

ip

(y

i

− x

⊤

i

β)

r−1







=







0

.

0







If r > 1, then the partial derivatives of Q

r

(β) are continuous.

16 / 436

Ordinary Least Squares (OLS)

If r = 2, then the above syste m of equations is linear and

ˆ

β

becomes the Ordinary Least Squares (OLS) estimator, given by

ˆ

β = argmin

β ∈R

p+1

Q

2

(β),

where

Q

2

(β) =

1

2σ

2

n

X

i=1

(y

i

− x

⊤

i

β)

2

=

1

2σ

2

(y − Xβ)

⊤

(y − Xβ) =

1

2σ

2

h

y

⊤

y − 2y

⊤

Xβ + β

⊤

X

⊤

Xβ

i

.

The gradient vector of Q

2

(β) is given by (see Appendix B-1,2)

U(β) =







∂Q

2

(β)

∂β

0

∂Q

2

(β)

∂β

1

.

∂Q

2

(β)

∂β

p







=







−

1

σ

2

n

P

i=1

(y

i

− x

⊤

i

β)

−

1

σ

2

n

P

i=1

x

i1

(y

i

− x

⊤

i

β)

.

−

1

σ

2

n

P

i=1

x

ip

(y

i

− x

⊤

i

β)







= −

1

σ

2

X

⊤

(y − Xβ)

17 / 436

Ordinary Least Squares (OLS)

−

•

x

y

ˆ

µ =

ˆ

β

0

+

ˆ

β

1

x

(x

i

, y

i

)

OLS estimator selects the straight line (

ˆ

β

0

+

ˆ

β

1

x

) that minimizes the

sum of the squares of the dotted segment lengths.

18 / 436

Ordinary Least Squares (OLS)

The Hessian matrix of Q

2

(β) is given by (see Appendix B-3)

J(β) =







∂

2

Q

2

(β)

∂β

2

0

∂

2

Q

2

(β)

∂β

0

∂β

1

. . .

∂

2

Q

2

(β)

∂β

0

∂β

p

∂

2

Q

2

(β)

∂β

1

∂β

0

∂

2

Q

2

(β)

∂β

2

1

. . .

∂

2

Q

2

(β)

∂β

1

∂β

p

.

∂

2

Q

2

(β)

∂β

p

∂β

0

∂

2

Q

2

(β)

∂β

p

∂β

1

. . .

∂

2

Q

2

(β)

∂β

2

p







=

1

σ

2

X

⊤

X

If X has linearly independent columns, then J(β) is a positive def-

inite and non-singular matrix (see

Appendix A-6), implying that

Q

2

(β) is a strictly convex function (see Appendix F-2). Therefore,

if the system of equations

U(β) = 0 has a solution, then it is the

unique minimum global of Q

2

(β). Then, the OL S estimator of β

reduces to the solution of U(

ˆ

β) = −

1

σ

2

X

⊤

(y − X

ˆ

β) = 0

, that is,

ˆ

β = (X

⊤

X)

−1

X

⊤

y.

A necessary condition for X’s columns to be linearly independent

is

p + 1 ≤ n, that is, X has fewer columns than rows.

19 / 436

Ordinary Least Squares (OLS)

In conclusion, if the columns of X are linearly independent, then

the OLS estimator of

β satisﬁes the following:

(1) It always exists.

(2) It is unique.

(3) It has closed-form, given by

ˆ

β = (X

⊤

X)

−1

X

⊤

y

.

(4) It is linear with respect to y, that is, it may be written as Ay,

where

A is a (p + 1) × n matrix which does not depend on

y. Therefore, the (j + 1)-th element of

ˆ

β

, denoted here by

ˆ

β

j

, may be written as

ˆ

β

j

= a

⊤

j

y = a

j1

y

1

+ . . . + a

jn

y

n

, where

a

⊤

j

= (a

j1

, . . . , a

jn

)

represents the (j + 1)-th row of A.

(5) The estimation of the mean vector µ = Xβ is also linear

with respect to y, that is, it may be written as ˆµ = X

ˆ

β = Hy,

where H = X(X

⊤

X)

−1

X

⊤

is the orthogonal projection matrix.

Therefore,

ˆ

µ

i

= x

⊤

i

ˆ

β = h

⊤

i

y = h

i1

y

1

+ . . . + h

in

y

n

, where h

⊤

i

=

(h

i1

, . . . , h

in

)

represents the i-th row of H.

20 / 436

Exercises

Consider the following simple linear model











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β

0

+ β

1

x

i

,

ǫ

i

ind

∼ Normal(0, σ

2

)

Show that

ˆ

β

0

=

y −

ˆ

β

1

x

and

ˆ

β

1

=

n

P

i=1

(x

i

−

x)(y

i

− y)

n

P

i=1

(x

i

−

x)

2

=

S

y

S

x

r

xy

,

where (n−1)S

2

x

=

n

P

i=1

(x

i

−

x)

2

, nx =

n

P

i=1

x

i

, (n−1)S

2

y

=

n

P

i=1

(y

i

−

y)

2

,

n

y =

n

P

i=1

y

i

and r

xy

is the Pearson

′

s linear correlation coeﬃcient.

21 / 436

Orthogonal projection matrix (H)

•

(x

i

,

ˆ

β

0

+

ˆ

β

1

x

i

)

(x

i

, y

i

)

x

y

The matrix H orthogonally projects y

1

, . . . , y

n

onto the line, plane,

or hyperplane given by

ˆ

µ =

ˆ

β

0

+

ˆ

β

1

x

1

+

ˆ

β

2

x

2

+ . . . +

ˆ

β

p

x

p

.

22 / 436

Orthogonal projection matrix (H)

h

ij

= x

⊤

i

(X

⊤

X)

−1

x

j

.

H is a symmetric matrix, that is, H

⊤

= H, so the i-th row

and the

i-th column of H coincide. Therefore, h

ij

= h

ji

for all

i, j = 1, . . . , n.

H is a idempotent matrix, that is, H

2

= H, so h

ij

= h

⊤

i

h

j

for

all

i, j = 1, . . . , n. For e xample,

h

ii

= h

⊤

i

h

i

=

n

X

j=1

h

2

ij

= h

2

ii

+

n

X

j:j,i

h

2

ij

=⇒ h

ii

− h

2

ii

= h

ii

(1 − h

ii

) =

n

X

j:j,i

h

2

ij

≥ 0

=⇒ h

ii

∈ [0, 1].

∂

ˆ

µ

i

∂y

j

= h

ij

and

∂

ˆ

µ

j

∂y

i

= h

ji

. Therefore,

∂

ˆ

µ

i

∂y

j

=

∂

ˆ

µ

j

∂y

i

.

rank(H) = trace(H) = p + 1 (see Appendix A-1,2). Conse-

quently, if p + 1 < n then H is a singular matrix.

23 / 436

Maximum Likelihood (ML)

This method consists of maximizing, with respect to β, the

chance of observing the data in the sample, that is,

ˆ

β = argmax

β ∈R

p+1

L(β),

where L(β) is the likelihood function of β, which corresponds to

the joint probability density function of

Y

1

, . . . , Y

n

, viewed as a

function of

β rather than as a function of y

1

, . . . , y

n

. Thus,

L(β) =

n

Y

i=1

1

(2π σ

2

)

1

2

exp







−

1

2

y

i

− x

⊤

i

β

σ

!

2







= (2π σ

2

)

−

n

2

exp







−

1

2σ

2

n

X

i=1

(y

i

− x

⊤

i

β)

2







= (2π σ

2

)

−

n

2

exp



−Q

2

(β)



.

The function log(x) is strictly increasing, so any β that maximizes

log[L(β)] will also maximize L(β). In addition, the computation

of the gradient of log[L(β)] is easier than that of L(β), as the latter

24 / 436

Maximum Likelihood (ML)

is a product of n terms while the former is a sum of n terms.

Thus, a more convenient way to express

ˆ

β

is the following

ˆ

β = argmax

β ∈R

p+1

ℓ(β),

where

ℓ(β) = log[L(β)] = −

n

2

log(2π σ

2

) − Q

2

(β)

is the log-likelihood function of β.

Note that maximizing

ℓ(β) with respect to β is equivalent to

maximizing −Q

2

(β), as −

n

2

log(2π σ

2

)

is a constant. In addition,

maximizing −Q

2

(β) is equivalent to minimizing Q

2

(β), as −1 is

a negative constant. Therefore, the ML estimator of

β in the

model where ǫ

i

ind

∼ Normal(0, σ

2

)

coincides with the OLS estima-

tor. For this reason, the normal or Gaussian distribution is the

most widely used to describe the behavior of the random errors.

25 / 436

Statistical propertie s of

ˆ

β

If the model is correctly speciﬁed, then (see Appendix E-3)

ˆ

β ∼ Normal

p+1

(E(

ˆ

β), Var(

ˆ

β)),

where

E(

ˆ

β) = E[(X

⊤

X)

−1

X

⊤

Y] = (X

⊤

X)

−1

X

⊤

E(Y) = β

and

Var(

ˆ

β) = Var[(X

⊤

X)

−1

X

⊤

Y] = (X

⊤

X)

−1

X

⊤

Var(Y)X(X

⊤

X)

−1

= σ

2

(X

⊤

X)

−1

.

Therefore,

ˆ

β ∼ Normal

p+1

(β, σ

2

(X

⊤

X)

−1

).

In addition,

ˆ

β

is an eﬃcient estimator as it achieves the Cramér-

Rao bound, that is, the variance-covariance matrix of

ˆ

β

coincides

with the i nverse of the Fisher information matrix, where the latter

is given by

−E

"

∂

2

ℓ(β)

∂β∂β

⊤

#

=

1

σ

2

X

⊤

X.

26 / 436

Exercises

(1) The (generalized) normal linear model may be written as











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

= x

⊤

i

β,

ǫ

i

ind

∼ GN(0, σ

2

, τ),

=⇒ Y

i

ind

∼ GN(µ

i

, σ

2

, τ),

where, unlike β and σ

2

, τ > 0 is known. The probability

density function of

Y

i

∼ GN(µ

i

, σ

2

, τ) is the follows

f

Y

i

(y; µ

i

, σ

2

, τ) =

τ

1−

1

τ

2σΓ



1

τ



exp



−

1

τ



y − µ

i

σ



τ



, y ∈ R.

If τ= 2 then GN(0, σ

2

, τ) reduces to Normal(0, σ

2

). Show that,

(a) the distribution of Y

i

is symmetric about µ

i

(b)

the mode of the distribution of Y

i

is µ

i

(c)

the variance of Y

i

does not depend on i, that is, it is constant

(d) the ML estimator of β reduces to

ˆ

β = argmin

β ∈R

p+1

Q

τ

(β)

27 / 436

Exercises

(2) Consider the following linear model











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

= x

⊤

i

β,

ǫ

i

ind

∼ Normal(0, σ

2

/ω

i

),

where ω

i

> 0 for i = 1, . . . , n are known weights.

Show that,

(a) the estimator of β by the least squares method, better known

as the Weighted Least Squares (WLS) estimator, minimizes

(y − Xβ)

⊤

Ω (y − Xβ) =

n

P

i=1

ω

i

(y

i

− x

⊤

i

β)

2

and it can be written

as

ˆ

β = (X

⊤

ΩX)

−1

X

⊤

Ωy, where Ω = diag{ω

1

, . . . , ω

n

}.

(b) the ML estimator of β coincides with the WLS estimator.

(c) if the model is correctly speciﬁed, then the distribution of

ˆ

β

is Normal

p+1

(β, σ

2

(X

⊤

ΩX)

−1

).

28 / 436

R-squared

Interest Model

(model under investigation)











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

,

ǫ

i

ind

∼ Normal(0, σ

2

),

X =







1 x

11

. . . x

1p

1 x

21

. . . x

2p

.

1 x

n1

. . . x

np







ˆ

β = argmin

β ∈R

p+1

Q

2

(β)

where,

Q

2

(β) =

1

2σ

2

n

X

i=1







y

i

− β

0

−

p

X

j=1

β

j

x

ij







2

Reference Model

(model with the poorest

goodness-of-ﬁt)











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β,

ǫ

i

ind

∼ Normal(0, σ

2

),

X =







1

.

1







ˆ

β = argmin

β ∈R

Q

0

2

(β)

where,

Q

0

2

(β) =

1

2σ

2

n

X

i=1



y

i

− β



2

Note that Q

2

(

ˆ

β) = min

β ∈R

p+1

Q

2

(β)

and Q

0

2

(

ˆ

β) = min

β ∈R

p+1

Q

2

(β)

restricted to β

1

= . . . = β

p

= 0

29 / 436

R-squared

R

2

= 1 −

Q

2

(

ˆ

β)

Q

0

2

(

ˆ

β)

= 1 −

n

P

i=1



y

i

− x

⊤

i

ˆ

β



2

n

P

i=1



y

i

−

y



2

The R

2

satisﬁes the following:

R

2

∈ [0, 1]

If X = 1

n

then R

2

= 0, where 1

n

represents a column vector

of 1’s of size

n.

If X = I

n

then R

2

= 1, where I

n

represents the identity matrix

of size n.

A hundred times its value represents the percentage reduc-

tion in R esid ual Sum of Squares (RSS) caused by the inclu-

sion of

p parameters in the model.

It coincides with r

2

(y, ˆµ), where −1 ≤ r(y, ˆµ) ≤ 1 repre-

sents the Pearson’s linear correlation coeﬃcient computed

between

y = (y

1

, . . . , y

n

) and ˆµ = (

ˆ

µ

1

, . . . ,

ˆ

µ

n

).

30 / 436

R-squared

The R

2

can be used to compare nested models, that is, models where

the ﬁrst model’s

X matrix coincides with the ﬁ rst (r + 1) columns of the

second model’s

X matrix, in which r < p.

Model I











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β

0

+

r

P

j=1

β

j

x

ij

,

ǫ

i

ind

∼ Normal(0, σ

2

),

ˆ

β

I

= argmin

β ∈R

r+1

Q

I

2

(β),

where

Q

I

2

(β) ∝

n

X

i=1







y

i

− β

0

−

r

X

j=1

β

j

x

ij







2

Model II











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β

0

+

r

P

j=1

β

j

x

ij

+

p

P

j=r+1

β

j

x

ij

,

ǫ

i

ind

∼ Normal(0, σ

2

),

ˆ

β

II

= argmin

β ∈R

p+1

Q

II

2

(β),

where

Q

II

2

(β)∝

n

X

i=1







y

i

− β

0

−

r

X

j=1

β

j

x

ij

−

p

X

j=r+1

β

j

x

ij







2

Note that

Q

II

2

(

ˆ

β

II

) = min

β ∈R

p+1

Q

II

2

(β),

and Q

I

2

(

ˆ

β

I

) = min

β ∈R

p+1

Q

II

2

(β)

restricted to β

r+1

= . . . = β

p

= 0

31 / 436

R-squared

Therefore,

Q

II

2

(

ˆ

β

II

) ≤ Q

I

2

(

ˆ

β

I

).

As a result, R

2

II

is ALWAYS greater than R

2

I

, that is, according to

R

2

, the model II is ALWAYS better than the model I, even when

the additional

(p −r) covariates are not helpful in explaining the

diﬀerences between

y

1

, . . . , y

n

.

The following expression describes an adjusted version of R

2

that measures not just goodness-of-ﬁt, but also penalizes model

complexity:

R

2

adj

= 1 −

n

P

i=1



y

i

− x

⊤

i

ˆ

β



2

/(n − p − 1)

n

P

i=1



y

i

−

y



2

/(n − 1)

= 1 −

n − 1

n − p − 1

(1 − R

2

),

where (n − p − 1) and (n − 1) are the degrees-of-freedom of the

Interest Model and the Reference Model, respectively.

32 / 436

Predicted R

2

Note that

ˆ

β = (X

⊤

X)

−1

X

⊤

y

also may be written as follows

ˆ

β =







n

X

l=1

x

l

x

⊤

l







−1







n

X

l=1

x

l

y

l







.

Hence, the estimate of β derived from the dataset with the i-th

observation excluded, denoted here by

ˆ

β

(i)

, is (see, for instance,

Cook and Weisberg [1982, page 110])

ˆ

β

(i)

=







n

X

l=1

x

l

x

⊤

l

− x

i

x

⊤

i







−1







n

X

l=1

x

l

y

l

− x

i

y

i







=



X

⊤

X − x

i

x

⊤

i



−1



X

⊤

y − x

i

y

i



(

see Appendix A−3)

=







(X

⊤

X)

−1

+

(X

⊤

X)

−1

x

i

x

⊤

i

(X

⊤

X)

−1

1 − h

ii









X

⊤

y − x

i

y

i



=

ˆ

β −



y

i

− x

⊤

i

ˆ

β



1 − h

ii

(X

⊤

X)

−1

x

i

,

where h

ii

= x

⊤

i

(X

⊤

X)

−1

x

i

represents the (i, i)-th entry of H.

33 / 436

Predicted R

2

Therefore, the predicted residual is

y

i

− x

⊤

i

ˆ

β

(i)

= y

i

− x

⊤

i

ˆ

β +



y

i

− x

⊤

i

ˆ

β



1 − h

ii

x

⊤

i

(X

⊤

X)

−1

x

i

=



y

i

− x

⊤

i

ˆ

β



1 − h

ii

.

Thus, the predicted R

2

may be deﬁned as follows

R

2

pred

= 1 −

n

X

i=1



y

i

− x

⊤

i

ˆ

β

(i)



2

/

n

X

i=1



y

i

−

y

(i)



2

,

where

n

P

i=1



y

i

− x

⊤

i

ˆ

β

(i)



2

and

n

P

i=1



y

i

−

y

(i)



2

are the Predicted Resid-

ual Error Sum of Squares (PRESS) (Allen [1971]) of the interest

and reference models, respectively. If the values of (y

i

− x

⊤

i

ˆ

β

(i)

)

and (y

i

− y

(i)

) are replaced in the expression above, then R

2

pred

becomes (see, for example, Alcantara et al. [2023])

R

2

pred

= 1 −

n

X

i=1



y

i

− x

⊤

i

ˆ

β



2

(1 − h

ii

)

2

/

n

X

i=1



y

i

−

y



2

(1 − n

−1

)

2

.

34 / 436

Generalized Information Criterion

GIC(k) = −

pmeasure of the

goodness-of-ﬁt

z}|{

2 ℓ(

ˆ

β) +

complexity

penalty

z }| {

k(p + 2)

= n log(2π

ˆ

σ

2

) +

1

ˆ

σ

2

n

X

i=1

(y

i

− x

⊤

i

ˆ

β)

2

+ k(p + 2)

= n log(2π

ˆ

σ

2

) + (n − p − 1) + k(p + 2)

= n log(

ˆ

σ

2

) + p(k −1) + C,

where C > 0 is a constant as it does not depend on the ﬁtted

model. The Akaike Information Criterion (

AIC, Akaike [1974]),

and the Bayesian Information Criterion (

BIC, Schwarz [1978]),

also known as the Schwarz Information Criterion (SIC), are spe-

cial cases of GIC(k) when k = 2 and k = log(n), respectively.

If n > 7 then the penalty by model complexity in BIC is greater

than that in AIC. Therefore, the BIC tends to choose simpler

models than AIC.

35 / 436

Exercises

Consider two models: in the ﬁrst model, denoted here by I, µ

i

=

β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

, while in the second model, denoted here by II,

µ

i

= β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

+ β

p+1

x

i(p+1)

+ . . . + β

p+r

x

i(p+r)

. Show that:

(1) The R

2

adj

of model I is greater [smaller] than that of the model II

if and only if

ˆ

σ

2

I

/

ˆ

σ

2

II

< 1

h

ˆ

σ

2

I

/

ˆ

σ

2

II

> 1

i

, where

ˆ

σ

2

I

y

ˆ

σ

2

II

represent the

estimates of

σ

2

obtained from the models I and II, respectively.

Hence, choosing the the model with the highest value of

R

2

adj

is

equivalent to choosing the model with the lowest estimate of

σ

2

.

(2) The AIC of model I is smaller [greater] than that of model II if and

only if

ˆ

σ

2

I

/

ˆ

σ

2

II

< exp(−

r

n

)

h

ˆ

σ

2

I

/

ˆ

σ

2

II

> exp(−

r

n

)

i

. Therefore, for large

sample sizes, choosing the model with the smallest AIC value is

equivalent to choosing the model with the smallest estimate of

σ

2

.

(3) The BIC of model I is smaller (greater) than that of model II if and

only if

ˆ

σ

2

I

/

ˆ

σ

2

II

< exp(−

r

n

[log(n) − 1])

h

ˆ

σ

2

I

/

ˆ

σ

2

II

> exp(−

r

n

[log(n) − 1])

i

.

Therefore, for large sample sizes, larger than in the AIC case,

choosing the model with the smallest BIC value is equivalent to

choosing the model with the smallest estimate of

σ

2

.

36 / 436

Conﬁdence intervals

If

ˆ

β ∼ Normal

p+1

(β, σ

2

(X

⊤

X)

−1

)

, then

ˆ

β

j

∼ Normal(β

j

, σ

2

c

j

), j = 0, 1, . . . , p,

where c

j

represents the (j + 1)th entry of the main di agonal of

(X

⊤

X)

−1

. Therefore,

ˆ

β

j

− β

j

σ

√

c

j

∼ Normal(0, 1),

and, for all α ∈ (0, 1),

Pr







−Z

1−

α

2

≤

ˆ

β

j

− β

j

σ

√

c

j

≤ Z

1−

α

2







= 1 − α

=⇒ Pr

h

ˆ

β

j

− Z

1−

α

2

σ

p

c

j

≤ β

j

≤

ˆ

β

j

+ Z

1−

α

2

σ

p

c

j

i

= 1 − α,

where Z

1−

α

2

represents the 100(1 −

α

2

)%

th percentile of the stan-

dard normal distribution.

37 / 436

Conﬁdence intervals

Then, a conﬁdence interval of 100(1 − α)% for β

j

reduces to

IC

1−α

(β

j

) =

ˆ

β

j

± Z

1−

α

2

σ

p

c

j

Nevertheless, this conﬁdence interval is not useful in practice

because

σ

2

is unknown. An estimator of σ

2

that is statistically

independent of

ˆ

β

and such that (n − p − 1)

ˆ

σ

2

σ

2

∼ χ

2

(n − p − 1)

is

given by the following expression:

ˆ

σ

2

=

1

n − p − 1

(y − X

ˆ

β)

⊤

(y − X

ˆ

β) =

1

n − p − 1

n

X

i=1

(y

i

− x

⊤

i

ˆ

β)

2

.

The mean and variance of a random variable following χ

2

(n −

p − 1) distri bution allow to compute E(

ˆ

σ

2

) = σ

2

and Var(

ˆ

σ

2

) =

2σ

4

/(n − p − 1)

. The above allows to conclude that

ˆ

σ

2

is an

estimator of σ

2

which satisﬁes the following: (i) it is unbiased;

(ii) its variance tends to 0 as n tends to ∞; and (iii) it is consistent.

38 / 436

Conﬁdence intervals

Y ∼ Normal

n

(Xβ, σ

2

I) implies that

(n − p − 1)

ˆ

σ

2

σ

2

= Y

⊤

1

σ

2

(I − H)Y ∼ χ

2

(n − p − 1)

provided that (see Appendix E-4)

1

σ

2

(I − H)σ

2

I = I −H

is an idempotent matrix,

(Xβ)

⊤

1

σ

2

(I − H)Xβ = (Xβ)

⊤

1

σ

2

(X − HX)β = 0

, and

rank(I − H) = trace(I −H) = n −trace(H) = n −p − 1.

In addition, (n − p − 1)

ˆ

σ

2

σ

2

and

ˆ

β = (X

⊤

X)

−1

X

⊤

Y are statistically

independent

provided that (see Appendix E-5)

(X

⊤

X)

−1

X

⊤

σ

2

I

1

σ

2

(I − H) = (X

⊤

X)

−1

(X

⊤

− X

⊤

H) = 0.

39 / 436

Conﬁdence intervals

The distribution of

Z

√

V/m

is Student-t(m) provided that Z ∼

Normal(0, 1), V ∼ χ

2

(m), and Z and V are statistically indepen-

dent random variables (see

Appendix D-2). Therefore,

ˆ

β

j

− β

j

σ

√

c

j

/

r

ˆ

σ

2

σ

2

=

ˆ

β

j

− β

j

ˆ

σ

√

c

j

∼ Student-t(n − p − 1).

Thus, for all α ∈ (0, 1),

Pr







−t

(n−p−1)

1−

α

2

≤

ˆ

β

j

− β

j

ˆ

σ

√

c

j

≤ t

(n−p−1)

1−

α

2







= 1 − α

=⇒ Pr



ˆ

β

j

− t

(n−p−1)

1−

α

2

ˆ

σ

p

c

j

≤ β

j

≤

ˆ

β

j

+ t

(n−p−1)

1−

α

2

ˆ

σ

p

c

j



= 1 − α,

where t

(n−p−1)

1−

α

2

represents the 100(1 −

α

2

)%

th percentile of the

Student-t(n − p − 1) distribution.

40 / 436

Conﬁdence intervals

Then, a conﬁdence interval of 100(1 − α)% for β

j

reduces to

IC

1−α

(β

j

) =

ˆ

β

j

± t

(n−p−1)

1−

α

2

ˆ

σ

p

c

j

Two desirable attributes for a conﬁdence interval are:

Conﬁdence, which is easily speciﬁed by using “small” va-

lues of

α ∈ (0, 1).

Precision, which is recognized by “small” value s of the in-

terval width, given by

2 t

(n−p−1)

1−

α

2

ˆ

σ

√

c

j

.

The uncertainty in σ

2

subtracts precision from the conﬁdence

interval because, for ﬁxed α, the value of t

(n−p−1)

1−

α

2

is greater than

Z

1−

α

2

. However, the loss of precision decreases with i ncreasing

(n −p −1), since for ﬁxed α, t

(n−p−1)

1−

α

2

−→ Z

1−

α

2

as (n −p −1) −→ ∞.

41 / 436

Conﬁdence intervals

−Z

1−

α

2

Z

1−

α

2

Standard Normal

α

2

α

2

0.0 0.1 0.2 0.3 0.4

0 1 2 3 4−1−2−3−4

−t

1−

α

2

t

1−

α

2

Student-t(ν)

α

2

α

2

0.0 0.1 0.2 0.3

0 1 2 3 4−1−2−3−4

42 / 436

Hypothesis testing

Null hypothesis (H

0

) is

True False

Decision about Null Don

′

t reject ! Type II error

hypothesis (H

0

) is Reject Type I error !

α = Pr[ Type I error ] = Pr[ Reject H

0

| H

0

is True ]

γ = Pr[

Type II error ] = Pr[ Don’t reject H

0

| H

0

is False ]

43 / 436

Hypothesis testing

H

0

: β

j

= β

◦

j

versus H

1

: β

j

, β

◦

j

Type I error : Conclude that β

j

, β

◦

j

when, in fact, β

j

= β

◦

j

Type II error : Conclude that β

j

= β

◦

j

when, in fact, β

j

, β

◦

j

Ifσ

2

is known, then

ˆ

β

j

−β

j

σ

√

c

j

∼Normal(0, 1)

and the decision rule is

“Reject H

0

at the signiﬁcance level of 100(α)% if



ˆ

β

j

−β

◦

j



σ

√

c

j

>Z

1−

α

2

”

Pr



Type I error



= Pr

h



ˆ

β

j

− β

◦

j



/σ

p

c

j

> Z

1−

α

2



H

0

is True

i

= 1 − Pr

h



ˆ

β

j

− β

◦

j



/σ

p

c

j

≤ Z

1−

α

2



H

0

is True

i

= 1 − Pr

h

−Z

1−

α

2

≤ (

ˆ

β

j

− β

◦

j

)/σ

p

c

j

≤ Z

1−

α

2



H

0

is True

i

= 1 −



Φ(Z

1−

α

2

) − Φ(−Z

1−

α

2

)



= α,

where Φ(·) represents the cumulative distribution function of the stan-

dard normal distribution.

44 / 436

Hypothesis testing

Pr



Type II error



= Pr

h



ˆ

β

j

− β

◦

j



/σ

p

c

j

≤ Z

1−

α

2



H

0

is False

i

= Pr

h

−Z

1−

α

2

≤ (

ˆ

β

j

− β

◦

j

)/σ

p

c

j

≤ Z

1−

α

2



H

0

is False

i

= Pr

h

−Z

1−

α

2

− δ ≤ (

ˆ

β

j

− β

j

)/σ

p

c

j

≤ Z

1−

α

2

− δ



H

0

is False

i

= Φ(Z

1−

α

2

− δ) − Φ(−Z

1−

α

2

− δ),

where δ = (β

j

− β

◦

j

)/σ

√

c

j

.

α = 0.01

α = 0.05

α = 0.10

δ

Pr[ Type II error]

0 2 4−2−4

0.0 0.2 0.4 0.6 0.8 1.0

45 / 436

Hypothesis testing

If σ

2

is unknown and the aim is to assess the hypothesis system

H

0

: β

j

= β

◦

j

versus H

1

: β

j

, β

◦

j

,

then the following decision rule may be used

“Reject H

0

at the signiﬁcance level of 100(α)% if



ˆ

β

j

− β

◦

j



ˆ

σ

√

c

j

> t

(n−p−1)

1−

α

2

”

If a random variable follows Student-t(n − p − 1) distribution,

then its square follows

F

(1,n−p−1)

distribution (see Appendix D-5).

Therefore, the above decision rule is equivalent to the following

“Reject H

0

at the signiﬁcance level of 100(α)% if

(

ˆ

β

j

− β

◦

j

)

2

ˆ

σ

2

c

j

>F

(1, n−p−1)

1−α

”

The p-value, gi ven by Pr[F > (

ˆ

β

j

−β

◦

j

)

2

/

ˆ

σ

2

c

j

]

, where F ∼ F

(1, n−p−1)

,

allows to write the above decision rule as follows

“Reject H

0

at the signiﬁcance level of 100(α)% if p -value < α ”

46 / 436

Hypothesis testing

0 1 2 3 4

0.0 0.2 0.4 0.6

α

p−value

F F

(r, n−p−1)

1−α

The null hypothesis is not rejected at the signiﬁcance level of 100(α)%.

47 / 436

Hypothesis testing

0 1 2 3 4

0.0 0.2 0.4 0.6

α

p−value

FF

(r, n−p−1)

1−α

The null hypothesis is rejected at the signiﬁcance level of 100(α)%.

48 / 436

Hypothesis testing

The interest is to assess the following hypothesis system

H

0

: L

⊤

β = M versus H

1

: L

⊤

β , M,

where L

⊤

is an r × (p + 1) matrix whose rows are linearly inde-

pendent and

M is an r × 1 matrix. The fol lowing are examples

of the hypothesis system above for

p = 3:

(1)

H

0

: β

2

= 0 versus H

1

: β

2

, 0

Then r = 1, L

⊤

=

h

0 0 1 0

i

and M = 0.

(2)

H

0

:

"

β

1

β

3

#

=

"

0

#

versus H

1

:

"

β

1

β

3

#

,

"

0

#

Then r = 2, L

⊤

=

"

0 1 0 0

0 0 0 1

#

and M =

"

0

#

.

49 / 436

Hypothesis testing

(3)

H

0

: β

2

= β

3

versus H

1

: β

2

, β

3

Then r = 1, L

⊤

=

h

0 0 1 −1

i

and M = 0.

If the model is correctly speciﬁed then

ˆ

β ∼ Normal

p+1

(β, σ

2

(X

⊤

X)

−1

)

,

which implies that (see

Appendix E-3)

L

⊤

ˆ

β ∼ Normal

r

(L

⊤

β, L

⊤

σ

2

(X

⊤

X)

−1

L),

which in turn implies that (see Appendix E-3,4)

(L

⊤

ˆ

β − L

⊤

β)

⊤

(L

⊤

σ

2

(X

⊤

X)

−1

L)

−1

(L

⊤

ˆ

β − L

⊤

β) ∼ χ

2

(r).

The distribution of

V/m

W/r

is F

(m,r)

provided that V ∼ χ

2

(m) and

W ∼ χ

2

(r) are statistically independent random variables (see

Appendix D-3). Therefore,

σ

2

r

ˆ

σ

2

(L

⊤

ˆ

β − L

⊤

β)

⊤

(L

⊤

σ

2

(X

⊤

X)

−1

L)

−1

(L

⊤

ˆ

β − L

⊤

β) ∼ F

(r,n−p−1)

.

50 / 436

Hypothesis testing

Then, the decision rule reduces to “Reject H

0

at the signiﬁcance

level of 100(α)% if F > F

(r,n−p−1)

1−α

”

, where F may be written as

follows

F =

1

r

(L

⊤

ˆ

β − M)

⊤

(L

⊤

ˆ

σ

2

(X

⊤

X)

−1

L)

−1

(L

⊤

ˆ

β − M).

For example,

(1)

H

0

: β

2

= 0 versus H

1

: β

2

, 0

Then F =

ˆ

β

2

ˆ

Var(

ˆ

β

2

)

=

ˆ

β

2

[Std.Error(

ˆ

β

2

)]

2

(2)

H

0

:

"

β

1

β

3

#

=

"

0

#

versus H

1

:

"

β

1

β

3

#

,

"

0

#

Then F =

1

2

h

ˆ

β

1

ˆ

β

3

i

"

ˆ

Var(

ˆ

β

1

)

ˆ

Cov(

ˆ

β

1

,

ˆ

β

3

)

ˆ

Cov(

ˆ

β

3

,

ˆ

β

1

)

ˆ

Var(

ˆ

β

3

)

#

−1

"

ˆ

β

1

ˆ

β

3

#

51 / 436

Hypothesis testing

(3)

H

0

: β

2

= β

3

versus H

1

: β

2

, β

3

Then F =

(

ˆ

β

2

−

ˆ

β

3

)

2

ˆ

Var(

ˆ

β

2

) +

ˆ

Var(

ˆ

β

3

) − 2

ˆ

Cov(

ˆ

β

2

,

ˆ

β

3

)

Alternatively, F can be expressed as follows

F =

1

r

ˆ

σ

2







n

X

i=1

(y

i

− x

⊤

i

ˆ

β

◦

)

2

−

n

X

i=1

(y

i

− x

⊤

i

ˆ

β)

2







,

where

ˆ

β = argmin

β ∈R

p+1

Q

2

(β)

and

ˆ

β

◦

= argmin

β ∈R

p+1

Q

2

(β)

restricted to L

⊤

β = M.

52 / 436

Richness of the species

The analysis of these data, available in the object richness of the

library

glmtoolbox, aims to explain the diﬀerences between plots

as related to species richness (represented by a count of the num-

ber of plant species (

richness)) using the diﬀerences between

plots with respect to biomasses (

Biomass) and soil pH levels:

low (

pH=“low”), medium (pH=“mid”) and high (pH=“high”).

•

••

•

••

•

••

•

••

•

High

Mid

Low

pH level

Biomass

Number of plant species

0

0.0 2.5 5.0 7.5 10.0

10

20

30

40











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β

0

+ β

1

x

i1

+ β

2

x

i2

+ β

3

x

i3

,

ǫ

i

ind

∼ Normal(0, σ

2

)

y = Number of plant species

x

1

= Biomass

x

2

= 0 and x

3

= 0 if pH=“low”

x

2

= 1 and x

3

= 0 if pH=“mid”

x

2

= 0 and x

3

= 1 if pH=“high”

53 / 436

Parameter interpretation: without inte raction

The expected number of plant species for a plot with biomass x

1

and

pH level low (

x

2

= 0 and x

3

= 0), mid (x

2

= 1 and x

3

= 0) or high (x

2

= 0

and x

3

= 1) is β

0

+ β

1

x

1

+ β

2

x

2

+ β

3

x

3

.

There are two plots with the same pH levels, but the biomass of the

ﬁrst plot is

x + δ while the biomass of the second plot is x. As a result,

the expected number of plant species for the ﬁrst plot is |β

1

|δ units

lower, if

β

1

< 0, or greater, if β

1

> 0, than that for the second plot.

µ

x+δ

− µ

x

= β

0

+ β

1

(x + δ) + β

2

x

2

+ β

3

x

3

− (β

0

+ β

1

x + β

2

x

2

+ β

3

x

3

) = β

1

δ.

There are two plots with the same biomass, but the pH level of the ﬁrst

plot is mid while the biomass of the second plot is low. As a result, the

expected number of plant species for the ﬁrst plot is |β

2

| units lower, if

β

2

< 0, or greater, if β

2

> 0, than that for the second plot.

µ

mid

− µ

low

= β

0

+ β

1

x

1

+ β

2

− (β

0

+ β

1

x

1

) = β

2

.

There are two plots with the same biomass, but the pH level of the ﬁrst

plot is high while the biomass of the second plot is low. As a result, the

expected number of plant species for the ﬁrst plot is |β

3

| units lower, if

β

3

< 0, or greater, if β

3

> 0, than that for the second plot.

µ

high

− µ

low

= β

0

+ β

1

x

1

+ β

3

− (β

0

+ β

1

x

1

) = β

3

.

54 / 436

Parameter interpretation: with inte raction

The expected number of plant species for a plot with biomass x

1

and

pH level low (

x

2

= 0 and x

3

= 0), mid (x

2

= 1 and x

3

= 0) or high (x

2

= 0

and x

3

= 1) is β

0

+ β

1

x

1

+ β

2

x

2

+ β

3

x

3

+ β

4

x

1

x

2

+ β

5

x

1

x

3

.

There are two plots with pH level low, but the biomass of the ﬁrst

plot is

x + δ while the biomass of the second plot is x. Therefore, the

expected number of plant species for the ﬁrst plot is |β

1

|δ units lower,

if

β

1

< 0, or greater, if β

1

> 0, than that for the second plot.

µ

x+δ

− µ

x

= β

0

+ β

1

(x + δ) − (β

0

+ β

1

x) = β

1

δ.

There are two plots with pH level mid, but the biomass of the ﬁrst

plot is

x + δ while the biomass of the second plot is x. Therefore, the

expected number of plant species for the ﬁrst plot is

|β

1

+ β

4

|δ units

lower, if

β

1

+β

4

<0, or greater, if β

1

+β

4

>0, than that for the second plot.

µ

x+δ

− µ

x

= β

0

+ β

1

(x + δ) + β

4

(x + δ) − (β

0

+ β

1

x + β

4

x) = (β

1

+ β

4

)δ.

There are two plots with pH level high, but the biomass of the ﬁrst

plot is

x + δ while the biomass of the second plot is x. Therefore, the

expected number of plant species for the ﬁrst plot is

|β

1

+ β

5

|δ units

lower, if

β

1

+β

5

<0, or greater, if β

1

+β

5

>0, than that for the second plot.

µ

x+δ

− µ

x

= β

0

+ β

1

(x + δ) + β

5

(x + δ) − (β

0

+ β

1

x + β

5

x) = (β

1

+ β

5

)δ.

55 / 436

Richness of the species

> data(richness)

> fit <- lm(Species ~ Biomass + pH, data=richness)

> summary(fit)

H

0

: β

j

= 0 vs H

1

: β

j

, 0

ˆ

β

j

ˆ

σ

√

c

j

ˆ

β

j

/

ˆ

σ

√

c

j

Pr[F

(1,n−p−1)

>t

2

j

]

Estimate Std.Error t-value Pr(>|t|)

(Intercept) 17.6248 0.7890 22.34 <2e-16

Biomass -2.7276 0.1717 -15.89 <2e-16

pHmid 11.9782 1.0020 11.95 <2e-16

pHhigh 22.6200 1.0818 20.91 <2e-16

ˆ

σ (n − p − 1)

Residual standard error: 3.783 on 86 degrees of freedom

Multiple R-squared: 0.8524 R

2

, Adjusted R-squared: 0.8472 R

2

adj

F-statistic: 165.5 on 3 and 86 DF, p-value: < 2.2e-16

F r (n − p − 1) Pr[F

(r,n−p−1)

>F]

H

0

: (β

1

, β

2

, β

3

)

⊤

= 0 vs H

1

: (β

1

, β

2

, β

3

)

⊤

, 0

56 / 436

Richness of the species

Is there an interaction between biomass and soil pH level? That

is, does the eﬀect on the expected number of plant species of an

increase in biomass depend on soil pH level?











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β

0

+ β

1

x

i1

+ β

2

x

i2

+ β

3

x

i3

+ β

4

x

i1

x

i2

+ β

5

x

i1

x

i3

,

ǫ

i

ind

∼ Normal(0, σ

2

)

> fit2 <- update(fit, formula=Species ~ Biomass*pH)

> anova(fit,fit2)

H

0

: (β

4

, β

5

)

⊤

= 0 vs H

1

: (β

4

, β

5

)

⊤

, 0

Model 1: Species ~ Biomass + pH

Model 2: Species ~ Biomass + pH + Biomass:pH

r F Pr[F

(r,n−p−1)

>F]

Res.Df RSS Df Sum of Sq F Pr(>F)

1 86 1230.9

2 84 1224.7 2 6.2591 0.2147 0.8073

57 / 436

Richness of the species

Model matr ix

X

model.matrix(fit)

Parameter estimates

ˆ

β = (

ˆ

β

0

,

ˆ

β

1

, . . . ,

ˆ

β

p

)

⊤

coef(fit)

Residual sum of squares

n

P

i=1



y

i

− x

⊤

i

ˆ

β



2

sum(resid(fit)**2)

Log-likelihood

ℓ(

ˆ

β)

logLik(fit)

Conﬁdence intervals

ˆ

β

j

± t

(n−p−1)

1−

α

2

ˆ

σ

√

c

j

confint(fit,level=1 − α)

Degrees-of-freedom

n − p − 1

df.residual(fit)

Estimate of σ

2

ˆ

σ

2

summary(fit)$sigma**2

Estimate of Var(

ˆ

β)

ˆ

σ

2

(X

⊤

X)

−1

vcov(fit)

Fitted values

ˆµ = X

ˆ

β = Hy

fitted(fit)

Ordinary residuals

r

i

= y

i

− x

⊤

i

ˆ

β

resid(fit)

R-squared

R

2

summary(fit)$r.squared

Adjusted R-squared

R

2

adj

summary(fit)$adj.r.squared

58 / 436

Gasoline yield

The analysis of the data in the object Auto of the library ISLR

is aimed at explaining the diﬀerences between the vehicles with

respect to their gasoline yield (

mpg), in miles per gallon, using

other vehicle characteristics such as their horsepower (

horsepower)

and their weights (

weight), in kilograms.

•

••

•

•• •

••

•

•• •

•

••

•

••

•

••

•

• ••

•

•• ••

•

• •

•

• •

•• •

•

• ••

•

••

•

••

•

• •

•

••

•

••

•

••

•

••

•

• •

•

••

•

• •

•

••

•

• •

•

••

•

High

Mid

Low

Weight

Horsepower

Gasoline yield

10

20

30

40

50 100 150 200











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β

0

+ β

1

x

i1

+ β

2

x

i2

+ β

3

x

i1

x

i2

,

ǫ

i

ind

∼ Norm al(0, σ

2

)

y = Gasoline yield

x

1

= Horsepower

x

2

= Weight

59 / 436

Parameter interpretation: without inte raction

The expected gasoline yield for a vehicle with horsepower x

1

and weight x

2

is β

0

+ β

1

x

1

+ β

2

x

2

.

There are two vehicles with the same weight, but the horsepower

of the ﬁrst vehicle is

x + δ while the horsepower of the second

vehicle is

x. As a result, the expected gasoline yield for the ﬁrst

vehicle is

|β

1

|δ miles per gallon lower, if β

1

< 0, or greater, if

β

1

> 0, than that for the second vehicle.

µ

x+δ

− µ

x

= β

0

+ β

1

(x + δ) + β

2

x

2

− (β

0

+ β

1

x + β

2

x

2

) = β

1

δ.

There are two vehicles with the same horsepower, but the weight

of the ﬁrst vehicle is

x + δ while the weight of the second vehicle

is x. As a result, the expected gasoline yield for the ﬁrst vehicle

is |β

2

|δ miles per gallon lower, if β

1

< 0, or greater, if β

1

> 0, than

that for the second vehicle.

µ

x+δ

− µ

x

= β

0

+ β

1

x

1

+ β

2

(x + δ) − (β

0

+ β

1

x

1

+ β

2

x) = β

2

δ.

60 / 436

Parameter interpretation: with inte raction

The expected gasoline yield for a vehicle with horsepower x

1

and weight x

2

is β

0

+ β

1

x

1

+ β

2

x

2

+ β

3

x

1

x

2

.

There are two vehicles with weight equal to z, but the horse-

power of the ﬁrst vehicle is

x + δ while the horsepower of the

second vehicle is

x. As a result, the expected gasoline yield for

the ﬁrst vehicle is

|β

1

+β

3

z|δ miles per gallon lower, if β

1

+β

3

z < 0,

or greater, if

β

1

+ β

3

z > 0, than that for the second vehicle.

µ

x+δ

− µ

x

= β

0

+ β

1

(x + δ) + β

2

z + β

3

(x + δ)z

− (β

0

+ β

1

x + β

2

z + β

3

xz) = (β

1

+ β

3

z)δ.

There are two vehicles with horsepower equal to z, but the

weight of the ﬁrst vehicle is x + δ while the weight of the second

vehicle is x. As a result, the expected gasoline yield for the ﬁrst

vehicle is |β

2

+ β

3

z|δ miles per gallon lower, if β

2

+ β

3

z < 0, or

greater, if β

2

+ β

3

z > 0, than that for the second vehicle.

µ

x+δ

− µ

x

= β

0

+ β

1

z + β

2

(x + δ) + β

3

z(x + δ)

− (β

0

+ β

1

z + β

2

x + β

3

zx) = (β

2

+ β

3

z)δ.

61 / 436

Gasoline yield

> data(Auto,package="ISLR")

> Auto <- within(Auto, weight <- weight/1000)

>

> fit <- lm(mpg ~ horsepower + weight + horsepower:weight,

+ data=Auto)

> summary(fit)

H

0

: β

j

= 0 vs H

1

: β

j

, 0

ˆ

β

j

ˆ

σ

√

c

j

ˆ

β

j

/

ˆ

σ

√

c

j

Pr[F

(1,n−p−1)

>t

2

j

]

Estimate Std.Error t-value Pr(>|t|)

(Intercept) 63.557940 2.342949 27.127 < 2e-16

horsepower -0.250838 0.027279 -9.195 < 2e-16

weight -10.772411 0.773809 -13.921 < 2e-16

horsepower:weight 0.053554 0.006649 8.054 9.93e-15

ˆ

σ (n − p − 1)

Residual standard error: 3.93 on 388 degrees of freedom

Multiple R-squared: 0.7484 R

2

, Adjusted R-squared: 0.7465 R

2

adj

F-statistic: 384.8 on 3 and 388 DF, p-value: < 2.2e-16

F r (n − p − 1) Pr[F

(r,n−p−1)

>F]

H

0

: (β

1

, β

2

, β

3

)

⊤

= 0 vs H

1

: (β

1

, β

2

, β

3

)

⊤

, 0

62 / 436

Gasoline yield

> fit2 <- update(fit, formula = mpg ~ horsepower + weight)

> anova(fit2,fit)

H

0

: β

3

= 0 vs H

1

: β

3

, 0

Model 1: mpg ~ horsepower + weight

Model 2: mpg ~ horsepower + weight + horsepower:weight

r F Pr[F

(r,n−p−1)

>F]

Res.Df RSS Df Sum of Sq F Pr(>F)

1 389 6993.8

2 388 5992.0 1 1001.8 64.871 9.934e-15

> fit3 <- update(fit, formula = mpg ~ 1)

> anova(fit3,fit)

H

0

: (β

1

, β

2

, β

3

)

⊤

= 0 vs H

1

: (β

1

, β

2

, β

3

)

⊤

, 0

Model 1: mpg ~ 1

Model 2: mpg ~ horsepower + weight + horsepower:weight

r F Pr[F

(r,n−p−1)

>F]

Res.Df RSS Df Sum of Sq F Pr(>F)

1 391 23819

2 388 5992 3 17827 384.8 <2.2e-16

63 / 436

Exercises

These data, available in the object races of the library glmtoolbox,

consist of record time, distance, and cumulative climb of 35 hill races

in Scotland. The aim of the statistical analysis of these data is to

explain the diﬀerences between the record time of t he races (

rtime), in

minutes, using their diﬀerences on distance (distance), in miles, and

cumulative climb (cclimb), in thousands of feet.

(a) Fit to the data a normal linear model where the record time of the

races is the response variable, and the distance and cumulative

climb are the explanatory variables.

(b) Assess at the signiﬁcance level of 5% if the eﬀect of the distance on

the expected record time of the race depends on the cumulative

climb.

(c) Interpret the parameter estimates except the intercept term.

(d) Estimate the expected record time, in minutes, of a race whose

distance and cumulative climb are 7.5 miles and 1800 feet, re-

spectively. Compute this estimate “manually” and using the

predict() function.

64 / 436

Exercises

These data, available in the object whiteside of the library MASS,

were collected to assess the eﬀect of insulation on gas consumption.

The weekly gas consumption (

Gas), in thousands of cubic feet, and

the average external temperature (Temp), in degrees Celsius, were

recorded for 26 weeks before (

Insul=“Before”) and for 30 weeks after

(Insul=“After”) that a cavity-wall insulation was installed in a house.

(a) Fit to the data a normal linear model where the gas consumption

is the response variable, and the average external temperature

and the presence/absence of the cavity-wall insulation are the

explanatory variables.

(b) Assess at the signiﬁcance level of 5% if the eﬀect of the average

external temperature on the expected gas consumption depends

on the presence/absence of the wall insulation.

(c) Interpret the parameter estimates except the intercept term.

(d) Estimate the expected gas consumption, in thousands of cubic

feet, in a week where the average external temperature is 5 de-

grees Celsius and the wall insulation was present. Compute this

estimate “manually” and using the predict() function.

65 / 436

Variable selection

66 / 436

Hybrid forward stepwise

The algorithm starts with the simplest model, that is, a model

with a linear predictor in which the coeﬃcients of all variables

and/or eﬀects, except the intercept term, are zero. Next, the can-

didate models are constructed by hierarchically adding eﬀects to

the linear predictor (i.e., the main eﬀects are ﬁrst, then the two-

way interactions, and so on), and their “relevance” is assessed

by comparing the models with and without the added eﬀects

using a previously deﬁned criterion. After an eﬀect has been

added to the model, any other eﬀect previously added can be

removed if, according to the speciﬁed criterion, that eﬀect is no

longer relevant. See, for example,

James et al. [2013, page 210].

The criteria to compare models include the following

Akaike Information Criterion

Bayesian Information Criterion

Adjusted R

2

Predicted R

2

Mallows’ CP

p-value of the F test

67 / 436

Hybrid backward ste pwise

The process begins with the more complex model (i.e., one that

includes all variables and/or eﬀects in its linear predictor) and

then the candidate models are constructed by hierarchically re-

moving eﬀects from the linear predictor (i.e., the highest level

interactions are ﬁrst, then the next level interactions, and so on).

The “relevance” of the candidates is determined by comparing

the models with and without the excluded eﬀects using a pre-

viously established criterion. If an eﬀect is excluded from the

model, then any o ther eﬀect previously excluded may be added

to it if, based on the speciﬁed criterion, that eﬀect is no longer

irrelevant. See, for example,

James et al. [2013, page 210]. The

criteria to compare models include the following

Akaike Information Criterion

Bayesian Information Criterion

Adjusted R

2

Predicted R

2

Mallows’ CP

p-value of the F test

68 / 436

Variable selection

−→ Forward

−→ Backward

 Candidate models

69 / 436

Variable selection

> args(stepCriterion.lm)

> help(stepCriterion.lm)

model: the object where the ﬁtted LM is stored.

criterion: the criterion speciﬁed to compare the candidate mod-

els. The available options are: AIC ("aic"), BIC ("bic"), adjusted

R-squared ("adjr2"), predicted R-squared ("prdr2"), Mallows’ CP

("cp") and p-value of the F test ("p-value"). As default, criterion

is set to "bic".

direction: hybrid backward ("backward") or hybrid forward

("forward") procedures. By default, direction="forward".

levels: the pair of values in the interval (0, 1) at which the eﬀects

must in and out from the model, respectively, if criterion=“p-

value”. By default, levels=c(0.05,0.05).

trace: logical switch indicating if should the stepwise reports be

printed. As default, trace is set to TRUE.

70 / 436

Best subset selection

If there are p explanatory variables, which can be introduced into

the model by

˜

p

columns in the model matrix X, then the number

of possible models of size

i, for i = 1, . . . ,

˜

p, is given by

˜

p

i

!

=

˜

p!

(

˜

p − i)! i!

Therefore, the total number of models is

˜

p

X

i=1

˜

p

i

!

= 2

˜

p

− 1.

In the “best subset” selection, an exhaustive search is conducted

separately for every model size f rom

i to

˜

p

, in which the model

with the smallest Residual Sum of Squares (RSS) value is iden-

tiﬁed. Therefore, if, for a ﬁxed model size, the interest model

selection criteria reduce to monotone functions o f RSS, thus dif-

fering only in the way the model sizes are compared, then the

results of the “best subset” selection do not depend upon the

choice of the trade-oﬀ between goodness-of-ﬁt and complexity

on which they are based.

71 / 436

Salaries for Professors

These data, available in the object Salaries of the library car, are

the 2008-09 nine-month academic salary for Assistant Professors,

Associate Professors and Professors in a college in the U.S. The

data were collected as part of the on-going eﬀort of the college’s

administration to monitor salary diﬀerences between male and

female faculty members.

rank: Rank is a factor with three levels: “AssocProf”, “As-

stProf” and “Prof”.

discipline: Discipline is a factor with two levels: “A”

(theoretical departments) and “B” (applied departments).

yrs.since.phd: years since PhD.

yrs.service: years of service .

sex: sex is a factor with two le vels: “Female” and “Male”

salary: nine-month salary, in dollars.

72 / 436

Salaries for Professors

> data(Salaries, package="car")

> f <- lm(salary ~ rank*discipline*yrs.since.phd*yrs.service*sex,

+ data=Salaries)

>

> stepCriterion(f, direction="forward", criterion="bic")

"~ rank + discipline"

>

> stepCriterion(f, direction="backward", criterion="bic")

"~ rank + discipline*yrs.since.phd + discipline*yrs.service"

>

> stepCriterion(f, direction="forward", criterion="adjr2")

"~ rank + discipline + sex"

>

> stepCriterion(f, direction="forward", criterion="aic")

"~ rank + discipline"

>

> stepCriterion(f, direction="forward", criterion="prdr2")

"~ rank + discipline + sex"

>

> stepCriterion(f, direction="forward", criterion="cp")

"~ rank + discipline"

73 / 436

Gasoline yield

> data(Auto, package="ISLR")

> Auto <- within(Auto, {weight <- weight/1000

+ origin <- factor(origin, labels=c("America","Europe","Japan")) })

>

> fit <- lm(mpg ~ 1 + horsepower + weight + horsepower:weight + origin + acceleration +

+ displacement + cylinders, data=Auto)

>

> stepCriterion(fit, direction="forward", criterion="bic")

Initial model:

~ 1

Step 0 :

df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)

+ weight 1 2265.939 2277.8 52 0.6918 0.6913 96.39 <2.2e-16

+ displacement 1 2318.830 2330.744 0.6473 0.6465 166.37 <2.2e-16

+ horsepower 1 2363.324 2375.2 37 0.6049 0.6032 232.10 <2.2e-16

+ cylinders 1 2364.574 2376.488 0.6037 0.6030 234.98 <2.2 e- 16

+ origin 2 2572.313 2588.1 98 0.3284 0.3250 666.96 <2.2e-16

+ acceleration 1 2650.969 2662.883 0.1771 0.1749 905.51 <2.2e-16

<none> 2726.383 2734.325 0.0000 0.00 00 1185.93

Step 1 : + weight

df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)

+ horsepower 1 2250.005 2265.8 90 0.7049 0.7034 76.73 2.488e-05

+ acceleration 1 2258.820 2274.705 0.6982 0.6958 87.26 0.002645

+ displacement 1 2259.773 2275.658 0.6974 0.6960 88.41 0.004445

+ origin 2 2257.750 2277.6 06 0.6997 0.6963 85.56 0.002401

+ cylinders 1 2261.726 2277.611 0.6959 0.6948 90.78 0.013087

<none> 2265.939 2277.852 0.6918 0.69 13 96.39

74 / 436

Gasoline yield

Step 2 : + horsepower

df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)

+ horsepower:weight 1 2191.401 2211.258 0.7465 0.74 47 12.45 9.992e-15

+ origin 2 2236.388 2260.2 15 0.7164 0.7126 60.40 0.0001672

<none> 2250.005 2265.890 0.7049 0.70 34 76.73

+ cylinders 1 2250.297 2270.153 0.7054 0.7035 76.72 0.1938059

+ displacement 1 2251.230 2271.086 0.7047 0.7023 77.82 0.3813177

+ acceleration 1 2252.005 2271.861 0.7041 0.7011 78.73 0.9866463

- weight 1 2363.324 2375.2 37 0.6049 0.6032 232.10 <2.2e-16

Step 3 : + horsepower:weight

df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)

<none> 2191.401 2211.258 0.7465 0.74 47 12.450

+ origin 2 2184.990 2212.7 89 0.7519 0.7477 6.058 0.00594

+ acceleration 1 2191.167 2214.994 0.7473 0.7440 12.196 0.13770

+ displacement 1 2192.028 2215.856 0.7467 0.7437 13.063 0.24457

+ cylinders 1 2192.807 2216.634 0.7462 0.7439 13.848 0.44377

Final model:

~ weight + horsepower + weight:horsepower

> a <- stepCriterion(fit, direction="forward", criterion="bic", trace=FALSE)

> a

$initial

[1] "~ 1"

$direction

[1] "forward"

$criterion

[1] "BIC"

$final

[1] "~ weight + horsepower + weight:horsepower"

75 / 436

Gasoline yield

> stepCriterion(fit, direction="backward", criterion="bic")

Initial model:

~ horsepower + weight + horsepower:weight + origin + acceleration + displacement + cylinders

Step 0 :

df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)

- displacement 1 2185.9861 2221 .72 75 0.7525 0.7459 7.1114 0.73876

- cylinders 1 2185.9877 2221.7291 0.7525 0.7449 7.1129 0.73700

- acceleration 1 2188.6987 2224 .44 01 0.7507 0.7449 9.7717 0.09676

- origin 2 2192.9214 2224.6915 0.7474 0.7418 13.9443 0.01203

<none> 2187.8721 2227.5848 0.7519 0.7439 9.0000

- horsepower:weight 1 2240.1859 2275.9272 0.7158 0.7087 63.9189 3.338e-13

Step 1 : - displacement

df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)

- cylinders 1 2184.5512 2216.3213 0.7528 0.7471 5.6640 0. 45 717 7

- acceleration 1 2186.7146 2218 .48 47 0.7514 0.7465 7.7873 0.102302

- origin 2 2192.1403 2219.9391 0.7473 0.7433 13.1649 0.006919

<none> 2185.9861 2221.7275 0.7525 0.7459 7.1114

- horsepower:weight 1 2239.6375 2271.4076 0.7154 0.7092 63.5511 1.599e-13

Step 2 : - cylinders

df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)

- acceleration 1 2184.9899 2212 .78 88 0.7519 0.7477 6.0584 0.121947

- origin 2 2191.1665 2214.9941 0.7473 0.7440 12.1956 0.005446

<none> 2184.5512 2216.3213 0.7528 0.7471 5.6640

- horsepower:weight 1 2238.3223 2266.1211 0.7157 0.7104 62.3230 1.401e-13

76 / 436

Gasoline yield

Step 3 : - acceleration

df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)

- origin 2 2191.4014 2211.2577 0.7465 0.7447 12.4494 0.00594

<none> 2184.9899 2212.7888 0.7519 0.7477 6.0584

- horsepower:weight 1 2236.3878 2260.2153 0.7164 0.7126 60.3969 4.274e-13

Step 4 : - origin

df AIC BIC adj.R-squared prd.R-squared Mallows’ CP Pr(>F)(*)

<none> 2191.401 2211.258 0.7465 0.7447 12.449

- horsepower:weight 1 2250.005 2265.890 0.7049 0.7034 76.733 9.992e-15

Final model:

~ horsepower + weight + horsepower:weight

> b <- stepCriterion(fit, direction="backward", criterion="bic", trace=FALSE)

> b

$initial

[1]"~horsepower + weight + horsepower:weight + origin + acceleration + displacement + cylinders"

$direction

[1] "backward"

$criterion

[1] "BIC"

$final

[1] "~ weight + horsepower + weight:horsepower"

77 / 436

Gasoline yield

> library(leaps)

> X <- model.matrix(fit)

> results <- regsubsets(y=Auto$mpg, x=X[,-1])

> plot(results)

78 / 436

Diagnostic methods

79 / 436

Residual analysis

The vector of ordinary residuals, denoted here by r = (r

1

, . . . , r

n

)

⊤

,

where

r

i

= Y

i

− x

⊤

i

ˆ

β

, may be written as follows

r = Y −X

ˆ

β

= (I − H)Y

= (I − H)(Xβ + ǫ) = (I − H)ǫ.

That is, r

i

= ǫ

i

−

n

P

j=1

h

ij

ǫ

j

. According to this identity, r and ǫ are

only related by

H. In the presence of suﬃciently small h

ij

s, r can

substitute for

ǫ, otherwise its usefulness may be limited. Indeed,

E(r) = (I −H)E(ǫ)

= (I − H)0 = 0

Var(r) = (I −H)Var(ǫ)(I −H)

= σ

2

(I − H)

2

= σ

2

(I − H)

Therefore, E(r

i

) = 0, Var(r

i

) = σ

2

(1 −h

ii

) and Cov(r

i

, r

j

) = −σ

2

h

ij

.

The above indicates that, unlike the random errors (ǫ

1

, . . . , ǫ

n

),

the ordinary residuals (r

1

, . . . , r

n

) are not independent nor iden-

tically distributed random variables.

80 / 436

Residual analysis

Internally Studentized residuals:

r

I

i

=

y

i

− x

⊤

i

ˆ

β

ˆ

σ

√

1 − h

ii

.

Externally Studentized residuals:

r

E

i

=

y

i

− x

⊤

i

ˆ

β

ˆ

σ

(i)

√

1 − h

ii

,

where

ˆ

σ

2

(i)

is the estimate of σ

2

derived from the dataset with the i-

th observation excluded. According to

Cook and Weisberg [1982,

page 20], the distribution of r

E

i

is Student-t(n −p − 2) and it also

can be expressed as follows:

r

E

i

= r

I

i







n − p − 2

n − p − 1 − r

2

I

i







1

2

.

81 / 436

Normal probability plot with sim ulated envelope

See, for instance, Atkinson [1981].

Step 0: Set T (positive integer) and α ∈ (0, 1).

Step 1: Generate T independent samples of size n according to

Y

i

ind

∼ Normal(x

⊤

i

ˆ

β,

ˆ

σ

2

)

for i = 1, . . . , n.

Sample 1 Sample 2 Sample 3 . . . Sample T

y

11

y

12

y

13

. . . y

1T

.

. . . .

.

y

n1

y

n2

y

n3

. . . y

nT

Step 2: Compute

ˆ

β

t

= (X

⊤

X)

−1

X

⊤

y

t

and

ˆ

σ

2

t

=

n

P

i=1

(y

it

−x

⊤

i

ˆ

β

t

)

2

/(n −

p − 1)

for t = 1, . . . , T, where y

t

= (y

1t

, . . . , y

nt

)

⊤

.

Step 3: Compute the residuals r

it

for i = 1, . . . , n and t = 1, . . . , T.

Sample 1 Sample 2 Sample 3 . . . Sample T

r

11

r

12

r

13

. . . r

1T

.

. . . .

.

r

n1

r

n2

r

n3

. . . r

nT

82 / 436

Normal probability plot with sim ulated envelope

Step 4: Compute the order statistics of r

∗

it

in each sample.

Sample 1 Sample 2 Sample 3 . . . Sample T

r

(1)1

r

(1)2

r

(1)3

. . . r

(1)T

.

. . . .

.

r

(n)1

r

(n)2

r

(n)3

. . . r

(n)T

Step 5: Compute the 100(

α

2

)

th, 50th and 100(1 −

α

2

)

th percentiles

of the row

i, denoted by p

(i)

α

2

, p

(i)

1

2

and p

(i)

1−

α

2

for i = 1, . . . , n.

Sample 1 Sample 2 . . . Sample T

r

(1)1

r

(1)2

. . . r

(1)T

.

. . . .

.

r

(n)1

r

(1)2

. . . r

(1)T



p

(1)

α

2

p

(1)

1

2

p

(1)

1−

α

2

.

p

(n)

α

2

p

(n)

1

2

p

(n)

1−

α

2

Step 6: Plot p

(1)

α

2

, . . . , p

(n)

α

2

versus the expected values of the order

statistics in a sample of size

n of the standard normal

distribution.

83 / 436

Normal probability plot with sim ulated envelope

Step 7: Plot p

(1)

1

2

, . . . , p

(n)

1

2

versus the expected values of the order

statistics in a sample of size

n of the standard normal

distribution.

Step 8: Plot p

(1)

1−

1

2

, . . . , p

(n)

1−

1

2

versus the expected values of the or-

der statistics in a sample of size

n of the standard normal

distribution.

Step 9: Plot r

1

, . . . , r

n

versus the expected values of the order

statistics in a sample of size n of the standard normal

distribution.

84 / 436

Inﬂuence or Sensitivity analysis

The Cook’s distance, denoted here by CD

i

, is a measure of the

inﬂuence of the

i-th observation on the estimate of β, which

consists of the “distance” between

ˆ

β

and

ˆ

β

(i)

using a metric based

on the matrix

ˆ

Var(

ˆ

β)

. The Cook’s distance may be written as

follows (see

Cook [1977])

CD

i

=

1

p + 1



ˆ

β

(i)

−

ˆ

β



⊤

h

ˆ

Var(

ˆ

β)

i

−1



ˆ

β

(i)

−

ˆ

β



=

1

ˆ

σ

2

(p + 1)



ˆ

β

(i)

−

ˆ

β



⊤



X

⊤

X





ˆ

β

(i)

−

ˆ

β



=

1

ˆ

σ

2

(p + 1)

n

X

l=1



x

⊤

l

ˆ

β

(i)

− x

⊤

l

ˆ

β



2

If

ˆ

β

(i)

is replaced by

ˆ

β −

r

i

1 − h

ii

(X

⊤

X)

−1

x

i

, then CD

i

becomes

CD

i

=

1

p + 1

(y

i

− x

⊤

i

ˆ

β)

2

ˆ

σ

2

(1 − h

ii

)

h

ii

1 − h

ii

=

1

p + 1

r

2

I

i

h

ii

1 − h

ii

.

85 / 436

Inﬂuence or sensitivity analysis

(1) The matrix

ˆ

Var(

ˆ

β)

is positive deﬁnite. Therefore,

CD

i

≥ 0

CD

i

= 0 if and only if

ˆ

β =

ˆ

β

(i)

(2)

If r

I

i

= 0 (that is, if y

i

=

ˆ

µ

i

), then

ˆ

β

(i)

=

ˆ

β

and CD

i

= 0.

(3) For ﬁxed h

ii

, the value of the Cook’s distance is a strictly

increasing function of

r

2

I

i

.

(4) For ﬁxed r

2

I

i

, the value of the Cook’s distance is a strictly

increasing function of

h

ii

.

A univariate version of

CD

i

is known as DFBETA

j(i)

, and it is

given by

DFBETA

j(i)

=

ˆ

β

j

−

ˆ

β

j

(i)

, j = 0, . . . , p,

where

ˆ

β

j

(i)

is the element (j + 1) of

ˆ

β

(i)

. Similarly, DFBETAS

j(i)

is

given by

DFBETAS

j(i)

=

ˆ

β

j

−

ˆ

β

j

(i)

se(

ˆ

β

j

(i)

)

, j = 0, . . . , p.

86 / 436

Leverage points

The estimation of µ = Xβ may be written as ˆµ = X

ˆ

β = Hy,

where

H is the orthogonal projection matrix. Thus,

ˆ

µ

i

= h

⊤

i

y = h

i1

y

1

+ . . . + h

in

y

n

, i = 1, . . . , n,

where h

⊤

i

= (h

i1

, . . . , h

in

)

represents the i-th row of H. Therefore

(see

Appendix A-2),

∂

ˆ

µ

i

∂y

i

= h

ii

and

1

n

X

i=1

∂

ˆ

µ

i

∂y

i

=

1

n

trace(H) =

p + 1

n

.

As a result, “high” values o f h

ii

indicate that y

i

has a relatively

“large” impact on

ˆ

µ

i

, meaning that there is relatively little in-

formation about the relationship between

y and x when x = x

i

.

Thus, the accuracy of the estimation of

µ when x = x

i

is lower

than that for other values of x, as

E(

ˆ

µ

i

) = E(x

⊤

i

ˆ

β) = x

⊤

i

E(

ˆ

β) = µ

i

and

Var(

ˆ

µ

i

) = Var(x

⊤

i

ˆ

β) = x

⊤

i

Var(

ˆ

β)x

i

= σ

2

x

⊤

i

(X

⊤

X)

−1

x

i

= σ

2

h

ii

.

87 / 436

Leverage points

If the model has intercept, then (Cook and Weisberg [1982, page

12]

)

1

n

≤ h

ii

≤

1

c

,

where c represents the number of rows equal to x

⊤

i

in X. In

addition, if

h

ii

= 1 then

ˆ

µ

i

= y

i

.

For example, if

µ

i

= βx

i

for i = 1, . . . , n then

h

ii

=

x

2

i

n

P

j=1

x

2

j

.

Similarly, if µ

i

= β

0

+ β

1

x

i

for i = 1, . . . , n then

h

ii

=

1

n

+

(x

i

−

x)

2

n

P

j=1

(x

j

−

x)

2

.

88 / 436

Testing the constant variance assumption

To assess the assumption of constant variance the following het-

eroskedastic model is considered











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

,

ǫ

i

ind

∼ Normal(0, σ

2

i

),

log(σ

2

i

) = γ

0

+ γ

1

m

i1

+ . . . + γ

q

m

iq

,

where m

i

= (m

i1

, . . . , m

iq

)

⊤

are the values of q explanatory vari-

ables, quantitative and/or qualitative , measured on the

i-th in-

dividual. The sets of variables

x

1

, . . . , x

p

and m

1

, . . . , m

q

may

or may not coincide. Thus, assess the assumption of constant

variance is equivalent to assess the following hypothesis in the

heteroskedastic model described above

H

0

: γ = 0 versus H

1

: γ , 0,

where γ = (γ

1

, . . . , γ

q

)

⊤

. If H

0

is not rejected, then it is possible to

conclude that there is not signiﬁcative statistical evidence against

the assumption of constant variance.

89 / 436

Testing the constant variance assumption

The decision rule is

Reject H

0

at the approximate signiﬁcance level of 100(α)% if ξ > χ

2

1−α

(q)

The statistic of the test is given by

ξ =

1

2

z

⊤

M(M

⊤

M)

−1

M

⊤

z,

where M =



I − n

−1

11

⊤



M

, M =

(

m

1

, . . . , m

n

)

⊤

and z = (z

1

, . . . , z

n

)

⊤

,

in which

z

i

=

(y

i

−

ˆ

µ

i

)

2

φ

∗

− 1

and φ

∗

=

1

n

X

i=1

(y

i

−

ˆ

µ

i

)

2

See Breusch and Pagan [1979], Cook and Weisberg [1983].

90 / 436

Box-Tidwell transformation

Let’s consider the following normal nonlinear model











Y

i

= µ

i

+ ǫ

i

, ,

µ

i

= β

0

+ β

1

h

1

(x

i1

) + . . . + β

r

h

r

(x

ir

) + β

r+1

x

i(r+1)

+ . . . + β

p

x

ip

,

ǫ

i

ind

∼ Normal(0, σ

2

)

where x

i1

, . . . , x

ir

are the values of r quantitative and strictly po-

sitive explanatory variables measured on the subject

i, and

h

j

(x) =











x

τ

j

if τ

j

, 0

log(x)

if τ

j

= 0

, j = 1, . . . , r.

Hence, the interest parameter v ector is (β

0

, β

1

, . . . , β

p+1

, τ

1

, . . . , τ

r

)

⊤

.

Box and Tidwell [1962] provide an eﬃcient algorithm to estimate

τ = (τ

1

, . . . , τ

r

)

⊤

by the maximum likelihood method. Note that,

evaluation of the following hypothesis systems is of interest:

H

0

: τ

j

= 1 versus H

1

: τ

j

, 1,

for j = 1, . . . , r, and

H

0

: (τ

1

, . . . , τ

r

)

⊤

= 1 versus H

1

: (τ

1

, . . . , τ

r

)

⊤

, 1.

91 / 436

Power generation by windmills

The analysis of these data, available in the object windmill of

the library

GLMsData, aims to explain the diﬀerences between

the amount of direct current (

DC) output from windmills using

the diﬀerences between wind velocities (

Wind).

•

Wind velocity

Amount of direct current

0.5

1.0

1.5

2.0

4 6 8 10











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β

0

+ β

1

x

i

,

ǫ

i

ind

∼ Normal(0, σ

2

)

y = Amount of direct current

output from windmills

x = Wind velocity

Is there a function

h(·) su ch that the relationship between y and

h(x) is approximately linear?

92 / 436

Box-Tidwell transformation

> args(BoxTidwell.lm)

> help(BoxTidwell.lm)

> data(windmill, package="GLMsData")

> fit <- lm(DC ~ Wind, data=windmill)

>

> BoxTidwell(fit, transf= ~ Wind)

ˆ

τ

j

H

0

: τ

j

= 1

versus H

1

: τ

j

, 1

Estimate Std.Error z-value Pr(>|z|)

Wind -0.83334 0.15566 -11.77799 < 2.22e-16

Wald test for null hypothesis that all taus are 1:

chi = 138.72 , df = 1 , Pr(>chi) = < 2.22e-16

H

0

: (τ

1

, . . . , τ

r

)

⊤

= 1 versus H

1

: (τ

1

, . . . , τ

r

)

⊤

, 1

>

> fit2 <- update(fit,formula = ~ I(Wind**(-0.833)))

> c(adjR2(fit),adjR2(fit2))

[1] 0.86904 0.98010

> c(AIC(fit),AIC(fit2))

[1] 2.67724 -44.43086

> c(BIC(fit),BIC(fit2))

[1] 6.33387 -40.77424

93 / 436

Richness of the species

The analysis of these data, available in the object richness of the

library

glmtoolbox, aims to explain the diﬀerences between plots

as related to species richness (represented by a count of the num-

ber of plant species (

richness)) using the diﬀerences between

plots with respect to biomasses (

Biomass) and soil pH levels:

low (

pH=“low”), medium (pH=“mid”) and high (pH=“high”).

•

••

•

••

•

••

•

••

•

High

Mid

Low

pH level

Biomass

Number of plant species

0

0.0 2.5 5.0 7.5 10.0

10

20

30

40











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β

0

+ β

1

x

i1

+ β

2

x

i2

+ β

3

x

i3

,

ǫ

i

ind

∼ Normal(0, σ

2

)

y = Number of plant species

x

1

= Biomass

x

2

= 0 and x

3

= 0 if pH=“low”

x

2

= 1 and x

3

= 0 if pH=“mid”

x

2

= 0 and x

3

= 1 if pH=“high”

94 / 436

Richness of the species

> data(richness)

> fit <- lm(Species ~ Biomass + pH, data=richness)

> residuals2(fit,type="external",plot.it=TRUE,identify=3)

1

3

18

•

ˆ

µ

r

E

i

-2 0 2 4

5 10 15 20 25 30 35 40

95 / 436

Richness of the species

> cbind(richness,mu=fitted(fit))[c(1,3,18),]

pH Biomass Species mu

1 high 0.4692972 30 38.96482

3 high 2.0897785 44 34.54483

18 high 7.2420621 36 20.49156

> envelope(fit,rep=100,identify=3)

3

18

•

Expected quantiles

Observed quantiles

-2

-2 -1

0

1

2

4-4

96 / 436

Richness of the species

> cd <- cooks.distance(fit)

> plot(cd, type="h", xlab="Observation Index", ylab="Cook’s Di stance")

> identify(cd, n=4)

1

3

18

Observation Index

Cook’s Distance

0

20

20 40 60 80

0.00 0.05 0.10 0.15 0.20

97 / 436

Richness of the species

Which subset of the potentially inﬂuential observations (#1, #3, #18, #20) causes

the largest change in the estimate of

β when removed from the data set?

> infl <- c(1,3,18,20)

> n <- length(infl)

> metr <- chol2inv(chol(vcov(fit)))

> hm <- 2**n - 1

> results <- matrix(0, nrow=hm, ncol=1)

> id <- 1; nfil <- vect or()

> for(i in 1:n){

+ pos <- combn(n,i)

+ for(j in 1:ncol(pos)){

+ fitemp <- update(fit, subset=-pos[,j])

+ difs <- coef(fitemp)-coef(fit)

+ results[id,1] <- t(difs)%*%metr%*%(difs)

+ nfil <- c(nfil,paste(infl[pos[,j]], collapse=","))

+ id <- id + 1

+ }

> nfil[results==max(results)]

[1] "3,18,20"

The largest change in the estimate of β occurs when observations #3, #18 and

#20 are removed.

98 / 436

Richness of the species

> a <- round(summary(fit)$coefficients, digits=5); a

Estimate Std. Error t value Pr(>|t|)

(Intercept) 17.62484 0.78899 22.33861 0

Biomass -2.72758 0.17167 -15.88830 0

pHmid 11.97819 1.00200 11.95431 0

pHhigh 22.62002 1.08183 20.90895 0

>

> fit2 <- update(fit, subset=-c(3,18,20))

> b <- round(summary(fit2)$coefficients, digits=5); b

Estimate Std. Error t value Pr(>|t|)

(Intercept) 17.51424 0.66545 26.31922 0

Biomass -2.67778 0.14883 -17.99242 0

pHmid 11.91347 0.83936 14.19345 0

pHhigh 21.85045 0.92175 23.70550 0

>

> round(100*(b[,1:2]-a[,1:2])/abs(a[,1:2]), digits=3)

Estimate Std. Error

(Intercept) -0.628 -15.658

Biomass 1.826 -13.305

pHmid -0.540 -16.232

pHhigh -3.402 -14.797

99 / 436

Richness of the species

> h <- apply(svd(model.matrix(fit))$u**2,1,sum)

> plot(h, type="h", xlab="Observation Index", ylab=expression(h[ii]))

> identify(h, n=7)

1

10

11

30

44

51

Index

h

ii

0.04 0.05 0.06 0.07 0.08

0

20

20 40 60 80

> richness[c(1,10,11,20,30,44,51), c("pH","Biomass")]

> boxplot(Biomass ~ pH, data=richness)

100 / 436

Richness of the species

> vdtest(fit)

Score test for varying dispersion parameter

Statistic = 11.07537

degrees of freedom = 3

p-value = 0.011325

> fit3 <- update(fit, subset=-c(18))

> vdtest(fit3)

Score test for varying dispersion parameter

Statistic = 5.29641

degrees of freedom = 3

p-value = 0.15134

> vdtest(fit3,varformula=~ Biomass + pH + Biomass:pH)

Score test for varying dispersion parameter

Statistic = 6.14185

degrees of freedom = 5

p-value = 0.29266

101 / 436

Lognormal distribution

If Y ∼ Normal(µ, σ

2

), then T = e xp(Y) ∼ Lognormal(exp(µ), σ

2

).

If

T ∼ Lognormal(η, σ

2

), then Y = log(T) ∼ Normal(log(η), σ

2

).

If

T ∼ Lognormal(η, σ

2

), then

(1) The distribution of T is strictly positive and right-skewed

(2) The probability density function of T is

f

T

(t; η, σ

2

) =

1

t

√

2π σ

2

exp







−



log(t/η)



2

2σ

2







I(0, ∞).

(3) The mean, variance and coeﬃcient of variation of T are

η exp

σ

2

!

, η

2

exp



σ

2

h

exp



σ

2



− 1

i

and

q

exp

(

σ

2

)

− 1

(4) The distribution of a T is Lognormal(a η, σ

2

) for all positive

constant a.

102 / 436

Lognormal regression

(5) The 100(α)th percentile of T reduces to

Q

T

(α) = η exp

(

σ Z

α

)

for all α ∈ (0, 1) ,

where Z

α

represents the 100(α)th percentile of the Normal(0, 1).

For example, the median of

T is η.











T

i

= η

i

ε

i

,

log(η

i

) = β

0

+ β

1

x

i1

+. . .+β

p

x

ip

ε

i

ind

∼ Lognormal(1, σ

2

)

⇔











Y

i

= µ

i

+ ǫ

i

,

µ

i

= β

0

+ β

1

x

i1

+. . .+β

p

x

ip

ǫ

i

ind

∼ Normal(0, σ

2

)

where Y

i

= log(T

i

), µ

i

= log(η

i

) and ǫ

i

= log(ε

i

). Therefore, if

Y

i

ind

∼ Lognormal(η

i

, σ

2

)

, then

(1) E(T

i

) = η

i

exp(σ

2

/2)

(2) CV(T

i

) = exp(σ

2

) − 1

(3) Q

T

i

(α) = η

i

exp(σ Z

α

)

103 / 436

Gasoline yield

The analysis of the data in the object Auto of the library ISLR is aimed

at explaining the diﬀerences between the vehicles with respect to their

gasoline yield (

mpg), in miles per gallon, using other vehicle characte-

ristics such as their horsepower (

horsepower), their weight (weight),

in grams, and their origin (

origin).

•

• ••

•

••

•

••

•

• •

•

••

•

••

•

••

•

••

•

• •

•

••

•

• •

•

••

•

• •

•

••

•

••

•

•• •

••

•

•• •

•

••

•

••

•

••

•

• ••

•

•• ••

•

• •

•

• •

•• •

•

Low

Mid

High

Weight

Horsepower

log(Gasoline yield)

3.5

3.0

2.5

50 100 150 200











T

i

= η

i

ε

i

,

log(η

i

) = β

0

+ β

1

x

i1

+ β

2

x

i2

+ β

3

x

i3

+ β

4

x

i4

+ β

5

x

i1

x

i2

,

ε

i

ind

∼ Lognor mal(1, σ

2

)

t = Gasoline yield

x

1

= Horsepower

x

2

= Weight

x

3

= 0 and x

4

= 0 for America

x

3

= 1 and x

4

= 0 for Europe

x

3

= 0 and x

4

= 1 for Japan

104 / 436

Gasoline yield

> data(Auto, package="ISLR")

> str(Auto)

> summary(Auto)

>

> fit1 <- lm(mpg ~ horsepower*weight*origin, data =Auto)

> fit2 <- lm(log(mpg) ~ horsepower*weight*origin, data=Auto)

>

> stepCriterion(fit1, criterion="bic", trace=FALSE)$final

"~ weight + horsepower + origin + weight:horsepower"

> stepCriterion(fit2, criterion="bic", trace=FALSE)$final

"~ weight + horsepower + origin + weight:horsepower"

>

> fit1a <- update(fit1, formula= ~ weight*horsepow er + origin)

> fit2a <- update(fit2, formula= ~ weight*horsepow er + origin)

>

> adjR2(fit1a,fit2a, verbose=FAL SE)[,4]

0.7517 0.8006

> c(AIC(fit1a), BIC(fit1a))

2184.288 2208.116

> c(AIC(fit2a), BIC(fit2a)) + 2*sum(log(Auto[,"mpg"]))

2070.695 2094.522

>

> summary(fit2a)

105 / 436

Lung capacity

In these data, available in the lungcap object of the library GLMsData,

the lung capacity of 654 individuals is measured by their forced expira-

tory volume (

FEV), in liters. The height (Ht), in inches, gender (Gender),

and smoking habit (

Smoke) of each individual were also recorded. The

data analysis aims to explain the diﬀerences between lung capacities

from the diﬀerences in height, gender, and smoking behavior.

•

••

•

••

•

Female

Male

Gender

Height

log(Forced expiratory volume)

0.0

0.5

1.0

1.5

45 50 55 60 65 70 75











T

i

= η

i

ε

i

,

log(η

i

) = β

0

+ β

1

x

i1

+ β

2

x

i2

+ β

3

x

i3

,

ε

i

ind

∼ lognormal(1, σ

2

)

t = Forced expiratory volume

x

1

= Height

x

2

= 0 if Gender=“F”

x

2

= 1 if Gender=“M”

x

3

= 0 for non-smokers

x

3

= 1 for smokers

106 / 436

Lung capacity

> data(lungcap, package="GLMsDat a")

> str(lungcap)

> summary(lungcap)

>

> fit1 <- lm(FEV ~ Ht*Gender*Smoke, data=lungcap)

> fit2 <- lm(log(FEV) ~ Ht*Gender*Smoke, data=lun gcap)

>

> stepCriterion(fit1, criterion="aic", trace=FALSE)$final

"~ Ht + Gender + Ht:Gender"

> stepCriterion(fit2, criterion="aic", trace=FALSE)$final

"~ Ht + Gender "

>

> fit1a <- update(fit1, formula= ~ Ht + Gender + Ht:Ge nder)

> fit2a <- update(fit2, formula= ~ Ht + Gender)

>

> adjR2(fit1a,fit2a,verbose=FALSE)[,4]

0.7649 0.7958

> c(AIC(fit1a), BIC(fit1a))

728.5365 750.9521

> c(AIC(fit2a), BIC(fit2a)) + 2*sum(log(lungcap[,"FEV"]))

582.2044 600.1368

>

> summary(fit2a)

107 / 436

Generalized Linear Models

108 / 436

Motivation

Normal linear models are based on the following assumptions:

The response variable is continuous, symmetrical, and its

variance does not depend on its mean,

Covariates have additive eﬀects on the mean of the

response.

There are many situations in which those assumptions are incor-

rect. As an example, when:

The response is binary, a discrete proportion, or a count,

The response is continuous, but skewed to the right and

strictly positive,

Covariates act multiplicatively on the mean of the

response.

109 / 436

Motivation

Generalized Linear Models (GLM) extend the normal linear

model by allowing for

some non-normal response distributions, such as

Bernoulli : binary responses

Binomial : discrete proportions

Poisson : counts

Gamma, Inverse Gaussian : continuous and right−skewed

an alternative function to the identity that connects the mean

of the response and the linear predictor, allowing for other

eﬀects other than additive eﬀects to be accounted for.

Identity function : additive eﬀects

Logarithmic function : multiplicative eﬀects

110 / 436

Formulation

Random component: States the probability distribution as-

sumed to describe the response,

Y, which may be Normal,

Bernoulli, Binomial, Poisson, Gamma, and Inverse Normal,

among others.

Systematic component: Continuous and/or discrete explana-

tory variables in the linear predictor, that is,

η = β

0

+ β

1

x

1

+ . . . + β

p

x

p

Link function: Continuous, strictly monotone, and twice

diﬀerentiable function, g(·), which connects/links the mean

of the response variable, µ, to the systematic component,

that is,

g(µ) = β

0

+ β

1

x

1

+ . . . + β

p

x

p

111 / 436

Examples

112 / 436

Liver cancer in female mice

These data, extracted from Zhang and Zelterman [1999] and available

in the object

liver of the library glmtoolbox, arose from an experi-

ment where 8 groups of varying sizes (

exposed) of female mice were

continuously fed dietary concentrations of 2-Acetylaminoﬂuorene (2-

AA) (

dose), in parts per 10

4

. These data consist of the number of liver

neoplasms in mice (

cancer) observed for 18 months.

Dose of 2-Acetylaminoﬂuorene

% of mice with liver cancer

•

3

6

9

0.0 0.5 1.0 1.5











m

i

Y

i

ind

∼ Binomial(m

i

, µ

i

),

g(µ

i

) = β

0

+ β

1

x

i

m = Number of exposed mices

y = Proportion of mice with cancer

x = Dose of 2-AA, in parts per 10

4

113 / 436

Insecticides

These data, extracted from Hewlett and Plackett [1950] and available

in the object

deposit of the library GLMsData, arose from an experiment

in which groups of insects were exposed to deposits of insecticides. The

total number of insects exposed in each group (

Number), the number

of the insects killed after six days exposure (

Killed), and the amount

(

Deposit) and type of insecticide used (Insecticide) were recorded.

Insecticide

type A

type B

type C

Amount of insecticide, in milligrams

% of killed insects

•

••

•

2 4 6 8

0

25

50

75

100











m

i

Y

i

ind

∼ Binomial(m

i

, µ

i

),

g(µ

i

) = β

0

+ β

1

x

i1

+ β

2

x

i2

+ β

3

x

i3

m = Number of exposed insects

y = Proportion of killed insects

x

1

= Amount of insecticide, in mg.

x

2

= 0 and x

3

= 0 if Insecticide=“A”

x

2

= 1 and x

3

= 0 if Insecticide=“B”

x

2

= 0 and x

3

= 1 if Insecticide=“C”

114 / 436

Burn injury study

This data set, discussed in Hosmer et al. [2013, page 27] and available

in the object burn1000 of the library aplore3, is a subset of 1000 subjects

treated between 2000 and 2007 at 40 diﬀerent burn facilities. Available

information includes patient demographics, total burn surface area

and presence of inhalation injury. The outcome of interest is survival

to hospital discharge. The data set includes

age (x

1

). The patient’s age at admission, in years.

gender ( x

2

). The patient’s gender: “Female” and “Male”.

race (x

3

). The patient’s race: “Non-White” and “White”.

tbsa (x

4

). Total burn surface area, in percentage.

inh_inj ( x

5

). Burn involved inhalation injury? “No” and “Yes”.

flame (x

6

). Flame involved in burn injury? “No” and “Yes”.

death (y). Hospital discharge status: “Alive” and “Dead”.











Y

i

ind

∼ Bernoulli(µ

i

),

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

6

x

i6

115 / 436

Dilution assay

These data, extracted from Ridout (1990) and available in the ob-

ject

dilution of the library glmtoolbox, are counts of virus particles

(

Count) at diﬀerent dilutions (Dilution). The plot of data indicates that

the location as well as the dispersion of the counts of virus particles

increase as the dilution volume increases.

0

Dilution volume

Count of virus particles

•

••

•

••

•

0.1 0.2 0.3

5

10

15

20











Y

i

ind

∼ Poisson(µ

i

),

µ

i

= β

∗

0

x

β

1

i

y = Count of virus particles

x = Dilution volume

116 / 436

Skin cancer in women

These data, extracted from Kleinbaum et al. [2013, page 751] and avail-

able in the object

skincancer of the library glmtoolbox, describe the

number of of non-melanoma skin cancer cases (

cases) among women

(

population) stratiﬁed by age (age: “15-24”, “25-34”, “35-44”, “45-

54”, “55-64”, “65-74”, “75-84”, “85+”) in two metropolitan areas of US

(

city: “St. Paul”, in Minneapolis, and “Fort Worth”, in Dallas).

15−24 25−34 35−44 45−54 55−64 65−74 75−84 85+

City

St.Paul

Ft.Worth

Incidence rate of skin cancer ×10

4

Age range, in years

•

0

25

50

75











Y

i

ind

∼ Poisson(µ

i

),

log



10

4

µ

i

λ

i



= β

0

+ β

1

x

i1

+ β

2

x

i2

y = Number of cases of skin cancer

λ = Number of women exposed

x

1

= 1 if city=“Fort Worth”

x

1

= 0 if city=“St. Paul”

x

2

= Midpoint of age range

117 / 436

Advertising

This data set, extracted from James et al. [2013, page 15] and available

in the object

advertising of the library glmtoolbox, consists of the

sales (

sales), in thousands of units, of a product in 200 diﬀerent mar-

kets, along with advertising budgets, in thousands of dollars, for the

product in each of those markets for television (

TV) and radio (radio).

Advertising

budget for radio

Low

Mid

High

Advertising budget for TV

Sales

•

••

•

0

10

20

100 200 300











Y

i

ind

∼ Norm al(µ

i

, σ

2

),

log(µ

i

) = β

0

+ β

1

log(x

i1

) + β

2

x

i2

+

β

3

log(x

i1

)x

i2

y = Sales, in thousands of units

x

1

= Advertising budget for TV

x

2

= Advertising budget for radio

118 / 436

Atlantic Croakers

These data, available in the object Croaker2 of the library FSAdata,

describe the age-at-capture (

age), in years, assigned by using sagit-

tal otoliths, and the total length (

tl), in millimeters, for 204 females

(

sex=“F”) and 114 males (sex=“M”) belonging to a species of ﬁsh better

known as Atlantic Croaker.

Sex

Female

Male

Age-at-capture

Tot al length at capture

•

••

•

2.5 5.0 7.5 10.0

200

300

400

500











Y

i

ind

∼ Normal(µ

i

, σ

2

),

log(µ

i

) = β

0

+ β

1

log(x

i1

) + β

2

x

i2

y = Total length at capture, in mm

x

1

= Age-at-capture, in years

x

2

= 1 if sex=“M”

x

2

= 0 if sex=“F”

119 / 436

Hardened stee l

These data, extracted from McCool [1980] and available in the object

Steel of the library glmtoolbox, consist of the failure times (life)

for hardened steel specimens in a 4-ball rolling contact fatigue test.

Ten independent observations were taken at each of the four values of

contact stress (

stress), in pounds per square inch ×10

−6

.

10

20

30

Contact stress, in psi ×10

−6

Time until the specimen failed

•

••

•

0.9 1.0 1.1

0











Y

i

ind

∼ Gamma(µ

i

, σ

2

),

log(µ

i

) = β

0

+ β

1

x

i

y = Time until the specimen failed

x = Contact stress

120 / 436

Gasoline yield

This data set, discussed in James et al. [2013, page 92] and available in

the object

Auto of the library ISLR, includes acceleration, horsepower,

fuel eﬃciency, weight, and other features of diﬀerent makes and mod-

els of cars. The outcome of interest is fuel consumption, in miles per

gallon. The variables in this data set are the following

mpg (y). Fuel consumption, in miles per gallon.

cylinders (x

1

). Number of cylinders.

displacement (x

2

). Engine displacement, in cubic inches.

horsepower (x

3

). Engine horsepower.

weight ( x

4

). weight, in pounds.

acceleration (x

5

). Time, in seconds, to accelerate from 0 to 60

miles per hour.

origin (x

6

, x

7

). Origin of the vehicle: American (1), European ( 2)

and J apanese (3).











Y

i

ind

∼ Gamma(µ

i

, σ

2

),

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

7

x

i7

121 / 436

Model Formulation

122 / 436

Model formulation











Y

i

ind

∼ LEF(µ

i

, φ/ω

i

),

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

= x

⊤

i

β

Then, E(Y

i

) = µ

i

and Var(Y

i

) =

φ

ω

i

V(µ

i

),

where

β = (β

0

, β

1

, . . . , β

p

)

⊤

is the unknown interest parameter vector.

x

i

= (1, x

i1

, . . . , x

ip

)

⊤

, where (x

i1

, . . . , x

ip

) are the values of p

explanatory variables, quantitative and/or qualitative, mea-

sured on the

i-th individual.

φ > 0 is the unknown dispersion parameter.

ω

1

, . . . , ω

n

are known positive weights.

V(·) is the variance function, which describes the way in

which Var(Y

i

) is assumed to be dependent on µ

i

.

g(·) is the link function, which is assumed to be known,

strictly monotone, and twice diﬀerentiable.

123 / 436

Random Component

124 / 436

Linear exponential family (Continuous case)

The probability density function of Y ∼ LEF(µ, φ), where µ is its

mean and

φ > 0 is its dispersion parameter, may be written as

f

Y

(y; µ, φ) = C(y, φ) exp

1

φ



yθ(µ) − b[θ(µ)]



!

,

for some known functions θ(·), b(·) and C(·, ·). By deﬁnition,

Z

f

Y

(y; µ, φ)= 1 =⇒

Z

C(y, φ) exp

1

φ

yθ

!

dy = exp

1

φ

b(θ)

!

Thus, the moment generating function of Y reduces to

M

Y

(t)= E[exp(tY)]=

Z

C(y, φ) exp

1

φ

h

y(θ + tφ) − b(θ)

i

!

dy

=exp

−

1

φ

b(θ)

!

Z

C(y, φ) exp

1

φ

h

y(θ + tφ)

i

!

dy

=exp

1

φ

h

b



θ + tφ



− b(θ)

i

!

,

where the values of t are those for which b



θ + tφ



exists.

125 / 436

Linear exponential family (Discrete case)

The probability mass function of Y ∼ LEF(µ, φ) , where µ is its

mean and

φ > 0 is its dispersion parameter, may be written as

f

Y

(y; µ, φ) = C(y, φ) exp

1

φ



yθ(µ) − b[θ(µ)]



!

,

for some known functions θ(·), b(·) and C(·, ·). By deﬁnition,

X

f

Y

(y; µ, φ)= 1 =⇒

X

C(y, φ) exp

1

φ

yθ

!

= exp

1

φ

b(θ)

!

Thus, the moment generating function of Y reduces to

M

Y

(t)= E[exp(tY)]=

X

C(y, φ) exp

1

φ

h

y(θ + tφ) − b(θ)

i

!

=exp

−

1

φ

b(θ)

!

X

C(y, φ) exp

1

φ

h

y(θ + tφ)

i

!

=exp

1

φ

h

b



θ + tφ



− b(θ)

i

!

,

where the values of t are those for which b



θ + tφ



exists.

126 / 436

Linear exponential family

Therefore,

E(Y) = M

(1)

Y

(t) = b

′

(θ),

E(Y

2

) = M

(2)

Y

(t) = φb

′′

(θ) + [b

′

(θ)]

2

,

and Var(Y) = M

(2)

Y

(t) − [M

(1)

Y

(t)]

2

= φV(µ), where V(·) ≡b

′′

[θ(·)] is

known as the variance function.

If

˜

Y = (Y −µ)/

p

φ, then its moment generating function becomes

M

˜

Y

(t) = exp







−tµ

p

φ







M

Y



t/

p

φ



= exp







−tµ

p

φ

+

1

φ

h

b



θ + t

p

φ



− b(θ)

i







.

The third-order Taylor series expansion of log



M

˜

Y

(t)



around

t = 0 may be written as follows

log



M

˜

Y

(t)



=

1

2

b

′′

(θ)t

2

+

1

6

b

′′′



θ + t

∗

p

φ



p

φ t

3

=

1

2

V(µ)t

2

+

1

6

b

′′′



θ + t

∗

p

φ



p

φ t

3

,

127 / 436

Linear exponential family

where t

∗

= (1 − δ)t for some δ ∈ (0, 1). Therefore,

lim

φ →0

log



M

˜

Y

(t)



=

1

2

V(µ)t

2

=⇒ lim

φ →0

M

˜

Y

(t) = exp



1

2

V(µ)t

2



.

Then, the moment generating function of

˜

Y

converges to that of

Normal(0, V(µ)). That is (see, for instance, Jorgensen [1987]),

Y −µ

p

φ

D

−−−−→

φ →0

Normal(0, V(µ)),

where

D

−−−−→

φ →0

means convergence in distribution as φ → 0.

128 / 436

Special Cases

129 / 436

Binomial distribution

Let mY ∼ Binomial(m, µ), where m is a known positive integer

and

µ ∈ (0, 1). The p.m.f. of Y may be written as

f

Y

(y; µ, m)=

m

my

!

µ

my

(1 − µ)

m(1−y)

I

my

{0, 1, . . . , m}

=exp

(

m

"

y log

µ

1 − µ

!

+ log(1 −µ)

#)

m

my

!

I

my

{0, 1, . . . , m}

Then,θ = log[µ/(1 − µ)], b(θ) = log(1 + e

θ

), V(µ) = µ(1 − µ) and

φ= m

−1

.The mean,variance,skewnessandexcess kurtosisofYare

µ,

µ(1 − µ)

m

,

1 − 2µ

p

µ(1 − µ)m

and

1 − 6µ

µ(1 − µ)m

,

respectively. In fact, the m.g.f. of Y reduces to

M

Y

(t) =



1 + µ



exp



t

m



− 1



m

The binomial distribution tends to the Gaussian when m → ∞,

that is, for “large” values of m we have Y ˙∼Normal



µ, m

−1

µ(1 − µ)



.

130 / 436

Probability mass function of Binomial(m, µ) for m = 10

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

0.0

0.1

0.2

0.3

0.4

µ = 0.1 µ = 0.2

µ = 0.4 µ = 0.6

µ = 0.8 µ = 0.9

y

f

Y

(y; m, µ)

Binary responses and discrete proportions

131 / 436

R codes

> library(ggplot2)

>

> size <- 10

> mus <- c(0.1,0.2,0.4,0.6,0.8,0.9)

> binom <- expand.grid(mu=mus,x=seq(from=0,to=1,by=1/size))

> binom <- within(binom,{prob <- dbinom(x=size*x,size=size,prob=mu)

+ cdf <- pbinom(q=si ze*x,size=s ize,prob=mu)

+ mu <- factor(mu,lab els=paste( "mu==",mus))

+ })

>

> dev.new()

> ggplot(binom) +

+ geom_segment(aes(x=x,xend=x,y=0,yend =prob),col= "red",linew idth=1.25) +

+ labs(x="y",y=bquote(~f[Y]~"(y;m,"~mu ~")")) +

+ facet_wrap(vars(mu),ncol=2,scale="fre e_x",labell er=label_par sed)

>

> dev.new()

> ggplot(binom) +

> geom_segment(aes(x=x,xend=x+0.1,y=cd f,yend=cdf) ,col="red",l inewidth=1. 25) +

> labs(x="y",y=bquote(~F[Y]~"(y;m,"~mu ~")")) +

> facet_wrap(vars(mu),ncol=2,scale="fre e_x",labell er=label_par sed)

>

132 / 436

Binomial regression











m

i

Y

i

ind

∼ Binomial(m

i

, µ

i

),

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

E(Y

i

) = µ

i

and Var(Y

i

) = m

−1

i

µ

i

(1 − µ

i

)

Logit link

g(µ

i

) = log

µ

i

1 − µ

i

!

Probit link

g(µ

i

) = Φ

−1

(µ

i

)

Complementary log-log link

g(µ

i

) = log



−log



1 − µ

i



Cauchy link

g(µ

i

) = tan



π

2



2µ

i

− 1





In this model ω

i

= m

i

for i = 1, . . . , n.

133 / 436

Binomial model for m = 25

|

y

x

µ =

exp(−4 + 1.6 x)

1 + exp(−4 + 1.6 x)

0000000 1111111 2222222 3333333 4444444 5555555

0.00.00.00.00.00.00.0 0.20.20.20.20.20.20.2 0.40.40.40.40.40.40.4 0.60.60.60.60.60.60.6 0.80.80.80.80.80.80.8 1.01.01.01.01.01.01.0

134 / 436

Poisson distribution

Let Y ∼ Poisson(µ), where µ > 0. The probability mass function

of

Y may be written as follows

f

Y

(y; µ) =

exp(−µ)µ

y

y!

I

y

{0, 1, 2, . . .}

= exp



y log(µ) − µ



I

y

{0, 1, 2, . . .}

y!

Therefore, θ = log(µ), b(θ) = exp(θ), V(µ) = µ and φ = 1. The

mean, variance, skewness and excess kurtosis of

Y are

µ, µ, µ

−

1

2

and µ

−1

,

respectively. Indeed, the moment generating function of Y re-

duces to

M

Y

(t) = exp

h

µ(e

t

− 1)

i

.

135 / 436

Poisson as the limit of Binomial distribution

lim

m →∞

π →0

mπ →µ

Binomial(m, π)

= lim

m →∞

π →0

mπ →µ

m

y

!

π

y

(1 − π)

m−y

I

y

{0, 1, . . . , m}

= lim

m →∞

π →0

mπ →µ

m(m − 1) . . . (m − y + 1)

y!



π

1 − π



y



1 −

mπ

m



m

I

y

{0, 1, . . . , m}

= lim

m →∞

π →0

mπ →µ

product of y terms

z }| {

m(m − 1) . . . (m − y + 1)

m

y

y!



mπ

1 − π



y



1 −

mπ

m



m

I

y

{0, 1, . . . , m}

=

µ

y

exp(−µ)

y!

I

y

{0, 1, . . .}

= Poisson(µ)

136 / 436

Probability mass function of Poisson(µ)

2 4 6 8 2 4 6 8

0.0

0.1

0.2

0.3

0.0

0.1

0.2

0.3

0.0

0.1

0.2

0.3

µ = 1.0 µ = 1.5

µ = 2.0 µ = 2.5

µ = 3.0 µ = 3.5

y

f

Y

(y; µ)

Counts of some relatively rare event.

137 / 436

R codes

> library(ggplot2)

>

> rango <- c(0,8)

> mus <- seq(from=1,to=3.5,by=0.5)

> pois <- expand.grid(mu=mus,x=seq(from=rango[1],to=rango[2],by=1))

> pois <- within(pois,{prob <- dpois(x=x,lambda=mu)

+ cdf <- ppois(q=x,la mbda=mu)

+ mu <- factor(mu,lab els=paste( "mu==",mus))

+ })

>

> dev.new()

> ggplot(pois) +

+ geom_segment(aes(x=x,xend=x,y=0,yend =prob),col= "red",linew idth=1.25) +

+ labs(x="y",y=bquote(~f[Y]~"(y;"~mu~" )")) +

+ facet_wrap(vars(mu),ncol=2,scale="fre e_x",labell er=label_par sed)

>

> dev.new()

> ggplot(pois) +

> geom_segment(aes(x=x,xend=x+1,y=cdf, yend=cdf),c ol="red",lin ewidth=1.25 ) +

> labs(x="y",y=bquote(~F[Y]~"(y;"~mu~" )")) +

> facet_wrap(vars(mu),ncol=2,scale="fre e_x",labell er=label_par sed)

>

138 / 436

Poisson regression











Y

i

ind

∼ Poisson(µ

i

),

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

E(Y

i

) = µ

i

and Var(Y

i

) = µ

i

Logarithmic link

g(µ

i

) = log(µ

i

)

Identity link

g(µ

i

) = µ

i

Square-root link

g(µ

i

) =

√

µ

i

139 / 436

Poisson model

|

y

x

µ = exp(−0.5 + 0.6 x)

0

0 1111111 2222222 3333333 4444444

5

10101010101010 15151515151515 20202020202020

140 / 436

Normal distribution

Let Y ∼ Normal(µ, σ

2

), where µ ∈ R and σ > 0. The probability

density function of

Y may be written as follows

f

Y

(y; µ, σ

2

) =

1

√

2π σ

2

exp

−

(y −µ)

2

2 σ

2

!

=

1

√

2π σ

2

exp

−y

2

+ 2yµ − µ

2

2 σ

2

!

= exp

(

1

σ

2

"

yµ −

µ

2

#)

exp(−y

2

/2σ

2

)

√

2π σ

2

Therefore, θ = µ, b(θ) = θ

2

/2, V(µ) = 1 and φ = σ

2

. The mean,

variance, skewness and excess kurtosis of Y are

µ, σ

2

, 0 and 0,

respectively. Indeed, the moment generating function of Y re-

duces to

M

Y

(t) = exp

µt +

σ

2

t

2

!

.

141 / 436

Probability density funct ion of Normal(0, σ

2

)

−4 −2 0 2 4 −4 −2 0 2 4

0.0

0.5

1.0

0.0

0.5

1.0

0.0

0.5

1.0

σ

2

= 0.1 σ

2

= 0.2

σ

2

= 0.3 σ

2

= 0.6

σ

2

= 0.9 σ

2

= 1.2

y

f

Y

(y; µ, σ

2

)

Continuous and symmetric responses

142 / 436

R codes

> library(ggplot2)

>

> mu <- 0

> s2 <- c(0.1,0.2,0.3,0.6,0.9,1.2)

> x <- seq(from=-4,to=4,length=200)

> norm <- expand.grid(mu=mu,s2s=s2,x=x)

> norm <- within(norm,{pdf <- dnorm(x=x,mean=mu,sd=sqrt(s2))

+ cdf <- pnorm(q=x,me an=mu,sd=sqr t(s2))

+ s2s <- factor(s2s,la bels=paste ("sigma**2= =",s2))

+ })

>

> dev.new()

> ggplot(norm) +

+ geom_line(aes(x=x,y=pdf),col="red",linewidth=1.1) +

+ labs(x="y",y=bquote(~f[Y]~"(y;"~mu~","~sigma**2~")")) +

+ facet_wrap(vars(s2s),ncol=2,scale="free_x",labeller=label_parsed)

>

> dev.new()

> ggplot(norm) +

+ geom_line(aes(x=x,y=cdf),col="red",linewidth=1.1) +

+ labs(x="y",y=bquote(~F[Y]~"(y;"~mu~","~sigma**2~")")) +

+ facet_wrap(vars(s2s),ncol=2,scale="free_x",labeller=label_parsed)

>

143 / 436

Normal regression











Y

i

ind

∼ Normal



µ

i

, σ

2

/ω

i



,

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

⇔











Y

i

= µ

i

+ ε

i

,

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

,

ε

i

ind

∼ Normal



0, σ

2

/ω

i



E(Y

i

) = µ

i

and Var(Y

i

) =

σ

2

ω

i

Inverse link

g(µ

i

) = 1/µ

i

Logarithmic link

g(µ

i

) = log(µ

i

)

Identity link

g(µ

i

) = µ

i

If ω

1

= ω

2

= . . . = ω

n

, then the dispersion parameter is constant,

which implies that the variance is also constant.

144 / 436

Normal model for σ

2

= 0.4

y

|

x

µ = 1.5 + 1.4 x

0

0 111111111111

2

2 333333333333

4

4 555555555555

666666666666 888888888888 101010101010101010101010

145 / 436

Gamma distribution

Let Y ∼ Gamma(µ, σ

2

), where µ > 0 and σ > 0. The probability

density function of

Y may be written as follows

f

Y

(y; µ, σ

2

) =

1

Γ



1

σ

2





σ

2

µ



1

σ

2

y

1

σ

2

−1

exp

−

y

σ

2

µ

!

I

y

(0, ∞)

= exp

(

1

σ

2

"

−

y

µ

− log(µ)

#)

y

1

σ

2

−1

Γ



1

σ

2



(

σ

2

)

1

σ

2

I

y

(0, ∞)

Therefore, θ = −µ

−1

, b(θ) = −log(−θ), φ = σ

2

and V(µ) = µ

2

.

As special cases we have Exponential

(µ = λ

−1

, σ

2

= 1) and Chi-

squared

(µ = ν, σ

2

= 2ν

−1

) distributions. The mean, variance,

skewness and excess kurtosis of Y are

µ, σ

2

µ

2

, 2σ and 6σ

2

,

respectively. Indeed, the m.g.f. of Y reduces to M

Y

(t)=



1 − σ

2

µt



−

1

σ

2

.

The Gamma distribution tends to the Gaussian when σ

2

→ 0,

that is, for “small” values of σ

2

we have Y ˙∼Normal



µ, σ

2

µ

2



.

146 / 436

Probability density funct ion of Gamma(1, σ

2

)

0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.5

1.0

1.5

0.0

0.5

1.0

1.5

0.0

0.5

1.0

1.5

σ

2

= 0.08 σ

2

= 0.15

σ

2

= 0.35 σ

2

= 0.65

σ

2

= 0.9 σ

2

= 1.2

y

f

Y

(y; µ, σ

2

)

Continuous, strictly positive, and right-skewed responses

147 / 436

R codes

> library(ggplot2)

>

> mu <- 1

> s2 <- c(0.08,0.15,0.35,0.65,0.9,1.2)

> x <- seq(from=0.01,to=2.5,length=200)

> gama <- expand.grid(mu=mu,s2s=s2,x=x)

> gama <- within(gama,{pdf <- dgamma(x=x,shape=1/s2s,scale=mu*s2s)

+ cdf <- pgamma(q=x,s hape=1/s2s, scale=mu*s2s )

+ s2s <- factor(s2s,la bels=paste ("sigma**2= =",s2))})

>

> dev.new()

> ggplot(gama) +

+ geom_line(aes(x=x,y=pdf),col="red",linewidth=1.1) +

+ labs(x="y",y=bquote(~f[Y]~"(y;"~mu~","~sigma**2~")")) +

+ facet_wrap(vars(s2s),ncol=2,scale="free_x",labeller=label_parsed)

>

> dev.new()

> ggplot(gama) +

+ geom_line(aes(x=x,y=cdf),col="red",linewidth=1.1) +

+ labs(x="y",y=bquote(~F[Y]~"(y;"~mu~","~sigma**2~")")) +

+ facet_wrap(vars(s2s),ncol=2,scale="free_x",labeller=label_parsed)

>

148 / 436

Gamma regression











Y

i

ind

∼ Gamma



µ

i

, σ

2

/ω

i



,

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

⇔











Y

i

= µ

i

ε

i

,

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

,

ε

i

ind

∼ Gamma



1, σ

2

/ω

i



E(Y

i

) = µ

i

and Var(Y

i

) =

σ

2

ω

i

µ

2

i

Inverse link

g(µ

i

) = 1/µ

i

Logarithmic link

g(µ

i

) = log(µ

i

)

Identity link

g(µ

i

) = µ

i

If ω

1

= ω

2

= . . . = ω

n

, then the dispersion parameter is constant,

which implies that the coeﬃcient of variation is also constant.

149 / 436

Gamma model for σ

2

= 0.4

y

|

x

µ =

1

0.5 + 0.4 x

0

1

2

3

4

5

6666666666666

150 / 436

Inverse-Gaussian distribution

Let Y ∼ Inverse Gaussian(µ, σ

2

), where µ > 0 and σ > 0. The

probability density function of

Y may be written as follows

f

Y

(y; µ, σ

2

) =

1



2π σ

2

y

3



1

2

exp

−

(y −µ)

2

2σ

2

µ

2

y

!

I

y

(0, ∞)

= exp

(

1

σ

2

"

−

y

2µ

2

+

1

µ

#)

exp(−1/2σ

2

y)



2π σ

2

y

3



1

2

I

y

(0, ∞)

Therefore, θ = −1/2µ

2

, b(θ) = −

√

−2θ, φ = σ

2

and V(µ) = µ

3

.

The mean, variance, skewness and excess kurtosis of

Y are

µ, σ

2

µ

3

, 3σ

√

µ and 15σ

2

µ,

respectively. Indeed, the m.g.f. of Y reduces to

M

Y

(t) = exp

1

σ

2

µ



1 −

q

1 − 2σ

2

µ

2

t



!

The Inverse Gaussian distribution tends to the Gaussian when

σ

2

→ 0, that is, for "small" values of σ

2

we have Y ˙∼Normal



µ, σ

2

µ

3



.

151 / 436

Probability density funct ion of Inverse -Gaussian(1, σ

2

)

0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5

0

1

2

0

1

2

0

1

2

σ

2

= 0.025 σ

2

= 0.1

σ

2

= 0.5 σ

2

= 1

σ

2

= 2 σ

2

= 4

y

f

Y

(y; µ, σ

2

)

Continuous, strictly positive, and right-skewed responses

152 / 436

R codes

> library(ggplot2)

> library(statmod)

>

> mu <- 1

> s2 <- c(0.025,0.1,0.5,1,2,4)

> x <- seq(from=0.01,to=2.5,length=200)

> invg <- expand.grid(mu=mu,s2s=s2,x=x)

> invg <- within(invg,{pdf <- dinvgauss(x=x,mean=mu,dispersion=s2s)

+ cdf <- pinvgauss(q=x ,mean=mu,di spersion=s2s )

+ s2s <- factor(s2s,la bels=paste ("sigma**2= =",s2))})

>

> dev.new()

> ggplot(invg) +

+ geom_line(aes(x=x,y=pdf),col="red",linewidth=1.1) +

+ labs(x="y",y=bquote(~f[Y]~"(y;"~mu~","~sigma**2~")")) +

+ facet_wrap(vars(s2s),ncol=2,scale="free_x",labeller=label_parsed)

>

> dev.new()

> ggplot(invg) +

+ geom_line(aes(x=x,y=cdf),col="red",linewidth=1.1) +

+ labs(x="y",y=bquote(~F[Y]~"(y;"~mu~","~sigma**2~")")) +

+ facet_wrap(vars(s2s),ncol=2,scale="free_x",labeller=label_parsed)

>

153 / 436

Inverse-Gaussian regression











Y

i

ind

∼ Inverse Gaussian



µ

i

, σ

2

/ω

i



,

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

E(Y

i

) = µ

i

and Var(Y

i

) =

σ

2

ω

i

µ

3

i

Inverse link

g(µ

i

) = 1/µ

i

Logarithmic link

g(µ

i

) = log(µ

i

)

Identity link

g(µ

i

) = µ

i

The values of ω

1

, . . . , ω

n

are assumed to be known and positive.

154 / 436

Inverse-Gaussian model for σ

2

= 0.4

y

|

x

µ =

1

(0.6 + 0.2 x)

2

0

1

2

3

4

5

6666666666666 7777777777777

155 / 436

Main distributions belonging to the LEF

Distribution θ

i

=θ(µ

i

) b(θ

i

) V(µ

i

) ω

i

φ

Binomial(m

i

, µ

i

) log

µ

i

1−µ

i

!

log



1+e

θ

i



µ

i

(1−µ

i

) m

i

1

Poisson(µ

i

) log(µ

i

) e

θ

i

µ

i

1 1

Normal(µ

i

, σ

2

) µ

i

θ

2

i

2

1 1 σ

2

Gamma(µ

i

, σ

2

) −

1

µ

i

−log(−θ

i

) µ

2

i

1 σ

2

Inv.Gaussian(µ

i

, σ

2

) −

1

2µ

2

i

−

√

−2θ

i

µ

3

i

1 σ

2

See, for example, McCullagh and Nelder [1989, page 30].

156 / 436

Exercises

(1) Let m Y ∼ Negative B inomial(m, µ), where m is a positive

integer and

µ > 0. The probability mass function of Y may

be written as

f

Y

(y; m, µ) =

m(y + 1) − 1

my

!

µ

µ + 1

!

my

1

µ + 1

!

m

I

my

{0, 1, 2, . . .}

(A)

Show that the distribution of Y belongs to the LEF.

(B) Find the mean and variance of Y.

(C) Find the moment generating function of Y.

(2) Let Y ∼ Negative Binomial(µ, σ

2

), where µ > 0 and σ

2

is

assumed to be known. The probability mass function of

Y

may be written as

f

Y

(y; µ, σ

2

) =

Γ



y +

1

σ

2



Γ



1

σ

2



Γ



y + 1



σ

2

µ

σ

2

µ + 1

!

y

1

σ

2

µ + 1

!

1

σ

2

I

y

{0, 1, 2, . . .}

(A)

Show that the distribution of Y belongs to the LEF.

(B) Find the mean and variance of Y.

(C) Find the moment generating function of Y.

157 / 436

(3) Let Y ∼ Beta(µ, σ

2

), where 0 < µ < 1. The probability

density function of

Y may be written as

f

Y

(y; µ, σ

2

) =

Γ(σ

2

)

Γ(µσ

2

)Γ((1 − µ)σ

2

)

y

µσ

2

−1

(1−y)

(1−µ)σ

2

−1

, 0 < y < 1.

(A)

Does the Beta distribution belong to the exponential family

of distributions?

(B) Does the Beta distribution belong to the LEF?

(4) Let Y ∼ Weibull(µ, α), where µ, α > 0. The probability

density function of

Y may be written as

f

Y

(y; µ, α) =

α c(α)

µ

y

µ

!

α−1

exp

"

−c(α)

y

µ

!

α

#

, y > 0,

where c(α) = [Γ(1 + 1/α)]

α

.

(A) Does the Weibull distribution belong to the exponential fam-

ily of distributions?

(B) Does the Weibull distribution belong to the LEF?

(C) Does the Weibull distribution belong to the LEF when α = 1?

158 / 436

Parameter estimation

159 / 436

Parameter estimation

The maximum likelihood method is used for parameter estima-

tion. Under certain conditions, known as regularity conditions,

this method yields estimators with desirable statistical proper-

ties, such as consistency, asymptotic unbiasedness, asymptotic

eﬃciency, and asymptotic normality. This method consists of

maximizing the likelihood function, denoted here by

L(β), which

corresponds to the joint probability d ensity function (or joint

probability mass function) of

Y

1

, . . . , Y

n

, viewe d as a function of

β rather than as a function of y

1

, . . . , y

n

. Thus,

ˆ

β = argmax

β ∈R

p+1

L(β).

Because Y

1

, . . . , Y

n

are independent, L(β) reduces to

L(β) =

n

Y

i=1

f

Y

i

(y

i

; µ

i

, φ/ω

i

)

=

n

Y

i=1

C(y

i

, φ/ω

i

) exp

ω

i

φ

h

y

i

θ(µ

i

) − b[θ(µ

i

)]

i

!

.

160 / 436

Parameter estimation

The function log(x) is strictly increasing, so any β that maximizes

log[L(β)] will also maximize L(β). In addition, the computation

of the gradient of

log[L(β)] is easier than that of L(β), as the

latter is a product of

n terms while the former is a sum of n

terms. Therefore, a more convenient form for

ˆ

β

is the following

ˆ

β = argmax

β ∈R

p+1

ℓ(β),

where

ℓ(β) = log[L(β)]=

n

X

i=1

log[ f

Y

i

(y

i

; µ

i

, φ/ω

i

)]

=

n

X

i=1

log[C(y

i

, φ/ω

i

)] +

1

φ

n

X

i=1

ω

i

h

y

i

θ(µ

i

) − b[θ(µ

i

)]

i

is the log-likelihood function of β. Note that

n

P

i=1

log[C(y

i

, φ/ω

i

)]

and φ > 0 do not depend on β.

161 / 436

Parameter estimation

So,

ˆ

β = argmax

β ∈R

p+1

n

X

i=1

ω

i



y

i

θ(µ

i

) − b[θ(µ

i

)]



,

that is,

ˆ

β

does not depend on φ.

Binomial:

ˆ

β = argmax

β ∈R

p+1

n

X

i=1

m

i

y

i

log

µ

i

1 − µ

i

!

+ log(1 −µ

i

)

!

Poisson:

ˆ

β = argmax

β ∈R

p+1

n

X

i=1

y

i

log(µ

i

) − µ

i

Normal:

ˆ

β = argmax

β ∈R

p+1

n

X

i=1

ω

i







y

i

µ

i

−

µ

2

i

2







Gamma:

ˆ

β = argmax

β ∈R

p+1

n

X

i=1

ω

i

−

y

i

µ

i

− log(µ

i

)

!

Inverse Gaussian:

ˆ

β = argmax

β ∈R

p+1

n

X

i=1

ω

i







1

µ

i

−

y

i

2µ

2

i







162 / 436

Deviance

Deviance is used to quantify the goodness-of-ﬁt of an interest

model by comparing its log-likelihood function to that of the sat-

urated model (that is, a model with the same number of parame-

ters as subjects or individuals i n the sample). Thus, the lower the

deviance, the less the diﬀerence between the ﬁtted values of the

interest model and those of the saturated model. Hence, “small”

values of deviance indicate that the interest model achieves “sim-

ilar” ﬁtted values as the saturated model, despite the fact that

the ﬁrst only has

(p + 1) parameters, while the second has n

parameters. Note that,

ˆ

β = argmax

β ∈R

p+1

ℓ(β)

= argmin

β ∈R

p+1

2











sup

µ

1

,...,µ

n

X

i=1

log

h

f

Y

i

(y

i

; µ

i

, φ/ω

i

)

i

− ℓ(β)











,

where sup

µ

1

,...,µ

n

P

i=1

log

h

f

Y

i

(y

i

; µ

i

, φ/ω

i

)

i

is the log-likelihood func-

tion of the saturated model.

163 / 436

Deviance

Then, the deviance can be written as the following function of β:

D

∗

(β)= 2











sup

µ

1

,...,µ

n

X

i=1

log

h

f

Y

i

(y

i

; µ

i

, φ/ω

i

)

i

−

n

X

i=1

log

h

f

Y

i

(y

i

; µ

i

(β), φ/ω

i

)

i











= 2

n

X

i=1

sup

µ

i

log

h

f

Y

i

(y

i

; µ

i

, φ/ω

i

)

i

− log

h

f

Y

i

(y

i

; µ

i

(β), φ/ω

i

)

i

= 2

n

X

i=1

sup

µ

i

ω

i

φ

h

y

i

θ(µ

i

) − b[θ(µ

i

)]

i

+ log

h

C(y

i

, φ/ω

i

)

i

−

ω

i

φ

h

y

i

θ(µ

i

(β)) − b[θ(µ

i

(β))]

i

− log

h

C(y

i

, φ/ω

i

)

i

=

2

φ

n

X

i=1

sup

µ

i

ω

i

h

y

i

θ(µ

i

) − b[θ(µ

i

)]

i

− ω

i

h

y

i

θ(µ

i

(β)) − b[θ(µ

i

(β))]

i

=

1

φ

n

X

i=1

d(y

i

, µ

i

, ω

i

),

164 / 436

Deviance

where

d(y

i

, µ

i

, ω

i

) = 2











sup

µ

ω

i

h

y

i

θ(µ) − b[θ(µ)]

i

− ω

i

h

y

i

θ(µ

i

) − b[θ(µ

i

)]

i











,

in which µ

i

= g

−1

(x

⊤

i

β)

. Note that

∂

∂µ

ω

i

h

y

i

θ(µ) − b[θ(µ)]

i

= ω

i

θ

′

(µ)

h

y

i

− b

′

[θ(µ)]

i

and

∂

2

∂µ

2

ω

i

h

y

i

θ(µ)−b[θ(µ)]

i

= ω

i

(

θ

′′

(µ)

h

y

i

−b

′

[θ(µ)]

i

−[θ

′

(µ)]

2

b

′′

[θ(µ)]

)

Therefore, if y

i

belongs to the parametric space of µ, then the ma-

ximum of ω

i

h

y

i

θ(µ)−b[θ(µ)]

i

exists and it is ω

i

h

y

i

θ(y

i

)−b[θ(y

i

)]

i

.

Consequently, d(y

i

, µ

i

, ω

i

) reduces to

d(y

i

, µ

i

, ω

i

) = 2

n

ω

i

h

y

i

θ(y

i

) − b[θ(y

i

)]

i

− ω

i

h

y

i

θ(µ

i

) − b[θ(µ

i

)]

io

.

165 / 436

Deviance

If y does not belong to the parametric space of µ, then ω

i

h

y

i

θ(µ)−

b[θ(µ)]

i

is a strictly monotonic function of µ, so its maximum

does not exist and must be replaced by its supreme.

In conclusion, the deviance satisﬁes the following:

Lower is the value of D(

ˆ

β), better the goodness-of-ﬁt of the

model.

D(

ˆ

β) ≥ 0 as d(y

i

,

ˆ

µ

i

, ω

i

) ≥ 0 for i = 1, . . . , n.

ˆ

β

may be written as

ˆ

β = argmin

β ∈R

p+1

D

∗

(β) = argmin

β ∈R

p+1

D(β)

,

where D(

ˆ

β) =

n

P

i=1

d(y

i

,

ˆ

µ

i

, ω

i

)

is the unscaled deviance.

166 / 436

Deviance (see McCullagh and Nelder [1989, page 34])

• If m

i

Y

i

ind

∼ Binomial(m

i

, µ

i

)

then

d(y

i

,

ˆ

µ

i

, ω

i

) =











−2 m

i

log(1 −

ˆ

µ

i

)

if y

i

= 0,

2 m

i

(

y

i

log

"

y

i

(1 −

ˆ

µ

i

)

ˆ

µ

i

(1 − y

i

)

#

+ log

"

1 − y

i

1 −

ˆ

µ

i

#)

if 0 < y

i

< 1,

−2 m

i

log(

ˆ

µ

i

)

if y

i

= 1.

•

If Y

i

ind

∼ Poisson(µ

i

)

then

d(y

i

,

ˆ

µ

i

) =











2

ˆ

µ

i

if y

i

= 0,

2



y

i

log(y

i

/

ˆ

µ

i

) − (y

i

−

ˆ

µ

i

)



if y

i

> 0.

•

If Y

i

ind

∼ Normal(µ

i

, φ/ω

i

) then

d(y

i

,

ˆ

µ

i

, ω

i

) = 2

n

ω

i

h

y

i

θ(y

i

) − b[θ(y

i

)]

i

− ω

i

h

y

i

θ(

ˆ

µ

i

) − b[θ(

ˆ

µ

i

)]

io

= ω

i

h

2y

i

(y

i

−

ˆ

µ

i

) + (

ˆ

µ

2

i

− y

2

i

)

i

= ω

i

(y

i

−

ˆ

µ

i

)

2

.

In addition,

n

P

i=1

ω

i

(y

i

− µ

i

)

2

φ

∼ χ

2

(n).

167 / 436

Deviance

• If Y

i

ind

∼ Gamma(µ

i

, φ/ω

i

) then

d(y

i

,

ˆ

µ

i

, ω

i

) = 2

n

ω

i

h

y

i

θ(y

i

) − b[θ(y

i

)]

i

− ω

i

h

y

i

θ(

ˆ

µ

i

) − b[θ(

ˆ

µ

i

)]

i

o

= 2 ω

i

"

y

i

1

ˆ

µ

i

−

1

y

i

!

− log

y

i

ˆ

µ

i

!#

= 2 ω

i

"

(y

i

−

ˆ

µ

i

)

ˆ

µ

i

− log

y

i

ˆ

µ

i

!#

.

•

If Y

i

ind

∼ Inverse Gaussian(µ

i

, φ/ω

i

)

then

d(y

i

,

ˆ

µ

i

, ω

i

) = 2

n

ω

i

h

y

i

θ(y

i

) − b[θ(y

i

)]

i

− ω

i

h

y

i

θ(

ˆ

µ

i

) − b[θ(

ˆ

µ

i

)]

io

= 2 ω

i







y

i







1

2

ˆ

µ

2

i

−

1

2y

2

i







−

1

ˆ

µ

i

−

1

y

i

!







= ω

i

(y

i

−

ˆ

µ

i

)

2

ˆ

µ

2

i

y

i

.

In addition,

n

P

i=1

ω

i

(y

i

− µ

i

)

2

φµ

2

i

y

i

∼ χ

2

(n).

168 / 436

Exercises

Let Y ∼ Negative Binomial(µ, σ

2

), where µ > 0 and σ

2

is known.

The probability mass function of

Y may be written as

f

Y

(y; µ, σ

2

) =

Γ



y +

1

σ

2



Γ



1

σ

2



Γ



y + 1



σ

2

µ

σ

2

µ + 1

!

y

1

σ

2

µ + 1

!

1

σ

2

I

y

{0, 1, 2, . . .}

Then, the following MLG can be formulated











Y

i

ind

∼ Negative Binomial(µ

i

, σ

2

),

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

(A)

What is the maximum of y

i

θ(µ

i

) −b[θ(µ

i

)] with respect to µ

i

on the interval (0, ∞)?

(B) What is the supreme of −b[θ(µ

i

)] on the interval (0, ∞)?

(C) Find the deviance and the unscaled deviance of the model.

(D) Compare the results of (C) above with the obtained in R from

negative.binomial(1)$dev.resids, where .Theta = 1/σ

2

169 / 436

Parameter estimation

If ℓ(β) is a strictly concave function, that is, if the Hessian matrix of

ℓ(β) is negative deﬁnite for all β, then

ˆ

β reduces to the unique solution

to

U(

ˆ

β) = 0, if it exists, where U(β) is given by

U(β) =

∂ℓ(β)

∂β

0

,

∂ℓ(β)

∂β

1

, . . . ,

∂ℓ(β)

∂β

p

!

⊤

,

in which

∂ℓ(β)

∂β

j

=

1

φ

n

X

i=1

ω

i

"

y

i

∂θ

i

∂µ

i

∂µ

i

∂η

i

∂η

i

∂β

j

− b

′

(θ

i

)

∂θ

i

∂µ

i

∂µ

i

∂η

i

∂η

i

∂β

j

#

=

1

φ

n

X

i=1

ω

i

(y

i

− µ

i

)

∂θ

i

∂µ

i

∂µ

i

∂η

i

∂η

i

∂β

j

=

1

φ

n

X

i=1

ω

i

(y

i

− µ

i

)

V(µ

i

)g

′

(µ

i

)

x

ij

,

because

(1) η

i

= β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

=⇒

∂η

i

∂β

j

= x

ij

, with x

i0

= 1;

(2) g(µ

i

) = η

i

=⇒

∂g(µ

i

)

∂η

i

=

∂η

i

∂η

i

=⇒ g

′

(µ

i

)

∂µ

i

∂η

i

= 1 =⇒

∂µ

i

∂η

i

=

1

g

′

(µ

i

)

;

(3) µ

i

= b

′

(θ

i

) =⇒

∂µ

i

∂µ

i

=

∂b

′

(θ

i

)

∂µ

i

=⇒ 1 = b

′′

(θ

i

)

∂θ

i

∂µ

i

=⇒

∂θ

i

∂µ

i

=

1

V(µ

i

)

170 / 436

Parameter estimation

In matrix form U(β) becomes

U(β) =







φ

−1

n

P

i=1

ω

i

(y

i

− µ

i

)

V(µ

i

)g

′

(µ

i

)

.

φ

−1

n

P

i=1

ω

i

(y

i

− µ

i

)

V(µ

i

)g

′

(µ

i

)

x

i1

.

φ

−1

n

P

i=1

ω

i

(y

i

− µ

i

)

V(µ

i

)g

′

(µ

i

)

x

ip







= φ

−1

X

⊤

KV

−1

(y − µ),

where

X = (x

1

, . . . , x

n

)

⊤

is the model matrix

K = diag{1/g

′

(µ

1

), . . . , 1/g

′

(µ

n

)}

V = diag{V(µ

1

)/ω

1

, . . . , V(µ

n

)/ω

n

}

y = (y

1

, . . . , y

n

)

⊤

is the vector of observed responses

µ = (µ

1

, . . . , µ

n

)

⊤

is the vector of means

In general, the solution to the nonlinear equations given by

U(

ˆ

β) = 0 must be obtained using numerical methods such as

the Newton-Raphson and the Fisher’s Scoring algorithms.

171 / 436

Exercises

Find the solution to U(

ˆ

β) = 0 in the following cases:

(A)











Y

i

ind

∼ Normal(µ

i

, φ/ω

i

),

µ

i

= β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

(B)











Y

i

ind

∼ LEF(µ

i

, φ/ω

i

),

g(µ

i

) = β

0

+ β

1

x

i1

,

where the individual i belongs to group 1 or 2. Therefore,

x

i1

= I{i ∈ group 2}.

(C)











Y

i

ind

∼ LEF(µ

i

, φ/ω

i

),

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

r−1

x

i(r−1)

,

where the individual i belongs to group 1, 2, . . . , (r −1) or r.

Hence,

x

ij

= I{i ∈ group (j + 1)}.

172 / 436

Newton-Raphson algorithm

The idea behind this algorithm is to maximize, in its iteration

(t + 1) for t = 0, 1, 2, . . ., a quadratic approximation of ℓ(β), de-

noted here by

ℓ

[t+1]

(β), so that, as lower is the diﬀerence between

ℓ(β) and ℓ

[t+1]

(β), lower is the diﬀerence between

ˆ

β

and the max-

imum of

ℓ

[t+1]

(β), denoted here by β

[t+1]

. This is accomplished

by building, in the iteration

(t + 1) of the algorithm, an approx-

imation of

ℓ(β) based on its second-order Taylor serie s around

β

[t]

, obtaining thus the following (see Appendix C-3)

ℓ(β) ≈ ℓ

[t+1]

(β) = ℓ (β

[t]

)+U

⊤

(β

[t]

)(β−β

[t]

)−

1

2

(β−β

[t]

)

⊤

J(β

[t]

)(β−β

[t]

),

where J(β) = −∇

2

ℓ(β) is minus the Hessian matrix of ℓ(β), also

known as the observed Fisher information matrix of β. The

maximum of

ℓ

[t+1]

(β) is easy to compute as it is a quadratic and

strictly concave function, which is consequence of the following:

(i) the Hessian matrices of ℓ(β) and ℓ

[t+1]

(β) coincide; and (ii)

J(β) is assumed to be positive deﬁnite for all β.

173 / 436

Newton-Raphson algorithm

The maximum of ℓ

[t+1]

(β) is its zero-gradient point, that is, the

maximum of

ℓ

[t+1]

(β) is the solution to the following equation

U(β

[t]

) − J(β

[t]

)



β

[t+1]

− β

[t]



= 0.

Therefore, β

[t+1]

may be written as follows (see, for instance,

Agresti [2015, section 4.5.1]):

β

[t+1]

= β

[t]

+ [J(β

[t]

)]

−1

U(β

[t]

).

The above results motivate the following algorithm:

Step 0: Start the counter at t = 0; set a tolerance limit,

ǫ > 0; set a maximum number of iterations, n

max

;

and set a starting value for β, say β

[0]

.

Step 1: Compute β

[t+1]

= β

[t]

+ [J(β

[t]

)]

−1

U(β

[t]

)

Step 2: Compute δ

(t+1)

= δ(β

[t]

, β

[t+1]

)

Step 3: Update the counter by t = t + 1

Step 4: Repeat steps 1, 2 and 3 until δ

(t)

< ǫ or t > n

max

.

Step 5: If δ

(t)

< ǫ then

ˆ

β

is deﬁned to be β

[t]

. Otherwise,

the convergence was not achieved.

174 / 436

Newton-Raphson algorithm

→ β

[t]

β

[t+1]

→

ˆ

β

→ ℓ(β)

→ ℓ

[t+1]

(β)

β

175 / 436

Newton-Raphson algorithm

Convergence of the algorithm above is obtained when the value

of

β

[t]

stabilizes, that is, convergence is reached when the “dif-

ference” between

β

[t]

and β

[t+1]

is “small” enough. Therefore,

(1) If β

[t]

and β

[t+1]

are equal, then U(β

[t]

) = 0 and

ˆ

β = β

[t+1]

,

as

J(β

[t]

) is a non-singular matrix.

(2) If β

[t]

reaches the value of

ˆ

β

, then the algorithm ﬁnishes, as,

according to the Step 1 of the algorithm,

U(β

[t]

) = 0 implies

that

β

[t]

and β

[t+1]

are equal.

The convergence criterion can be formalized by setting the cons-

tant ǫ > 0 and the non-negative function δ(β

[t]

, β

[t+1]

), which is a

strictly increasing function of the “diﬀerence” between β

[t]

and

β

[t+1]

. Possible forms for δ(β

[t]

, β

[t+1]

) include the following:

δ(β

[t]

, β

[t+1]

) =



β

[t+1]

− β

[t]



r

=







p

P

j=0



β

[t+1]

j

− β

[t]

j



r







1

r

, r ≥ 1

δ(β

[t]

, β

[t+1]

) = max

n



β

[t+1]

0

− β

[t]

0



/



β

[t]

0



, . . . ,



β

[t+1]

p

− β

[t]

p



/



β

[t]

p



o

176 / 436

Fisher’s Scoring algorithm

This algor ithm is a modiﬁed version of the Newton-Raphson

algorithm, where

J(β) is replaced by the expected Fisher’s in-

formation matrix, deﬁned by

I(β) = E[J(β)] = φ

−1

X

⊤

WX, with

W = KV

−1

K. This algorithm can be more convenient than the

Newton-Raphson, as:

(i) I(β) is a positive deﬁnite matrix for all

β (see Appendix A-3), which, in general, is not true for J(β) ; (ii)

the structure of I(β) is simpler than that of J(β). The Fisher’s

Scoring algorithm may be written as follows (see, for instance,

Agresti [2015, section 4.5.2]):

Step 0: Start the counter at t = 0; set a tolerance limit,

ǫ > 0; set a maximum number of iterations, n

max

;

and set a starting value for β, say β

[0]

.

Step 1: Compute β

[t+1]

= β

[t]

+ [I(β

[t]

)]

−1

U(β

[t]

)

Step 2: Compute δ

(t+1)

= δ(β

[t]

, β

[t+1]

)

Step 3: Update the counter by t = t + 1

Step 4:

Repeat steps 1, 2 and 3 until δ

(t)

< ǫ or t > n

max

.

Step 5: If δ

(t)

< ǫ then

ˆ

β

is deﬁned to be β

[t]

. Otherwise,

the convergence was not achieved.

177 / 436

Fisher’s Scoring algorithm

The entry (j, j

′

) of the matrix J(β) may be obtained as follows

−

∂

2

ℓ(β)

∂β

j

∂β

j

′

=

∂

∂β

j







−

1

φ

n

X

i=1

ω

i

(Y

i

− µ

i

)

V(µ

i

)g

′

(µ

i

)

x

ij

′







= −

1

φ

n

X

i=1

ω

i

x

ij

′

"

f

i

∂(Y

i

− µ

i

)

∂µ

i

+ (Y

i

− µ

i

)

∂ f

i

∂µ

i

#

∂µ

i

η

i

∂η

i

β

j

= −

1

φ

n

X

i=1

ω

i

x

ij

′

"

−

1

V(µ

i

)



g

′

(µ

i

)



+ (Y

i

− µ

i

)

∂ f

i

∂µ

i

#

x

ij

g

′

(µ

i

)

,

=

1

φ

n

X

i=1

ω

i

x

ij

′

x

ij

V(µ

i

)



g

′

(µ

i

)



2

−

1

φ

n

X

i=1

ω

i

(Y

i

− µ

i

)

∂ f

i

∂µ

i

x

ij

′

x

ij

g

′

(µ

i

)

,

where f

i

= [V(µ

i

)g

′

(µ

i

)]

−1

. Therefore, the entry (j, j

′

) of I(β)

reduces to (Agresti [2015, page 126]):

E

−

∂

2

ℓ(β)

∂β

j

∂β

j

′

!

=

1

φ

n

X

i=1

ω

i

x

ij

′

x

ij

V(µ

i

)



g

′

(µ

i

)



2

.

178 / 436

Fisher’s Scoring algorithm

By replacing U(β) and I(β) by φ

−1

X

⊤

WK

−1

(y−µ) and φ

−1

X

⊤

WX,

respectively, the

Step 1 of the Fisher’s Scoring algorithm may be

written as follows (see, for instance,

Agresti [2015, page 142]):

β

[t+1]

= β

[t]

+ [I(β

[t]

)]

−1

U(β

[t]

)

= β

[t]

+ (X

⊤

W

[t]

X)

−1

X

⊤

W

[t]

(z

[t]

− Xβ

[t]

)

= (X

⊤

W

[t]

X)

−1

X

⊤

W

[t]

z

[t]

,

where W

[t]

and z

[t]

represent to z = Xβ + K

−1

(y − µ) = g(µ) +

K

−1

(y −µ)

and W evaluated at β

[t]

, respectively, in which g(µ) is

deﬁned to be

(g(µ

1

), . . . , g(µ

n

))

⊤

. Therefore, β

[t+1]

can regarded as

the Weighted Least Squares estimate of β in a linear model such

that

E(Z

[t]

) = Xβ, Var(Z

[t]

) = σ

2

[W

[t]

]

−1

and z

[t]

is the observed

value of Z

[t]

. Indeed, at the convergence of the algorithm,

ˆ

β

can be written as

ˆ

β = (X

⊤

ˆ

WX)

−1

X

⊤

ˆ

Wˆz

. Hence, the Fisher’s Scor-

ing algorithm is also known as an Iteratively Rewei ghted Least

Squares (IRLS).

179 / 436

Fisher’s Scoring algorithm

Thus, an initial or starting value for the Newton-Raphson and

Fisher’s Scoring algorithms may be deﬁned as

β

[0]

= (X

⊤

W

∗

X)

−1

X

⊤

W

∗

z

∗

,

where W

∗

and z

∗

represent to W and z evaluated at µ = y.

However, in some cases, the value of

y does not belong to the

parametric space of

µ, so that g(y) and/or g

′

(y) may be undeﬁned.

This occurs in the following cases

(1) Under the binomial response models when y = 0 or y = 1. In

those cases,

β

[0]

may be computed by replacing such values

of y by |y − τ|, where τ is a “small” value as, for instance,

0.01, 0.05 or 0.1.

(2) Under the poisson response models when y = 0. In those

cases, β

[0]

may be computed by replacing such values of y

by τ, where τ is a “small” value as, for instance, 0.01 or 0.05.

180 / 436

Canonical link functions

The canonical link functions are such that

g(µ) ∝ θ(µ),

which implies that g

′

(µ) = [V(µ)]

−1

. In those cases, the expres-

sion for the elements of the gradient vector

U(β) reduces to

∂ℓ(β)

∂β

j

=

1

φ

n

X

i=1

ω

i

(y

i

− µ

i

)x

ij

, j = 0, 1, . . . , p.

Similarly, the expression for the elements of the matrix J(β) re-

duces to

−

∂

2

ℓ(β)

∂β

j

∂β

j

′

=

1

φ

n

X

i=1

ω

i

V(µ

i

)x

ij

′

x

ij

, j, j

′

= 0, 1, . . . , p.

Therefore,

U(β) =

1

φ

X

⊤

(y − µ)

and

I(β) = J(β).

181 / 436

Canonical link functions

See, for example, McCullagh and Nelder [1989, page 32].

Distribution Canonical link

Normal g(µ) = µ (identity)

Binomial

g(µ) = log



µ/(1 − µ)



(logit)

Poisson

g(µ) = log(µ) (log)

Gamma

g(µ) = µ

−1

(inverse)

Inverse Gaussian

g(µ) = µ

−2

(1/muˆ2)

Models with those link functions satisfy the following:

(1) The Newton-Raphson and Fisher’s Scoring algor ithms coin-

cide, as the only diﬀerence between them is that the former

uses

J(β) whereas the latter uses I(β).

(2) The Hessian matrix of ℓ(β), given by −I(β), is negative deﬁ-

nite for all β. Therefore, ℓ(β) is a strictly concave function,

and the solution to U(β) = 0, if it exists, is the unique global

maximum of ℓ(β), that is, it is the maximum likelihood

estimate of

β.

182 / 436

Hardened stee l

These data, extracted from McCool [1980] and available in the

object

Steel of the library glmtoolbox, consist of the failure

times (

life) for hardened steel specimens in a 4-ball rolling

contact fatigue test. Ten independent observations were taken

at each of the four values of contact stress (

stress), in pounds

per square inch ×10

−6

.

10

20

30

Contact stress, in psi ×10

−6

Time until the specimen failed

•

••

•

0.9 1.0 1.1

0











Y

i

ind

∼ Gamma(µ

i

, σ

2

),

log(µ

i

) = β

0

+ β

1

x

i

y = Time until the specimen failed

x = Contact stress

183 / 436

Hardened stee l

The convergence criterion of the Fisher’s Scoring algorithm in the

function glm() is the following:

δ(β

[t]

, β

[t+1]

) =



D(β

[t+1]

) − D(β

[t]

)



/

h

D(β

[t]

) + 0.1

i

< ǫ

> fit <- glm(life ~ stress, family=Gamma(log), data=Steel)

> FisherScoring(fit)

Tolerance Limit: 1e-08 ǫ

Maximum number of iterations: 25 n

max

t D(β

[t]

) δ(β

[t−1]

, β

[t]

) β

[t]

= (β

[t]

0

, β

[t]

1

)

⊤

Iteration Deviance Tolerance Intercept Stress

1 43.604 NA 12.660 -12.318

2 34.873 1.9978e-01 14.829 -13.891

3 34.231 1.8341e-02 14.270 -13.456

4 34.226 1.5484e-04 14.190 -13.386

5 34.226 1.7747e-07 14.187 -13.383

6 34.226 2.6181e-10 14.186 -13.383

ˆ

β

> estequa(fit) U(

ˆ

β)

(Intercept) -5.827184e-09

stress 3.420245e-06

184 / 436

Hardened stee l

> fit0 <- lm(log(life) ~ stress, data=Steel)

> fit1 <- glm(life ~ stress, family=Gamma(log), data=Steel,

+ start= coef(fit0) β

[0]

,

+ control=list(epsilon= 1e-8 ,maxit= 25 ,trace=TRUE))

> ǫ n

max

Deviance = 34.87253 Iterations - 1

Deviance = 34.23110 Iterations - 2

Deviance = 34.22578 Iterations - 3

Deviance = 34.22578 Iterations - 4

Deviance = 34.22578 Iterations - 5

> summary(fit1)

ˆ

β

j

Estimate Std.Error t value Pr(>|t|)

(Intercept) 14.186 1.250 11.35 9.02e-14 ***

stress -13.383 1.203 -11.12 1.63e-13 ***

(Dispersion parameter for Gamma family taken to be 0.7710894)

Null deviance: 110.033 on 39 degrees of freedom

Residual deviance: 34.226 on 38 degrees of freedom

D(

ˆ

β)

Number of Fisher Scoring iterations: 5

185 / 436

Hardened stee l

The function glm() also enables the user to specify starting values for

η

1

, . . . , η

n

(etastart) and for µ

1

, . . . , µ

n

(mustart).

> etas <- fitted(fit0)

> fit <- glm(life ~ stress, family=Gamma(log), data=Steel,

+ etastart= etas (η

[0]

1

, . . . , η

[0]

n

)

⊤

,

+ control=list(epsilon= 1e-8 ,maxit= 25 ,trace=TRUE))

> ǫ n

max

Deviance = 34.87253 Iterations - 1

Deviance = 34.23110 Iterations - 2

Deviance = 34.22578 Iterations - 3

Deviance = 34.22578 Iterations - 4

Deviance = 34.22578 Iterations - 5

> summary(fit)

>

> mus <- exp(fitted(fit0))

> fit <- glm(life ~ stress, family=Gamma(log), data=Steel,

+ mustart= mus (µ

[0]

1

, . . . , µ

[0]

n

)

⊤

,

+ control=list(epsilon= 1e-8 ,maxit= 25 ,trace=TRUE))

> ǫ n

max

186 / 436

Goodness-of-ﬁt

187 / 436

R-squared

Interest Model

(model under investigation)











Y

i

ind

∼ LEF(µ

i

, φ/ω

i

),

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

,

X =







1 x

11

. . . x

1p

1 x

21

. . . x

2p

.

1 x

n1

. . . x

np







ˆ

β = argmin

β ∈R

p+1

D(β)

Reference Model

(model with the poorest

goodness-of-ﬁt)











Y

i

ind

∼ LEF(µ

i

, φ/ω

i

),

g(µ

i

) = β,

X =







1

.

1







ˆ

β = argmin

β ∈R

D

0

(β)

Note that D(

ˆ

β) = min

β ∈R

p+1

D(β)

and D

0

(

ˆ

β) = min

β ∈R

p+1

D(β) restricted to β

1

= . . . = β

p

= 0,

where D(β) and D

0

(β) represent the unscaled deviances of the Interest

Model and the Reference Model, respectively.

188 / 436

R-squared

R

2

= 1 −

D(

ˆ

β)

D

0

(

ˆ

β)

The R

2

satisﬁes the following:

Larger is its value, better the goodness-of-ﬁt of the model.

R

2

∈ [0, 1].

If X = 1

n

then R

2

= 0, where 1

n

represents a column vector

of 1’s of size

n.

If y

1

, . . . , y

n

belong to the parametric space of µ and X = I

n

then R

2

=1, where I

n

represents the identity matrix of size n.

A hundred times its value represents the percentage reduc-

tion in unscaled deviance caused by the inclusion of p pa-

rameters in the linear predictor.

It is an one-to-one transformation of the unscaled deviance.

189 / 436

R-squared

The R

2

can be used to compare nested models, that is, models that

assume the same probability distribution to describe the response vari-

able, but where the ﬁrst model’s

X matrix coincides with the ﬁrst (r + 1)

columns of the second model’s X matrix, in which r < p.

Model I











Y

i

ind

∼ LEF(µ

i

, φ/ω

i

),

g(µ

i

) =

˜

β

0

+

r

P

j=1

˜

β

j

x

ij

,

ˆ

˜

β = argmin

˜

β ∈R

r+1

D

I

(

˜

β)

Model II











Y

i

ind

∼ LEF(µ

i

, φ/ω

i

),

g(µ

i

) = β

0

+

r

P

j=1

β

j

x

ij

+

p

P

j=r+1

β

j

x

ij

,

ˆ

β = argmin

β ∈R

p+1

D

II

(β)

Note that D

II

(

ˆ

β) = min

β ∈R

p+1

D

II

(β)

and D

I

(

ˆ

˜

β) = min

β ∈R

p+1

D

II

(β) restricted to β

r+1

= . . . = β

p

= 0,

where D

I

(β) and D

II

(

˜

β) represent the unscaled deviances of the models

I and II, respectively.

190 / 436

R-squared

Therefore,

D

II

(

ˆ

β) ≤ D

I

(

ˆ

˜

β),

which implies that R

2

II

is ALWAYS larger or equal than R

2

I

, that

is, according to the

R

2

the model II is ALWAYS better than the

model I, even if the

(p − r) additional covariates in the model II

do not help to explain the diﬀerences between

y

1

, . . . , y

n

.

An adjusted version of R

2

in which not just the model goodness-

of-ﬁt is measured but also its complexity is penalized, i s given

by the following expression:

R

2

adj

= 1 −

D(

ˆ

β)/(n − p − 1)

D

0

(

ˆ

β)/(n − 1)

= 1 −

n − 1

n − p − 1

(1 − R

2

),

where (n − p − 1) and (n − 1) are the degrees-of-freedom of the

Interest Model and the Reference Model, respectively. Note that

R

2

adj

is an one-to-one transformation of D(

ˆ

β)/(n − p − 1).

191 / 436

Generalized Information Criterion (GIC)

GIC(k) = −

pmeasure of the

goodness-of-ﬁt

z}|{

2 ℓ(

ˆ

β) +

complexity

penalty

z }| {

k(number of parameters),

where

ℓ(

ˆ

β) =

n

X

i=1

log[C(y

i

, φ/ω

i

)] +

1

φ

n

X

i=1

ω

i



y

i

θ(

ˆ

µ

i

) − b[θ(

ˆ

µ

i

)]



.

The Akaike Information Criterion (AIC, Akaike [1974]), and the

Bayesian Information Criterion (

BIC, Schwarz [1978]), also known

as the Schwarz Information Criterion (SIC), are special cases of

GIC(k) when k = 2 and k = log(n), respectively.

If n > 7 then the penalty by model complexity in BIC is higher

than that in AIC. Therefore, the BIC tends to choose simpler

models than AIC.

192 / 436

Statistical inference

193 / 436

Asymptotic theory

If the model is correctly speciﬁed then, under certain conditions,

better known as regularity conditions,

ˆ

β

is a consistent estimator

of

β, which asymptotically (that is, when n → ∞) satisﬁes the

following (

Fahrmeir and Kaufmann [1985]):

it is unbiased,

it is eﬃcient,

it follows a multivariate normal distribution.

As a result, for “large” values of the sample size

n, we have that

ˆ

β ˙∼Normal

p+1

(β, [I(β)]

−1

),

where ˙∼ means “approximately distributed as”. A consistent

estimator of [I(β)]

−1

is [I(

ˆ

β)]

−1

.

194 / 436

Hypothesis testing

The interest is to assess the following hypothesis system

H

0

: L

⊤

β = M versus H

1

: L

⊤

β , M,

where L

⊤

is an r × (p + 1) matrix whose rows are linearly independent

and M is an r × 1 matrix. The decision rule is the following

“Reject H

0

at t he approximate 100(α)% signiﬁcance level if ξ > χ

2

1−α

(r)”,

where

α ∈ (0, 1), χ

2

1−α

(r) is the 100(1 − α)th percentile of the chi-square

distribution with

r degrees-of-freedom, and ξ is one of the following

statistics (see, for instance, Buse [1982] and Terrell [2002]):

Likelihood-ratio

Wald

Rao’s Score

Terrell’s Gradient

If

H

0

is true, then the asymptotic distribution (n → ∞) of ξ is χ

2

(r).

The statistic ξ depends on the dispersion parameter φ. However, the

asymptotic distribution of ξ remains χ

2

(r) regardless that φ is unknown

and is replaced by a consistent estimator.

195 / 436

Hypothesis testing

For example, with p = 3.

(1)

H

0

: β

2

= 0

versus H

1

: β

2

, 0

Then r = 1, L

⊤

=

h

0 0 1 0

i

and M = 0.

(2)

H

0

:

"

β

1

β

3

#

=

"

0

#

versus H

1

:

"

β

1

β

3

#

,

"

0

#

Then r = 2, L

⊤

=

"

0 1 0 0

0 0 0 1

#

and M =

"

0

#

.

(3)

H

0

: β

2

= β

3

versus H

1

: β

2

, β

3

Then r = 1, L

⊤

=

h

0 0 1 −1

i

and M = 0.

196 / 436

Hypothesis testing

Likelihood-ratio test

ξ

LR

= 2 log

L(

ˆ

β)

L(

ˆ

β

◦

)

!

= 2

h

ℓ(

ˆ

β) − ℓ(

ˆ

β

◦

)

i

=

2

φ











n

X

i=1

ω

i

h

y

i

θ(

ˆ

µ

i

) − b[θ(

ˆ

µ

i

)]

i

−

n

X

i=1

ω

i

h

y

i

θ(

ˆ

µ

◦

i

) − b[θ(

ˆ

µ

◦

i

)]

i











=

2

φ











n

X

i=1

sup

µ

i

ω

i

h

y

i

θ(µ

i

) − b[θ(µ

i

)]

i

−

n

X

i=1

ω

i

h

y

i

θ(

ˆ

µ

◦

i

) − b[θ(

ˆ

µ

◦

i

)]

i











−

2

φ











n

X

i=1

sup

µ

i

ω

i

h

y

i

θ(µ

i

) + b[θ(µ

i

)]

i

−

n

X

i=1

ω

i

h

y

i

θ(

ˆ

µ

i

) − b[θ(

ˆ

µ

i

)]

i











=

1

φ

h

D(

ˆ

β

◦

) − D(

ˆ

β)

i

,

where

ˆ

β = argmin

β ∈R

p+1

D(β),

ˆ

β

◦

= argmin

β ∈R

p+1

D(β)

restricted to L

⊤

β = M,

ˆ

µ

i

= g

−1

(x

⊤

i

ˆ

β)

and

ˆ

µ

◦

i

= g

−1

(x

⊤

i

ˆ

β

◦

)

.

197 / 436

Hypothesis testing

Wald test

ˆ

β ˙∼Normal

p+1

(β, [I(β)]

−1

)

, which implies that (see Appendix E-3)

L

⊤

ˆ

β ˙∼Normal

r

(L

⊤

β, L

⊤

[I(β)]

−1

L),

which in turn implies that (see Appendix E-3,4)

(L

⊤

ˆ

β − L

⊤

β)

⊤

(L

⊤

[I(β)]

−1

L)

−1

(L

⊤

ˆ

β − L

⊤

β) ˙∼χ

2

(r).

So, the statistic of the Wald test reduces to

ξ

W

= (L

⊤

ˆ

β − M)

⊤

(L

⊤

[I(

ˆ

β)]

−1

L)

−1

(L

⊤

ˆ

β − M).

Unlike ξ

LR

, the statistic ξ

W

does not depend on

ˆ

β

◦

. As a result, ξ

W

has a lower computational cost than ξ

LR

, since the former does

not require the computation of

ˆ

β

◦

.

198 / 436

Hypothesis testing

For example,

(1)

H

0

: β

2

= 0

versus H

1

: β

2

, 0

Then ξ

W

=

ˆ

β

2

ˆ

Var(

ˆ

β

2

)

=

ˆ

β

2

[Std.Error(

ˆ

β

2

)]

2

(2)

H

0

:

"

β

1

β

3

#

=

"

0

#

versus H

1

:

"

β

1

β

3

#

,

"

0

#

Then ξ

W

=

h

ˆ

β

1

ˆ

β

3

i

"

ˆ

Var(

ˆ

β

1

)

ˆ

Cov(

ˆ

β

1

,

ˆ

β

3

)

ˆ

Cov(

ˆ

β

3

,

ˆ

β

1

)

ˆ

Var(

ˆ

β

3

)

#

−1

"

ˆ

β

1

ˆ

β

3

#

(3)

H

0

: β

2

= β

3

versus H

1

: β

2

, β

3

Then ξ

W

=

(

ˆ

β

2

−

ˆ

β

3

)

2

ˆ

Var(

ˆ

β

2

) +

ˆ

Var(

ˆ

β

3

) − 2

ˆ

Cov(

ˆ

β

2

,

ˆ

β

3

)

199 / 436

Hypothesis testing

Rao’s Score test

U(β) ˙∼Normal

p+1

(0, I(β)), which implies that (see Appendix E-3)

L

⊤

U(β) ˙∼Normal

r

(0, L

⊤

[I(β)]L),

which in turn implies that (see Appendix E-3,4)

[L

⊤

U(β)]

⊤

(L

⊤

[I(β)]L)

−1

[L

⊤

U(β)] ˙∼χ

2

(r).

So, the statistic of the Rao’s Score test test reduces to

ξ

S

= [L

⊤

U(

ˆ

β

◦

)]

⊤

(L

⊤

[I(

ˆ

β

◦

)L)

−1

[L

⊤

U(

ˆ

β

◦

)].

Unlike ξ

LR

, the statistic ξ

S

does not depend on

ˆ

β

. As a result, ξ

S

has a lower computational cost than ξ

LR

, since the former does

not require the computation of

ˆ

β

.

Terrell’s Gradient test

ξ

G

= (L

⊤

ˆ

β − M)

⊤

[L

⊤

U(

ˆ

β

◦

)]

Similar to ξ

LR

, the statistic of the Terrell’s Gradient test depends

on both

ˆ

β

and

ˆ

β

◦

.

200 / 436

Hypothesis testing

> help(anova2)

>

> anova2(fit0,fit1,test=‘‘wald’’)

fit0: the restricted model, that is, the model ﬁtted under

the restrictions on

β stated by H

0

.

fit1: the unrestricted model, that is, the model ﬁtted with-

out restrictions on

β.

test: the statistical test speciﬁed to assess the hypothe-

sis. The available options are: Wald (“wald”), Rao’s score

(“score”), Terrell’s gradient (“gradient”), and likelihood ra-

tio (“lr”) tests. By default, test is set to be “wald”.

The function anova2() provides the interest values associated

with the hypothesis testing, that is, ξ, r and Pr[χ

2

(r) > ξ].

201 / 436

Conﬁdence intervals (see Agresti [2015, page 131])

The approximate 100(1 −α)% conﬁdence interval for β

j

includes

all

β

◦

for which the hypothesis system given by

H

0

: β

j

= β

◦

versus H

1

: β

j

, β

◦

is not rejected at the approximate 100(α)% signiﬁcance level.

In general, the limits of the approximate

100(1 −α)% conﬁdence

interval for

β

j

based on the likelihood-ratio, R ao’s Score, and

Terrell’s Gradient tests do not have closed-form and require nu-

merical methods. On the other hand, the approximate

100(1 −

α)%

conﬁdence interval for β

j

based on the Wald test comprises

of the

β

◦

such that (

ˆ

β

j

−β

◦

j

)

2

/

ˆ

Var(

ˆ

β

j

) ≤ χ

2

1−α

(1) = Z

2

1−

α

2

. Thus, the

interval is the following:



ˆ

β

j

− Z

1−

α

2

[

ˆ

Var(

ˆ

β

j

)]

1

2

,

ˆ

β

j

+ Z

1−

α

2

[

ˆ

Var(

ˆ

β

j

)]

1

2



,

where Z

1−

α

2

represents the 100(1 −

α

2

)

th percentile of the standard

normal distribution.

202 / 436

Conﬁdence intervals

> help(confint2)

>

> confint2(fit,level,test,digits)

fit: the ﬁtted model.

level: the value indicating the required conﬁdence level .

By default, level is set to be

0.95.

test: the statistical test in w hich the approximate conﬁ-

dence intervals are based on. The available options are:

Wald (“wald”), Rao’s score (“score”), Terrell’s gradient (“gra-

dient”), and likelihood ratio (“lr”) tests. By default, test is

set to be “wald”.

digits: the integer value indicating the number of decimal

places to be used. By de fault, digits is set to be 5.

The function conﬁnt2() provides the lower and upper limits of

the conﬁdence intervals for all parameters in the linear predictor.

203 / 436

Estimation of the dispersion parameter

E







1

n

X

i=1

ω

i

(Y

i

− µ

i

)

2

V(µ

i

)







= φ,

and, by the Law of Large Numbers,

1

n

X

i=1

ω

i

(Y

i

− µ

i

)

2

V(µ

i

)

P

−−−−→

n→∞

φ.

Then, by replacing µ

1

, . . . , µ

n

by

ˆ

µ

1

, . . . ,

ˆ

µ

n

and adjusting the

value of

n by the number of estimated parameters, the following

estimator of

φ is obtained

ˆ

φ =

1

n − p − 1

n

X

i=1

ω

i

(y

i

−

ˆ

µ

i

)

2

V(

ˆ

µ

i

)

=

Pr(

ˆ

β)

n − p − 1

.

A second estimator of φ is the following

˜

φ =

D(

ˆ

β)

n − p − 1

.

204 / 436

Special cases

205 / 436

Binomial regression











m

i

Y

i

ind

∼ Binomial(m

i

, µ

i

),

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

E(Y

i

) = µ

i

and Var(Y

i

) = m

−1

i

µ

i

(1 − µ

i

)

Logit link

g(µ

i

) = log

µ

i

1 − µ

i

!

Probit link

g(µ

i

) = Φ

−1

(µ

i

)

Complementary log-log link

g(µ

i

) = log



−log



1 − µ

i



Cauchy link

g(µ

i

) = tan



π

2



2µ

i

− 1





In this model ω

i

= m

i

for i = 1, . . . , n.

206 / 436

Males attached to female horseshoe crabs

These data, extracted from Agresti [2015, page 16] and available in

the object

hcrabs of the library GLMsData, come from a study of fe-

male horseshoe crabs on an island in the Gulf of Mexico. During the

spawning season, a female migrates to the shore to breed. With a male

attached to her posterior spine, she burrows into the sand and lays

clusters of eggs. The eggs are fertilized externally in the sand beneath

the pair. During spawning, other male crabs (satellites) may cluster

around the pair and may also fertilize the eggs. The data set includes

Width (x

1

). The carapace width of the female crab, in cms.

Col. The color of the female crab: “LM” (Light Medium), “M”

(Medium), “DM” (Dark Medium), or “D” (Dark). Then, x

2

= 1 if

Col=“LM” or Col=“M”; and

x

2

= 0 if Col=“DM” or Col=“D”.

Sat. The number of male crabs attached to the female crab. Then,

y = 1 if Sat > 0 and y = 0 if Sat = 0.











Y

i

ind

∼ Bernoulli(µ

i

),

g(µ) = β

0

+ β

1

x

i1

+ β

2

x

i2

207 / 436

Parameter interpretation: logit link

Let Z ∼ Binomial(m, µ) and Z

′

= m − Z be the number of “suc-

cesses” and “failures”, respectively. Then,

Odds(µ) =

E(Z)

E(Z

′

)

=

µ

1 − µ

= exp[g(µ)].

Odds(µ) satisﬁes the following:

(1) It is a strictly positive function of µ.

(2) It is a strictly increasing function of µ. Therefore, if µ in-

creases then

Odds(µ) also increases, and vice versa. In ad-

dition, if

µ decreases then Odds(µ) also decreases, and vice

versa.

(3) Odds(0.5) is 1.

(4) It can be interpreted as, in mean, for every “failure” there

are Odds(µ) “successes”.

For example, in mean, for every female horseshoe crab without

satellites, there are Odds(µ) female horseshoe crabs with at least

one satellite.

208 / 436

Parameter interpretation: cloglog link

Let Z ∼ Geometric(µ) be the number of independent trials re-

quired to get a “success”. Then,

Pr(Z = z) = (1−µ)

z−1

µ I

z

{1, 2, . . .},

E(Z) = µ

−1

and the median of Z is given approximately by

MNNT(µ) =

log(2)

−log(1 − µ)

= log(2) exp[−g(µ)].

MNNT(µ) satisﬁes the following:

(1) It is a strictly positive function of µ.

(2) It is a strictly decreasing function of µ. Therefore, if µ in-

creases then

MNNT(µ) decreases, and vice versa. In addi-

tion, if

µ decreases then MNNT(µ) increases, and vice versa.

(3) MNNT(0.5) is 1.

(4) It can be interpreted as the median number of trials required

to get a “success”.

For example, the median number of female horseshoe crabs re-

quired to get one with at least one satellite is MNNT(µ).

209 / 436

Parameter interpretation: without inte raction

The odds of having at least one satellite for a female crab with

carapace width

x

1

and color x

2

is exp(β

0

+ β

1

x

1

+ β

2

x

2

).

We have two female crabs whose carapace widths coincide but

the color of the ﬁrst is Light, whereas the color of the second is

Dark, then the odds of having at least one satellite of the ﬁrst

female crab is exp(β

2

) times that of the second.

Odds

L

Odds

D

=

exp(β

0

+ β

1

x

1

+ β

2

)

exp(β

0

+ β

1

x

1

)

= exp(β

2

).

We have two female crabs whose colors coincide but the carapace

width of the ﬁrst is x+δ whereas the carapace width of the second

is x, then the odds of having at least one satellite of the ﬁrst female

crab is exp(β

1

δ) times that of the second.

Odds

x+δ

Odds

x

=

exp(β

0

+ β

1

(x + δ) + β

2

x

2

)

exp(β

0

+ β

1

x + β

2

x

2

)

= exp(β

1

δ).

210 / 436

Parameter interpretation: with inte raction

The odds of having at least one satellite for a female crab with

carapace width

x

1

and color x

2

is exp(β

0

+ β

1

x

1

+ β

2

x

2

+ β

3

x

1

x

2

).

There are two “Dark” female crabs where the carapace width of

the ﬁrst is

x + δ whereas the carapace width of the second is x,

then the odds of having at least one satellite of the ﬁrst female

crab is

exp(β

1

δ) times that of the second.

Odds

x+δ

Odds

x

=

exp(β

0

+ β

1

(x + δ))

exp(β

0

+ β

1

x)

= exp(β

1

δ).

There are two “Light” female crabs where the carapace width of

the ﬁrst is x + δ whereas the carapace width of the seco nd is x,

then the odds of having at least one satellite of the ﬁrst female

crab is

exp[(β

1

+ β

3

)δ] times that of the second.

Odds

x+δ

Odds

x

=

exp(β

0

+ (β

1

+ β

3

)(x + δ) + β

2

)

exp(β

0

+ (β

1

+ β

3

)x + β

2

)

= exp[(β

1

+ β

3

)δ].

211 / 436

Males attached to female horseshoe crabs

> data(hcrabs, package="GLMsData")

> hcrabs <- within(hcrabs,{Col2<-ifelse(Col %in% c("LM","M"),1,0)

+ Sat2<-ifelse(Sat > 0,1,0)})

> fit <- glm(Sat2~Width+Col2,family=binomial(logit),data=hcrabs)

> summary(fit)

ˆ

β

j

c

j

ˆ

β

j

/c

j

Pr[χ

2

(1)

> z

2

j

]

Estimate Std.Error z value Pr(>|z|)

(Intercept) -11.9501 2.6643 -4.485 7.28e-06

Width 0.4670 0.1037 4.506 6.61e-06

Col2 0.6531 0.3571 1.829 0.0675

φ

(Dispersion parameter for binomial family taken to be 1 )

D

0

(

ˆ

β) (n − 1)

Null deviance : 225.76 on 172 degrees of freedom

Residual deviance : 191.12 on 170 degrees of freedom

D(

ˆ

β) (n − p − 1)

AIC : 197.12

−2ℓ(

ˆ

β) + 2(p + 1)

Number of Fisher Scoring iterations: 4

212 / 436

Males attached to female horseshoe crabs

> fit2 <- update(fit, formula=Sat2 ~ Width + Col2 + Width:Col2)

> anova2(fit,fit2,test="wald")

Model 1 : Sat2 ~ Width + Col2

Model 2 : Sat2 ~ Width + Col2 + Width:Col2

ξ

W

r Pr[χ

2

(r) > ξ

W

]

Chi df Pr(Chisq>)

1 vs 2 0.78331 1 0.3761

>

> anova2(fit,fit2, test="lr")

Model 1 : Sat2 ~ Width + Col2

Model 2 : Sat2 ~ Width + Col2 + Width:Col2

ξ

LR

r Pr[χ

2

(r) > ξ

LR

]

Chi df Pr(Chisq>)

1 vs 2 0.80605 1 0.3693

>

> anova2(fit,fit2, test="score")

Model 1 : Sat2 ~ Width + Col2

Model 2 : Sat2 ~ Width + Col2 + Width:Col2

ξ

S

r Pr[χ

2

(r) > ξ

S

]

Chi df Pr(Chisq>)

1 vs 2 0.79097 1 0.3738

213 / 436

Males attached to female horseshoe crabs

Model matrix X : model.matrix(fit)

Parameter estimates

ˆ

β = (

ˆ

β

0

,

ˆ

β

1

, . . . ,

ˆ

β

p

)

⊤

: coef(fit)

Estimate of variance-covariance matrix [I(

ˆ

β)]

−1

: vcov(fit)

Fitted values ˆµ = (

ˆ

µ

1

, . . . ,

ˆ

µ

n

)

⊤

: fitted(fit)

Unscaled deviance D(

ˆ

β): deviance(fit)

Pearson’s statistic P(

ˆ

β): sum(resid(fit, type="pearson")**2)

Log-likelihood ℓ(

ˆ

β): logLik(fit)

AIC = −2ℓ(

ˆ

β) + 2(p + 1): AIC(fit)

BIC = −2ℓ(

ˆ

β) + log(n)(p + 1): BIC(fit)

R

2

adj

= 1 −

D(

ˆ

β)/(n − p − 1)

D

0

(

ˆ

β)/(n − 1)

:

with(fit, 1-deviance*df.null/(null.deviance*df.residual))

214 / 436

Insecticides

These data, extracted from Hewlett and Plackett [1950] and available

in the object

deposit of the library GLMsData, arose from an experiment

in which groups of insects were exposed to deposits of insecticides. The

total number of insects exposed in each group (

Number), the number

of the insects killed after six days exposure (

Killed), and the amount

(

Deposit) and type of insecticide used (Insecticide) were recorded.

Insecticide

type A

type B

type C

Amount of insecticide, in milligrams

% of killed insects

•

••

•

2 4 6 8

0

25

50

75

100











m

i

Y

i

ind

∼ Binomial(m

i

, µ

i

),

g(µ

i

) = β

0

+ β

1

x

i1

+ β

2

x

i2

+ β

3

x

i3

m = Number of exposed insects

y = Proportion of killed insects

x

1

= Amount of insecticide, in mg.

x

2

= 0 and x

3

= 0 if Insecticide=“A”

x

2

= 1 and x

3

= 0 if Insecticide=“B”

x

2

= 0 and x

3

= 1 if Insecticide=“C”

215 / 436

Parameter interpretation: without inte raction

The odds of death for an insect exposed to an amount of insecticide x

1

of type A (x

2

= 0 and x

3

= 0), B (x

2

= 1 and x

3

= 0) or C (x

2

= 0 and

x

3

= 1) is exp(β

0

+ β

1

log(x

1

) + β

2

x

2

+ β

3

x

3

).

Two insects are exposed to the same type of insecticide, but the amount

the ﬁrst and second insects were exposed to are

xδ and x, respectively.

Then, the odds of death of the ﬁrst insect is δ

β

1

times that of the second.

Odds

xδ

Odds

x

=

exp(β

0

+ β

1

log(xδ) + β

2

x

2

+ β

3

x

3

)

exp(β

0

+ β

1

log(x) + β

2

x

2

+ β

3

x

3

)

= δ

β

1

.

Two insects receive the same dose of insecticide, but the ﬁrst insect is

exposed to type B insecticide and the second insect is exposed to type

A insecticide. Then, the odds of death of the ﬁrst insect is

exp(β

3

) times

that of the second.

Odds

B

Odds

A

=

exp(β

0

+ β

1

log(x

1

) + β

2

)

exp(β

0

+ β

1

log(x

1

))

= exp(β

2

).

Two insects receive the same dose of insecticide, but the ﬁrst insect is

exposed to type C insecticide and the second insect is exposed to type

A insecticide. Then, the odds of death of the ﬁrst insect is

exp(β

3

) times

that of the second.

Odds

C

Odds

A

=

exp(β

0

+ β

1

log(x

1

) + β

3

)

exp(β

0

+ β

1

log(x

1

))

= exp(β

3

).

216 / 436

Parameter interpretation: with inte raction

The odds of death for an insect exposed to an amount of insecticide x

1

of type A (x

2

= 0 and x

3

= 0), B (x

2

= 1 and x

3

= 0) or C (x

2

= 0 and

x

3

= 1) is exp(β

0

+ β

1

log(x

1

) + β

2

x

2

+ β

3

x

3

+ β

4

log(x

1

)x

2

+ β

5

log(x

1

)x

3

).

Two insects were exposed to insecticide type A. The amount the ﬁrst

and second insects were exposed to are xδ and x, respectively. Then,

the odds of death of the ﬁrst insect is δ

β

1

times that of the second.

Odds

xδ

Odds

x

=

exp(β

0

+ β

1

log(xδ))

exp(β

0

+ β

1

log(x))

= δ

β

1

.

Two insects were exposed to insecticide type B. The amount the ﬁrst

and second insects were exposed to are

xδ and x, respectively. Then,

the odds of death of the ﬁrst insect is δ

β

1

+β

4

times that of the second.

Odds

xδ

Odds

x

=

exp(β

0

+ β

1

log(xδ) + β

4

log(xδ))

exp(β

0

+ β

1

log(x) + β

4

log(x))

= δ

β

1

+β

4

.

Two insects were exposed to insecticide type C. The amount the ﬁrst

and second insects were exposed to are

xδ and x, respectively. Then,

the odds of death of the ﬁrst insect is δ

β

1

+β

5

times that of the second.

Odds

xδ

Odds

x

=

exp(β

0

+ β

1

log(xδ) + β

5

log(xδ))

exp(β

0

+ β

1

log(x) + β

5

log(x))

= δ

β

1

+β

5

.

217 / 436

Insecticides

> data(deposit, package="GLMsData")

>

> fit <- glm(Killed/Number ~ log(Deposit) + Insecticide,

+ weights=Number, family=binomial(logit), data=deposit)

> summary(fit)

ˆ

β

j

c

j

ˆ

β

j

/c

j

Pr[χ

2

(1)

> z

2

j

]

Estimate Std.Error z value Pr(>|z|)

(Intercept) -4.4613 0.3564 -12.518 <2e-16

log(Deposit) 2.8875 0.2236 12.912 <2e-16

InsecticideB 0.3489 0.2060 1.694 0.0903

InsecticideC 2.8402 0.2538 11.191 <2e-16

φ

(Dispersion parameter for binomial family taken to be 1 )

D

0

(

ˆ

β) (n − 1)

Null deviance :413.644 on 17 degrees of freedom

Residual deviance : 23.385 on 14 degrees of freedom

D(

ˆ

β) (n − p − 1)

AIC : 93.583

−2ℓ(

ˆ

β) + 2(p + 1)

Number of Fisher Scoring iterations: 4

218 / 436

Insecticides

> fit2 <- update(fit,formula=Killed/Number~log(Deposit)*Insecticide)

> anova2(fit,fit2,test="wald")

Model 1 : Killed/Number ~ log(Deposit) + Insecticide

Model 2 : Killed/Number ~ log(Deposit) * Insecticide

ξ

W

r Pr[χ

2

(r) > ξ

W

]

Chi df Pr(Chisq>)

1 vs 2 0.99287 2 0.6087

>

> anova2(fit,fit2, test="lr")

Model 1 : Killed/Number ~ log(Deposit) + Insecticide

Model 2 : Killed/Number ~ log(Deposit) * Insecticide

ξ

LR

r Pr[χ

2

(r) > ξ

LR

]

Chi df Pr(Chisq>)

1 vs 2 1.0487 2 0.5919

>

> anova2(fit,fit2, test="score")

Model 1 : Killed/Number ~ log(Deposit) + Insecticide

Model 2 : Killed/Number ~ log(Deposit) * Insecticide

ξ

S

r Pr[χ

2

(r) > ξ

S

]

Chi df Pr(Chisq>)

1 vs 2 0.99865 2 0.6069

219 / 436

Poisson regression











Y

i

ind

∼ Poisson(µ

i

),

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

E(Y

i

) = µ

i

and Var(Y

i

) = µ

i

Logarithmic link

g(µ

i

) = log(µ

i

)

Identity link

g(µ

i

) = µ

i

Square-root link

g(µ

i

) =

√

µ

i

220 / 436

Skin cancer in women

These data, extracted from Kleinbaum et al. [2013, page 751] and avai-

lable in the object

skincancer of the library glmtoolbox, describe the

number of of non-melanoma skin cancer cases (

cases) among women

(

population) stratiﬁed by age (age: “15-24”, “25-34”, “35-44”, “45-

54”, “55-64”, “65-74”, “75-84”, “85+”) in two metropolitan areas of US

(

city: “St. Paul”, in Minneapolis, and “Fort Worth”, in Dallas).

15−24 25−34 35−44 45−54 55−64 65−74 75−84 85+

City

St.Paul

Ft.Worth

Incidence rate of skin cancer ×10

4

Age range, in years

•

0

25

50

75











Y

i

ind

∼ Poisson(µ

i

),

log



10

4

µ

i

λ

i



= β

0

+ β

1

x

i1

+ β

2

x

i2

y = Number of cases of skin cancer

λ = Number of women exposed

x

1

= 1 if city=“Fort Worth”

x

1

= 0 if city=“St. Paul”

x

2

= Midpoint of age range

221 / 436

Parameter interpretation: without inte raction

The expected incidence of skin cancer for a population group

located at the city

x

1

and an age range with mid point x

2

is

exp(β

0

+ β

1

x

1

+ β

2

log(x

2

)).

There are two population groups whose age ranges coincide but

the ﬁrst is located at Fort Worth whereas the second is located at

Saint Paul, then the expected incidence of skin cancer in the ﬁrst

population group is exp(β

1

) times that of the second.

µ

FW

µ

SP

=

exp(β

0

+ β

1

+ β

2

log(x

2

))

exp(β

0

+ β

2

log(x

2

))

= exp(β

1

).

There are two population groups located at the same city but the

mid point of the age range of the ﬁrst is xδ whereas that of the

second is x, then the expected incidence of skin cancer in the ﬁrst

population group is δ

β

2

times that of the second.

µ

xδ

µ

x

=

exp(β

0

+ β

1

x

1

+ β

2

log(xδ))

exp(β

0

+ β

1

x

1

+ β

2

log(x))

= exp(β

2

log(δ)) = δ

β

2

.

222 / 436

Parameter interpretation: with inte raction

The expected incidence of skin cancer for a population group at

the city

x

1

and an age range with mid point x

2

is exp(β

0

+ β

1

x

1

+

β

2

log(x

2

) + β

3

x

1

log(x

2

))

.

There are two population g roups located at Saint Paul but the

mid point of the age range of the ﬁrst is

xδ whereas that of the

second is

x, then the expected incidence of skin cancer in the ﬁrst

population group is

δ

β

2

times that of the second.

µ

xδ

µ

x

=

exp(β

0

+ β

2

log(xδ))

exp(β

0

+ β

2

log(x))

= exp(β

2

log(δ)) = δ

β

2

.

There are two population groups located at Fort Worth but the

mid point of the age range of the ﬁrst is xδ whereas that of the

second is x, then the expected incidence of skin cancer in the ﬁrst

population group is δ

β

2

+β

3

times that of the second.

µ

xδ

µ

x

=

exp(β

0

+ β

1

+ β

2

log(xδ) + β

3

log(xδ))

exp(β

0

+ β

1

+ β

2

log(x) + β

3

log(x))

= δ

β

2

+β

3

.

223 / 436

Skin cancer in women

> data(skincancer)

> fit <- glm(cases ~ city + log(age), offset=log(population),

+ family=poisson(log), data=skincancer)

> summary(fit)

ˆ

β

j

c

j

ˆ

β

j

/c

j

Pr[χ

2

(1)

> z

2

j

]

Estimate Std.Error z value Pr(>|z|)

(Intercept) -20.35008 0.35282 -57.68 <2e-16

cityFt.Worth 0.80928 0.05217 15.51 <2e-16

log(age) 3.37649 0.08469 39.87 <2e-16

φ

(Dispersion parameter for poisson family taken to be 1 )

D

0

(

ˆ

β) (n − 1)

Null deviance : 2789.681 on 15 degrees of freedom

Residual deviance : 31.816 on 13 degrees of freedom

D(

ˆ

β) (n − p − 1)

AIC : 132.06

−2ℓ(

ˆ

β) + 2(p + 1)

224 / 436

Skin cancer in women

> fit2 <- update(fit,formula = cases ~ city*log(age))

> anova2(fit,fit2, test="wald")

Model 1 : cases ~ city + log(age)

Model 2 : cases ~ city + log(age) + city:log(age)

ξ

W

r Pr[χ

2

(r) > ξ

W

]

Chi df Pr(Chisq>)

1 vs 2 3.6753 1 0.05523

>

> anova2(fit,fit2, test="lr")

Model 1 : cases ~ city + log(age)

Model 2 : cases ~ city + log(age) + city:log(age)

ξ

LR

r Pr[χ

2

(r) > ξ

LR

]

Chi df Pr(Chisq>)

1 vs 2 3.764 1 0.05237

>

> anova2(fit,fit2, test="score")

Model 1 : cases ~ city + log(age)

Model 2 : cases ~ city + log(age) + city:log(age)

ξ

S

r Pr[χ

2

(r) > ξ

S

]

Chi df Pr(Chisq>)

1 vs 2 3.6808 1 0.05504

225 / 436

Richness of the species

The analysis of these data, available in the object richness of the

library

glmtoolbox, aims to explain the diﬀerences between plots

as related to species richness (represented by a count of the num-

ber of plant species (

richness)) using the diﬀerences between

plots with respect to biomasses (

Biomass) and soil pH levels:

low (

pH=“low”), medium (pH=“mid”) and high (pH=“high”).

•

••

•

••

•

••

•

••

•

High

Mid

Low

pH level

Biomass

Number of plant species

0

0.0 2.5 5.0 7.5 10.0

10

20

30

40











Y

i

ind

∼ Poisson(µ

i

),

µ

i

= β

0

+ β

1

x

i1

+ β

2

x

i2

+ β

3

x

i3

y = Number of plant species

x

1

= Biomass

x

2

= 0 and x

3

= 0 if pH=“low”

x

2

= 1 and x

3

= 0 if pH=“mid”

x

2

= 0 and x

3

= 1 if pH=“high”

226 / 436

Parameter interpretation: without inte raction

The expected number of plant species for a plot with biomass x

1

and

pH level low (

x

2

= 0 and x

3

= 0), mid (x

2

= 1 and x

3

= 0) or high (x

2

= 0

and x

3

= 1) is β

0

+ β

1

x

1

+ β

2

x

2

+ β

3

x

3

.

There are two plots with the same pH levels, but the biomass of the

ﬁrst plot is

x + δ while the biomass of the second plot is x. As a result,

the expected number of plant species for the ﬁrst plot is |β

1

|δ units

lower, if

β

1

< 0, or greater, if β

1

> 0, than that for the second plot.

µ

x+δ

− µ

x

= β

0

+ β

1

(x + δ) + β

2

x

2

+ β

3

x

3

− (β

0

+ β

1

x + β

2

x

2

+ β

3

x

3

) = β

1

δ.

There are two plots with the same biomass, but the pH level of the ﬁrst

plot is mid while the biomass of the second plot is low. As a result, the

expected number of plant species for the ﬁrst plot is |β

2

| units lower, if

β

2

< 0, or greater, if β

2

> 0, than that for the second plot.

µ

mid

− µ

low

= β

0

+ β

1

x

1

+ β

2

− (β

0

+ β

1

x

1

) = β

2

.

There are two plots with the same biomass, but the pH level of the ﬁrst

plot is high while the biomass of the second plot is low. As a result, the

expected number of plant species for the ﬁrst plot is |β

3

| units lower, if

β

3

< 0, or greater, if β

3

> 0, than that for the second plot.

µ

high

− µ

low

= β

0

+ β

1

x

1

+ β

3

− (β

0

+ β

1

x

1

) = β

3

.

227 / 436

Parameter interpretation: without inte raction

The expected number of plant species for a plot with biomass x

1

and

pH level low (

x

2

= 0 and x

3

= 0), mid (x

2

= 1 and x

3

= 0) or high (x

2

= 0

and x

3

= 1) is β

0

+ β

1

x

1

+ β

2

x

2

+ β

3

x

3

+ β

4

x

1

x

2

+ β

5

x

1

x

3

.

There are two plots with pH level low, but the biomass of the ﬁrst

plot is

x + δ while the biomass of the second plot is x. Therefore, the

expected number of plant species for the ﬁrst plot is |β

1

|δ units lower,

if

β

1

< 0, or greater, if β

1

> 0, than that for the second plot.

µ

x+δ

− µ

x

= β

0

+ β

1

(x + δ) − (β

0

+ β

1

x) = β

1

δ.

There are two plots with pH level mid, but the biomass of the ﬁrst

plot is

x + δ while the biomass of the second plot is x. Therefore, the

expected number of plant species for the ﬁrst plot is

|β

1

+ β

4

|δ units

lower, if

β

1

+β

4

<0, or greater, if β

1

+β

4

>0, than that for the second plot.

µ

x+δ

− µ

x

= β

0

+ β

1

(x + δ) + β

4

(x + δ) − (β

0

+ β

1

x + β

4

x) = (β

1

+ β

4

)δ.

There are two plots with pH level high, but the biomass of the ﬁrst

plot is

x + δ while the biomass of the second plot is x. Therefore, the

expected number of plant species for the ﬁrst plot is

|β

1

+ β

5

|δ units

lower, if

β

1

+β

5

<0, or greater, if β

1

+β

5

>0, than that for the second plot.

µ

x+δ

− µ

x

= β

0

+ β

1

(x + δ) + β

5

(x + δ) − (β

0

+ β

1

x + β

5

x) = (β

1

+ β

5

)δ.

228 / 436

Richness of the species

> data(richness)

> fit <- glm(Species ~ Biomass + pH, family=poisson(identity),

+ data=richness)

> summary(fit)

ˆ

β

j

c

j

ˆ

β

j

/c

j

Pr[χ

2

(1)

> z

2

j

]

Estimate Std.Error z value Pr(>|z|)

(Intercept) 17.7915 0.7789 22.84 <2e-16

Biomass -2.8710 0.1830 -15.68 <2e-16

pHmid 12.5603 0.9821 12.79 <2e-16

pHhigh 23.0681 1.1801 19.55 <2e-16

φ

(Dispersion parameter for poisson family taken to be 1 )

D

0

(

ˆ

β) (n − 1)

Null deviance : 452.346 on 89 degrees of freedom

Residual deviance : 68.126 on 86 degrees of freedom

D(

ˆ

β) (n − p − 1)

AIC : 495.32

−2ℓ(

ˆ

β) + 2(p + 1)

229 / 436

Richness of the species

> fit2 <- update(fit,formula=Species ~ Biomass + pH + Biomass:pH)

> anova2(fit,fit2, test="wald")

Model 1 : Species ~ Biomass + pH

Model 2 : Species ~ Biomass + pH + Biomass:pH

ξ

W

r Pr[χ

2

(r) > ξ

W

]

Chi df Pr(Chisq>)

1 vs 2 1.1275 2 0.5691

>

> anova2(fit,fit2, test="lr")

Model 1 : Species ~ Biomass + pH

Model 2 : Species ~ Biomass + pH + Biomass:pH

ξ

LR

r Pr[χ

2

(r) > ξ

LR

]

Chi df Pr(Chisq>)

1 vs 2 1.0563 2 0.5897

>

> anova2(fit,fit2, test="score")

Model 1 : Species ~ Biomass + pH

Model 2 : Species ~ Biomass + pH + Biomass:pH

ξ

S

r Pr[χ

2

(r) > ξ

S

]

Chi df Pr(Chisq>)

1 vs 2 0.98562 2 0.6109

230 / 436

Normal regression











Y

i

ind

∼ Normal(µ

i

, σ

2

/ω

i

),

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

⇔











Y

i

= µ

i

+ ε

i

,

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

,

ε

i

ind

∼ Normal(0, σ

2

/ω

i

)

E(Y

i

) = µ

i

and Var(Y

i

) =

σ

2

ω

i

Inverse link

g(µ

i

) = 1/µ

i

Logarithmic link

g(µ

i

) = log(µ

i

)

Identity link

g(µ

i

) = µ

i

If ω

1

= ω

2

= . . . = ω

n

, then the dispersion parameter is constant,

which implies that the variance is also constant.

231 / 436

Normal regression

Let Q

ε

i

(α) be the 100(α)th percentile of the random error ε

i

. Thus,

Pr[ε

i

≤ Q

ε

i

(α)] = α

Pr[µ

i

+ ε

i

≤ µ

i

+ Q

ε

i

(α)] = α

Pr[Y

i

≤ µ

i

+ Q

ε

i

(α)] = α.

That is, the 100(α)th percentile of Y

i

, denoted here by Q

Y

i

(α),

can be written as

µ

i

+ Q

ε

i

(α). In addition, if the link function is

identity (that is,

g(µ) = µ) and ω

1

= ω

2

= . . . = ω

n

= 1, then

ε

1

, . . . , ε

n

are identically distributed random variables, so Q

ε

i

(α)

does not depend on i. Hence, Q

Y

i

(α) can be expressed as follows

Q

Y

i

(α) = µ

i

+ Q

ε

(α)

= β

∗

0

+ β

1

x

i1

+ . . . + β

p

x

ip

,

where β

∗

0

= β

0

+ Q

ε

(α). Therefore, the parameter interpretation

also may be performed by taking into account the additive eﬀects

of the covariates on

Q

Y

i

(α), for all α ∈ (0, 1).

232 / 436

Advertising

This data set, extracted from James et al. [2013, page 15] and

available in the object

advertising of the library glmtoolbox,

consists of the sales (

sales), in thousands of units, of a pro-

duct in 200 diﬀerent markets, along with advertising budgets, in

thousands of dollars, for the product in each of those markets

for television (

TV) and radio (radio).

Advertising

budget for radio

Low

Mid

High

Advertising budget for TV

Sales

•

••

•

0

10

20

100 200 300











Y

i

ind

∼ Norm al(µ

i

, σ

2

),

log(µ

i

) = β

0

+ β

1

log(x

i1

) + β

2

x

i2

+

β

3

log(x

i1

)x

i2

y = Sales, in thousands of units.

x

1

= Advertising budget for TV.

x

2

= Advertising budget for radio.

233 / 436

Parameter interpretation: without inte raction

The expected sales of the product in a market where the adver-

tising budget for TV is

x

1

and the advertising budget for radio is

x

2

are exp(β

0

+ β

1

log(x

1

) + β

2

x

2

).

In two markets the radio advertising budgets are the same, but

the TV advertisi ng budgets are

xδ and x, respectively. As a result,

the product’s expected sales in the ﬁrst market are

δ

β

1

times those

in the second market.

µ

xδ

µ

x

=

exp(β

0

+ β

1

log(xδ) + β

2

x

2

)

exp(β

0

+ β

1

log(x) + β

2

x

2

)

= exp(β

1

log(δ)) = δ

β

1

.

In two markets the advertising budgets for TV coincide, but the

advertising budget for radio in the ﬁrst is x+ δ, and in the second

is x. As a result, the product’s expected sales in the ﬁrst market

are exp(β

2

δ) times those in the second market.

µ

x+δ

µ

x

=

exp(β

0

+ β

1

log(x

1

) + β

2

(x + δ)

exp(β

0

+ β

1

log(x

1

) + β

2

x)

= exp(β

2

δ).

234 / 436

Parameter interpretation: with inte raction

The expected sales of the product in a market where the adver-

tising budget for TV is

x

1

and the advertising budget for radio is

x

2

are exp(β

0

+ β

1

log(x

1

) + β

2

x

2

+ β

3

log(x

1

)x

2

).

Two markets have radio advertising budgets eq ual to

z, but TV

advertising budgets are

xδ and x, respectively. As a result, the

product’s expected sales in the ﬁrst market are

δ

β

1

+β

3

z

times those

in the second market.

µ

xδ

µ

x

=

exp(β

0

+ β

1

log(xδ) + β

2

z + β

3

log(xδ)z)

exp(β

0

+ β

1

log(x) + β

2

z + β

3

log(x)z)

= δ

β

1

+β

3

z

.

Two markets have TV advertising budgets equ al to z, but radio

advertising budgets are x + δ and x, respectively. As a result,

the product’s expected sales in the ﬁrst market are exp[(β

2

+

β

3

log(z))δ] times those in the second market.

µ

x+δ

µ

x

=

exp(β

0

+ β

1

log(z) + β

2

(x+δ) + β

3

log(z)(x+δ))

exp(β

0

+ β

1

log(z) + β

2

x + β

3

log(z)x)

= exp[(β

2

+β

3

log(z))δ].

235 / 436

Advertising

> data(advertising)

> fit <- glm(sales ~ log(TV) + radio + log(TV):radio,

+ family=gaussian(log), data=advertising)

> summary(fit)

ˆ

β

j

c

j

ˆ

β

j

/c

j

Pr[χ

2

(1)

> z

2

j

]

Estimate Std.Error z value Pr(>|z|)

(Intercept) 1.0302249 0.0467959 22.015 <2e-16

log(TV) 0.2646359 0.0092146 28.719 <2e-16

radio -0.0043242 0.0014929 -2.896 0.0042

log(TV):radio 0.0036644 0.0002909 12.598 <2e-16

ˆ

φ

(Dispersion parameter for gaussian family taken to be 0.2668851 )

D

0

(

ˆ

β) (n − 1)

Null deviance : 5417.149 on 199 degrees of freedom

Residual deviance : 52.309 on 196 degrees of freedom

D(

ˆ

β) (n − p − 1)

AIC : 309.35

−2ℓ(

ˆ

β) + 2(p + 2)

Number of Fisher Scoring iterations: 5

236 / 436

Advertising

> fit2 <- update(fit,formula = sales ~ log(TV) + radio)

> anova2(fit2,fit, test="wald")

Model 1 : sales ~ log(TV) + radio

Model 2 : sales ~ log(TV) + radio + log(TV)*radio

ξ

W

r Pr[χ

2

(r) > ξ

W

]

Chi df Pr(Chisq>)

1 vs 2 158.7 1 < 2.2e-16

>

> anova2(fit,fit2, test="lr")

Model 1 : sales ~ log(TV) + radio

Model 2 : sales ~ log(TV) + radio + log(TV)*radio

ξ

LR

r Pr[χ

2

(r) > ξ

LR

]

Chi df Pr(Chisq>)

1 vs 2 153.45 1 < 2.2e-16

>

> anova2(fit,fit2, test="gradient")

Model 1 : sales ~ log(TV) + radio

Model 2 : sales ~ log(TV) + radio + log(TV)*radio

ξ

S

r Pr[χ

2

(r) > ξ

G

]

Chi df Pr(Chisq>)

1 vs 2 151.16 1 < 2.2e-16

237 / 436

Advertising

Model matrix X : model.matrix(fit)

Parameter estimates

ˆ

β = (

ˆ

β

0

,

ˆ

β

1

, . . . ,

ˆ

β

p

)

⊤

: coef(fit)

Estimate of variance-covariance matrix [I(

ˆ

β)]

−1

: vcov(fit)

Fitted values ˆµ = (

ˆ

µ

1

, . . . ,

ˆ

µ

n

)

⊤

: fitted(fit)

Unscaled deviance D(

ˆ

β): deviance(fit)

Pearson’s statistic P(

ˆ

β): sum(resid(fit, type="pearson")**2)

Log-likelihood ℓ(

ˆ

β): logLik(fit)

AIC = −2ℓ(

ˆ

β) + 2(p + 2): AIC(fit)

BIC = −2ℓ(

ˆ

β) + log(n)(p + 2): BIC(fit)

R

2

adj

= 1 −

D(

ˆ

β)/(n − p − 1)

D

0

(

ˆ

β)/(n − 1)

:

with(fit, 1-deviance*df.null/(null.deviance*df.residual))

238 / 436

Gamma regression











Y

i

ind

∼ Gamma(µ

i

, σ

2

/ω

i

),

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

⇔











Y

i

= µ

i

ε

i

,

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

,

ε

i

ind

∼ Gamma(1, σ

2

/ω

i

)

E(Y

i

) = µ

i

and Var(Y

i

) =

σ

2

ω

i

µ

2

i

Inverse link

g(µ

i

) = 1/µ

i

Logarithmic link

g(µ

i

) = log(µ

i

)

Identity link

g(µ

i

) = µ

i

If ω

1

= ω

2

= . . . = ω

n

, then the dispersion parameter is constant,

which implies that the coeﬃcient of variation is also constant.

239 / 436

Gamma regression

Let Q

ε

i

(α) be the 100(α)th percentile of the random error ε

i

. Thus,

Pr[ε

i

≤ Q

ε

i

(α)] = α

Pr[µ

i

ε

i

≤ µ

i

Q

ε

i

(α)] = α

Pr[Y

i

≤ µ

i

Q

ε

i

(α)] = α.

That is, the 100(α)th percentile of Y

i

, denoted here by Q

Y

i

(α),

can be written as

µ

i

Q

ε

i

(α). In addition, if the link function is

logarithm (that is,

g(µ) = log(µ)) and ω

1

= ω

2

= . . . = ω

n

= 1,

then

ε

1

, . . . , ε

n

are identically distributed random variables, so

Q

ε

i

(α) does not depend on i. Hence, log[Q

Y

i

(α)] can be expressed

as follows

log[Q

Y

i

(α)] = log(µ

i

) + log[Q

ε

(α)]

= β

∗

0

+ β

1

x

i1

+ . . . + β

p

x

ip

,

where β

∗

0

= β

0

+ log[Q

ε

(α)]. Therefore, the parameter interpre-

tation also may be performed by taking into account the multi-

plicative eﬀects of the covariates on

Q

Y

i

(α), for all α ∈ (0, 1).

240 / 436

Hill races in Scotland

Each year, the Scottish Hill Runners Association publishes a

list of hill races in Scotland for the year. These data, extracted

from

Agresti [2015, page 62] and available in the object races

of the library glmtoolbox, consist of record time, distance, and

cumulative climb of 35 of those races. The aim of the statistical

analysis of these data is to explain the diﬀerences between the

record time of the races using their diﬀerences on distance and

cumulative climb. The variables in this data set are the following

distance (x

1

). the distance, in miles, of the races.

cclimb (x

2

). the cumulative climb, in thousands of feet, of

the races.

rtime (y). the record time, in minutes, of the races.

race. the names of the races.











Y

i

ind

∼ Gamma(µ

i

, σ

2

),

log(µ

i

) = β

0

+ β

1

log(x

i1

) + β

2

x

i2

241 / 436

Parameter interpretation: without inte raction

The expected (median) record time for a race with distance x

1

and

cumulative climb

x

2

is (proportional to) exp(β

0

+β

1

log(x

1

)+β

2

x

2

).

In two races the cumulative climbs are the same, but the distances

are

xδ and x, respectively. As a result, the expected (median)

record time in the ﬁrst race is

δ

β

1

times that in the second race.

µ

xδ

µ

x

=

exp(β

0

+ β

1

log(xδ) + β

2

x

2

)

exp(β

0

+ β

1

log(x) + β

2

x

2

)

= exp(β

1

log(δ)) = δ

β

1

.

In two races the distances coincide, but the cumulative climb in

the ﬁrst is

x + δ, and i n the second is x. As a result, the expected

(median) record time in the ﬁrst race is

exp(β

2

δ) times that in the

second race.

µ

x+δ

µ

x

=

exp(β

0

+ β

1

log(x

1

) + β

2

(x + δ))

exp(β

0

+ β

1

log(x

1

) + β

2

x)

= exp(β

2

δ).

242 / 436

Parameter interpretation: with inte raction

The expected (median) record time for a race with distance x

1

and cumulative climb x

2

is (proportional to) exp(β

0

+ β

1

log(x

1

) +

β

2

x

2

+ β

3

log(x

1

)x

2

)

.

Two races have cumulative climbs equal to

z, but the distances

are

xδ and x, respectively. As a result, the expected (median)

record time in the ﬁrst race is

δ

β

1

+β

3

z

times that in the second

race.

µ

xδ

µ

x

=

exp(β

0

+ β

1

log(xδ) + β

2

z + β

3

log(xδ)z)

exp(β

0

+ β

1

log(x) + β

2

z + β

3

log(x)z)

= δ

β

1

+β

3

z

.

Two races have distances equal to z, but the cumulative climbs

are x + δ and x, respectively. As a result, the expected (median)

record time in the ﬁrst race is exp[(β

2

+ β

3

log(z))δ] times that in

the second race.

µ

x+δ

µ

x

=

exp(β

0

+ β

1

log(z) + β

2

(x+δ) + β

3

log(z)(x+δ))

exp(β

0

+ β

1

log(z) + β

2

x + β

3

log(z)x)

= exp[(β

2

+β

3

log(z))δ].

243 / 436

Hill races in Scotland

> data(races)

> fit <- glm(rtime ~ log(distance) + cclimb, family=Gamma(log),

+ data=races)

> summary(fit)

ˆ

β

j

c

j

ˆ

β

j

/c

j

Pr[χ

2

(1)

> z

2

j

]

Estimate Std.Error z value Pr(>|z|)

(Intercept) 1.86703 0.08938 20.888 <2e-16

log(distance) 0.90141 0.05933 15.193 3.47e-16

cclimb 0.13089 0.02165 6.045 9.52e-07

ˆ

φ

(Dispersion parameter for Gamma family taken to be 0.02141007 )

D

0

(

ˆ

β) (n − 1)

Null deviance : 19.79114 on 34 degrees of freedom

Residual deviance : 0.72634 on 32 degrees of freedom

D(

ˆ

β) (n − p − 1)

AIC : 233.91

−2ℓ(

ˆ

β) + 2(p + 2)

Number of Fisher Scoring iterations: 5

244 / 436

Hill races in Scotland

> fit2 <- update(fit,formula = rtime ~ log(distance)*cclimb)

> anova2(fit,fit2, test="wald")

Model 1 : rtime ~ log(distance) + cclimb

Model 2 : rtime ~ log(distance) + cclimb + log(distance):cclimb

ξ

W

r Pr[χ

2

(r) > ξ

W

]

Chi df Pr(Chisq>)

1 vs 2 1.8245 1 0.1768

>

> anova2(fit,fit2, test="lr")

Model 1 : rtime ~ log(distance) + cclimb

Model 2 : rtime ~ log(distance) + cclimb + log(distance):cclimb

ξ

LR

r Pr[χ

2

(r) > ξ

LR

]

Chi df Pr(Chisq>)

1 vs 2 1.8072 1 0.1788

>

> anova2(fit,fit2, test="gradient")

Model 1 : rtime ~ log(distance) + cclimb

Model 2 : rtime ~ log(distance) + cclimb + log(distance):cclimb

ξ

S

r Pr[χ

2

(r) > ξ

G

]

Chi df Pr(Chisq>)

1 vs 2 1.8061 1 0.179

245 / 436

Lognormal distribution

If Y ∼ Normal(µ, σ

2

) then T = exp(Y) ∼ Lognormal(exp(µ), σ

2

).

If

T ∼ Lognormal(τ, σ

2

) then Y = log(T) ∼ Normal(log(τ), σ

2

).

The probability density function of

T may be written as follows

f

T

(t; τ, σ

2

) =

1

t

√

2π σ

2

exp







−



log(t/τ)



2

2σ

2







I

t

(0, ∞).

The mean, variance and coeﬃcient of variation of T are

τ exp

σ

2

!

, τ

2

exp



σ

2

h

exp



σ

2



− 1

i

and

q

exp

(

σ

2

)

− 1,

respectively. In addition, the 100(α)th percentile of T reduces to

Q

T

(α) = τ exp

(

σ Z

α

)

, α ∈ (0, 1),

where Z

α

represents the 100(α)th percentile of the Normal(0, 1).

For example, the median of T is τ. The lognormal distribution

does not belong to the LEF.

246 / 436

Probability density funct ion of lognormal(1, σ

2

)

0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5

0.0

0.5

1.0

1.5

2.0

0.0

0.5

1.0

1.5

2.0

0.0

0.5

1.0

1.5

2.0

σ

2

= 0.05

σ

2

= 0.12

σ

2

= 0.30

σ

2

= 0.8

σ

2

= 2.0

σ

2

= 5.0

t

f

T

(t; τ, σ

2

)

247 / 436

Lognormal regression











T

i

= τ

i

ε

i

,

h(τ

i

) = β

0

+ β

1

x

i1

+. . .+β

p

x

ip

ε

i

ind

∼ Lognormal(1, σ

2

/ω

i

)

⇔











Y

i

= µ

i

+ ǫ

i

,

g(µ

i

) = β

0

+ β

1

x

i1

+. . .+β

p

x

ip

ǫ

i

ind

∼ Normal(0, σ

2

/ω

i

)

where Y

i

= log(T

i

), µ

i

= log(τ

i

), ǫ

i

= log(ε

i

) and g(·) ≡ h[exp(·)].

E(T

i

) = τ

i

exp

σ

2

2 ω

i

!

and Var(T

i

) = τ

2

i

exp

σ

2

ω

i

!"

exp

σ

2

ω

i

!

− 1

#

.

Logarithmic link

h(τ

i

) = log(τ

i

) ⇔ g(µ

i

) = µ

i

Identity link

h(τ

i

) = τ

i

⇔ g(µ

i

) = exp(µ

i

)

Inverse link

h(τ

i

) = 1/τ

i

⇔ g(µ

i

) = exp(−µ

i

)

If ω

1

= ω

2

= . . . = ω

n

then the dispersion parameter is constant,

which implies that the coeﬃcient of variation is also constant.

248 / 436

Exercises











T

i

= τ

i

ε

i

,

h(τ

i

) = β

0

+ β

1

x

i1

+. . .+β

p

x

ip

ε

i

ind

∼ Lognormal(1, σ

2

/ω

i

)

⇔











Y

i

= µ

i

+ ǫ

i

,

g(µ

i

) = β

0

+ β

1

x

i1

+. . .+β

p

x

ip

ǫ

i

ind

∼ Normal(0, σ

2

/ω

i

)

where Y

i

= log(T

i

), y

i

= log(t

i

), µ

i

= log(τ

i

), ǫ

i

= log(ε

i

) and

g(·) ≡ h[exp(·)]. Show that:

(A) The likelihood function of β under the lognormal model

is proportional to that under the normal model. Therefore,

the maximum likelihood estimator of

β under the lognormal

model coincides with that under the normal model.

(B) The deviance under the lognormal model coincides with

that under the normal model.

(C) The GAIC under the lognormal model is equal to that under

the normal model plus 2[log(t

1

) +. . .+log(t

n

)]= 2(y

1

+ . . .+y

n

).

(D) If h(τ

i

) = log(τ

i

) and ω

1

= ω

2

= . . . = ω

n

then

log[Q

T

i

(α)] = β

∗

0

+ β

1

x

i1

+. . .+β

p

x

ip

for all α ∈ (0, 1) ,

where β

∗

0

= β

0

+ log[Q

ξ

(α)]

.

249 / 436

Lognormal versus Gamma model under log link











T

i

= τ

i

ε

i

,

log(τ

i

)= β

0

+ β

1

x

i1

+. . .+β

p

x

ip

,

ε

i

ind

∼ Lognormal(1, σ

2

)

vs.











Y

i

= µ

i

ǫ

i

,

log(µ

i

)= β

0

+ β

1

x

i1

+. . .+β

p

x

ip

,

ǫ

i

ind

∼ Gamma(1, σ

2

)

These models have the following similarities:

(1) They are appropriate when the response variable is contin-

uous, strictly positive, and right skewed.

(2) The constant dispersion parameter assumption is equiva-

lent to the constant coeﬃcient of variation assumption.

(3) A higher value of the dispersion parameter indicates more

asymmetry in the distribution of the response variable.

(4) The model parameters can be interpreted by taking into

account the following: (i) log[E(T

i

)] = β

∗

0

+ β

1

x

i1

+ . . . + β

p

x

ip

and log[E(Y

i

)] = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

; or (ii) log[Q

T

i

(α)] =

¯

β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

and log[Q

Y

i

(α)] =

˜

β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

for all α ∈ (0, 1).

250 / 436

Lognormal versus Gamma model under log link











T

i

= τ

i

ε

i

,

log(τ

i

)= β

0

+ β

1

x

i1

+. . .+β

p

x

ip

,

ε

i

ind

∼ Lognormal(1, σ

2

)

vs.











Y

i

= µ

i

ǫ

i

,

log(µ

i

)= β

0

+ β

1

x

i1

+. . .+β

p

x

ip

,

ǫ

i

ind

∼ Gamma(1, σ

2

)

The lognormal model has the following advantages over the

Gamma model:

(1) The log-likelihood function of β in lognormal model, de-

noted here by

ℓ(β), is a strictly convex function. Therefore,

if the system of equations given by

R(β) = 0 has a solution,

where

R(β) = (∂ℓ(β)/∂β

0

, . . . , ∂ℓ(β)/∂β

p

)

⊤

is the gradient of

ℓ(β), then it is the only global maximum of ℓ(β).

(2) The maximum likel ihood estimator of model parameters in

the lognormal model has a closed-form (i.e.,

ˆ

β = (X

⊤

X)

−1

X

⊤

t

∗

,

where t

∗

= (log(t

1

), . . . , log(t

n

))

⊤

), whereas in the Gamma

model it requires numerical methods.

(3) Statistical inference under the lognormal model is exact,

whereas inference under the Gamma model is asymptotic,

meaning that it is valid only for large sample sizes.

251 / 436

Gasoline yield

The analysis of the data in the object Auto of the library ISLR

is aimed at explaining the diﬀerences between the vehicles with

respect to their gasoline yield (

mpg), in miles per gallon, using

other vehicle characteristics such as their horsepower (

horsepower)

and their weight (

weight), in kilograms.

•

••

•

•• •

••

•

•• •

•

••

•

••

•

••

•

• ••

•

•• ••

•

• •

•

• •

•• •

•

• ••

•

••

•

••

•

• •

•

••

•

••

•

••

•

••

•

• •

•

••

•

• •

•

••

•

• •

•

••

•

High

Mid

Low

Weight

Horsepower

Gasoline yield

10

20

30

40

50 100 150 200











Y

i

ind

∼ Lognor mal(µ

i

, σ

2

),

log(µ

i

)= β

0

+ β

1

x

i1

+ β

2

x

i2

+ β

3

x

i1

x

i2

y = Gasoline yield

x

1

= Horsepower

x

2

= Weight

252 / 436

Parameter interpretation: without inte raction

The expected (median) gasoline yield for a vehicle with horse-

power

x

1

and weight x

2

is proportional to exp(β

0

+ β

1

x

1

+ β

2

x

2

).

In two vehicles the weights are the same, but the horsepowers

are

x + δ and x, respectively. As a result, the expected (median)

gasoline yield i n the ﬁrst vehicle is

exp(β

1

δ) times that in the

second vehicle.

µ

x+δ

µ

x

=

exp(β

0

+ β

1

(x + δ) + β

2

x

2

)

exp(β

0

+ β

1

x + β

2

x

2

)

= exp(β

1

δ).

In two vehicles the horsepowers are the same, but the weights

are x + δ and x, respectively. As a result, the expected (median)

gasoline yield i n the ﬁrst vehicle is

exp(β

2

δ) times that in the

second vehicle.

µ

x+δ

µ

x

=

exp(β

0

+ β

1

x

1

+ β

2

(x + δ))

exp(β

0

+ β

1

x

1

+ β

2

x)

= exp(β

2

δ).

253 / 436

Parameter interpretation: with inte raction

The expected (median) gasoline yield for a vehicle with horse-

power

x

1

and weight x

2

is proportional to exp(β

0

+ β

1

x

1

+ β

2

x

2

+

β

3

x

1

x

2

)

.

Two vehicles have weights equal to

z, but the horsepowers are

x + δ and x, respectively. As a result, the expected (median)

gasoline yield in the ﬁrst vehicle is

exp[(β

1

+ β

3

z)δ] times that in

the second vehicle.

µ

x+δ

µ

x

=

exp(β

0

+ β

1

(x + δ) + β

2

z + β

3

(x + δ)z)

exp(β

0

+ β

1

x + β

2

z + β

3

xz)

= exp[(β

1

+ β

3

z)δ].

Two vehicles have horsepowers equal to z, but the weights are

x + δ and x, respectively. As a result, the expected (median)

gasoline yield in the ﬁrst vehicle is exp[(β

2

+ β

3

z)δ] times that in

the second vehicle.

µ

x+δ

µ

x

=

exp(β

0

+ β

1

z + β

2

(x + δ) + β

3

z(x + δ))

exp(β

0

+ β

1

z + β

2

x + β

3

zx)

= exp[(β

2

+ β

3

z)δ].

254 / 436

Gasoline yield

> data(Auto, package="ISLR")

> Auto <- within(Auto, weight <- weight/1000)

>

> fit <- glm(log(mpg) ~ horsepower + weight + horsepower:weight,

+ family=gaussian(identity), data=Auto)

> summary(fit)

ˆ

β

j

c

j

ˆ

β

j

/c

j

Pr[χ

2

(1)

> z

2

j

]

Estimate Std.Error z value Pr(>|z|)

(Intercept) 4.5121298 0.0912295 49.459 < 2e-16

horsepower -0.0071145 0.0010622 -6.698 7.44e-11

weight -0.3618254 0.0301305 -12.009 < 2e-16

horsepower:weight 0.0011991 0.0002589 4.631 4.96e-06

ˆ

φ

(Dispersion parameter for gaussian family taken to be 0.02341458 )

D

0

(

ˆ

β) (n − 1)

Null deviance : 45.2100 on 391 degrees of freedom

Residual deviance : 9.0849 on 388 degrees of freedom

D(

ˆ

β) (n −p −1)

> AIC(fit) + 2*sum(fit$y) −2ℓ(

ˆ

β) + 2(p + 2)

2075.781

255 / 436

Genesis of link functions in

binary response models

256 / 436

Genesis of link functions in binary response models











Y

i

ind

∼ Bernoulli(µ

i

)

g(µ

i

) = β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

The link functions frequently used in binary response models

(logit, probit, complementary log-log and Cauchy) may be mo-

tivated by using the following latent linear model (see, for in-

stance,

Agresti [2015, section 5.1.2])

Y

∗

i

= β

∗

0

+ β

∗

1

x

i1

+ . . . + β

∗

p

x

ip

− σ e

i

,

where β

∗

= (β

∗

0

, β

∗

1

, . . . , β

∗

p

)

⊤

are regression parameters, σ > 0 is a

scale parameter, and

e

1

, . . . , e

n

are independent and identically

distributed r andom errors whose cumulative distri bution func-

tion, denoted here by

G(·), satisﬁes the following: (i) is a contin-

uous and twice diﬀerentiable function;

(ii) 0 < G(e) < 1 for all

e ∈ R; and (iii) it does not depend on unknown parameters.

The response is dichotomized as follows

Y

i

=











1

if Y

∗

i

≥ τ

0

if Y

∗

i

< τ

,

257 / 436

Link functions in binary response models

where τ is an arbitrary value. Therefore, Y

i

ind

∼ Bernoulli(µ

i

),

where

µ

i

is given by

µ

i

= Pr[Y

i

= 1]

= Pr[Y

∗

i

≥ τ]

= Pr[β

∗

0

+ β

∗

1

x

i1

+ . . . + β

∗

p

x

ip

− σ e

i

≥ τ]

= Pr[−σ e

i

≥ τ − β

∗

0

− β

∗

1

x

i1

− . . . − β

∗

p

x

ip

]

= Pr[σ e

i

≤ β

∗

0

+ β

∗

1

x

i1

+ . . . + β

∗

p

x

ip

− τ]

= Pr[e

i

≤ (β

∗

0

− τ)/σ + (β

∗

1

/σ)x

i1

+ . . . + (β

∗

p

/σ)x

ip

]

= Pr[e

i

≤ β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

]

= G(β

0

+ β

1

x

i1

+ . . . + β

p

x

ip

),

in which β

0

= (β

∗

0

−τ)/σ

and β

j

= (β

∗

j

/σ)

for j = 1, . . . , p. Thus, the

link function arises naturally as

g(µ

i

) = G

−1

(µ

i

)

258 / 436

Link functions in binary response models

(1) Standard Logistic distribution

G(e) =

exp(e)

1 + exp(e)

.

Then,

g(µ

i

) = G

−1

(µ

i

) = log

µ

i

1 − µ

i

!

.

That is, the random errors distributed according to the stan-

dard logistic distribution induce the

logit link function.

(2) Standard Normal distribution

G(e) = Φ(e) ≈

1

2

+ sign(e)

1

2

q

1 − exp



−

p

π/8 e

2



.

Then,

g(µ

i

) = G

−1

(µ

i

) = Φ

−1

(µ

i

).

That is, the random errors distributed according to the stan-

dard normal distribution induce the

probit link function.

259 / 436

Link functions in binary response models

(3) Standard Extreme Value distribution

G(e) = 1 −exp[−exp(e)].

Then,

g(µ

i

) = G

−1

e

(µ

i

) = log[−log(1 − µ

i

)] = log

"

log

µ

i

1 − µ

i

+ 1

!#

That is, the errors distributed according to the standard

extreme value distribution induce the

cloglog link function.

(4) Standard Cauchy distribution

G(e) =

1

2

+

1

π

arctan(e).

Then,

g(µ

i

) ≡ G

−1

(µ

i

) = tan



π

2

(2µ

i

− 1)



.

That is, the random errors distributed according to the stan-

dard Cauchy di stribution induce the cauchit link function.

260 / 436

Link functions in binary response models

0.00

0.25

0.50

0.75

1.00

−5.0 −2.5 0.0 2.5 5.0

Link function

Linear predictor (η)

µ = g

−1

(η)

logit

probit

cloglog

cauchit

261 / 436

Receiver Operating

Characteristic (ROC) Curve

262 / 436

ROC Curve

The sample is partitioned into a sample for training and a sample

for validation.

Training sample

Validation sample

The binary response model is “trained” using the training sam-

ple. In order to evaluate the predictive ability of the trained

model, the validation sample is used.

263 / 436

ROC Curve

For a given value of the cutoﬀ point ν, the Interest Classiﬁcation

Rule

is the following:

ˆ

Y

i

=











1

if

ˆ

µ

i

≥ ν

0

if

ˆ

µ

i

< ν

Therefore, the confusion matrix reduces to

Predicted condition

ˆ

Y = 0

ˆ

Y = 1

Actual condition

Y = 0 a

ν

c

ν

Y = 1 b

ν

d

ν

Then, the Sensitivity or True Positive Rate (TPR), and the Speci-

ﬁcity or True Negative Rate (TNR), are give n, respectively, by

TPR(ν) =

d

ν

b

ν

+ d

ν

and TNR(ν) =

a

ν

a

ν

+ c

ν

264 / 436

ROC Curve

• zeros • ones

• •• •••••• •• ••• •• •• ••

•• •••• ••• • •• • • ••• • • •

•••• • •• •• •• • •• ••• •• •

••••••••••••••••••••

•• •••• •••• • ••••• • •••

• •• ••• • ••••• •• •• ••••

• •• • •• ••••• • • •• •••• •

• • •• • •• •• •• ••••• • •• •

••••••••••••••••••••

• • •• ••••• •• • •• • •••• •

ˆ

µ

E

D

C

B

A

0.00.00.00.00.0 0.20.20.20.20.2 0.40.40.40.40.4 0.60.60.60.60.6 0.80.80.80.80.8 1.01.01.01.01.0

The ideal scenario is E, since it allows for perfect diﬀerentiation between “suc-

cesses” and “failures”. By contrast, the worst scenario is D, since no distinction

can be made between “successes” and “failures”.

265 / 436

ROC Curve

The Receiver Operating Characteristic (ROC) curve is the plot of

TPR(ν) versus 1 − TNR(ν) for ν

1

< ν

2

< . . . < ν

r−1

< ν

r

.

Cutoﬀ point (ν) 1 − TNR(ν) TPR(ν)

ν

1

c

ν

1

a

ν

1

+ c

ν

1

d

ν

1

b

ν

1

+ d

ν

1

ν

2

c

ν

2

a

ν

2

+ c

ν

2

d

ν

2

b

ν

2

+ d

ν

2

.

ν

r

c

ν

r

a

ν

r

+ c

ν

r

d

ν

r

b

ν

r

+ d

ν

r

Area Under the ROC curve (AUROC) is a measure of the pre-

dictive power of the model, and it can be calculated by applying

successively the formula for trapezoid area. Thus,

AUROC =

1

2

r

X

l=2

[TNR(ν

l

) − TNR(ν

l−1

)][TPR(ν

l

) + TPR(ν

l−1

)].

266 / 436

ROC Curve

1 − TNR(ν

l+1

) 1 − TNR(ν

l

) 1 − TNR(ν

l−1

) 1 − TNR(ν

l−2

)

TPR(ν

l+1

)

TPR(ν

l

)

TPR(ν

l−1

)

TPR(ν

l−2

)

The area of the shaded region can be obtained as follows

= [(1 − TNR(ν

l−1

)) − (1 − TNR(ν

l

))] ×

1

2

[TPR(ν

l

) + TPR(ν

l−1

)]

= [TNR(ν

l

) − TNR(ν

l−1

)] ×

1

2

[TPR(ν

l

) + TPR(ν

l−1

)].

267 / 436

ROC Curve

The AUROC also may be expressed using the statistic of the

Mann-Whitney test (see

Gibbons and Chakraborti [2010, section

6.6]

)

AUROC =

1

N

0

N

1

X

l ∈G

0

X

i ∈G

1

S(

ˆ

µ

l

,

ˆ

µ

i

),

where G

0

= {i : y

i

= 0}, G

1

= {i : y

i

= 1}, N

0

= #G

0

, N

1

= #G

1

and

S(a, b) may be expressed as follows

S(a, b) =











1

if a < b

1

2

if a = b

0

if a > b

If

ˆ

µ

l

<

ˆ

µ

i

for all l ∈ G

0

and i ∈ G

1

then AUROC=1, that is, there

is ν

∗

∈ (0, 1) such that the classiﬁcation rule given by

ˆ

Y

i

=











1

if

ˆ

µ

i

≥ ν

∗

0

if

ˆ

µ

i

< ν

∗

is “perfect” in the sense of TPR(ν

∗

) = TNR(ν

∗

) = 1.

268 / 436

ROC Curve

Other measures of the predictive power of the model also may

be considered. For instance,

Gini Coeﬃcient = 2



AUROC −

1

2



,

and the Kolmogorov-Smirnov (K-S) statistic (see, for instance,

Gibbons and Chakraborti [2010, section 6.3]), which compares

the empirical cumulative distribution function of

{

ˆ

µ

i

: y

i

= 0},

denoted here by

ECDF

(0)

(µ), with that of {

ˆ

µ

i

: y

i

= 1}, denoted

here by

ECDF

(1)

(µ), in the following form:

K-S Statistic = max



ECDF

(0)

(µ) − ECDF

(1)

(µ)



= max



TNR(ν

l

) − (1 − TPR(ν

l

))



.

If

ˆ

µ

l

<

ˆ

µ

i

for all l ∈ G

0

and i ∈ G

1

then K-S Statistic=1.

269 / 436

ROC Curve

For a given value of the cutoﬀ point ν, the Reference Classiﬁca-

tion Rule

is the following:

ˆ

Y

i

=











1

if u

i

≥ ν

0

if u

i

< ν

where u

1

, u

2

, . . . are realizations of random variables such that

U

i

ind

∼ Uniform(0, 1)

. The ROC curv e of this classiﬁcation rule

is the zero-intercept and unit-slope straight line as

TPR(ν) =

Pr[U

i

≥ ν] = 1 − ν

and TNR(ν) = Pr[U

i

< ν] = ν for all ν ∈ ( 0, 1).

Thus, the AUROC of the reference classiﬁcation rule is

0.5.

The classiﬁcation rule of a model whose linear predictor includes

just the intercept term also may be used as reference. For that

model the estimates of µ

i

are

ˆ

µ

i

=

ˆ

µ

for all i. Thus, the ROC curve

and the AUROC of that classiﬁcation rule are the zero-intercept

and unit-slope straight line and

0.5, respectively, as

(1) TPR(ν) = 1 and TNR(ν) = 0 for all ν ≤

ˆ

µ

(2) TPR(ν) = 0 and TNR(ν) = 1 for all ν >

ˆ

µ

270 / 436

Burn injury study

This data set, discussed in Hosmer et al. [2013, page 27] and available

in the object

burn1000 of the library aplore3, is a subset of 1000 subjects

treated between 2000 and 2007 at 40 diﬀerent burn facilities. Available

information includes patient demographics, total burn surface area

and presence of inhalation injury. The outcome of interest is survival

to hospital discharge. The data set includes the following:

age (x

1

). The patient’s age at admission, in years.

gender ( x

2

). The patient’s gender: “Female” and “Male”.

race (x

3

). The patient’s race: “Non-White” and “White”.

tbsa (x

4

). Total burn surface area, in percentage.

inh_inj ( x

5

). Burn involved inhalation injury? “No” and “Yes”.

flame (x

6

). Flame involved in burn injury? “No” and “Yes”.

death (y). Hospital discharge status: “Alive” and “Dead”.











Y

i

ind

∼ Bernoulli(µ

i

),

log

µ

i

1 − µ

i

!

= β

0

+ β

1

x

i1

+ . . . + β

6

x

i6

271 / 436

Burn injury study

> help(ROCc)

>

> burn <- aplore3::burn1000

> burn <- within(burn, death2 <- ifelse(death=="Dead",1,0))

> s <- sample(1:nrow(burn), size=nrow(burn)*0.7)

> train <- burn[s,]

> test <- burn[-s,]

>

> fit <- glm(death ~ age + tbsa + inh_inj + age*inh_inj +

+ tbsa*inh_inj, family=binomial(logit), data=train)

>

> pr <- predict(fit, newdata=test, type="response")

>

> out <- ROCc(cbind(test[,"death2"],pr),col="red",col.lab="blue")

> out

This function provides a plot of the ROC curve, as well as the va-

lues of AUROC, Gini coeﬃcient, and K-S statistic for the interest

classiﬁcation rule when it is applied to the validation sample.

272 / 436

ROC Curve: Example

0.0 0.2 0.4 0.6 0.8 1.0

ROC Curve

1−Specificity

Sensitivity

AUROC = 0.972

273 / 436

ROC Curve: Example

0.0 0.2 0.4 0.6 0.8 1.0

Kolmogorov−Smirnov Statistic

µ

^

Empirical Cumulative Distribution Function

0.0 0.2 0.4 0.6 0.8 1.0

Zeros

Ones

K-S Stat istic = 0.851

274 / 436

ROC Curve

There are several ways to deﬁne the “optimal” cutoﬀ point ν.

For example, the “optimal” cutoﬀ point can be deﬁned as being

the value of

ν which minimizes the expected classiﬁcation cost,

which is given by (see, for instance,

Habibzadeh et al. [2016],

Nahm [2022]

)

E(Cost) = E(Cost |Y = 0) Pr(Y = 0) + E(Cost |Y = 1) Pr(Y = 1),

where

Cost |Y = 0 =











C

0 |0

if

ˆ

Y = 0

C

1 |0

if

ˆ

Y = 1

and

Cost |Y = 1 =











C

0 |1

if

ˆ

Y = 0

C

1 |1

if

ˆ

Y = 1

in which C

0 |0

, C

1 |0

, C

0 |1

and C

1 |1

are assumed to be non-negative

and known values.

275 / 436

ROC Curve

Therefore,

E(Cost |Y = 0) = C

1 |0

(1 − TNR) + C

0 |0

TNR.

and

E(Cost |Y = 1) = C

0 |1

(1 − TPR) + C

1 |1

TPR.

Thus, the expected classiﬁcation cost reduces to

E(Cost) = [C

1 |0

(1 − TNR) + C

0 |0

TNR] Pr(Y = 0) +

[C

0 |1

(1 − TPR) + C

1 |1

TPR] Pr(Y = 1).

The values of Pr(Y = 0) and Pr(Y = 1) can be replaced by their

estimates, which are given by N

0

/(N

0

+ N

1

) and N

1

/(N

0

+ N

1

),

respectively.

276 / 436

Variable selection

277 / 436

Hybrid forward stepwise

The algorithm starts with the simplest model, that is, a model

with a linear predictor in which the coeﬃcients of all variables

and/or eﬀects, except the intercept term, are zero. Next, the can-

didate models are constructed by hierarchically adding eﬀects to

the linear predictor (i.e., the main eﬀects are ﬁrst, then the two-

way interactions, and so on), and their “relevance” is assessed

by comparing the models with and without the added eﬀects

using a previously deﬁned criterion. After an eﬀect has been

added to the model, any other eﬀect previously added can be

removed if, according to the speciﬁed criterion, that eﬀect is no

longer relevant. See, for example,

James et al. [2013, page 210].

The criteria to compare models include the following:

Akaike Information Criterion

Bayesian Information Criterion

Adjusted R

2

p-value of the signiﬁcance tests

278 / 436