\SweaveOpts{engine=R,eps=FALSE,pdf=TRUE,width=8,strip.white=all}
\SweaveOpts{keep.source=TRUE}
\SweaveOpts{prefix=TRUE,prefix.string=figs/GLMMB,include=TRUE}
\setkeys{Gin}{width=\textwidth}
<>=
options(width=74, show.signif.stars = FALSE,
lattice.theme = function() canonical.theme("pdf", color = FALSE),
str = strOptions(strict.width = "cut"))
library(splines)
library(lattice)
library(Matrix)
library(Rcpp)
library(minqa)
library(lme4)
.f5 = "%.5f"
.f1 = "%.1f"
data(Contraception, package = "mlmRev")
if (file.exists("fm10.rda")) {load("fm10.rda")
} else {
fm10 <- glmer(use ~ 1+age+I(age^2)+urban+livch+(1|district),
Contraception, binomial)
save(fm10, file="fm10.rda")
}
@
\chapter{Generalized Linear Mixed Models for Binary Responses}
\label{chap:GLMMbinomial}
In this chapter we consider mixed-effects models for data sets in
which the response is \emph{binary},\index{binary
response}\index{response!binary} representing
yes/no or true/false or correct/incorrect responses.
Because the response can only take on one of two values we adapt our
models to predict the probability of the positive response. We retain
the concept of the linear predictor, $\vec X\vec\beta+\vec Z\vec b$,
depending on the fixed-effects parameters, $\vec\beta$, and the random
effects, $\vec b$, with the corresponding model matrices, $\vec X$ and
$\vec Z$, determining the conditional mean, $\vec\mu$, of the
response, given the random effects, but the relationship is more
general than that for the linear mixed model. The linear predictor,
which we shall write as $\vec\eta$, determines $\vec\mu$ according
to a \emph{link function}, $g$. For historical reasons it is the
function taking an element of $\vec\mu$ to the corresponding element
of $\vec\eta$ that is called the link. The transformation in the
opposite direction, from $\vec\eta$ to $\vec\mu$, is called the
\emph{inverse link}.
As described in earlier chapters, models based on a Gaussian
distribution for the response vector with its mean determined by the
linear predictor are called linear models. Models incorporating
coefficients in a linear predictor but allowing for more general forms
of the distribution of the response are called \emph{generalized
linear models}\index{generalized linear model}\index{glm}. When the
linear predictor incorporates random effects in addition to the
fixed-effects parameters we call them \emph{generalized linear mixed
models} (GLMMs)\index{GLMM} and fit such models with the
\code{glmer}\index{glmer} function. As in previous chapters, we will
begin with an example to help illustrate these ideas.
\section{Artificial contraception use in regions of Bangladesh}
\label{sec:contraception}
One of the test data sets from the Center for Multilevel
Modelling,\index{Center for Multilevel Modelling} University of
Bristol is derived from the 1989 Bangladesh Fertility Survey,
\citep{huq:cleland:1990}. The data are a subsample of responses from
1934 women grouped in 60 districts and are available as the
\code{Contraception}\index{Contraception data}\index{data
set!Contraception} data set in the \code{mlmRev} package.
<>=
str(Contraception)
@
The response of interest is \code{use} --- whether the woman chooses
to use artificial contraception. The covariates include the district
in which the woman resides, the number of live children she currently
has, her age and whether she is in a rural or an urban setting.
Note that the \code{age} variable is centered about a particular age
so some values are negative. Regretably, the information on what the
centering age was does not seem to be available.
\subsection{Plotting the binary response}
\label{sec:plottingbinary}
Producing informative graphical displays of a binary response as it
relates to covariates is somewhat more challenging that the
corresponding plots for responses on a continuous scale. If we were
to plot the 1934 responses as 0/1 values versus, for example, the
woman's centered age, we would end up with a rather uninformative plot
because all the points would fall on one of two horizontal lines.
One approach to illustrating the structure of the data more
effectively is to add \emph{scatterplot smoother}\index{scatterplot
smoother} lines (Fig.~\ref{fig:Contra1})
\begin{figure}[tbp]
\centering
<>=
print(xyplot(ifelse(use == "Y", 1, 0) ~ age|urban, Contraception,
groups = livch, type = c("g", "smooth"),
auto.key = list(space = "top", points = FALSE,
lines = TRUE, columns = 4),
ylab = "Proportion", xlab = "Centered age"))
@
\caption[Contraception use versus centered age]{Contraception use
versus centered age for women in the Bangladesh Fertility Survey
1989. Panels are determined by whether the woman is in an urban
setting or not. Lines within the panels are scatterplot smoother
lines for women with 0, 1, 2 and 3 or more live children.}
\label{fig:Contra1}
\end{figure}
to show the trend in the response with respect to the covariate. Once
we have the smoother lines in such a plot we can omit the data points
themselves, as we did here, because they add very little information.
The first thing to notice about the plot is that the proportion of
women using contraception is not linear in age, which, on reflection,
makes sense. A woman in the middle of this age range (probably
corresponding to an age around 25) is more likely to use artificial
contraception than is a girl in her early teens or a woman in her
forties. We also see that women in an urban setting are more likely
to use contraception than those in a rural setting and that women with
no live children are less likely than women who have live children.
There do not seem to be strong differences between women who have 1, 2
or 3 or more children compared to the differences between women with
children and those without children.
Interestingly, the quadratic pattern with respect to age does not seem
to have been noticed. Comparisons of model fits through different
software systems, as provided by the Center for Multilevel Modelling,
incorporate only a linear term in age, even though the pattern is
clearly nonlinear. The lesson here is similar to what we have seen in
other examples; careful plotting of the data should, whenever
possible, precede attempts to fit models to the data.
\subsection{Initial GLMM fit to the contraception data}
\label{sec:ContraGLMM}
As for the \code{lmer} function, the first two arguments to the
\code{glmer} function for fitting generalized linear mixed models are
the model formula and the name of the data frame. The third argument
to \code{glmer}, named \code{family}, describes the type of
conditional distribution of the response given the random effects.
Actually, as the name \code{family} implies, it contains more
information than just the distribution type in that each distribution
and link are described by several functions. Certain distributions,
including the binomial, have canonical link functions (described in
Sec.~\ref{sec:canonicallink}) associated with them and if we specify just
the distribution type we get the family with the canonical link. Thus
our initial fit is generated as
<>=
fm10 <- glmer(use ~ 1+age+I(age^2)+urban+livch+(1|district),
Contraception, binomial)
@
\index{fitted models!fm10}
When displaying a fitted model like this that has several
fixed-effects coefficients it is helpful to specify the optional
argument \code{corr=FALSE} to suppress printing of the rather large
correlation matrix of the fixed effects estimators.
<>=
print(fm10, corr=FALSE)
@
The interpretation of the coefficients in this model is somewhat
different from the linear mixed models coefficients that we examined
previously but many of the model-building steps are similar. A rough
assessment of the utility of a particular term in the fixed-effects
part of the model can be obtained from examining the estimates of the
coefficients associated with it and their standard errors. To test
whether a particular term is useful we omit it from the model, refit
and compare the reduced model fit to the original according to the change
in deviance.
We will examine the terms in the model first and discuss the
interpretation of the coefficients in Sec.~\ref{sec:GLMMlink}.
Recall from Chap.~\ref{chap:Covariates} that the default set of
contrasts for a factor such as \code{livch} is offsets relative to the
reference level, in this case women who do not have any live children.
Although the coefficients labeled \code{livch1}, \code{livch2} and
\code{livch3+} are all large relative to their standard errors, they
are quite close to each other. This confirms our earlier impression
that the main distinction is between women with children and those
without and, for those who do have children, the number of children is
not an important distinction.
After incorporating a new variable \code{ch} --- an indicator of whether the
woman has any children --- in the data
<>=
Contraception <- within(Contraception,
ch <- factor(livch != 0, labels = c("N", "Y")))
@
we fit a reduced model, \code{fm11}, with summary
<>=
print(fm11 <- glmer(use ~ age + I(age^2) + urban + ch + (1|district),
Contraception, binomial), corr = FALSE)
@
\index{fitted models!fm11}
Comparing this model to the previous model
<>=
anova(fm11,fm10)
@
indicates that the reduced model is adequate.
A plot of the smoothed observed proportions versus centered age
according to \code{ch} and \code{urban} (Fig.~\ref{fig:Contra2})
\begin{figure}[tbp]
\centering
<>=
print(xyplot(ifelse(use == "Y", 1, 0) ~ age|urban, Contraception,
groups = ch, type = c("g", "smooth"),
auto.key = list(space = "top", points = FALSE,
lines = TRUE, columns = 2),
ylab = "Proportion", xlab = "Centered age"))
@
\caption[Contraception use versus centered age (children/no
children)]{Contraception use versus centered age for women in the
Bangladesh Fertility Survey 1989. Panels are determined by whether
the woman is in an urban setting or not. Lines within the panels
are scatterplot smoother lines for women without children and women
with one or more live children.}
\label{fig:Contra2}
\end{figure}
indicates that all four groups have a quadratic trend with respect to
age but the location of the peak proportion is shifted for those
without children relative to those with children. Incorporating an
interaction of \code{age} and \code{ch} allows for such a shift.
<>=
print(fm12 <- glmer(use ~ age*ch + I(age^2) + urban + (1|district),
Contraception, binomial), corr = FALSE)
@
\index{fitted models!fm12}
Comparing this fitted model to the previous one
<>=
anova(fm11, fm12)
@
confirms the usefulness of this term.
Continuing with the model-building we turn our attention to the random
effects specification to see whether urban/rural differences vary
significantly between districts and whether the distinction between
childless women and women with children varies between districts. We
fit a succession of models, described in the exercises for this chapter,
before settling on model \code{fm13},
<>=
(fm13 <- glmer(use ~ age*ch + I(age^2) + urban + (1|urban:district),
Contraception, binomial))
<>
anova(fm12,fm13)
@
\index{fitted models!fm13}
Notice that although there are 60 distinct districts there are only
102 distinct combinations of \code{urban:district} represented in
the data. In 15 of the 60 districts there are no rural women in the
sample and in 3 districts there are no urban women in the sample, as
shown in
<>=
xtabs(~ urban + district, Contraception)
@
\section{Link functions and interpreting coefficients}
\label{sec:GLMMlink}
To this point the only difference we have encountered between
\code{glmer} and \code{lmer} as model-fitting functions is the need to
specify the distribution family in a call to \code{glmer}. The
formula specification is identical and the assessment of the
significance of terms using likelihood ratio tests is similar. This
is intentional. We have emphasized the use of likelihood ratio tests
on terms, whether fixed-effects or random-effects terms, exactly so
the approach will be general.
However, the interpretation of the coefficient estimates in the
different types of models is different. In a linear mixed model the
linear predictor is the conditional mean (or ``expected value'') of
the response given the random effects. That is, if we assume that we
know the values of the fixed-effects parameters and the random effects
then the expected response for a particular combination of covariate
values is the linear predictor. Individual coefficients can be
interpreted as slopes of the fitted response with respect to a numeric
covariate or as shifts between levels of a categorical covariate.
To interpret the estimates of coefficients in a GLMM we must define
and examine the link function that we mentioned earlier.
\subsection{The logit link function for binary responses}
\label{sec:logitlink}
The probability model for a binary response is the Bernoulli
distribution, which is about the simplest probability distribution we
can concoct. There are only two possible values: 0 and 1. If the
probability of the response 1 is $p$ then the probability of 0 must be
$1-p$. It is easy to establish that the expected value is also $p$.
For consistency across distribution families we write this expected
response as $\mu$ instead of $p$. We should, however, keep
in mind that, for this distribution, $\mu$ corresponds to a probability
and hence must satisfy $0\le\mu\le 1$.
In general we don't want to have restrictions on the values of the
linear predictor so we equate the linear predictor to a function of
$\mu$ that has an unrestricted range. In the case of the Bernoulli
distribution with the canonical link function we equate the linear
predictor to the \emph{log odds} or \emph{logit} of the positive
response. That is
\begin{equation}
\label{eq:logodds}
\eta = \logit(\mu) = \log\left(\frac{\mu}{1-\mu}\right) .
\end{equation}
To understand why this is called the ``log odds'' recall that $\mu$
corresponds to a probability in $[0,1]$. The corresponding
odds ratio, $\frac{\mu}{1-\mu}$, is in $[0,\infty)$ and
the logarithm of the odds ratio, $\logit(\mu)$, is in
$(-\infty, \infty)$.
The inverse of the logit link function,
\begin{equation}
\label{eq:logitinv}
\mu = \frac{1}{1+\exp(-\eta)}
\end{equation}
shown in
\begin{figure}[tbp]
\centering
<>=
eta <- seq(-7, 7, len=701)
print(xyplot(plogis(eta) ~ eta, type = c("g","l"),
xlab = expression(eta),
ylab = expression(mu == frac(1,1+exp(-eta)))))
@
\caption[Inverse of the logit link function]{Inverse of the logit link
function. The linear predictor value, $\eta$, which is on an
unrestricted scale, is mapped to $\mu$ on the probability scale, $[0,1]$.}
\label{fig:logitinv}
\end{figure}
Fig.~\ref{fig:logitinv}, takes a value on the unrestricted range,
$(-\infty,\infty)$, and maps it to the probability range, $[0,1]$. It
happens this function is also the cumulative distribution function for
the standard logistic distribution, available in \R{} as the function
\code{plogis}. In some presentations the relationship between the
logit link and the logistic distribution is emphasized but that often
leads to questions of why we should focus on the logistic
distribution. Also, it is not clear how this approach would
generalize to other distributions such as the Poisson or the Gamma
distributions.
\subsection{Canonical link functions}
\label{sec:canonicallink}
A way of deriving the logit link that does generalize to a class of
common distributions in what is called the \emph{exponential family}
is to consider the logarithm of the probability function (for discrete
distributions) or the probability density function (for continuous
distributions). The probability function for the Bernoulli
distribution is $\mu$ for $y=1$ and $1-\mu$ for $y=0$. If we write this
in a somewhat peculiar way as $\mu^y\cdot(1-\mu)^{1-y}$ for $y\in\{0,1\}$
then the logarithm of the probability function becomes
\begin{equation}
\label{eq:BernoulliProb}
\log\left(\mu^y\cdot(1-\mu)^{1-y}\right) = \log(1-\mu) +
y\,\log\left(\frac{\mu}{1-\mu}\right) .
\end{equation}
Notice that the logit link function is the multiple of $y$ in the last term.
For members of the exponential family the logarithm of the probability
or probability density function can be expressed as a sum of up
to three terms: one that involves $y$ only, one that involves the
parameters only and the product of $y$ and a function of the
parameters. This function is the canonical link.
In the case of the Poisson distribution the probability function is
$\frac{e^{-\mu}\mu^y}{y!}$ for $y\in\{0,1,2,\dots\}$ so the log
probability function is
\begin{equation}
\label{eq:PoissonProb}
-\log(y!)-\mu+y\log(\mu)
\end{equation}
and the canonical link function is $\log(\mu)$.
\subsection{Interpreting coefficient estimates}
\label{sec:GLMMcoefficients}
Returning to the interpretation of the estimated coefficients in model
\code{fm13} we apply exactly the same interpretation as for a linear
mixed model but taking into account that slopes or differences in
levels are with respect to the logit or log-odds function. If we wish
to express results in the probability scale then we should apply the
\code{plogis} function to whatever combination of coefficients is of
interest to us.
For example, we see from Fig.~\ref{fig:Contra2} that the observed
proportion of childless women with a centered age of
0 living in a rural setting who use artificial contraception is about
20\%. The fitted value of the log-odds for a typical district
(i.e. with a random effect of zero) is
\Sexpr{sprintf(.f5, fixef(fm13)[[1]])} corresponding to a fitted probability of
<>=
plogis(fixef(fm13)[[1]])
@
or \Sexpr{sprintf(.f1, 100*plogis(fixef(fm13)[[1]]))}\%.
Similarly, the predicted log-odds of a childless woman with a centered
age of 0 in an urban setting of a typical district using artificial
contraception is
<>=
sum(fixef(fm13)[c("(Intercept)","urbanY")])
@
corresponding to a probability of
<>=
plogis(sum(fixef(fm13)[c("(Intercept)","urbanY")]))
@
The predicted log-odds and predicted probability for a woman with
children and at the same age and location are
<>=
logodds <- sum(fixef(fm13)[c("(Intercept)","chY","urbanY")])
c("log-odds"=logodds, "probability"=plogis(logodds))
@
We should also be aware that the random effects are defined on the
linear predictor scale and not on the probability scale. A normal
probability plot of the conditional modes of the random effects for
model \code{fm13} (Fig.~\ref{fig:fm13predqq})
\begin{figure}[tbp]
<>=
print(qqmath(ranef(fm13, condVar =TRUE), strip = FALSE))
@
\caption{95\% prediction intervals on the random effects in
\code{fm13} versus quantiles of the standard normal distribution.}
\label{fig:fm13predqq}
\end{figure}
shows that the smallest random effects are approximately $-1$ and the
largest are approximately $1$. The numerical values and the identifier
of the combination of \code{urban} and \code{district} for these
extreme values can be obtained as
<>=
head(sort(ranef(fm13, drop=TRUE)[[1]]), 3)
@
and
<>=
tail(sort(ranef(fm13, drop=TRUE)[[1]]), 3)
@
The exponential of the random effect is the relative odds of a woman
in a particular urban/district combination using artificial birth
control compared to her counterpart (same age, same with/without
children status, same urban/rural status) in a typical district.
The odds of a rural woman in district 1 (i.e. the \code{N:1} value of the
\code{urban:district} interaction) using artifical contraception is
<>=
exp(ranef(fm13, drop=TRUE)[[1]]["N:1"])
@
or about 40\% of that of her rural counterpart in a typical district.
Notice that there is considerable variability in the lengths of the
prediction intervals in Fig.~\ref{fig:fm13predqq}, unlike those from
previous model fits (e.g. Fig.~\ref{fig:fm01MLpreddot},
p.~\pageref{fig:fm01MLpreddot} or Fig.~\ref{fig:fm03ranef},
p.~\ref{fig:fm03ranef}). This is to be expected with data from a
highly unbalanced observational study.
Consider the cross-tabulation of counts of interviewees by district
and urban/rural status presented at the end of
Sec.~\ref{sec:ContrGLMM}. The data contains responses from 54 rural
women in district 1 but only 21 rural women from district 11. Thus
the bottom line in Fig.~\ref{fig:fm13predqq}, from the \code{N:1}
level of the \code{urban:district} interaction, and based on 54
responses, is shorter than the line second from the bottom, for
\code{N:11} and based on 21 women only.