\SweaveOpts{engine=R,eps=FALSE,pdf=TRUE,width=8,strip.white=all} \SweaveOpts{keep.source=TRUE} \SweaveOpts{prefix=TRUE,prefix.string=figs/GLMMB,include=TRUE} \setkeys{Gin}{width=\textwidth} <>= options(width=74, show.signif.stars = FALSE, lattice.theme = function() canonical.theme("pdf", color = FALSE), str = strOptions(strict.width = "cut")) library(splines) library(lattice) library(Matrix) library(Rcpp) library(minqa) library(lme4) .f5 = "%.5f" .f1 = "%.1f" data(Contraception, package = "mlmRev") if (file.exists("fm10.rda")) {load("fm10.rda") } else { fm10 <- glmer(use ~ 1+age+I(age^2)+urban+livch+(1|district), Contraception, binomial) save(fm10, file="fm10.rda") } @ \chapter{Generalized Linear Mixed Models for Binary Responses} \label{chap:GLMMbinomial} In this chapter we consider mixed-effects models for data sets in which the response is \emph{binary},\index{binary response}\index{response!binary} representing yes/no or true/false or correct/incorrect responses. Because the response can only take on one of two values we adapt our models to predict the probability of the positive response. We retain the concept of the linear predictor, $\vec X\vec\beta+\vec Z\vec b$, depending on the fixed-effects parameters, $\vec\beta$, and the random effects, $\vec b$, with the corresponding model matrices, $\vec X$ and $\vec Z$, determining the conditional mean, $\vec\mu$, of the response, given the random effects, but the relationship is more general than that for the linear mixed model. The linear predictor, which we shall write as $\vec\eta$, determines $\vec\mu$ according to a \emph{link function}, $g$. For historical reasons it is the function taking an element of $\vec\mu$ to the corresponding element of $\vec\eta$ that is called the link. The transformation in the opposite direction, from $\vec\eta$ to $\vec\mu$, is called the \emph{inverse link}. As described in earlier chapters, models based on a Gaussian distribution for the response vector with its mean determined by the linear predictor are called linear models. Models incorporating coefficients in a linear predictor but allowing for more general forms of the distribution of the response are called \emph{generalized linear models}\index{generalized linear model}\index{glm}. When the linear predictor incorporates random effects in addition to the fixed-effects parameters we call them \emph{generalized linear mixed models} (GLMMs)\index{GLMM} and fit such models with the \code{glmer}\index{glmer} function. As in previous chapters, we will begin with an example to help illustrate these ideas. \section{Artificial contraception use in regions of Bangladesh} \label{sec:contraception} One of the test data sets from the Center for Multilevel Modelling,\index{Center for Multilevel Modelling} University of Bristol is derived from the 1989 Bangladesh Fertility Survey, \citep{huq:cleland:1990}. The data are a subsample of responses from 1934 women grouped in 60 districts and are available as the \code{Contraception}\index{Contraception data}\index{data set!Contraception} data set in the \code{mlmRev} package. <>= str(Contraception) @ The response of interest is \code{use} --- whether the woman chooses to use artificial contraception. The covariates include the district in which the woman resides, the number of live children she currently has, her age and whether she is in a rural or an urban setting. Note that the \code{age} variable is centered about a particular age so some values are negative. Regretably, the information on what the centering age was does not seem to be available. \subsection{Plotting the binary response} \label{sec:plottingbinary} Producing informative graphical displays of a binary response as it relates to covariates is somewhat more challenging that the corresponding plots for responses on a continuous scale. If we were to plot the 1934 responses as 0/1 values versus, for example, the woman's centered age, we would end up with a rather uninformative plot because all the points would fall on one of two horizontal lines. One approach to illustrating the structure of the data more effectively is to add \emph{scatterplot smoother}\index{scatterplot smoother} lines (Fig.~\ref{fig:Contra1}) \begin{figure}[tbp] \centering <>= print(xyplot(ifelse(use == "Y", 1, 0) ~ age|urban, Contraception, groups = livch, type = c("g", "smooth"), auto.key = list(space = "top", points = FALSE, lines = TRUE, columns = 4), ylab = "Proportion", xlab = "Centered age")) @ \caption[Contraception use versus centered age]{Contraception use versus centered age for women in the Bangladesh Fertility Survey 1989. Panels are determined by whether the woman is in an urban setting or not. Lines within the panels are scatterplot smoother lines for women with 0, 1, 2 and 3 or more live children.} \label{fig:Contra1} \end{figure} to show the trend in the response with respect to the covariate. Once we have the smoother lines in such a plot we can omit the data points themselves, as we did here, because they add very little information. The first thing to notice about the plot is that the proportion of women using contraception is not linear in age, which, on reflection, makes sense. A woman in the middle of this age range (probably corresponding to an age around 25) is more likely to use artificial contraception than is a girl in her early teens or a woman in her forties. We also see that women in an urban setting are more likely to use contraception than those in a rural setting and that women with no live children are less likely than women who have live children. There do not seem to be strong differences between women who have 1, 2 or 3 or more children compared to the differences between women with children and those without children. Interestingly, the quadratic pattern with respect to age does not seem to have been noticed. Comparisons of model fits through different software systems, as provided by the Center for Multilevel Modelling, incorporate only a linear term in age, even though the pattern is clearly nonlinear. The lesson here is similar to what we have seen in other examples; careful plotting of the data should, whenever possible, precede attempts to fit models to the data. \subsection{Initial GLMM fit to the contraception data} \label{sec:ContraGLMM} As for the \code{lmer} function, the first two arguments to the \code{glmer} function for fitting generalized linear mixed models are the model formula and the name of the data frame. The third argument to \code{glmer}, named \code{family}, describes the type of conditional distribution of the response given the random effects. Actually, as the name \code{family} implies, it contains more information than just the distribution type in that each distribution and link are described by several functions. Certain distributions, including the binomial, have canonical link functions (described in Sec.~\ref{sec:canonicallink}) associated with them and if we specify just the distribution type we get the family with the canonical link. Thus our initial fit is generated as <>= fm10 <- glmer(use ~ 1+age+I(age^2)+urban+livch+(1|district), Contraception, binomial) @ \index{fitted models!fm10} When displaying a fitted model like this that has several fixed-effects coefficients it is helpful to specify the optional argument \code{corr=FALSE} to suppress printing of the rather large correlation matrix of the fixed effects estimators. <>= print(fm10, corr=FALSE) @ The interpretation of the coefficients in this model is somewhat different from the linear mixed models coefficients that we examined previously but many of the model-building steps are similar. A rough assessment of the utility of a particular term in the fixed-effects part of the model can be obtained from examining the estimates of the coefficients associated with it and their standard errors. To test whether a particular term is useful we omit it from the model, refit and compare the reduced model fit to the original according to the change in deviance. We will examine the terms in the model first and discuss the interpretation of the coefficients in Sec.~\ref{sec:GLMMlink}. Recall from Chap.~\ref{chap:Covariates} that the default set of contrasts for a factor such as \code{livch} is offsets relative to the reference level, in this case women who do not have any live children. Although the coefficients labeled \code{livch1}, \code{livch2} and \code{livch3+} are all large relative to their standard errors, they are quite close to each other. This confirms our earlier impression that the main distinction is between women with children and those without and, for those who do have children, the number of children is not an important distinction. After incorporating a new variable \code{ch} --- an indicator of whether the woman has any children --- in the data <>= Contraception <- within(Contraception, ch <- factor(livch != 0, labels = c("N", "Y"))) @ we fit a reduced model, \code{fm11}, with summary <>= print(fm11 <- glmer(use ~ age + I(age^2) + urban + ch + (1|district), Contraception, binomial), corr = FALSE) @ \index{fitted models!fm11} Comparing this model to the previous model <>= anova(fm11,fm10) @ indicates that the reduced model is adequate. A plot of the smoothed observed proportions versus centered age according to \code{ch} and \code{urban} (Fig.~\ref{fig:Contra2}) \begin{figure}[tbp] \centering <>= print(xyplot(ifelse(use == "Y", 1, 0) ~ age|urban, Contraception, groups = ch, type = c("g", "smooth"), auto.key = list(space = "top", points = FALSE, lines = TRUE, columns = 2), ylab = "Proportion", xlab = "Centered age")) @ \caption[Contraception use versus centered age (children/no children)]{Contraception use versus centered age for women in the Bangladesh Fertility Survey 1989. Panels are determined by whether the woman is in an urban setting or not. Lines within the panels are scatterplot smoother lines for women without children and women with one or more live children.} \label{fig:Contra2} \end{figure} indicates that all four groups have a quadratic trend with respect to age but the location of the peak proportion is shifted for those without children relative to those with children. Incorporating an interaction of \code{age} and \code{ch} allows for such a shift. <>= print(fm12 <- glmer(use ~ age*ch + I(age^2) + urban + (1|district), Contraception, binomial), corr = FALSE) @ \index{fitted models!fm12} Comparing this fitted model to the previous one <>= anova(fm11, fm12) @ confirms the usefulness of this term. Continuing with the model-building we turn our attention to the random effects specification to see whether urban/rural differences vary significantly between districts and whether the distinction between childless women and women with children varies between districts. We fit a succession of models, described in the exercises for this chapter, before settling on model \code{fm13}, <>= (fm13 <- glmer(use ~ age*ch + I(age^2) + urban + (1|urban:district), Contraception, binomial)) <> anova(fm12,fm13) @ \index{fitted models!fm13} Notice that although there are 60 distinct districts there are only 102 distinct combinations of \code{urban:district} represented in the data. In 15 of the 60 districts there are no rural women in the sample and in 3 districts there are no urban women in the sample, as shown in <>= xtabs(~ urban + district, Contraception) @ \section{Link functions and interpreting coefficients} \label{sec:GLMMlink} To this point the only difference we have encountered between \code{glmer} and \code{lmer} as model-fitting functions is the need to specify the distribution family in a call to \code{glmer}. The formula specification is identical and the assessment of the significance of terms using likelihood ratio tests is similar. This is intentional. We have emphasized the use of likelihood ratio tests on terms, whether fixed-effects or random-effects terms, exactly so the approach will be general. However, the interpretation of the coefficient estimates in the different types of models is different. In a linear mixed model the linear predictor is the conditional mean (or ``expected value'') of the response given the random effects. That is, if we assume that we know the values of the fixed-effects parameters and the random effects then the expected response for a particular combination of covariate values is the linear predictor. Individual coefficients can be interpreted as slopes of the fitted response with respect to a numeric covariate or as shifts between levels of a categorical covariate. To interpret the estimates of coefficients in a GLMM we must define and examine the link function that we mentioned earlier. \subsection{The logit link function for binary responses} \label{sec:logitlink} The probability model for a binary response is the Bernoulli distribution, which is about the simplest probability distribution we can concoct. There are only two possible values: 0 and 1. If the probability of the response 1 is $p$ then the probability of 0 must be $1-p$. It is easy to establish that the expected value is also $p$. For consistency across distribution families we write this expected response as $\mu$ instead of $p$. We should, however, keep in mind that, for this distribution, $\mu$ corresponds to a probability and hence must satisfy $0\le\mu\le 1$. In general we don't want to have restrictions on the values of the linear predictor so we equate the linear predictor to a function of $\mu$ that has an unrestricted range. In the case of the Bernoulli distribution with the canonical link function we equate the linear predictor to the \emph{log odds} or \emph{logit} of the positive response. That is \begin{equation} \label{eq:logodds} \eta = \logit(\mu) = \log\left(\frac{\mu}{1-\mu}\right) . \end{equation} To understand why this is called the ``log odds'' recall that $\mu$ corresponds to a probability in $[0,1]$. The corresponding odds ratio, $\frac{\mu}{1-\mu}$, is in $[0,\infty)$ and the logarithm of the odds ratio, $\logit(\mu)$, is in $(-\infty, \infty)$. The inverse of the logit link function, \begin{equation} \label{eq:logitinv} \mu = \frac{1}{1+\exp(-\eta)} \end{equation} shown in \begin{figure}[tbp] \centering <>= eta <- seq(-7, 7, len=701) print(xyplot(plogis(eta) ~ eta, type = c("g","l"), xlab = expression(eta), ylab = expression(mu == frac(1,1+exp(-eta))))) @ \caption[Inverse of the logit link function]{Inverse of the logit link function. The linear predictor value, $\eta$, which is on an unrestricted scale, is mapped to $\mu$ on the probability scale, $[0,1]$.} \label{fig:logitinv} \end{figure} Fig.~\ref{fig:logitinv}, takes a value on the unrestricted range, $(-\infty,\infty)$, and maps it to the probability range, $[0,1]$. It happens this function is also the cumulative distribution function for the standard logistic distribution, available in \R{} as the function \code{plogis}. In some presentations the relationship between the logit link and the logistic distribution is emphasized but that often leads to questions of why we should focus on the logistic distribution. Also, it is not clear how this approach would generalize to other distributions such as the Poisson or the Gamma distributions. \subsection{Canonical link functions} \label{sec:canonicallink} A way of deriving the logit link that does generalize to a class of common distributions in what is called the \emph{exponential family} is to consider the logarithm of the probability function (for discrete distributions) or the probability density function (for continuous distributions). The probability function for the Bernoulli distribution is $\mu$ for $y=1$ and $1-\mu$ for $y=0$. If we write this in a somewhat peculiar way as $\mu^y\cdot(1-\mu)^{1-y}$ for $y\in\{0,1\}$ then the logarithm of the probability function becomes \begin{equation} \label{eq:BernoulliProb} \log\left(\mu^y\cdot(1-\mu)^{1-y}\right) = \log(1-\mu) + y\,\log\left(\frac{\mu}{1-\mu}\right) . \end{equation} Notice that the logit link function is the multiple of $y$ in the last term. For members of the exponential family the logarithm of the probability or probability density function can be expressed as a sum of up to three terms: one that involves $y$ only, one that involves the parameters only and the product of $y$ and a function of the parameters. This function is the canonical link. In the case of the Poisson distribution the probability function is $\frac{e^{-\mu}\mu^y}{y!}$ for $y\in\{0,1,2,\dots\}$ so the log probability function is \begin{equation} \label{eq:PoissonProb} -\log(y!)-\mu+y\log(\mu) \end{equation} and the canonical link function is $\log(\mu)$. \subsection{Interpreting coefficient estimates} \label{sec:GLMMcoefficients} Returning to the interpretation of the estimated coefficients in model \code{fm13} we apply exactly the same interpretation as for a linear mixed model but taking into account that slopes or differences in levels are with respect to the logit or log-odds function. If we wish to express results in the probability scale then we should apply the \code{plogis} function to whatever combination of coefficients is of interest to us. For example, we see from Fig.~\ref{fig:Contra2} that the observed proportion of childless women with a centered age of 0 living in a rural setting who use artificial contraception is about 20\%. The fitted value of the log-odds for a typical district (i.e. with a random effect of zero) is \Sexpr{sprintf(.f5, fixef(fm13)[[1]])} corresponding to a fitted probability of <>= plogis(fixef(fm13)[[1]]) @ or \Sexpr{sprintf(.f1, 100*plogis(fixef(fm13)[[1]]))}\%. Similarly, the predicted log-odds of a childless woman with a centered age of 0 in an urban setting of a typical district using artificial contraception is <>= sum(fixef(fm13)[c("(Intercept)","urbanY")]) @ corresponding to a probability of <>= plogis(sum(fixef(fm13)[c("(Intercept)","urbanY")])) @ The predicted log-odds and predicted probability for a woman with children and at the same age and location are <>= logodds <- sum(fixef(fm13)[c("(Intercept)","chY","urbanY")]) c("log-odds"=logodds, "probability"=plogis(logodds)) @ We should also be aware that the random effects are defined on the linear predictor scale and not on the probability scale. A normal probability plot of the conditional modes of the random effects for model \code{fm13} (Fig.~\ref{fig:fm13predqq}) \begin{figure}[tbp] <>= print(qqmath(ranef(fm13, condVar =TRUE), strip = FALSE)) @ \caption{95\% prediction intervals on the random effects in \code{fm13} versus quantiles of the standard normal distribution.} \label{fig:fm13predqq} \end{figure} shows that the smallest random effects are approximately $-1$ and the largest are approximately $1$. The numerical values and the identifier of the combination of \code{urban} and \code{district} for these extreme values can be obtained as <>= head(sort(ranef(fm13, drop=TRUE)[[1]]), 3) @ and <>= tail(sort(ranef(fm13, drop=TRUE)[[1]]), 3) @ The exponential of the random effect is the relative odds of a woman in a particular urban/district combination using artificial birth control compared to her counterpart (same age, same with/without children status, same urban/rural status) in a typical district. The odds of a rural woman in district 1 (i.e. the \code{N:1} value of the \code{urban:district} interaction) using artifical contraception is <>= exp(ranef(fm13, drop=TRUE)[[1]]["N:1"]) @ or about 40\% of that of her rural counterpart in a typical district. Notice that there is considerable variability in the lengths of the prediction intervals in Fig.~\ref{fig:fm13predqq}, unlike those from previous model fits (e.g. Fig.~\ref{fig:fm01MLpreddot}, p.~\pageref{fig:fm01MLpreddot} or Fig.~\ref{fig:fm03ranef}, p.~\ref{fig:fm03ranef}). This is to be expected with data from a highly unbalanced observational study. Consider the cross-tabulation of counts of interviewees by district and urban/rural status presented at the end of Sec.~\ref{sec:ContrGLMM}. The data contains responses from 54 rural women in district 1 but only 21 rural women from district 11. Thus the bottom line in Fig.~\ref{fig:fm13predqq}, from the \code{N:1} level of the \code{urban:district} interaction, and based on 54 responses, is shorter than the line second from the bottom, for \code{N:11} and based on 21 women only.