class: center, middle, inverse, title-slide # STA 610L: Module 4.1 ## Genealized linear mixed effects models (Part I) ### Dr. Olanrewaju Michael Akande --- ## Generalized linear mixed effects model (GLMM) As we continue to generalize the concepts we have covered, let's think about the incorporation of random effects into the standard representation of generalized linear models. The basic idea is that we assume there is natural heterogeneity across groups in a subset of the regression coefficients. These coefficients are assumed to vary across groups according to some distribution. Conditional on the random effects, we then assume the responses for a single subject are independent observations from a distribution in the exponential family. --- ## GLMM Note: when we look at longitudinal data, we will group by `\(i\)`, otherwise, we will group by `\(j\)`. In the .hlight[generalized linear mixed effects model (GLMM)] for longitudinal data, we assume the conditional distribution of each `\(Y_{ij}\)`, conditional on `\({\bf b}_i\)`, belongs to the exponential family with conditional mean `$$g(E[Y_{ij} \mid {\bf b}_i])={\bf X}_{ij}'\boldsymbol{\beta}+{\bf Z}_{ij}'{\bf b}_i,$$` where `\(g(\cdot)\)` is a known link function. Assume the `\({\bf b}_i\)` are independent across subjects with `\({\bf b}_i \sim N({\bf 0}, {\bf D})\)`. We also assume that given `\({\bf b}_i\)`, the responses `\(Y_{i1}, \ldots, Y_{in}\)` are mutually independent. --- ## Example: multilevel linear regression `$$Y_{ij}={\bf X}_{ij}'\boldsymbol{\beta}+b_i + \varepsilon_{ij},$$` where `$$b_i\overset{iid}\sim N(0,\sigma_b^2) \perp \varepsilon_{ij} \overset{iid}\sim N(0,\sigma_e^2)$$` and `$$E(Y_{ij} \mid b_i)={\bf X}_{ij}'\boldsymbol{\beta}+b_i$$` --- ## Example: multilevel logistic model with random intercepts `$$\text{logit}(E(Y_{ij} \mid b_i))={\bf X}_{ij}'\boldsymbol{\beta}+b_i,$$` where `$$b_i\sim N(0,\sigma^2)$$` <br> Question: what happened to `\(\varepsilon_{ij}\)`? --- ## Example: random coefficients Poisson regression `$$\log(E(Y_{ij} \mid {\bf b}_i))={\bf X}_{ij}'\boldsymbol{\beta}+{\bf Z}_{ij}'{\bf b}_i.$$` If we set `$${\bf X}_{ij}={\bf Z}_{ij}=[1,t_{ij}],$$` that is, we have random slopes and intercepts, then we can assume `$${\bf b}_i \sim N({\bf 0}, {\bf D}).$$` --- ## Interpretation of GLMM estimates In the model `$$\text{logit}(E[Y_{ij} \mid b_i])={\bf X}_{ij}'\boldsymbol{\beta}+b_i,$$` with `\(b_i \sim N(0,\sigma^2)\)`, each element of `\(\boldsymbol{\beta}\)` measures the change in the log odds of a 'positive' response per unit change in the respective covariate, in a specific group that has an underlying propensity to respond positively given by `\(b_i\)`. <br> That is, we need to hold the group membership constant when interpreting `\(\beta_k\)`, just as we would hold the values of `\(\bf{x}_{i,(-k)}\)` constant when interpreting `\(\beta_k\)` --- ## Caution Note also that with a non-linear link function, a non-linear contrast of the averages is not equal to the average of non-linear contrasts, so that the parameters do not in general have population-average interpretations (as they would in a linear mixed effects model, which has identity link). So while in the lmm `$$g(E(Y_{ij} \mid {\bf X}_{ij}, {\bf b}_i))={\bf X}_{ij}'\boldsymbol{\beta}+{\bf Z}_{ij}'{\bf b}_i$$` so that `\(E(Y_{ij} \mid {\bf X}_{ij})={\bf X}_{ij}'\boldsymbol{\beta}\)`, when `\(g(\cdot)\)` is non-linear (say the logit), then `$$g(E(Y_{ij} \mid {\bf X}_{ij}))\neq {\bf X}_{ij}'\boldsymbol{\beta}$$` for all `\(\boldsymbol{\beta}\)` when averaged over the distribution of the random effects. --- ## Intraclass correlation Let's focus on binary response for the rest of this module. That is, each `\(Y_{ij} \in \{0,1\}\)`. Now consider an unobserved continuous variable `\(W_{ij}\)`. `\(W_{ij}\)` is related to `\(Y_{ij}\)` in the following manner: `\(Y_{ij}=1\)` if `\(W_{ij}<c\)`, and `\(Y_{ij}=0\)` otherwise. <br> The location of `\(c\)` and the distribution of `\(W\)` govern the probability that `\(Y=1\)`. --- ## Intraclass correlation Useful way of thinking about model but not an essential assumption: `$$W_{ij}= {\bf X}_{ij}'\boldsymbol{\beta}+b_i+\varepsilon_{ij}$$` - `\(\varepsilon_{ij} \sim N(0,1)\)`: probit regression - `\(\varepsilon_{ij} \sim\)` standard logistic (mean 0, variance `\(\frac{\pi^2}{3}\)`): logistic regression <br> We can use this framework to calculate ICC's: - `\(ICC=\frac{\sigma^2}{\sigma^2+1} ~~~ \text{for probit}\)` - `\(ICC=\frac{\sigma^2}{\sigma^2+\frac{\pi^2}{3}} ~~~ \text{for logistic}\)` --- ## Estimation using ML The joint probability density function is given by `$$f({\bf Y}_i \mid {\bf X}_i, {\bf b}_i)f({\bf b}_i).$$` However, recall that the `\({\bf b}_i\)` are unobserved. How then do we handle the `\({\bf b}_i\)` in the maximization? <br> Typically, we base frequentist inferences on the marginal (integrated) likelihood function, given by `$$\prod_{i=1}^N \int f({\bf Y}_i \mid {\bf X}_i, {\bf b}_i)f({\bf b}_i) d{\bf b}_i.$$` Estimation using maximum likelihood then involves a two-step procedure. --- ## ML estimation steps .hlight[Step 1]: Obtain ML estimates of `\(\boldsymbol{\beta}\)` and `\({\bf D}\)` based on the marginal likelihood of the data. While this may sound simple, numerical or Monte Carlo integration techniques are typically required, and the techniques used may have substantial impacts on resulting inferences. .hlight[Step 2]: Given estimates of `\(\boldsymbol{\beta}\)` and `\({\bf D}\)`, predict the random effects as `$$\widehat{{\bf b}}_i=E({\bf b}_i \mid {\bf Y}_i, \widehat{\boldsymbol{\beta}}, \widehat{{\bf D}}).$$` Again, simple analytic solutions are rarely available. <br> Even when the burden of integration is not that great, the optimization problem may be difficult to solve. --- ## Random effects logistic regression Inverse logit functions for random intercepts logistic model with a single predictor. <img src="4-1-generalized-linear-mixed-effects-models_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## Random effects logistic regression Inverse logit functions for random slopes logistic model with a single predictor. <img src="4-1-generalized-linear-mixed-effects-models_files/figure-html/unnamed-chunk-3-1.png" style="display: block; margin: auto;" /> --- ## Random effects logistic regression Inverse logit functions for random intercepts and random slopes logistic model with a single predictor. <img src="4-1-generalized-linear-mixed-effects-models_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- ## 1988 elections analysis To illustrate how to fit and interpret the results of random effect logistic models, we will use a sample data on election polls. National opinion polls are conducted by a variety of organizations (e.g., media, polling organizations, campaigns) leading up to elections. While many of the best opinion polls are conducted at a national level, it can also be often interesting to estimate voting opinions and preferences at the state or even local level. Well-designed polls are generally based on national random samples with corrections for nonresponse based on a variety of demographic factors (e.g., sex, ethnicity, race, age, education). The data is from CBS News surveys conducted during the week before the 1988 election. Respondents were asked about their preferences for either the Republican candidate (Bush Sr.) or the Democratic candidate (Dukakis). --- ## 1988 elections analysis The dataset includes 2193 observations from one of eight surveys (the most recent CBS News survey right before the election) in the original full data. .small[ Variable | Description :------------- | :------------ org | cbsnyt = CBS/NYT .hlight[bush] | .hlight[1 = preference for Bush Sr., 0 = otherwise] state | 1-51: 50 states including DC (number 9) edu | education: 1=No HS, 2=HS, 3=Some College, 4=College Grad age | 1=18-29, 2=30-44, 3=45-64, 4=65+ female | 1=female, 0=male black | 1=black, 0=otherwise region | 1=NE, 2=S, 3=N, 4=W, 5=DC v_prev | average Republican vote share in the three previous elections (adjusted for home-state and home-region effects in the previous elections) ] Given that the data has a natural multilevel structure (through `state` and `region`), it makes sense to explore hierarchical models for this data. --- ## 1988 elections analysis Both voting turnout and preferences often depend on a complex combination of demographic factors. In our example dataset, we have demographic factors such as biological sex, race, age, education, which we may all want to look at by state, resulting in `\(2 \times 2 \times 4 \times 4 \times 51 = 3264\)` potential categories of respondents. We may even want to control for `region`, adding to the number of categories. Clearly, without a very large survey (most political survey poll around 1000 people), we will need to make assumptions in order to even obtain estimates in each category. We usually cannot include all interactions; we should therefore select those to explore (through EDA and background knowledge). The data is in the file `polls_subset.txt` on Sakai. --- ## 1988 elections analysis ```r ###### Load the data polls_subset <- read.table("data/polls_subset.txt",header=TRUE) str(polls_subset) ``` ``` ## 'data.frame': 2193 obs. of 10 variables: ## $ org : chr "cbsnyt" "cbsnyt" "cbsnyt" "cbsnyt" ... ## $ survey: int 9158 9158 9158 9158 9158 9158 9158 9158 9158 9158 ... ## $ bush : int NA 1 0 0 1 1 1 1 0 0 ... ## $ state : int 7 39 31 7 33 33 39 20 33 40 ... ## $ edu : int 3 4 2 3 2 4 2 2 4 1 ... ## $ age : int 1 2 4 1 2 4 2 4 3 3 ... ## $ female: int 1 1 1 1 1 1 0 1 0 0 ... ## $ black : int 0 0 0 0 0 0 0 0 0 0 ... ## $ region: int 1 1 1 1 1 1 1 1 1 1 ... ## $ v_prev: num 0.567 0.527 0.564 0.567 0.524 ... ``` ```r head(polls_subset) ``` ``` ## org survey bush state edu age female black region v_prev ## 1 cbsnyt 9158 NA 7 3 1 1 0 1 0.5666333 ## 2 cbsnyt 9158 1 39 4 2 1 0 1 0.5265667 ## 3 cbsnyt 9158 0 31 2 4 1 0 1 0.5641667 ## 4 cbsnyt 9158 0 7 3 1 1 0 1 0.5666333 ## 5 cbsnyt 9158 1 33 2 2 1 0 1 0.5243666 ## 6 cbsnyt 9158 1 33 4 4 1 0 1 0.5243666 ``` --- ## 1988 elections analysis ```r summary(polls_subset) ``` ``` ## org survey bush state ## Length:2193 Min. :9158 Min. :0.0000 Min. : 1.00 ## Class :character 1st Qu.:9158 1st Qu.:0.0000 1st Qu.:14.00 ## Mode :character Median :9158 Median :1.0000 Median :26.00 ## Mean :9158 Mean :0.5578 Mean :26.11 ## 3rd Qu.:9158 3rd Qu.:1.0000 3rd Qu.:39.00 ## Max. :9158 Max. :1.0000 Max. :51.00 ## NA's :178 ## edu age female black ## Min. :1.000 Min. :1.000 Min. :0.0000 Min. :0.00000 ## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.:0.00000 ## Median :2.000 Median :2.000 Median :1.0000 Median :0.00000 ## Mean :2.653 Mean :2.289 Mean :0.5887 Mean :0.07615 ## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:0.00000 ## Max. :4.000 Max. :4.000 Max. :1.0000 Max. :1.00000 ## ## region v_prev ## Min. :1.000 Min. :0.1530 ## 1st Qu.:2.000 1st Qu.:0.5278 ## Median :2.000 Median :0.5481 ## Mean :2.431 Mean :0.5550 ## 3rd Qu.:3.000 3rd Qu.:0.5830 ## Max. :5.000 Max. :0.6927 ## ``` --- ## 1988 elections analysis ```r polls_subset$v_prev <- polls_subset$v_prev*100 #rescale polls_subset$region_label <- factor(polls_subset$region,levels=1:5, labels=c("NE","S","N","W","DC")) #we consider DC as a separate region due to its distinctive voting patterns polls_subset$edu_label <- factor(polls_subset$edu,levels=1:4, labels=c("No HS","HS","Some College","College Grad")) polls_subset$age_label <- factor(polls_subset$age,levels=1:4, labels=c("18-29","30-44","45-64","65+")) #the data includes states but without the names, which we will need, #so let's grab that from R datasets data(state) #"state" is an R data file (type ?state from the R command window for info) state.abb #does not include DC, so we will create ours ``` ``` ## [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" ## [16] "KS" "KY" "LA" "ME" "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ" ## [31] "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT" "VT" ## [46] "VA" "WA" "WV" "WI" "WY" ``` ```r #In the polls data, DC is the 9th "state" in alphabetical order state_abbr <- c (state.abb[1:8], "DC", state.abb[9:50]) polls_subset$state_label <- factor(polls_subset$state,levels=1:51,labels=state_abbr) rm(list = ls(pattern = "state")) #remove unnecessary values in the environment ``` --- ## 1988 elections analysis ```r ###### View properties of the data head(polls_subset) ``` ``` ## org survey bush state edu age female black region v_prev region_label ## 1 cbsnyt 9158 NA 7 3 1 1 0 1 56.66333 NE ## 2 cbsnyt 9158 1 39 4 2 1 0 1 52.65667 NE ## 3 cbsnyt 9158 0 31 2 4 1 0 1 56.41667 NE ## 4 cbsnyt 9158 0 7 3 1 1 0 1 56.66333 NE ## 5 cbsnyt 9158 1 33 2 2 1 0 1 52.43666 NE ## 6 cbsnyt 9158 1 33 4 4 1 0 1 52.43666 NE ## edu_label age_label state_label ## 1 Some College 18-29 CT ## 2 College Grad 30-44 PA ## 3 HS 65+ NJ ## 4 Some College 18-29 CT ## 5 HS 30-44 NY ## 6 College Grad 65+ NY ``` ```r dim(polls_subset) ``` ``` ## [1] 2193 14 ``` --- ## 1988 elections analysis ```r ###### View properties of the data str(polls_subset) ``` ``` ## 'data.frame': 2193 obs. of 14 variables: ## $ org : chr "cbsnyt" "cbsnyt" "cbsnyt" "cbsnyt" ... ## $ survey : int 9158 9158 9158 9158 9158 9158 9158 9158 9158 9158 ... ## $ bush : int NA 1 0 0 1 1 1 1 0 0 ... ## $ state : int 7 39 31 7 33 33 39 20 33 40 ... ## $ edu : int 3 4 2 3 2 4 2 2 4 1 ... ## $ age : int 1 2 4 1 2 4 2 4 3 3 ... ## $ female : int 1 1 1 1 1 1 0 1 0 0 ... ## $ black : int 0 0 0 0 0 0 0 0 0 0 ... ## $ region : int 1 1 1 1 1 1 1 1 1 1 ... ## $ v_prev : num 56.7 52.7 56.4 56.7 52.4 ... ## $ region_label: Factor w/ 5 levels "NE","S","N","W",..: 1 1 1 1 1 1 1 1 1 1 ... ## $ edu_label : Factor w/ 4 levels "No HS","HS","Some College",..: 3 4 2 3 2 4 2 2 4 1 ... ## $ age_label : Factor w/ 4 levels "18-29","30-44",..: 1 2 4 1 2 4 2 4 3 3 ... ## $ state_label : Factor w/ 51 levels "AL","AK","AZ",..: 7 39 31 7 33 33 39 20 33 40 ... ``` --- ## 1988 elections analysis I will not do any meaningful EDA here. I expect you to be able to do this yourself. Let's just take a look at the amount of data we have for "bush" and the age:edu interaction. ```r ###### Exploratory data analysis table(polls_subset$bush) #well split by the two values ``` ``` ## ## 0 1 ## 891 1124 ``` ```r table(polls_subset$edu,polls_subset$age) ``` ``` ## ## 1 2 3 4 ## 1 44 42 67 96 ## 2 232 283 223 116 ## 3 141 205 99 54 ## 4 119 285 125 62 ``` --- ## 1988 elections analysis As a start, we will consider a simple model with fixed effects of race and sex and a random effect for state (50 states + the District of Columbia). $$ `\begin{split} \textrm{bush}_{ij} | \boldsymbol{x}_{ij} & \sim \textrm{Bernoulli}(\pi_{ij}); \ \ \ i = 1, \ldots, n; \ \ \ j = 1, \ldots, J=51; \\ \textrm{log}\left(\dfrac{\pi_{ij}}{1-\pi_{ij}}\right) & = \beta_{0} + b_{0j} + \beta_1 \textrm{female}_{ij} + \beta_2 \textrm{black}_{ij}; \\ b_{0j} & \sim N(0, \sigma^2). \end{split}` $$ In `R`, we have ```r #library(lme4) model1 <- glmer(bush ~ black+female+(1|state_label), family=binomial(link="logit"), data=polls_subset) summary(model1) ``` --- ## 1988 elections analysis ``` ## Generalized linear mixed model fit by maximum likelihood (Laplace ## Approximation) [glmerMod] ## Family: binomial ( logit ) ## Formula: bush ~ black + female + (1 | state_label) ## Data: polls_subset ## ## AIC BIC logLik deviance df.resid ## 2666.7 2689.1 -1329.3 2658.7 2011 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -1.7276 -1.0871 0.6673 0.8422 2.5271 ## ## Random effects: ## Groups Name Variance Std.Dev. ## state_label (Intercept) 0.1692 0.4113 ## Number of obs: 2015, groups: state_label, 49 ## ## Fixed effects: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.44523 0.10139 4.391 1.13e-05 ## black -1.74161 0.20954 -8.312 < 2e-16 ## female -0.09705 0.09511 -1.020 0.308 ## ## Correlation of Fixed Effects: ## (Intr) black ## black -0.119 ## female -0.551 -0.005 ``` --- ## 1988 elections analysis Looks like we dropped some NAs. ```r c(sum(complete.cases(polls_subset)),sum(!complete.cases(polls_subset))) ``` ``` ## [1] 2015 178 ``` Not ideal; we'll learn about methods for dealing with missing data soon. Interpretation of results: + For a fixed state (or across all states), a non-black male respondent has odds of `\(e^{0.45}=1.57\)` of supporting Bush. + For a fixed state and sex, a black respondent as `\(e^{-1.74}=0.18\)` times (an 82% decrease) the odds of supporting Bush as a non-black respondent; you are much less likely to support Bush if your race is black compared to being non-black. + For a given state and race, a female respondent has `\(e^{-0.10}=0.91\)` (a 9% decrease) times the odds of supporting Bush as a male respondent. However, this effect is not actually statistically significant! --- ## 1988 elections analysis The state-level standard deviation is estimated at 0.41, so that the states do vary some, but not so much. I expect that you will be able to interpret the corresponding confidence intervals. ``` ## Computing profile confidence intervals ... ``` ``` ## 2.5 % 97.5 % ## .sig01 0.2608567 0.60403428 ## (Intercept) 0.2452467 0.64871247 ## black -2.1666001 -1.34322366 ## female -0.2837100 0.08919986 ``` --- ## 1988 elections analysis We can definitely fit a more sophisticated model that includes other relevant survey factors, such as + region + prior vote history (note that this is a state-level predictor), + age, education, and the interaction between them. Given the structure of the data, it makes sense to include region as a second (nested) grouping variable. We are yet to discuss that, so I will return to this later. --- ## 1988 elections analysis For now, let's just fit two models, one with the main effects for age and education, and the second with the interaction between them. .large[ ``` ## Generalized linear mixed model fit by maximum likelihood (Laplace ## Approximation) [glmerMod] ## Family: binomial ( logit ) ## Formula: bush ~ black + female + edu_label + age_label + (1 | state_label) ## Data: polls_subset ## ## AIC BIC logLik deviance df.resid ## 2662.2 2718.3 -1321.1 2642.2 2005 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -1.8921 -1.0606 0.6420 0.8368 2.7906 ## ## Random effects: ## Groups Name Variance Std.Dev. ## state_label (Intercept) 0.1738 0.4168 ## Number of obs: 2015, groups: state_label, 49 ## ## Fixed effects: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) 0.31206 0.19438 1.605 0.10841 ## black -1.74378 0.21124 -8.255 < 2e-16 ## female -0.09681 0.09593 -1.009 0.31289 ## edu_labelHS 0.23282 0.16569 1.405 0.15998 ## edu_labelSome College 0.51598 0.17921 2.879 0.00399 ## edu_labelCollege Grad 0.31585 0.17454 1.810 0.07036 ## age_label30-44 -0.29222 0.12352 -2.366 0.01800 ## age_label45-64 -0.06744 0.13738 -0.491 0.62352 ## age_label65+ -0.22509 0.16142 -1.394 0.16318 ``` ] Can you interpret the results? --- ## 1988 elections analysis ```r model3 <- glmer(bush ~ black + female + edu_label*age_label + (1|state_label), family=binomial(link="logit"),data=polls_subset) ``` ``` ## Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : ## Model failed to converge with max|grad| = 0.00802313 (tol = 0.002, component 1) ``` Why do we have a rank deficient model? Also, it looks like we have a convergence issue. These issues can happen. We have so many parameters to estimate from the interaction terms `edu_label*age_label` (16 actually), and it looks like that's causing a problem. --- ## Note on estimation ML estimation is carried out typically using adaptive Gaussian quadrature. To improve accuracy over many package defaults (Laplace approximation), increase the number of quadrature points to be greater than one. Note that some software packages require Laplace approximation with Gaussian quadrature if the number of random effects is more than 1 for the sake of computational efficiency. --- class: center, middle # What's next? ### Move on to the readings for the next module!