Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

STA 610L: Module 1.3

Introduction to hierarchical models

Dr. Olanrewaju Michael Akande

1 / 13

Introduction to hierarchical models

The terminology hierarchical model is quite general and can imply anything from simple use of a prior distribution to a highly organized data hierarchy (students nested in classes nested in schools nested in school systems nested in states nested in countries).

For grouped or nested data for example, we may want to infer or estimate the relationship between a response variable and certain predictors collected across all the groups.

In that case, we should do so in a way that takes advantage of the relationship between observations in the same group, but we should also look to borrow information across groups.

2 / 13

Introduction to hierarchical models

Hierarchical models are often used in the following commonly-encountered settings:

  • members of a "cluster" share more similarities with each other than with members of other clusters, violating the typical independence assumption of generalized linear models (like linear or logistic regression) -- examples of clusters include members of a family or students in a class

  • hypotheses of interest include context-dependent associations, often across a large number of settings -- e.g., does success of a new mode of instruction depend on the individual teacher

  • it is necessary to borrow information across groups in order to stabilize estimates or to obtain estimates with desirable properties -- e.g., we want to make state-specific estimates of election candidate preference by country of origin, but some states may have few immigrants from a given country

3 / 13

Hypothetical school testing example

Suppose we wish to estimate the distribution of test scores for students at J different high schools.

In each school j, where j=1,,J, suppose we test a random sample of nj students.

Let yij be the test score for the ith student in school j, with i=1,,nj.

Option I: estimation can be done separately in each group, where we assume

yij|μj,σ2jN(μj,σ2j)

where for each school j, μj is the school-wide average test score, and σ2j is the school-wide variance of individual test scores.

4 / 13

Hypothetical school testing example

We can do classical inference for each school based on large sample 95% CI: ˉyj±1.96s2j/nj, where ˉyj is the sample average in school j, and s2j is the sample variance in school j.

Clearly, we can overfit the data within schools, for example, what if we only have 4 students from one of the schools?

Option II: alternatively, we might believe that μj=μ for all j; that is, all schools have the same mean. This is the assumption (null hypothesis) in ANOVA models for example.

Option I ignores that the μj's should be reasonably similar, whereas option II ignores any differences between them.

It would be nice to find a compromise!

This is what we are able to do with hierarchical modeling.

5 / 13

Hierarchical model

Once again, suppose

yij|μj,σ2jN(μj,σ2j);   i=1,,nj;   j=1,,J.

We can assume that the μj's are drawn from a distribution based on the following: conceive of the schools themselves as being a random sample from all possible school.

Suppose μ0 is the overall mean of all school's average scores (a mean of the means), and τ2 is the variance of all school's average scores (a variance of the means).

Then, we can think of each μj as being drawn from a distribution, e.g.,

μj|μ0,τ2N(μ0,τ2),

which gives us one more level, resulting in a hierarchical specification.

Usually, μ0 and τ2 will also be unknown so that we need to estimate them (usually MLE or Bayesian methods).

6 / 13

Hierarchical model: school testing example

Back to our example, it turns out that the multilevel estimate is

ˆμjnjσ2jˉyj+1τ2μ0njσ2j+1τ2,

but since the unknown parameters have to be estimated, the classical estimate is

ˆμjnjs2jˉyj+1ˆτ2ˉyallnjs2j+1ˆτ2,

where ˉyall is the completely pooled estimate (the overall sample mean of all test scores).

7 / 13

Hierarchical model: implications

Our estimate for each μj is a weighted average of ˉyj and μ0, ensuring that we are borrowing information across all levels through μ0 and τ2.

The weights for the weighted average is determined by relative precisions (the inverse of variance is often referred to as precision) from the data and from the second level model.

Suppose all σ2jσ2. Then the schools with smaller nj have estimated μj closer to μ0 than schools with larger nj.

Thus, the hierarchical model shrinks estimates with high variance towards the grand mean.

We seek to specify models like this in many different contexts, for many reasons, including the idea of "shrinkage".

We will do this over and over throughout the course.

8 / 13

Generalized linear models (GLM)

The generalized linear model framework accommodates many popular statistical models, including linear regression, logistic regression, probit regression, and Poisson regression, among others.

Two popular GLM's we will use in class include the linear regression model and the logistic regression model.

9 / 13

Linear regression

Linear regression is perhaps the most widely-used statistical model.

Recall that the model is given by yi=β0+β1x1i++βpxpi+εi, where εiN(0,σ2).

If the parameter βj>0, then increasing levels of xj are associated with larger expected values of y, and values of βj<0 are associated with smaller expected values of y.

βj=0 is consistent with no association between xj and y.

10 / 13

Logistic regression

Logistic regression is a type of generalized linear model, which generalizes the typical linear model to binary data.

Let yi take either the value 1 or the value 0 (the labels assigned to 1 and 0 are arbitrary -- that is, we could let 1 denote voters and 0 denote non-voters, or we could exchange the labels -- we just need to remember our coding).

The logistic regression model is linear on the log of the odds: logπi1πi=β0+β1x1i++βpxpi, where πi=Pr(yi=1).

If the parameter βj>0, then increasing levels of xj are associated with higher probabilities that y=1, and values of βj<0 are associated with lower probabilities that y=1.

βj=0 is consistent with no association between xj and y.

11 / 13

Hierarchical model: demo

For some intuition behind hierarchical models, we'll check out this neat tutorial by Michael Freeman at University of Washington.

12 / 13

What's next?

Move on to the readings for the next module!

13 / 13

Introduction to hierarchical models

The terminology hierarchical model is quite general and can imply anything from simple use of a prior distribution to a highly organized data hierarchy (students nested in classes nested in schools nested in school systems nested in states nested in countries).

For grouped or nested data for example, we may want to infer or estimate the relationship between a response variable and certain predictors collected across all the groups.

In that case, we should do so in a way that takes advantage of the relationship between observations in the same group, but we should also look to borrow information across groups.

2 / 13
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow