Concepts: Linear mixed models

This page introduces the statistical ideas behind interlace — linear mixed models, random effects, variance components, REML estimation, and BLUPs. No prior experience with mixed models is assumed, though familiarity with ordinary linear regression helps.


Why not ordinary regression?

Ordinary least squares (OLS) regression assumes that observations are independent of each other. That assumption breaks down whenever data have a grouping structure: students from the same school tend to be more similar to each other than to students from a different school; items in a questionnaire elicit systematically different response times regardless of condition; measurements from the same patient on different days share a common baseline.

Ignoring this structure has two consequences:

  1. Standard errors are too small — the model treats 100 observations from 10 students as if they were 100 independent data points rather than 10 × 10 correlated ones. Significance tests become anti-conservative.

  2. Variance is misattributed — variability that belongs to between-group differences leaks into the residual, inflating σ² and distorting effect estimates.

Mixed models address both problems by explicitly modelling the grouping structure.


Fixed effects and random effects

A linear mixed model partitions the model into two parts:

Fixed effects capture the population-level relationships you care about — the intercept, treatment conditions, continuous predictors. These are the same as coefficients in ordinary regression.

Random effects capture the deviation of each group from the population mean. Rather than estimating a separate intercept for every school (which would require many parameters and overfit small groups), the model assumes group intercepts are drawn from a common Normal distribution with mean zero and variance σ²_group. The variance σ²_group is estimated from the data; individual group offsets are predicted conditionally.

Rule of thumb for choosing: a factor is fixed if you want to make inferences about its specific levels (treatment A vs treatment B); it is random if the levels are a sample from a larger population and you want to generalise beyond the observed groups (these 30 schools, these 50 items).


Variance components

A mixed model with two crossed grouping factors decomposes the total variance in the outcome into additive pieces:

σ²_total  =  σ²_group1  +  σ²_group2  +  σ²_residual

Each σ² is called a variance component:

Component

Meaning

σ²_group1

How much groups in factor 1 differ from each other

σ²_group2

How much groups in factor 2 differ from each other

σ²_residual

Residual within-group variability

Large σ²_group relative to σ²_residual means the grouping structure explains a substantial fraction of the total variance — the groups are very different from each other. The intraclass correlation coefficient (ICC) summarises this:

ICC_g  =  σ²_group / σ²_total

An ICC of 0.3 for schools means 30% of the variance in scores is attributable to school-level differences.

In interlace, variance components are returned as result.variance_components — a dict of {group_col: σ²}. The residual variance is result.scale.


Nested vs crossed designs

The grouping structure of your data determines which type of model you need.

Nested (hierarchical) design: every level of a lower factor exists within exactly one level of the factor above it. A classic three-level example: students are grouped into classes, and classes are grouped into schools. Class 1A only exists in School 1 — it is not shared with School 2.

Nested structure (students within classes within schools):

school 1 ──┬── class 1A ──┬── student A
           │              ├── student B
           │              └── student C
           └── class 1B ──┬── student D
                          ├── student E
                          └── student F

school 2 ──┬── class 2A ── ...
           └── class 2B ── ...

Key property: no class is shared between schools.
Each student belongs to exactly one class, one school.

The hierarchy is strict: no class spans two schools, no student spans two classes. Variance decomposes across levels — some students differ because of school-level effects, some because of class-level effects within a school, and the rest is residual.

This is often called a multilevel or hierarchical linear model and is well supported by statsmodels.MixedLM.

Crossed design: the grouping factors are independent of each other — each level of factor A co-occurs with multiple levels of factor B, and vice versa. A classic example: subjects responding to items in a reading-time experiment. Subject 1 sees items from across the whole item pool; item 3 is seen by subjects from across the whole subject pool. Neither factor is contained within the other.

Crossed structure (subjects × items):

           item 1  item 2  item 3  item 4
subject A    ✓       ✓              ✓
subject B    ✓               ✓      ✓
subject C            ✓       ✓
subject D    ✓       ✓       ✓

Key property: subjects and items are independent sources of variance.
Subject A appears with items 1, 2, 4 — not confined to any "item group".
Item 2 appears with subjects A, C, D — not confined to any "subject group".

Both subject and item independently shift the outcome. Fitting a model with only one of them leaves the other’s variance in the residual, biasing inference. This is the defining feature of a crossed design, and it cannot be expressed as a hierarchy.

The same pattern appears in many applied settings:

  • HR / compensation: employees rated by managers, where each manager rates employees from multiple departments and each department has employees rated by multiple managers

  • Education: students taking standardised tests across subjects, where a student’s performance reflects both student ability and item difficulty

  • E-commerce: customers purchasing across product categories, where a transaction reflects both customer propensity and category-level demand

interlace targets crossed designs with random intercepts and random slopes, and also supports generalised linear mixed models (GLMMs) for non-normal outcomes via interlace.glmer() — see the GLMM quickstart. For nested designs, both interlace and statsmodels.MixedLM work well. See For Python users for a side-by-side comparison.


REML: restricted maximum likelihood

Mixed models have two estimators in common use: maximum likelihood (ML) and restricted maximum likelihood (REML).

Plain ML treats fixed-effect coefficients as known when estimating variance components. In practice they are estimated, and ignoring that estimation uncertainty causes ML to systematically underestimate variance components — more so with smaller samples and more predictors.

REML corrects for this by maximising a likelihood that integrates out the fixed effects first (the “restricted” part), separating the estimation of variance from the estimation of coefficients. The result is unbiased variance component estimates.

When to use REML vs ML:

Situation

Estimator

Comparing models with the same fixed effects (different random structures)

REML

Comparing models with different fixed effects

ML (REML likelihoods are not comparable across different fixed structures)

Reporting final parameter estimates

REML

interlace uses profiled REML by default — the same algorithm as R’s lme4::lmer(). Profiled REML collapses the optimisation to a low-dimensional problem over variance parameters, which is faster and more numerically stable than optimising all parameters jointly.


BLUPs: best linear unbiased predictors

Once the model is fitted, we want to estimate the group-level deviations — how much each school, each subject, each item shifts the outcome relative to the population mean.

These estimates are called BLUPs (Best Linear Unbiased Predictors), or sometimes conditional modes or empirical Bayes estimates. They are not the same as including a fixed dummy variable for each group:

  • A dummy variable for each group gives an unregularised estimate — it uses only the data from that group.

  • A BLUP shrinks the estimate toward zero in proportion to the group’s sample size and σ²_group. Small or sparse groups shrink more; large, consistent groups shrink less.

This shrinkage is why BLUPs are useful for prediction: a new subject with only one observation should not be assigned an extreme intercept. The BLUP for that subject pulls toward the population mean, which produces better out-of-sample predictions.

In interlace:

# Dict of {group_col → Series of BLUPs, indexed by group level}
result.random_effects["subject"]  # e.g. Series: s1→1.3, s2→-0.8, ...

An unseen group level at prediction time automatically receives a BLUP of zero (pure population mean), as implemented in result.predict(newdata=...).


Glossary

Terminology note: this documentation uses grouping factor to mean a categorical variable whose levels define groups (e.g. subject, item, school). The term grouping column refers to the same thing in the context of a DataFrame — the column whose values identify which group each observation belongs to. The two terms are interchangeable.

Similarly, random intercept and random effect are often used synonymously in this documentation; more precisely, a random intercept is the simplest type of random effect (a per-group offset), while a random effect is the general concept that includes random slopes.

Term

Definition

Fixed effect

A population-level coefficient; inference focuses on the specific levels

Random effect

A group-level deviation assumed drawn from a Normal distribution; encompasses random intercepts and random slopes

Random intercept

A per-group offset from the population mean — the most common type of random effect

Random slope

A per-group deviation in the effect of a predictor — the slope varies by group

Grouping factor

A categorical variable whose levels define groups (also called grouping column)

Variance component

The variance σ² of a random effect distribution

ICC

Intraclass correlation: proportion of total variance due to grouping; ICC = σ²_group / σ²_total

Nested design

Groups at one level exist entirely within a single group at the next level

Crossed design

Each level of factor A co-occurs with multiple levels of factor B

REML

Restricted maximum likelihood: unbiased estimator for variance components

ML

Maximum likelihood: biased for variance components but comparable across fixed structures

Profiled REML

REML where the optimisation is collapsed to variance parameters only

Profiled likelihood

A likelihood surface obtained by optimising out nuisance parameters (here, the fixed effects) so only variance parameters remain

BLUP

Best Linear Unbiased Predictor: shrinkage estimate of a group’s deviation; also called conditional mode

Conditional mode

Synonym for BLUP — the mode of the conditional distribution of random effects given the data

Shrinkage

Pulling individual group estimates toward the population mean

Conditional prediction

A prediction that includes the estimated group-level BLUPs; appropriate for known groups

Marginal prediction

A prediction based on fixed effects only, averaging over the random-effect distribution; appropriate for new or unknown groups

Conditional residual

Observed minus conditional prediction; reflects within-group unexplained variation

Marginal residual

Observed minus marginal prediction; reflects both between- and within-group unexplained variation

Leverage

The diagonal of the hat matrix: how much an observation can influence its own fitted value

Cook’s distance

A scalar measure of how much the fixed-effect estimates change when one observation is deleted

MDFFITS

Measures of Difference in Fits: a scale-free version of Cook’s distance standardised by the full fixed-effect covariance matrix

COVTRACE

Influence on the trace of the fixed-effect precision matrix; positive values mean deletion improves precision

COVRATIO

Ratio of fixed-effect covariance determinants with vs without an observation; values near 1 indicate little influence on precision

RVC

Relative variance change: how much a variance component changes (proportionally) when one observation is deleted

GLMM

Generalised linear mixed model: extends LMM to non-normal outcomes (Poisson, binomial) via link functions; fitted with interlace.glmer()

Link function

A monotone function that maps the expected response to the linear predictor (e.g. logit for binomial, log for Poisson)

Laplace approximation

A method for approximating the marginal likelihood in GLMMs by a second-order Taylor expansion around the conditional modes

PIRLS

Penalised iteratively reweighted least squares: the inner algorithm used to solve for random effects and fixed effects in a GLMM

Sparse Cholesky

Matrix factorisation used internally for efficient REML computation on large, sparse random-effect structures


Where to go next