Concepts: Linear mixed models¶
This page introduces the statistical ideas behind interlace — linear mixed models,
random effects, variance components, REML estimation, and BLUPs. No prior experience
with mixed models is assumed, though familiarity with ordinary linear regression helps.
Why not ordinary regression?¶
Ordinary least squares (OLS) regression assumes that observations are independent of each other. That assumption breaks down whenever data have a grouping structure: students from the same school tend to be more similar to each other than to students from a different school; items in a questionnaire elicit systematically different response times regardless of condition; measurements from the same patient on different days share a common baseline.
Ignoring this structure has two consequences:
Standard errors are too small — the model treats 100 observations from 10 students as if they were 100 independent data points rather than 10 × 10 correlated ones. Significance tests become anti-conservative.
Variance is misattributed — variability that belongs to between-group differences leaks into the residual, inflating
σ²and distorting effect estimates.
Mixed models address both problems by explicitly modelling the grouping structure.
Fixed effects and random effects¶
A linear mixed model partitions the model into two parts:
Fixed effects capture the population-level relationships you care about — the intercept, treatment conditions, continuous predictors. These are the same as coefficients in ordinary regression.
Random effects capture the deviation of each group from the population mean. Rather
than estimating a separate intercept for every school (which would require many
parameters and overfit small groups), the model assumes group intercepts are drawn from
a common Normal distribution with mean zero and variance σ²_group. The variance
σ²_group is estimated from the data; individual group offsets are predicted
conditionally.
Rule of thumb for choosing: a factor is fixed if you want to make inferences about its specific levels (treatment A vs treatment B); it is random if the levels are a sample from a larger population and you want to generalise beyond the observed groups (these 30 schools, these 50 items).
Variance components¶
A mixed model with two crossed grouping factors decomposes the total variance in the outcome into additive pieces:
σ²_total = σ²_group1 + σ²_group2 + σ²_residual
Each σ² is called a variance component:
Component |
Meaning |
|---|---|
|
How much groups in factor 1 differ from each other |
|
How much groups in factor 2 differ from each other |
|
Residual within-group variability |
Large σ²_group relative to σ²_residual means the grouping structure explains a
substantial fraction of the total variance — the groups are very different from each
other. The intraclass correlation coefficient (ICC) summarises this:
ICC_g = σ²_group / σ²_total
An ICC of 0.3 for schools means 30% of the variance in scores is attributable to school-level differences.
In interlace, variance components are returned as result.variance_components — a
dict of {group_col: σ²}. The residual variance is result.scale.
Nested vs crossed designs¶
The grouping structure of your data determines which type of model you need.
Nested (hierarchical) design: every level of a lower factor exists within exactly one level of the factor above it. A classic three-level example: students are grouped into classes, and classes are grouped into schools. Class 1A only exists in School 1 — it is not shared with School 2.
Nested structure (students within classes within schools):
school 1 ──┬── class 1A ──┬── student A
│ ├── student B
│ └── student C
└── class 1B ──┬── student D
├── student E
└── student F
school 2 ──┬── class 2A ── ...
└── class 2B ── ...
Key property: no class is shared between schools.
Each student belongs to exactly one class, one school.
The hierarchy is strict: no class spans two schools, no student spans two classes. Variance decomposes across levels — some students differ because of school-level effects, some because of class-level effects within a school, and the rest is residual.
This is often called a multilevel or hierarchical linear model and is well
supported by statsmodels.MixedLM.
Crossed design: the grouping factors are independent of each other — each level of factor A co-occurs with multiple levels of factor B, and vice versa. A classic example: subjects responding to items in a reading-time experiment. Subject 1 sees items from across the whole item pool; item 3 is seen by subjects from across the whole subject pool. Neither factor is contained within the other.
Crossed structure (subjects × items):
item 1 item 2 item 3 item 4
subject A ✓ ✓ ✓
subject B ✓ ✓ ✓
subject C ✓ ✓
subject D ✓ ✓ ✓
Key property: subjects and items are independent sources of variance.
Subject A appears with items 1, 2, 4 — not confined to any "item group".
Item 2 appears with subjects A, C, D — not confined to any "subject group".
Both subject and item independently shift the outcome. Fitting a model with only
one of them leaves the other’s variance in the residual, biasing inference. This is
the defining feature of a crossed design, and it cannot be expressed as a
hierarchy.
The same pattern appears in many applied settings:
HR / compensation: employees rated by managers, where each manager rates employees from multiple departments and each department has employees rated by multiple managers
Education: students taking standardised tests across subjects, where a student’s performance reflects both student ability and item difficulty
E-commerce: customers purchasing across product categories, where a transaction reflects both customer propensity and category-level demand
interlace targets crossed designs with random intercepts and random slopes, and also
supports generalised linear mixed models (GLMMs) for non-normal outcomes via
interlace.glmer() — see the GLMM quickstart. For nested
designs, both interlace and statsmodels.MixedLM work well. See
For Python users for a side-by-side comparison.
REML: restricted maximum likelihood¶
Mixed models have two estimators in common use: maximum likelihood (ML) and restricted maximum likelihood (REML).
Plain ML treats fixed-effect coefficients as known when estimating variance components. In practice they are estimated, and ignoring that estimation uncertainty causes ML to systematically underestimate variance components — more so with smaller samples and more predictors.
REML corrects for this by maximising a likelihood that integrates out the fixed effects first (the “restricted” part), separating the estimation of variance from the estimation of coefficients. The result is unbiased variance component estimates.
When to use REML vs ML:
Situation |
Estimator |
|---|---|
Comparing models with the same fixed effects (different random structures) |
REML |
Comparing models with different fixed effects |
ML (REML likelihoods are not comparable across different fixed structures) |
Reporting final parameter estimates |
REML |
interlace uses profiled REML by default — the same algorithm as R’s lme4::lmer().
Profiled REML collapses the optimisation to a low-dimensional problem over variance
parameters, which is faster and more numerically stable than optimising all parameters
jointly.
BLUPs: best linear unbiased predictors¶
Once the model is fitted, we want to estimate the group-level deviations — how much each school, each subject, each item shifts the outcome relative to the population mean.
These estimates are called BLUPs (Best Linear Unbiased Predictors), or sometimes conditional modes or empirical Bayes estimates. They are not the same as including a fixed dummy variable for each group:
A dummy variable for each group gives an unregularised estimate — it uses only the data from that group.
A BLUP shrinks the estimate toward zero in proportion to the group’s sample size and
σ²_group. Small or sparse groups shrink more; large, consistent groups shrink less.
This shrinkage is why BLUPs are useful for prediction: a new subject with only one observation should not be assigned an extreme intercept. The BLUP for that subject pulls toward the population mean, which produces better out-of-sample predictions.
In interlace:
# Dict of {group_col → Series of BLUPs, indexed by group level}
result.random_effects["subject"] # e.g. Series: s1→1.3, s2→-0.8, ...
An unseen group level at prediction time automatically receives a BLUP of zero
(pure population mean), as implemented in result.predict(newdata=...).
Glossary¶
Terminology note: this documentation uses grouping factor to mean a categorical
variable whose levels define groups (e.g. subject, item, school). The term
grouping column refers to the same thing in the context of a DataFrame — the column
whose values identify which group each observation belongs to. The two terms are
interchangeable.
Similarly, random intercept and random effect are often used synonymously in this documentation; more precisely, a random intercept is the simplest type of random effect (a per-group offset), while a random effect is the general concept that includes random slopes.
Term |
Definition |
|---|---|
Fixed effect |
A population-level coefficient; inference focuses on the specific levels |
Random effect |
A group-level deviation assumed drawn from a Normal distribution; encompasses random intercepts and random slopes |
Random intercept |
A per-group offset from the population mean — the most common type of random effect |
Random slope |
A per-group deviation in the effect of a predictor — the slope varies by group |
Grouping factor |
A categorical variable whose levels define groups (also called grouping column) |
Variance component |
The variance |
ICC |
Intraclass correlation: proportion of total variance due to grouping; ICC = σ²_group / σ²_total |
Nested design |
Groups at one level exist entirely within a single group at the next level |
Crossed design |
Each level of factor A co-occurs with multiple levels of factor B |
REML |
Restricted maximum likelihood: unbiased estimator for variance components |
ML |
Maximum likelihood: biased for variance components but comparable across fixed structures |
Profiled REML |
REML where the optimisation is collapsed to variance parameters only |
Profiled likelihood |
A likelihood surface obtained by optimising out nuisance parameters (here, the fixed effects) so only variance parameters remain |
BLUP |
Best Linear Unbiased Predictor: shrinkage estimate of a group’s deviation; also called conditional mode |
Conditional mode |
Synonym for BLUP — the mode of the conditional distribution of random effects given the data |
Shrinkage |
Pulling individual group estimates toward the population mean |
Conditional prediction |
A prediction that includes the estimated group-level BLUPs; appropriate for known groups |
Marginal prediction |
A prediction based on fixed effects only, averaging over the random-effect distribution; appropriate for new or unknown groups |
Conditional residual |
Observed minus conditional prediction; reflects within-group unexplained variation |
Marginal residual |
Observed minus marginal prediction; reflects both between- and within-group unexplained variation |
Leverage |
The diagonal of the hat matrix: how much an observation can influence its own fitted value |
Cook’s distance |
A scalar measure of how much the fixed-effect estimates change when one observation is deleted |
MDFFITS |
Measures of Difference in Fits: a scale-free version of Cook’s distance standardised by the full fixed-effect covariance matrix |
COVTRACE |
Influence on the trace of the fixed-effect precision matrix; positive values mean deletion improves precision |
COVRATIO |
Ratio of fixed-effect covariance determinants with vs without an observation; values near 1 indicate little influence on precision |
RVC |
Relative variance change: how much a variance component changes (proportionally) when one observation is deleted |
GLMM |
Generalised linear mixed model: extends LMM to non-normal outcomes (Poisson, binomial) via link functions; fitted with |
Link function |
A monotone function that maps the expected response to the linear predictor (e.g. logit for binomial, log for Poisson) |
Laplace approximation |
A method for approximating the marginal likelihood in GLMMs by a second-order Taylor expansion around the conditional modes |
PIRLS |
Penalised iteratively reweighted least squares: the inner algorithm used to solve for random effects and fixed effects in a GLMM |
Sparse Cholesky |
Matrix factorisation used internally for efficient REML computation on large, sparse random-effect structures |
Where to go next¶
For Python / statsmodels users — code-level comparison and when to use
interlacevsstatsmodelsFor R / lme4 users — formula syntax mapping and shared references
Quickstart — fit your first crossed random-intercepts model
Interpreting results — what to do with the output once the model is fitted