Concepts: Linear mixed models¶

This page introduces the statistical ideas behind interlace — linear mixed models, random effects, variance components, REML estimation, and BLUPs. No prior experience with mixed models is assumed, though familiarity with ordinary linear regression helps.

Why not ordinary regression?¶

Ordinary least squares (OLS) regression assumes that observations are independent of each other. That assumption breaks down whenever data have a grouping structure: students from the same school tend to be more similar to each other than to students from a different school; items in a questionnaire elicit systematically different response times regardless of condition; measurements from the same patient on different days share a common baseline.

Ignoring this structure has two consequences:

Standard errors are too small — the model treats 100 observations from 10 students as if they were 100 independent data points rather than 10 × 10 correlated ones. Significance tests become anti-conservative.
Variance is misattributed — variability that belongs to between-group differences leaks into the residual, inflating σ² and distorting effect estimates.

Mixed models address both problems by explicitly modelling the grouping structure.

Fixed effects and random effects¶

A linear mixed model partitions the model into two parts:

Fixed effects capture the population-level relationships you care about — the intercept, treatment conditions, continuous predictors. These are the same as coefficients in ordinary regression.

Random effects capture the deviation of each group from the population mean. Rather than estimating a separate intercept for every school (which would require many parameters and overfit small groups), the model assumes group intercepts are drawn from a common Normal distribution with mean zero and variance σ²_group. The variance σ²_group is estimated from the data; individual group offsets are predicted conditionally.

Rule of thumb for choosing: a factor is fixed if you want to make inferences about its specific levels (treatment A vs treatment B); it is random if the levels are a sample from a larger population and you want to generalise beyond the observed groups (these 30 schools, these 50 items).

Variance components¶

A mixed model with two crossed grouping factors decomposes the total variance in the outcome into additive pieces:

σ²_total  =  σ²_group1  +  σ²_group2  +  σ²_residual

Each σ² is called a variance component:

Component	Meaning
`σ²_group1`	How much groups in factor 1 differ from each other
`σ²_group2`	How much groups in factor 2 differ from each other
`σ²_residual`	Residual within-group variability

Large σ²_group relative to σ²_residual means the grouping structure explains a substantial fraction of the total variance — the groups are very different from each other. The intraclass correlation coefficient (ICC) summarises this:

ICC_g  =  σ²_group / σ²_total

An ICC of 0.3 for schools means 30% of the variance in scores is attributable to school-level differences.

In interlace, variance components are returned as result.variance_components — a dict of {group_col: σ²}. The residual variance is result.scale.

Nested vs crossed designs¶

The grouping structure of your data determines which type of model you need.

Nested (hierarchical) design: every level of a lower factor exists within exactly one level of the factor above it. A classic three-level example: students are grouped into classes, and classes are grouped into schools. Class 1A only exists in School 1 — it is not shared with School 2.

Nested structure (students within classes within schools):

school 1 ──┬── class 1A ──┬── student A
           │              ├── student B
           │              └── student C
           └── class 1B ──┬── student D
                          ├── student E
                          └── student F

school 2 ──┬── class 2A ── ...
           └── class 2B ── ...

Key property: no class is shared between schools.
Each student belongs to exactly one class, one school.

The hierarchy is strict: no class spans two schools, no student spans two classes. Variance decomposes across levels — some students differ because of school-level effects, some because of class-level effects within a school, and the rest is residual.

This is often called a multilevel or hierarchical linear model and is well supported by statsmodels.MixedLM.

Crossed design: the grouping factors are independent of each other — each level of factor A co-occurs with multiple levels of factor B, and vice versa. A classic example: subjects responding to items in a reading-time experiment. Subject 1 sees items from across the whole item pool; item 3 is seen by subjects from across the whole subject pool. Neither factor is contained within the other.

Crossed structure (subjects × items):

           item 1  item 2  item 3  item 4
subject A    ✓       ✓              ✓
subject B    ✓               ✓      ✓
subject C            ✓       ✓
subject D    ✓       ✓       ✓

Key property: subjects and items are independent sources of variance.
Subject A appears with items 1, 2, 4 — not confined to any "item group".
Item 2 appears with subjects A, C, D — not confined to any "subject group".

Both subject and item independently shift the outcome. Fitting a model with only one of them leaves the other’s variance in the residual, biasing inference. This is the defining feature of a crossed design, and it cannot be expressed as a hierarchy.

The same pattern appears in many applied settings:

HR / compensation: employees rated by managers, where each manager rates employees from multiple departments and each department has employees rated by multiple managers
Education: students taking standardised tests across subjects, where a student’s performance reflects both student ability and item difficulty
E-commerce: customers purchasing across product categories, where a transaction reflects both customer propensity and category-level demand

interlace targets crossed designs with random intercepts and random slopes, and also supports generalised linear mixed models (GLMMs) for non-normal outcomes via interlace.glmer() — see the GLMM quickstart. For nested designs, both interlace and statsmodels.MixedLM work well. See For Python users for a side-by-side comparison.

REML: restricted maximum likelihood¶

Mixed models have two estimators in common use: maximum likelihood (ML) and restricted maximum likelihood (REML).

Plain ML treats fixed-effect coefficients as known when estimating variance components. In practice they are estimated, and ignoring that estimation uncertainty causes ML to systematically underestimate variance components — more so with smaller samples and more predictors.

REML corrects for this by maximising a likelihood that integrates out the fixed effects first (the “restricted” part), separating the estimation of variance from the estimation of coefficients. The result is unbiased variance component estimates.

When to use REML vs ML:

Situation	Estimator
Comparing models with the same fixed effects (different random structures)	REML
Comparing models with different fixed effects	ML (REML likelihoods are not comparable across different fixed structures)
Reporting final parameter estimates	REML

interlace uses profiled REML by default — the same algorithm as R’s lme4::lmer(). Profiled REML collapses the optimisation to a low-dimensional problem over variance parameters, which is faster and more numerically stable than optimising all parameters jointly.

BLUPs: best linear unbiased predictors¶

Once the model is fitted, we want to estimate the group-level deviations — how much each school, each subject, each item shifts the outcome relative to the population mean.

These estimates are called BLUPs (Best Linear Unbiased Predictors), or sometimes conditional modes or empirical Bayes estimates. They are not the same as including a fixed dummy variable for each group:

A dummy variable for each group gives an unregularised estimate — it uses only the data from that group.
A BLUP shrinks the estimate toward zero in proportion to the group’s sample size and σ²_group. Small or sparse groups shrink more; large, consistent groups shrink less.

This shrinkage is why BLUPs are useful for prediction: a new subject with only one observation should not be assigned an extreme intercept. The BLUP for that subject pulls toward the population mean, which produces better out-of-sample predictions.

In interlace:

# Dict of {group_col → Series of BLUPs, indexed by group level}
result.random_effects["subject"]  # e.g. Series: s1→1.3, s2→-0.8, ...

An unseen group level at prediction time automatically receives a BLUP of zero (pure population mean), as implemented in result.predict(newdata=...).

Glossary¶

Terminology note: this documentation uses grouping factor to mean a categorical variable whose levels define groups (e.g. subject, item, school). The term grouping column refers to the same thing in the context of a DataFrame — the column whose values identify which group each observation belongs to. The two terms are interchangeable.

Similarly, random intercept and random effect are often used synonymously in this documentation; more precisely, a random intercept is the simplest type of random effect (a per-group offset), while a random effect is the general concept that includes random slopes.

Term	Definition
Fixed effect	A population-level coefficient; inference focuses on the specific levels
Random effect	A group-level deviation assumed drawn from a Normal distribution; encompasses random intercepts and random slopes
Random intercept	A per-group offset from the population mean — the most common type of random effect
Random slope	A per-group deviation in the effect of a predictor — the slope varies by group
Grouping factor	A categorical variable whose levels define groups (also called grouping column)
Variance component	The variance `σ²` of a random effect distribution
ICC	Intraclass correlation: proportion of total variance due to grouping; ICC = σ²_group / σ²_total
Nested design	Groups at one level exist entirely within a single group at the next level
Crossed design	Each level of factor A co-occurs with multiple levels of factor B
REML	Restricted maximum likelihood: unbiased estimator for variance components
ML	Maximum likelihood: biased for variance components but comparable across fixed structures
Profiled REML	REML where the optimisation is collapsed to variance parameters only
Profiled likelihood	A likelihood surface obtained by optimising out nuisance parameters (here, the fixed effects) so only variance parameters remain
BLUP	Best Linear Unbiased Predictor: shrinkage estimate of a group’s deviation; also called conditional mode
Conditional mode	Synonym for BLUP — the mode of the conditional distribution of random effects given the data
Shrinkage	Pulling individual group estimates toward the population mean
Conditional prediction	A prediction that includes the estimated group-level BLUPs; appropriate for known groups
Marginal prediction	A prediction based on fixed effects only, averaging over the random-effect distribution; appropriate for new or unknown groups
Conditional residual	Observed minus conditional prediction; reflects within-group unexplained variation
Marginal residual	Observed minus marginal prediction; reflects both between- and within-group unexplained variation
Leverage	The diagonal of the hat matrix: how much an observation can influence its own fitted value
Cook’s distance	A scalar measure of how much the fixed-effect estimates change when one observation is deleted
MDFFITS	Measures of Difference in Fits: a scale-free version of Cook’s distance standardised by the full fixed-effect covariance matrix
COVTRACE	Influence on the trace of the fixed-effect precision matrix; positive values mean deletion improves precision
COVRATIO	Ratio of fixed-effect covariance determinants with vs without an observation; values near 1 indicate little influence on precision
RVC	Relative variance change: how much a variance component changes (proportionally) when one observation is deleted
GLMM	Generalised linear mixed model: extends LMM to non-normal outcomes (Poisson, binomial) via link functions; fitted with `interlace.glmer()`
Link function	A monotone function that maps the expected response to the linear predictor (e.g. logit for binomial, log for Poisson)
Laplace approximation	A method for approximating the marginal likelihood in GLMMs by a second-order Taylor expansion around the conditional modes
PIRLS	Penalised iteratively reweighted least squares: the inner algorithm used to solve for random effects and fixed effects in a GLMM
Sparse Cholesky	Matrix factorisation used internally for efficient REML computation on large, sparse random-effect structures

Where to go next¶

For Python / statsmodels users — code-level comparison and when to use interlace vs statsmodels
For R / lme4 users — formula syntax mapping and shared references
Quickstart — fit your first crossed random-intercepts model
Interpreting results — what to do with the output once the model is fitted