# Concepts: Linear mixed models

This page introduces the statistical ideas behind `interlace` — linear mixed models,
random effects, variance components, REML estimation, and BLUPs. No prior experience
with mixed models is assumed, though familiarity with ordinary linear regression helps.

---

## Why not ordinary regression?

Ordinary least squares (OLS) regression assumes that observations are **independent**
of each other. That assumption breaks down whenever data have a **grouping structure**:
students from the same school tend to be more similar to each other than to students
from a different school; items in a questionnaire elicit systematically different
response times regardless of condition; measurements from the same patient on different
days share a common baseline.

Ignoring this structure has two consequences:

1. **Standard errors are too small** — the model treats 100 observations from 10
   students as if they were 100 independent data points rather than 10 × 10 correlated
   ones. Significance tests become anti-conservative.
2. **Variance is misattributed** — variability that belongs to between-group differences
   leaks into the residual, inflating `σ²` and distorting effect estimates.

Mixed models address both problems by explicitly modelling the grouping structure.

---

## Fixed effects and random effects

A linear mixed model partitions the model into two parts:

**Fixed effects** capture the population-level relationships you care about — the
intercept, treatment conditions, continuous predictors. These are the same as
coefficients in ordinary regression.

**Random effects** capture the deviation of each group from the population mean. Rather
than estimating a separate intercept for every school (which would require many
parameters and overfit small groups), the model assumes group intercepts are drawn from
a common Normal distribution with mean zero and variance `σ²_group`. The variance
`σ²_group` is estimated from the data; individual group offsets are predicted
conditionally.

**Rule of thumb for choosing:** a factor is fixed if you want to make inferences about
its specific levels (treatment A vs treatment B); it is random if the levels are a
sample from a larger population and you want to generalise beyond the observed groups
(these 30 schools, these 50 items).

---

## Variance components

A mixed model with two crossed grouping factors decomposes the total variance in the
outcome into additive pieces:

```
σ²_total  =  σ²_group1  +  σ²_group2  +  σ²_residual
```

Each `σ²` is called a **variance component**:

| Component | Meaning |
|---|---|
| `σ²_group1` | How much groups in factor 1 differ from each other |
| `σ²_group2` | How much groups in factor 2 differ from each other |
| `σ²_residual` | Residual within-group variability |

Large `σ²_group` relative to `σ²_residual` means the grouping structure explains a
substantial fraction of the total variance — the groups are very different from each
other. The **intraclass correlation coefficient (ICC)** summarises this:

```
ICC_g  =  σ²_group / σ²_total
```

An ICC of 0.3 for schools means 30% of the variance in scores is attributable to
school-level differences.

In `interlace`, variance components are returned as `result.variance_components` — a
dict of `{group_col: σ²}`. The residual variance is `result.scale`.

---

## Nested vs crossed designs

The grouping structure of your data determines which type of model you need.

**Nested (hierarchical) design:** every level of a lower factor exists within exactly
one level of the factor above it. A classic three-level example: students are grouped
into classes, and classes are grouped into schools. Class 1A only exists in School 1 —
it is not shared with School 2.

```
Nested structure (students within classes within schools):

school 1 ──┬── class 1A ──┬── student A
           │              ├── student B
           │              └── student C
           └── class 1B ──┬── student D
                          ├── student E
                          └── student F

school 2 ──┬── class 2A ── ...
           └── class 2B ── ...

Key property: no class is shared between schools.
Each student belongs to exactly one class, one school.
```

The hierarchy is strict: no class spans two schools, no student spans two classes.
Variance decomposes across levels — some students differ because of school-level
effects, some because of class-level effects within a school, and the rest is residual.

This is often called a **multilevel** or **hierarchical linear model** and is well
supported by `statsmodels.MixedLM`.

**Crossed design:** the grouping factors are independent of each other — each level of
factor A co-occurs with multiple levels of factor B, and vice versa. A classic example:
subjects responding to items in a reading-time experiment. Subject 1 sees items from
across the whole item pool; item 3 is seen by subjects from across the whole subject
pool. Neither factor is contained within the other.

```
Crossed structure (subjects × items):

           item 1  item 2  item 3  item 4
subject A    ✓       ✓              ✓
subject B    ✓               ✓      ✓
subject C            ✓       ✓
subject D    ✓       ✓       ✓

Key property: subjects and items are independent sources of variance.
Subject A appears with items 1, 2, 4 — not confined to any "item group".
Item 2 appears with subjects A, C, D — not confined to any "subject group".
```

Both `subject` and `item` independently shift the outcome. Fitting a model with only
one of them leaves the other's variance in the residual, biasing inference. This is
the defining feature of a **crossed** design, and it cannot be expressed as a
hierarchy.

The same pattern appears in many applied settings:

- **HR / compensation:** employees rated by managers, where each manager rates employees
  from multiple departments and each department has employees rated by multiple managers
- **Education:** students taking standardised tests across subjects, where a student's
  performance reflects both student ability and item difficulty
- **E-commerce:** customers purchasing across product categories, where a transaction
  reflects both customer propensity and category-level demand

`interlace` targets crossed designs with random intercepts and random slopes, and also
supports **generalised linear mixed models** (GLMMs) for non-normal outcomes via
`interlace.glmer()` — see the [GLMM quickstart](glmm-quickstart.md). For nested
designs, both `interlace` and `statsmodels.MixedLM` work well. See
[For Python users](why-python.md) for a side-by-side comparison.

---

## REML: restricted maximum likelihood

Mixed models have two estimators in common use: **maximum likelihood (ML)** and
**restricted maximum likelihood (REML)**.

Plain ML treats fixed-effect coefficients as known when estimating variance components.
In practice they are estimated, and ignoring that estimation uncertainty causes ML to
**systematically underestimate variance components** — more so with smaller samples and
more predictors.

REML corrects for this by maximising a likelihood that integrates out the fixed effects
first (the "restricted" part), separating the estimation of variance from the estimation
of coefficients. The result is unbiased variance component estimates.

**When to use REML vs ML:**

| Situation | Estimator |
|---|---|
| Comparing models with the **same fixed effects** (different random structures) | REML |
| Comparing models with **different fixed effects** | ML (REML likelihoods are not comparable across different fixed structures) |
| Reporting final parameter estimates | REML |

`interlace` uses profiled REML by default — the same algorithm as R's `lme4::lmer()`.
Profiled REML collapses the optimisation to a low-dimensional problem over variance
parameters, which is faster and more numerically stable than optimising all parameters
jointly.

---

## BLUPs: best linear unbiased predictors

Once the model is fitted, we want to estimate the group-level deviations — how much
each school, each subject, each item shifts the outcome relative to the population mean.

These estimates are called **BLUPs** (Best Linear Unbiased Predictors), or sometimes
**conditional modes** or **empirical Bayes estimates**. They are not the same as
including a fixed dummy variable for each group:

- A dummy variable for each group gives an **unregularised** estimate — it uses only
  the data from that group.
- A BLUP **shrinks** the estimate toward zero in proportion to the group's sample size
  and `σ²_group`. Small or sparse groups shrink more; large, consistent groups shrink
  less.

This shrinkage is why BLUPs are useful for **prediction**: a new subject with only one
observation should not be assigned an extreme intercept. The BLUP for that subject
pulls toward the population mean, which produces better out-of-sample predictions.

In `interlace`:

```python
# Dict of {group_col → Series of BLUPs, indexed by group level}
result.random_effects["subject"]  # e.g. Series: s1→1.3, s2→-0.8, ...
```

An **unseen group level** at prediction time automatically receives a BLUP of zero
(pure population mean), as implemented in `result.predict(newdata=...)`.

---

## Glossary

**Terminology note:** this documentation uses **grouping factor** to mean a categorical
variable whose levels define groups (e.g. `subject`, `item`, `school`). The term
**grouping column** refers to the same thing in the context of a DataFrame — the column
whose values identify which group each observation belongs to. The two terms are
interchangeable.

Similarly, **random intercept** and **random effect** are often used synonymously in
this documentation; more precisely, a random intercept is the simplest type of random
effect (a per-group offset), while a random effect is the general concept that includes
random slopes.

| Term | Definition |
|---|---|
| **Fixed effect** | A population-level coefficient; inference focuses on the specific levels |
| **Random effect** | A group-level deviation assumed drawn from a Normal distribution; encompasses random intercepts and random slopes |
| **Random intercept** | A per-group offset from the population mean — the most common type of random effect |
| **Random slope** | A per-group deviation in the effect of a predictor — the slope varies by group |
| **Grouping factor** | A categorical variable whose levels define groups (also called grouping column) |
| **Variance component** | The variance `σ²` of a random effect distribution |
| **ICC** | Intraclass correlation: proportion of total variance due to grouping; ICC = σ²_group / σ²_total |
| **Nested design** | Groups at one level exist entirely within a single group at the next level |
| **Crossed design** | Each level of factor A co-occurs with multiple levels of factor B |
| **REML** | Restricted maximum likelihood: unbiased estimator for variance components |
| **ML** | Maximum likelihood: biased for variance components but comparable across fixed structures |
| **Profiled REML** | REML where the optimisation is collapsed to variance parameters only |
| **Profiled likelihood** | A likelihood surface obtained by optimising out nuisance parameters (here, the fixed effects) so only variance parameters remain |
| **BLUP** | Best Linear Unbiased Predictor: shrinkage estimate of a group's deviation; also called conditional mode |
| **Conditional mode** | Synonym for BLUP — the mode of the conditional distribution of random effects given the data |
| **Shrinkage** | Pulling individual group estimates toward the population mean |
| **Conditional prediction** | A prediction that includes the estimated group-level BLUPs; appropriate for known groups |
| **Marginal prediction** | A prediction based on fixed effects only, averaging over the random-effect distribution; appropriate for new or unknown groups |
| **Conditional residual** | Observed minus conditional prediction; reflects within-group unexplained variation |
| **Marginal residual** | Observed minus marginal prediction; reflects both between- and within-group unexplained variation |
| **Leverage** | The diagonal of the hat matrix: how much an observation can influence its own fitted value |
| **Cook's distance** | A scalar measure of how much the fixed-effect estimates change when one observation is deleted |
| **MDFFITS** | Measures of Difference in Fits: a scale-free version of Cook's distance standardised by the full fixed-effect covariance matrix |
| **COVTRACE** | Influence on the trace of the fixed-effect precision matrix; positive values mean deletion improves precision |
| **COVRATIO** | Ratio of fixed-effect covariance determinants with vs without an observation; values near 1 indicate little influence on precision |
| **RVC** | Relative variance change: how much a variance component changes (proportionally) when one observation is deleted |
| **GLMM** | Generalised linear mixed model: extends LMM to non-normal outcomes (Poisson, binomial) via link functions; fitted with `interlace.glmer()` |
| **Link function** | A monotone function that maps the expected response to the linear predictor (e.g. logit for binomial, log for Poisson) |
| **Laplace approximation** | A method for approximating the marginal likelihood in GLMMs by a second-order Taylor expansion around the conditional modes |
| **PIRLS** | Penalised iteratively reweighted least squares: the inner algorithm used to solve for random effects and fixed effects in a GLMM |
| **Sparse Cholesky** | Matrix factorisation used internally for efficient REML computation on large, sparse random-effect structures |

---

## Where to go next

- [For Python / statsmodels users](why-python.md) — code-level comparison and when to use `interlace` vs `statsmodels`
- [For R / lme4 users](why-r.md) — formula syntax mapping and shared references
- [Quickstart](quickstart.md) — fit your first crossed random-intercepts model
- [Interpreting results](interpreting-results.ipynb) — what to do with the output once the model is fitted