Model Comparison

Comparing mixed models lets you test whether adding predictors or expanding the random-effect structure improves fit beyond what you’d expect by chance. This page covers the likelihood ratio test (LRT) workflow and when to use ML vs REML.


REML vs ML: which to use

Goal

Estimator

Why

Final parameter estimates (fixed effects, variance components)

method="REML" (default)

Unbiased variance estimates

Comparing models with different fixed effects

method="ML"

REML likelihoods are not comparable when the fixed-effect design differs

Comparing models with the same fixed effects, different random structure

Either (REML preferred)

REML likelihoods are comparable here; use REML for unbiased variance estimates

Practical workflow:

  1. Use method="ML" to select fixed effects (test whether adding a predictor helps)

  2. Refit the winning model with method="REML" (default) to get the final estimates


Likelihood ratio test (LRT)

The LRT compares two nested models. The test statistic is:

χ² = 2 × (log-likelihood_full − log-likelihood_reduced)

Under the null hypothesis (reduced model is adequate), this follows a χ² distribution with degrees of freedom equal to the number of added parameters.

Comparing fixed-effect structures

import interlace
import scipy.stats

# Both models must use method="ML"
m_reduced = interlace.fit(
    "rt ~ 1",           # intercept only
    data=df,
    groups=["subject", "item"],
    method="ML",
)

m_full = interlace.fit(
    "rt ~ condition",   # + condition effect
    data=df,
    groups=["subject", "item"],
    method="ML",
)

lrt_stat = 2 * (m_full.llf - m_reduced.llf)
df_diff  = 1  # one added parameter (condition coefficient)
p_value  = scipy.stats.chi2.sf(lrt_stat, df=df_diff)

print(f"LRT χ²({df_diff}) = {lrt_stat:.3f}, p = {p_value:.4f}")
# LRT χ²(1) = 18.42, p = 0.0000

Comparing random-effect structures

When adding a random slope, the number of added parameters depends on the parameterisation:

Change

Added parameters

Add intercept-only term

1 (variance)

Add correlated slope (1 + x | g)

2 (slope variance + intercept-slope covariance)

Add independent slope (1 + x || g)

1 (slope variance only, covariance = 0)

m_intercept = interlace.fit(
    "rt ~ condition",
    data=df,
    groups=["subject", "item"],
    method="ML",
)

m_slopes = interlace.fit(
    "rt ~ condition",
    data=df,
    random=["(1 + condition | subject)", "(1 | item)"],
    method="ML",
)

lrt_stat = 2 * (m_slopes.llf - m_intercept.llf)
p_value  = scipy.stats.chi2.sf(lrt_stat, df=2)  # 2 extra params
print(f"LRT χ²(2) = {lrt_stat:.3f}, p = {p_value:.4f}")

AIC and BIC

For non-nested comparisons (e.g. two different fixed-effect structures where neither is a special case of the other), use information criteria:

print(f"AIC: {m_reduced.aic:.1f} vs {m_full.aic:.1f}")
print(f"BIC: {m_reduced.bic:.1f} vs {m_full.bic:.1f}")

Lower AIC/BIC is better. BIC penalises complexity more heavily and tends to favour simpler models. Differences < 2 are not meaningful; differences > 10 are strong.


Step-by-step workflow

A practical model-building workflow for a typical analysis:

import interlace
import scipy.stats

# Step 1: Fit baseline with ML
m0 = interlace.fit("rt ~ 1", data=df, groups=["subject", "item"], method="ML")

# Step 2: Add predictors one at a time, test with LRT
m1 = interlace.fit("rt ~ condition", data=df, groups=["subject", "item"], method="ML")
lrt = 2 * (m1.llf - m0.llf)
print(f"Adding condition: χ²(1) = {lrt:.2f}, p = {scipy.stats.chi2.sf(lrt, 1):.4f}")

m2 = interlace.fit("rt ~ condition + frequency", data=df, groups=["subject", "item"], method="ML")
lrt = 2 * (m2.llf - m1.llf)
print(f"Adding frequency: χ²(1) = {lrt:.2f}, p = {scipy.stats.chi2.sf(lrt, 1):.4f}")

# Step 3: Refit winning model with REML for final estimates
m_final = interlace.fit("rt ~ condition + frequency", data=df, groups=["subject", "item"])
# method="REML" is the default

print(m_final.fe_params)
print(m_final.variance_components)

anova() — likelihood-ratio test shortcut

The manual LRT (compute 2 * (m_full.llf - m_reduced.llf), look up the χ² p-value) works but is repetitive. interlace.anova() automates it, producing a two-row table that matches lme4’s anova.merMod() output:

import interlace

m0 = interlace.fit("rt ~ 1",         data=df, groups=["subject", "item"], method="ML")
m1 = interlace.fit("rt ~ condition", data=df, groups=["subject", "item"], method="ML")

print(interlace.anova(m0, m1))
#    Df   AIC    BIC  logLik  deviance  Chisq  Chi Df  Pr(>Chisq)
# 0   4  ...    ...    ...      ...     NaN     NaN       NaN
# 1   5  ...    ...    ...      ...    18.4     1.0     0.0000

The simpler model is always shown first regardless of argument order. Chisq, Chi Df, and Pr(>Chisq) are NaN for the reduced model row.

anova() raises ValueError if either model was fitted with REML — a deliberate guard against comparing incomparable likelihoods.

With update()

anova() and update() compose naturally for incremental model building:

m0 = interlace.fit("rt ~ 1", data=df, groups=["subject", "item"], method="ML")
m1 = m0.update(". ~ . + condition")
m2 = m1.update(". ~ . + frequency")

print(interlace.anova(m0, m1))
print(interlace.anova(m1, m2))

# Refit winner with REML for final estimates
m_final = m2.update(method="REML")

Iterative refinement with update()

Repeatedly calling interlace.fit() with slightly different formulas is verbose. The update() method reruns the fit with only the parts you want to change — formula, data, or any keyword argument — while inheriting the rest from the original call.

Dot notation

A . in the new formula expands to the corresponding part of the original:

# Original model
m0 = interlace.fit("rt ~ condition", data=df, groups=["subject", "item"], method="ML")

# Add a predictor: . ~ . + frequency
m1 = m0.update(". ~ . + frequency")

# Remove a predictor: . ~ . - condition
m_reduced = m0.update(". ~ . - condition")

# Replace the response (LHS)
m_alt = m0.update("log_rt ~ .")

The original m0 is unchanged; update() always returns a new CrossedLMEResult.

Changing the dataset

Pass data= to refit on a different frame — useful for sensitivity analyses or rolling-window designs:

# Refit on a filtered subset
m_large = m0.update(data=df[df["school_size"] > 200])

# Change data and formula together
m_sens = m0.update(". ~ . + frequency", data=df_filtered)

Overriding fit arguments

Any keyword accepted by interlace.fit() can be overridden:

# Switch from ML to REML for final estimates
m_reml = m1.update(method="REML")

# Switch optimizer
m_bobyqa = m0.update(optimizer="bobyqa")

Typical workflow

import interlace, scipy.stats

# 1. Build models incrementally with ML
m0 = interlace.fit("rt ~ 1",         data=df, groups=["subject", "item"], method="ML")
m1 = m0.update(". ~ . + condition")
m2 = m1.update(". ~ . + frequency")

# 2. Test each step with LRT
for reduced, full, name in [(m0, m1, "condition"), (m1, m2, "frequency")]:
    stat = 2 * (full.llf - reduced.llf)
    p    = scipy.stats.chi2.sf(stat, df=1)
    print(f"{name}: χ²(1) = {stat:.2f}, p = {p:.4f}")

# 3. Refit winner with REML for final estimates
m_final = m2.update(method="REML")
print(m_final.summary())

Notes and caveats

  • LRT p-values for variance components are conservative. The null hypothesis puts the parameter on the boundary of the parameter space (variance ≥ 0), so the chi-squared approximation is anti-conservative. A rule of thumb: halve the p-value, or use a mixture distribution. For fixed effects, the approximation is accurate.

  • Do not compare REML likelihoods across different fixed-effect structures. REML integrates out the fixed effects, so the likelihood depends on the fixed-effect design matrix — two models with different fixed effects have incomparable REML likelihoods.

  • AIC/BIC with REML: result.aic and result.bic are computed from the REML log-likelihood when method="REML". Use them only for models with identical fixed effects.


See also