Diagnostics guide¶

After fitting a mixed model, diagnostics help you check whether the model assumptions hold and whether any observations are unduly influencing the estimates.

This guide covers:

Residuals — marginal vs conditional, what patterns to look for
Leverage — which observations have high potential influence
Cook’s distance — which observations actually change the fixed-effect estimates
MDFFITS — standardised version of Cook’s D, scale-free
COVTRACE and COVRATIO — influence on the precision of fixed-effect estimates
Relative variance change (RVC) — influence on variance components
hlm_augment — combining everything into one tidy DataFrame
Worked example — introducing an outlier and watching the diagnostics react

Throughout we use synthetic exam-score data: students and schools as crossed grouping factors, with hours_studied and prior_gpa as fixed predictors.

import numpy as np
import pandas as pd
import interlace
from interlace import hlm_resid, leverage, hlm_influence, hlm_augment, isSingular
from plotnine import (
    ggplot, aes,
    geom_point, geom_hline, geom_vline, geom_histogram, geom_qq, geom_qq_line,
    geom_text, geom_segment,
    scale_color_manual,
    labs, theme_bw, theme, element_text,
    facet_wrap,
)

rng = np.random.default_rng(42)
n = 300

df = pd.DataFrame({
    "score":         60 + rng.normal(0, 8, n),
    "hours_studied": rng.uniform(0, 10, n),
    "prior_gpa":     rng.uniform(2.0, 4.0, n),
    "student_id":    [f"s{i}"   for i in rng.integers(1, 31, n)],
    "school_id":     [f"sch{i}" for i in rng.integers(1, 11, n)],
})

result = interlace.fit(
    "score ~ hours_studied + prior_gpa",
    data=df,
    groups=["student_id", "school_id"],
)

/tmp/ipykernel_2458/179038553.py:25: UserWarning: boundary (singular) fit: see help(interlace.isSingular)

0. Singular fit check¶

Before running any diagnostics, check whether the model is at the boundary of its parameter space. A singular fit means one or more variance components collapsed to zero. This can make residual patterns and influence measures harder to interpret.

from interlace import isSingular

print(f"Singular fit: {result.is_singular}")
print(f"Boundary flags: {result.boundary_flags}")

if isSingular(result):
    print("\nWarning: one or more variance components are at the boundary.")
    print("Fixed-effect SEs and influence measures may be unreliable.")
    print("Consider dropping the flagged grouping factor or using optimizer='bobyqa'.")

Singular fit: True
Boundary flags: {'student_id': False, 'school_id': True}

Warning: one or more variance components are at the boundary.
Fixed-effect SEs and influence measures may be unreliable.
Consider dropping the flagged grouping factor or using optimizer='bobyqa'.

1. Residuals¶

interlace.hlm_resid() returns either marginal or conditional residuals.

Type	Formula	Interpretation
Marginal	`y − Xβ̂`	Deviation from the population-level prediction (ignoring BLUPs)
Conditional	`y − Xβ̂ − Zû`	Deviation from the group-specific prediction (using BLUPs)

Use marginal residuals to check the fixed-effect structure and the overall distributional assumption (normality of the response given predictors). Use conditional residuals to check within-group model fit and to spot individual outliers after accounting for group membership.

Both types should show:

No trend against fitted values (homoscedasticity)
Approximate normality (check with a Q–Q plot)
No systematic pattern by group

resid_m = hlm_resid(result, level=1)  # marginal
resid_c = hlm_resid(result, level=0)  # conditional

resid_df = pd.DataFrame({
    "fitted":      result.fittedvalues,
    "marginal":    resid_m,
    "conditional": resid_c,
})

resid_df.describe().round(3)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[3], line 2
      1 resid_m = hlm_resid(result, level=1)  # marginal
----> 2 resid_c = hlm_resid(result, level=0)  # conditional
      4 resid_df = pd.DataFrame({
      5     "fitted":      result.fittedvalues,
      6     "marginal":    resid_m,
      7     "conditional": resid_c,
      8 })
     10 resid_df.describe().round(3)

File ~/work/interlace/interlace/src/interlace/residuals.py:113, in hlm_resid(model, full_data, type, standardized, level)
    111     available = list(model.random_effects)
    112     msg = f"level '{level}' not found in random_effects; available: {available}"
--> 113     raise ValueError(msg)
    114 re_obj = model.random_effects[level]
    115 # re_obj: pd.Series (intercept-only) or pd.DataFrame (slopes)

ValueError: level '0' not found in random_effects; available: ['student_id', 'school_id']

Residuals vs fitted values¶

A good plot shows points scattered randomly around zero with roughly constant spread. Any fan shape (wider spread at high fitted values) indicates heteroscedasticity; any curve indicates a missing non-linear term.

(
    ggplot(resid_df, aes(x="fitted", y="conditional"))
    + geom_point(alpha=0.4)
    + geom_hline(yintercept=0, linetype="dashed", color="red")
    + labs(
        x="Fitted values",
        y="Conditional residual",
        title="Residuals vs fitted — good pattern (random scatter around zero)",
    )
    + theme_bw()
)

Q–Q plot¶

Points should fall on the diagonal. Heavy tails or an S-curve indicate departures from normality — consider a robustness check or a transformation.

(
    ggplot(resid_df, aes(sample="conditional"))
    + geom_qq(alpha=0.4)
    + geom_qq_line(color="red")
    + labs(
        x="Theoretical quantiles",
        y="Sample quantiles",
        title="Q–Q plot of conditional residuals",
    )
    + theme_bw()
)

2. Leverage¶

interlace.leverage() decomposes the hat-matrix diagonal into three parts following Demidenko & Stukel (2005) and Nobre & Singer (2007):

Column	Meaning
`fixef` (H1)	Leverage due to the fixed-effect design matrix — unusual predictor values
`ranef` (H2)	Leverage due to the random-effect structure (Demidenko & Stukel)
`ranef.uc`	Unconfounded H2 (Nobre & Singer) — H2 with fixed-effect contribution removed
`overall`	H1 + H2 — total leverage

Interpretation:

High fixef leverage: the observation has an unusual combination of predictors; it controls the slope estimates.
High ranef leverage: the observation strongly determines its group’s BLUP.
A rule of thumb borrowed from OLS: overall > 2p/n (where p = number of fixed parameters, n = observations) flags potentially high-leverage points. In mixed models this is a heuristic — leverage is distributed differently — but it gives a starting threshold.

lev = leverage(result)
lev.head()

p = len(result.fe_params)  # number of fixed-effect parameters
threshold = 2 * p / result.nobs

lev_plot_df = lev.reset_index(drop=True).assign(
    obs=range(len(lev)),
    high=lev["overall"] > threshold,
)

(
    ggplot(lev_plot_df, aes(x="obs", y="overall", color="high"))
    + geom_point(alpha=0.6)
    + geom_hline(yintercept=threshold, linetype="dashed", color="red")
    + scale_color_manual(values={True: "#e74c3c", False: "#2c3e50"})
    + labs(
        x="Observation index",
        y="Overall leverage (H1 + H2)",
        color="> 2p/n",
        title="Leverage plot — dashed line at 2p/n threshold",
    )
    + theme_bw()
)

High leverage alone is not a problem — it means the observation could be influential. Combine leverage with residual size (next section) to identify observations that are both unusual and poorly fit.

3. Cook’s distance¶

Cook’s distance (cooksd) measures how much the vector of fixed-effect estimates changes when observation i is deleted. It combines residual size and leverage:

$$D_i = \frac{(\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}{(-i)})^\top (\mathbf{X}^\top \hat{\boldsymbol{\Omega}}^{-1} \mathbf{X}) (\hat{\boldsymbol{\beta}} - \hat{\boldsymbol{\beta}}{(-i)})} {p , \hat{\sigma}^2}$$

Thresholds (heuristics):

cooksd > 4/n — worth inspecting
cooksd > 1 — substantial influence on fixed-effect estimates

These thresholds are rules of thumb from OLS, applied here as starting points. Always inspect flagged observations in context before removing them.

infl = hlm_influence(result)
infl.head()

n = result.nobs
cooksd_threshold = 4 / n

cooksd_df = infl[["cooksd"]].reset_index(drop=True).assign(
    obs=range(n),
    high=infl["cooksd"].values > cooksd_threshold,
)

(
    ggplot(cooksd_df, aes(x="obs", y="cooksd", color="high"))
    + geom_point(alpha=0.6)
    + geom_hline(yintercept=cooksd_threshold, linetype="dashed", color="red")
    + scale_color_manual(values={True: "#e74c3c", False: "#2c3e50"})
    + labs(
        x="Observation index",
        y="Cook's distance",
        color="> 4/n",
        title="Cook's distance — dashed line at 4/n",
    )
    + theme_bw()
)

4. MDFFITS¶

MDFFITS (Measures of Difference in Fits) is a scale-free version of Cook’s distance. It standardises by the full covariance matrix of the fixed effects rather than by p * σ², so it is comparable across models with different numbers of parameters.

Threshold: mdffits > 2 * sqrt(p/n) is a common guideline.

MDFFITS and Cook’s D usually agree on which observations are influential; MDFFITS is preferred when comparing across models or when σ² is poorly estimated.

mdffits_threshold = 2 * np.sqrt(p / n)

mdffits_df = infl[["mdffits"]].reset_index(drop=True).assign(
    obs=range(n),
    high=infl["mdffits"].values > mdffits_threshold,
)

(
    ggplot(mdffits_df, aes(x="obs", y="mdffits", color="high"))
    + geom_point(alpha=0.6)
    + geom_hline(yintercept=mdffits_threshold, linetype="dashed", color="red")
    + scale_color_manual(values={True: "#e74c3c", False: "#2c3e50"})
    + labs(
        x="Observation index",
        y="MDFFITS",
        color=f"> 2√(p/n)",
        title="MDFFITS — dashed line at 2√(p/n)",
    )
    + theme_bw()
)

5. COVTRACE and COVRATIO¶

These diagnostics measure influence on the precision of fixed-effect estimates (the covariance matrix V), not just on their values.

Measure	Formula	Interpretation
`covtrace`	tr(V⁻¹Vᵢ) − p	Positive → deletion improves precision; negative → deletion hurts precision
`covratio`	det(Vᵢ) / det(V)	< 1 → deletion improves precision (det shrinks); > 1 → deletion hurts precision

Here Vᵢ is the fixed-effect covariance matrix fitted without observation i, and p is the number of fixed parameters.

Thresholds:

|covtrace| > p is notable
covratio far from 1 (say, < 0.9 or > 1.1) is worth checking

A large positive covtrace means observation i inflates standard errors — removing it would make estimates more precise. Check whether it is an outlier or simply at an extreme predictor value (high leverage).

cov_df = infl[["covtrace", "covratio"]].reset_index(drop=True).assign(
    obs=range(n),
)

(
    ggplot(cov_df, aes(x="obs", y="covratio"))
    + geom_point(alpha=0.5)
    + geom_hline(yintercept=1, linetype="dashed", color="red")
    + geom_hline(yintercept=0.9, linetype="dotted", color="orange")
    + geom_hline(yintercept=1.1, linetype="dotted", color="orange")
    + labs(
        x="Observation index",
        y="COVRATIO",
        title="COVRATIO — dashed=1 (neutral), dotted=0.9/1.1 (guideline bounds)",
    )
    + theme_bw()
)

6. Relative variance change (RVC)¶

RVC columns (rvc.<name>) measure how much each variance component changes when observation i is deleted, relative to the full-model estimate:

$$\text{RVC}i(\theta_k) = \frac{\hat{\theta}{k(-i)} - \hat{\theta}_k}{\hat{\theta}_k}$$

A value of +0.3 means the variance component increases by 30% when observation i is removed — that observation is suppressing variance in that grouping factor.

Threshold: |RVC| > 0.5 is a common guideline for flagging notable influence on a specific variance component.

rvc_cols = [c for c in infl.columns if c.startswith("rvc.")]
rvc_df = (
    infl[rvc_cols]
    .reset_index(drop=True)
    .assign(obs=range(n))
    .melt(id_vars="obs", var_name="component", value_name="rvc")
)

(
    ggplot(rvc_df, aes(x="obs", y="rvc"))
    + geom_point(alpha=0.4)
    + geom_hline(yintercept=0,    linetype="dashed", color="grey")
    + geom_hline(yintercept= 0.5, linetype="dotted", color="orange")
    + geom_hline(yintercept=-0.5, linetype="dotted", color="orange")
    + facet_wrap("component", ncol=1)
    + labs(
        x="Observation index",
        y="Relative variance change",
        title="RVC per variance component — dotted lines at ±0.5",
    )
    + theme_bw()
)

7. `hlm_augment` — everything in one DataFrame¶

interlace.hlm_augment() appends all residual and influence diagnostics to your original data. This makes it easy to filter, join, and plot.

Columns added:

Column	Source
`.resid`	Conditional residual
`.resid_mar`	Marginal residual
`.fitted`	Conditional fitted value
`.leverage`	Overall leverage (H1 + H2)
`.cooksd`	Cook’s distance
`.mdffits`	MDFFITS
`.covtrace`	COVTRACE
`.covratio`	COVRATIO
`.rvc.<name>`	RVC per variance component

aug = hlm_augment(result, data=df)
aug.head()

# Flag any observation that exceeds both the leverage and Cook's D thresholds
aug["influential"] = (
    (aug[".leverage"] > 2 * p / n) &
    (aug[".cooksd"]   > 4 / n)
)

print(f"Observations flagged as influential: {aug['influential'].sum()} of {len(aug)}")
aug[aug["influential"]][["student_id", "school_id", "score", ".leverage", ".cooksd"]].head()

8. Worked example — introducing an outlier¶

To see how the diagnostics respond to a genuine outlier, we refit the model after injecting a single extreme observation.

# Inject one outlier: extreme score and unusual predictor values
outlier = pd.DataFrame([{
    "score":         120.0,   # far above the ~60-point mean
    "hours_studied": 0.1,     # almost no study time
    "prior_gpa":     2.0,     # lowest GPA
    "student_id":    "s1",
    "school_id":     "sch1",
}])

df_outlier = pd.concat([df, outlier], ignore_index=True)
result_outlier = interlace.fit(
    "score ~ hours_studied + prior_gpa",
    data=df_outlier,
    groups=["student_id", "school_id"],
)

print("Without outlier — hours_studied coefficient:", round(result.fe_params["hours_studied"], 3))
print("With outlier    — hours_studied coefficient:", round(result_outlier.fe_params["hours_studied"], 3))

infl_outlier = hlm_influence(result_outlier)
n2 = result_outlier.nobs

cooksd_out_df = infl_outlier[["cooksd"]].reset_index(drop=True).assign(
    obs=range(n2),
    is_outlier=(range(n2) == n2 - 1),  # last row is the injected outlier
)
# label the injected outlier
cooksd_out_df["label"] = cooksd_out_df.apply(
    lambda r: "injected\noutlier" if r["is_outlier"] else "", axis=1
)

(
    ggplot(cooksd_out_df, aes(x="obs", y="cooksd", color="is_outlier"))
    + geom_point(alpha=0.6)
    + geom_hline(yintercept=4 / n2, linetype="dashed", color="red")
    + geom_text(
        data=cooksd_out_df[cooksd_out_df["is_outlier"]],
        mapping=aes(label="label"),
        nudge_x=-8, nudge_y=0,
        size=9,
    )
    + scale_color_manual(values={True: "#e74c3c", False: "#2c3e50"})
    + labs(
        x="Observation index",
        y="Cook's distance",
        color="Injected outlier",
        title="Cook's distance after injecting an outlier — outlier is clearly detected",
    )
    + theme_bw()
)

The injected observation stands out dramatically. Compare its diagnostic values to the rest of the dataset:

aug_outlier = hlm_augment(result_outlier, data=df_outlier)

pd.concat([
    aug_outlier[[".leverage", ".cooksd", ".mdffits", ".covtrace", ".covratio"]]
    .describe()
    .loc[["mean", "max"]]
    .rename(index={"mean": "dataset mean", "max": "dataset max"}),
    aug_outlier.iloc[[-1]][
        [".leverage", ".cooksd", ".mdffits", ".covtrace", ".covratio"]
    ].rename(index={n2 - 1: "injected outlier"}),
]).round(4)

9. `lmer_influence_measures()` — HLMdiag-parity all-in-one¶

lmer_influence_measures() combines case-deletion diagnostics, analytical leverage, and DFBETAS into a single dict that matches R’s HLMdiag::hlm_influence output exactly. Use it when you need full R parity or when you want all measures in one call.

For large models, pass n_jobs=-1 to parallelise the case-deletion loop across all available CPUs (Linux: fork, fast; macOS/Windows: spawn, slower).

from interlace.influence import lmer_influence_measures

# n_jobs=1 (default) is sequential; n_jobs=-1 uses all CPUs
measures = lmer_influence_measures(result, n_jobs=1, show_progress=False)
print(list(measures.keys()))
# ['cooks', 'hat', 'hat_overall', 'hat_fixef', 'dfbetas', 'dffits', 'residuals', 'sigma']

# Use the correct hat vector for flagging (matches R's HLMdiag convention)
# - hat_fixef (H1 only) for crossed multi-RE models
# - hat_overall (H1+H2) for single-RE models
n = result.nobs
p = len(result.fe_params)

cooksd_thresh   = 4 / n
leverage_thresh = 2 * p / n

flag_cooksd   = measures['cooks']   > cooksd_thresh
flag_leverage = measures['hat']     > leverage_thresh
flag_both     = flag_cooksd & flag_leverage

print(f"Influential (Cook's D): {flag_cooksd.sum()}")
print(f"High leverage:          {flag_leverage.sum()}")
print(f"Both:                   {flag_both.sum()}")

DFBETAS from `lmer_influence_measures`¶

The 'dfbetas' key contains an analytical DFBETAS matrix (n, p) using the same formula as R. This is faster than case-deletion-based DFBETAS and matches HLMdiag’s values:

import pandas as pd
from plotnine import ggplot, aes, geom_tile, scale_fill_gradient2, theme_bw, labs

dfbetas = measures['dfbetas']   # shape (n, p)
threshold = 2 / np.sqrt(n)

dfbetas_df = (
    pd.DataFrame(dfbetas, columns=result.fe_params.index)
    .assign(obs=range(n))
    .melt(id_vars='obs', var_name='coefficient', value_name='dfbetas')
)

(
    ggplot(dfbetas_df, aes(x='coefficient', y='obs', fill='dfbetas'))
    + geom_tile()
    + scale_fill_gradient2(low='#2166ac', mid='white', high='#d73027', midpoint=0)
    + theme_bw()
    + labs(title='Mixed-model DFBETAS', x='Coefficient', y='Observation', fill='DFBETAS')
)

9. OLS DFBETAS — coefficient-level influence for OLS models¶

DFBETAS measure how much each regression coefficient changes (in units of its standard error) when a single observation is removed. Unlike Cook’s distance — which summarises influence on all coefficients simultaneously — DFBETAS give a per-coefficient, per-observation breakdown.

ols_dfbetas_qr() computes DFBETAS for a statsmodels OLS model via QR decomposition: fully vectorised, no Python loops.

When to use:

As a pre-check before fitting the mixed model (stage-1 diagnostics)
In two-stage workflows where OLS is used on group-level summaries
When you need to know which coefficient an influential observation drives

from interlace.influence import ols_dfbetas_qr
import statsmodels.formula.api as smf

# Fit an OLS model on the same dataset
ols = smf.ols("score ~ hours_studied + prior_gpa", data=df).fit()

# Compute DFBETAS — shape (n_obs, n_params)
dfbetas = ols_dfbetas_qr(ols)
print(f"DFBETAS shape: {dfbetas.shape}")
print(f"Columns correspond to: {ols.model.exog_names}")

# Rule-of-thumb threshold: 2 / sqrt(n)
threshold = 2 / np.sqrt(result.nobs)
influential_mask = np.any(np.abs(dfbetas) > threshold, axis=1)
print(f"Threshold: {threshold:.3f}")
print(f"Observations exceeding threshold on at least one coefficient: {influential_mask.sum()}")

Heatmap visualisation¶

A heatmap shows the full DFBETAS matrix at a glance. Rows are observations, columns are coefficients. Dark cells indicate high influence on that coefficient.

dfbetas_df = (
    pd.DataFrame(dfbetas, columns=ols.model.exog_names)
    .assign(obs=range(len(dfbetas)))
    .melt(id_vars="obs", var_name="coefficient", value_name="dfbetas")
)

from plotnine import ggplot, aes, geom_tile, scale_fill_gradient2, theme_bw, labs

(
    ggplot(dfbetas_df, aes(x="coefficient", y="obs", fill="dfbetas"))
    + geom_tile()
    + scale_fill_gradient2(low="#2166ac", mid="white", high="#d73027", midpoint=0)
    + theme_bw()
    + labs(title="OLS DFBETAS", x="Coefficient", y="Observation", fill="DFBETAS")
)

Observations with large absolute DFBETAS for a coefficient are driving that coefficient’s estimate. The same observations may or may not be flagged by Cook’s distance — DFBETAS can reveal influence on a single coefficient that is diluted in the Cook’s D summary.

Note that ols_dfbetas_qr() operates on OLS models only. For mixed-model influence at the coefficient level, use hlm_influence() and examine the mdffits column, which is the multivariate analogue.

Acting on diagnostic results¶

Diagnostics answer the question is my model adequate? Below is a decision workflow for the most common findings.

Residual patterns suggest model misspecification¶

Pattern	Likely cause	Action
Fan shape (variance increases with fitted)	Heteroscedasticity	Transform response (log/sqrt) or consider GLMM
Curve in residual-vs-fitted	Missing non-linear term	Add polynomial or interaction term
Heavy tails in Q-Q plot	Non-normal errors	Check for data errors; consider robust estimation
Systematic trend by group	Missing random slope	Fit a random slope for that predictor

Checking for a missing random slope: If conditional residuals show a systematic trend with a predictor within groups, a random slope may be warranted:

# Refit with a by-student random slope for hours_studied
result_slopes = interlace.fit(
    "score ~ hours_studied + prior_gpa",
    data=df,
    random=["(1 + hours_studied | student_id)", "(1 | school_id)"],
)
# Then compare with LRT — see model-comparison.md

High leverage + high Cook’s D: influential observation¶

Steps:

Identify: aug[aug[".cooksd"] > 4/n]
Check for data errors (typos, unit mismatches)
Refit without the observation and compare fixed-effect estimates
If estimates change substantially, report both models

aug = hlm_augment(result, data=df)
drop_idx = aug[".cooksd"].idxmax()

result_sens = interlace.fit(
    "score ~ hours_studied + prior_gpa",
    data=df.drop(index=drop_idx),
    groups=["student_id", "school_id"],
)
print("Original:", result.fe_params.round(3).to_dict())
print("Without obs", drop_idx, ":", result_sens.fe_params.round(3).to_dict())

High RVC: influential observation for a variance component¶

Large rvc.<name> often occurs when a group has very few observations (especially singletons). This is expected. Check group sizes:

print(df.groupby("student_id").size().sort_values())
# Groups with n=1 will always have high RVC — treat their BLUPs with caution

Diagnostic thresholds — quick reference¶

Diagnostic	Guideline threshold	What it flags
Overall leverage	> 2p/n	Unusual predictor values
Cook’s distance	> 4/n or > 1	Influence on fixed-effect estimates
MDFFITS	> 2√(p/n)	Scale-free fixed-effect influence
COVTRACE	\|value\| > p	Influence on fixed-effect precision
COVRATIO	< 0.9 or > 1.1	Influence on covariance determinant
RVC	\|value\| > 0.5	Influence on a specific variance component

All thresholds are heuristics. An observation that exceeds a threshold is worth inspecting, not automatically removed. Ask:

Is it a data-entry error?
Is it a legitimate extreme value that the model should fit?
Does removing it change your substantive conclusions?

If removal does not change conclusions, the result is robust. If it does, report both analyses and discuss the influential case.

Where to go next¶

Interpreting model results — reading fixed effects, variance components, and BLUPs
Influence diagnostics — hlm_influence, cooks_distance, mdffits, n_influential
Leverage — leverage decomposition columns
Augment — hlm_augment column reference
Residuals — hlm_resid level parameter
FAQ & Troubleshooting — convergence issues and numerical stability