Interpreting model results

After calling interlace.fit(), you receive a CrossedLMEResult object. This page explains what each part of the output means and how to use it to draw conclusions from your model.

Throughout this page we use a running example: exam scores modelled as a function of hours studied and prior GPA, with students and schools as crossed grouping factors.

import numpy as np
import pandas as pd
import interlace

rng = np.random.default_rng(42)
n = 300

df = pd.DataFrame({
    "score":         60 + rng.normal(0, 8, n),
    "hours_studied": rng.uniform(0, 10, n),
    "prior_gpa":     rng.uniform(2.0, 4.0, n),
    "student_id":    [f"s{i}"   for i in rng.integers(1, 31, n)],
    "school_id":     [f"sch{i}" for i in rng.integers(1, 11, n)],
})

result = interlace.fit(
    "score ~ hours_studied + prior_gpa",
    data=df,
    groups=["student_id", "school_id"],
)
/tmp/ipykernel_2509/4095886218.py:16: UserWarning: boundary (singular) fit: see help(interlace.isSingular)

Fixed effects

Fixed effects capture the population-level relationships between predictors and the outcome, controlling for the grouping structure.

The result exposes four attributes for fixed-effect inference:

Attribute

Contents

fe_params

Coefficient estimates (pd.Series)

fe_bse

Standard errors

fe_pvalues

Two-sided p-values (normal approximation)

fe_conf_int

95% confidence intervals (pd.DataFrame, columns [0.025, 0.975])

Reading the coefficients

fe_params is a pandas.Series indexed by term name. Each coefficient is the expected change in the outcome per unit increase in the predictor, holding grouping structure constant — i.e., the within-group effect.

result.fe_params
Intercept        60.454607
hours_studied     0.032509
prior_gpa        -0.322930
dtype: float64

Standard errors and inference

Standard errors in fe_bse account for the clustering: they are derived from the fixed-effect covariance matrix fe_cov = scale * (X'Ω⁻¹X)⁻¹, where X is the fixed-effect design matrix (one row per observation, one column per predictor including the intercept) and Ω is the marginal covariance matrix of the response that encodes the random effect structure. This is why they are typically larger than OLS standard errors on the same data — the effective sample size is reduced by the grouping.

A convenient summary table:

pd.DataFrame({
    "estimate": result.fe_params,
    "se":       result.fe_bse,
    "p":        result.fe_pvalues,
    "ci_lo":    result.fe_conf_int.iloc[:, 0],
    "ci_hi":    result.fe_conf_int.iloc[:, 1],
})
estimate se p ci_lo ci_hi
Intercept 60.454607 2.446897 0.000000 55.658688 65.250526
hours_studied 0.032509 0.149269 0.827601 -0.260059 0.325077
prior_gpa -0.322930 0.714485 0.651300 -1.723321 1.077460

P-values use a normal approximation (z-test), which is appropriate for large samples. For small samples with few groups, consider using likelihood-ratio tests or parametric bootstrap instead.

Comparing models with different fixed structures

To test whether a predictor improves fit, refit with method="ML" (not REML) and compare log-likelihoods. Use REML (the default) for your final reported estimates; use ML only for likelihood-ratio tests comparing fixed structures. See Concepts: REML for the reason.

from scipy.stats import chi2

m1 = interlace.fit("score ~ hours_studied",              data=df, groups=["student_id", "school_id"], method="ML")
m2 = interlace.fit("score ~ hours_studied + prior_gpa", data=df, groups=["student_id", "school_id"], method="ML")

lr_stat = 2 * (m2.llf - m1.llf)
df_diff = 1  # one extra parameter in m2
p_value = chi2.sf(lr_stat, df=df_diff)

print(f"LR statistic: {lr_stat:.3f}  (df={df_diff})")
print(f"p-value:      {p_value:.4f}")
LR statistic: 0.210  (df=1)
p-value:      0.6470
/tmp/ipykernel_2509/2957097658.py:3: UserWarning: boundary (singular) fit: see help(interlace.isSingular)
/tmp/ipykernel_2509/2957097658.py:4: UserWarning: boundary (singular) fit: see help(interlace.isSingular)

Variance components

Variance components quantify how much variability in the outcome is attributable to each grouping factor and to residual within-group noise.

Reading variance components

print("Between-group variances:")
for g, v in result.variance_components.items():
    print(f"  {g}: σ² = {v:.3f}  (σ ≈ {v**0.5:.2f} score points)")

print(f"\nResidual variance (scale): σ² = {result.scale:.3f}  (σ ≈ {result.scale**0.5:.2f} score points)")
Between-group variances:
  student_id: σ² = 4.513  (σ ≈ 2.12 score points)
  school_id: σ² = 0.000  (σ ≈ 0.00 score points)

Residual variance (scale): σ² = 51.391  (σ ≈ 7.17 score points)

Intraclass correlation (ICC)

The ICC tells you what fraction of total variance each grouping factor explains. An ICC below ~0.05 suggests the grouping factor contributes little and may not need a random intercept.

total = sum(result.variance_components.values()) + result.scale

icc = {g: v / total for g, v in result.variance_components.items()}

for g, v in icc.items():
    print(f"ICC({g}): {v:.3f}{v*100:.1f}% of variance explained by {g}")
ICC(student_id): 0.081  →  8.1% of variance explained by student_id
ICC(school_id): 0.000  →  0.0% of variance explained by school_id

What if a variance component is near zero?

A variance component close to zero (sometimes reported as exactly 0 at the boundary of the parameter space) means the corresponding grouping factor explains almost no variance in the outcome after conditioning on the fixed effects. This is not an error — it is a valid result indicating that the groups in that factor are essentially homogeneous. You may choose to drop that grouping factor and refit, or retain it for theoretical reasons.


Model fit

print(f"AIC:       {result.aic:.2f}")
print(f"BIC:       {result.bic:.2f}")
print(f"log-lik:   {result.llf:.2f}")
print(f"converged: {result.converged}")
AIC:       2061.00
BIC:       2083.22
log-lik:   -1024.50
converged: True

AIC and BIC

AIC and BIC are used to compare models with the same fixed structure fitted with REML, or models with different fixed structures fitted with ML. Both penalise model complexity; BIC penalises more heavily for large samples.

AIC is defined as −2 log L + 2k (Akaike, 1974); BIC as −2 log L + k log n (Schwarz, 1978), where k is the number of estimated parameters and n the number of observations.

A difference of ≥ 4 AIC points between models is often treated as meaningful; a difference of 1–3 is weak evidence. These are heuristics — use them alongside subject-matter reasoning, not as automatic decision rules.

References

Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716–723. doi:10.1109/TAC.1974.1100705

Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6(2), 461–464. doi:10.1214/aos/1176344136

Checking convergence

Always verify convergence before reporting results. Non-convergence typically occurs with very small datasets, degenerate variance (one group per level), or near-zero variance components.

if not result.converged:
    print("Warning: optimiser did not converge — estimates may be unreliable")
else:
    print("Converged successfully.")
Converged successfully.

Random effects (BLUPs)

BLUPs (Best Linear Unbiased Predictors) are the estimated deviations of each group from the population mean. They are accessible per grouping factor as a dict of pd.Series (intercept-only models) or pd.DataFrame (random slopes models).

Reading BLUPs

student_blups = result.random_effects["student_id"]
school_blups  = result.random_effects["school_id"]

student_blups.sort_values()
s17   -2.813086
s12   -2.566059
s18   -1.792370
s9    -1.789333
s3    -1.499837
s22   -1.007045
s11   -0.940747
s16   -0.891217
s21   -0.824995
s6    -0.657869
s19   -0.594837
s28   -0.518625
s2    -0.512487
s7    -0.369438
s14   -0.158327
s30    0.032735
s29    0.120162
s26    0.324681
s5     0.325169
s20    0.363378
s8     0.379302
s23    0.605424
s15    0.829710
s1     1.317096
s13    1.367523
s27    1.627519
s24    1.823487
s4     2.066743
s25    2.825764
s10    2.927580
Name: student_id, dtype: float64

Each value is the estimated deviation of that group from the population mean, in the same units as the outcome (exam score points). A positive BLUP means the group scores above average after accounting for the fixed predictors and other grouping factors.

For random slopes models, random_effects[group] is a pd.DataFrame with one column per term (intercept + slope).

Shrinkage

BLUPs are shrunk toward zero relative to the group’s raw deviation. Groups with fewer observations or groups whose variance component is small are shrunk more. This is intentional: a student with only one observation should not be assigned an extreme intercept — the BLUP borrows strength from the population distribution.

The plot below makes shrinkage visible: each point is one student. Points above the diagonal are shrunk toward zero; the further from the line, the stronger the shrinkage.

from plotnine import ggplot, aes, geom_point, geom_abline, labs, theme_bw

raw = df.assign(resid=result.resid).groupby("student_id")["resid"].mean()

shrinkage_df = pd.DataFrame({
    "raw_mean_resid": raw,
    "blup":           student_blups.reindex(raw.index),
}).reset_index()

(
    ggplot(shrinkage_df, aes(x="raw_mean_resid", y="blup"))
    + geom_abline(slope=1, intercept=0, linetype="dashed", color="grey")
    + geom_point(alpha=0.7)
    + labs(
        x="Raw group mean residual (no shrinkage)",
        y="BLUP (shrunk estimate)",
        title="Shrinkage: BLUPs vs raw group means",
    )
    + theme_bw()
)

BLUPs vs fixed group dummies

BLUPs are not the same as including a fixed dummy variable for each group. Dummies use only the data from that group; BLUPs partially pool across groups. For groups with very many observations, BLUPs converge to the dummy-variable estimate. For small or sparse groups, BLUPs are much more stable.


Predictions

result.predict() returns fitted values, optionally including random effects.

Conditional (include_re=True)

Marginal (include_re=False)

Includes BLUPs

Yes

No

Use case

In-sample fit, known groups

New groups, population-level

Unseen group levels

BLUP = 0 (shrinks to mean)

Identical to marginal

Conditional predictions are appropriate when you know which group an observation belongs to and want the best estimate for that specific group. They are used to compute residuals (result.resid = observed conditional prediction).

Marginal predictions give the expected outcome for a hypothetical average group — useful for reporting population-level effects or when the new data includes groups not seen during training.

# Conditional predictions (uses BLUPs, default)
y_conditional = result.predict()

# Marginal predictions (fixed effects only)
y_marginal = result.predict(include_re=False)

print(f"Conditional predictions — mean: {y_conditional.mean():.2f}, std: {y_conditional.std():.2f}")
print(f"Marginal predictions    — mean: {y_marginal.mean():.2f}, std: {y_marginal.std():.2f}")
Conditional predictions — mean: 59.67, std: 1.43
Marginal predictions    — mean: 59.67, std: 1.43

Predicting on new data

Unseen group levels automatically receive a BLUP of zero — the population mean for that factor. This is the correct Bayesian-optimal behaviour under the model’s assumptions.

df_new = pd.DataFrame({
    "hours_studied": [5.0, 8.0, 5.0],
    "prior_gpa":     [3.2, 3.8, 3.2],
    "student_id":    ["s1", "s2", "s_new"],   # s_new is unseen → BLUP = 0
    "school_id":     ["sch1", "sch1", "sch1"],
})

preds = result.predict(newdata=df_new)

df_new.assign(predicted_score=preds.round(2))
hours_studied prior_gpa student_id school_id predicted_score
0 5.0 3.2 s1 sch1 60.90
1 8.0 3.8 s2 sch1 58.98
2 5.0 3.2 s_new sch1 59.58

Bootstrap standard error

For scalar statistics of the outcome (currently: median), bootstrap_se() computes a cluster-bootstrap standard error that respects the grouping structure.

Ordinary bootstrap resamples individual observations, which underestimates the SE when group variance is substantial. Cluster bootstrap resamples group levels with replacement, then includes all observations from the sampled groups — preserving within-group correlation.

Use resample_level="observation" only if you have verified that the ICC is negligible.

se_cluster     = result.bootstrap_se(statistic="median", n_bootstrap=500, resample_level="group",       seed=42)
se_observation = result.bootstrap_se(statistic="median", n_bootstrap=500, resample_level="observation", seed=42)

print(f"Median: {np.median(df['score']):.2f}")
print(f"Cluster-bootstrap SE:     {se_cluster:.4f}")
print(f"Observation-bootstrap SE: {se_observation:.4f}  (underestimates when ICC > 0)")
Median: 59.30
Cluster-bootstrap SE:     0.8252
Observation-bootstrap SE: 0.5958  (underestimates when ICC > 0)

Quick reference

Attribute

Type

What it tells you

fe_params

pd.Series

Fixed-effect coefficient estimates

fe_bse

pd.Series

Standard errors of fixed effects

fe_pvalues

pd.Series

Two-sided p-values (z-test)

fe_conf_int

pd.DataFrame

95% confidence intervals

fe_cov

np.ndarray

Fixed-effect covariance matrix

variance_components

dict

Between-group variance per factor

scale

float

Within-group (residual) variance σ²

random_effects

dict

BLUPs per grouping factor

fittedvalues

np.ndarray

Conditional fitted values

resid

np.ndarray

Conditional residuals

aic

float

Akaike Information Criterion

bic

float

Bayesian Information Criterion

llf

float

Log-likelihood at optimum

converged

bool

Whether the optimiser converged

nobs

int

Number of observations

ngroups

dict

Number of levels per grouping factor


Where to go next