Simulation and Bootstrap Guide

interlace provides two functions for simulation-based inference on fitted mixed models:

  • simulate() — draw new response vectors from the fitted model, using the same random-effect structure and parameter estimates

  • bootMer() — run a full parametric bootstrap: simulate, refit, and collect any statistic you define

Together these cover three common workflows: bootstrap standard errors and confidence intervals, power analysis, and simulation-based model checking (posterior predictive checks).


simulate() — drawing from the fitted model

simulate() draws response vectors from the fitted conditional model:

y* = X β + Z u* + ε*

where u* ~ N(0, σ² ΛΛᵀ) and ε* ~ N(0, σ² Iₙ).

import interlace
import numpy as np

result = interlace.fit("score ~ time + treatment", data=df, groups="subject")

# Draw 500 simulated response vectors (shape: n × 500)
y_sim = interlace.simulate(result, nsim=500, seed=42)

The return value is an (n, nsim) array — each column is one draw from the model’s predictive distribution. A single draw (nsim=1) gives an (n, 1) array.

When to use simulate() directly

simulate() is the building block for the two higher-level workflows below, but it is also useful on its own when you want full control over the simulated draws — for example, to compute a custom summary across replicates without the overhead of refitting the model.


Bootstrap standard errors and CIs with bootMer()

bootMer() runs a parametric bootstrap: for each of B replicates it simulates y*, refits the model on the simulated data, and records a statistic vector. The default statistic collects [fixed effects..., σ, θ...].

boot = interlace.bootMer(result, B=500, seed=42, show_progress=True)

# Boot SEs for each parameter
se = boot.estimates.std(axis=0)
print(se)

# 95% percentile CIs (shape: p × 2, columns are [lower, upper])
ci = boot.ci(level=0.95)
print(ci)

boot.estimates has shape (B, p) where p is the length of the statistic vector.

Custom statistic

Pass any callable (CrossedLMEResult) np.ndarray as the statistic argument:

def fixed_effects_only(r):
    return np.asarray(r.fe_params)

boot = interlace.bootMer(result, statistic=fixed_effects_only, B=500, seed=42)

# Columns: one per fixed-effect coefficient
print(boot.estimates.shape)       # (500, p)
print(boot.ci())                  # (p, 2)

Bootstrap CIs vs Wald CIs

Bootstrap CIs are asymmetry-aware: if the sampling distribution of a variance component is right-skewed, the percentile CI will correctly be wider on the upper end. Wald CIs from result.conf_int() assume approximate normality, which is reasonable for fixed effects but can be misleading for variance components and their ratios.

Use bootMer() when:

  • Reporting CIs for variance components or ICC (right-skewed sampling distributions)

  • The sample is small and normality of the fixed-effect estimators is in doubt

  • You want a custom statistic with no closed-form SE (e.g. a ratio of coefficients, a predicted difference between two groups)


Power analysis

The parametric bootstrap is the standard approach for power analysis in mixed models, because closed-form power formulas assume balanced designs and ignore the variability introduced by estimating random effects.

The recipe:

  1. Specify the data-generating model (effect sizes and variance components).

  2. Simulate a dataset from that model.

  3. Fit the model and test the effect of interest.

  4. Repeat many times and count how often the test is significant.

import numpy as np
import pandas as pd
import interlace

rng = np.random.default_rng(2024)

def power_lmm(
    n_subjects: int,
    n_obs: int,
    beta_treatment: float,
    sigma_subject: float = 1.0,
    sigma_resid: float = 1.0,
    B: int = 500,
    alpha: float = 0.05,
) -> float:
    """Estimated power for a two-group treatment effect in a random-intercept LMM."""
    n = n_subjects * n_obs
    significant = 0

    for _ in range(B):
        # Simulate data from the assumed model
        subject = np.repeat(np.arange(n_subjects), n_obs)
        treatment = (subject % 2).astype(float)          # half treated
        u = rng.normal(0, sigma_subject, n_subjects)     # random intercepts
        eps = rng.normal(0, sigma_resid, n)
        y = beta_treatment * treatment + u[subject] + eps

        df = pd.DataFrame({"y": y, "treatment": treatment, "subject": subject})
        result = interlace.fit("y ~ treatment", data=df, groups="subject")

        # Test: is the treatment effect significant at level alpha?
        pval = result.pvalues["treatment"]
        if pval < alpha:
            significant += 1

    return significant / B


power = power_lmm(
    n_subjects=40,
    n_obs=5,
    beta_treatment=0.5,
)
print(f"Estimated power: {power:.2f}")

Varying sample size

Wrap the function in a grid search to find the required n:

results = {}
for n_subjects in [20, 30, 40, 60, 80]:
    results[n_subjects] = power_lmm(n_subjects=n_subjects, n_obs=5,
                                    beta_treatment=0.5, B=200)
    print(f"n={n_subjects}: power={results[n_subjects]:.2f}")

Tips for power analysis

  • Use realistic variance components. Power is very sensitive to the ICC (ratio of between-subject variance to total variance). If you have pilot data, plug its variance component estimates in rather than guessing.

  • Match your analysis model to your data-generating model. If the true model has random slopes but you power for random intercepts only, your power estimate will be optimistic.

  • Report the assumed effect size and variance. Power results are only interpretable relative to the assumed parameters.


Simulation-based model checking

A posterior predictive check asks: do the data look like they could have come from the fitted model? This is a useful complement to residual plots, especially for catching systematic misfit (e.g. overdispersion, heavy tails, floor effects).

Checking distributional assumptions

import matplotlib.pyplot as plt

result = interlace.fit("score ~ time + treatment", data=df, groups="subject")

# Draw 200 simulated response vectors
y_sim = interlace.simulate(result, nsim=200, seed=0)

# Overlay observed and simulated densities
fig, ax = plt.subplots()
for j in range(y_sim.shape[1]):
    ax.hist(y_sim[:, j], bins=30, alpha=0.02, color="steelblue", density=True)
ax.hist(df["score"], bins=30, alpha=0.7, color="black", density=True, label="Observed")
ax.set_xlabel("score")
ax.legend()
plt.show()

If the observed histogram consistently falls in the tails of the simulated distribution, the model is misspecified in some way (e.g., skewed residuals, outliers, incorrect link function for a GLMM).

Checking group-level summaries

# Compare observed vs simulated group means
observed_group_means = df.groupby("subject")["score"].mean().values

sim_group_means = np.array([
    df.assign(score=y_sim[:, j]).groupby("subject")["score"].mean().values
    for j in range(y_sim.shape[1])
])  # shape: (nsim, n_subjects)

# 95% predictive interval for each group's mean
lo = np.percentile(sim_group_means, 2.5, axis=0)
hi = np.percentile(sim_group_means, 97.5, axis=0)

# How many observed group means fall outside the interval?
outside = np.mean((observed_group_means < lo) | (observed_group_means > hi))
print(f"Fraction of groups outside 95% PI: {outside:.2f}")

A well-calibrated model should have roughly 5% of group means outside the 95% predictive interval. Substantially more than 5% suggests the random effect variance is underestimated.

Checking a scalar summary statistic (T-test)

For a single summary statistic T(y), you can compute a simulation-based p-value:

# Example: kurtosis of the residuals
from scipy.stats import kurtosis

y_obs = np.asarray(df["score"])
T_obs = kurtosis(y_obs - result.fittedvalues)

T_sim = np.array([
    kurtosis(y_sim[:, j] - result.fittedvalues)
    for j in range(y_sim.shape[1])
])

p_ppc = np.mean(np.abs(T_sim) >= np.abs(T_obs))
print(f"PPC p-value (kurtosis): {p_ppc:.3f}")

A p-value near 0 or 1 indicates the observed statistic is extreme relative to what the model predicts.


Tips

  • Seed everything. Pass an integer seed to both simulate() and bootMer() for reproducible results.

  • Progress bar. bootMer(..., show_progress=True) uses tqdm if installed. Install it with pip install tqdm.

  • Speed. Each bootMer replicate refits the model from scratch. For large datasets or many replicates, budget accordingly — 500 replicates on a model that takes 0.5 s to fit takes ~4 min.

  • Percentile CIs are not bias-corrected. BootResult.ci() uses the raw percentile method. For small B or heavily skewed statistics, the BCa (bias-corrected accelerated) method is more accurate, but requires larger B to be stable.


See also