Cross-Validation Guide¶

Cross-validation estimates how well a fitted model generalises to unseen data. For mixed models, the key challenge is data leakage: observations within the same group share a random effect, so splitting at the observation level lets the test set “peek” at training-set information. interlace.cross_val() avoids this by splitting at the group level.

Why group-level CV?¶

Suppose you split randomly and a school appears in both training and test sets. The model’s BLUP for that school is informed by its training observations, giving an optimistic error estimate on the test observations from the same school. Group-level CV holds out entire groups, giving a realistic picture of prediction error for new, unseen groups.

Leave-one-group-out (LOGO)¶

The default strategy, cv="logo", removes one group at a time, fits the model on the remaining data, and predicts on the held-out group. Repeat for every group.

from interlace import cross_val

cv = cross_val(
    "score ~ hours_studied + prior_gpa",
    data=df,
    groups="school_id",    # grouping factor used for both RE and CV folds
    cv="logo",             # default — can be omitted
    scoring="rmse",        # default
)

print(f"LOGO RMSE: {cv.mean:.3f} ± {cv.std:.3f}")
print(f"Per-fold scores: {cv.scores}")

LOGO is exact but can be slow when there are many groups, because it fits one model per unique group level.

K-fold by groups¶

cv="kfold" randomly partitions the unique group labels into k folds (default k=5). Each fold holds out all observations from roughly n_groups / k groups. This is faster than LOGO while still preventing leakage.

cv_k = cross_val(
    "score ~ hours_studied + prior_gpa",
    data=df,
    groups="school_id",
    cv="kfold",
    k=5,
    scoring="rmse",
)

print(f"5-fold CV RMSE: {cv_k.mean:.3f} ± {cv_k.std:.3f}")

The fold assignment uses a fixed seed (42) for reproducibility.

Scoring¶

Built-in metrics¶

`scoring=`	Metric
`"rmse"` (default)	Root mean squared error
`"mae"`	Mean absolute error

Custom scoring function¶

Pass any callable with the signature scorer(y_true, y_pred) -> float:

import numpy as np

def mape(y_true, y_pred):
    """Mean absolute percentage error."""
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

cv = cross_val(
    "score ~ hours_studied",
    data=df,
    groups="school_id",
    scoring=mape,
)
print(f"MAPE: {cv.mean:.1f}%")

Passing fit arguments¶

Extra keyword arguments are forwarded to interlace.fit(). Use this to select a different optimizer, set the random-effect structure, or switch to ML:

cv = cross_val(
    "score ~ hours_studied + prior_gpa",
    data=df,
    groups="school_id",
    cv="logo",
    random=["(1 + hours_studied | school_id)"],   # random slopes
    optimizer="bobyqa",
)

Inspecting per-fold models¶

Set return_models=True to store the fitted model and prediction details for each fold. Each element of cv.fold_results is a dict with keys:

"model" — the fitted CrossedLMEResult for that fold
"train_groups" — group labels used for training
"test_groups" — group labels held out
"y_true" — observed response in the test set
"y_pred" — model predictions on the test set

cv = cross_val(
    "score ~ hours_studied + prior_gpa",
    data=df,
    groups="school_id",
    cv="logo",
    return_models=True,
)

for fold in cv.fold_results:
    test_g  = fold["test_groups"]
    rmse    = np.sqrt(np.mean((fold["y_true"] - fold["y_pred"]) ** 2))
    print(f"Held-out group: {test_g}, RMSE: {rmse:.3f}")

`CVResult` reference¶

cross_val() returns a CVResult dataclass:

Attribute / property	Type	Description
`scores`	`np.ndarray`	Per-fold score values (length = number of folds)
`mean`	`float`	Mean of `scores`
`std`	`float`	Standard deviation of `scores` (ddof=1)
`fold_results`	`list[dict] \| None`	Only populated when `return_models=True`

Tips¶

Prefer LOGO when groups are few (< 20): each group gets exactly one held-out fold, and the variance estimate is more informative.
Prefer kfold when groups are many (≥ 50): LOGO becomes expensive; 5- or 10-fold CV is a good trade-off.
Interpret cv.std carefully: with LOGO, fold scores are correlated (each fold removes one group from a common pool), so std underestimates true uncertainty.
Use return_models=True sparingly: storing one fitted model per fold can consume substantial memory for large datasets.