# FAQ & Troubleshooting Common questions, error messages, and tuning guidance for `interlace`. --- ## Data format requirements **What format does my data need to be in?** `interlace.fit()` accepts a **pandas DataFrame** in long (tidy) format — one row per observation. - The response variable and all predictors must be **numeric** columns (int or float). - Grouping columns (`groups` / `random`) must be **string or categorical** columns that identify group membership. Integer IDs work too. - Column names must be valid Python identifiers and must not clash with formula operators. Avoid names containing `~`, `+`, `*`, `:`, `|`, or spaces. **Missing values** Rows with `NaN` in the response, any predictor, or any grouping column are **dropped silently** before fitting. Check `result.nobs` against `len(df)` to confirm how many rows were used. ```python model = interlace.fit("y ~ x", data=df, groups="firm") print(f"Used {model.nobs} of {len(df)} rows") ``` **Minimum group sizes** There is no hard minimum, but groups with only one observation cannot contribute information about within-group variance. Random effects for singleton groups are estimated by shrinkage toward zero; treat their BLUPs with caution. --- ## Common error messages ### `ConvergenceWarning: optimizer did not converge` The REML optimiser hit its iteration limit without finding a stable solution. See [Convergence troubleshooting](#convergence-troubleshooting) below. ### `LinAlgError: singular matrix` or `Matrix is not positive definite` The model matrix or the estimated covariance matrix is singular. Common causes: - **Perfect multicollinearity** — a predictor is an exact linear combination of others (e.g. dummy variables that sum to 1). Drop one category or use `pd.get_dummies(..., drop_first=True)`. - **Near-zero variance component** — a random effect explains essentially no variance. The Hessian becomes ill-conditioned at the boundary. Try removing the offending grouping factor or see [Numerical stability](#numerical-stability). ### `ValueError: groups column 'X' not found in data` The string passed to `groups=` does not match any column name. Check for trailing spaces or case differences: ```python print(df.columns.tolist()) # verify exact column names ``` ### `ValueError: formula refers to unknown variables` A variable in the formula is not a column in the DataFrame. Remember that the formula uses Python-style names — check that column names are not reserved words and contain no special characters. ### `KeyError` when accessing `result.random_effects["group"]` The key must exactly match the column name passed to `groups`. If you passed a list, each element becomes a separate key: ```python result = interlace.fit("y ~ x", data=df, groups=["student_id", "school_id"]) blups = result.random_effects["student_id"] # correct ``` --- ## Convergence troubleshooting If the optimiser does not converge, try the following steps in order. ### 1. Rescale your predictors The REML surface is sensitive to the scale of the design matrix. Centre and standardise continuous predictors before fitting: ```python from sklearn.preprocessing import StandardScaler df["x_scaled"] = StandardScaler().fit_transform(df[["x"]]) model = interlace.fit("y ~ x_scaled", data=df, groups="firm") ``` ### 2. Switch optimisers The default L-BFGS-B optimiser uses gradients and can struggle near variance-component boundaries. BOBYQA is gradient-free and often more robust in these cases: ```python # Requires: pip install "interlace-lme[bobyqa]" model = interlace.fit("y ~ x", data=df, groups="firm", optimizer="bobyqa") ``` ### 3. Check your model structure A model that does not converge is often overparameterised. Ask: - Do you have enough groups? As a rough guide, at least 5–10 groups per random-effect term. - Are all random slopes warranted? If a random slope variance is near zero, remove it. - Are crossed groupings truly crossed, or is one nested in the other? --- ## Model comparison **Can I compare two models to test if a predictor improves fit?** Yes — fit both models with `method="ML"` and use a likelihood ratio test (LRT). Do not compare models fitted with REML when the fixed-effect structures differ. See the [Model Comparison Guide](model-comparison.md) for the full workflow. **Can I update a model's formula without retyping the full call?** Yes — use `result.update()`. It accepts a new formula (with lme4-style dot notation), a new dataset, or any keyword argument to `fit()`, and inherits everything else from the original fit: ```python # Add a predictor m2 = result.update(". ~ . + frequency") # Refit on filtered data m_sens = result.update(data=df[df["n_obs"] > 5]) # Chain both: add predictor and switch to REML m_final = result.update(". ~ . + frequency", method="REML") ``` See the [Model Comparison Guide](model-comparison.md#iterative-refinement-with-update) for a complete step-by-step workflow. **How do I assess how well my model generalises to new groups?** Use `interlace.cross_val()`. It splits at the *group* level to avoid leakage — observations within the same group share a random effect, so row-level splits give over-optimistic error estimates: ```python from interlace import cross_val cv = cross_val( "score ~ hours_studied + prior_gpa", data=df, groups="school_id", cv="logo", # leave-one-group-out; use cv="kfold" for many groups scoring="rmse", ) print(f"LOGO RMSE: {cv.mean:.3f} ± {cv.std:.3f}") ``` See the [Cross-Validation Guide](cross-validation.md) for LOGO vs k-fold trade-offs, custom scoring, and per-fold model inspection. --- ## How do I know if my model has a boundary/singular fit? Call `isSingular()` or check `result.is_singular`. A singular fit means one or more variance components have collapsed to zero — the model is at the boundary of its parameter space. ```python from interlace import isSingular if isSingular(result): print(result.boundary_flags) # {'school_id': True, 'student_id': False} ``` `fit()` also issues a `ConvergenceWarning` automatically when this happens. See the [Variance Inference Guide](variance-inference.md) for what to do. ## How do I get confidence intervals for variance components? Use `result.confint()` for profile likelihood CIs, which are more accurate than Wald intervals near the boundary: ```python ci = result.confint() # 95 % by default print(ci) # estimate 2.5 % 97.5 % # school_id 0.412 0.201 0.731 # residual 1.000 1.000 1.000 ``` CIs are on the theta (relative Cholesky factor) scale. Multiply by `result.scale ** 0.5` to convert to the standard-deviation scale. See the [Variance Inference Guide](variance-inference.md#profile-likelihood-cis-for-variance-parameters) for full details. ## How do I speed up influence diagnostics on large models? Pass `n_jobs=-1` to `hlm_influence()` or `lmer_influence_measures()` to parallelise the case-deletion loop across all available CPUs: ```python from interlace.influence import hlm_influence, lmer_influence_measures # Parallel — uses all CPUs infl = hlm_influence(result, n_jobs=-1, show_progress=True) # Or the all-in-one HLMdiag-parity version measures = lmer_influence_measures(result, n_jobs=-1) ``` On Linux, parallel workers use `fork` (fast). On macOS/Windows they use `spawn` (slower startup), so parallelism pays off most for larger models (n > 500 or slow per-deletion refits). --- ## Can I fit a Poisson or binomial mixed model (GLMM)? Yes. Use `interlace.glmer()` with `family="binomial"` or `family="poisson"`: ```python import interlace # Binomial GLMM (proportion response) result = interlace.glmer( "incidence / size ~ period", data=df, family="binomial", groups="herd", weights=df["size"].values, ) # Poisson GLMM (count response) result = interlace.glmer( "count ~ x1 + x2", data=df, family="poisson", groups=["site", "year"], ) ``` `glmer()` uses Laplace approximation with PIRLS (penalised iteratively reweighted least squares), matching R's `lme4::glmer()`. See the [GLMM quickstart](glmm-quickstart.md) for a full walkthrough and the [glmer API reference](api/glmer.md) for all parameters. --- ## My response is binary (0/1). Which function should I use? Use `interlace.glmer()` with `family="binomial"`. Do not use `interlace.fit()` for binary outcomes — `fit()` assumes a continuous, normally distributed response. ```python result = interlace.glmer( "outcome ~ treatment + age", data=df, family="binomial", groups="subject", ) ``` Fixed-effect coefficients are on the log-odds scale. Use `result.predict(type="response")` for predicted probabilities. --- ## Solver choice: CHOLMOD vs default `interlace` uses a dense Cholesky factorisation of the random-effect covariance blocks by default. For large models this can be slow. **When to use CHOLMOD:** - You have many groups (e.g. > 1 000 firms) or many random-effect levels. - Fitting takes more than a few seconds on your machine. - You have a sparse crossing structure (most combinations of grouping levels are unobserved). **Installing CHOLMOD:** ```bash pip install "interlace-lme[cholmod]" ``` No code change is required. `interlace` detects the library at runtime and automatically uses the sparse path: symbolic analysis runs once, and the sparsity pattern is reused on every REML iteration. **When to stick with the default:** - Small datasets (< 500 observations, < 50 groups). - Fully-balanced designs where the covariance matrix is dense — CHOLMOD's sparsity savings disappear. - Environments where compiling SuiteSparse is difficult (e.g. restricted CI images). --- ## Optimizer choice: BOBYQA vs default `interlace` defaults to scipy's L-BFGS-B, a gradient-based optimizer. BOBYQA (Bound Optimisation BY Quadratic Approximation) is an alternative gradient-free optimizer that more closely matches lme4's internal algorithm. **When to use BOBYQA:** - The default optimizer does not converge or produces a `ConvergenceWarning`. - You need close numerical parity with lme4 (e.g. for reproducing R results or using `hlm_influence` diagnostics). - Your model has variance components near the boundary (zero), where gradients become unreliable. **Installing BOBYQA:** ```bash pip install "interlace-lme[bobyqa]" ``` **Using BOBYQA:** ```python model = interlace.fit("y ~ x", data=df, groups="firm", optimizer="bobyqa") # Also pass it to influence functions for consistent results from interlace.influence import hlm_influence infl = hlm_influence(model, optimizer="bobyqa") ``` **When to stick with the default:** - Fast convergence is the priority and you are not comparing to lme4 output. - You have not installed the `bobyqa` extra (the package is optional). --- ## Numerical stability ### Near-zero variance components If a random-effect variance is estimated at or near zero, the model is at a boundary of the parameter space. This is not an error — it means the grouping factor explains very little variance — but it can cause: - Inflated condition numbers and near-singular Hessians. - Inaccurate standard errors for fixed effects. - Convergence warnings from the gradient-based optimizer. **What to do:** 1. Check `result.variance_components` — if a component is `< 1e-4` relative to `result.scale`, consider dropping that grouping factor. 2. Switch to BOBYQA, which handles boundary cases more gracefully. 3. If the near-zero component is meaningful (e.g. a pre-registered random effect), report the boundary result and note the limitation. ### Ill-conditioned fixed-effect design If your fixed-effect matrix is ill-conditioned (detected by a `LinAlgWarning`), standardise all continuous predictors and ensure categorical variables are not perfectly collinear. ```python # Check condition number of the design matrix import numpy as np import patsy _, X = patsy.dmatrices("y ~ x1 + x2", data=df) print(np.linalg.cond(X)) # values > 1e6 indicate near-singularity ```