{ "cells": [ { "cell_type": "markdown", "id": "d0000001", "metadata": {}, "source": [ "# Diagnostics guide\n", "\n", "After fitting a mixed model, diagnostics help you check whether the model assumptions\n", "hold and whether any observations are unduly influencing the estimates.\n", "\n", "This guide covers:\n", "\n", "1. **Residuals** \u2014 marginal vs conditional, what patterns to look for\n", "2. **Leverage** \u2014 which observations have high potential influence\n", "3. **Cook's distance** \u2014 which observations actually change the fixed-effect estimates\n", "4. **MDFFITS** \u2014 standardised version of Cook's D, scale-free\n", "5. **COVTRACE and COVRATIO** \u2014 influence on the precision of fixed-effect estimates\n", "6. **Relative variance change (RVC)** \u2014 influence on variance components\n", "7. **`hlm_augment`** \u2014 combining everything into one tidy DataFrame\n", "8. **Worked example** \u2014 introducing an outlier and watching the diagnostics react\n", "\n", "Throughout we use synthetic exam-score data: students and schools as crossed grouping\n", "factors, with `hours_studied` and `prior_gpa` as fixed predictors." ] }, { "cell_type": "code", "execution_count": null, "id": "d0000002", "metadata": {}, "outputs": [], "source": [ "import numpy as np\nimport pandas as pd\nimport interlace\nfrom interlace import hlm_resid, leverage, hlm_influence, hlm_augment, isSingular\nfrom plotnine import (\n ggplot, aes,\n geom_point, geom_hline, geom_vline, geom_histogram, geom_qq, geom_qq_line,\n geom_text, geom_segment,\n scale_color_manual,\n labs, theme_bw, theme, element_text,\n facet_wrap,\n)\n\nrng = np.random.default_rng(42)\nn = 300\n\ndf = pd.DataFrame({\n \"score\": 60 + rng.normal(0, 8, n),\n \"hours_studied\": rng.uniform(0, 10, n),\n \"prior_gpa\": rng.uniform(2.0, 4.0, n),\n \"student_id\": [f\"s{i}\" for i in rng.integers(1, 31, n)],\n \"school_id\": [f\"sch{i}\" for i in rng.integers(1, 11, n)],\n})\n\nresult = interlace.fit(\n \"score ~ hours_studied + prior_gpa\",\n data=df,\n groups=[\"student_id\", \"school_id\"],\n)" ] }, { "cell_type": "markdown", "id": "singular-header", "metadata": {}, "source": [ "---\n", "\n", "## 0. Singular fit check\n", "\n", "Before running any diagnostics, check whether the model is at the boundary\n", "of its parameter space. A **singular fit** means one or more variance\n", "components collapsed to zero. This can make residual patterns and influence\n", "measures harder to interpret.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "singular-check", "metadata": {}, "outputs": [], "source": [ "from interlace import isSingular\n", "\n", "print(f\"Singular fit: {result.is_singular}\")\n", "print(f\"Boundary flags: {result.boundary_flags}\")\n", "\n", "if isSingular(result):\n", " print(\"\\nWarning: one or more variance components are at the boundary.\")\n", " print(\"Fixed-effect SEs and influence measures may be unreliable.\")\n", " print(\"Consider dropping the flagged grouping factor or using optimizer='bobyqa'.\")\n" ] }, { "cell_type": "markdown", "id": "d0000003", "metadata": {}, "source": [ "---\n", "\n", "## 1. Residuals\n", "\n", "`interlace.hlm_resid()` returns either **marginal** or **conditional** residuals.\n", "\n", "| Type | Formula | Interpretation |\n", "|------|---------|----------------|\n", "| **Marginal** | `y \u2212 X\u03b2\u0302` | Deviation from the population-level prediction (ignoring BLUPs) |\n", "| **Conditional** | `y \u2212 X\u03b2\u0302 \u2212 Z\u00fb` | Deviation from the group-specific prediction (using BLUPs) |\n", "\n", "Use **marginal residuals** to check the fixed-effect structure and the overall\n", "distributional assumption (normality of the response given predictors).\n", "Use **conditional residuals** to check within-group model fit and to spot\n", "individual outliers after accounting for group membership.\n", "\n", "Both types should show:\n", "- No trend against fitted values (homoscedasticity)\n", "- Approximate normality (check with a Q\u2013Q plot)\n", "- No systematic pattern by group" ] }, { "cell_type": "code", "execution_count": null, "id": "d0000004", "metadata": {}, "outputs": [], "source": [ "resid_m = hlm_resid(result, level=1) # marginal\n", "resid_c = hlm_resid(result, level=0) # conditional\n", "\n", "resid_df = pd.DataFrame({\n", " \"fitted\": result.fittedvalues,\n", " \"marginal\": resid_m,\n", " \"conditional\": resid_c,\n", "})\n", "\n", "resid_df.describe().round(3)" ] }, { "cell_type": "markdown", "id": "d0000005", "metadata": {}, "source": [ "### Residuals vs fitted values\n", "\n", "A good plot shows points scattered randomly around zero with roughly constant spread.\n", "Any fan shape (wider spread at high fitted values) indicates heteroscedasticity;\n", "any curve indicates a missing non-linear term." ] }, { "cell_type": "code", "execution_count": null, "id": "d0000006", "metadata": {}, "outputs": [], "source": [ "(\n", " ggplot(resid_df, aes(x=\"fitted\", y=\"conditional\"))\n", " + geom_point(alpha=0.4)\n", " + geom_hline(yintercept=0, linetype=\"dashed\", color=\"red\")\n", " + labs(\n", " x=\"Fitted values\",\n", " y=\"Conditional residual\",\n", " title=\"Residuals vs fitted \u2014 good pattern (random scatter around zero)\",\n", " )\n", " + theme_bw()\n", ")" ] }, { "cell_type": "markdown", "id": "d0000007", "metadata": {}, "source": [ "### Q\u2013Q plot\n", "\n", "Points should fall on the diagonal. Heavy tails or an S-curve indicate\n", "departures from normality \u2014 consider a robustness check or a transformation." ] }, { "cell_type": "code", "execution_count": null, "id": "d0000008", "metadata": {}, "outputs": [], "source": [ "(\n", " ggplot(resid_df, aes(sample=\"conditional\"))\n", " + geom_qq(alpha=0.4)\n", " + geom_qq_line(color=\"red\")\n", " + labs(\n", " x=\"Theoretical quantiles\",\n", " y=\"Sample quantiles\",\n", " title=\"Q\u2013Q plot of conditional residuals\",\n", " )\n", " + theme_bw()\n", ")" ] }, { "cell_type": "markdown", "id": "d0000009", "metadata": {}, "source": [ "---\n", "\n", "## 2. Leverage\n", "\n", "`interlace.leverage()` decomposes the hat-matrix diagonal into three parts following\n", "Demidenko & Stukel (2005) and Nobre & Singer (2007):\n", "\n", "| Column | Meaning |\n", "|--------|--------|\n", "| `fixef` (H1) | Leverage due to the fixed-effect design matrix \u2014 unusual predictor values |\n", "| `ranef` (H2) | Leverage due to the random-effect structure (Demidenko & Stukel) |\n", "| `ranef.uc` | Unconfounded H2 (Nobre & Singer) \u2014 H2 with fixed-effect contribution removed |\n", "| `overall` | H1 + H2 \u2014 total leverage |\n", "\n", "**Interpretation:**\n", "- High `fixef` leverage: the observation has an unusual combination of predictors;\n", " it controls the slope estimates.\n", "- High `ranef` leverage: the observation strongly determines its group's BLUP.\n", "- A rule of thumb borrowed from OLS: `overall > 2p/n` (where `p` = number of fixed\n", " parameters, `n` = observations) flags potentially high-leverage points. In mixed\n", " models this is a heuristic \u2014 leverage is distributed differently \u2014 but it gives a\n", " starting threshold." ] }, { "cell_type": "code", "execution_count": null, "id": "d0000010", "metadata": {}, "outputs": [], "source": [ "lev = leverage(result)\n", "lev.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "d0000011", "metadata": {}, "outputs": [], "source": [ "p = len(result.fe_params) # number of fixed-effect parameters\n", "threshold = 2 * p / result.nobs\n", "\n", "lev_plot_df = lev.reset_index(drop=True).assign(\n", " obs=range(len(lev)),\n", " high=lev[\"overall\"] > threshold,\n", ")\n", "\n", "(\n", " ggplot(lev_plot_df, aes(x=\"obs\", y=\"overall\", color=\"high\"))\n", " + geom_point(alpha=0.6)\n", " + geom_hline(yintercept=threshold, linetype=\"dashed\", color=\"red\")\n", " + scale_color_manual(values={True: \"#e74c3c\", False: \"#2c3e50\"})\n", " + labs(\n", " x=\"Observation index\",\n", " y=\"Overall leverage (H1 + H2)\",\n", " color=\"> 2p/n\",\n", " title=\"Leverage plot \u2014 dashed line at 2p/n threshold\",\n", " )\n", " + theme_bw()\n", ")" ] }, { "cell_type": "markdown", "id": "d0000012", "metadata": {}, "source": [ "High leverage alone is not a problem \u2014 it means the observation *could* be influential.\n", "Combine leverage with residual size (next section) to identify observations that are\n", "both unusual and poorly fit." ] }, { "cell_type": "markdown", "id": "d0000013", "metadata": {}, "source": [ "---\n", "\n", "## 3. Cook's distance\n", "\n", "Cook's distance (`cooksd`) measures how much the vector of fixed-effect estimates\n", "changes when observation *i* is deleted. It combines residual size and leverage:\n", "\n", "$$D_i = \\frac{(\\hat{\\boldsymbol{\\beta}} - \\hat{\\boldsymbol{\\beta}}_{(-i)})^\\top\n", " (\\mathbf{X}^\\top \\hat{\\boldsymbol{\\Omega}}^{-1} \\mathbf{X})\n", " (\\hat{\\boldsymbol{\\beta}} - \\hat{\\boldsymbol{\\beta}}_{(-i)})}\n", " {p \\, \\hat{\\sigma}^2}$$\n", "\n", "**Thresholds (heuristics):**\n", "- `cooksd > 4/n` \u2014 worth inspecting\n", "- `cooksd > 1` \u2014 substantial influence on fixed-effect estimates\n", "\n", "These thresholds are rules of thumb from OLS, applied here as starting points.\n", "Always inspect flagged observations in context before removing them." ] }, { "cell_type": "code", "execution_count": null, "id": "d0000014", "metadata": {}, "outputs": [], "source": [ "infl = hlm_influence(result)\n", "infl.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "d0000015", "metadata": {}, "outputs": [], "source": [ "n = result.nobs\n", "cooksd_threshold = 4 / n\n", "\n", "cooksd_df = infl[[\"cooksd\"]].reset_index(drop=True).assign(\n", " obs=range(n),\n", " high=infl[\"cooksd\"].values > cooksd_threshold,\n", ")\n", "\n", "(\n", " ggplot(cooksd_df, aes(x=\"obs\", y=\"cooksd\", color=\"high\"))\n", " + geom_point(alpha=0.6)\n", " + geom_hline(yintercept=cooksd_threshold, linetype=\"dashed\", color=\"red\")\n", " + scale_color_manual(values={True: \"#e74c3c\", False: \"#2c3e50\"})\n", " + labs(\n", " x=\"Observation index\",\n", " y=\"Cook's distance\",\n", " color=\"> 4/n\",\n", " title=\"Cook's distance \u2014 dashed line at 4/n\",\n", " )\n", " + theme_bw()\n", ")" ] }, { "cell_type": "markdown", "id": "d0000016", "metadata": {}, "source": [ "---\n", "\n", "## 4. MDFFITS\n", "\n", "MDFFITS (Measures of Difference in Fits) is a scale-free version of Cook's distance.\n", "It standardises by the full covariance matrix of the fixed effects rather than\n", "by `p * \u03c3\u00b2`, so it is comparable across models with different numbers of parameters.\n", "\n", "**Threshold:** `mdffits > 2 * sqrt(p/n)` is a common guideline.\n", "\n", "MDFFITS and Cook's D usually agree on which observations are influential; MDFFITS is\n", "preferred when comparing across models or when `\u03c3\u00b2` is poorly estimated." ] }, { "cell_type": "code", "execution_count": null, "id": "d0000017", "metadata": {}, "outputs": [], "source": [ "mdffits_threshold = 2 * np.sqrt(p / n)\n", "\n", "mdffits_df = infl[[\"mdffits\"]].reset_index(drop=True).assign(\n", " obs=range(n),\n", " high=infl[\"mdffits\"].values > mdffits_threshold,\n", ")\n", "\n", "(\n", " ggplot(mdffits_df, aes(x=\"obs\", y=\"mdffits\", color=\"high\"))\n", " + geom_point(alpha=0.6)\n", " + geom_hline(yintercept=mdffits_threshold, linetype=\"dashed\", color=\"red\")\n", " + scale_color_manual(values={True: \"#e74c3c\", False: \"#2c3e50\"})\n", " + labs(\n", " x=\"Observation index\",\n", " y=\"MDFFITS\",\n", " color=f\"> 2\u221a(p/n)\",\n", " title=\"MDFFITS \u2014 dashed line at 2\u221a(p/n)\",\n", " )\n", " + theme_bw()\n", ")" ] }, { "cell_type": "markdown", "id": "d0000018", "metadata": {}, "source": [ "---\n", "\n", "## 5. COVTRACE and COVRATIO\n", "\n", "These diagnostics measure influence on the **precision** of fixed-effect estimates\n", "(the covariance matrix `V`), not just on their values.\n", "\n", "| Measure | Formula | Interpretation |\n", "|---------|---------|----------------|\n", "| `covtrace` | tr(V\u207b\u00b9V\u1d62) \u2212 p | Positive \u2192 deletion improves precision; negative \u2192 deletion hurts precision |\n", "| `covratio` | det(V\u1d62) / det(V) | < 1 \u2192 deletion improves precision (det shrinks); > 1 \u2192 deletion hurts precision |\n", "\n", "Here V\u1d62 is the fixed-effect covariance matrix fitted without observation *i*, and p\n", "is the number of fixed parameters.\n", "\n", "**Thresholds:**\n", "- `|covtrace| > p` is notable\n", "- `covratio` far from 1 (say, < 0.9 or > 1.1) is worth checking\n", "\n", "A large positive `covtrace` means observation *i* inflates standard errors \u2014 removing\n", "it would make estimates more precise. Check whether it is an outlier or simply at an\n", "extreme predictor value (high leverage)." ] }, { "cell_type": "code", "execution_count": null, "id": "d0000019", "metadata": {}, "outputs": [], "source": [ "cov_df = infl[[\"covtrace\", \"covratio\"]].reset_index(drop=True).assign(\n", " obs=range(n),\n", ")\n", "\n", "(\n", " ggplot(cov_df, aes(x=\"obs\", y=\"covratio\"))\n", " + geom_point(alpha=0.5)\n", " + geom_hline(yintercept=1, linetype=\"dashed\", color=\"red\")\n", " + geom_hline(yintercept=0.9, linetype=\"dotted\", color=\"orange\")\n", " + geom_hline(yintercept=1.1, linetype=\"dotted\", color=\"orange\")\n", " + labs(\n", " x=\"Observation index\",\n", " y=\"COVRATIO\",\n", " title=\"COVRATIO \u2014 dashed=1 (neutral), dotted=0.9/1.1 (guideline bounds)\",\n", " )\n", " + theme_bw()\n", ")" ] }, { "cell_type": "markdown", "id": "d0000020", "metadata": {}, "source": [ "---\n", "\n", "## 6. Relative variance change (RVC)\n", "\n", "RVC columns (`rvc.`) measure how much each variance component changes when\n", "observation *i* is deleted, relative to the full-model estimate:\n", "\n", "$$\\text{RVC}_i(\\theta_k) = \\frac{\\hat{\\theta}_{k(-i)} - \\hat{\\theta}_k}{\\hat{\\theta}_k}$$\n", "\n", "A value of +0.3 means the variance component increases by 30% when observation *i* is\n", "removed \u2014 that observation is suppressing variance in that grouping factor.\n", "\n", "**Threshold:** `|RVC| > 0.5` is a common guideline for flagging notable influence\n", "on a specific variance component." ] }, { "cell_type": "code", "execution_count": null, "id": "d0000021", "metadata": {}, "outputs": [], "source": [ "rvc_cols = [c for c in infl.columns if c.startswith(\"rvc.\")]\n", "rvc_df = (\n", " infl[rvc_cols]\n", " .reset_index(drop=True)\n", " .assign(obs=range(n))\n", " .melt(id_vars=\"obs\", var_name=\"component\", value_name=\"rvc\")\n", ")\n", "\n", "(\n", " ggplot(rvc_df, aes(x=\"obs\", y=\"rvc\"))\n", " + geom_point(alpha=0.4)\n", " + geom_hline(yintercept=0, linetype=\"dashed\", color=\"grey\")\n", " + geom_hline(yintercept= 0.5, linetype=\"dotted\", color=\"orange\")\n", " + geom_hline(yintercept=-0.5, linetype=\"dotted\", color=\"orange\")\n", " + facet_wrap(\"component\", ncol=1)\n", " + labs(\n", " x=\"Observation index\",\n", " y=\"Relative variance change\",\n", " title=\"RVC per variance component \u2014 dotted lines at \u00b10.5\",\n", " )\n", " + theme_bw()\n", ")" ] }, { "cell_type": "markdown", "id": "d0000022", "metadata": {}, "source": [ "---\n", "\n", "## 7. `hlm_augment` \u2014 everything in one DataFrame\n", "\n", "`interlace.hlm_augment()` appends all residual and influence diagnostics to your\n", "original data. This makes it easy to filter, join, and plot.\n", "\n", "Columns added:\n", "\n", "| Column | Source |\n", "|--------|--------|\n", "| `.resid` | Conditional residual |\n", "| `.resid_mar` | Marginal residual |\n", "| `.fitted` | Conditional fitted value |\n", "| `.leverage` | Overall leverage (H1 + H2) |\n", "| `.cooksd` | Cook's distance |\n", "| `.mdffits` | MDFFITS |\n", "| `.covtrace` | COVTRACE |\n", "| `.covratio` | COVRATIO |\n", "| `.rvc.` | RVC per variance component |" ] }, { "cell_type": "code", "execution_count": null, "id": "d0000023", "metadata": {}, "outputs": [], "source": [ "aug = hlm_augment(result, data=df)\n", "aug.head()" ] }, { "cell_type": "code", "execution_count": null, "id": "d0000024", "metadata": {}, "outputs": [], "source": [ "# Flag any observation that exceeds both the leverage and Cook's D thresholds\n", "aug[\"influential\"] = (\n", " (aug[\".leverage\"] > 2 * p / n) &\n", " (aug[\".cooksd\"] > 4 / n)\n", ")\n", "\n", "print(f\"Observations flagged as influential: {aug['influential'].sum()} of {len(aug)}\")\n", "aug[aug[\"influential\"]][[\"student_id\", \"school_id\", \"score\", \".leverage\", \".cooksd\"]].head()" ] }, { "cell_type": "markdown", "id": "d0000025", "metadata": {}, "source": [ "---\n", "\n", "## 8. Worked example \u2014 introducing an outlier\n", "\n", "To see how the diagnostics respond to a genuine outlier, we refit the model after\n", "injecting a single extreme observation." ] }, { "cell_type": "code", "execution_count": null, "id": "d0000026", "metadata": {}, "outputs": [], "source": [ "# Inject one outlier: extreme score and unusual predictor values\n", "outlier = pd.DataFrame([{\n", " \"score\": 120.0, # far above the ~60-point mean\n", " \"hours_studied\": 0.1, # almost no study time\n", " \"prior_gpa\": 2.0, # lowest GPA\n", " \"student_id\": \"s1\",\n", " \"school_id\": \"sch1\",\n", "}])\n", "\n", "df_outlier = pd.concat([df, outlier], ignore_index=True)\n", "result_outlier = interlace.fit(\n", " \"score ~ hours_studied + prior_gpa\",\n", " data=df_outlier,\n", " groups=[\"student_id\", \"school_id\"],\n", ")\n", "\n", "print(\"Without outlier \u2014 hours_studied coefficient:\", round(result.fe_params[\"hours_studied\"], 3))\n", "print(\"With outlier \u2014 hours_studied coefficient:\", round(result_outlier.fe_params[\"hours_studied\"], 3))" ] }, { "cell_type": "code", "execution_count": null, "id": "d0000027", "metadata": {}, "outputs": [], "source": [ "infl_outlier = hlm_influence(result_outlier)\n", "n2 = result_outlier.nobs\n", "\n", "cooksd_out_df = infl_outlier[[\"cooksd\"]].reset_index(drop=True).assign(\n", " obs=range(n2),\n", " is_outlier=(range(n2) == n2 - 1), # last row is the injected outlier\n", ")\n", "# label the injected outlier\n", "cooksd_out_df[\"label\"] = cooksd_out_df.apply(\n", " lambda r: \"injected\\noutlier\" if r[\"is_outlier\"] else \"\", axis=1\n", ")\n", "\n", "(\n", " ggplot(cooksd_out_df, aes(x=\"obs\", y=\"cooksd\", color=\"is_outlier\"))\n", " + geom_point(alpha=0.6)\n", " + geom_hline(yintercept=4 / n2, linetype=\"dashed\", color=\"red\")\n", " + geom_text(\n", " data=cooksd_out_df[cooksd_out_df[\"is_outlier\"]],\n", " mapping=aes(label=\"label\"),\n", " nudge_x=-8, nudge_y=0,\n", " size=9,\n", " )\n", " + scale_color_manual(values={True: \"#e74c3c\", False: \"#2c3e50\"})\n", " + labs(\n", " x=\"Observation index\",\n", " y=\"Cook's distance\",\n", " color=\"Injected outlier\",\n", " title=\"Cook's distance after injecting an outlier \u2014 outlier is clearly detected\",\n", " )\n", " + theme_bw()\n", ")" ] }, { "cell_type": "markdown", "id": "d0000028", "metadata": {}, "source": [ "The injected observation stands out dramatically. Compare its diagnostic values to\n", "the rest of the dataset:" ] }, { "cell_type": "code", "execution_count": null, "id": "d0000029", "metadata": {}, "outputs": [], "source": [ "aug_outlier = hlm_augment(result_outlier, data=df_outlier)\n", "\n", "pd.concat([\n", " aug_outlier[[\".leverage\", \".cooksd\", \".mdffits\", \".covtrace\", \".covratio\"]]\n", " .describe()\n", " .loc[[\"mean\", \"max\"]]\n", " .rename(index={\"mean\": \"dataset mean\", \"max\": \"dataset max\"}),\n", " aug_outlier.iloc[[-1]][\n", " [\".leverage\", \".cooksd\", \".mdffits\", \".covtrace\", \".covratio\"]\n", " ].rename(index={n2 - 1: \"injected outlier\"}),\n", "]).round(4)" ] }, { "cell_type": "markdown", "id": "lmer-influence-header", "metadata": {}, "source": [ "---\n", "\n", "## 9. `lmer_influence_measures()` \u2014 HLMdiag-parity all-in-one\n", "\n", "`lmer_influence_measures()` combines case-deletion diagnostics, analytical\n", "leverage, and DFBETAS into a single dict that matches R's\n", "`HLMdiag::hlm_influence` output exactly. Use it when you need full R parity\n", "or when you want all measures in one call.\n", "\n", "For large models, pass `n_jobs=-1` to parallelise the case-deletion loop\n", "across all available CPUs (Linux: fork, fast; macOS/Windows: spawn, slower).\n" ] }, { "cell_type": "code", "execution_count": null, "id": "lmer-influence-call", "metadata": {}, "outputs": [], "source": [ "from interlace.influence import lmer_influence_measures\n", "\n", "# n_jobs=1 (default) is sequential; n_jobs=-1 uses all CPUs\n", "measures = lmer_influence_measures(result, n_jobs=1, show_progress=False)\n", "print(list(measures.keys()))\n", "# ['cooks', 'hat', 'hat_overall', 'hat_fixef', 'dfbetas', 'dffits', 'residuals', 'sigma']\n" ] }, { "cell_type": "code", "execution_count": null, "id": "lmer-influence-flagging", "metadata": {}, "outputs": [], "source": [ "# Use the correct hat vector for flagging (matches R's HLMdiag convention)\n", "# - hat_fixef (H1 only) for crossed multi-RE models\n", "# - hat_overall (H1+H2) for single-RE models\n", "n = result.nobs\n", "p = len(result.fe_params)\n", "\n", "cooksd_thresh = 4 / n\n", "leverage_thresh = 2 * p / n\n", "\n", "flag_cooksd = measures['cooks'] > cooksd_thresh\n", "flag_leverage = measures['hat'] > leverage_thresh\n", "flag_both = flag_cooksd & flag_leverage\n", "\n", "print(f\"Influential (Cook's D): {flag_cooksd.sum()}\")\n", "print(f\"High leverage: {flag_leverage.sum()}\")\n", "print(f\"Both: {flag_both.sum()}\")\n" ] }, { "cell_type": "markdown", "id": "lmer-influence-dfbetas", "metadata": {}, "source": [ "### DFBETAS from `lmer_influence_measures`\n", "\n", "The `'dfbetas'` key contains an analytical DFBETAS matrix `(n, p)` using the\n", "same formula as R. This is faster than case-deletion-based DFBETAS and\n", "matches `HLMdiag`'s values:\n" ] }, { "cell_type": "code", "execution_count": null, "id": "lmer-influence-dfbetas-code", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from plotnine import ggplot, aes, geom_tile, scale_fill_gradient2, theme_bw, labs\n", "\n", "dfbetas = measures['dfbetas'] # shape (n, p)\n", "threshold = 2 / np.sqrt(n)\n", "\n", "dfbetas_df = (\n", " pd.DataFrame(dfbetas, columns=result.fe_params.index)\n", " .assign(obs=range(n))\n", " .melt(id_vars='obs', var_name='coefficient', value_name='dfbetas')\n", ")\n", "\n", "(\n", " ggplot(dfbetas_df, aes(x='coefficient', y='obs', fill='dfbetas'))\n", " + geom_tile()\n", " + scale_fill_gradient2(low='#2166ac', mid='white', high='#d73027', midpoint=0)\n", " + theme_bw()\n", " + labs(title='Mixed-model DFBETAS', x='Coefficient', y='Observation', fill='DFBETAS')\n", ")\n" ] }, { "cell_type": "markdown", "id": "ols-dfbetas-header", "metadata": {}, "source": [ "---\n", "\n", "## 9. OLS DFBETAS \u2014 coefficient-level influence for OLS models\n", "\n", "**DFBETAS** measure how much each regression coefficient changes (in units of its\n", "standard error) when a single observation is removed. Unlike Cook's distance \u2014\n", "which summarises influence on *all* coefficients simultaneously \u2014 DFBETAS give a\n", "**per-coefficient, per-observation** breakdown.\n", "\n", "`ols_dfbetas_qr()` computes DFBETAS for a statsmodels OLS model via QR\n", "decomposition: fully vectorised, no Python loops.\n", "\n", "**When to use:**\n", "- As a pre-check before fitting the mixed model (stage-1 diagnostics)\n", "- In two-stage workflows where OLS is used on group-level summaries\n", "- When you need to know *which coefficient* an influential observation drives" ] }, { "cell_type": "code", "execution_count": null, "id": "ols-dfbetas-setup", "metadata": {}, "outputs": [], "source": [ "from interlace.influence import ols_dfbetas_qr\n", "import statsmodels.formula.api as smf\n", "\n", "# Fit an OLS model on the same dataset\n", "ols = smf.ols(\"score ~ hours_studied + prior_gpa\", data=df).fit()\n", "\n", "# Compute DFBETAS \u2014 shape (n_obs, n_params)\n", "dfbetas = ols_dfbetas_qr(ols)\n", "print(f\"DFBETAS shape: {dfbetas.shape}\")\n", "print(f\"Columns correspond to: {ols.model.exog_names}\")" ] }, { "cell_type": "code", "execution_count": null, "id": "ols-dfbetas-threshold", "metadata": {}, "outputs": [], "source": [ "# Rule-of-thumb threshold: 2 / sqrt(n)\n", "threshold = 2 / np.sqrt(result.nobs)\n", "influential_mask = np.any(np.abs(dfbetas) > threshold, axis=1)\n", "print(f\"Threshold: {threshold:.3f}\")\n", "print(f\"Observations exceeding threshold on at least one coefficient: {influential_mask.sum()}\")" ] }, { "cell_type": "markdown", "id": "ols-dfbetas-heatmap-intro", "metadata": {}, "source": [ "### Heatmap visualisation\n", "\n", "A heatmap shows the full DFBETAS matrix at a glance. Rows are observations,\n", "columns are coefficients. Dark cells indicate high influence on that coefficient." ] }, { "cell_type": "code", "execution_count": null, "id": "ols-dfbetas-heatmap", "metadata": {}, "outputs": [], "source": [ "dfbetas_df = (\n", " pd.DataFrame(dfbetas, columns=ols.model.exog_names)\n", " .assign(obs=range(len(dfbetas)))\n", " .melt(id_vars=\"obs\", var_name=\"coefficient\", value_name=\"dfbetas\")\n", ")\n", "\n", "from plotnine import ggplot, aes, geom_tile, scale_fill_gradient2, theme_bw, labs\n", "\n", "(\n", " ggplot(dfbetas_df, aes(x=\"coefficient\", y=\"obs\", fill=\"dfbetas\"))\n", " + geom_tile()\n", " + scale_fill_gradient2(low=\"#2166ac\", mid=\"white\", high=\"#d73027\", midpoint=0)\n", " + theme_bw()\n", " + labs(title=\"OLS DFBETAS\", x=\"Coefficient\", y=\"Observation\", fill=\"DFBETAS\")\n", ")" ] }, { "cell_type": "markdown", "id": "ols-dfbetas-note", "metadata": {}, "source": [ "Observations with large absolute DFBETAS for a coefficient are driving\n", "that coefficient's estimate. The same observations may or may not be flagged\n", "by Cook's distance \u2014 DFBETAS can reveal influence on a *single* coefficient\n", "that is diluted in the Cook's D summary.\n", "\n", "Note that `ols_dfbetas_qr()` operates on **OLS** models only. For mixed-model\n", "influence at the coefficient level, use `hlm_influence()` and examine the\n", "`mdffits` column, which is the multivariate analogue." ] }, { "cell_type": "markdown", "id": "d0000031", "metadata": {}, "source": [ "---\n", "\n", "## Acting on diagnostic results\n", "\n", "Diagnostics answer the question *is my model adequate?* Below is a decision workflow for the most common findings.\n", "\n", "### Residual patterns suggest model misspecification\n", "\n", "| Pattern | Likely cause | Action |\n", "|---------|-------------|--------|\n", "| Fan shape (variance increases with fitted) | Heteroscedasticity | Transform response (log/sqrt) or consider GLMM |\n", "| Curve in residual-vs-fitted | Missing non-linear term | Add polynomial or interaction term |\n", "| Heavy tails in Q-Q plot | Non-normal errors | Check for data errors; consider robust estimation |\n", "| Systematic trend by group | Missing random slope | Fit a random slope for that predictor |\n", "\n", "**Checking for a missing random slope:**\n", "If conditional residuals show a systematic trend with a predictor *within* groups, a random slope may be warranted:\n", "\n", "```python\n", "# Refit with a by-student random slope for hours_studied\n", "result_slopes = interlace.fit(\n", " \"score ~ hours_studied + prior_gpa\",\n", " data=df,\n", " random=[\"(1 + hours_studied | student_id)\", \"(1 | school_id)\"],\n", ")\n", "# Then compare with LRT \u2014 see model-comparison.md\n", "```\n", "\n", "### High leverage + high Cook's D: influential observation\n", "\n", "Steps:\n", "1. Identify: `aug[aug[\".cooksd\"] > 4/n]`\n", "2. Check for data errors (typos, unit mismatches)\n", "3. Refit without the observation and compare fixed-effect estimates\n", "4. If estimates change substantially, report both models\n", "\n", "```python\n", "aug = hlm_augment(result, data=df)\n", "drop_idx = aug[\".cooksd\"].idxmax()\n", "\n", "result_sens = interlace.fit(\n", " \"score ~ hours_studied + prior_gpa\",\n", " data=df.drop(index=drop_idx),\n", " groups=[\"student_id\", \"school_id\"],\n", ")\n", "print(\"Original:\", result.fe_params.round(3).to_dict())\n", "print(\"Without obs\", drop_idx, \":\", result_sens.fe_params.round(3).to_dict())\n", "```\n", "\n", "### High RVC: influential observation for a variance component\n", "\n", "Large `rvc.` often occurs when a group has very few observations (especially singletons). This is expected. Check group sizes:\n", "\n", "```python\n", "print(df.groupby(\"student_id\").size().sort_values())\n", "# Groups with n=1 will always have high RVC \u2014 treat their BLUPs with caution\n", "```\n" ] }, { "cell_type": "markdown", "id": "d0000030", "metadata": {}, "source": [ "---\n", "\n", "## Diagnostic thresholds \u2014 quick reference\n", "\n", "| Diagnostic | Guideline threshold | What it flags |\n", "|------------|--------------------|--------------|\n", "| Overall leverage | > 2p/n | Unusual predictor values |\n", "| Cook's distance | > 4/n or > 1 | Influence on fixed-effect estimates |\n", "| MDFFITS | > 2\u221a(p/n) | Scale-free fixed-effect influence |\n", "| COVTRACE | \\|value\\| > p | Influence on fixed-effect precision |\n", "| COVRATIO | < 0.9 or > 1.1 | Influence on covariance determinant |\n", "| RVC | \\|value\\| > 0.5 | Influence on a specific variance component |\n", "\n", "All thresholds are heuristics. An observation that exceeds a threshold is **worth\n", "inspecting**, not automatically removed. Ask:\n", "\n", "1. Is it a data-entry error?\n", "2. Is it a legitimate extreme value that the model should fit?\n", "3. Does removing it change your substantive conclusions?\n", "\n", "If removal does not change conclusions, the result is robust. If it does, report both\n", "analyses and discuss the influential case.\n", "\n", "---\n", "\n", "## Where to go next\n", "\n", "- {doc}`interpreting-results` \u2014 reading fixed effects, variance components, and BLUPs\n", "- {doc}`api/influence` \u2014 `hlm_influence`, `cooks_distance`, `mdffits`, `n_influential`\n", "- {doc}`api/leverage` \u2014 `leverage` decomposition columns\n", "- {doc}`api/augment` \u2014 `hlm_augment` column reference\n", "- {doc}`api/residuals` \u2014 `hlm_resid` level parameter\n", "- {doc}`faq` \u2014 convergence issues and numerical stability" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.13.0" } }, "nbformat": 4, "nbformat_minor": 5 }