{ "cells": [ { "cell_type": "markdown", "id": "a1b2c3d4", "metadata": {}, "source": [ "# Examples\n", "\n", "All examples use the same synthetic exam scores dataset. Run this setup block first:" ] }, { "cell_type": "code", "execution_count": null, "id": "e5f6a7b8", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "from interlace import fit\n", "\n", "rng = np.random.default_rng(42)\n", "n = 300\n", "df = pd.DataFrame({\n", " \"score\": 60 + rng.normal(0, 8, n),\n", " \"hours_studied\": rng.uniform(0, 10, n),\n", " \"prior_gpa\": rng.uniform(2.0, 4.0, n),\n", " \"student_id\": [f\"s{i}\" for i in rng.integers(1, 31, n)],\n", " \"school_id\": [f\"sch{i}\" for i in rng.integers(1, 11, n)],\n", "})\n", "\n", "result = fit(\n", " formula=\"score ~ hours_studied + prior_gpa\",\n", " data=df,\n", " groups=[\"student_id\", \"school_id\"],\n", ")" ] }, { "cell_type": "markdown", "id": "c9d0e1f2", "metadata": {}, "source": [ "## Single grouping factor\n", "\n", "When you only have one source of clustering, pass a single column name:" ] }, { "cell_type": "code", "execution_count": null, "id": "a3b4c5d6", "metadata": {}, "outputs": [], "source": [ "result_single = fit(\n", " formula=\"score ~ hours_studied + prior_gpa\",\n", " data=df,\n", " groups=\"student_id\",\n", ")\n", "\n", "print(result_single.variance_components)" ] }, { "cell_type": "markdown", "id": "e7f8a9b0", "metadata": {}, "source": [ "## Residuals\n", "\n", "`hlm_resid` returns a DataFrame with `.resid` and `.fitted` columns alongside the\n", "original data. Use `type=\"marginal\"` to ignore random effects, or\n", "`type=\"conditional\"` to subtract predicted BLUPs." ] }, { "cell_type": "code", "execution_count": null, "id": "c1d2e3f4", "metadata": {}, "outputs": [], "source": [ "from interlace import hlm_resid\n", "\n", "# Marginal residuals: y - Xβ\n", "marginal = hlm_resid(result, type=\"marginal\")\n", "print(marginal[[\"resid\", \"fitted\"]].describe())" ] }, { "cell_type": "code", "execution_count": null, "id": "a5b6c7d8", "metadata": {}, "outputs": [], "source": [ "# Conditional residuals: y - Xβ - Zû\n", "conditional = hlm_resid(result, type=\"conditional\")\n", "\n", "# Standardised\n", "std_resid = hlm_resid(result, type=\"conditional\", standardized=True)\n", "\n", "# Group-level random effects\n", "school_re = hlm_resid(result, level=\"school_id\")\n", "print(school_re.head())" ] }, { "cell_type": "markdown", "id": "e9f0a1b2", "metadata": {}, "source": [ "## Leverage\n", "\n", "The hat-matrix diagonal is decomposed into fixed-effect and random-effect components\n", "following Demidenko & Stukel (2005) and Nobre & Singer (2007)." ] }, { "cell_type": "code", "execution_count": null, "id": "c3d4e5f6", "metadata": {}, "outputs": [], "source": [ "from interlace import leverage\n", "\n", "lev = leverage(result)\n", "print(lev.columns)\n", "\n", "# High-leverage observations\n", "high_lev = lev[lev[\"overall\"] > 2 * lev[\"overall\"].mean()]\n", "print(f\"{len(high_lev)} high-leverage observations\")" ] }, { "cell_type": "markdown", "id": "a7b8c9d0", "metadata": {}, "source": [ "## Influence diagnostics\n", "\n", "`hlm_influence` fits the model *n* times with one observation (or group) deleted,\n", "computing Cook's D, MDFFITS, COVTRACE, COVRATIO, and RVC for each deletion." ] }, { "cell_type": "code", "execution_count": null, "id": "e1f2a3b4", "metadata": {}, "outputs": [], "source": [ "from interlace import hlm_influence\n", "\n", "# Observation-level influence\n", "infl = hlm_influence(result, level=1)\n", "print(infl.columns)" ] }, { "cell_type": "code", "execution_count": null, "id": "c5d6e7f8", "metadata": {}, "outputs": [], "source": [ "# Group-level influence (delete one school at a time)\n", "school_infl = hlm_influence(result, level=\"school_id\")\n", "print(school_infl.head())" ] }, { "cell_type": "markdown", "id": "a9b0c1d2", "metadata": {}, "source": [ "### Cook's distance and MDFFITS" ] }, { "cell_type": "code", "execution_count": null, "id": "e3f4a5b6", "metadata": {}, "outputs": [], "source": [ "from interlace import cooks_distance, mdffits\n", "\n", "cd = cooks_distance(result) # np.ndarray, shape (n,)\n", "mdf = mdffits(result) # np.ndarray, shape (n,)\n", "\n", "print(f\"Max Cook's D: {cd.max():.4f}\")" ] }, { "cell_type": "markdown", "id": "c7d8e9f0", "metadata": {}, "source": [ "### Count and measure influential observations" ] }, { "cell_type": "code", "execution_count": null, "id": "a1b2c3d4e", "metadata": {}, "outputs": [], "source": [ "from interlace import n_influential, tau_gap\n", "\n", "# Count observations exceeding the 4/n heuristic threshold\n", "n_inf = n_influential(result)\n", "print(f\"{n_inf} influential observations (Cook's D > 4/n)\")\n", "\n", "# Change in random-effects std devs after removing influential observations\n", "gap = tau_gap(result)\n", "print(gap)" ] }, { "cell_type": "markdown", "id": "f5a6b7c8", "metadata": {}, "source": [ "## Combined augmented DataFrame\n", "\n", "`hlm_augment` is a convenience wrapper that returns a single DataFrame containing the\n", "original data, conditional residuals, fitted values, and all influence statistics.\n", "Useful for exploratory analysis or downstream filtering." ] }, { "cell_type": "code", "execution_count": null, "id": "d9e0f1a2", "metadata": {}, "outputs": [], "source": [ "from interlace import hlm_augment\n", "\n", "aug = hlm_augment(result)\n", "print(aug.columns.tolist())" ] }, { "cell_type": "code", "execution_count": null, "id": "b3c4d5e6", "metadata": {}, "outputs": [], "source": [ "# Find the most influential observations\n", "aug.nlargest(5, \"cooksd\")[[\"student_id\", \"school_id\", \"score\", \"cooksd\"]]" ] }, { "cell_type": "code", "execution_count": null, "id": "f7a8b9c0", "metadata": {}, "outputs": [], "source": [ "# Skip the influence refit loop (faster, residuals only)\n", "aug_fast = hlm_augment(result, include_influence=False)\n", "print(aug_fast.columns.tolist())" ] }, { "cell_type": "markdown", "id": "d1e2f3a4", "metadata": {}, "source": [ "## Prediction on new data" ] }, { "cell_type": "code", "execution_count": null, "id": "b5c6d7e8", "metadata": {}, "outputs": [], "source": [ "df_new = pd.DataFrame({\n", " \"hours_studied\": [3.0, 7.0, 5.0],\n", " \"prior_gpa\": [2.5, 3.8, 3.1],\n", " \"student_id\": [\"s1\", \"s2\", \"s_new\"], # s_new is unseen\n", " \"school_id\": [\"sch1\", \"sch1\", \"sch_new\"],\n", "})\n", "\n", "# Conditional prediction (known BLUPs applied, unknown → 0)\n", "y_hat = result.predict(newdata=df_new)\n", "print(y_hat)\n", "\n", "# Fixed-effects only (population-level)\n", "y_fe = result.predict(newdata=df_new, include_re=False)\n", "print(y_fe)" ] }, { "cell_type": "markdown", "id": "f9a0b1c2", "metadata": {}, "source": [ "## Plotting\n", "\n", "All plots return `plotnine.ggplot` objects and can be further customised with\n", "standard plotnine layers." ] }, { "cell_type": "markdown", "id": "d3e4f5a6", "metadata": {}, "source": [ "### Residual plots" ] }, { "cell_type": "code", "execution_count": null, "id": "b7c8d9e0", "metadata": {}, "outputs": [], "source": [ "from interlace import plot_resid\n", "\n", "resid_df = hlm_resid(result, type=\"conditional\")" ] }, { "cell_type": "code", "execution_count": null, "id": "f1a2b3c4", "metadata": {}, "outputs": [], "source": [ "plot_resid(resid_df, type=\"resid_vs_fitted\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d5e6f7a8", "metadata": {}, "outputs": [], "source": [ "plot_resid(resid_df, type=\"qq\")" ] }, { "cell_type": "markdown", "id": "b9c0d1e2", "metadata": {}, "source": [ "### Influence index plot" ] }, { "cell_type": "code", "execution_count": null, "id": "f3a4b5c6", "metadata": {}, "outputs": [], "source": [ "from interlace import plot_influence\n", "\n", "plot_influence(infl, diag=\"cooksd\")" ] }, { "cell_type": "code", "execution_count": null, "id": "d7e8f9a0", "metadata": {}, "outputs": [], "source": [ "plot_influence(infl, diag=\"mdffits\")" ] }, { "cell_type": "markdown", "id": "b1c2d3e4", "metadata": {}, "source": [ "### Ranked dotplot with outlier labels\n", "\n", "`dotplot_diag` ranks observations by the chosen diagnostic and labels any that\n", "exceed 3 × IQR above Q3." ] }, { "cell_type": "code", "execution_count": null, "id": "f5a6b7c8d", "metadata": {}, "outputs": [], "source": [ "from interlace import dotplot_diag\n", "\n", "dotplot_diag(infl, diag=\"cooksd\", cutoff=\"internal\", name=\"index\")" ] }, { "cell_type": "code", "execution_count": null, "id": "e9f0a1b2c", "metadata": {}, "outputs": [], "source": [ "dotplot_diag(infl, diag=\"cooksd\", cutoff=4 / len(df))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.13.0" } }, "nbformat": 4, "nbformat_minor": 5 }