{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a1b2c3d4",
   "metadata": {},
   "source": [
    "# Examples\n",
    "\n",
    "All examples use the same synthetic exam scores dataset. Run this setup block first:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e5f6a7b8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from interlace import fit\n",
    "\n",
    "rng = np.random.default_rng(42)\n",
    "n = 300\n",
    "df = pd.DataFrame({\n",
    "    \"score\":         60 + rng.normal(0, 8, n),\n",
    "    \"hours_studied\": rng.uniform(0, 10, n),\n",
    "    \"prior_gpa\":     rng.uniform(2.0, 4.0, n),\n",
    "    \"student_id\":    [f\"s{i}\"   for i in rng.integers(1, 31, n)],\n",
    "    \"school_id\":     [f\"sch{i}\" for i in rng.integers(1, 11, n)],\n",
    "})\n",
    "\n",
    "result = fit(\n",
    "    formula=\"score ~ hours_studied + prior_gpa\",\n",
    "    data=df,\n",
    "    groups=[\"student_id\", \"school_id\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9d0e1f2",
   "metadata": {},
   "source": [
    "## Single grouping factor\n",
    "\n",
    "When you only have one source of clustering, pass a single column name:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3b4c5d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "result_single = fit(\n",
    "    formula=\"score ~ hours_studied + prior_gpa\",\n",
    "    data=df,\n",
    "    groups=\"student_id\",\n",
    ")\n",
    "\n",
    "print(result_single.variance_components)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7f8a9b0",
   "metadata": {},
   "source": [
    "## Residuals\n",
    "\n",
    "`hlm_resid` returns a DataFrame with `.resid` and `.fitted` columns alongside the\n",
    "original data. Use `type=\"marginal\"` to ignore random effects, or\n",
    "`type=\"conditional\"` to subtract predicted BLUPs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c1d2e3f4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from interlace import hlm_resid\n",
    "\n",
    "# Marginal residuals: y - Xβ\n",
    "marginal = hlm_resid(result, type=\"marginal\")\n",
    "print(marginal[[\"resid\", \"fitted\"]].describe())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5b6c7d8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Conditional residuals: y - Xβ - Zû\n",
    "conditional = hlm_resid(result, type=\"conditional\")\n",
    "\n",
    "# Standardised\n",
    "std_resid = hlm_resid(result, type=\"conditional\", standardized=True)\n",
    "\n",
    "# Group-level random effects\n",
    "school_re = hlm_resid(result, level=\"school_id\")\n",
    "print(school_re.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9f0a1b2",
   "metadata": {},
   "source": [
    "## Leverage\n",
    "\n",
    "The hat-matrix diagonal is decomposed into fixed-effect and random-effect components\n",
    "following Demidenko & Stukel (2005) and Nobre & Singer (2007)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3d4e5f6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from interlace import leverage\n",
    "\n",
    "lev = leverage(result)\n",
    "print(lev.columns)\n",
    "\n",
    "# High-leverage observations\n",
    "high_lev = lev[lev[\"overall\"] > 2 * lev[\"overall\"].mean()]\n",
    "print(f\"{len(high_lev)} high-leverage observations\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7b8c9d0",
   "metadata": {},
   "source": [
    "## Influence diagnostics\n",
    "\n",
    "`hlm_influence` fits the model *n* times with one observation (or group) deleted,\n",
    "computing Cook's D, MDFFITS, COVTRACE, COVRATIO, and RVC for each deletion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e1f2a3b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from interlace import hlm_influence\n",
    "\n",
    "# Observation-level influence\n",
    "infl = hlm_influence(result, level=1)\n",
    "print(infl.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c5d6e7f8",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Group-level influence (delete one school at a time)\n",
    "school_infl = hlm_influence(result, level=\"school_id\")\n",
    "print(school_infl.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9b0c1d2",
   "metadata": {},
   "source": [
    "### Cook's distance and MDFFITS"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3f4a5b6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from interlace import cooks_distance, mdffits\n",
    "\n",
    "cd  = cooks_distance(result)   # np.ndarray, shape (n,)\n",
    "mdf = mdffits(result)          # np.ndarray, shape (n,)\n",
    "\n",
    "print(f\"Max Cook's D: {cd.max():.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7d8e9f0",
   "metadata": {},
   "source": [
    "### Count and measure influential observations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1b2c3d4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "from interlace import n_influential, tau_gap\n",
    "\n",
    "# Count observations exceeding the 4/n heuristic threshold\n",
    "n_inf = n_influential(result)\n",
    "print(f\"{n_inf} influential observations (Cook's D > 4/n)\")\n",
    "\n",
    "# Change in random-effects std devs after removing influential observations\n",
    "gap = tau_gap(result)\n",
    "print(gap)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5a6b7c8",
   "metadata": {},
   "source": [
    "## Combined augmented DataFrame\n",
    "\n",
    "`hlm_augment` is a convenience wrapper that returns a single DataFrame containing the\n",
    "original data, conditional residuals, fitted values, and all influence statistics.\n",
    "Useful for exploratory analysis or downstream filtering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d9e0f1a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from interlace import hlm_augment\n",
    "\n",
    "aug = hlm_augment(result)\n",
    "print(aug.columns.tolist())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b3c4d5e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find the most influential observations\n",
    "aug.nlargest(5, \"cooksd\")[[\"student_id\", \"school_id\", \"score\", \"cooksd\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7a8b9c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Skip the influence refit loop (faster, residuals only)\n",
    "aug_fast = hlm_augment(result, include_influence=False)\n",
    "print(aug_fast.columns.tolist())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d1e2f3a4",
   "metadata": {},
   "source": [
    "## Prediction on new data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5c6d7e8",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_new = pd.DataFrame({\n",
    "    \"hours_studied\": [3.0, 7.0, 5.0],\n",
    "    \"prior_gpa\":     [2.5, 3.8, 3.1],\n",
    "    \"student_id\":    [\"s1\", \"s2\", \"s_new\"],  # s_new is unseen\n",
    "    \"school_id\":     [\"sch1\", \"sch1\", \"sch_new\"],\n",
    "})\n",
    "\n",
    "# Conditional prediction (known BLUPs applied, unknown → 0)\n",
    "y_hat = result.predict(newdata=df_new)\n",
    "print(y_hat)\n",
    "\n",
    "# Fixed-effects only (population-level)\n",
    "y_fe = result.predict(newdata=df_new, include_re=False)\n",
    "print(y_fe)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9a0b1c2",
   "metadata": {},
   "source": [
    "## Plotting\n",
    "\n",
    "All plots return `plotnine.ggplot` objects and can be further customised with\n",
    "standard plotnine layers."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d3e4f5a6",
   "metadata": {},
   "source": [
    "### Residual plots"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b7c8d9e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "from interlace import plot_resid\n",
    "\n",
    "resid_df = hlm_resid(result, type=\"conditional\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1a2b3c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_resid(resid_df, type=\"resid_vs_fitted\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5e6f7a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_resid(resid_df, type=\"qq\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b9c0d1e2",
   "metadata": {},
   "source": [
    "### Influence index plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f3a4b5c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from interlace import plot_influence\n",
    "\n",
    "plot_influence(infl, diag=\"cooksd\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7e8f9a0",
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_influence(infl, diag=\"mdffits\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1c2d3e4",
   "metadata": {},
   "source": [
    "### Ranked dotplot with outlier labels\n",
    "\n",
    "`dotplot_diag` ranks observations by the chosen diagnostic and labels any that\n",
    "exceed 3 × IQR above Q3."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5a6b7c8d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from interlace import dotplot_diag\n",
    "\n",
    "dotplot_diag(infl, diag=\"cooksd\", cutoff=\"internal\", name=\"index\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9f0a1b2c",
   "metadata": {},
   "outputs": [],
   "source": [
    "dotplot_diag(infl, diag=\"cooksd\", cutoff=4 / len(df))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.13.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}