{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/htpu/barryhan.net/blob/main/usaaio/notebooks/04_math_bayes_ab_test.ipynb)\n",
    "\n",
    "# Bayesian A/B Testing -- Beta-Binomial Conjugacy\n",
    "\n",
    "We compare two versions of a webpage (A and B), each with a binary outcome\n",
    "(click / no click). We derive the Beta-Binomial posterior by hand, simulate\n",
    "incoming data, watch the posteriors update, and compare a Bayesian decision rule\n",
    "to a frequentist z-test.\n",
    "\n",
    "**Runtime.** ~30 seconds on Colab CPU.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. The model\n",
    "\n",
    "Each visitor to variant $A$ converts with unknown probability $\\theta_A \\in [0,1]$,\n",
    "likewise $\\theta_B$ for $B$. After showing $A$ to $n_A$ visitors and observing\n",
    "$k_A$ conversions, the likelihood is binomial:\n",
    "\n",
    "$$\n",
    "P(k_A \\mid \\theta_A) = \\binom{n_A}{k_A}\\,\\theta_A^{k_A}(1-\\theta_A)^{n_A - k_A}.\n",
    "$$\n",
    "\n",
    "Take a $\\mathrm{Beta}(\\alpha, \\beta)$ prior on $\\theta_A$:\n",
    "\n",
    "$$\n",
    "p(\\theta_A) = \\frac{\\theta_A^{\\alpha-1}(1-\\theta_A)^{\\beta-1}}{B(\\alpha, \\beta)}.\n",
    "$$\n",
    "\n",
    "By Bayes' rule, the posterior is proportional to prior times likelihood:\n",
    "\n",
    "$$\n",
    "p(\\theta_A \\mid k_A) \\propto \\theta_A^{\\alpha + k_A - 1}(1-\\theta_A)^{\\beta + n_A - k_A - 1},\n",
    "$$\n",
    "\n",
    "which is again a Beta -- this is **conjugacy**:\n",
    "\n",
    "$$\n",
    "\\theta_A \\mid k_A \\;\\sim\\; \\mathrm{Beta}(\\alpha + k_A,\\; \\beta + n_A - k_A).\n",
    "$$\n",
    "\n",
    "A flat prior $\\mathrm{Beta}(1,1)$ is the uniform distribution and a common default.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Setup"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy import stats\n",
    "\n",
    "rng = np.random.default_rng(0)\n",
    "plt.rcParams['figure.figsize'] = (9, 4)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Simulate true ground truth"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "TRUE_A = 0.10   # 10% conversion\n",
    "TRUE_B = 0.12   # 12% conversion -- B is genuinely better\n",
    "N_PER_DAY = 200\n",
    "N_DAYS   = 14\n",
    "print(f\"True lift: {(TRUE_B - TRUE_A) * 100:.1f} percentage points\")\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Stream data day by day.\n",
    "days = []\n",
    "kA = kB = 0; nA = nB = 0\n",
    "for d in range(1, N_DAYS + 1):\n",
    "    aA = rng.binomial(N_PER_DAY, TRUE_A)\n",
    "    aB = rng.binomial(N_PER_DAY, TRUE_B)\n",
    "    kA += aA; nA += N_PER_DAY\n",
    "    kB += aB; nB += N_PER_DAY\n",
    "    days.append((d, kA, nA, kB, nB))\n",
    "import pandas as pd\n",
    "log = pd.DataFrame(days, columns=['day', 'kA', 'nA', 'kB', 'nB'])\n",
    "log['rateA'] = log['kA'] / log['nA']\n",
    "log['rateB'] = log['kB'] / log['nB']\n",
    "log\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Posterior evolution\n",
    "\n",
    "With a $\\mathrm{Beta}(1,1)$ prior the posterior after $k$ successes in $n$ trials is $\\mathrm{Beta}(1+k,\\,1+n-k)$. We plot how the posterior tightens each day."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "ALPHA0, BETA0 = 1, 1\n",
    "xs = np.linspace(0, 0.25, 600)\n",
    "checkpoints = [1, 3, 7, 14]\n",
    "\n",
    "fig, axes = plt.subplots(1, len(checkpoints), figsize=(14, 3.2), sharey=True)\n",
    "for ax, d in zip(axes, checkpoints):\n",
    "    row = log[log.day == d].iloc[0]\n",
    "    pA = stats.beta(ALPHA0 + row.kA, BETA0 + row.nA - row.kA)\n",
    "    pB = stats.beta(ALPHA0 + row.kB, BETA0 + row.nB - row.kB)\n",
    "    ax.plot(xs, pA.pdf(xs), label='A')\n",
    "    ax.plot(xs, pB.pdf(xs), label='B')\n",
    "    ax.axvline(TRUE_A, color='C0', ls=':', alpha=.5)\n",
    "    ax.axvline(TRUE_B, color='C1', ls=':', alpha=.5)\n",
    "    ax.set_title(f'day {d}  (nA={int(row.nA)})')\n",
    "    ax.set_xlabel(r'$\theta$')\n",
    "    ax.legend(fontsize=8)\n",
    "axes[0].set_ylabel('posterior density')\n",
    "plt.tight_layout(); plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. The decision quantity -- $P(\\theta_B > \\theta_A \\mid \\text{data})$\n",
    "\n",
    "There's no closed form, but Monte Carlo is trivial:\n",
    "\n",
    "1. Draw $M$ samples from each posterior.\n",
    "2. Estimate $\\hat{p} = \\frac{1}{M}\\sum_{i=1}^{M} \\mathbf{1}[\\theta_B^{(i)} > \\theta_A^{(i)}]$.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def prob_B_better(kA, nA, kB, nB, draws=200_000, prior=(1, 1)):\n",
    "    a, b = prior\n",
    "    pA = stats.beta(a + kA, b + nA - kA).rvs(size=draws, random_state=rng)\n",
    "    pB = stats.beta(a + kB, b + nB - kB).rvs(size=draws, random_state=rng)\n",
    "    return float(np.mean(pB > pA)), float(np.mean(pB - pA)), float(np.std(pB - pA))\n",
    "\n",
    "log['P(B>A)'] = [prob_B_better(r.kA, r.nA, r.kB, r.nB)[0] for r in log.itertuples()]\n",
    "log[['day','kA','nA','kB','nB','rateA','rateB','P(B>A)']]\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "plt.figure()\n",
    "plt.plot(log['day'], log['P(B>A)'], marker='o')\n",
    "plt.axhline(0.95, color='red', ls='--', label='95% threshold')\n",
    "plt.axhline(0.5, color='grey', ls=':')\n",
    "plt.xlabel('day'); plt.ylabel(r'$P(\theta_B > \theta_A \\mid \\mathrm{data})$')\n",
    "plt.title('Bayesian decision quantity over time')\n",
    "plt.legend(); plt.tight_layout(); plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Decision rule\n",
    "\n",
    "Stop and ship $B$ when $P(\\theta_B > \\theta_A \\mid \\text{data}) \\ge 0.95$\n",
    "**and** the expected lift exceeds a minimum effect size you'd care about (say 1pp).\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "MIN_LIFT = 0.01\n",
    "decided = False\n",
    "for r in log.itertuples():\n",
    "    p, mean_lift, _ = prob_B_better(r.kA, r.nA, r.kB, r.nB)\n",
    "    if p >= 0.95 and mean_lift >= MIN_LIFT:\n",
    "        print(f\"Day {r.day}: ship B  (P(B>A)={p:.3f}, E[lift]={mean_lift*100:.2f}pp)\")\n",
    "        decided = True\n",
    "        break\n",
    "if not decided:\n",
    "    print('No decision yet -- need more data.')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Comparison vs. frequentist two-proportion z-test\n",
    "\n",
    "The classical test computes\n",
    "\n",
    "$$\n",
    "z = \\frac{\\hat p_B - \\hat p_A}{\\sqrt{\\hat p (1-\\hat p)\\left(\\tfrac{1}{n_A} + \\tfrac{1}{n_B}\\right)}},\\qquad\n",
    "\\hat p = \\frac{k_A + k_B}{n_A + n_B},\n",
    "$$\n",
    "\n",
    "and rejects $H_0:\\theta_A=\\theta_B$ when $|z| > 1.96$ (two-sided, $\\alpha=0.05$).\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def ztest(kA, nA, kB, nB):\n",
    "    pA = kA / nA; pB = kB / nB\n",
    "    p  = (kA + kB) / (nA + nB)\n",
    "    se = np.sqrt(p * (1 - p) * (1 / nA + 1 / nB))\n",
    "    z  = (pB - pA) / se\n",
    "    # two-sided p-value\n",
    "    pval = 2 * (1 - stats.norm.cdf(abs(z)))\n",
    "    return z, pval\n",
    "\n",
    "log['z']     = [ztest(r.kA, r.nA, r.kB, r.nB)[0] for r in log.itertuples()]\n",
    "log['p_freq']= [ztest(r.kA, r.nA, r.kB, r.nB)[1] for r in log.itertuples()]\n",
    "log[['day', 'P(B>A)', 'z', 'p_freq']]\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "fig, ax1 = plt.subplots()\n",
    "ax1.plot(log['day'], log['P(B>A)'], 'o-', color='C0', label='P(B>A)')\n",
    "ax1.set_ylabel('Bayesian P(B>A)', color='C0'); ax1.axhline(0.95, color='C0', ls='--', alpha=.4)\n",
    "ax2 = ax1.twinx()\n",
    "ax2.plot(log['day'], log['p_freq'], 's-', color='C3', label='frequentist p-value')\n",
    "ax2.set_ylabel('frequentist p-value', color='C3'); ax2.axhline(0.05, color='C3', ls='--', alpha=.4)\n",
    "plt.title('Bayesian vs frequentist evidence over time')\n",
    "plt.tight_layout(); plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Drill problems\n",
    "\n",
    "Try these by hand before peeking at the worked solutions.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Problem 1.** With a $\\mathrm{Beta}(2,8)$ prior on $\\theta$, you observe 7 successes in 20 trials. What's the posterior mean?\n",
    "\n",
    "**Problem 2.** Variant A: 90 conversions in 1000 visitors. Variant B: 120 in 1000. With flat priors, estimate $P(\\theta_B > \\theta_A \\mid \\text{data})$ by Monte Carlo with 100k draws.\n",
    "\n",
    "**Problem 3.** Why does the Bayesian decision rule allow continuous monitoring while the frequentist test (in its naive form) doesn't?\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Worked solution 1: posterior is Beta(2 + 7, 8 + 13) = Beta(9, 21).\n",
    "post1 = stats.beta(9, 21)\n",
    "print(f\"Problem 1 posterior mean = {post1.mean():.4f}  (closed form: 9/(9+21)={9/30:.4f})\")\n",
    "print(f\"  95% credible interval = ({post1.ppf(.025):.3f}, {post1.ppf(.975):.3f})\")\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Worked solution 2:\n",
    "p2, lift, _ = prob_B_better(90, 1000, 120, 1000, draws=100_000)\n",
    "print(f\"Problem 2  P(B>A) ~ {p2:.3f}   E[lift] ~ {lift*100:.2f}pp\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Worked solution 3.** Frequentist p-values are calibrated under a *fixed sample size* protocol: peek-and-stop inflates the false-positive rate (each peek is another chance to cross the threshold). The Bayesian posterior is conditional on the data you've actually seen, so it doesn't depend on a stopping rule. Caveat: optional stopping still affects long-run frequentist properties of any decision you derive from the posterior, so in practice you pair it with sequential designs or a minimum sample size.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What to try next\n",
    "\n",
    "- Replace the flat prior with $\\mathrm{Beta}(\\hat p \\cdot s,\\, (1-\\hat p) \\cdot s)$ from historical data (informative prior).\n",
    "- Multi-variant test: A vs B vs C, decision = $\\arg\\max_v \\theta_v$ -- use Thompson sampling for allocation.\n",
    "- Sequential test with the *expected loss* stopping rule instead of the 95% rule.\n",
    "\n",
    "Back to [Math you need](../math.html) on the USAAIO site.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}