{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/htpu/barryhan.net/blob/main/usaaio/notebooks/01_ml_tabular_titanic.ipynb)\n",
    "\n",
    "# Classical ML on Tabular Data \u2014 Titanic Survival\n",
    "\n",
    "End-to-end pipeline: load -> EDA -> feature engineering -> preprocessing -> baseline ->\n",
    "model comparison -> cross-validation -> feature importance -> Kaggle-style submission.\n",
    "\n",
    "**Dataset.** Kaggle's *Titanic: Machine Learning from Disaster*. We pull a public\n",
    "mirror so the notebook runs on Colab without Kaggle credentials.\n",
    "\n",
    "**Runtime.** ~2 minutes on Colab CPU.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Setup\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Standard scientific Python stack -- all pre-installed on Colab.\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold\n",
    "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.compose import ColumnTransformer\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier\n",
    "from sklearn.metrics import accuracy_score, classification_report\n",
    "\n",
    "RANDOM_STATE = 42\n",
    "np.random.seed(RANDOM_STATE)\n",
    "sns.set_theme(style=\"whitegrid\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Load the data\n",
    "\n",
    "We use the canonical CSV mirror hosted by DataScienceDojo (same Kaggle Titanic file)."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "URL = \"https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv\"\n",
    "df = pd.read_csv(URL)\n",
    "print(df.shape)\n",
    "df.head()\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "df.info()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Exploratory Data Analysis"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "df.describe(include='all')\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Survival rate by sex and class -- the two strongest predictors.\n",
    "fig, axes = plt.subplots(1, 2, figsize=(11, 4))\n",
    "sns.barplot(data=df, x='Sex', y='Survived', ax=axes[0])\n",
    "axes[0].set_title('Survival by Sex')\n",
    "sns.barplot(data=df, x='Pclass', y='Survived', ax=axes[1])\n",
    "axes[1].set_title('Survival by Passenger Class')\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Correlation heatmap on numeric features.\n",
    "num_df = df.select_dtypes(include=[np.number])\n",
    "plt.figure(figsize=(7, 5))\n",
    "sns.heatmap(num_df.corr(), annot=True, fmt='.2f', cmap='coolwarm', center=0)\n",
    "plt.title('Numeric feature correlation')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Age distribution by survival.\n",
    "plt.figure(figsize=(8, 4))\n",
    "sns.histplot(data=df, x='Age', hue='Survived', kde=True, bins=30, multiple='stack')\n",
    "plt.title('Age distribution by survival')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Missingness summary.\n",
    "df.isna().mean().sort_values(ascending=False)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Feature engineering\n",
    "\n",
    "We build three classic Titanic features:\n",
    "\n",
    "- **Title** -- extracted from `Name` (Mr/Mrs/Miss/Master/...). Encodes age + sex + status jointly.\n",
    "- **FamilySize** -- `SibSp + Parch + 1`. A small family helped survival; solo and large family hurt.\n",
    "- **FareBin** -- quartile bucket of fare. Robust to outliers like the 512 fares.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "def extract_title(name: str) -> str:\n",
    "    # 'Braund, Mr. Owen Harris' -> 'Mr'\n",
    "    return name.split(',')[1].split('.')[0].strip()\n",
    "\n",
    "df['Title'] = df['Name'].apply(extract_title)\n",
    "# Collapse rare titles.\n",
    "rare = df['Title'].value_counts()[lambda s: s < 10].index\n",
    "df['Title'] = df['Title'].replace(list(rare), 'Rare')\n",
    "df['Title'].value_counts()\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "df['FamilySize'] = df['SibSp'] + df['Parch'] + 1\n",
    "df['IsAlone'] = (df['FamilySize'] == 1).astype(int)\n",
    "df['FareBin'] = pd.qcut(df['Fare'].fillna(df['Fare'].median()), 4, labels=False)\n",
    "df[['FamilySize', 'IsAlone', 'FareBin']].head()\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Quick sanity check: survival rate by title.\n",
    "df.groupby('Title')['Survived'].agg(['mean', 'count']).sort_values('mean', ascending=False)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Preprocessing pipeline\n",
    "\n",
    "ColumnTransformer keeps train/val/test transformations consistent and prevents leakage."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "target = 'Survived'\n",
    "drop_cols = ['PassengerId', 'Name', 'Ticket', 'Cabin', target]\n",
    "X = df.drop(columns=drop_cols)\n",
    "y = df[target]\n",
    "\n",
    "numeric_features = ['Age', 'Fare', 'SibSp', 'Parch', 'FamilySize', 'IsAlone', 'FareBin', 'Pclass']\n",
    "categorical_features = ['Sex', 'Embarked', 'Title']\n",
    "\n",
    "numeric_tf = Pipeline([\n",
    "    ('imputer', SimpleImputer(strategy='median')),\n",
    "    ('scaler', StandardScaler()),\n",
    "])\n",
    "categorical_tf = Pipeline([\n",
    "    ('imputer', SimpleImputer(strategy='most_frequent')),\n",
    "    ('onehot', OneHotEncoder(handle_unknown='ignore')),\n",
    "])\n",
    "preprocessor = ColumnTransformer([\n",
    "    ('num', numeric_tf, numeric_features),\n",
    "    ('cat', categorical_tf, categorical_features),\n",
    "])\n",
    "preprocessor\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Train / validation split"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "X_train, X_val, y_train, y_val = train_test_split(\n",
    "    X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE\n",
    ")\n",
    "print(f\"train={X_train.shape}  val={X_val.shape}\")\n",
    "print(f\"train survival rate={y_train.mean():.3f}  val={y_val.mean():.3f}\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Baseline -- Logistic Regression"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "baseline = Pipeline([\n",
    "    ('prep', preprocessor),\n",
    "    ('clf', LogisticRegression(max_iter=1000, random_state=RANDOM_STATE)),\n",
    "])\n",
    "baseline.fit(X_train, y_train)\n",
    "pred = baseline.predict(X_val)\n",
    "print(f\"Baseline LR validation accuracy: {accuracy_score(y_val, pred):.4f}\")\n",
    "print(classification_report(y_val, pred))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Model comparison\n",
    "\n",
    "Three single models + one stacking ensemble."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "models = {\n",
    "    'LogReg': LogisticRegression(max_iter=1000, random_state=RANDOM_STATE),\n",
    "    'RandomForest': RandomForestClassifier(n_estimators=300, max_depth=6, random_state=RANDOM_STATE),\n",
    "    'GradBoost': GradientBoostingClassifier(n_estimators=200, max_depth=3, random_state=RANDOM_STATE),\n",
    "}\n",
    "stack = StackingClassifier(\n",
    "    estimators=[(name, mdl) for name, mdl in models.items()],\n",
    "    final_estimator=LogisticRegression(max_iter=1000),\n",
    "    cv=5,\n",
    ")\n",
    "models['Stacking'] = stack\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "results = {}\n",
    "for name, mdl in models.items():\n",
    "    pipe = Pipeline([('prep', preprocessor), ('clf', mdl)])\n",
    "    pipe.fit(X_train, y_train)\n",
    "    pred = pipe.predict(X_val)\n",
    "    acc = accuracy_score(y_val, pred)\n",
    "    results[name] = (pipe, acc)\n",
    "    print(f\"{name:14s}  val_acc = {acc:.4f}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Bar chart of validation accuracy.\n",
    "names = list(results.keys())\n",
    "accs = [results[n][1] for n in names]\n",
    "plt.figure(figsize=(7, 4))\n",
    "sns.barplot(x=names, y=accs)\n",
    "plt.ylim(0.7, max(accs) + 0.02)\n",
    "plt.ylabel('Validation accuracy')\n",
    "plt.title('Model comparison')\n",
    "for i, a in enumerate(accs):\n",
    "    plt.text(i, a + 0.002, f\"{a:.3f}\", ha='center')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Cross-validation\n",
    "\n",
    "Validation-set accuracy is noisy on 891 rows. 5-fold CV gives a tighter estimate."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)\n",
    "cv_results = {}\n",
    "for name, mdl in models.items():\n",
    "    pipe = Pipeline([('prep', preprocessor), ('clf', mdl)])\n",
    "    scores = cross_val_score(pipe, X, y, cv=cv, scoring='accuracy', n_jobs=-1)\n",
    "    cv_results[name] = scores\n",
    "    print(f\"{name:14s}  mean={scores.mean():.4f}  std={scores.std():.4f}\")\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Boxplot of CV folds per model.\n",
    "cv_df = pd.DataFrame(cv_results)\n",
    "plt.figure(figsize=(8, 4))\n",
    "sns.boxplot(data=cv_df)\n",
    "sns.swarmplot(data=cv_df, color='black', size=4)\n",
    "plt.ylabel('Accuracy')\n",
    "plt.title('5-fold CV accuracy by model')\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Feature importance\n",
    "\n",
    "From the Random Forest -- gives a stable ranking after one-hot encoding."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "rf_pipe = results['RandomForest'][0]\n",
    "rf = rf_pipe.named_steps['clf']\n",
    "prep = rf_pipe.named_steps['prep']\n",
    "feat_names = prep.get_feature_names_out()\n",
    "\n",
    "importances = pd.Series(rf.feature_importances_, index=feat_names).sort_values(ascending=False).head(15)\n",
    "plt.figure(figsize=(7, 5))\n",
    "sns.barplot(x=importances.values, y=importances.index)\n",
    "plt.title('Top 15 features -- Random Forest importance')\n",
    "plt.xlabel('Importance')\n",
    "plt.tight_layout()\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 10. Pick the best model and refit on full data"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "best_name = max(cv_results, key=lambda k: cv_results[k].mean())\n",
    "print(f\"Best model by CV mean: {best_name}\")\n",
    "best_pipe = Pipeline([('prep', preprocessor), ('clf', models[best_name])])\n",
    "best_pipe.fit(X, y)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 11. Write a Kaggle-style submission\n",
    "\n",
    "Kaggle's Titanic test set has the same columns minus `Survived`. We download it,\n",
    "apply the *same* feature engineering, predict, and write `submission.csv`.\n"
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "TEST_URL = \"https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/test.csv\"\n",
    "test = pd.read_csv(TEST_URL)\n",
    "test['Title'] = test['Name'].apply(extract_title).replace(list(rare), 'Rare')\n",
    "test['FamilySize'] = test['SibSp'] + test['Parch'] + 1\n",
    "test['IsAlone'] = (test['FamilySize'] == 1).astype(int)\n",
    "test['FareBin'] = pd.qcut(test['Fare'].fillna(test['Fare'].median()), 4, labels=False)\n",
    "test_X = test.drop(columns=[c for c in ['PassengerId', 'Name', 'Ticket', 'Cabin'] if c in test.columns])\n",
    "preds = best_pipe.predict(test_X)\n",
    "submission = pd.DataFrame({'PassengerId': test['PassengerId'], 'Survived': preds.astype(int)})\n",
    "submission.to_csv('submission.csv', index=False)\n",
    "print(submission.head())\n",
    "print(f\"Wrote submission.csv with {len(submission)} rows.\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What to try next\n",
    "\n",
    "- Hyperparameter search with `GridSearchCV` or `RandomizedSearchCV`.\n",
    "- Try `XGBoost` or `LightGBM` -- usually +1-2% on tabular.\n",
    "- Add interaction features (`Pclass * Sex`, `Age * Pclass`).\n",
    "- Calibrate probabilities with `CalibratedClassifierCV` if your metric is log-loss.\n",
    "\n",
    "Back to [Classical ML](../ml.html) on the USAAIO site.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
