Python data stack

The exact subset of Python you need for USAAIO: NumPy for arrays, pandas for tabular data, matplotlib + seaborn for plots, scikit-learn for ML, and PyTorch for deep learning. Set them up once, learn the idioms, then stop thinking about syntax.

Environment setup. Use a virtual environment (or conda) so library versions are pinned. For contests, install exact versions you'll use: a working notebook on day one beats a broken environment on day seven.

Setup

python3 -m venv .venv
source .venv/bin/activate

pip install numpy pandas matplotlib seaborn scikit-learn
pip install torch torchvision torchaudio
pip install jupyterlab

jupyter lab

On Mac with Apple silicon, PyTorch installs with MPS (Metal) acceleration by default; you don't need CUDA.

NumPy — arrays

NumPy is the foundation. Every other library wraps or interoperates with ndarray.

Creating arrays

import numpy as np

a = np.array([1, 2, 3])               # 1-D from list
b = np.zeros((3, 4))                  # 3x4 zeros
c = np.ones(5)                        # 1-D ones
d = np.arange(0, 10, 2)               # [0 2 4 6 8]
e = np.linspace(0, 1, 11)             # 11 evenly spaced
r = np.random.randn(100, 3)           # 100x3 standard normal

Shape, indexing, slicing

r.shape         # (100, 3)
r[0]            # first row, shape (3,)
r[:, 0]         # first column
r[10:20, :2]    # rows 10-19, first 2 columns
r[r > 0]        # boolean mask: positive values only

Vectorized arithmetic

x = np.array([1.0, 2.0, 3.0])
y = np.array([10.0, 20.0, 30.0])

x + y           # elementwise
x * y           # elementwise
x @ y           # dot product → 140.0
np.exp(x)       # elementwise exp
np.mean(r, axis=0)  # column means, shape (3,)

Broadcasting

M = np.zeros((4, 3))
v = np.array([1, 2, 3])
M + v           # v broadcasts to each row → shape (4, 3)

Vectorize, don't loop. Any explicit Python for over a NumPy array is 50–500× slower than the equivalent vectorized expression. If you find yourself writing a loop, ask "is there a NumPy operation that does this?"

pandas — tables

import pandas as pd

df = pd.read_csv("data.csv")
df.head()
df.info()
df.describe()

df["age"].mean()
df.groupby("category")["value"].mean()
df[df["age"] > 18]

df["log_x"] = np.log(df["x"] + 1)
df = df.dropna(subset=["target"])
df = df.fillna({"income": df["income"].median()})

Indexing: df.loc[label_rows, label_cols] vs df.iloc[int_rows, int_cols]. Always be explicit.
Boolean filtering: df[df["x"] > 0].
GroupBy + aggregation is your bread and butter for feature engineering.

Matplotlib & seaborn — plots

import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(6, 4))
plt.plot(x, y, label="train")
plt.plot(x, y_val, label="val")
plt.xlabel("epoch")
plt.ylabel("loss")
plt.legend()
plt.tight_layout()
plt.savefig("loss.png", dpi=120)

sns.scatterplot(data=df, x="feature1", y="feature2", hue="label")
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")

Plots are how you debug models. Always plot the training and validation loss curves; always plot a few predictions vs. ground truth.

scikit-learn — the ML API

Every classical-ML model in scikit-learn follows the same pattern:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_val_s   = scaler.transform(X_val)

model = LogisticRegression(max_iter=1000)
model.fit(X_train_s, y_train)
pred = model.predict(X_val_s)

print(accuracy_score(y_val, pred))
print(classification_report(y_val, pred))

Always fit the scaler on training data only. Then transform the validation/test sets with that fitted scaler. Fitting on the whole dataset before splitting leaks information.

Common gotchas

Random seeds. Set np.random.seed(42), torch.manual_seed(42), and pass random_state=42 to scikit-learn for reproducibility.
Shape mismatches. 90% of debugging time. Print x.shape liberally.
Mixing int and float. NumPy's integer division will silently truncate. Cast to float early in numerical code.
Chained assignment in pandas. df[df["x"]>0]["y"] = 1 often doesn't do what you want. Use .loc.
Notebook out-of-order execution. Always "Restart & Run All" before submitting — it's the only way to know your notebook runs top-to-bottom.