Python data stack
The exact subset of Python you need for USAAIO: NumPy for arrays, pandas for tabular data, matplotlib + seaborn for plots, scikit-learn for ML, and PyTorch for deep learning. Set them up once, learn the idioms, then stop thinking about syntax.
Environment setup. Use a virtual environment (or conda) so library versions are pinned.
For contests, install exact versions you'll use: a working notebook on day one beats a broken environment on day seven.
Setup
python3 -m venv .venv
source .venv/bin/activate
pip install numpy pandas matplotlib seaborn scikit-learn
pip install torch torchvision torchaudio
pip install jupyterlab
jupyter lab
On Mac with Apple silicon, PyTorch installs with MPS (Metal) acceleration by default; you don't need CUDA.
NumPy — arrays
NumPy is the foundation. Every other library wraps or interoperates with ndarray.
Creating arrays
import numpy as np
a = np.array([1, 2, 3]) # 1-D from list
b = np.zeros((3, 4)) # 3x4 zeros
c = np.ones(5) # 1-D ones
d = np.arange(0, 10, 2) # [0 2 4 6 8]
e = np.linspace(0, 1, 11) # 11 evenly spaced
r = np.random.randn(100, 3) # 100x3 standard normal
Shape, indexing, slicing
r.shape # (100, 3)
r[0] # first row, shape (3,)
r[:, 0] # first column
r[10:20, :2] # rows 10-19, first 2 columns
r[r > 0] # boolean mask: positive values only
Vectorized arithmetic
x = np.array([1.0, 2.0, 3.0])
y = np.array([10.0, 20.0, 30.0])
x + y # elementwise
x * y # elementwise
x @ y # dot product → 140.0
np.exp(x) # elementwise exp
np.mean(r, axis=0) # column means, shape (3,)
Broadcasting
M = np.zeros((4, 3))
v = np.array([1, 2, 3])
M + v # v broadcasts to each row → shape (4, 3)
Vectorize, don't loop. Any explicit Python
for over a NumPy array is 50–500× slower than the equivalent vectorized expression. If you find yourself writing a loop, ask "is there a NumPy operation that does this?"
pandas — tables
import pandas as pd
df = pd.read_csv("data.csv")
df.head()
df.info()
df.describe()
df["age"].mean()
df.groupby("category")["value"].mean()
df[df["age"] > 18]
df["log_x"] = np.log(df["x"] + 1)
df = df.dropna(subset=["target"])
df = df.fillna({"income": df["income"].median()})
- Indexing:
df.loc[label_rows, label_cols]vsdf.iloc[int_rows, int_cols]. Always be explicit. - Boolean filtering:
df[df["x"] > 0]. - GroupBy + aggregation is your bread and butter for feature engineering.
Matplotlib & seaborn — plots
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(6, 4))
plt.plot(x, y, label="train")
plt.plot(x, y_val, label="val")
plt.xlabel("epoch")
plt.ylabel("loss")
plt.legend()
plt.tight_layout()
plt.savefig("loss.png", dpi=120)
sns.scatterplot(data=df, x="feature1", y="feature2", hue="label")
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")
Plots are how you debug models. Always plot the training and validation loss curves; always plot a few predictions vs. ground truth.
scikit-learn — the ML API
Every classical-ML model in scikit-learn follows the same pattern:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
scaler = StandardScaler().fit(X_train)
X_train_s = scaler.transform(X_train)
X_val_s = scaler.transform(X_val)
model = LogisticRegression(max_iter=1000)
model.fit(X_train_s, y_train)
pred = model.predict(X_val_s)
print(accuracy_score(y_val, pred))
print(classification_report(y_val, pred))
Always fit the scaler on training data only. Then transform the validation/test sets with that fitted scaler. Fitting on the whole dataset before splitting leaks information.
Common gotchas
- Random seeds. Set
np.random.seed(42),torch.manual_seed(42), and passrandom_state=42to scikit-learn for reproducibility. - Shape mismatches. 90% of debugging time. Print
x.shapeliberally. - Mixing int and float. NumPy's integer division will silently truncate. Cast to float early in numerical code.
- Chained assignment in pandas.
df[df["x"]>0]["y"] = 1often doesn't do what you want. Use.loc. - Notebook out-of-order execution. Always "Restart & Run All" before submitting — it's the only way to know your notebook runs top-to-bottom.