Metric Selection Guide (Flowchart‑style)

High‑level guide for picking suitable metrics:

  1. Problem type?
    • Binary / multiclass classification → use metrics in Performance (Classification).
    • Regression → use metrics in Regression & Correlation.
    • Hypothesis / A/B test → use Inference & Hypothesis Testing.
  2. Is the dataset imbalanced?
    • Yes → emphasise recall, precision, F1, PR AUC, MCC; avoid relying on plain accuracy.
    • No → accuracy + ROC AUC + F1 are typically fine.
  3. Threshold‑free ranking vs threshold‑specific performance?
    • Ranking quality → ROC AUC / PR AUC.
    • Specific operating point → confusion matrix, precision/recall/F1 at that threshold.
  4. Regression goals?
    • Penalise big errors heavily → RMSE / MSE.
    • Interpret “average absolute deviation” → MAE.
    • Targets span orders of magnitude → RMSLE or log‑transforms.
  5. Need interpretability?
    • Global feature importance → permutation importance, SHAP.
    • Local explanations → SHAP, LIME.
Go Beyond Definitions: Focus on Pitfalls & Robustness

This resource prioritizes practical experience, highlighting the critical errors that lead to faulty models and bad decisions. Learn to avoid common pitfalls like:

Data Science Model Evaluation Metrics: Condensed Reference Sheet

Quick overview of what to use, when to use it, and what to watch out for.

Metric Selection Guide

  1. Problem Type → Metric Category
    Binary/Multiclass Classification → Performance (Classification) metrics
    Regression → Regression & Correlation metrics
    Hypothesis/A/B Testing → Inference & Hypothesis Testing
  2. Dataset Imbalance?
    Yes → Focus on Recall, Precision, F1, PR AUC, MCC; avoid plain Accuracy
    No → Accuracy, ROC AUC, F1 are fine
  3. Ranking vs. Threshold Performance
    Ranking quality → ROC AUC / PR AUC
    Specific operating point → Confusion matrix, Precision/Recall/F1 at threshold
  4. Regression Goals
    Heavily penalize big errors → RMSE / MSE
    Interpret “average absolute deviation” → MAE
    Targets span orders of magnitude → RMSLE or log transforms
  5. Need Interpretability?
    Global feature importance → Permutation importance, SHAP
    Local explanations → SHAP, LIME

Classification Metrics (Quick Reference)

Key formulas

Accuracy  = (TP + TN) / (TP + FP + TN + FN)
Precision = TP / (TP + FP)
Recall    = TP / (TP + FN)
F1        = 2 × (Precision × Recall) / (Precision + Recall)
MCC       = (TP×TN − FP×FN) / √[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]      
Metric Range Good Fair Poor When to Use
Accuracy 0 – 1 > 0.8 0.7 – 0.8 < 0.5 Balanced data only
Precision 0 – 1 > 0.8 0.7 – 0.8 < 0.5 False positives costly
Recall 0 – 1 > 0.8 0.7 – 0.8 < 0.5 False negatives costly
F1 Score 0 – 1 > 0.8 0.5 – 0.8 < 0.5 Imbalanced data
ROC AUC 0 – 1 > 0.8 0.7 – 0.8 < 0.7 Ranking problems
PR AUC 0 – 1 Much > baseline Moderate > baseline ≈ baseline Imbalanced binary
MCC −1 to 1 > 0.5 0.3 – 0.5 < 0 Imbalanced binary

Regression Metrics (Quick Reference)

When to use:

  • RMSE → Large errors are critically bad
  • MAE → Want “average error” interpretation
  • RMSLE → Targets span orders of magnitude
Metric Good Fair Poor Characteristics
> 0.7 0.5 – 0.7 < 0.3 % variance explained
Adjusted R² Close to R² Lower than R² Much lower Accounts for predictors
RMSE << target SD ≈ target SD ≥ target SD Penalizes large errors
MAE << target mean ≈ 10–20% mean ≥ 30% mean Robust to outliers

Statistical Testing Essentials

Hypothesis Testing Flow

State H₀ & H₁ → Choose α (0.05) → Compute p-value → Compare:

  • p < α → Reject H₀ (statistically significant)
  • p ≥ α → Fail to reject H₀

Common Tests

Test Use Case Key Assumptions
t-test Compare 2 group means Normality, equal variance (or use Welch’s)
ANOVA Compare ≥ 3 group means Normality, equal variance, independence
Chi-square Test independence in categorical data Expected counts ≥ 5
Permutation Test Non-parametric alternative Exchangeability of labels
Kolmogorov–Smirnov (KS) Compare a sample to a reference distribution (1-sample) or compare two samples (2-sample) Continuous data, specifies distribution under H₀. Sensitive to differences in shape and location.
Levene’s test Test equality of variances across groups (do we trust “equal variance” for t-test / ANOVA?) Groups independent. Works reasonably well even when data are not normal.

Multiple Testing Corrections

Method Controls When to Use
Bonferroni FWER (strict) Few tests, false positives very costly
Benjamini–Hochberg FDR (less strict) Many tests (genomics, feature screening)

Model Selection Criteria

Rule of thumb: ΔAIC/BIC < 2 → models are essentially equivalent.

Criterion Preference Best For
AIC Lower is better Prediction quality, larger models
BIC Lower is better True model identification, smaller models
Cross-validated Error Lower is better Direct out-of-sample performance
Adjusted R² Higher is better Regression with multiple predictors

Interpretability Methods

These show association, not causation!

Method Scope Key Insight
Permutation Importance Global Feature importance by performance drop
SHAP Values Global + Local Additive feature contributions
LIME Local Local surrogate model explanations

Top 10 Common Pitfalls

  1. Using Accuracy on imbalanced data
  2. Ignoring false negatives in medical/safety applications
  3. Optimizing only Precision or Recall (neglecting the other)
  4. Treating p-value as probability H₀ is true
  5. Not correcting for multiple testing
  6. Comparing R² across different datasets
  7. Using RMSE when outliers are unimportant
  8. Interpreting SHAP/LIME as causal effects
  9. Selecting models based on tiny metric differences
  10. Ignoring confidence intervals for metrics

Quick Decision Checklist

Before choosing metrics

  • What’s the business objective?
  • Balanced or imbalanced data?
  • Cost of false positives vs false negatives?
  • Need probability rankings or binary decisions?
  • Require interpretability?

After model evaluation

  • Check multiple metrics (not just one)
  • Examine confusion matrix / error patterns
  • Validate on holdout / test set
  • Consider confidence intervals
  • Compare to reasonable baselines

Essential Python Snippets

Classification Metrics

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

acc     = accuracy_score(y_true, y_pred)
prec    = precision_score(y_true, y_pred)
rec     = recall_score(y_true, y_pred)
f1      = f1_score(y_true, y_pred)
roc_auc = roc_auc_score(y_true, y_proba)      

Regression Metrics

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae  = mean_absolute_error(y_true, y_pred)
rmse = mean_squared_error(y_true, y_pred, squared=False)
r2   = r2_score(y_true, y_pred)      

Statistical Tests

from scipy import stats

# t-test (two independent groups, equal variance assumed)
t_stat, p_value = stats.ttest_ind(group1, group2)

# Chi-square test of independence for a contingency table
chi2, p_chi, dof, expected = stats.chi2_contingency(contingency_table)

# Kolmogorov–Smirnov:
# 1-sample KS: does sample follow a normal distribution?
ks_stat, p_ks = stats.kstest(sample, 'norm')

# 2-sample KS: are two samples from the same distribution?
ks2_stat, p_ks2 = stats.ks_2samp(sample1, sample2)

# Levene’s test for equality of variances
lev_stat, p_lev = stats.levene(group1, group2, group3)      

Key Takeaways

  1. No single metric tells the whole story → always use multiple.
  2. Context is everything → choose metrics aligned with business goals.
  3. Visualize → confusion matrices, ROC/PR curves, residual plots.
  4. Uncertainty matters → report confidence intervals, not just point estimates.
  5. Baseline comparison → always compare to simple benchmarks.

Based on the “Data Science Model Evaluation Metrics Dashboard” condensed sheet.

How to Read the Color Bars

The colors indicate the qualitative interpretation of a metric's value, following a consistent rule:

Green: A desirable or good result for the metric's purpose.
Yellow: A moderate, acceptable, or cautionary result.
Red: An undesirable, poor, or problematic result.

Key Context: "Desirable" depends on context. For a p-value testing an effect (e.g., "is there a difference?"), a low value (significant) is good (green). For a p-value checking an assumption (e.g., "are residuals normal?"), a high value (no violation) is good (green). Correlation bars now show strength (absolute value) from weak to strong.

Performance (Classification)
Metric Decision Criterion (Value Range) Purpose Description Working Mechanism Example Limitations
Accuracy
0 0.5 0.7 0.8 1.0
Aim as high as possible.

< 0.5 = poor overall performance.

0.5–0.7 = fair (better than chance but weak).

0.7–0.8 = good.

> 0.8 = very good (relative to chance level).
Gauge overall classification success rate. Useful for quick assessment on balanced data, but unreliable alone on strongly imbalanced data. Proportion of all predictions that are correct: \[ \text{Accuracy} = \frac{TP + TN}{TP+FP+TN+FN}. \] Counts correct vs total predictions with all errors weighted equally. 90 correct predictions out of 100 → accuracy = 0.90. Can be very misleading for imbalanced data – a majority‑class predictor can have high accuracy but be useless.

Common pitfalls:
  • Comparing models by accuracy without checking class balance or baseline (e.g. majority‑class accuracy).
  • Interpreting small accuracy differences as meaningful without considering confidence intervals or variability.
Precision (Positive Predictive Value, PPV)
0 0.5 0.7 0.8 1.0
Higher is better.

< 0.5 = more than half of predicted positives are wrong.

0.5–0.7 = fair.

0.7–0.8 = good.

> 0.8 (especially > 0.9) = very few false positives.
Measures how reliable positive predictions are, crucial when false positives are costly. \[ \text{Precision} = \frac{TP}{TP+FP}. \] Improves as false positives decrease, even if recall suffers. If 100 flagged spam emails contain 90 real spam, precision = 0.90. Can be gamed by predicting very few positives; must be balanced with recall.

Common pitfalls:
  • Optimising precision alone and ending up with a model that rarely predicts positives (very low recall).
  • Comparing precision across datasets with very different prevalence without context.
Recall (Sensitivity / True Positive Rate)
0 0.5 0.7 0.8 1.0
Higher is better.

< 0.5 = more than half of positives are missed.

0.5–0.7 = fair coverage.

0.7–0.8 = good.

> 0.8 (especially > 0.9) = very high detection.
Measures completeness of positive detection; key when missing positives is costly. \[ \text{Recall} = \frac{TP}{TP+FN}. \] Improves when false negatives decrease. Recall = 0.95 means 95% of true positives are found. Ignores false positives; predicting everything positive yields recall 1.0.

Common pitfalls:
  • Maximising recall at the expense of an unacceptably high false‑positive rate.
  • Interpreting high recall as “good model” without looking at precision or class prevalence.
Specificity (True Negative Rate)
0 0.5 0.7 0.8 1.0
Higher is better.

< 0.5 = majority of negatives misclassified.

0.5–0.7 = fair.

0.7–0.8 = good.

> 0.8 = very low false‑alarm rate.
Measures ability to correctly identify negatives; important when false positives are costly. \[ \text{Specificity} = \frac{TN}{TN+FP}. \] Complement of false positive rate: FPR = 1 − specificity. Specificity 0.98 means 98% of real negatives are correctly left unflagged. Trivially high if model predicts almost everything negative.

Common pitfalls:
  • Quoting high specificity while recall on the positive class is extremely low.
  • Confusing specificity with NPV or accuracy when explaining results to stakeholders.
False Positive Rate / False Negative Rate
(FPR / FNR)
0 moderate high
Lower is better for both.

< 0.05 on either side = very low error rate there.

≈ 0.10–0.20 = modest error rate.

> 0.30 = high error rate; typically unacceptable.
Quantify false alarms (FPR) and missed positives (FNR). \[ \text{FPR} = \frac{FP}{FP+TN}, \quad \text{FNR} = \frac{FN}{FN+TP}. \] Threshold choice trades FPR vs FNR; ROC and PR curves illustrate this trade‑off. A cancer test may tolerate FPR 0.15 for FNR 0.01 (very few missed cases). Need domain‑specific cost trade‑offs; no single “correct” target.

Common pitfalls:
  • Optimising only one of FPR or FNR without considering the business/clinical cost of the other side.
  • Reporting FPR but not stating what threshold or prevalence it corresponds to.
F1 Score
0 0.5 0.8 1.0
Higher is better.

< 0.5 = poor precision–recall balance.

0.5–0.8 = acceptable to good.

> 0.8 = strong overall performance.
Single number that balances precision and recall, especially on imbalanced datasets. \[ F_1 = 2 \frac{\text{Precision} \times \text{Recall}} {\text{Precision} + \text{Recall}}. \] Harmonic mean, so dominated by the smaller of precision and recall. Precision 0.9, recall 0.9 → F1 0.9; precision 0.9, recall 0.1 → F1 ≈ 0.18. Weights precision and recall equally; may not match domain priorities.

Common pitfalls:
  • Using F1 when precision and recall have very different real‑world costs (e.g. medical diagnosis).
  • Comparing F1 scores across datasets with very different class imbalance.
ROC AUC
0.5 0.7 0.8 1.0
< 0.7 = weak discrimination.

0.7–0.8 = moderate.

> 0.8 = good; 0.9+ often excellent.
Measures how well model ranks positives above negatives across thresholds. Equivalent to probability a random positive has higher score than a random negative. Integrates TPR vs FPR curve over all thresholds. AUC 0.85 → model ranks positives higher 85% of the time. Doesn’t reflect calibration; can look good on heavily imbalanced data even if practical performance is poor.

Common pitfalls:
  • Using ROC AUC on extreme class imbalance where PR AUC is more informative.
  • Assuming high AUC always implies good performance at the specific threshold used in production.
Negative Predictive Value (NPV)
0 0.5 0.7 0.8 1.0
Higher is better.

< 0.5 = “negative” predictions often wrong.

0.5–0.8 = moderate trust.

> 0.8 = “all clear” predictions usually correct.
Probability that a predicted negative is truly negative; key when false negatives are costly. \[ \text{NPV} = \frac{TN}{TN+FN}. \] Complements precision (PPV). In disease screening with low prevalence, NPV often very high even for moderate models. Strongly dependent on prevalence; high NPV doesn’t automatically imply a good model.

Common pitfalls:
  • Interpreting high NPV as strong evidence of good model quality in very low‑prevalence settings where almost everyone is negative.
  • Confusing NPV with specificity when explaining metrics.
Balanced Accuracy
0 0.5 0.7 1.0
0.5 = chance‑level performance (binary).

0.6–0.7 = modest improvement over chance.

> 0.7 = good performance on both classes.
Accounts for class imbalance by averaging recall over classes. \[ \text{Balanced Acc} = \frac{\text{TPR} + \text{TNR}}{2}. \] Penalises models that perform well on majority but poorly on minority class. TPR 0.9, TNR 0.5 → balanced accuracy 0.7. Still hides which side is weak; always inspect TPR and TNR separately.

Common pitfalls:
  • Reporting balanced accuracy without showing class‑specific recalls, hiding which class is performing poorly.
  • Comparing balanced accuracy across tasks with very different numbers of classes.
Brier Score
0 0.25 1.0
Lower is better.

0–0.05 = excellent probabilistic predictions.

0.05–0.25 = reasonable calibration / accuracy.

> 0.25 = poor (similar to random or worse).
Measures quality of probabilistic predictions for binary outcomes. \[ \text{Brier} = \frac{1}{N} \sum (\hat{p}_i - y_i)^2. \] Combines calibration and sharpness; penalises confident wrong predictions. A model always predicting p=0.5 on a balanced dataset has Brier 0.25. Absolute values need baseline for interpretation; doesn’t distinguish where (in p‑space) miscalibration occurs.

Common pitfalls:
  • Comparing Brier scores across tasks with very different prevalence or outcome scales without normalising.
  • Assuming a low Brier score guarantees good ranking performance (it does not replace ROC/PR metrics).
Calibration Error (ECE)
0 0.1 0.3+
Lower is better.

0–0.02 = very well calibrated.

0.02–0.10 = minor issues.

> 0.10 = noticeable miscalibration.
Measures mismatch between predicted probabilities and observed frequencies. Bins predictions by confidence; compares average predicted probability to empirical frequency in each bin, then averages absolute differences. Low ECE means that “70% probability” events do happen around 70% of the time. Used for risk models in credit, medicine, etc. as a complement to AUC. Depends on binning; can be unstable with small sample sizes.

Common pitfalls:
  • Using a single ECE value without inspecting calibration plots per probability region.
  • Ignoring that ECE can be low even when model ranking (AUC/PR AUC) is poor.
Confusion Matrix
0 0.5 0.8 1.0
Interpreted via the share of counts on the diagonal (correct) vs off‑diagonal (errors).

Diagonal similar to off‑diagonal = many misclassifications.

Diagonal somewhat dominant = moderate performance.

Diagonal strongly dominant, off‑diagonals small = strong performance across classes.
Summarise classification results in terms of true positives, false positives, false negatives, and true negatives for each class. Matrix whose rows are actual classes and columns are predicted classes (or vice versa); each cell counts how often that combination occurs. From predictions and labels, fill contingency table; derived metrics like precision, recall, F1, MCC are computed from its cells. A binary confusion matrix with large TP and TN and very small FP/FN indicates a strong classifier; see interactive explorer below. Not a single scalar; becomes large for many classes and may be hard to read without normalisation.

Common pitfalls:
  • Looking only at totals without normalising per row/column, which hides minority‑class errors.
  • Comparing confusion matrices across datasets with different sizes without converting to rates.
Precision–Recall Curve (PR curve / PR AUC)
baseline 0.7 0.8 1.0
Interpret relative to baseline positive rate π.

PR AUC ≈ π = little better than random ranking.

Clearly above π but < 0.7 = modest improvement.

Substantially above π (e.g. > 0.7) = useful on rare‑positive problems.
Assess trade‑off between precision and recall across thresholds, especially for highly imbalanced binary classification. Curve of precision vs recall as the decision threshold moves; area under it (PR AUC) summarises overall performance on positives. Emphasises performance on the positive class; random classifier’s baseline is the prevalence π rather than 0.5. In fraud detection with 0.5% positives, PR AUC 0.65 is a huge improvement over baseline 0.005, even if ROC AUC is “only” moderate. Baseline depends on prevalence; PR AUC numbers are not directly comparable across datasets with different class balance.

Common pitfalls:
  • Comparing PR AUC values across datasets without accounting for different positive rates.
  • Only inspecting a single operating point on PR curve instead of the region relevant to business constraints.
Cohen’s Kappa
-1 0 0.6 1
Agreement beyond chance (−1 to 1).

≤ 0 = no better (or worse) than random.

0.01–0.40 = slight–fair agreement.

0.41–0.60 = moderate.

> 0.60 = substantial to almost perfect agreement.
Measure inter‑rater reliability or agreement between model predictions and labels while adjusting for agreement expected by chance. Compares observed accuracy \(p_o\) to expected accuracy \(p_e\) under random agreement given class marginals. \[ \kappa = \frac{p_o - p_e}{1 - p_e}. \] Large κ implies agreement much higher than chance, small κ close to or below 0 implies near‑random agreement. Often used to compare two human labelers, or model vs clinician, in medical imaging or annotation tasks. Sensitive to prevalence and marginal distributions; κ can be low even with high observed accuracy in imbalanced datasets.

Common pitfalls:
  • Interpreting low κ as “bad model” without considering skewed class frequencies or label noise.
  • Applying generic qualitative cutoffs (e.g. “good” at 0.6) without domain‑specific context.
Matthews Correlation Coefficient (MCC)
-1 0 0.5 1
−1 ≤ MCC ≤ 1.

≤ 0 = random or worse than random.

0.1–0.3 = weak signal.

0.3–0.5 = moderate.

> 0.5 = strong classifier, particularly on imbalanced data.
Single‑number summary of binary classifier quality that is symmetric in classes and robust to class imbalance. Correlation between predicted and true labels using all four entries of the confusion matrix. \[ \text{MCC} = \frac{TP \cdot TN - FP \cdot FN} {\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}. \] In a fraud‑detection setting with extreme imbalance, MCC distinguishes a useful classifier (e.g. 0.6) from one that just predicts the majority class (MCC ≈ 0). Less intuitive than accuracy or F1; defined for binary classification and needs generalisation for multiclass problems.

Common pitfalls:
  • Ignoring the sign: a negative MCC means systematic misclassification, not just “slightly bad”.
  • Comparing MCC values from confusion matrices built at very different thresholds without explanation.

Interactive Confusion Matrix Explorer

Adjust TP / FP / FN / TN to see how the main metrics change (toy calculator only).

Predicted Positive Predicted Negative
Actual Positive TP: FN:
Actual Negative FP: TN:

Derived metrics from the current confusion matrix:

Survival / Time-to-Event Performance

Use these metrics when the outcome is a time until an event (death, relapse, failure, churn) and some observations are censored (we only know the event has not happened yet by the end of follow-up).

When to use

  • Outcomes like “time from diagnosis to death”, “time to device failure”, “time until customer churn”.
  • Many people are still event-free at the end of the study (right-censoring).
  • We care about risk over time, not just a yes/no label at a fixed date.

Key tools & metrics

  • Kaplan–Meier curve – step-shaped curve showing the fraction still event-free over time. Great for visualising survival patterns and comparing groups.
  • Log-rank test – tests whether two or more Kaplan–Meier curves are systematically different over time.
  • Cox proportional hazards model – regression model for time-to-event data that estimates hazard ratios for predictors.
  • C-index (concordance index) – rank-based performance measure: probability that a person who experiences the event earlier gets a higher predicted risk. (1 = perfect, 0.5 = random.)
  • Time-dependent ROC / AUC(t) – ROC-style discrimination at specific time points (e.g. 1-year, 5-year AUC).
  • Brier score over time – mean squared error of predicted event probabilities at a given time horizon, adjusted for censoring. Lower is better.

Common pitfalls

  • Ignoring censoring – treating censored cases as if the event never happened biases estimates. Always use survival-aware methods (Kaplan–Meier, Cox, etc.).
  • Immortal-time bias – giving people “risk-free” time because they must survive long enough to receive a treatment or enter a group.
  • Competing risks – if other events (e.g. death from another cause) prevent the event of interest, standard survival curves can overstate risk unless competing-risk methods are used.
  • Too short follow-up – if few events occur, any metric (C-index, Brier, log-rank) will be noisy and underpowered.
Inference & Hypothesis Testing
Metric / Test Decision Criterion (Test Ranges) Purpose Description Working Mechanism Example Limitations
p-value (Significance Level)
0 0.05 1.0
Typically compared to α = 0.05.

p < 0.05 = statistically significant evidence against H0 (if an effect is of interest).

p ≥ 0.05 = no statistically significant evidence; cannot rule out H0.
Quantifies evidence against the null hypothesis and provides a common decision rule across many tests. Probability, assuming H0 is true, of obtaining a result at least as extreme as observed. Derived from a test statistic’s null distribution; if p < α, reject H0. p = 0.03 for a drug effect typically leads to rejecting “no effect” at α=0.05. Does not measure effect size or importance; heavily sample‑size‑dependent and often misinterpreted.

Common pitfalls:
  • Interpreting p as the probability that the null hypothesis is true.
  • Equating “not significant” with “no effect”, especially in low‑power studies.
Adjusted α (Bonferroni Correction)
0 α/m 1.0
For m tests, use αadj = α / m.

p < αadj = significant even after strict multiple‑testing control.

p ≥ αadj = not significant after correction.

This keeps family‑wise error rate ≈ α (e.g. 0.05).
Control family‑wise probability of any false positive across multiple tests. Divides α by number of tests; each test uses αadj as threshold. Simple and conservative; strong control of false positives when tests are independent or mildly correlated. 20 tests with α=0.05 → αadj=0.0025; only very small p‑values survive. Can be overly conservative for large m, greatly reducing power.

Common pitfalls:
  • Applying Bonferroni mechanically in exploratory analyses where some false positives are acceptable.
  • Ignoring correlation between tests, which can make the correction overly strict.
Benjamini–Hochberg FDR
0 q 1.0
Choose FDR q (e.g. 0.05).

Sort p‑values p(1) ≤ … ≤ p(m), find largest k with p(k) ≤ (k/m)·q.

Tests 1…k are called significant; expected fraction of false discoveries ≈ q.
Control expected proportion of false positives among declared discoveries. Step‑up procedure comparing ordered p‑values to increasing thresholds (k/m)·q. Less conservative than Bonferroni; more discoveries at the cost of some false positives. Common in genomics / high‑dimensional feature screening. Controls FDR only in expectation and under certain dependence structures.

Common pitfalls:
  • Treating FDR‑controlled discoveries as if they had strict family‑wise error control.
  • Misunderstanding that, even at FDR 5%, some proportion of reported “hits” are expected to be false.
Type I & Type II Error
0 moderate high
α ≈ 0.01–0.05 = standard false‑positive risk.

β ≈ 0.10–0.20 (power 80–90%) = typical design target.

Very high α or β (> 0.10 or > 0.30) = too many wrong decisions.
Frame trade‑off between false positives (Type I) and false negatives (Type II) when designing tests and studies. Type I: reject true H0. Type II: fail to reject false H0. Power = 1 − β. Power analysis couples α, β, effect size, and sample size. A clinical trial might fix α=0.025 (one‑sided) and β=0.10 (90% power). Reducing α without increasing sample size generally increases β.

Common pitfalls:
  • Choosing a very small α without increasing sample size, making studies underpowered.
  • Focusing only on Type I error while ignoring the cost of missed true effects (Type II).
Statistical Power
0 0.5 0.8 1.0
< 0.5 = often misses real effects.

0.5–0.8 = moderate.

> 0.8 = usually acceptable; >0.9 even better.
Probability a test detects a true effect of given size. Depends on α, effect size, variability, and sample size. Higher n or larger effect sizes raise power. Power 0.8 means 80% chance to detect the specified effect if it exists. Post‑hoc power is usually uninformative; better to plan it a priori.

Common pitfalls:
  • Doing “observed power” calculations after non‑significant results and over‑interpreting them.
  • Ignoring that low power inflates the proportion of false positives among significant findings.
Z / t Tests
0 1 2 3+
Two‑sided tests (large df):

|Z| or |t| < 1 = little evidence.

|Z| or |t| ≈ 2 = borderline (p ≈ 0.05).

|Z| or |t| ≥ 3 = strong evidence against H0.
Test if a mean / difference / regression coefficient differs from a null value. Statistic = (estimate − null) / SE, compared against normal or t distribution. Large |statistic| means estimate is many SE away from null. |t| = 4 with df≈50 usually implies p < 0.001 (strong evidence). Assumes approximate normality and independence; sensitive to outliers.

Common pitfalls:
  • Using a Z‑test instead of a t‑test with small samples or unknown population variance.
  • Ignoring multiple‑testing corrections when running many t‑tests in parallel.
Chi-Square Test (χ²)
0 χ²crit high
χ² near 0 = data close to H0 (little evidence of association/effect).

χ² around χ²crit = borderline significance.

χ² well above χ²crit = strong evidence of association / lack of fit to H0.
Test independence (contingency tables) or goodness‑of‑fit of categorical data to expected counts. \[ χ^2 = \sum \frac{(O_i - E_i)^2}{E_i}. \] Large χ² means observed counts deviate strongly from expectations. Used for testing independence between categorical variables or Mendelian ratios in genetics, etc. With very large n, tiny practical differences become significant. Needs effect size (e.g. Cramér’s V) for magnitude.

Common pitfalls:
  • Applying χ² when expected frequencies are too small (e.g. < 5 per cell).
  • Interpreting a significant χ² as evidence of large practical effect without reporting effect size.
ANOVA F-test
1 2 3 5+
F ≈ 1 = no detected difference between means.

F ≈ 2–3 = might be marginally significant (depends on df).

F ≫ 1 (e.g. >5) = strong evidence at least one group mean differs.
Test if means of 3+ groups are all equal vs at least one differs. Compares between‑group variance to within‑group variance. Large F implies between‑group differences are large relative to noise. Follow significant F with post‑hoc tests to identify which groups differ. Assumes normality and equal variances; does not identify specific groups or effect sizes by itself.

Common pitfalls:
  • Stopping at a significant overall F without reporting effect sizes or post‑hoc comparisons.
  • Ignoring heteroscedasticity or unbalanced designs where standard ANOVA assumptions fail.
Confidence Interval (e.g. 95% CI)
narrow medium wide
95% CI not containing 0 (difference) or 1 (ratio) → statistically significant at ≈5%.

Narrow CI → precise estimate.

Very wide CI → high uncertainty.
Provide a range of plausible values for a parameter. Usually estimate ± critical value × standard error. Reflects both effect size and uncertainty. Difference 5 with 95% CI [2, 8] suggests a clearly positive but moderately uncertain effect. Relies on model assumptions; misinterpreted as containing the true value with 95% probability (frequentist CIs don’t strictly mean that).
Permutation Test
0 0.05 1.0
Let pperm be the permutation p‑value.

pperm < 0.05 → statistic is in extreme tail of null distribution (significant effect).

pperm ≥ 0.05 → compatible with chance re‑labeling.
Distribution‑free significance test using label shuffling. Permute labels many times; recompute statistic to get null distribution and pperm. Useful for complex statistics where analytic null distributions are hard to derive. Accuracy 0.8 vs permutation null mean 0.5 with pperm=0.01 is strong evidence of real signal. Computationally heavy; must respect structure (e.g. grouping or time).

Common pitfalls:
  • Permuting labels in time series or clustered data where observations are not exchangeable.
  • Using too few permutations, leading to coarse p‑value resolution and unstable conclusions.
Effect Size (Cohen’s d)
0 0.2 0.5 0.8 >1
|d| < 0.2 = negligible.

0.2–0.5 = small.

0.5–0.8 = medium.

> 0.8 = large effect.
Quantify standardized mean differences independent of sample size. \[ d = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}. \] Helps separate statistical from practical significance. d = 0.6 for treatment vs control often considered practically important. Assumes similar SDs; thresholds are rough and context‑dependent.

Common pitfalls:
  • Using generic “small/medium/large” thresholds without considering what effect size is practically meaningful in context.
  • Ignoring unequal variances where standard pooled‑SD formula is inappropriate.
Cliff’s Delta
-1 -0.5 0 0.5 1
|δ| < 0.147 = negligible.

0.147–0.33 = small.

0.33–0.474 = medium.

> 0.474 = large effect (strong dominance).
Non‑parametric effect size based on ranks; robust to non‑normal data. δ = P(X>Y) − P(Y>X), where X/Y from two groups; ranges −1..1. Equivalent to rank‑biserial correlation; sign shows direction, magnitude shows strength. δ = 0.5 means X exceeds Y in 75% of pairs, a very strong effect. Summarises ordering, not magnitude of differences; can be less intuitive than differences in means.

Common pitfalls:
  • Reporting δ without clarifying the underlying direction (which group is X vs Y).
  • Assuming δ behaves like Pearson r or Cohen’s d in terms of interpretation thresholds.

Real‑Dataset Examples for Common Tests

Statistical Tests & Assumptions – Quick Reference

What each test is asking, when to use it, and what can go wrong.

Big picture

  • Every test asks a question.
    “Are these means equal?”, “Are these variances equal?”, “Do these two distributions look the same?”, “Is there autocorrelation?”.
  • p-value is not the effect size.
    A tiny p-value can correspond to a tiny, unimportant effect if the sample is huge.
  • Assumptions matter.
    Many tests assume things like normal residuals, equal variances, or independent observations. If those are badly violated, the p-values can be misleading.
  • Multiple testing inflates false positives.
    If you run many tests, you need FWER/FDR control (Bonferroni, Holm, BH).

Distribution shape & normality

These tests check whether data or residuals look like they come from a particular distribution (usually normal). They are sensitive to sample size and to outliers.

Test Main question Typical use Notes & pitfalls
Shapiro–Wilk “Do these data look roughly normal?” Small to medium samples (n < ~2000) Powerful for normality; very sensitive to even small deviations in large samples. Always combine with plots (QQ-plot, histogram).
Kolmogorov–Smirnov (KS) “Is the sample distribution different from a reference distribution (or from another sample)?” Comparing one sample to a known distribution, or two independent samples. Works for continuous data; most sensitive near the centre, less in the tails. With estimated parameters, classical p-values need corrections.
Anderson–Darling “Do these data follow a given distribution, especially in the tails?” Checking normality with more tail focus than KS. Gives more weight to tails than KS. As with other tests, large n ⇒ tiny deviations become “significant”.

Equality of variances

Many tests (for example classical t-test, ANOVA) assume similar variances across groups. These tests check that.

Test Main question Data type Notes & pitfalls
Levene’s test “Do these groups have equal variances?” Continuous outcome, groups categorical More robust to non-normal data than classical tests. A small p-value suggests at least one group has a different variance.
Brown–Forsythe Levene’s test variant using medians instead of means. Continuous outcome, heavy tails or outliers Even more robust when distributions are skewed. Often preferred if outliers are expected.

Comparing means

These tests compare group averages. Non-parametric alternatives use ranks instead of assuming normality.

Test Main question Design Notes & pitfalls
t-test (independent) “Are the means of two independent groups equal?” Two groups, continuous outcome Assumes normal residuals and (often) equal variances. For unequal variances, use Welch’s t-test.
t-test (paired) “Is the mean difference between paired measurements zero?” Before/after, matched pairs Applied to differences. Assumes differences are roughly normal.
One-way ANOVA “Are all group means equal?” 3+ groups, continuous outcome Global test; if significant, follow with post-hoc comparisons (and multiple-testing correction). Assumes normal residuals & equal variances.
Mann–Whitney U “Do two groups differ in their typical values (medians/ranks)?” Two independent groups, ranked/continuous outcome Non-parametric alternative to the independent t-test. Tests distribution shift, not strictly medians.
Kruskal–Wallis “Do 3+ groups differ in their distributions (ranks)?” 3+ independent groups, ranked/continuous outcome Non-parametric analogue of one-way ANOVA. If significant, follow with pairwise rank tests + multiple-testing correction.

Categorical data & independence

These tests work on counts in contingency tables and ask whether patterns could be explained by chance alone.

Test Main question Typical table Notes & pitfalls
Chi-square test of independence “Are two categorical variables independent?” R × C contingency table (for example treatment × outcome) Expected counts should not be too small (rules of thumb apply). Large samples make tiny deviations “significant”.
Chi-square goodness-of-fit “Do observed category frequencies match a specified distribution?” 1 × C table (observed vs expected counts) Used to compare observed counts to a theoretical or historical pattern.
Fisher’s exact test “Is there association in a 2×2 table?” 2 × 2 table with small counts Exact test, no large-sample approximation. Useful when any expected count is small (<5).

Time-series residuals & autocorrelation

For time-ordered data, errors often correlate over time. These tests check whether residuals look “independent” or show systematic patterns.

Test Main question Typical use Notes & pitfalls
Durbin–Watson “Is there first-order autocorrelation in regression residuals?” Linear regression on time-ordered data Values near 2 ≈ no autocorrelation; near 0 ≈ strong positive autocorrelation; near 4 ≈ strong negative. Not designed for complex time-series models.
Ljung–Box “Are a set of autocorrelations jointly zero?” Checking whether residuals from a time-series model look like white noise. Tests several lags at once. A small p-value suggests remaining structure in residuals (model underfits dynamics).

Summary: how to think about tests

  • Always pair tests with plots. QQ-plots, residual plots and histograms often tell the story faster than p-values.
  • Large samples detect tiny issues. A “significant” deviation may be practically irrelevant.
  • Small samples lack power. A non-significant result does not prove that assumptions are perfect or effects are zero.
  • Use non-parametric tests when normality or equal variances are clearly violated and sample sizes are moderate.
  • Remember multiple testing. Running many tests on the same data requires FWER/FDR control, otherwise false positives accumulate fast.
Robustness & Resampling

Robustness Playground: Mean vs Median & Outliers

Type any numbers, then drag the outlier slider. Watch how the mean swings while the median and IQR stay more stable. This is what “robustness to outliers” looks like in practice.

Extra outlier: 0

Try: 1, 2, 3, 4, 5 then add an outlier like +30. The mean moves a lot; the median barely moves.


The "Robustness Playground" is a visual tool proving that the median is generally a better measure of the "typical" center for data that might contain errors or extreme values, because it resists the pull of outliers much better than the mean. Robustness in statistics means that a measurement (like the median) remains relatively unchanged even if a few data points are very unusual or extreme (outliers). Non-robust measurements (like the mean) are highly sensitive to these extreme values and can be pulled significantly in one direction.

Resampling Stability Playground (Bootstrap vs Jackknife)

This playground simulates a simple linear model with one true signal feature and one noise feature. Play with sample size, noise, and signal strength, then compare how a single fit, bootstrap, and jackknife disagree about the coefficient. Watch how CI width and sign stability change.

Sample size 100
Noise level (σ) 1.0
Signal strength (β₁) 1.5
Model: y = β₀ + β₁·x₁ + ε with an extra noise feature x₂ (true β₂ = 0).

Look for: when noise is high or n is small, single fit can be very misleading. Bootstrap and jackknife show how uncertain the coefficient really is.

How to read this playground

This Resampling Stability Playground compares two resampling methods – Bootstrap and Jackknife – for a simple linear model. It answers: “How stable is my estimated coefficient under resampling?”

Purpose

The model is y = β₀ + β₁·x₁ + ε with one true signal feature (x₁) and one pure noise feature (x₂, true β₂ = 0). By changing sample size, noise level, and signal strength, you can see when estimates are stable vs. when they are fragile.

How to use the controls

  • Sample size – more observations usually mean tighter intervals.
  • Noise level – higher noise makes estimates wobble more.
  • Signal strength – stronger β₁ is easier to detect reliably.
  • Generate new data – redraws a fresh dataset and new resamples.

Reading the plot

  • The vertical green dashed line shows the true β₁.
  • The orange dot is the single-fit estimate on the full sample.
  • The blue bar is the Bootstrap 95% CI for β₁.
  • The purple bar is the Jackknife 95% CI for β₁.

Reading the table

  • CI width – how wide the interval is (narrow = more precise).
  • Sign stable – fraction of resamples that keep the same sign as the mean.
  • Green “high stability” tags mean the method gives a tight CI and almost never flips sign; red “low stability” tags mean the estimate is fragile.

Use this to build intuition for robustness: when resampling methods agree and intervals are narrow, your inference is much safer than when everything jumps around.

Bootstrap 95% CI   Jackknife 95% CI   Single-fit estimate

High stability = coefficient keeps sign and tight CI.
Low stability = sign flips often / very wide CI.

Metric / Method Decision Criterion Purpose Description Working Mechanism Example Limitations
Cross-Validation Mean & Std (k-fold CV)
0 0.5 1.0
Higher mean score = better average performance.

high var medium low var
Low std across folds = stable / robust model.

High std across folds = performance sensitive to data split.
Estimate out‑of‑sample performance and stability. Repeatedly train/test on different folds of the data. Mean = expected performance; std = sensitivity to sample. 0.88 ± 0.01 across folds is strong & stable; 0.90 ± 0.10 is unstable. More expensive than simple train/test; must respect temporal / grouped structure.

Common pitfalls:
  • Randomly shuffling time‑series data or grouped data, breaking dependencies.
  • Choosing k so large that folds are too small (high variance) or so small that variance estimates are unreliable.
Bias–Variance Tradeoff
underfit balanced overfit
High bias (underfit) → high error on train & test.

Balanced bias–variance → low and similar train/test error.

High variance (overfit) → very low train error, high test error.
Conceptual tool for selecting model complexity. Simple models: high bias, low variance; complex models: low bias, high variance. Analyse learning curves vs model capacity to find sweet spot. Deep tree that fits training perfectly but fails on test is high‑variance. Not a single numeric statistic; patterns can be subtle for deep models.

Common pitfalls:
  • Assuming that increasing model capacity always improves performance without monitoring overfitting.
  • Using training error alone as proxy for generalisation error.
Jackknife Variability
high var medium low var
Low jackknife SE → estimator stable to leaving out single observations.

Moderate SE → some sensitivity.

High SE → highly sensitive to specific cases.
Assess estimator stability and approximate standard errors by leave‑one‑out recomputation. Compute estimates on N datasets, each missing one observation; inspect spread. Big changes when omitting particular points indicate influence. If dropping any one observation barely changes a coefficient, model is stable. Less flexible than bootstrap for complex estimators; can be noisy for small N.

Common pitfalls:
  • Confusing jackknife variability (influence of individual points) with overall sampling variability.
  • Applying jackknife to highly non‑smooth estimators where leave‑one‑out changes structure dramatically.
Bootstrap CI Width
narrow medium wide
Narrow bootstrap CI = precise estimate.

Medium width = acceptable uncertainty.

Very wide or irregular CI = high uncertainty / instability.
Non‑parametric uncertainty quantification for statistics and model parameters. Resample with replacement, recompute estimator, and use empirical distribution. Percentile or BCa intervals reflect sampling variability without assuming normality. Narrow [4.1, 4.2] CI is very precise; [0, 20] shows extreme uncertainty. Expensive for large models; assumes sample is representative.

Common pitfalls:
  • Bootstrapping data with temporal or grouped dependence without respecting structure.
  • Using too few bootstrap samples, resulting in noisy interval estimates.
Feature Stability Across Resamples
0 0.5 1.0
< 0.5 = feature selected or important in <50% of resamples (unstable).

0.5–0.8 = moderate stability.

> 0.8 = highly stable across resamples.
Check whether discovered “important features” are robust to sampling variation. Repeat feature selection / importance computation over many resamples and count how often each feature is selected as important. Helps distinguish real signal from noise‑driven feature choices. Stable features (e.g. 95/100 bootstrap samples) are more trustworthy than features selected rarely. Correlated predictors may appear unstable individually; requires careful interpretation.

Common pitfalls:
  • Judging features purely by single‑fit importance scores without looking at stability.
  • Ignoring that correlated feature groups may swap roles across resamples while jointly representing the same signal.
Regression & Correlation
Metric Decision Criterion Purpose Description Working Mechanism Example Limitations
R2 (Coefficient of Determination)
0 0.25 0.5 1.0
< 0.25 = weak fit.

0.25–0.5 = moderate fit.

> 0.5 = strong fit.
Proportion of variance in response explained by the model. \[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}}. \] Compares residual variance to variance around the mean. R² = 0.85 → model explains 85% of variability in outcome. Can be inflated by overfitting; does not imply causality or good out‑of‑sample performance.

Common pitfalls:
  • Equating high R² with causal explanation rather than predictive association.
  • Comparing R² across models fitted to different datasets or with different outcome variances.
Adjusted R2
0 0.25 0.5 1.0
Higher adjusted R² = better fit after penalising extra predictors.

Small gains = marginal benefit of added predictors.

Decrease when adding predictors = likely overfitting / noise variables.
Compare models with different numbers of predictors while penalising complexity. Adjusts R² downward for each extra degree of freedom. Only increases when added variables meaningfully reduce residual variance. If R² rises but adjusted R² falls when adding variables, they’re probably not helpful. Can’t compare across different datasets; still doesn’t guarantee predictive performance.

Common pitfalls:
  • Using small differences in adjusted R² as decisive evidence between models.
  • Ignoring other diagnostics (residual plots, multicollinearity) when adjusted R² looks acceptable.
RMSE (Root Mean Square Error)
0 moderate large
Lower RMSE = better (small typical errors).

Similar to target SD = modest improvement over baseline.

Close to or above target SD/range = weak predictive value.
Average magnitude of squared prediction errors, back in original units. \[ \text{RMSE} = \sqrt{\frac{1}{n}\sum (y_i - \hat{y}_i)^2 }. \] Squares errors (emphasising large ones), then square‑roots. RMSE \$20k on houses with SD \$60k is good; \$55k is poor. Highly sensitive to outliers; needs baseline to interpret magnitude.

Common pitfalls:
  • Comparing RMSE across targets with different scales instead of using relative metrics.
  • Optimising RMSE when extreme outliers are less important than typical error magnitude (where MAE may be better).
MAE (Mean Absolute Error)
0 moderate large
Lower MAE = predictions on average close to truth.

Moderate fraction of target scale = acceptable.

Large fraction of target scale = poor accuracy.
Average absolute prediction error; more robust than RMSE. \[ \text{MAE} = \frac{1}{n}\sum |y_i - \hat{y}_i|. \] Each error contributes linearly. MAE of 1.5k on 20k prices ≈ 7.5% typical error (good). Doesn’t strongly penalise rare huge errors.

Common pitfalls:
  • Using MAE when large outliers are mission‑critical, under‑penalising them.
  • Interpreting MAE without relating it to the typical value or variance of the target.
MSE / RMSLE
0 moderate large
Lower MSE/RMSLE = smaller errors overall.

Intermediate values = moderate performance.

High values = many large errors.
Alternative regression loss functions: MSE for squared errors, RMSLE for relative/log errors. RMSLE uses log1p transform, emphasising multiplicative errors. Useful when underestimation of large values is particularly problematic or targets are highly skewed. Popular in Kaggle competitions for positive‑only targets. RMSLE can’t straightforwardly handle zeros/negatives; MSE sensitive to outliers.

Common pitfalls:
  • Applying RMSLE when the target has zeros/negatives without appropriate transformations.
  • Ignoring that MSE/RMSE heavily weight a small number of very large residuals.
Pearson Correlation (r)
-1 -0.5 0 0.5 1
Interpreting |r|:

< 0.3 = weak.

0.3–0.5 = moderate.

> 0.7 = strong linear association.
Strength and direction of linear association between two numeric variables. Standardised covariance: \[ r = \frac{\text{Cov}(X,Y)}{\sigma_X\sigma_Y}. \] Ranges −1..1; sign gives direction, magnitude gives strength. Height vs weight often r≈0.7–0.8. Very sensitive to outliers; misses non‑linear relationships.

Common pitfalls:
  • Interpreting correlation as causation or assuming no hidden confounders.
  • Quoting r without visualising scatter plots to check linearity and outliers.
Spearman Correlation (ρ)
-1 -0.5 0 0.5 1
Interpreting |ρ|:

< 0.3 = weak monotonic association.

0.3–0.5 = moderate.

> 0.7 = strong monotonic relationship.
Correlation on ranks; robust to outliers and non‑linear but monotonic trends. Compute ranks of X and Y, then Pearson r on ranks. Captures relationships where one variable consistently increases/decreases with the other. Useful for ordinal data or non‑linear monotonic relationships. Near zero for non‑monotonic relationships (e.g. U‑shaped).

Common pitfalls:
  • Assuming Spearman detects arbitrary non‑linear patterns; it only captures monotonic tendencies.
  • Not accounting for many ties in ranked data, which can affect estimates.
Partial Correlation
-1 -0.5 0 0.5 1
Interpreting |rpartial|:

≈ 0 = little remaining association beyond controls.

≈ 0.3–0.5 = moderate residual link.

> 0.5 = strong association beyond controlled factors.
Measure association between two variables while controlling for others. Regress each variable on controls, then correlate residuals. Helps separate direct from indirect effects. Correlation between exercise and blood pressure may shrink after controlling for age. Only removes linear effects; can be unstable with many correlated controls.

Common pitfalls:
  • Interpreting partial correlation as proof of direct causal influence.
  • Including too many collinear controls, leading to noisy or unstable estimates.
Regression Coefficients + CI
narrow medium wide
CI not containing 0 → coefficient significantly different from 0.

Narrow CI → precise effect estimate.

Wide CI including 0 → weak or uncertain effect.
Interpret predictor effects and uncertainty in regression models. Coefficients describe expected change in response for unit change in predictor, holding others fixed; CI shows uncertainty. Based on estimated SEs and t / normal critical values. “Each extra year of experience adds \$2k (95% CI \$1.5k–\$2.5k)” is clear and interpretable. Interpretation assumes correct model form and no severe multicollinearity.

Common pitfalls:
  • Interpreting coefficients from poorly specified models (e.g. missing confounders) as causal.
  • Ignoring the width of CIs and focusing only on significance.
Durbin–Watson Test
0 2 4
< 1.5 = likely positive autocorrelation (bad for OLS SEs).

1.5–2.5 = little evidence of serious autocorrelation.

> 2.5 = possible negative autocorrelation.
Detect first‑order serial correlation in regression residuals. DW ≈ 2(1 − ρ1), where ρ1 is lag‑1 autocorrelation. Values far from 2 suggest residual dependence. DW = 0.9 suggests strong positive autocorrelation; consider time‑series models or GLS. Primarily detects first‑order correlation; interpretation uses critical value tables or approximations.

Common pitfalls:
  • Applying Durbin–Watson to models with lagged dependent variables where its distribution changes.
  • Ignoring serial correlation even when DW indicates strong dependence, leading to underestimated SEs.
Breusch–Pagan Test
0 0.05 1.0
p < 0.05 → reject homoscedasticity; heteroscedastic errors (bad for standard OLS SEs).

p ≥ 0.05 → no strong evidence of non‑constant variance.
Test for heteroscedasticity (non‑constant variance) in regression. Regress squared residuals on predictors; statistic ~χ² under constant variance. Significant p suggests need for robust SEs, transforms, or alternate models. Often used in linear regression diagnostics. Power depends on auxiliary regression specification; may miss complex patterns.

Common pitfalls:
  • Assuming homoscedasticity solely because the Breusch–Pagan p‑value is slightly above 0.05.
  • Using standard OLS SEs despite strong evidence of heteroscedasticity.
Shapiro–Wilk Normality Test (on residuals)
0 0.05 1.0
Null: residuals are normal.

p < 0.05 → reject normality (assumption violation).

p ≥ 0.05 → no strong evidence against normality.
Check normality assumption for regression residuals. Statistic W measures agreement between ordered residuals and expected normal order statistics. Used with residual plots to assess normality assumption for t‑based inference. p = 0.4 suggests residual normality is acceptable; p = 0.001 suggests heavy tails / skew. Very powerful for large n (tiny deviations flagged); doesn’t tell how residuals deviate.

Common pitfalls:
  • Overreacting to tiny deviations from normality in large samples where CLT‑based inference is still robust.
  • Relying solely on the test without inspecting Q–Q plots.

Comparative Table – When to Use MAE vs RMSE

Scenario Prefer MAE Prefer RMSE
Robustness to outliers Yes – if occasional extreme errors are not critical. No – RMSE will be dominated by a few large residuals.
Penalising large errors heavily No – treats all deviations linearly. Yes – squared errors heavily punish large mistakes.
Interpretability “On average, we are off by …” is intuitive. Less intuitive, but mathematically convenient for optimisation.
Gradient‑based optimisation Non‑differentiable at 0 but workable. Smooth and strongly convex; widely used loss.
Highly skewed targets Sometimes combined with median‑based models. May require log‑transform or RMSLE for stability.
Multicollinearity Diagnostics
Metric Decision Criterion Purpose Description Working Mechanism Example Limitations
Variance Inflation Factor (VIF)
1 5 10 20+
VIF ≈ 1 → no multicollinearity.

5–10 = moderate concern.

> 10 = serious multicollinearity.
Quantify how much variance of a coefficient is inflated by linear dependence with other predictors. \[ \text{VIF}_j = \frac{1}{1-R_j^2} \] where Rj² from regressing predictor j on others. Large Rj² → large VIF → unstable coefficient for that predictor. VIF = 12 suggests coefficient may be poorly estimated and highly sensitive to small data changes. Doesn’t indicate which predictors are collinear with each other; only that some redundancy exists.

Common pitfalls:
  • Dropping variables solely because VIF is above a rule‑of‑thumb threshold without considering domain meaning.
  • Ignoring that standardising predictors can change VIF interpretation.
Condition Number
1 10 30 60+
< 10 = low multicollinearity.

10–30 = moderate.

> 30 = severe (near singular matrix).
Measure overall multicollinearity of predictor matrix. Ratio of largest to smallest singular value (or sqrt of eigenvalue ratio). Large condition number means XᵀX is ill‑conditioned; coefficient estimates can be unstable. Condition number ≈ 50 suggests strong collinearity somewhere among predictors. Scaling affects value; interpret with standardised predictors. Does not pinpoint which variables are problematic.

Common pitfalls:
  • Ignoring collinearity when condition number is large but VIFs seem moderate.
  • Comparing condition numbers across models where predictors are scaled differently.
Outlier & Distribution Metrics
Metric/Test Decision Criterion Purpose Description Working Mechanism Example Limitations
Skewness (Distribution Asymmetry)
-3 -1 0 1 3
Between -1 and +1 = modest skew (often acceptable).

Between -2 and -1 or 1 and 2 = moderate skew.

< -2 or > 2 = strong skew; consider transform or robust methods.
Quantify asymmetry of a distribution (left vs right tail). Third standardized moment; sign indicates direction of long tail. Positive skew → long right tail; negative → long left tail. Incomes are typically right‑skewed; log‑incomes are closer to symmetric. Unstable in small samples; easily influenced by a few extreme points.

Common pitfalls:
  • Using skewness alone to justify transformations without visual inspection.
  • Interpreting minor non‑zero skewness in large samples as serious model violation.
Kurtosis (Tailedness)
-2 0 2
(Using excess kurtosis, normal ≈ 0.)

≈ 0 = tails similar to normal.

Between -2 and -0.5 = somewhat light‑tailed.

> 2 = heavy tails / many extreme values.
Describe how heavy the tails are compared to normal. Fourth standardized moment minus 3. High kurtosis indicates variance dominated by rare large deviations. Financial returns often show high positive kurtosis. Very sensitive to outliers; hard to interpret without skewness and plots.

Common pitfalls:
  • Attributing all high kurtosis to “fat tails” instead of checking for data quality or structural breaks.
  • Using kurtosis for tiny samples where estimates are extremely noisy.
Shapiro-Wilk Test (Normality test)
0 0.05 1.0
Null: data are normal.

p < 0.05 → reject normality.

p ≥ 0.05 → no strong evidence against normality.
Formal test of normality for small–moderate samples. Statistic W measures agreement between ordered data and expected normal quantiles. p-value derived from W; small p indicates deviation from normality. p=0.08 → normality acceptable; p=0.001 → strong deviation. Very sensitive with large n; use with plots and domain context.

Common pitfalls:
  • Automatically transforming data because p < 0.05 even when deviations are minor and models are robust.
  • Using the test on discrete or heavily censored data where normality is impossible.
Z-score Outlier Detection
-3 -2 0 2 3
Under normality:

|z| < 2 = typical.

2 ≤ |z| ≤ 3 = borderline outlier.

|z| > 3 = potential outlier.
Flag univariate outliers relative to mean and SD. \[ z = \frac{x - \mu}{\sigma}. \] Extremely large |z| values are unlikely under a normal model. z = 5 is extremely unusual (probability < 10⁻⁶ under normal). Assumes normality; heavy‑tailed data produce many |z|>3 that are not truly abnormal. Mean/SD can be distorted by the outliers.

Common pitfalls:
  • Using z‑score thresholds on clearly non‑normal data (e.g. Pareto‑like heavy tails).
  • Removing points based only on z‑scores without investigating data quality or domain context.
IQR Method (Tukey’s Fences)
LF Q1 Med Q3 UF
Within [Q1−1.5·IQR, Q3+1.5·IQR] = typical range.

Outside 1.5·IQR but within 3·IQR = moderate outlier.

Beyond 3·IQR fences = extreme outlier.
Non‑parametric, robust rule of thumb for univariate outliers. Uses quartiles and interquartile range (IQR = Q3–Q1). Points outside whiskers in a boxplot correspond to Tukey outliers. Common default rule in statistical software boxplots. Skewed distributions may produce many flagged points; rule is heuristic and dimension‑wise only.

Common pitfalls:
  • Applying standard 1.5×IQR rule to strongly skewed data without adjustment.
  • Dropping all outliers automatically instead of investigating their cause.
Robust Z-score (MAD)
-3 -2 0 2 3
For robust zMAD:

|zMAD| < 2.5 = typical.

2.5–3.5 = borderline.

> 3.5 = strong outlier candidate.
Detect outliers robustly using median and median absolute deviation. Robust Z = (x − median) / (1.4826 · MAD), where MAD is median(|x − median|). More stable than classical z‑score in presence of outliers. Good for heavy‑tailed or skewed distributions where mean/SD are distorted. Still assumes a roughly unimodal distribution; threshold choices are heuristic.

Common pitfalls:
  • Using robust z‑scores but still computing MAD on a mixture of very different populations.
  • Believing robust methods remove the need for visual inspection or domain knowledge.
Cook’s Distance
0 0.5 1 2+
< 0.5 = typically low influence.

0.5–1 = potentially influential, inspect.

> 1 (especially much >1) = highly influential point.
Measure influence of each observation on regression fit. Combines leverage and residual size to approximate change in all fitted values when a point is removed. Large Cook’s D means the observation strongly affects estimates. One point with D=1.5 while others < 0.1 indicates a dominating data point. Thresholds are rules of thumb; influential points may be valid data, not necessarily errors.

Common pitfalls:
  • Automatically deleting points with Cook’s D above a threshold without checking if they are legitimate.
  • Ignoring the leverage–residual decomposition that explains why points are influential.
Mahalanobis Distance
0 χ²p,0.975 χ²p,0.99 extreme
For p dimensions:

MD² ≤ χ²p,0.975 = inside main cloud.

Between χ²p,0.975 and χ²p,0.99 = potential multivariate outlier.

MD² > χ²p,0.99 = strong multivariate outlier.
Detect multivariate outliers accounting for correlations between variables. \[ MD(x) = \sqrt{(x-\mu)^\top \Sigma^{-1} (x-\mu)}. \] Squaring MD gives a χ² statistic under multivariate normality. Points with large MD² lie far from the multivariate mean in whitened space. Useful for anomaly detection in multi‑feature settings. Requires good estimates of μ and Σ; classical covariance is itself distorted by outliers; robust covariance estimators may be needed.

Common pitfalls:
  • Computing Mahalanobis distance with non‑invertible or poorly conditioned covariance matrices.
  • Applying χ² cutoffs to data that are far from multivariate normal.
Kolmogorov–Smirnov Test (KS)
0 0.05 1.0
Two‑sample KS for distribution shift:

p < 0.05 → distributions differ significantly (potential shift / mismatch).

p ≥ 0.05 → no strong evidence of difference.

Larger D (0–1) = stronger discrepancy.
Compare empirical distributions (e.g. train vs production) or sample vs theoretical distribution. KS statistic D = sup |F1(x) − F2(x)| over x. p‑value derived from D; sensitive to location and shape differences. Useful in monitoring feature distribution drift over time. More sensitive near median than in tails; assumes continuous data and independent samples.

Common pitfalls:
  • Using KS on discrete or heavily binned data without an appropriate variant.
  • Interpreting a non‑significant KS as proof that distributions are identical (rather than “no strong evidence of difference”).
Influence & Robustness Lab (Interactive Playgrounds)

    These playgrounds show how single points, noise and multiple comparisons can quietly break your models, even when headline metrics still look good.

Outlier Impact 2.0 – How one point can twist a regression line

Controls

Sample size (base points) 30
Noise level (σ) 1.0
Outlier height 6.0
True model: y = 1 + 1.2·x. Outlier shares the same x-range, but you can move it far up or down.

How to read this playground

This playground shows how a single outlier can dramatically change an ordinary least-squares regression line.

Purpose

We simulate a simple linear model and compare two fits: one without the outlier (baseline) and one with the outlier. If a single point can flip the slope or change it a lot, your model is fragile.

How to use the controls

  • Sample size – more points ⟹ harder to twist the line.
  • Noise level – more noise hides the clean relationship.
  • Outlier height – drag far up/down and watch the orange line tilt.
  • Regenerate – new random base cloud with same settings.

Interpretation

  • Blue dots: base data. Red dot: outlier. Blue line: fit without outlier.
  • Orange line: fit including outlier.
  • In the table, Δ slope and the tag fragile/stable tell you how influential the point is.
Base points Outlier Fit without outlier Fit with outlier
Model Slope Intercept Δ slope Stability

If one point can flip the conclusion, you don’t have a stable finding – you have an anecdote dressed up as a model.

Cook’s Distance & Leverage – Influence of a single high-leverage point

Controls

Leverage (x position) 2.5
Residual size (vertical offset) 2.0
Base model: same y = 1 + 1.2·x with noise. The purple point is the high-leverage candidate.

How to read this playground

Cook’s Distance combines two ideas: leverage (how unusual x is) and residual (how badly the point is fit). A point with both high leverage and large residual is highly influential.

What you see

  • Blue dots: regular data used to fit the baseline regression line.
  • Purple dot: candidate point at the chosen x-position and residual.
  • The table shows its Cook’s Distance and a qualitative flag (OK vs influential).

Heuristics

  • Rough rule: Cook’s D > 1 is clearly influential; D > 0.5 is worth attention.
  • High leverage with tiny residual can still be dangerous if a future small mistake there would flip the slope.
Base points Candidate point Baseline fit Fit with candidate
Quantity Value Interpretation

Use this playground to feel that leverage (far-out x) without residual is mild, residual without leverage is local – but the combination gives large Cook’s Distance.

Noise Injection Playground – How noise quietly kills stability

Noise controls

Gaussian feature noise 0.3
Random label flips 0.05
Feature noise (jitter) 0.2
# Irrelevant features 5

This is a pedagogical model: we don’t fit an actual classifier, but apply a simple response surface that mimics how noise hurts cross-validated performance.

Effect on performance & stability

Baseline accuracy (clean data) 0.90
Expected train accuracy 0.90
Expected CV accuracy 0.86
Variance of CV scores 0.01
Stability index (0–1) 0.80
Verdict moderate robustness

How to read this playground

  • Increase noise and watch train accuracy stay high while CV accuracy drops and variance inflates.
  • Many irrelevant features increase overfitting pressure even if the core signal is unchanged.
  • The stability index is a compact score combining CV accuracy and variability: low values mean your model is too dependent on random quirks.

Moral: always think in terms of “signal vs noise”. Robust models maintain high CV accuracy and low instability even when noise rises.

Bonferroni & VIF – Why many tests and correlated features can fool you

In real projects we almost never test just one thing. We try many features, model variants, time points, segments, and outcomes. Every extra test is another chance to see a “significant” result that is actually just noise.

1. Multiple testing & Bonferroni

If you test one hypothesis at α = 0.05, there is a 5% chance of a false positive (a Type I error). If you test many hypotheses at the same threshold, the chance that at least one of them is a false positive grows very quickly.

  • 1 test at α = 0.05 → about 5% chance of a false positive.
  • 100 independent tests at α = 0.05 → about 99% chance that at least one is “significant” just by luck.
What is family-wise error (FWE / FWER)?

Family-wise error (FWE), often called the family-wise error rate (FWER), is the probability of making at least one Type I error (false positive) when performing a “family” of statistical tests. When you run many tests, this probability increases. FWE / FWER is used to quantify and control this risk, usually by adjusting the significance level or using corrections such as Bonferroni or Holm.

What Bonferroni does

Bonferroni is a conservative safety brake:

New per-test α = original α ÷ number of tests.

Example: with α = 0.05 and 100 tests, the Bonferroni-corrected threshold becomes 0.05 ÷ 100 = 0.0005 for each test. This keeps the FWER near 5%, but makes it harder to detect real effects.

  • Pros: very safe; strong control of FWER (false discoveries are rare).
  • Cons: conservative; with many tests it can hide real signals.

The playground shows how FWER explodes with many tests when you don’t correct, and how Bonferroni pulls it back under control.

2. VIF – When features tell the same story

In the right-hand panel, we are not testing many hypotheses, but we are using many correlated features in a regression. VIF explains how this hurts the stability of your coefficients.

Intuition:

  • The slider "Correlation between similar features" (ρ) says how strongly a group of features move together.
  • The slider "# of similarly correlated features" (k) says how many features sit in that correlated pack.

VIF is defined as VIF = 1 / (1 − R²) when regressing one feature on the others. It tells you how much the variance of a coefficient is inflated because of collinearity.

  • VIF ≈ 1 – almost no collinearity.
  • VIF 5–10 – coefficients are unstable and hard to interpret.
  • VIF > 10 – strong collinearity; rethink features or model design.

If ρ and k are high, the model cannot cleanly separate which feature carries the signal. Coefficients may swing wildly, flip sign, or become impossible to interpret, even though the underlying relationship hasn’t changed.


3. Big picture

Both panels demonstrate the same core idea in different ways:

  • Multiple testing – too many chances to “win” by noise → false discoveries.
  • Collinearity – too many overlapping features → unstable estimates.

In short: noise + many decisions = unreliable statistics. The corrections and diagnostics you see here (Bonferroni, FWER, VIF) are tools to keep that under control.

Bonferroni & VIF Intuition – Multiple tests and collinearity

Bonferroni Correction – Family-wise error

# of independent tests (m) 20
Per-test α (uncorrected) 0.05
FWER before correction 0.64
Bonferroni αcorr = α / m 0.0025
≈ FWER after correction 0.05
Verdict reasonable

Every extra test is another chance to see “significance” by luck. Bonferroni shrinks the per-test α so the family-wise error rate stays near your target (e.g. 5%).

Try m = 1 vs m = 100 with α = 0.05 and feel how uncorrected FWER explodes.

VIF Intuition – How correlation inflates variance

Correlation between similar features (ρ) 0.7
# of similarly correlated features (k) 3
Approx. R² when regressing one feature on the others 0.66
VIF = 1 / (1 − R²) 2.94
Interpretation acceptable

VIF tells you how much the variance of a coefficient is inflated by collinearity with other predictors.

  • VIF ≈ 1 – almost no collinearity.
  • VIF 5–10 – concerning; coefficients unstable and hard to interpret.
  • VIF > 10 – strong collinearity, rethink your design/features.

Increase ρ or k and watch VIF explode while the underlying signal hasn’t changed at all.

Model Selection Criteria
Metric Decision Criterion Purpose Description Working Mechanism Example Limitations
Akaike Information Criterion (AIC)
0 2 10 20+
Compare ΔAIC relative to best (lowest).

ΔAIC < 2 = essentially equally good.

2–10 = some to strong evidence against.

> 10 = much worse than best model.
Trade off fit vs complexity for out‑of‑sample prediction quality. \[ \text{AIC} = 2k - 2\log(L), \] k = parameters, L = likelihood. Lower AIC preferred among models fit to same data/response. Models with ΔAIC < 2 are often considered similarly plausible. Asymptotic; for small n, AICc is preferable. Not directly interpretable in absolute terms.

Common pitfalls:
  • Treating small AIC differences (< 2) as meaningful when they are usually negligible.
  • Comparing AIC values across different datasets or response variables.
Bayesian Information Criterion (BIC)
0 2 10 20+
ΔBIC vs best:

< 2 = weak evidence against.

2–6 = positive evidence against.

6–10 = strong.

> 10 = very strong evidence against model.
More heavily penalise complexity, tending to pick more parsimonious models as n grows. \[ \text{BIC} = k \log(n) - 2\log(L). \] Approximate Bayes factor under certain priors; lower is better. BIC often selects smaller subset of predictors than AIC in large samples. Assumes true model is in candidate set; may underfit when prediction, not true model recovery, is goal.

Common pitfalls:
  • Using BIC to choose between models when the goal is pure prediction rather than parsimony.
  • Comparing BIC across datasets with different numbers of observations.
Mallows’ Cp
-10 0 +10
For subset size p (predictors), good models have Cp ≈ p+1.

Cp − (p+1) ≈ 0 = balanced bias–variance.

Far above p+1 = underfitting (missing predictors).

Far below p+1 = possible overfitting.
Guide subset selection in linear regression. Compares residual SS of subset model to full model’s error variance. Plot Cp vs p; models near diagonal line Cp = p+1 are desirable. Helps choose among many subset models with similar R². Relies on full model as reference; computationally expensive for large predictor sets without heuristics.

Common pitfalls:
  • Ignoring the uncertainty in selecting the “best” Cp model when many subsets have similar values.
  • Using Cp when the full model is itself misspecified.
Cross‑validated Deviance / Log‑Loss
low medium high
Lower is better.

Lowest CV deviance among candidates = preferred model.

Differences < 1–2% = often negligible in practice.

Much larger deviance = clearly worse predictive model.
Directly compare predictive models using out‑of‑sample negative log‑likelihood. Compute mean deviance / log‑loss across CV folds. Penalises wrong confident predictions more strongly than Brier score. Standard for logistic / probabilistic models in ML competitions. Can be dominated by rare but very miscalibrated regions; needs context for what constitutes a big improvement.

Common pitfalls:
  • Over‑tuning hyperparameters to tiny differences in log‑loss on noisy validation data.
  • Ignoring class weights or imbalance when interpreting deviance alone.
Regularisation Strength (λ)
too small balanced too large
Very small λ → unregularised, risk of overfitting.

λ chosen by CV → good bias–variance compromise.

Very large λ → coefficients shrunk too much (underfitting).
Control complexity of models like ridge, lasso, elastic‑net. Penalise large coefficients (L2) or enforce sparsity (L1) to improve generalisation. Typically chosen via cross‑validated performance curves over λ. L1 (lasso) can produce sparse models; L2 (ridge) stabilises coefficients with multicollinearity. Interpretation depends on scaling; different λ scales across algorithms and implementations.

Common pitfalls:
  • Using default λ without checking for under/over‑regularisation via validation curves.
  • Comparing λ values across models with different feature standardisation or penalty definitions.
Cutting-Edge ML Metrics (2020–2025)

Modern machine-learning research uses metrics that go beyond classical AUC, RMSE, and p-values. These tools measure calibration quality, distribution shift, risk ranking, uncertainty, robustness, fairness, and high-dimensional model stability.

  • Advanced Calibration Metrics research → practice
    ACE (Adaptive Calibration Error), TCE (Thresholded Calibration Error), Squared-ECE, Adaptive Reliability Error.
  • Skill Scores common in production
    Brier Skill Score (BSS) — compares your model to a naïve baseline to give calibration improvement.
  • Modern Ranking & Risk Metrics research → practice
    C-index, Somers’ D, IDI (Integrated Discrimination Improvement), NRI (Net Reclassification Improvement).
  • Bayesian / Predictive Model Selection research → practice
    WAIC, PSIS-LOO, Expected Log Predictive Density (ELPD), Pareto-k diagnostics.
  • Distribution Shift Diagnostics research → practice
    PSI (Population Stability Index), KL Divergence, Maximum Mean Discrepancy (MMD), Wasserstein Distance.
  • Uncertainty Diagnostics primarily research
    Negative Log-Likelihood (NLL), epistemic vs aleatoric variance decomposition, Expected Sharpness.
  • Robustness & Stability primarily research
    Model Stability Index (MSI), permutation effect-size curves, influence-function diagnostics.
  • Fairness / Ethical ML Metrics research → practice
    Equalized Odds, Demographic Parity, Predictive Parity, Calibration Within Groups.
  • Interaction-Aware Interpretability primarily research
    SHAP Interaction Values — modern extension of SHAP for modelling non-linear pairwise effects.
  • Leakage Diagnostics primarily research
    Target Leakage Test (permutation with shuffled labels), Cross-fold Correlation Leakage Index.
Model Interpretability & Explainability
Method / Metric Decision Criterion Purpose Description Working Mechanism Example Limitations
Permutation Feature Importance
0 medium large
Large performance drop when permuted = strong importance.

Near‑zero drop = little contribution (under current model).
Global importance ranking for features in any black‑box model. Measures how much a model’s performance metric worsens when the values of a feature are randomly permuted. Break the relationship between a feature and target by shuffling that feature; recompute performance and compare to baseline. Rank features by drop in ROC AUC or RMSE in a tree ensemble to understand which inputs matter most. Correlated features can share importance (each appears less important alone). Requires many evaluations of the model; may be expensive.

Common pitfalls:
  • Interpreting low importance as “irrelevant” when it may be redundant with correlated variables.
  • Permuting time‑series or grouped features independently, breaking their structure.
SHAP Values
unstable ok stable
Stable patterns that match domain knowledge are desirable.

Very large |SHAP| values indicate strongly influential features for a given prediction.
Provide theoretically grounded, additive feature attributions for individual predictions and global patterns. Based on Shapley values from cooperative game theory; each feature gets a contribution to pushing the prediction away from a baseline. Approximate each feature’s marginal contribution by averaging over many coalitions of features; specialised fast algorithms exist for tree‑based models. Global SHAP summary plots highlight which variables drive model predictions overall; local plots explain single predictions (e.g. why a loan was rejected). Computationally expensive for complex models without specialised approximations; explanations can be misread as causal rather than correlational.

Common pitfalls:
  • Over‑interpreting SHAP values as causal effects instead of associations.
  • Ignoring the impact of feature collinearity on SHAP attribution.
LIME (Local Interpretable Model‑agnostic Explanations)
unstable ok faithful
Good explanations are locally faithful (approximate the model well near the point of interest) and sparse enough to be interpretable.
Explain individual predictions by fitting a simple surrogate model around a local neighbourhood. Samples points near the instance, queries the black‑box model, and fits an interpretable model (e.g. linear or small tree) to those local outputs. Weights samples by proximity to the instance; the surrogate’s coefficients are reported as local feature importance. Useful for debugging why a specific prediction was made, especially in regulated domains where explanations must be human‑readable. Local surrogate may be unstable (different runs give different explanations); only valid near the point; can be misleading globally.

Common pitfalls:
  • Treating LIME’s local linear model as a global explanation.
  • Ignoring randomness in neighbourhood sampling, leading to inconsistent explanations.
Python Code Snippets (scikit‑learn / statsmodels / SHAP / LIME)

Classification Metrics & Confusion Matrix

Launch on Binder

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Generate dummy data for demonstration
X, y = np.random.rand(100, 5), np.random.randint(0, 2, 100)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train a simple classification model (e.g., Logistic Regression)
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)

# Define y_true and X for the existing metrics calculation cell
y_true = y_test
X = X_test

from sklearn.metrics import (
    confusion_matrix, accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, average_precision_score,
    matthews_corrcoef, cohen_kappa_score
)

y_proba = model.predict_proba(X)[:, 1]
y_pred = (y_proba >= 0.5).astype(int)

cm = confusion_matrix(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
roc_auc = roc_auc_score(y_true, y_proba)
pr_auc = average_precision_score(y_true, y_proba)
mcc = matthews_corrcoef(y_true, y_pred)
kappa = cohen_kappa_score(y_true, y_pred)

print("Confusion Matrix:\\n", cm)
print("Accuracy:", acc)
print("Precision:", prec)
print("Recall:", rec)
print("F1-score:", f1)
print("ROC AUC:", roc_auc)
print("PR AUC:", pr_auc)
print("Matthews Corr. Coeff.:", mcc)
print("Cohen's Kappa:", kappa)
  


Statistical Tests (t-test, ANOVA, chi-square)

Launch on Binder

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Sample data for two-sample t-test
group1 = np.random.rand(20) * 10
group2 = np.random.rand(25) * 10 + 2  # Slightly different mean

# two-sample t-test
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)
print(f"Two-sample t-test: t-statistic = {t_stat:.3f}, p-value = {p_value:.3f}")

# Sample data for chi-square test of independence
contingency_table = np.array([[10, 20], [15, 5]])

# chi-square test of independence
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
print(f"\nChi-square test: chi2 = {chi2:.3f}, p-value = {p:.3f}, dof = {dof}")
print(f"Expected frequencies:\n{expected}")

# Sample data for one-way ANOVA
data = {
    'y': np.concatenate([
        np.random.rand(10) * 5,
        np.random.rand(10) * 5 + 2,
        np.random.rand(10) * 5 + 1
    ]),
    'group': ['A'] * 10 + ['B'] * 10 + ['C'] * 10
}
df = pd.DataFrame(data)

# one-way ANOVA with statsmodels
model = smf.ols('y ~ C(group)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(f"\nOne-way ANOVA:\n{anova_table}")
  


Regression Metrics (MAE vs RMSE)

Launch on Binder

from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

y_true = y_test
# y_pred = model.predict(X)  # y_pred is already available in the kernel state

mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))

print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
  


SHAP / LIME Basics

Launch on Binder

import shap
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from lime.lime_tabular import LimeTabularExplainer
import numpy as np
import pandas as pd

# Create a dummy binary target variable for X (since none is provided)
y_binary = (X[:, 0] > np.mean(X[:, 0])).astype(int)  # Example: binarize based on first feature

# Split data for LIME, ensuring X_train and X_test are pandas DataFrames for feature_names
X_train_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
X_test_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])

# Create and train a Logistic Regression model as the 'linear model'
model = LogisticRegression(solver='liblinear')
model.fit(X_train_df, y_binary[:X_train_df.shape[0]])  # Ensure y_binary matches X_train size

# SHAP for the Linear model
explainer = shap.LinearExplainer(model, X_train_df)
shap_values = explainer.shap_values(X_test_df)

# Summary plot
shap.summary_plot(shap_values, X_test_df)

# LIME explainer
explainer = LimeTabularExplainer(
    training_data=X_train_df.values,
    feature_names=X_train_df.columns,
    class_names=['negative', 'positive'],
    mode='classification'
)

i = 0
exp = explainer.explain_instance(
    X_test_df.iloc[i].values,
    model.predict_proba,
    num_features=5
)

exp.show_in_notebook()
  

Limits of Models, Metrics & Science

This dashboard summarises many of the tools we currently use to make sense of data – from classical statistics to modern ML metrics. They are powerful, but they are not the whole story.

Gödel’s incompleteness theorem shows that any formal system rich enough to express basic arithmetic will contain true statements that cannot be proven within that system. No system can be both complete and consistent.

By analogy, no modelling framework or collection of metrics can ever fully capture the processes we study. We always work with:

  • finite, noisy, and biased data
  • simplified models of complex systems
  • metrics that highlight some aspects while ignoring others
  • assumptions that are never perfectly satisfied

The goal here is not to pretend that AUC, RMSE, p-values, or any “cutting-edge” metric delivers final truth. Instead, this dashboard makes our tools explicit – showing where they are informative, where they are fragile, and where they leave important questions unanswered.

In other words: this is a map, not the territory. The map is useful, and it keeps improving – but if we ever forget that it is incomplete, we stop doing science.

Thanks for your attention.
— manu

Email: x34mev@proton.me • GitHub: https://github.com/slashennui

Plain-language Glossary
Terms on this part of the page
Scroll the page – I’ll show terms here.
All terms (Statistics · Machine Learning · Data Science)
?