Metric Selection Guide (Flowchart‑style)
High‑level guide for picking suitable metrics:
- Problem type?
- Binary / multiclass classification → use metrics in Performance (Classification).
- Regression → use metrics in Regression & Correlation.
- Hypothesis / A/B test → use Inference & Hypothesis Testing.
- Is the dataset imbalanced?
- Yes → emphasise recall, precision, F1, PR AUC, MCC; avoid relying on plain accuracy.
- No → accuracy + ROC AUC + F1 are typically fine.
- Threshold‑free ranking vs threshold‑specific performance?
- Ranking quality → ROC AUC / PR AUC.
- Specific operating point → confusion matrix, precision/recall/F1 at that threshold.
- Regression goals?
- Penalise big errors heavily → RMSE / MSE.
- Interpret “average absolute deviation” → MAE.
- Targets span orders of magnitude → RMSLE or log‑transforms.
- Need interpretability?
- Global feature importance → permutation importance, SHAP.
- Local explanations → SHAP, LIME.
This resource prioritizes practical experience, highlighting the critical errors that lead to faulty models and bad decisions. Learn to avoid common pitfalls like:
- The Accuracy Trap when working with imbalanced datasets.
- Misinterpreting the p-value – it is not the probability that the null hypothesis is true.
- Data leakage from incorrectly applying cross-validation to time-series or temporally ordered data.
Data Science Model Evaluation Metrics: Condensed Reference Sheet
Quick overview of what to use, when to use it, and what to watch out for.
Metric Selection Guide
-
Problem Type → Metric Category
Binary/Multiclass Classification → Performance (Classification) metrics
Regression → Regression & Correlation metrics
Hypothesis/A/B Testing → Inference & Hypothesis Testing -
Dataset Imbalance?
Yes → Focus on Recall, Precision, F1, PR AUC, MCC; avoid plain Accuracy
No → Accuracy, ROC AUC, F1 are fine -
Ranking vs. Threshold Performance
Ranking quality → ROC AUC / PR AUC
Specific operating point → Confusion matrix, Precision/Recall/F1 at threshold -
Regression Goals
Heavily penalize big errors → RMSE / MSE
Interpret “average absolute deviation” → MAE
Targets span orders of magnitude → RMSLE or log transforms -
Need Interpretability?
Global feature importance → Permutation importance, SHAP
Local explanations → SHAP, LIME
Classification Metrics (Quick Reference)
Key formulas
Accuracy = (TP + TN) / (TP + FP + TN + FN) Precision = TP / (TP + FP) Recall = TP / (TP + FN) F1 = 2 × (Precision × Recall) / (Precision + Recall) MCC = (TP×TN − FP×FN) / √[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]
| Metric | Range | Good | Fair | Poor | When to Use |
|---|---|---|---|---|---|
| Accuracy | 0 – 1 | > 0.8 | 0.7 – 0.8 | < 0.5 | Balanced data only |
| Precision | 0 – 1 | > 0.8 | 0.7 – 0.8 | < 0.5 | False positives costly |
| Recall | 0 – 1 | > 0.8 | 0.7 – 0.8 | < 0.5 | False negatives costly |
| F1 Score | 0 – 1 | > 0.8 | 0.5 – 0.8 | < 0.5 | Imbalanced data |
| ROC AUC | 0 – 1 | > 0.8 | 0.7 – 0.8 | < 0.7 | Ranking problems |
| PR AUC | 0 – 1 | Much > baseline | Moderate > baseline | ≈ baseline | Imbalanced binary |
| MCC | −1 to 1 | > 0.5 | 0.3 – 0.5 | < 0 | Imbalanced binary |
Regression Metrics (Quick Reference)
When to use:
- RMSE → Large errors are critically bad
- MAE → Want “average error” interpretation
- RMSLE → Targets span orders of magnitude
| Metric | Good | Fair | Poor | Characteristics |
|---|---|---|---|---|
| R² | > 0.7 | 0.5 – 0.7 | < 0.3 | % variance explained |
| Adjusted R² | Close to R² | Lower than R² | Much lower | Accounts for predictors |
| RMSE | << target SD | ≈ target SD | ≥ target SD | Penalizes large errors |
| MAE | << target mean | ≈ 10–20% mean | ≥ 30% mean | Robust to outliers |
Statistical Testing Essentials
Hypothesis Testing Flow
State H₀ & H₁ → Choose α (0.05) → Compute p-value → Compare:
- p < α → Reject H₀ (statistically significant)
- p ≥ α → Fail to reject H₀
Common Tests
| Test | Use Case | Key Assumptions |
|---|---|---|
| t-test | Compare 2 group means | Normality, equal variance (or use Welch’s) |
| ANOVA | Compare ≥ 3 group means | Normality, equal variance, independence |
| Chi-square | Test independence in categorical data | Expected counts ≥ 5 |
| Permutation Test | Non-parametric alternative | Exchangeability of labels |
| Kolmogorov–Smirnov (KS) | Compare a sample to a reference distribution (1-sample) or compare two samples (2-sample) | Continuous data, specifies distribution under H₀. Sensitive to differences in shape and location. |
| Levene’s test | Test equality of variances across groups (do we trust “equal variance” for t-test / ANOVA?) | Groups independent. Works reasonably well even when data are not normal. |
Multiple Testing Corrections
| Method | Controls | When to Use |
|---|---|---|
| Bonferroni | FWER (strict) | Few tests, false positives very costly |
| Benjamini–Hochberg | FDR (less strict) | Many tests (genomics, feature screening) |
Model Selection Criteria
Rule of thumb: ΔAIC/BIC < 2 → models are essentially equivalent.
| Criterion | Preference | Best For |
|---|---|---|
| AIC | Lower is better | Prediction quality, larger models |
| BIC | Lower is better | True model identification, smaller models |
| Cross-validated Error | Lower is better | Direct out-of-sample performance |
| Adjusted R² | Higher is better | Regression with multiple predictors |
Interpretability Methods
These show association, not causation!
| Method | Scope | Key Insight |
|---|---|---|
| Permutation Importance | Global | Feature importance by performance drop |
| SHAP Values | Global + Local | Additive feature contributions |
| LIME | Local | Local surrogate model explanations |
Top 10 Common Pitfalls
- Using Accuracy on imbalanced data
- Ignoring false negatives in medical/safety applications
- Optimizing only Precision or Recall (neglecting the other)
- Treating p-value as probability H₀ is true
- Not correcting for multiple testing
- Comparing R² across different datasets
- Using RMSE when outliers are unimportant
- Interpreting SHAP/LIME as causal effects
- Selecting models based on tiny metric differences
- Ignoring confidence intervals for metrics
Quick Decision Checklist
Before choosing metrics
- What’s the business objective?
- Balanced or imbalanced data?
- Cost of false positives vs false negatives?
- Need probability rankings or binary decisions?
- Require interpretability?
After model evaluation
- Check multiple metrics (not just one)
- Examine confusion matrix / error patterns
- Validate on holdout / test set
- Consider confidence intervals
- Compare to reasonable baselines
Essential Python Snippets
Classification Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score acc = accuracy_score(y_true, y_pred) prec = precision_score(y_true, y_pred) rec = recall_score(y_true, y_pred) f1 = f1_score(y_true, y_pred) roc_auc = roc_auc_score(y_true, y_proba)
Regression Metrics
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score mae = mean_absolute_error(y_true, y_pred) rmse = mean_squared_error(y_true, y_pred, squared=False) r2 = r2_score(y_true, y_pred)
Statistical Tests
from scipy import stats # t-test (two independent groups, equal variance assumed) t_stat, p_value = stats.ttest_ind(group1, group2) # Chi-square test of independence for a contingency table chi2, p_chi, dof, expected = stats.chi2_contingency(contingency_table) # Kolmogorov–Smirnov: # 1-sample KS: does sample follow a normal distribution? ks_stat, p_ks = stats.kstest(sample, 'norm') # 2-sample KS: are two samples from the same distribution? ks2_stat, p_ks2 = stats.ks_2samp(sample1, sample2) # Levene’s test for equality of variances lev_stat, p_lev = stats.levene(group1, group2, group3)
Key Takeaways
- No single metric tells the whole story → always use multiple.
- Context is everything → choose metrics aligned with business goals.
- Visualize → confusion matrices, ROC/PR curves, residual plots.
- Uncertainty matters → report confidence intervals, not just point estimates.
- Baseline comparison → always compare to simple benchmarks.
Based on the “Data Science Model Evaluation Metrics Dashboard” condensed sheet.
How to Read the Color Bars
The colors indicate the qualitative interpretation of a metric's value, following a consistent rule:
Key Context: "Desirable" depends on context. For a p-value testing an effect (e.g., "is there a difference?"), a low value (significant) is good (green). For a p-value checking an assumption (e.g., "are residuals normal?"), a high value (no violation) is good (green). Correlation bars now show strength (absolute value) from weak to strong.
Performance (Classification)
| Metric | Decision Criterion (Value Range) | Purpose | Description | Working Mechanism | Example | Limitations |
|---|---|---|---|---|---|---|
| Accuracy |
Aim as high as possible. < 0.5 = poor overall performance. 0.5–0.7 = fair (better than chance but weak). 0.7–0.8 = good. > 0.8 = very good (relative to chance level). |
Gauge overall classification success rate. Useful for quick assessment on balanced data, but unreliable alone on strongly imbalanced data. | Proportion of all predictions that are correct: \[ \text{Accuracy} = \frac{TP + TN}{TP+FP+TN+FN}. \] | Counts correct vs total predictions with all errors weighted equally. | 90 correct predictions out of 100 → accuracy = 0.90. | Can be very misleading for imbalanced data – a majority‑class predictor can
have high accuracy but be useless. Common pitfalls:
|
| Precision (Positive Predictive Value, PPV) |
Higher is better. < 0.5 = more than half of predicted positives are wrong. 0.5–0.7 = fair. 0.7–0.8 = good. > 0.8 (especially > 0.9) = very few false positives. |
Measures how reliable positive predictions are, crucial when false positives are costly. | \[ \text{Precision} = \frac{TP}{TP+FP}. \] | Improves as false positives decrease, even if recall suffers. | If 100 flagged spam emails contain 90 real spam, precision = 0.90. | Can be gamed by predicting very few positives; must be balanced with recall. Common pitfalls:
|
| Recall (Sensitivity / True Positive Rate) |
Higher is better. < 0.5 = more than half of positives are missed. 0.5–0.7 = fair coverage. 0.7–0.8 = good. > 0.8 (especially > 0.9) = very high detection. |
Measures completeness of positive detection; key when missing positives is costly. | \[ \text{Recall} = \frac{TP}{TP+FN}. \] | Improves when false negatives decrease. | Recall = 0.95 means 95% of true positives are found. | Ignores false positives; predicting everything positive yields recall 1.0. Common pitfalls:
|
| Specificity (True Negative Rate) |
Higher is better. < 0.5 = majority of negatives misclassified. 0.5–0.7 = fair. 0.7–0.8 = good. > 0.8 = very low false‑alarm rate. |
Measures ability to correctly identify negatives; important when false positives are costly. | \[ \text{Specificity} = \frac{TN}{TN+FP}. \] | Complement of false positive rate: FPR = 1 − specificity. | Specificity 0.98 means 98% of real negatives are correctly left unflagged. | Trivially high if model predicts almost everything negative. Common pitfalls:
|
| False Positive Rate / False Negative Rate (FPR / FNR) |
Lower is better for both. < 0.05 on either side = very low error rate there. ≈ 0.10–0.20 = modest error rate. > 0.30 = high error rate; typically unacceptable. |
Quantify false alarms (FPR) and missed positives (FNR). | \[ \text{FPR} = \frac{FP}{FP+TN}, \quad \text{FNR} = \frac{FN}{FN+TP}. \] | Threshold choice trades FPR vs FNR; ROC and PR curves illustrate this trade‑off. | A cancer test may tolerate FPR 0.15 for FNR 0.01 (very few missed cases). | Need domain‑specific cost trade‑offs; no single “correct” target. Common pitfalls:
|
| F1 Score |
Higher is better. < 0.5 = poor precision–recall balance. 0.5–0.8 = acceptable to good. > 0.8 = strong overall performance. |
Single number that balances precision and recall, especially on imbalanced datasets. | \[ F_1 = 2 \frac{\text{Precision} \times \text{Recall}} {\text{Precision} + \text{Recall}}. \] | Harmonic mean, so dominated by the smaller of precision and recall. | Precision 0.9, recall 0.9 → F1 0.9; precision 0.9, recall 0.1 → F1 ≈ 0.18. | Weights precision and recall equally; may not match domain priorities. Common pitfalls:
|
| ROC AUC |
< 0.7 = weak discrimination. 0.7–0.8 = moderate. > 0.8 = good; 0.9+ often excellent. |
Measures how well model ranks positives above negatives across thresholds. | Equivalent to probability a random positive has higher score than a random negative. | Integrates TPR vs FPR curve over all thresholds. | AUC 0.85 → model ranks positives higher 85% of the time. | Doesn’t reflect calibration; can look good on heavily imbalanced data even
if practical performance is poor. Common pitfalls:
|
| Negative Predictive Value (NPV) |
Higher is better. < 0.5 = “negative” predictions often wrong. 0.5–0.8 = moderate trust. > 0.8 = “all clear” predictions usually correct. |
Probability that a predicted negative is truly negative; key when false negatives are costly. | \[ \text{NPV} = \frac{TN}{TN+FN}. \] | Complements precision (PPV). | In disease screening with low prevalence, NPV often very high even for moderate models. | Strongly dependent on prevalence; high NPV doesn’t automatically imply a
good model. Common pitfalls:
|
| Balanced Accuracy |
0.5 = chance‑level performance (binary). 0.6–0.7 = modest improvement over chance. > 0.7 = good performance on both classes. |
Accounts for class imbalance by averaging recall over classes. | \[ \text{Balanced Acc} = \frac{\text{TPR} + \text{TNR}}{2}. \] | Penalises models that perform well on majority but poorly on minority class. | TPR 0.9, TNR 0.5 → balanced accuracy 0.7. | Still hides which side is weak; always inspect TPR and TNR separately. Common pitfalls:
|
| Brier Score |
Lower is better. 0–0.05 = excellent probabilistic predictions. 0.05–0.25 = reasonable calibration / accuracy. > 0.25 = poor (similar to random or worse). |
Measures quality of probabilistic predictions for binary outcomes. | \[ \text{Brier} = \frac{1}{N} \sum (\hat{p}_i - y_i)^2. \] | Combines calibration and sharpness; penalises confident wrong predictions. | A model always predicting p=0.5 on a balanced dataset has Brier 0.25. | Absolute values need baseline for interpretation; doesn’t distinguish
where (in p‑space) miscalibration occurs. Common pitfalls:
|
| Calibration Error (ECE) |
Lower is better. 0–0.02 = very well calibrated. 0.02–0.10 = minor issues. > 0.10 = noticeable miscalibration. |
Measures mismatch between predicted probabilities and observed frequencies. | Bins predictions by confidence; compares average predicted probability to empirical frequency in each bin, then averages absolute differences. | Low ECE means that “70% probability” events do happen around 70% of the time. | Used for risk models in credit, medicine, etc. as a complement to AUC. | Depends on binning; can be unstable with small sample sizes. Common pitfalls:
|
| Confusion Matrix |
Interpreted via the share of counts on the diagonal (correct) vs off‑diagonal (errors). Diagonal similar to off‑diagonal = many misclassifications. Diagonal somewhat dominant = moderate performance. Diagonal strongly dominant, off‑diagonals small = strong performance across classes. |
Summarise classification results in terms of true positives, false positives, false negatives, and true negatives for each class. | Matrix whose rows are actual classes and columns are predicted classes (or vice versa); each cell counts how often that combination occurs. | From predictions and labels, fill contingency table; derived metrics like precision, recall, F1, MCC are computed from its cells. | A binary confusion matrix with large TP and TN and very small FP/FN indicates a strong classifier; see interactive explorer below. | Not a single scalar; becomes large for many classes and may be hard to read without normalisation. Common pitfalls:
|
| Precision–Recall Curve (PR curve / PR AUC) |
Interpret relative to baseline positive rate π. PR AUC ≈ π = little better than random ranking. Clearly above π but < 0.7 = modest improvement. Substantially above π (e.g. > 0.7) = useful on rare‑positive problems. |
Assess trade‑off between precision and recall across thresholds, especially for highly imbalanced binary classification. | Curve of precision vs recall as the decision threshold moves; area under it (PR AUC) summarises overall performance on positives. | Emphasises performance on the positive class; random classifier’s baseline is the prevalence π rather than 0.5. | In fraud detection with 0.5% positives, PR AUC 0.65 is a huge improvement over baseline 0.005, even if ROC AUC is “only” moderate. | Baseline depends on prevalence; PR AUC numbers are not directly comparable across datasets with different class balance. Common pitfalls:
|
| Cohen’s Kappa |
Agreement beyond chance (−1 to 1). ≤ 0 = no better (or worse) than random. 0.01–0.40 = slight–fair agreement. 0.41–0.60 = moderate. > 0.60 = substantial to almost perfect agreement. |
Measure inter‑rater reliability or agreement between model predictions and labels while adjusting for agreement expected by chance. | Compares observed accuracy \(p_o\) to expected accuracy \(p_e\) under random agreement given class marginals. | \[ \kappa = \frac{p_o - p_e}{1 - p_e}. \] Large κ implies agreement much higher than chance, small κ close to or below 0 implies near‑random agreement. | Often used to compare two human labelers, or model vs clinician, in medical imaging or annotation tasks. | Sensitive to prevalence and marginal distributions; κ can be low even with high observed accuracy in imbalanced datasets. Common pitfalls:
|
| Matthews Correlation Coefficient (MCC) |
−1 ≤ MCC ≤ 1. ≤ 0 = random or worse than random. 0.1–0.3 = weak signal. 0.3–0.5 = moderate. > 0.5 = strong classifier, particularly on imbalanced data. |
Single‑number summary of binary classifier quality that is symmetric in classes and robust to class imbalance. | Correlation between predicted and true labels using all four entries of the confusion matrix. | \[ \text{MCC} = \frac{TP \cdot TN - FP \cdot FN} {\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}. \] | In a fraud‑detection setting with extreme imbalance, MCC distinguishes a useful classifier (e.g. 0.6) from one that just predicts the majority class (MCC ≈ 0). | Less intuitive than accuracy or F1; defined for binary classification and needs generalisation for multiclass problems. Common pitfalls:
|
Interactive Confusion Matrix Explorer
Adjust TP / FP / FN / TN to see how the main metrics change (toy calculator only).
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | TP: | FN: |
| Actual Negative | FP: | TN: |
Derived metrics from the current confusion matrix:
- Accuracy:
– - Precision (PPV):
– - Recall (TPR):
– - Specificity (TNR):
– - F1 score:
–
Survival / Time-to-Event Performance
Use these metrics when the outcome is a time until an event (death, relapse, failure, churn) and some observations are censored (we only know the event has not happened yet by the end of follow-up).
When to use
- Outcomes like “time from diagnosis to death”, “time to device failure”, “time until customer churn”.
- Many people are still event-free at the end of the study (right-censoring).
- We care about risk over time, not just a yes/no label at a fixed date.
Key tools & metrics
- Kaplan–Meier curve – step-shaped curve showing the fraction still event-free over time. Great for visualising survival patterns and comparing groups.
- Log-rank test – tests whether two or more Kaplan–Meier curves are systematically different over time.
- Cox proportional hazards model – regression model for time-to-event data that estimates hazard ratios for predictors.
- C-index (concordance index) – rank-based performance measure: probability that a person who experiences the event earlier gets a higher predicted risk. (1 = perfect, 0.5 = random.)
- Time-dependent ROC / AUC(t) – ROC-style discrimination at specific time points (e.g. 1-year, 5-year AUC).
- Brier score over time – mean squared error of predicted event probabilities at a given time horizon, adjusted for censoring. Lower is better.
Common pitfalls
- Ignoring censoring – treating censored cases as if the event never happened biases estimates. Always use survival-aware methods (Kaplan–Meier, Cox, etc.).
- Immortal-time bias – giving people “risk-free” time because they must survive long enough to receive a treatment or enter a group.
- Competing risks – if other events (e.g. death from another cause) prevent the event of interest, standard survival curves can overstate risk unless competing-risk methods are used.
- Too short follow-up – if few events occur, any metric (C-index, Brier, log-rank) will be noisy and underpowered.
Inference & Hypothesis Testing
| Metric / Test | Decision Criterion (Test Ranges) | Purpose | Description | Working Mechanism | Example | Limitations |
|---|---|---|---|---|---|---|
| p-value (Significance Level) |
Typically compared to α = 0.05. p < 0.05 = statistically significant evidence against H0 (if an effect is of interest). p ≥ 0.05 = no statistically significant evidence; cannot rule out H0. |
Quantifies evidence against the null hypothesis and provides a common decision rule across many tests. | Probability, assuming H0 is true, of obtaining a result at least as extreme as observed. | Derived from a test statistic’s null distribution; if p < α, reject H0. | p = 0.03 for a drug effect typically leads to rejecting “no effect” at α=0.05. | Does not measure effect size or importance; heavily sample‑size‑dependent
and often misinterpreted. Common pitfalls:
|
| Adjusted α (Bonferroni Correction) |
For m tests, use αadj = α / m. p < αadj = significant even after strict multiple‑testing control. p ≥ αadj = not significant after correction. This keeps family‑wise error rate ≈ α (e.g. 0.05). |
Control family‑wise probability of any false positive across multiple tests. | Divides α by number of tests; each test uses αadj as threshold. | Simple and conservative; strong control of false positives when tests are independent or mildly correlated. | 20 tests with α=0.05 → αadj=0.0025; only very small p‑values survive. | Can be overly conservative for large m, greatly reducing power. Common pitfalls:
|
| Benjamini–Hochberg FDR |
Choose FDR q (e.g. 0.05). Sort p‑values p(1) ≤ … ≤ p(m), find largest k with p(k) ≤ (k/m)·q. Tests 1…k are called significant; expected fraction of false discoveries ≈ q. |
Control expected proportion of false positives among declared discoveries. | Step‑up procedure comparing ordered p‑values to increasing thresholds (k/m)·q. | Less conservative than Bonferroni; more discoveries at the cost of some false positives. | Common in genomics / high‑dimensional feature screening. | Controls FDR only in expectation and under certain dependence structures. Common pitfalls:
|
| Type I & Type II Error |
α ≈ 0.01–0.05 = standard false‑positive risk. β ≈ 0.10–0.20 (power 80–90%) = typical design target. Very high α or β (> 0.10 or > 0.30) = too many wrong decisions. |
Frame trade‑off between false positives (Type I) and false negatives (Type II) when designing tests and studies. | Type I: reject true H0. Type II: fail to reject false H0. Power = 1 − β. | Power analysis couples α, β, effect size, and sample size. | A clinical trial might fix α=0.025 (one‑sided) and β=0.10 (90% power). | Reducing α without increasing sample size generally increases β. Common pitfalls:
|
| Statistical Power |
< 0.5 = often misses real effects. 0.5–0.8 = moderate. > 0.8 = usually acceptable; >0.9 even better. |
Probability a test detects a true effect of given size. | Depends on α, effect size, variability, and sample size. | Higher n or larger effect sizes raise power. | Power 0.8 means 80% chance to detect the specified effect if it exists. | Post‑hoc power is usually uninformative; better to plan it a priori. Common pitfalls:
|
| Z / t Tests |
Two‑sided tests (large df): |Z| or |t| < 1 = little evidence. |Z| or |t| ≈ 2 = borderline (p ≈ 0.05). |Z| or |t| ≥ 3 = strong evidence against H0. |
Test if a mean / difference / regression coefficient differs from a null value. | Statistic = (estimate − null) / SE, compared against normal or t distribution. | Large |statistic| means estimate is many SE away from null. | |t| = 4 with df≈50 usually implies p < 0.001 (strong evidence). | Assumes approximate normality and independence; sensitive to outliers. Common pitfalls:
|
| Chi-Square Test (χ²) |
χ² near 0 = data close to H0 (little evidence of association/effect). χ² around χ²crit = borderline significance. χ² well above χ²crit = strong evidence of association / lack of fit to H0. |
Test independence (contingency tables) or goodness‑of‑fit of categorical data to expected counts. | \[ χ^2 = \sum \frac{(O_i - E_i)^2}{E_i}. \] | Large χ² means observed counts deviate strongly from expectations. | Used for testing independence between categorical variables or Mendelian ratios in genetics, etc. | With very large n, tiny practical differences become significant. Needs
effect size (e.g. Cramér’s V) for magnitude. Common pitfalls:
|
| ANOVA F-test |
F ≈ 1 = no detected difference between means. F ≈ 2–3 = might be marginally significant (depends on df). F ≫ 1 (e.g. >5) = strong evidence at least one group mean differs. |
Test if means of 3+ groups are all equal vs at least one differs. | Compares between‑group variance to within‑group variance. | Large F implies between‑group differences are large relative to noise. | Follow significant F with post‑hoc tests to identify which groups differ. | Assumes normality and equal variances; does not identify specific groups
or effect sizes by itself. Common pitfalls:
|
| Confidence Interval (e.g. 95% CI) |
95% CI not containing 0 (difference) or 1 (ratio) → statistically significant at ≈5%. Narrow CI → precise estimate. Very wide CI → high uncertainty. |
Provide a range of plausible values for a parameter. | Usually estimate ± critical value × standard error. | Reflects both effect size and uncertainty. | Difference 5 with 95% CI [2, 8] suggests a clearly positive but moderately uncertain effect. | Relies on model assumptions; misinterpreted as containing the true value with 95% probability (frequentist CIs don’t strictly mean that). |
| Permutation Test |
Let pperm be the permutation p‑value. pperm < 0.05 → statistic is in extreme tail of null distribution (significant effect). pperm ≥ 0.05 → compatible with chance re‑labeling. |
Distribution‑free significance test using label shuffling. | Permute labels many times; recompute statistic to get null distribution and pperm. | Useful for complex statistics where analytic null distributions are hard to derive. | Accuracy 0.8 vs permutation null mean 0.5 with pperm=0.01 is strong evidence of real signal. | Computationally heavy; must respect structure (e.g. grouping or time). Common pitfalls:
|
| Effect Size (Cohen’s d) |
|d| < 0.2 = negligible. 0.2–0.5 = small. 0.5–0.8 = medium. > 0.8 = large effect. |
Quantify standardized mean differences independent of sample size. | \[ d = \frac{\bar{x}_1 - \bar{x}_2}{s_{pooled}}. \] | Helps separate statistical from practical significance. | d = 0.6 for treatment vs control often considered practically important. | Assumes similar SDs; thresholds are rough and context‑dependent. Common pitfalls:
|
| Cliff’s Delta |
|δ| < 0.147 = negligible. 0.147–0.33 = small. 0.33–0.474 = medium. > 0.474 = large effect (strong dominance). |
Non‑parametric effect size based on ranks; robust to non‑normal data. | δ = P(X>Y) − P(Y>X), where X/Y from two groups; ranges −1..1. | Equivalent to rank‑biserial correlation; sign shows direction, magnitude shows strength. | δ = 0.5 means X exceeds Y in 75% of pairs, a very strong effect. | Summarises ordering, not magnitude of differences; can be less intuitive
than differences in means. Common pitfalls:
|
Real‑Dataset Examples for Common Tests
- t‑test: compare mean blood pressure in treatment vs control groups in a clinical trial.
- Paired t‑test: before/after measurements of the same patients after an intervention.
- Chi‑square test: association between smoking status and lung‑disease incidence.
- ANOVA F‑test: compare average click‑through rate across several ad creatives.
- Permutation test: evaluate whether model accuracy exceeds chance by shuffling labels.
- KS test: detect distribution shift between training and production feature distributions.
Statistical Tests & Assumptions – Quick Reference
What each test is asking, when to use it, and what can go wrong.
Big picture
-
Every test asks a question.
“Are these means equal?”, “Are these variances equal?”, “Do these two distributions look the same?”, “Is there autocorrelation?”. -
p-value is not the effect size.
A tiny p-value can correspond to a tiny, unimportant effect if the sample is huge. -
Assumptions matter.
Many tests assume things like normal residuals, equal variances, or independent observations. If those are badly violated, the p-values can be misleading. -
Multiple testing inflates false positives.
If you run many tests, you need FWER/FDR control (Bonferroni, Holm, BH).
Distribution shape & normality
These tests check whether data or residuals look like they come from a particular distribution (usually normal). They are sensitive to sample size and to outliers.
| Test | Main question | Typical use | Notes & pitfalls |
|---|---|---|---|
| Shapiro–Wilk | “Do these data look roughly normal?” | Small to medium samples (n < ~2000) | Powerful for normality; very sensitive to even small deviations in large samples. Always combine with plots (QQ-plot, histogram). |
| Kolmogorov–Smirnov (KS) | “Is the sample distribution different from a reference distribution (or from another sample)?” | Comparing one sample to a known distribution, or two independent samples. | Works for continuous data; most sensitive near the centre, less in the tails. With estimated parameters, classical p-values need corrections. |
| Anderson–Darling | “Do these data follow a given distribution, especially in the tails?” | Checking normality with more tail focus than KS. | Gives more weight to tails than KS. As with other tests, large n ⇒ tiny deviations become “significant”. |
Equality of variances
Many tests (for example classical t-test, ANOVA) assume similar variances across groups. These tests check that.
| Test | Main question | Data type | Notes & pitfalls |
|---|---|---|---|
| Levene’s test | “Do these groups have equal variances?” | Continuous outcome, groups categorical | More robust to non-normal data than classical tests. A small p-value suggests at least one group has a different variance. |
| Brown–Forsythe | Levene’s test variant using medians instead of means. | Continuous outcome, heavy tails or outliers | Even more robust when distributions are skewed. Often preferred if outliers are expected. |
Comparing means
These tests compare group averages. Non-parametric alternatives use ranks instead of assuming normality.
| Test | Main question | Design | Notes & pitfalls |
|---|---|---|---|
| t-test (independent) | “Are the means of two independent groups equal?” | Two groups, continuous outcome | Assumes normal residuals and (often) equal variances. For unequal variances, use Welch’s t-test. |
| t-test (paired) | “Is the mean difference between paired measurements zero?” | Before/after, matched pairs | Applied to differences. Assumes differences are roughly normal. |
| One-way ANOVA | “Are all group means equal?” | 3+ groups, continuous outcome | Global test; if significant, follow with post-hoc comparisons (and multiple-testing correction). Assumes normal residuals & equal variances. |
| Mann–Whitney U | “Do two groups differ in their typical values (medians/ranks)?” | Two independent groups, ranked/continuous outcome | Non-parametric alternative to the independent t-test. Tests distribution shift, not strictly medians. |
| Kruskal–Wallis | “Do 3+ groups differ in their distributions (ranks)?” | 3+ independent groups, ranked/continuous outcome | Non-parametric analogue of one-way ANOVA. If significant, follow with pairwise rank tests + multiple-testing correction. |
Categorical data & independence
These tests work on counts in contingency tables and ask whether patterns could be explained by chance alone.
| Test | Main question | Typical table | Notes & pitfalls |
|---|---|---|---|
| Chi-square test of independence | “Are two categorical variables independent?” | R × C contingency table (for example treatment × outcome) | Expected counts should not be too small (rules of thumb apply). Large samples make tiny deviations “significant”. |
| Chi-square goodness-of-fit | “Do observed category frequencies match a specified distribution?” | 1 × C table (observed vs expected counts) | Used to compare observed counts to a theoretical or historical pattern. |
| Fisher’s exact test | “Is there association in a 2×2 table?” | 2 × 2 table with small counts | Exact test, no large-sample approximation. Useful when any expected count is small (<5). |
Time-series residuals & autocorrelation
For time-ordered data, errors often correlate over time. These tests check whether residuals look “independent” or show systematic patterns.
| Test | Main question | Typical use | Notes & pitfalls |
|---|---|---|---|
| Durbin–Watson | “Is there first-order autocorrelation in regression residuals?” | Linear regression on time-ordered data | Values near 2 ≈ no autocorrelation; near 0 ≈ strong positive autocorrelation; near 4 ≈ strong negative. Not designed for complex time-series models. |
| Ljung–Box | “Are a set of autocorrelations jointly zero?” | Checking whether residuals from a time-series model look like white noise. | Tests several lags at once. A small p-value suggests remaining structure in residuals (model underfits dynamics). |
Summary: how to think about tests
- Always pair tests with plots. QQ-plots, residual plots and histograms often tell the story faster than p-values.
- Large samples detect tiny issues. A “significant” deviation may be practically irrelevant.
- Small samples lack power. A non-significant result does not prove that assumptions are perfect or effects are zero.
- Use non-parametric tests when normality or equal variances are clearly violated and sample sizes are moderate.
- Remember multiple testing. Running many tests on the same data requires FWER/FDR control, otherwise false positives accumulate fast.
Robustness & Resampling
Robustness Playground: Mean vs Median & Outliers
Type any numbers, then drag the outlier slider. Watch how the mean swings while the median and IQR stay more stable. This is what “robustness to outliers” looks like in practice.
Try: 1, 2, 3, 4, 5 then add an outlier like +30. The mean moves a lot; the median barely moves.
The "Robustness Playground" is a visual tool proving that the median is generally a better measure of the "typical" center for data that might contain errors or extreme values, because it resists the pull of outliers much better than the mean.
Robustness in statistics means that a measurement (like the median) remains relatively unchanged even if a few data points are very unusual or extreme (outliers).
Non-robust measurements (like the mean) are highly sensitive to these extreme values and can be pulled significantly in one direction.
Resampling Stability Playground (Bootstrap vs Jackknife)
This playground simulates a simple linear model with one true signal feature and one noise feature. Play with sample size, noise, and signal strength, then compare how a single fit, bootstrap, and jackknife disagree about the coefficient. Watch how CI width and sign stability change.
Look for: when noise is high or n is small, single fit can be very misleading. Bootstrap and jackknife show how uncertain the coefficient really is.
How to read this playground
This Resampling Stability Playground compares two resampling methods – Bootstrap and Jackknife – for a simple linear model. It answers: “How stable is my estimated coefficient under resampling?”
Purpose
The model is y = β₀ + β₁·x₁ + ε with one true signal feature (x₁) and one pure noise feature (x₂, true β₂ = 0). By changing sample size, noise level, and signal strength, you can see when estimates are stable vs. when they are fragile.
How to use the controls
- Sample size – more observations usually mean tighter intervals.
- Noise level – higher noise makes estimates wobble more.
- Signal strength – stronger β₁ is easier to detect reliably.
- Generate new data – redraws a fresh dataset and new resamples.
Reading the plot
- The vertical green dashed line shows the true β₁.
- The orange dot is the single-fit estimate on the full sample.
- The blue bar is the Bootstrap 95% CI for β₁.
- The purple bar is the Jackknife 95% CI for β₁.
Reading the table
- CI width – how wide the interval is (narrow = more precise).
- Sign stable – fraction of resamples that keep the same sign as the mean.
- Green “high stability” tags mean the method gives a tight CI and almost never flips sign; red “low stability” tags mean the estimate is fragile.
Use this to build intuition for robustness: when resampling methods agree and intervals are narrow, your inference is much safer than when everything jumps around.
High stability = coefficient keeps sign and tight CI.
Low stability = sign flips often / very wide CI.
| Metric / Method | Decision Criterion | Purpose | Description | Working Mechanism | Example | Limitations |
|---|---|---|---|---|---|---|
| Cross-Validation Mean & Std (k-fold CV) |
Higher mean score = better average performance. Low std across folds = stable / robust model. High std across folds = performance sensitive to data split. |
Estimate out‑of‑sample performance and stability. | Repeatedly train/test on different folds of the data. | Mean = expected performance; std = sensitivity to sample. | 0.88 ± 0.01 across folds is strong & stable; 0.90 ± 0.10 is unstable. | More expensive than simple train/test; must respect temporal / grouped
structure. Common pitfalls:
|
| Bias–Variance Tradeoff |
High bias (underfit) → high error on train & test. Balanced bias–variance → low and similar train/test error. High variance (overfit) → very low train error, high test error. |
Conceptual tool for selecting model complexity. | Simple models: high bias, low variance; complex models: low bias, high variance. | Analyse learning curves vs model capacity to find sweet spot. | Deep tree that fits training perfectly but fails on test is high‑variance. | Not a single numeric statistic; patterns can be subtle for deep models. Common pitfalls:
|
| Jackknife Variability |
Low jackknife SE → estimator stable to leaving out single observations. Moderate SE → some sensitivity. High SE → highly sensitive to specific cases. |
Assess estimator stability and approximate standard errors by leave‑one‑out recomputation. | Compute estimates on N datasets, each missing one observation; inspect spread. | Big changes when omitting particular points indicate influence. | If dropping any one observation barely changes a coefficient, model is stable. | Less flexible than bootstrap for complex estimators; can be noisy for small
N. Common pitfalls:
|
| Bootstrap CI Width |
Narrow bootstrap CI = precise estimate. Medium width = acceptable uncertainty. Very wide or irregular CI = high uncertainty / instability. |
Non‑parametric uncertainty quantification for statistics and model parameters. | Resample with replacement, recompute estimator, and use empirical distribution. | Percentile or BCa intervals reflect sampling variability without assuming normality. | Narrow [4.1, 4.2] CI is very precise; [0, 20] shows extreme uncertainty. | Expensive for large models; assumes sample is representative. Common pitfalls:
|
| Feature Stability Across Resamples |
< 0.5 = feature selected or important in <50% of resamples (unstable). 0.5–0.8 = moderate stability. > 0.8 = highly stable across resamples. |
Check whether discovered “important features” are robust to sampling variation. | Repeat feature selection / importance computation over many resamples and count how often each feature is selected as important. | Helps distinguish real signal from noise‑driven feature choices. | Stable features (e.g. 95/100 bootstrap samples) are more trustworthy than features selected rarely. | Correlated predictors may appear unstable individually; requires careful
interpretation. Common pitfalls:
|
Regression & Correlation
| Metric | Decision Criterion | Purpose | Description | Working Mechanism | Example | Limitations |
|---|---|---|---|---|---|---|
| R2 (Coefficient of Determination) |
< 0.25 = weak fit. 0.25–0.5 = moderate fit. > 0.5 = strong fit. |
Proportion of variance in response explained by the model. | \[ R^2 = 1 - \frac{SS_{res}}{SS_{tot}}. \] | Compares residual variance to variance around the mean. | R² = 0.85 → model explains 85% of variability in outcome. | Can be inflated by overfitting; does not imply causality or good
out‑of‑sample performance. Common pitfalls:
|
| Adjusted R2 |
Higher adjusted R² = better fit after penalising extra predictors. Small gains = marginal benefit of added predictors. Decrease when adding predictors = likely overfitting / noise variables. |
Compare models with different numbers of predictors while penalising complexity. | Adjusts R² downward for each extra degree of freedom. | Only increases when added variables meaningfully reduce residual variance. | If R² rises but adjusted R² falls when adding variables, they’re probably not helpful. | Can’t compare across different datasets; still doesn’t guarantee predictive
performance. Common pitfalls:
|
| RMSE (Root Mean Square Error) |
Lower RMSE = better (small typical errors). Similar to target SD = modest improvement over baseline. Close to or above target SD/range = weak predictive value. |
Average magnitude of squared prediction errors, back in original units. | \[ \text{RMSE} = \sqrt{\frac{1}{n}\sum (y_i - \hat{y}_i)^2 }. \] | Squares errors (emphasising large ones), then square‑roots. | RMSE \$20k on houses with SD \$60k is good; \$55k is poor. | Highly sensitive to outliers; needs baseline to interpret magnitude. Common pitfalls:
|
| MAE (Mean Absolute Error) |
Lower MAE = predictions on average close to truth. Moderate fraction of target scale = acceptable. Large fraction of target scale = poor accuracy. |
Average absolute prediction error; more robust than RMSE. | \[ \text{MAE} = \frac{1}{n}\sum |y_i - \hat{y}_i|. \] | Each error contributes linearly. | MAE of 1.5k on 20k prices ≈ 7.5% typical error (good). | Doesn’t strongly penalise rare huge errors. Common pitfalls:
|
| MSE / RMSLE |
Lower MSE/RMSLE = smaller errors overall. Intermediate values = moderate performance. High values = many large errors. |
Alternative regression loss functions: MSE for squared errors, RMSLE for relative/log errors. | RMSLE uses log1p transform, emphasising multiplicative errors. | Useful when underestimation of large values is particularly problematic or targets are highly skewed. | Popular in Kaggle competitions for positive‑only targets. | RMSLE can’t straightforwardly handle zeros/negatives; MSE sensitive to
outliers. Common pitfalls:
|
| Pearson Correlation (r) |
Interpreting |r|: < 0.3 = weak. 0.3–0.5 = moderate. > 0.7 = strong linear association. |
Strength and direction of linear association between two numeric variables. | Standardised covariance: \[ r = \frac{\text{Cov}(X,Y)}{\sigma_X\sigma_Y}. \] | Ranges −1..1; sign gives direction, magnitude gives strength. | Height vs weight often r≈0.7–0.8. | Very sensitive to outliers; misses non‑linear relationships. Common pitfalls:
|
| Spearman Correlation (ρ) |
Interpreting |ρ|: < 0.3 = weak monotonic association. 0.3–0.5 = moderate. > 0.7 = strong monotonic relationship. |
Correlation on ranks; robust to outliers and non‑linear but monotonic trends. | Compute ranks of X and Y, then Pearson r on ranks. | Captures relationships where one variable consistently increases/decreases with the other. | Useful for ordinal data or non‑linear monotonic relationships. | Near zero for non‑monotonic relationships (e.g. U‑shaped). Common pitfalls:
|
| Partial Correlation |
Interpreting |rpartial|: ≈ 0 = little remaining association beyond controls. ≈ 0.3–0.5 = moderate residual link. > 0.5 = strong association beyond controlled factors. |
Measure association between two variables while controlling for others. | Regress each variable on controls, then correlate residuals. | Helps separate direct from indirect effects. | Correlation between exercise and blood pressure may shrink after controlling for age. | Only removes linear effects; can be unstable with many correlated
controls. Common pitfalls:
|
| Regression Coefficients + CI |
CI not containing 0 → coefficient significantly different from 0. Narrow CI → precise effect estimate. Wide CI including 0 → weak or uncertain effect. |
Interpret predictor effects and uncertainty in regression models. | Coefficients describe expected change in response for unit change in predictor, holding others fixed; CI shows uncertainty. | Based on estimated SEs and t / normal critical values. | “Each extra year of experience adds \$2k (95% CI \$1.5k–\$2.5k)” is clear and interpretable. | Interpretation assumes correct model form and no severe multicollinearity. Common pitfalls:
|
| Durbin–Watson Test |
< 1.5 = likely positive autocorrelation (bad for OLS SEs). 1.5–2.5 = little evidence of serious autocorrelation. > 2.5 = possible negative autocorrelation. |
Detect first‑order serial correlation in regression residuals. | DW ≈ 2(1 − ρ1), where ρ1 is lag‑1 autocorrelation. | Values far from 2 suggest residual dependence. | DW = 0.9 suggests strong positive autocorrelation; consider time‑series models or GLS. | Primarily detects first‑order correlation; interpretation uses critical
value tables or approximations. Common pitfalls:
|
| Breusch–Pagan Test |
p < 0.05 → reject homoscedasticity; heteroscedastic errors (bad for standard OLS SEs). p ≥ 0.05 → no strong evidence of non‑constant variance. |
Test for heteroscedasticity (non‑constant variance) in regression. | Regress squared residuals on predictors; statistic ~χ² under constant variance. | Significant p suggests need for robust SEs, transforms, or alternate models. | Often used in linear regression diagnostics. | Power depends on auxiliary regression specification; may miss complex
patterns. Common pitfalls:
|
| Shapiro–Wilk Normality Test (on residuals) |
Null: residuals are normal. p < 0.05 → reject normality (assumption violation). p ≥ 0.05 → no strong evidence against normality. |
Check normality assumption for regression residuals. | Statistic W measures agreement between ordered residuals and expected normal order statistics. | Used with residual plots to assess normality assumption for t‑based inference. | p = 0.4 suggests residual normality is acceptable; p = 0.001 suggests heavy tails / skew. | Very powerful for large n (tiny deviations flagged); doesn’t tell how
residuals deviate. Common pitfalls:
|
Comparative Table – When to Use MAE vs RMSE
| Scenario | Prefer MAE | Prefer RMSE |
|---|---|---|
| Robustness to outliers | Yes – if occasional extreme errors are not critical. | No – RMSE will be dominated by a few large residuals. |
| Penalising large errors heavily | No – treats all deviations linearly. | Yes – squared errors heavily punish large mistakes. |
| Interpretability | “On average, we are off by …” is intuitive. | Less intuitive, but mathematically convenient for optimisation. |
| Gradient‑based optimisation | Non‑differentiable at 0 but workable. | Smooth and strongly convex; widely used loss. |
| Highly skewed targets | Sometimes combined with median‑based models. | May require log‑transform or RMSLE for stability. |
Multicollinearity Diagnostics
| Metric | Decision Criterion | Purpose | Description | Working Mechanism | Example | Limitations |
|---|---|---|---|---|---|---|
| Variance Inflation Factor (VIF) |
VIF ≈ 1 → no multicollinearity. 5–10 = moderate concern. > 10 = serious multicollinearity. |
Quantify how much variance of a coefficient is inflated by linear dependence with other predictors. | \[ \text{VIF}_j = \frac{1}{1-R_j^2} \] where Rj² from regressing predictor j on others. | Large Rj² → large VIF → unstable coefficient for that predictor. | VIF = 12 suggests coefficient may be poorly estimated and highly sensitive to small data changes. | Doesn’t indicate which predictors are collinear with each other; only that
some redundancy exists. Common pitfalls:
|
| Condition Number |
< 10 = low multicollinearity. 10–30 = moderate. > 30 = severe (near singular matrix). |
Measure overall multicollinearity of predictor matrix. | Ratio of largest to smallest singular value (or sqrt of eigenvalue ratio). | Large condition number means XᵀX is ill‑conditioned; coefficient estimates can be unstable. | Condition number ≈ 50 suggests strong collinearity somewhere among predictors. | Scaling affects value; interpret with standardised predictors. Does not
pinpoint which variables are problematic. Common pitfalls:
|
Outlier & Distribution Metrics
| Metric/Test | Decision Criterion | Purpose | Description | Working Mechanism | Example | Limitations |
|---|---|---|---|---|---|---|
| Skewness (Distribution Asymmetry) |
Between -1 and +1 = modest skew (often acceptable). Between -2 and -1 or 1 and 2 = moderate skew. < -2 or > 2 = strong skew; consider transform or robust methods. |
Quantify asymmetry of a distribution (left vs right tail). | Third standardized moment; sign indicates direction of long tail. | Positive skew → long right tail; negative → long left tail. | Incomes are typically right‑skewed; log‑incomes are closer to symmetric. | Unstable in small samples; easily influenced by a few extreme points. Common pitfalls:
|
| Kurtosis (Tailedness) |
(Using excess kurtosis, normal ≈ 0.) ≈ 0 = tails similar to normal. Between -2 and -0.5 = somewhat light‑tailed. > 2 = heavy tails / many extreme values. |
Describe how heavy the tails are compared to normal. | Fourth standardized moment minus 3. | High kurtosis indicates variance dominated by rare large deviations. | Financial returns often show high positive kurtosis. | Very sensitive to outliers; hard to interpret without skewness and plots. Common pitfalls:
|
| Shapiro-Wilk Test (Normality test) |
Null: data are normal. p < 0.05 → reject normality. p ≥ 0.05 → no strong evidence against normality. |
Formal test of normality for small–moderate samples. | Statistic W measures agreement between ordered data and expected normal quantiles. | p-value derived from W; small p indicates deviation from normality. | p=0.08 → normality acceptable; p=0.001 → strong deviation. | Very sensitive with large n; use with plots and domain context. Common pitfalls:
|
| Z-score Outlier Detection |
Under normality: |z| < 2 = typical. 2 ≤ |z| ≤ 3 = borderline outlier. |z| > 3 = potential outlier. |
Flag univariate outliers relative to mean and SD. | \[ z = \frac{x - \mu}{\sigma}. \] | Extremely large |z| values are unlikely under a normal model. | z = 5 is extremely unusual (probability < 10⁻⁶ under normal). | Assumes normality; heavy‑tailed data produce many |z|>3 that are not
truly abnormal. Mean/SD can be distorted by the outliers. Common pitfalls:
|
| IQR Method (Tukey’s Fences) |
Within [Q1−1.5·IQR, Q3+1.5·IQR] = typical range. Outside 1.5·IQR but within 3·IQR = moderate outlier. Beyond 3·IQR fences = extreme outlier. |
Non‑parametric, robust rule of thumb for univariate outliers. | Uses quartiles and interquartile range (IQR = Q3–Q1). | Points outside whiskers in a boxplot correspond to Tukey outliers. | Common default rule in statistical software boxplots. | Skewed distributions may produce many flagged points; rule is heuristic
and dimension‑wise only. Common pitfalls:
|
| Robust Z-score (MAD) |
For robust zMAD: |zMAD| < 2.5 = typical. 2.5–3.5 = borderline. > 3.5 = strong outlier candidate. |
Detect outliers robustly using median and median absolute deviation. | Robust Z = (x − median) / (1.4826 · MAD), where MAD is median(|x − median|). | More stable than classical z‑score in presence of outliers. | Good for heavy‑tailed or skewed distributions where mean/SD are distorted. | Still assumes a roughly unimodal distribution; threshold choices are
heuristic. Common pitfalls:
|
| Cook’s Distance |
< 0.5 = typically low influence. 0.5–1 = potentially influential, inspect. > 1 (especially much >1) = highly influential point. |
Measure influence of each observation on regression fit. | Combines leverage and residual size to approximate change in all fitted values when a point is removed. | Large Cook’s D means the observation strongly affects estimates. | One point with D=1.5 while others < 0.1 indicates a dominating data point. | Thresholds are rules of thumb; influential points may be valid data, not
necessarily errors. Common pitfalls:
|
| Mahalanobis Distance |
For p dimensions: MD² ≤ χ²p,0.975 = inside main cloud. Between χ²p,0.975 and χ²p,0.99 = potential multivariate outlier. MD² > χ²p,0.99 = strong multivariate outlier. |
Detect multivariate outliers accounting for correlations between variables. | \[ MD(x) = \sqrt{(x-\mu)^\top \Sigma^{-1} (x-\mu)}. \] Squaring MD gives a χ² statistic under multivariate normality. | Points with large MD² lie far from the multivariate mean in whitened space. | Useful for anomaly detection in multi‑feature settings. | Requires good estimates of μ and Σ; classical covariance is itself
distorted by outliers; robust covariance estimators may be needed. Common pitfalls:
|
| Kolmogorov–Smirnov Test (KS) |
Two‑sample KS for distribution shift: p < 0.05 → distributions differ significantly (potential shift / mismatch). p ≥ 0.05 → no strong evidence of difference. Larger D (0–1) = stronger discrepancy. |
Compare empirical distributions (e.g. train vs production) or sample vs theoretical distribution. | KS statistic D = sup |F1(x) − F2(x)| over x. | p‑value derived from D; sensitive to location and shape differences. | Useful in monitoring feature distribution drift over time. | More sensitive near median than in tails; assumes continuous data and
independent samples. Common pitfalls:
|
Influence & Robustness Lab (Interactive Playgrounds)
These playgrounds show how single points, noise and multiple comparisons can quietly break your models,
even when headline metrics still look good.
Outlier Impact 2.0 – How one point can twist a regression line
Controls
How to read this playground
This playground shows how a single outlier can dramatically change an ordinary least-squares regression line.
Purpose
We simulate a simple linear model and compare two fits: one without the outlier (baseline) and one with the outlier. If a single point can flip the slope or change it a lot, your model is fragile.
How to use the controls
- Sample size – more points ⟹ harder to twist the line.
- Noise level – more noise hides the clean relationship.
- Outlier height – drag far up/down and watch the orange line tilt.
- Regenerate – new random base cloud with same settings.
Interpretation
- Blue dots: base data. Red dot: outlier. Blue line: fit without outlier.
- Orange line: fit including outlier.
- In the table, Δ slope and the tag fragile/stable tell you how influential the point is.
| Model | Slope | Intercept | R² | Δ slope | Stability |
|---|
If one point can flip the conclusion, you don’t have a stable finding – you have an anecdote dressed up as a model.
Cook’s Distance & Leverage – Influence of a single high-leverage point
Controls
How to read this playground
Cook’s Distance combines two ideas: leverage (how unusual x is) and residual (how badly the point is fit). A point with both high leverage and large residual is highly influential.
What you see
- Blue dots: regular data used to fit the baseline regression line.
- Purple dot: candidate point at the chosen x-position and residual.
- The table shows its Cook’s Distance and a qualitative flag (OK vs influential).
Heuristics
- Rough rule: Cook’s D > 1 is clearly influential; D > 0.5 is worth attention.
- High leverage with tiny residual can still be dangerous if a future small mistake there would flip the slope.
| Quantity | Value | Interpretation |
|---|
Use this playground to feel that leverage (far-out x) without residual is mild, residual without leverage is local – but the combination gives large Cook’s Distance.
Noise Injection Playground – How noise quietly kills stability
Noise controls
This is a pedagogical model: we don’t fit an actual classifier, but apply a simple response surface that mimics how noise hurts cross-validated performance.
Effect on performance & stability
How to read this playground
- Increase noise and watch train accuracy stay high while CV accuracy drops and variance inflates.
- Many irrelevant features increase overfitting pressure even if the core signal is unchanged.
- The stability index is a compact score combining CV accuracy and variability: low values mean your model is too dependent on random quirks.
Moral: always think in terms of “signal vs noise”. Robust models maintain high CV accuracy and low instability even when noise rises.
Bonferroni & VIF – Why many tests and correlated features can fool you
In real projects we almost never test just one thing. We try many features, model variants, time points, segments, and outcomes. Every extra test is another chance to see a “significant” result that is actually just noise.
1. Multiple testing & Bonferroni
If you test one hypothesis at α = 0.05, there is a 5% chance of a
false positive (a Type I error). If you test many hypotheses at the
same threshold, the chance that at least one of them is a false positive
grows very quickly.
- 1 test at α = 0.05 → about 5% chance of a false positive.
- 100 independent tests at α = 0.05 → about 99% chance that at least one is “significant” just by luck.
What is family-wise error (FWE / FWER)?
Family-wise error (FWE), often called the family-wise error rate (FWER), is the probability of making at least one Type I error (false positive) when performing a “family” of statistical tests. When you run many tests, this probability increases. FWE / FWER is used to quantify and control this risk, usually by adjusting the significance level or using corrections such as Bonferroni or Holm.
What Bonferroni does
Bonferroni is a conservative safety brake:
New per-test α = original α ÷ number of tests.
Example: with α = 0.05 and 100 tests, the Bonferroni-corrected threshold becomes 0.05 ÷ 100 = 0.0005 for each test. This keeps the FWER near 5%, but makes it harder to detect real effects.
- Pros: very safe; strong control of FWER (false discoveries are rare).
- Cons: conservative; with many tests it can hide real signals.
The playground shows how FWER explodes with many tests when you don’t correct, and how Bonferroni pulls it back under control.
2. VIF – When features tell the same story
In the right-hand panel, we are not testing many hypotheses, but we are using many correlated features in a regression. VIF explains how this hurts the stability of your coefficients.
Intuition:
- The slider "Correlation between similar features" (ρ) says how strongly a group of features move together.
- The slider "# of similarly correlated features" (k) says how many features sit in that correlated pack.
VIF is defined as VIF = 1 / (1 − R²) when regressing one feature on the
others. It tells you how much the variance of a coefficient is
inflated because of collinearity.
- VIF ≈ 1 – almost no collinearity.
- VIF 5–10 – coefficients are unstable and hard to interpret.
- VIF > 10 – strong collinearity; rethink features or model design.
If ρ and k are high, the model cannot cleanly separate which feature carries the signal. Coefficients may swing wildly, flip sign, or become impossible to interpret, even though the underlying relationship hasn’t changed.
3. Big picture
Both panels demonstrate the same core idea in different ways:
- Multiple testing – too many chances to “win” by noise → false discoveries.
- Collinearity – too many overlapping features → unstable estimates.
In short: noise + many decisions = unreliable statistics. The corrections and diagnostics you see here (Bonferroni, FWER, VIF) are tools to keep that under control.
Bonferroni & VIF Intuition – Multiple tests and collinearity
Bonferroni Correction – Family-wise error
Every extra test is another chance to see “significance” by luck. Bonferroni shrinks the per-test α so the family-wise error rate stays near your target (e.g. 5%).
Try m = 1 vs m = 100 with α = 0.05 and feel how uncorrected FWER explodes.
VIF Intuition – How correlation inflates variance
VIF tells you how much the variance of a coefficient is inflated by collinearity with other predictors.
- VIF ≈ 1 – almost no collinearity.
- VIF 5–10 – concerning; coefficients unstable and hard to interpret.
- VIF > 10 – strong collinearity, rethink your design/features.
Increase ρ or k and watch VIF explode while the underlying signal hasn’t changed at all.
Model Selection Criteria
| Metric | Decision Criterion | Purpose | Description | Working Mechanism | Example | Limitations |
|---|---|---|---|---|---|---|
| Akaike Information Criterion (AIC) |
Compare ΔAIC relative to best (lowest). ΔAIC < 2 = essentially equally good. 2–10 = some to strong evidence against. > 10 = much worse than best model. |
Trade off fit vs complexity for out‑of‑sample prediction quality. | \[ \text{AIC} = 2k - 2\log(L), \] k = parameters, L = likelihood. | Lower AIC preferred among models fit to same data/response. | Models with ΔAIC < 2 are often considered similarly plausible. | Asymptotic; for small n, AICc is preferable. Not directly interpretable in
absolute terms. Common pitfalls:
|
| Bayesian Information Criterion (BIC) |
ΔBIC vs best: < 2 = weak evidence against. 2–6 = positive evidence against. 6–10 = strong. > 10 = very strong evidence against model. |
More heavily penalise complexity, tending to pick more parsimonious models as n grows. | \[ \text{BIC} = k \log(n) - 2\log(L). \] | Approximate Bayes factor under certain priors; lower is better. | BIC often selects smaller subset of predictors than AIC in large samples. | Assumes true model is in candidate set; may underfit when prediction, not
true model recovery, is goal. Common pitfalls:
|
| Mallows’ Cp |
For subset size p (predictors), good models have Cp ≈ p+1. Cp − (p+1) ≈ 0 = balanced bias–variance. Far above p+1 = underfitting (missing predictors). Far below p+1 = possible overfitting. |
Guide subset selection in linear regression. | Compares residual SS of subset model to full model’s error variance. | Plot Cp vs p; models near diagonal line Cp = p+1 are desirable. | Helps choose among many subset models with similar R². | Relies on full model as reference; computationally expensive for large
predictor sets without heuristics. Common pitfalls:
|
| Cross‑validated Deviance / Log‑Loss |
Lower is better. Lowest CV deviance among candidates = preferred model. Differences < 1–2% = often negligible in practice. Much larger deviance = clearly worse predictive model. |
Directly compare predictive models using out‑of‑sample negative log‑likelihood. | Compute mean deviance / log‑loss across CV folds. | Penalises wrong confident predictions more strongly than Brier score. | Standard for logistic / probabilistic models in ML competitions. | Can be dominated by rare but very miscalibrated regions; needs context
for what constitutes a big improvement. Common pitfalls:
|
| Regularisation Strength (λ) |
Very small λ → unregularised, risk of overfitting. λ chosen by CV → good bias–variance compromise. Very large λ → coefficients shrunk too much (underfitting). |
Control complexity of models like ridge, lasso, elastic‑net. | Penalise large coefficients (L2) or enforce sparsity (L1) to improve generalisation. | Typically chosen via cross‑validated performance curves over λ. | L1 (lasso) can produce sparse models; L2 (ridge) stabilises coefficients with multicollinearity. | Interpretation depends on scaling; different λ scales across algorithms and
implementations. Common pitfalls:
|
Cutting-Edge ML Metrics (2020–2025)
Modern machine-learning research uses metrics that go beyond classical AUC, RMSE, and p-values. These tools measure calibration quality, distribution shift, risk ranking, uncertainty, robustness, fairness, and high-dimensional model stability.
-
Advanced Calibration Metrics research → practiceACE (Adaptive Calibration Error), TCE (Thresholded Calibration Error), Squared-ECE, Adaptive Reliability Error.
-
Skill Scores common in productionBrier Skill Score (BSS) — compares your model to a naïve baseline to give calibration improvement.
-
Modern Ranking & Risk Metrics research → practiceC-index, Somers’ D, IDI (Integrated Discrimination Improvement), NRI (Net Reclassification Improvement).
-
Bayesian / Predictive Model Selection research → practiceWAIC, PSIS-LOO, Expected Log Predictive Density (ELPD), Pareto-k diagnostics.
-
Distribution Shift Diagnostics research → practicePSI (Population Stability Index), KL Divergence, Maximum Mean Discrepancy (MMD), Wasserstein Distance.
-
Uncertainty Diagnostics primarily researchNegative Log-Likelihood (NLL), epistemic vs aleatoric variance decomposition, Expected Sharpness.
-
Robustness & Stability primarily researchModel Stability Index (MSI), permutation effect-size curves, influence-function diagnostics.
-
Fairness / Ethical ML Metrics research → practiceEqualized Odds, Demographic Parity, Predictive Parity, Calibration Within Groups.
-
Interaction-Aware Interpretability primarily researchSHAP Interaction Values — modern extension of SHAP for modelling non-linear pairwise effects.
-
Leakage Diagnostics primarily researchTarget Leakage Test (permutation with shuffled labels), Cross-fold Correlation Leakage Index.
Model Interpretability & Explainability
| Method / Metric | Decision Criterion | Purpose | Description | Working Mechanism | Example | Limitations |
|---|---|---|---|---|---|---|
| Permutation Feature Importance |
Large performance drop when permuted = strong importance. Near‑zero drop = little contribution (under current model). |
Global importance ranking for features in any black‑box model. | Measures how much a model’s performance metric worsens when the values of a feature are randomly permuted. | Break the relationship between a feature and target by shuffling that feature; recompute performance and compare to baseline. | Rank features by drop in ROC AUC or RMSE in a tree ensemble to understand which inputs matter most. | Correlated features can share importance (each appears less
important alone). Requires many evaluations of the model; may be expensive. Common pitfalls:
|
| SHAP Values |
Stable patterns that match domain knowledge are desirable. Very large |SHAP| values indicate strongly influential features for a given prediction. |
Provide theoretically grounded, additive feature attributions for individual predictions and global patterns. | Based on Shapley values from cooperative game theory; each feature gets a contribution to pushing the prediction away from a baseline. | Approximate each feature’s marginal contribution by averaging over many coalitions of features; specialised fast algorithms exist for tree‑based models. | Global SHAP summary plots highlight which variables drive model predictions overall; local plots explain single predictions (e.g. why a loan was rejected). | Computationally expensive for complex models without specialised
approximations; explanations can be misread as causal rather than correlational. Common pitfalls:
|
| LIME (Local Interpretable Model‑agnostic Explanations) | Good explanations are locally faithful (approximate the model well near the point of interest) and sparse enough to be interpretable. | Explain individual predictions by fitting a simple surrogate model around a local neighbourhood. | Samples points near the instance, queries the black‑box model, and fits an interpretable model (e.g. linear or small tree) to those local outputs. | Weights samples by proximity to the instance; the surrogate’s coefficients are reported as local feature importance. | Useful for debugging why a specific prediction was made, especially in regulated domains where explanations must be human‑readable. | Local surrogate may be unstable (different runs give different
explanations); only valid near the point; can be misleading globally. Common pitfalls:
|
Python Code Snippets (scikit‑learn / statsmodels / SHAP / LIME)
Classification Metrics & Confusion Matrix
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# Generate dummy data for demonstration
X, y = np.random.rand(100, 5), np.random.randint(0, 2, 100)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Train a simple classification model (e.g., Logistic Regression)
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
# Define y_true and X for the existing metrics calculation cell
y_true = y_test
X = X_test
from sklearn.metrics import (
confusion_matrix, accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, average_precision_score,
matthews_corrcoef, cohen_kappa_score
)
y_proba = model.predict_proba(X)[:, 1]
y_pred = (y_proba >= 0.5).astype(int)
cm = confusion_matrix(y_true, y_pred)
acc = accuracy_score(y_true, y_pred)
prec = precision_score(y_true, y_pred)
rec = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
roc_auc = roc_auc_score(y_true, y_proba)
pr_auc = average_precision_score(y_true, y_proba)
mcc = matthews_corrcoef(y_true, y_pred)
kappa = cohen_kappa_score(y_true, y_pred)
print("Confusion Matrix:\\n", cm)
print("Accuracy:", acc)
print("Precision:", prec)
print("Recall:", rec)
print("F1-score:", f1)
print("ROC AUC:", roc_auc)
print("PR AUC:", pr_auc)
print("Matthews Corr. Coeff.:", mcc)
print("Cohen's Kappa:", kappa)
Statistical Tests (t-test, ANOVA, chi-square)
import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
# Sample data for two-sample t-test
group1 = np.random.rand(20) * 10
group2 = np.random.rand(25) * 10 + 2 # Slightly different mean
# two-sample t-test
t_stat, p_value = stats.ttest_ind(group1, group2, equal_var=False)
print(f"Two-sample t-test: t-statistic = {t_stat:.3f}, p-value = {p_value:.3f}")
# Sample data for chi-square test of independence
contingency_table = np.array([[10, 20], [15, 5]])
# chi-square test of independence
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
print(f"\nChi-square test: chi2 = {chi2:.3f}, p-value = {p:.3f}, dof = {dof}")
print(f"Expected frequencies:\n{expected}")
# Sample data for one-way ANOVA
data = {
'y': np.concatenate([
np.random.rand(10) * 5,
np.random.rand(10) * 5 + 2,
np.random.rand(10) * 5 + 1
]),
'group': ['A'] * 10 + ['B'] * 10 + ['C'] * 10
}
df = pd.DataFrame(data)
# one-way ANOVA with statsmodels
model = smf.ols('y ~ C(group)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(f"\nOne-way ANOVA:\n{anova_table}")
Regression Metrics (MAE vs RMSE)
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np
y_true = y_test
# y_pred = model.predict(X) # y_pred is already available in the kernel state
mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
print(f"Mean Absolute Error (MAE): {mae:.4f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}")
SHAP / LIME Basics
import shap
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from lime.lime_tabular import LimeTabularExplainer
import numpy as np
import pandas as pd
# Create a dummy binary target variable for X (since none is provided)
y_binary = (X[:, 0] > np.mean(X[:, 0])).astype(int) # Example: binarize based on first feature
# Split data for LIME, ensuring X_train and X_test are pandas DataFrames for feature_names
X_train_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
X_test_df = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
# Create and train a Logistic Regression model as the 'linear model'
model = LogisticRegression(solver='liblinear')
model.fit(X_train_df, y_binary[:X_train_df.shape[0]]) # Ensure y_binary matches X_train size
# SHAP for the Linear model
explainer = shap.LinearExplainer(model, X_train_df)
shap_values = explainer.shap_values(X_test_df)
# Summary plot
shap.summary_plot(shap_values, X_test_df)
# LIME explainer
explainer = LimeTabularExplainer(
training_data=X_train_df.values,
feature_names=X_train_df.columns,
class_names=['negative', 'positive'],
mode='classification'
)
i = 0
exp = explainer.explain_instance(
X_test_df.iloc[i].values,
model.predict_proba,
num_features=5
)
exp.show_in_notebook()
Limits of Models, Metrics & Science
This dashboard summarises many of the tools we currently use to make sense of data – from classical statistics to modern ML metrics. They are powerful, but they are not the whole story.
Gödel’s incompleteness theorem shows that any formal system rich enough to express basic arithmetic will contain true statements that cannot be proven within that system. No system can be both complete and consistent.
By analogy, no modelling framework or collection of metrics can ever fully capture the processes we study. We always work with:
- finite, noisy, and biased data
- simplified models of complex systems
- metrics that highlight some aspects while ignoring others
- assumptions that are never perfectly satisfied
The goal here is not to pretend that AUC, RMSE, p-values, or any “cutting-edge” metric delivers final truth. Instead, this dashboard makes our tools explicit – showing where they are informative, where they are fragile, and where they leave important questions unanswered.
In other words: this is a map, not the territory. The map is useful, and it keeps improving – but if we ever forget that it is incomplete, we stop doing science.
Thanks for your attention.
— manu
Email: x34mev@proton.me • GitHub: https://github.com/slashennui