Statistical Analysis
All statistical analyses are implemented in pipeline/analysis/statistics.py and run via scripts/04_run_analysis.py.
Primary Comparison: McNemar's Test
Why McNemar? Each model answers the same 200 questions, creating a natural paired design. McNemar's test is the correct paired binary test for this structure — it asks: "Does model A get questions right that model B gets wrong (and vice versa), more often than chance?"
How it works: For each model pair (A, B), we build a 2×2 discordant-pairs table:
| B Correct | B Wrong | |
|---|---|---|
| A Correct | (ignored) | b |
| A Wrong | c | (ignored) |
The McNemar statistic tests whether b ≠ c. We use the exact binomial version (statsmodels.stats.contingency_tables.mcnemar(exact=True)) for robustness with small cell counts.
Correction for multiple comparisons: With 8 models, there are $\binom{8}{2} = 28$ pairwise tests. We apply the Holm-Bonferroni step-down procedure, which is uniformly more powerful than the Bonferroni correction while still controlling the family-wise error rate at α = 0.05.
Effect Size: Cohen's h
For pairwise accuracy differences, we report Cohen's h:
$$h = 2\arcsin(\sqrt{p_1}) - 2\arcsin(\sqrt{p_2})$$
Interpretation guidelines (Cohen, 1988):
| |h| | Interpretation | |-------|----------------| | 0.20 | Small effect | | 0.50 | Medium effect | | 0.80 | Large effect |
Subgroup Analysis: Chi-Square
Within each model, we test whether accuracy varies significantly across subgroups (e.g., medical subject or difficulty level) using a chi-square test of independence on the accuracy-by-subgroup contingency table.
from pipeline.analysis.statistics import subgroup_chi_square
# Test whether subject explains accuracy variance for each model
df = subgroup_chi_square(results, groupby_col="subject")
IMG Gap: Fisher's Exact Test
For the IMG vs. non-IMG accuracy comparison, we use Fisher's exact test rather than chi-square because cell counts for IMG-relevant questions may be small. The test is two-tailed.
Confidence Intervals
All accuracy estimates are reported with 95% normal-approximation confidence intervals:
$$\text{CI} = \hat{p} \pm 1.96 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$
Reading the Output CSVs
After running scripts/04_run_analysis.py, the CSVs in evaluation/results/ contain:
accuracy_by_model.csv
| Column | Description |
|---|---|
group | Model name |
n | Total questions answered |
correct | Number correct |
accuracy | Proportion correct |
ci_lower / ci_upper | 95% CI bounds |
mcnemar_tests.csv
| Column | Description |
|---|---|
model_a, model_b | Model pair |
statistic | McNemar test statistic |
p_value | Unadjusted p-value |
p_adjusted | Holm-Bonferroni adjusted p-value |
significant | Boolean: p_adjusted < 0.05 |
img_gap_analysis.csv
| Column | Description |
|---|---|
model | Model name |
img_accuracy | Accuracy on IMG-relevant questions |
non_img_accuracy | Accuracy on non-IMG questions |
accuracy_gap | non_img_accuracy - img_accuracy |
p_value | Fisher's exact p-value |
significant | Boolean: p_value < 0.05 |