Dataset

MedQA-USMLE

This project uses the MedQA-USMLE dataset in its four-option variant, available on HuggingFace as GBaker/MedQA-USMLE-4-options.

The dataset was originally introduced by Jin et al. (2021) and contains questions derived from publicly released USMLE practice materials. All questions are formatted as single-best-answer multiple-choice items with four options (A–D). The combined train, validation, and test splits contain approximately 12,700 questions.

Each question record includes:

Field	Description
`question`	The question stem (plain text)
`options`	Dict with keys A, B, C, D
`answer_idx`	Ground truth answer letter
`meta_info`	Subject tag (not always populated)

Step Classification

Since MedQA does not provide an explicit Step 1 / Step 2 CK label, we infer the exam step using a keyword heuristic. Questions containing management or clinical decision-making language ("next step", "most appropriate management", "hospitalized", "follow-up", etc.) are classified as Step 2 CK; all others are classified as Step 1. This heuristic achieves reasonable face validity and was validated against a random subsample by the medical co-author.

Stratified Sampling

We draw 100 questions per exam step (200 total) using the following stratification:

By subject — Proportional allocation across up to 15 medical subjects.
By difficulty — Balanced across easy / medium / hard within each subject. Difficulty is inferred from question length: <60 words → easy, 60–119 words → medium, ≥120 words → hard.

The sampling uses a fixed random seed (42) so that the same sample is produced on every run. Sampled questions are stored as versioned JSON files:

data/sampled/
├── step1_sample.json       # 100 Step 1 questions
├── step2ck_sample.json     # 100 Step 2 CK questions
└── sample_metadata.json    # Subject distribution, seed, strategy

IMG Relevance Tagging

After sampling, questions are automatically tagged as img_relevant = True if any keyword from the US-centric lexicon appears in the question stem or answer options. See the IMG Perspective page and Appendix D of the paper for the full keyword list.

Inspecting the Sample

To view the sampled questions after running scripts/02_sample_questions.py:

# Summary
python -c "
import json
from pathlib import Path
qs = json.loads(Path('data/sampled/step1_sample.json').read_text())
img = sum(1 for q in qs if q['img_relevant'])
subjects = {}
for q in qs:
    subjects[q['subject']] = subjects.get(q['subject'], 0) + 1
print(f'Total: {len(qs)} | IMG-relevant: {img}')
for s, n in sorted(subjects.items(), key=lambda x: -x[1]):
    print(f'  {s}: {n}')
"

# View sampling metadata
cat data/sampled/sample_metadata.json

License

The MedQA dataset is licensed for research use. This project does not redistribute the raw dataset; it downloads it directly from HuggingFace at runtime via the datasets library.