Models
Evaluated Models
All eight models are evaluated under identical conditions: the same prompt template, temperature=0, max_tokens=512, and no system prompt.
| Model Key | Display Name | Provider | ~Parameters | Input $/1K | Output $/1K |
|---|---|---|---|---|---|
gpt-4o | GPT-4o | OpenAI | Not disclosed | $0.005 | $0.015 |
gpt-4o-mini | GPT-4o-mini | OpenAI | Not disclosed | $0.00015 | $0.0006 |
claude-3-5-sonnet | Claude 3.5 Sonnet | Anthropic | Not disclosed | $0.003 | $0.015 |
claude-3-haiku | Claude 3 Haiku | Anthropic | Not disclosed | $0.00025 | $0.00125 |
gemini-1.5-pro | Gemini 1.5 Pro | Not disclosed | $0.00125 | $0.005 | |
gemini-1.5-flash | Gemini 1.5 Flash | Not disclosed | $0.000075 | $0.0003 | |
llama-3.3-70b | Llama 3.3 70B | Meta (via Groq) | 70B | Free | Free |
deepseek-r1-70b | DeepSeek R1 Distill 70B | DeepSeek (via Groq) | 70B | Free | Free |
Prices reflect approximate rates at the time of study design. Check current provider pricing before running at scale.
Selecting Models for a Run
Use the --models flag in scripts/03_run_evaluation.py:
# Run all 8 models
python scripts/03_run_evaluation.py --models all --questions data/sampled/step1_sample.json
# Run a specific subset
python scripts/03_run_evaluation.py \
--models gpt-4o,claude-3-5-sonnet,llama-3.3-70b \
--questions data/sampled/step1_sample.json
# Run only free models (Groq-hosted)
python scripts/03_run_evaluation.py \
--models llama-3.3-70b,deepseek-r1-70b \
--questions data/sampled/step1_sample.json
Adding a New Model
To add a model not in the current registry:
Step 1 — Add an entry to MODELS in pipeline/config.py:
MODELS["my-new-model"] = {
"provider": "openai", # or "anthropic", "google", "groq"
"model_id": "gpt-4-turbo", # exact API model identifier
"cost_per_1k_input": 0.01,
"cost_per_1k_output": 0.03,
}
Step 2 — If the provider is new, create a file pipeline/models/my_provider_model.py extending LLMModel from pipeline/models/base.py, implementing the answer_question() method.
Step 3 — Register it in pipeline/models/registry.py:
from .my_provider_model import MyProviderModel
# Add to the provider dispatch dict:
"myprovider": MyProviderModel,
The new model will automatically be included in --models all runs and cost tracking.
Model Architecture Notes
- GPT-4o / GPT-4o-mini: OpenAI's latest generation multimodal models. Both support the standard chat completions API.
- Claude 3.5 Sonnet / Haiku: Anthropic's messages API. Sonnet is the most capable; Haiku is optimized for speed and cost.
- Gemini 1.5 Pro / Flash: Google's
google-generativeaiSDK. Pro supports a 1M-token context window; Flash is optimized for latency. - Llama 3.3 70B: Meta's open-weight model, served via Groq's LPU inference hardware. Uses the same chat completions API interface as OpenAI.
- DeepSeek R1 Distill Llama 70B: A reasoning-focused model distilled from DeepSeek R1 into the Llama 70B architecture. Also served via Groq. Expected to produce longer, more deliberate reasoning chains.