Models

Evaluated Models

All eight models are evaluated under identical conditions: the same prompt template, temperature=0, max_tokens=512, and no system prompt.

Model Key	Display Name	Provider	~Parameters	Input $/1K	Output $/1K
`gpt-4o`	GPT-4o	OpenAI	Not disclosed	$0.005	$0.015
`gpt-4o-mini`	GPT-4o-mini	OpenAI	Not disclosed	$0.00015	$0.0006
`claude-3-5-sonnet`	Claude 3.5 Sonnet	Anthropic	Not disclosed	$0.003	$0.015
`claude-3-haiku`	Claude 3 Haiku	Anthropic	Not disclosed	$0.00025	$0.00125
`gemini-1.5-pro`	Gemini 1.5 Pro	Google	Not disclosed	$0.00125	$0.005
`gemini-1.5-flash`	Gemini 1.5 Flash	Google	Not disclosed	$0.000075	$0.0003
`llama-3.3-70b`	Llama 3.3 70B	Meta (via Groq)	70B	Free	Free
`deepseek-r1-70b`	DeepSeek R1 Distill 70B	DeepSeek (via Groq)	70B	Free	Free

Prices reflect approximate rates at the time of study design. Check current provider pricing before running at scale.

Selecting Models for a Run

Use the --models flag in scripts/03_run_evaluation.py:

# Run all 8 models
python scripts/03_run_evaluation.py --models all --questions data/sampled/step1_sample.json

# Run a specific subset
python scripts/03_run_evaluation.py \
  --models gpt-4o,claude-3-5-sonnet,llama-3.3-70b \
  --questions data/sampled/step1_sample.json

# Run only free models (Groq-hosted)
python scripts/03_run_evaluation.py \
  --models llama-3.3-70b,deepseek-r1-70b \
  --questions data/sampled/step1_sample.json

Adding a New Model

To add a model not in the current registry:

Step 1 — Add an entry to MODELS in pipeline/config.py:

MODELS["my-new-model"] = {
    "provider": "openai",       # or "anthropic", "google", "groq"
    "model_id": "gpt-4-turbo",  # exact API model identifier
    "cost_per_1k_input": 0.01,
    "cost_per_1k_output": 0.03,
}

Step 2 — If the provider is new, create a file pipeline/models/my_provider_model.py extending LLMModel from pipeline/models/base.py, implementing the answer_question() method.

Step 3 — Register it in pipeline/models/registry.py:

from .my_provider_model import MyProviderModel
# Add to the provider dispatch dict:
"myprovider": MyProviderModel,

The new model will automatically be included in --models all runs and cost tracking.

Model Architecture Notes

GPT-4o / GPT-4o-mini: OpenAI's latest generation multimodal models. Both support the standard chat completions API.
Claude 3.5 Sonnet / Haiku: Anthropic's messages API. Sonnet is the most capable; Haiku is optimized for speed and cost.
Gemini 1.5 Pro / Flash: Google's google-generativeai SDK. Pro supports a 1M-token context window; Flash is optimized for latency.
Llama 3.3 70B: Meta's open-weight model, served via Groq's LPU inference hardware. Uses the same chat completions API interface as OpenAI.
DeepSeek R1 Distill Llama 70B: A reasoning-focused model distilled from DeepSeek R1 into the Llama 70B architecture. Also served via Groq. Expected to produce longer, more deliberate reasoning chains.