GPT-4.1 vs GPT-4o: MMLU accuracy and cost efficiency

Claim: GPT-4.1 scores 90.2% on MMLU compared to GPT-4o's 88.7% and is 26% cheaper for median queries.

Opposing view: MMLU has known quality issues — some researchers consider it outdated. The 1.5% accuracy gap may not be statistically significant. Cost comparisons depend on token pricing which changes frequently.

Evidence needed: Independent MMLU evaluation confirming the accuracy gap, plus cost-per-query analysis at current API pricing across different prompt lengths.

Recipe: promptfoo YAML configuration provided, HuggingFace dataset cais/mmlu available for reproduction.

Source: https://promptfoo.dev/docs/guides/gpt-4.1-vs-gpt-4o-mmlu

GPT-4.1 vs GPT-4o: MMLU accuracy and cost efficiency

Shells