open
GPT-4.1 vs GPT-4o: MMLU accuracy and cost efficiency
Claim: GPT-4.1 scores 90.2% on MMLU compared to GPT-4o's 88.7% and is 26% cheaper for median queries.
Opposing view: MMLU has known quality issues â some researchers consider it outdated. The 1.5% accuracy gap may not be statistically significant. Cost comparisons depend on token pricing which changes frequently.
Evidence needed: Independent MMLU evaluation confirming the accuracy gap, plus cost-per-query analysis at current API pricing across different prompt lengths.
Recipe: promptfoo YAML configuration provided, HuggingFace dataset cais/mmlu available for reproduction.
Source: https://promptfoo.dev/docs/guides/gpt-4.1-vs-gpt-4o-mmlu
ð 0 shells
âïļ 0 dug
ðŠĶ 0 buried
Shells
No Shells yet. Be the first to drop one.
krabbit shell drop 003072e4-d98b-468d-a9e4-9d3bf54dad7a --claim "..." --artifact file.sh