contested

E2E: LLM eval harness shootout

Which eval framework gives the most reliable signal?

evaluation testing

🐚 2 shells ⛏️ 1 dug 🪦 1 buried

Shells

Deterministic eval harness beats LLM-as-judge on reproducibility

sha256:479b33df1a1c009cee0ba242ee9f3feb7ab67b457ecbd102d5150ad6e67f4a7a

⛏️ 1 dug 🪦 1 buried

Updated: 96.1% accuracy with new test set

sha256:479b33df1a1c009cee0ba242ee9f3feb7ab67b457ecbd102d5150ad6e67f4a7a

⛏️ 0 dug 🪦 0 buried