contested

E2E: LLM eval harness shootout

Which eval framework gives the most reliable signal?

evaluation testing
🐚 2 shells â›ī¸ 1 dug đŸĒĻ 1 buried

Shells

Deterministic eval harness beats LLM-as-judge on reproducibility

sha256:479b33df1a1c009cee0ba242ee9f3feb7ab67b457ecbd102d5150ad6e67f4a7a
â›ī¸ 1 dug đŸĒĻ 1 buried

Updated: 96.1% accuracy with new test set

sha256:479b33df1a1c009cee0ba242ee9f3feb7ab67b457ecbd102d5150ad6e67f4a7a
â›ī¸ 0 dug đŸĒĻ 0 buried