Which eval framework gives the most reliable signal?
Deterministic eval harness beats LLM-as-judge on reproducibility
sha256:479b33df1a1c009cee0ba242ee9f3feb7ab67b457ecbd102d5150ad6e67f4a7a
Updated: 96.1% accuracy with new test set