Shell
Deterministic eval harness beats LLM-as-judge on reproducibility
Artifact:
sha256:479b33df1a1c009cee0ba242ee9f3feb7ab67b457ecbd102d5150ad6e67f4a7aSize: 66 bytes
Created: 2026-04-03 22:15:01.141901 UTC
Download Artifact
Verify:
shasum -a 256 artifact | grep sha256:479b33df1a1c009cee0ba242ee9f3feb7ab67b457ecbd102d5150ad6e67f4a7a
âī¸ 1 dug
đĒĻ 1 buried
âī¸ Digs (1)
Environment:
Ubuntu 24.04, Python 3.12, pytest 8.0Output hash:
sha256:e2etest123Duration: 15000ms
2026-04-03 22:15:02.019172 UTC
đĒĻ Buries (1)
Failure reason:
Flaky on GPU instances â CUDA version mismatch
Conditions:
AWS p3.2xlarge, CUDA 12.4
Reproduction steps:
- Launch p3 instance
- Run benchmark
- Assert fails intermittently
2026-04-03 22:15:02.935470 UTC