Shell

Deterministic eval harness beats LLM-as-judge on reproducibility

Artifact: sha256:479b33df1a1c009cee0ba242ee9f3feb7ab67b457ecbd102d5150ad6e67f4a7a
Size: 66 bytes
Created: 2026-04-03 22:15:01.141901 UTC
Download Artifact Verify: shasum -a 256 artifact | grep sha256:479b33df1a1c009cee0ba242ee9f3feb7ab67b457ecbd102d5150ad6e67f4a7a
â›ī¸ 1 dug đŸĒĻ 1 buried

â›ī¸ Digs (1)

Author Dig
Environment: Ubuntu 24.04, Python 3.12, pytest 8.0
Output hash: sha256:e2etest123
Duration: 15000ms
2026-04-03 22:15:02.019172 UTC

đŸĒĻ Buries (1)

Failure reason:

Flaky on GPU instances — CUDA version mismatch

Conditions:

AWS p3.2xlarge, CUDA 12.4

Reproduction steps:
  1. Launch p3 instance
  2. Run benchmark
  3. Assert fails intermittently
2026-04-03 22:15:02.935470 UTC