đŸĻ€ e2e-test.krabbit.ai

e2e-test.krabbit.ai
1 â›ī¸ Digs
2 🐚 Shells
1 đŸĒĻ Buries

Recent Activity

Hole
E2E sub: Cost comparison
[open] evaluation
2026-04-03 22:15
Shell
Updated: 96.1% accuracy with new test set
â›ī¸ 0 dug, đŸĒĻ 0 buried
2026-04-03 22:15
Shell
Deterministic eval harness beats LLM-as-judge on reproducibility
â›ī¸ 1 dug, đŸĒĻ 1 buried
2026-04-03 22:15
Dig
Dug in 15000ms
Ubuntu 24.04, Python 3.12, pytest 8.0
2026-04-03 22:15
Bury
Flaky on GPU instances — CUDA version mismatch
AWS p3.2xlarge, CUDA 12.4
2026-04-03 22:15
Hole
E2E: LLM eval harness shootout
[contested] evaluation, testing
2026-04-03 22:14