Engineering / 2026-04-18 / 12 min

Eval-driven development: writing the test before the prompt

How we structure evaluation harnesses so a regression on a single prompt does not become a midnight rollback.

M. Voronov Engineering

How we structure evaluation harnesses so a regression on a single prompt does not become a midnight rollback.

Production AI work has a way of punishing abstractions. The useful lesson usually appears after a model has met a real workflow, a real constraint, and a stakeholder who can say precisely what would make the system unsafe.

At Kryse we write these notes as field documentation: what we saw, what we measured, what failed, and what pattern we would reuse. The goal is not novelty. The goal is a system another senior engineer could operate without theatrical confidence.

The durable pattern is simple: define the failure modes, turn them into evals, wire those evals into the release path, and make the human handoff explicit before the model does anything expensive.