// process · how an engagement runs

Twelve weeks. Signed in week one.

The canonical Eval-first Build moves through signed checkpoints, weekly demos, measurable evals, and clean handoff. AI audits collapse the same logic into four weeks.

Start discovery

[01] / Week 1 – 2

Discovery

Two weeks. We sit with your team, your traffic, and your data. We leave with a signed problem-and-success spec and a budget. If we cannot write the spec, we do not start the build.

D 1–2 · Kickoff & system tour

Architecture review, traffic samples, current eval surface, incidents from the last 90 days.

D 3–5 · Stakeholder interviews

Sponsor, engineering lead, the people whose work the system changes. We write down what 'better' means in their words.

D 6–7 · Failure-mode workshop

We walk through worst-case outputs with the team. The list becomes the first draft of the adversarial suite.

D 8–9 · Eval-spec drafting

We write the eval before the prompt. Cases, scorers, thresholds, budgets.

D 10 · Read-out & sign-off

30-page document. Problem, success criteria, eval surface, scope, schedule, price. Signed by both sponsors before week 3.

Discovery report (~30pp) Eval-spec v0 Statement of work

[02] / Week 3 – 5

Foundations

Three weeks to stand up the harness — the platform that everything else runs through. By the end of week 5 your team can run the eval suite locally and in CI; nothing ships without it green.

W 3 · Eval harness scaffolding

Suites for regression, adversarial, drift, cost. Scorers pinned to specific judges.

W 3–4 · Model gateway

Multi-model routing, cost ceilings, structured-output validation, request signing.

W 4 · Tracing & cost dashboards

OpenTelemetry traces. Per-call cost. Wired into your existing observability stack, not a parallel one.

W 5 · Guardrails layer

PII scrubbing, jailbreak heuristics, output validators. Implemented as middleware, not bolted-on.

W 5 · First green build

Eval suite running on every PR; merges blocked on regressions.

Harness in your repo Cost & latency dashboards PR-blocking eval gate

[03] / Week 6 – 11

Build

Six weeks of engineering. Agents, retrieval, fine-tunes, prompts, tools — whatever the system needs. Every change is gated by the eval suite. We ship to a shadow environment by week 9 and to a canary fraction of production by week 11.

W 6–7 · Core inference path

Agent graph, tool definitions, retrieval index, prompts. First end-to-end pass through the harness.

W 7–8 · Fine-tuning (if scoped)

Dataset curation from production traces; supervised + preference training; eval-gated promotion.

W 8–9 · Shadow deploy

System runs against real traffic without taking action. Outputs scored offline, dashboarded, reviewed daily.

W 10 · Adversarial hardening

Internal red-team week. New cases added to the suite. Threshold raised to release-bar.

W 11 · Canary in production

10% of traffic. On-call from our side. Daily review with your team.

Production inference path Shadow & canary dashboards Red-team report

[04] / Week 12 – 13

Handoff

Two weeks of supervised handoff. Your engineers run the next deploy with us watching, then the deploy after that without us. We write a runbook covering every alert path. We don't disappear — see the next phase.

W 12 · Runbook & training

On-call playbook for every dashboard alert. Two training sessions: ML-side and platform-side.

W 12 · Supervised deploy

Your team ships a change end-to-end with us in the room. We point; they drive.

W 13 · Solo deploy

Your team ships alone. We're available, not active.

W 13 · Exit review

Written assessment of state on exit. What's strong, what's fragile, what we'd watch.

Runbook & on-call playbook Exit review (~10pp) Open issues list

[05] / Week 14 – 21

Aftercare

Sixty days of defect-fix support included by default. We monitor your dashboards; we fix bugs we shipped. Beyond the 60-day window, you can move to an Embed retainer or close the engagement.

Day 1–60 · Defect-fix window

Bugs in code we wrote: ours, fixed for free. New scope: a separate engagement.

Weekly · Health checks

Automated weekly report on eval pass rate, drift, cost, latency. Sent to your team and ours.

Monthly · Steering review

Thirty-minute call. What's working, what's drifting, whether to extend or close.

Weekly health reports Defect-fix PRs Optional Embed retainer

// artifacts

What exists when we leave.

DOC-001

Discovery report

30 pp · signed before build

REPO/evals

Eval suite spec

YAML · in your repo

DASH-COST

Cost dashboard

Grafana · or your tooling

OTEL-TRACE

Trace explorer

Per-call observability

REPO/adversarial

Adversarial battery

200+ cases · grows with you

DOC-002

Runbook

Per-alert playbook

DOC-003

Exit review

Written on the way out

AUTO-WHR

Weekly health report

Automated · 60 days

// boundaries

How we behave inside your team.

✓ Sit on your Slack for the duration

— Demand a private channel

✓ Ship under your IP and your repos

— Hold code hostage

✓ Name a single ML lead on day one

— Rotate juniors through the engagement

✓ Tell you when the project should not happen

— Take work we don't believe in

✓ Write down our exit criteria before week one

— Manufacture reasons to stay