[01] / Week 1 – 2
Discovery
Two weeks. We sit with your team, your traffic, and your data. We leave with a signed problem-and-success spec and a budget. If we cannot write the spec, we do not start the build.
D 1–2 · Kickoff & system tour
Architecture review, traffic samples, current eval surface, incidents from the last 90 days.
D 3–5 · Stakeholder interviews
Sponsor, engineering lead, the people whose work the system changes. We write down what 'better' means in their words.
D 6–7 · Failure-mode workshop
We walk through worst-case outputs with the team. The list becomes the first draft of the adversarial suite.
D 8–9 · Eval-spec drafting
We write the eval before the prompt. Cases, scorers, thresholds, budgets.
D 10 · Read-out & sign-off
30-page document. Problem, success criteria, eval surface, scope, schedule, price. Signed by both sponsors before week 3.
Discovery report (~30pp)
Eval-spec v0
Statement of work
[02] / Week 3 – 5
Foundations
Three weeks to stand up the harness — the platform that everything else runs through. By the end of week 5 your team can run the eval suite locally and in CI; nothing ships without it green.
W 3 · Eval harness scaffolding
Suites for regression, adversarial, drift, cost. Scorers pinned to specific judges.
W 3–4 · Model gateway
Multi-model routing, cost ceilings, structured-output validation, request signing.
W 4 · Tracing & cost dashboards
OpenTelemetry traces. Per-call cost. Wired into your existing observability stack, not a parallel one.
W 5 · Guardrails layer
PII scrubbing, jailbreak heuristics, output validators. Implemented as middleware, not bolted-on.
W 5 · First green build
Eval suite running on every PR; merges blocked on regressions.
Harness in your repo
Cost & latency dashboards
PR-blocking eval gate
[03] / Week 6 – 11
Build
Six weeks of engineering. Agents, retrieval, fine-tunes, prompts, tools — whatever the system needs. Every change is gated by the eval suite. We ship to a shadow environment by week 9 and to a canary fraction of production by week 11.
W 6–7 · Core inference path
Agent graph, tool definitions, retrieval index, prompts. First end-to-end pass through the harness.
W 7–8 · Fine-tuning (if scoped)
Dataset curation from production traces; supervised + preference training; eval-gated promotion.
W 8–9 · Shadow deploy
System runs against real traffic without taking action. Outputs scored offline, dashboarded, reviewed daily.
W 10 · Adversarial hardening
Internal red-team week. New cases added to the suite. Threshold raised to release-bar.
W 11 · Canary in production
10% of traffic. On-call from our side. Daily review with your team.
Production inference path
Shadow & canary dashboards
Red-team report
[04] / Week 12 – 13
Handoff
Two weeks of supervised handoff. Your engineers run the next deploy with us watching, then the deploy after that without us. We write a runbook covering every alert path. We don't disappear — see the next phase.
W 12 · Runbook & training
On-call playbook for every dashboard alert. Two training sessions: ML-side and platform-side.
W 12 · Supervised deploy
Your team ships a change end-to-end with us in the room. We point; they drive.
W 13 · Solo deploy
Your team ships alone. We're available, not active.
W 13 · Exit review
Written assessment of state on exit. What's strong, what's fragile, what we'd watch.
Runbook & on-call playbook
Exit review (~10pp)
Open issues list
[05] / Week 14 – 21
Aftercare
Sixty days of defect-fix support included by default. We monitor your dashboards; we fix bugs we shipped. Beyond the 60-day window, you can move to an Embed retainer or close the engagement.
Day 1–60 · Defect-fix window
Bugs in code we wrote: ours, fixed for free. New scope: a separate engagement.
Weekly · Health checks
Automated weekly report on eval pass rate, drift, cost, latency. Sent to your team and ours.
Monthly · Steering review
Thirty-minute call. What's working, what's drifting, whether to extend or close.
Weekly health reports
Defect-fix PRs
Optional Embed retainer