A field guide to AI-native ops
How an embed actually runs week-by-week, what we measure, and the difference between systems that compound and ones that decay.
Most AI-native ops engagements look the same shape. Ten weeks. Five phases. One production system. The work is not glamorous. The wins are.
Here's how an embed actually runs.
Week 0 · Scope
A working session, not a sales call. We trace the operation, watch the team work for a day, and write down what success looks like in measurable terms. If we can't name the outcome by the end of the session, we don't take the engagement.
Weeks 1–2 · Embed
A senior operator joins the team in your tools. Slack, Linear, the CRM, the support queue. They sit in standups, work the actual backlog, and listen. By the end of week two we know where the leverage is, and it's almost never where the brief said it would be.
Weeks 3–4 · Re-architect
Most AI projects fail because they bolt a model onto a broken process. We do the opposite. Map the loop. Find the leverage point. Write the workflow spec. Define what success looks like as evals, concrete pass/fail tests against real cases.
Weeks 5–8 · Ship
Production system, against real volume. Each release is gated on the evals from re-architect. Human-in-loop where the cost of being wrong is high. Monitoring, alerting, runbooks. We don't ship volume past a failing eval.
Weeks 9–10 · Hand-off
The system stays. Owned by your team, with a runbook they wrote, an on-call rotation they own, and a champion who can make the next change without us.
What we measure
- Coverage. What share of the loop runs through the system today.
- Quality. Eval pass rate against current and historical cases.
- Latency. Time from input to resolution.
- Cost. Per-resolution cost, including model spend.
- Toil. Hours the human team spent on the loop this week.
Coverage and quality go up. Latency, cost, and toil go down. If they don't, we name it and fix it before the next phase.
Compounding vs. decaying
The difference that matters most is the one between systems that compound and systems that decay.
A compounding system gets sharper as it sees more traffic. The eval set grows. Edge cases get encoded. The team that owns it adds new pass/fail tests as new failure modes appear in production. The model improves because the work to improve it is built into the operation.
A decaying system was wired up correctly the day it shipped and never touched again. Six months later it's making decisions on a stale rubric, against a changed input distribution, and nobody on the team knows how to retune it.
The difference isn't the model. It's the loop you built around it. Telemetry feeds into evals. Evals feed into tuning. Tuning feeds back into production. Break that chain and the system decays.
When we hand the work off, we hand off the chain. Not the model.