Most "AI is changing engineering" content is anecdotal. We have the data.
This is one engineer, one engagement, one shipped system — with PR-level numbers, AI-assist percentages, and the honest places where the AI slowed things down or got rejected. We'll publish this kind of case study quarterly, with different engineers and engagement shapes.
The engagement at a glance
| Client | Healthcare AI · US · Series A · ~$8M raised (anonymized per NDA) |
| Brief | Citation-backed RAG over 4,000 internal clinical guidelines, surfaced to physicians via existing chat UI |
| Engineer | Senior FutureProofing engineer · Brazil · 7 yrs production experience · ex-Series-B fintech (anonymized per engineer NDA) |
| Stack | Next.js + LangChain + Pinecone + OpenAI embeddings + Claude Sonnet 4.6 (orchestration) |
| Tooling | Cursor + Claude Code Max (20x tier, included with FutureProofing engagement) |
| Total calendar time | 11 days from kickoff to first production deploy |
| Total dev time | 9 days (~70 hours, accounting for client review days and weekends) |
| PRs merged | 47 |
| Eval pass rate at deploy | 87% (target was 80%) |
Day-by-day timeline
Days 1–2 — Codebase ramp + eval harness scaffolding (PRs 1–7)
The engineer spent the first two days reading the existing Next.js app, the deployed inference layer, and the previous attempt at RAG (which had been built by the founding team and abandoned because retrieval quality was unacceptable).
Day 1 deliverable: a writeup of what was wrong with the previous attempt (retrieval was using cosine similarity on full guideline documents — chunking strategy was the root cause).
Day 2 deliverable: the eval harness scaffolding, with 50 hand-curated test cases and a scoring rubric for retrieval relevance, citation accuracy, and response coherence. This was the highest-leverage work of the entire engagement — every subsequent decision could now be measured.
AI-assist percentage in this phase: 71% (boilerplate-heavy work — config files, test fixtures, basic utility functions where AI handles mechanical parts well).
Days 3–5 — Embedding model selection + chunking strategy (PRs 8–22)
Three iteration cycles:
| Iteration | Embedding model | Chunk size | Eval score | Decision |
|---|---|---|---|---|
| 1 | OpenAI ada-002 | 500 tokens | 62% | Baseline — known-bad from previous attempt |
| 2 | OpenAI text-embedding-3-large | 500 tokens | 71% | Better, but plateau — chunking is the bottleneck |
| 3 | text-embedding-3-large | 250 tokens with 50-token overlap + section-aware splitting | 84% | Ship it |
The engineer made the section-aware splitter call independently — Claude Code suggested fixed-size chunking on iteration 3 (which would have plateaued at ~75%). The engineer rejected the suggestion based on her read of the source documents (clinical guidelines have natural section boundaries that fixed-size chunking shreds).
This was rejection #1: AI proposed the standard pattern. Engineer's domain understanding said the standard pattern was wrong for this corpus.
AI-assist percentage in this phase: 64% (more architecture decisions, fewer mechanical writes).
Days 6–8 — Retrieval ranker + citation linker + HIPAA-compliant logging (PRs 23–35)
Three components shipped in parallel:
- Retrieval ranker — re-ranks the top 20 hits from vector search by recency and clinical authority before passing to the LLM. Eval lift: +6%.
- Citation linker — when the LLM response cites a guideline, links it back to the exact source document with line-level precision. Required custom span tracking from chunking through generation.
- HIPAA-compliant logging — request/response logging that strips PHI before storage. Used a regex-based PII scrubber + a secondary LLM verification pass on flagged cases.
This was rejection #2: Claude Code suggested using LangChain's built-in citation chain for the citation linker. Engineer rejected it — the built-in chain was both over-abstracted and didn't preserve the line-level span data needed. Hand-wrote a 60-line citation tracker that did exactly what was needed.
AI-assist percentage in this phase: 58% (custom logic, harder to delegate to AI).
Days 9–10 — Eval suite ran end-to-end. Failure mode analysis (PRs 36–43)
Eval suite ran across all 50 test cases. 41 passed, 9 failed. Failure mode analysis:
| Failure mode | Count | Fix |
|---|---|---|
| Retrieved wrong section of right document | 4 | Tightened section boundaries in chunker |
| Hallucinated citation that didn't exist in retrieved chunks | 2 | Added "you must cite only from these snippets" reinforcement in system prompt |
| Truncated response mid-citation | 2 | Increased max_tokens, added continuation prompt |
| Genuine ambiguity in clinical guideline | 1 | Documented as expected behavior; flagged to client |
AI-assist percentage in this phase: 81% (debugging individual test cases is mechanical — AI handles diff-and-fix well).
Day 11 — Production deploy with feature flag, internal user pilot (PRs 44–47)
Deploy went live behind a feature flag exposed to 5 internal client users (their physician advisors). Live monitoring set up via PostHog (LLM analytics). Eval pass rate at deploy: 87%. Three observed-but-not-blocked issues queued for post-launch.
This was rejection #3: Claude Code suggested a more aggressive feature flag rollout (10% of users on day 1). Engineer recommended starting with 5 internal advisors instead — gives 24h of close-monitored data before opening up to real patient-facing flow. Client agreed. The conservative approach caught one PHI scrubbing edge case that would have been a real incident at 10% rollout.
AI-assist percentage in this phase: 65%.
The PR-level data
All 47 PRs broken down:
| Category | PRs | Avg lines added | AI-assist % |
|---|---|---|---|
| Eval harness + test fixtures | 9 | 145 | 78% |
| Embedding + chunking infrastructure | 11 | 220 | 64% |
| Retrieval ranker + citation linker | 8 | 380 | 51% |
| HIPAA-compliant logging + PII scrubbing | 6 | 290 | 62% |
| System prompt iteration | 7 | 60 | 71% |
| Feature flag + monitoring + deploy config | 6 | 175 | 84% |
| Total | 47 | 218 avg | 68% weighted avg |
The 3 things Claude Code accelerated most
-
Eval harness scaffolding. Generating 50 test fixtures with realistic clinical scenarios + scoring code was 3-4x faster than hand-writing. The engineer prompted "give me 10 more test cases following the same pattern but with X variation" and got working diffs in seconds.
-
Failure mode debugging. When a specific test case failed, the engineer could pipe the failure context to Claude Code and get a targeted fix proposal in seconds. The acceptance rate on these targeted fixes was ~70%.
-
Documentation + commit messages. Every PR description, every code comment, every README block was AI-drafted, then edited. This is invisible time-saver — easy to forget how much time engineers normally lose to writing docs.
The 2 things Claude Code slowed down
-
Architectural decisions where domain context mattered. Section-aware chunking for clinical guidelines (rejection #1) and rejecting LangChain's citation chain (rejection #2) both required pushing back on plausible-sounding AI suggestions. The first iteration of each could have shipped — the eval would have failed silently. AI made the wrong tradeoff easier to accept.
-
Subtle bug surfaces. Twice during the engagement, Claude Code generated code that looked correct, passed unit tests, and would have failed in production: once on a regex that worked for ASCII test cases but not Unicode clinical text, once on a JSON parsing assumption that broke on real LLM responses with embedded backticks. The engineer caught both via integration testing — but the integration tests existed only because she insisted on building the eval harness first.
What this engagement says about embedded engineering in 2026
Three honest takeaways:
-
The 11-day shipping timeline isn't replicable across all engineers. This particular engineer is top-quartile on AI-tooling fluency. The pattern below — eval harness first, AI-augmented but not AI-led, push back on plausible-sounding suggestions — is what made it work. We've seen median engineers ship comparable scope in 4-6 weeks. We've never seen below-median engineers ship this kind of scope at all without architecture review checkpoints.
-
The 68% AI-assist number is a ratio, not a value claim. Claude Code wrote 68% of keystrokes. The engineer's judgment determined ~95% of the architectural calls. Those are different things. Conflating them is how teams overspend on AI tooling and underspend on senior engineers.
-
The 20x Claude Code Max subscription paid for itself in days 1-2. The engineer used the equivalent of ~$320 of Claude API calls in the first 48 hours of the engagement (heavy eval-fixture generation phase). At Claude Pro 5x or default Pro tier, she would have hit rate limits on day 1. The bundled 20x tier removed that friction entirely. This is the actual reason we include it in every engagement — it's not a perk, it's an unblock.
Reproducing this kind of engagement
If you have a similar scope — a citation-backed RAG, a multi-step agent, an LLM eval harness for an existing system, or a Claude Code-driven refactor of a legacy codebase — send us a brief. We'll match you with an engineer from the same bench tier as the one in this case study. The bench is 10–14 engineers deep, with 3+ added monthly. About half of them have shipped at this velocity on Claude Code engagements; the rest are calibrating up.
If you want to see the exact rubric we use to vet for AI-tooling fluency: the 5-stage scorecard covers the live-pair Cursor session that catches whether an engineer can actually work this way.
Next case study: a multi-agent orchestration build for an insurtech client, dropping in Q3 2026.
FAQ
How does Claude Code change senior engineer velocity in production?
Based on this engagement: ~2.3x more PRs/week vs the same engineer's pre-Claude-Code baseline (we have 6 months of pre-data on this engineer from prior engagements). The lift comes mostly from boilerplate-heavy work (eval harness, CI scripts, type definitions, test fixtures) where AI handles the mechanical parts and the engineer reviews/tunes. For deep architectural decisions and tradeoff calls, the velocity is roughly unchanged — those still bottleneck on human judgment.
What percentage of merged code was AI-suggested in this engagement?
68% of merged code was AI-suggested or AI-co-authored (engineer accepted, then edited). 32% was hand-written from scratch. We measured this via Claude Code's session logs cross-referenced with git blame. Important caveat: 'AI-suggested' includes a wide range — from full multi-file diffs accepted as-is, to a single function suggestion the engineer used as scaffold. The 68% number doesn't mean Claude wrote 68% of the value — it means 68% of keystrokes had AI involvement.
Did the engineer reject AI suggestions?
Yes — frequently. Three categories of rejections were common: (1) AI hallucinated APIs that don't exist (e.g., proposing a LangChain method that was deprecated), (2) AI proposed over-abstracted solutions when a simpler inline approach was correct, (3) AI suggested patterns that would have created subtle bugs in the eval harness (the most senior-judgment-heavy part of the work). We watched the engineer reject ~20% of AI suggestions in the live pairing session during vetting — same pattern showed up in production.
What was the eval pass rate at deploy?
87% on the internal evaluation suite at first deploy (target was 80%). The engineer prioritized building the eval suite in days 1-3 before writing pipeline code, which surfaced retrieval quality issues early enough to swap embedding models without re-architecture. Most teams skip this discipline and pay for it in week 4.
Is this representative of every engagement, or a best case?
Above-median, not best case. This particular engineer is in the top quartile of our bench for AI-tooling fluency. A typical engagement ships first PR in week 2-3 and reaches eval-passing production deploy by week 4-6. The 11-day timeline here is fast — driven by a tight scope (citation-backed RAG, not a full agent system), an experienced engineer, and a client team that gave fast feedback. Don't over-anchor on 11 days; the honest range is 3-6 weeks for a comparable scope.