Earning the Right to Have an Opinion on AI in the SDLC
```json
{
"titleJa": "AI開発ツールに対する意見を述べる資格を得るために",
"summaryJa": "SVPの筆者は、AI開発ツールに関する意思決定を行うために、9週間もの間、顧客向けのプロダクションコードを自身で開発しました。
その結果、現在のAI開発ツールは、既存の複雑なシステムへの適用や、AIのタイプライティングと人間の監督・レビュー・修正を組み合わせた作業が重要であることがわかった。
AIによるコードレビューは、人間のレビュアーと並行して活用することで、より高い品質を維持できる点も明らかになりました。
しかし、CSSや視覚的な作業、制約条件の無視、レガシー契約のずれといった課題も存在し、常に人間の目による確認が不可欠です。
"
}
```
大手テック企業のSVP(上級副社長)が、AIを活用した開発プロセス(SDLC)の真の可能性を検証するため、9週間にわたり実践的な実験を行った結果を公開しました。AIツールがエンジニアの仕事を代替するのか、単なる流行なのかという議論に対し、実践に基づいた具体的な知見を提示しています。
AI実験の限界と実践の必要性
多くの企業のAI活用は、週末チャットボットやデモ用のPoC(概念実証)といった「おもちゃ」に留まりがちだそうです。しかし、本物のレガシーコードや複雑なシステムにAIを適用した場合、その実力は大きく変わると指摘しています。AIツールが真価を発揮するには、単なる新規開発(Greenfield)ではなく、長年蓄積された複雑なコードベースへの対応が鍵となると述べています。
AIと人間の協働による生産性向上
筆者は、AIがコードを生成する「タイピング」を担当し、自身が「指示、レビュー、修正、最終承認」を行うという形で開発を進めました。その結果、従来のチームで半年かかる規模の作業を、パートタイムで約3倍の効率で完了させたとのことです。特に、定型的な翻訳作業やテストコードの自動生成など、人間の工数がかかっていた部分でAIが大きな力を発揮したと説明しています。
AIレビューの新たな可能性と課題
さらに、AIをコードのレビュー担当者として導入した結果、130件以上のレビューコメントが生成され、すべての問題がマージ前に修正されたそうです。これは、AIが人間のレビューアとは異なる「盲点」を持ち、疲労や忖度なく厳しくチェックを行う「並行レビューア」として機能することを示しています。ただし、AIは人間より賢いわけではなく、人間の監督と組み合わせることで相乗効果を生むと強調しています。
まとめ
この実験は、AIがエンジニアを完全に置き換えるのではなく、人間の能力を飛躍的に高める強力なツールであることを示唆しています。重要なのは、AIを単なる代替品として捉えるのではなく、開発プロセス全体を強化するパートナーとして活用する視点だといえます。
原文の冒頭を表示(英語・3段落のみ)
I run engineering as an SVP. I haven’t shipped production code in years. In late February, I decided that was a problem worth fixing — not because I missed coding, but because I was about to make a lot of decisions about AI tooling that I had no business making.Every CTO and tech executive I know is being asked the same questions right now. Are these tools real? Will they replace engineers? How fast should we go? What do we let them do? The honest answer for most of us is: we’re guessing. We read the same blog posts everyone else reads, watch the same demos, and form opinions from the cheap seats.So I went back to the keyboard. For nine weeks I shipped customer-facing production code on a real, gnarly, legacy migration — every line generated by AI, every line reviewed by me. Here is what I learned. Most executive AI experiments are toys. A weekend chatbot. An internal tool nobody depends on. A POC that gets a demo slot and dies.Toys flatter the tools. Production reveals them.I wanted to know two things, and a POC could answer neither:Can current AI dev tools handle real legacy complexity? Greenfield is kind to LLMs. The harder question is whether they hold up against a codebase that has accreted decisions for a decade — patterns from three eras, deprecated dependencies, base components with dozens of downstream consumers, and edge cases that exist for reasons no one alive remembers.Can I credibly lead this transformation without doing it myself? I was telling teams to adopt these tools. I was approving budgets. I was writing strategy memos. And I had not personally felt what it’s like to ship production code with them. That’s the gap between being a mouthpiece and being a leader. I wanted to close it.The work I picked was a design system migration — replacing legacy UI components with our internal next-generation library across a customer-facing e-commerce platform. Bounded enough to be measurable. Public enough that mistakes hurt. Legacy enough to be honest.The single most important sentence in this post: every line of code that shipped was AI-generated, and not one line shipped without me at the keyboard.And yet the industry conversation has flattened into a false binary: AI is replacing engineers, or AI is hype that can't ship anything serious. Both camps miss the actual frontier, which lives entirely in the part between them.The truth: the AI did the typing. I did the directing, reviewing, correcting, and the final yes. There was no spec-and-walk-away. There was no “let it run overnight.” Every component took multiple prompts, multiple iterations, multiple course corrections. When the AI generated something subtly wrong — and it generated something subtly wrong on every single component — we caught it because we forced a test-first discipline.Anyone telling you they shipped serious production code with AI without a senior engineer in the loop is either lying, has a very forgiving definition of “serious,” or hasn’t seen their bug count yet.In nine weeks, working part-time around an executive day job, I shipped 27 pull requests to production. The eighteen substantive ones added 10,193 lines of code and removed 8,603 — a net codebase growth of under 1,600 lines. Six of those eighteen were net-negative: more code deleted than added. 64% of every line I added was a test.I single-handedly migrated a third of an epic projected to take a full team most of a year, fixed eight production bugs that emerged along the way, and the codebase came out smaller and better-tested than I found it. The same epic in the year before AI tools became part of the workflow had completed roughly six tickets per month and zero of the heavy base-component migrations. My throughput alone was nearly three times that, on structurally harder work.These numbers are not the point. They are the floor. They tell you what one part-time executive can do with no team support beyond normal review. The implication for a full-time engineer with these tools is not subtle.Boilerplate translation. Anywhere the work was “take this pattern and apply it to forty places,” the AI was extraordinary. The first time I ran a migration across forty downstream consumers in an afternoon, I felt the floor of what’s possible move.Test scaffolding. Coverage that would have been a tax on a human engineer was free. The AI happily wrote tests for paths I would have been tempted to skip. That’s how 64% of insertions ended up being tests — not because I planned it, but because the marginal cost of writing one more test went to nearly zero.Refactors with clear specs. When I could write a precise before/after spec, the execution was crisp. The skill was no longer typing the refactor; it was writing the spec.Code review. This one surprised me the most.I started using a setup where one AI authored the code and a different AI reviewed it. Sometimes a third reviewed both. Different models, different vendors, different prompts.Across the nine weeks the AI reviewer covered every changed file on every iteration of every PR — and raised more than 130 review comments. Every one was addressed before merge. On the most complex migrations the same PR went through 14–16 successive review passes before it was clean.This caught things human reviewers miss. Not because the AIs are smarter than humans — they aren’t — but because they’re tireless, they don’t skim, and they don’t politely defer to a senior engineer’s reputation. They flagged my mistakes the way I’d flag a junior’s. The bar for what gets merged went up, not down.If you take one tactical idea from this post: stop thinking of AI code review as a replacement for human review and start thinking of it as a parallel reviewer with different blind spots. The combination is stronger than either alone. The marginal cost is essentially zero. The downstream effect is real — every one of those 130 comments that got fixed pre-merge is a regression, a rollback, or a customer-facing incident that didn’t happen.CSS and visual work. The single biggest failure mode. Pixel alignment, responsive edge cases, anything where the model couldn’t see the rendered output — the AI confidently produced code that looked right and rendered wrong. I shipped regressions in this category that I had to chase down post-deploy. The bug pattern was consistent enough that I could predict it: any PR that changed component styling needed a manual visual pass that no current AI tool could substitute for.Ignoring explicit requirements. This is the most underreported failure mode in industry writing about AI coding, and the one operators most need to know about. I would write a clear spec — “do X, do not do Y” — and the AI would do X, do Y, and confidently describe its work as faithful to the spec. The fix is not better prompts. The fix is to assume the AI did not read the constraints and to verify every constraint manually. This is a tax on every interaction. It is also why the “let it run overnight” model fails.Legacy contract drift. Replacing one component with another sounds simple until you discover that ten downstream consumers depend on undocumented behavioral quirks of the old component. The AI cannot infer these. It assumes the new contract is the contract. Real users encountering subtly broken behavior is what you get if you don’t catch this. I caught most. I missed a few. I shipped fixes within forty-eight hours each time.Scope creep within a PR. A request to fix one file would, more than once, return as a fix that touched two hundred files. The AI’s ambition exceeded its discretion. Every PR needed an explicit scope check before merge.Test infrastructure side-effects. When the new components were heavier than the old ones in subtle ways, the test suite started running out of memory in CI. This was not a code bug; it was an emergent property of swapping components at scale. The AI generated more tests; it did not understand that more tests in this harness meant CI failure. A junior engineer would have made the same mistake. The senior judgment was knowing where to look.A pattern worth flagging honestly: the migrations where my test share dipped lowest — into the 25–30% range instead of the 64% average — were also the ones that shipped the most regressions in the days that followed. Coverage and quality moved together. That’s not a coincidence. The data wrote its own lesson.If you only measure your AI rollout by feature throughput, you are measuring the wrong thing.The deeper finding from these nine weeks isn’t that I shipped faster. It’s what happened to the human capacity that AI freed up. It didn’t sit idle. It went into the work that always gets deprioritized — the work that quietly compounds margin and risk in every engineering org I’ve ever seen.In the same nine weeks, I also: cleaned up the test harness to fix the OOM issues above, removed unused vulnerable dependencies (one of which had been flagged but not fixed for over a year), rewrote internal documentation on testing standards, authored migration spec documents that the rest of the team now uses as templates, and held a higher code review bar across other people’s work than the team had been holding before.There was a second-order effect I didn't expect: I became braver. Large refactors that I would have flinched at on my own — the kind that touch dozens of files and a decade of accreted assumptions — felt routine, because I had a tireless, patient, and competent partner at the keyboard with me at all times. The fear tax that senior engineers pay on every risky change quietly evaporated. That's where a meaningful share of the reallocated capacity actually came from. Not from typing speed. From courage.The AI didn’t just give me speed; it gave me the cognitive room to invest in the systemic problems that were eating into our engineering capacity in the first place.This is the answer to the question every executive is implicitly asking: “but what do humans do now?” They do the things that always got deprioritized. They raise standards. They mentor. They invest in test infrastructure. They retire technical debt that has been accruing interest for years. The CFO version of this story isn’t just “we ship features X% faster” — it’s “we redirected senior engineering capacity from a tax line into a margin line, and the codebase got healthier in the process.”The single most important organizational decision in any AI rollout is whether security is a partner or a gatekeeper. If they’re gatekeepers, your engineers will route around them, and you’ll have a worse security posture than before AI. If they’re partners, you’ll find the line together.We chose partners. The result, lived not theorized:Allow-lists, not deny-lists. We have an explicit, named list of which integrations the AI is allowed to talk to: our ticketing platform, our internal knowledge platform, our cloud infrastructure tooling, a small handful of others. Database access is currently off-limits. Anything not on the list requires a conversation. The list grows as trust grows.Read-only first, write later. Every new integration started in read-only mode and graduated to write access after we’d watched it work for a while. The first time I asked our security partner why the AI couldn’t write to our ticketing system, the answer was “because we started read-only on purpose, and we can graduate it now.” That answer is the entire model in one sentence.Verbal-approve, formalize-after. When the AI tooling needed an integration approved on a tight deadline, the security team developed a pattern of giving verbal approval in a meeting and formalizing in a ticket afterwards. A small organizational hack that matters more than it sounds. It removes the failure mode where security review becomes a bottleneck and engineers route around it. The integrity of the formal record is preserved. The speed of execution is preserved. The partnership is preserved.The graduation moment. Several weeks into the project, my security partner sent me a message: a particular AI access pattern had moved from “needs my review every time” to “manager-approved by default.” That sentence was the whole methodology working. Trust had been earned, and the system reflected it.Here is the part most people will get wrong if they only read about AI security in the abstract.Approvals are a security control. Approvals are also a UX surface. The fiftieth time you click “approve” on a tool prompt, you don’t read it. You click. The control has failed at that exact moment, regardless of what your policy document says.Approval fatigue is not a UX nit. It’s a security failure mode. The fix is not “more approvals” — it’s fewer, better-designed approvals. Allow-list the proven-safe patterns. Tier the trust. Batch the routine. Reserve the friction for the things that genuinely matter — the ones humans should be reading carefully because the consequences are real.The next year of AI engineering productivity is gated less on model capability than on getting the trust-graduation model right. CTOs and CISOs reading this: this is the conversation. Not “should we use AI.” Not “which model.” It’s “what are we comfortable letting it do without asking, what are we not, and how do we move things from the second category to the first as we learn?”Do it. Pick a real production project, scope it tightly, and ship code yourself for at least a quarter. Not a POC. Not a side project. A thing customers will see. Leading an AI transformation through PowerPoint and vendor demos is no longer enough. Open a terminal.Do it with security in the room. From the first day. Not as a checkpoint, as a partner. The friction you encounter early is the friction your engineers will encounter at scale. Solve it together while you’re small.Stay honest about the failure modes. The strongest version of the AI productivity story has the regressions in it, the requirements ignored, the CSS bugs, the scope creep — all of it. Anyone selling the story without those pieces is selling a story. The real version is more useful and, in the long run, more bullish.I went into this expecting to come out with a measured executive opinion about AI dev tools. I came out remembering why I got into this in the first place. There was a stretch of about three weeks where I didn’t want to do anything else. The flow state of pairing with a tool that types as fast as you can think is something I’d forgotten was possible.That part wasn’t in the plan. But it’s the most honest sentence in this post: I went looking for data, and I found my craft again. If you’re an executive who hasn’t shipped code in years, you might find something similar. It’s worth the nine weeks.No posts
※ 著作権に配慮し、引用は冒頭3段落までです。続きは元記事をご覧ください。