Loading...
AI coding agents have, in eighteen months, become extraordinary at writing code. They are still terrible at retiring it. This is not a small gap. It is the gap between a tool that helps you ship and a tool that helps you ship without leaving a graveyard behind.
Watch any AI agent work for an hour. It adds a feature flag to test a new code path. It writes the new path. It writes the old path's fallback. It writes the tests. It writes the migration script. It opens a PR. Six PRs later, the flag is hardcoded to true. Six PRs after that, the fallback is dead code. Six months after that, nobody on the team remembers why checkout_v2_redesign still exists in billing/handler.ts:147, and removing it requires a junior engineer with twenty minutes of courage.
The flag is alive because no agent is incentivized to retire it. The agent's success metric is did this PR ship. The flag's removal is a future PR by a different agent on a different day, and that PR is harder, scarier, and produces nothing visibly new. So it never happens.
This is going to get materially worse. Every additional AI-assisted feature adds another flag. Flag debt is the inverse of velocity: the faster you ship, the faster the graveyard fills.
The flag is alive because no agent is incentivized to retire it. The future of clean codebases is not faster writers. It's better retirers.
FlagShark is a primitive in the strict sense: a single capability that does one thing, that other tools can call, that has no opinion about how it's used. The whole API surface is two operations:
scan(repo) // returns: stale flag candidates with confidence scores retire(flag) // returns: a removal PR, ready to review
That's the whole product. Everything else (the GitHub Action, the dashboard, the Slack digest, the pricing tiers) is packaging. The primitive itself is small enough that an AI agent could call it as a tool. An IDE extension could call it as a command. A CI step could call it as a check. The point of building a primitive is that you don't get to decide how it's used; whoever needs cleanup PRs can have them.
The most interesting deployment of FlagShark is not on flagshark.com. It's inside someone else's product. The pattern looks like this:
flagshark.scan() on the resulting repo state.flagshark.retire() call.The agent never had to learn how to safely refactor flag-removal patterns across 12 languages. FlagShark already does that. The agent just calls a primitive. This is the unit of leverage: every AI coding agent ships a cleanup capability without owning its complexity.
A general-purpose AI coding agent can theoretically do flag cleanup. In practice it doesn't, for three reasons:
An AI agent might propose removing checkout_v2_redesign. Is it confident? It produces a probability distribution, not a guarantee. If it's wrong, the cleanup PR breaks production. FlagShark is deterministic: same input, same finding, every time. The same flag scanned by the same tool today and tomorrow produces the same answer. You can't audit a probabilistic refactor. You can audit a deterministic one.
Flag cleanup needs cross-file dataflow: where is the flag referenced, what does each reference do, what tests cover those branches, is the flag's value statically determinable. A general agent loads context into a window that maxes out at hundreds of thousands of tokens. FlagShark walks the entire AST without context-window limits. The hard part of cleanup is whole-repo reasoning, and that's structurally easier with a deterministic tool than an LLM.
When an AI agent ships a PR that breaks prod, you can't fire the AI agent. You blame the engineer who didn't review carefully. So the engineer reviews more carefully. So the AI agent's productivity drops. FlagShark's PRs are reviewed in 30 seconds because the diffs are small, the logic is mechanical, and the tool is conservative by design. Trust is earned by being boring, not by being clever.
Feature flag cleanup is the first capability. It's a beachhead, not the product. The longer-term primitive is "all the categories of code that exist but shouldn't":
Each one is the same shape: deterministic AST analysis, conservative findings, mechanical PRs. Each one is a capability an AI coding agent can call. The roadmap is the codebase-health stack.
If you're at an AI lab building a coding agent product, the question is: do you build cleanup as a feature, or do you call a primitive that someone else maintains? The answer should be the same as why your agent uses someone else's deterministic linter, type checker, and test runner instead of LLM-grading every line. Some things should be boring infrastructure. Cleanup PRs are one of them.
"You can write the new code in an hour. You can write the removal of the old code in five minutes, but only if someone built the tool that lets you. Otherwise the removal is a six-month migration."
— Every senior engineer who has ever cleaned up after a year of feature work
FlagShark is currently four things: a GitHub Action, a SaaS dashboard, an open-source scanner, and a brand. None of those are the actual product. The actual product is the primitive. If the right path is an OEM relationship with an AI coding tool, where FlagShark becomes the cleanup capability inside Claude Code or Cursor or Copilot, that's the path we're optimizing for. If the right path is staying independent and serving every AI tool equally, that works too. The product is the same.
Either way: when an AI ships your code, someone has to clean it up. We'd like it to be us.
A first-party Claude Code skill that runs FlagShark on the current repo. Findings appear inline in the conversation. /flagshark retire opens the removal PR via Claude Code's existing PR workflow.
user: /flagshark scan claude: found 4 stale flags · open all retirement PRs?
Cursor extension: keyboard shortcut runs FlagShark on the open repo. Findings appear in the sidebar with one-click retirement PRs opened via Cursor's GitHub integration.
Cursor sidebar · FlagShark · 4 stale flags found · [Retire all]
A tool registered in GitHub Copilot's agent mode. Agent invokes it as part of post-implementation hygiene. The agent never needs to learn flag-cleanup logic; it just calls the primitive.
tool_call: flagshark.scan() · result: 4 candidates
We have a working scanner. You have a coding product. The integration story above is real and shippable. If you're at Anthropic, OpenAI, Cursor, GitHub Next, Sourcegraph, or any team shipping AI-mediated code workflows, we'd love a 30-minute chat about what an OEM or first-party integration could look like.