Manifesto · 2026

The cleanup primitive your AI agent doesn't have.

by Joe McGrath·founder, FlagShark·May 2026

AI coding agents have, in eighteen months, become extraordinary at writing code. They are still terrible at retiring it. This is not a small gap. It is the gap between a tool that helps you ship and a tool that helps you ship without leaving a graveyard behind.

Watch any AI agent work for an hour. It adds a feature flag to test a new code path. It writes the new path. It writes the old path's fallback. It writes the tests. It writes the migration script. It opens a PR. Six PRs later, the flag is hardcoded to true. Six PRs after that, the fallback is dead code. Six months after that, nobody on the team remembers why checkout_v2_redesign still exists in billing/handler.ts:147, and removing it requires a junior engineer with twenty minutes of courage.

The flag is alive because no agent is incentivized to retire it. The agent's success metric is did this PR ship. The flag's removal is a future PR by a different agent on a different day, and that PR is harder, scarier, and produces nothing visibly new. So it never happens.

This is going to get materially worse. Every additional AI-assisted feature adds another flag. Flag debt is the inverse of velocity: the faster you ship, the faster the graveyard fills.

The flag is alive because no agent is incentivized to retire it. The future of clean codebases is not faster writers. It's better retirers.

What a cleanup primitive actually is

FlagShark is a primitive in the strict sense: a single capability that does one thing, that other tools can call, that has no opinion about how it's used. The whole API surface is two operations:

scan(repo)     // returns: stale flag candidates with confidence scores
retire(flag)   // returns: a removal PR, ready to review

That's the whole product. Everything else (the GitHub Action, the dashboard, the Slack alerts, the pricing tiers) is packaging. The primitive itself is small enough that an AI agent could call it as a tool. An IDE extension could call it as a command. A CI step could call it as a check. The point of building a primitive is that you don't get to decide how it's used; whoever needs cleanup PRs can have them.

The shape of the integration

The most interesting deployment of FlagShark is not on flagshark.com. It's inside someone else's product. The pattern looks like this:

An AI agent (Claude Code, Cursor, Copilot Agent, Amp) finishes implementing a feature.
As part of its post-merge hygiene, it invokes flagshark.scan() on the resulting repo state.
For each finding above a confidence threshold, it queues a flagshark.retire() call.
The user sees: their AI shipped the feature, and a day later, a small "retired 3 old flags from your repo" notification.

The agent never had to learn how to safely refactor flag-removal patterns across seven languages. FlagShark already does that. The agent just calls a primitive. This is the unit of leverage: every AI coding agent ships a cleanup capability without owning its complexity.

Why this is specifically hard for general-purpose AI agents

A general-purpose AI coding agent can theoretically do flag cleanup. In practice it doesn't, for three reasons:

1. The confidence problem

An AI agent might propose removing checkout_v2_redesign. Is it confident? It produces a probability distribution, not a guarantee. If it's wrong, the cleanup PR breaks production. FlagShark is deterministic: same input, same finding, every time. The same flag scanned by the same tool today and tomorrow produces the same answer. You can't audit a probabilistic refactor. You can audit a deterministic one.

2. The context problem

Flag cleanup needs cross-file dataflow: where is the flag referenced, what does each reference do, what tests cover those branches, is the flag's value statically determinable. A general agent loads context into a window that maxes out at hundreds of thousands of tokens. FlagShark parses each file directly — a tree-sitter AST for TypeScript, JavaScript, Go, Python, Java, C#, PHP, and Rust, precise patterns for the rest — and reasons across the whole repo without context-window limits. The hard part of cleanup is whole-repo reasoning, and that's structurally easier with a deterministic tool than an LLM.

3. The trust problem

When an AI agent ships a PR that breaks prod, you can't fire the AI agent. You blame the engineer who didn't review carefully. So the engineer reviews more carefully. So the AI agent's productivity drops. FlagShark's PRs are reviewed in 30 seconds because the diffs are small, the logic is mechanical, and the tool is conservative by design. Trust is earned by being boring, not by being clever.

What we'd like to be: the longer-term primitive

Feature flag cleanup is the first capability. It's a beachhead, not the product. The longer-term primitive is "all the categories of code that exist but shouldn't":

Stale feature flags · live (Q1 2026)
Dead code branches · planned Q3 2026
Unreferenced config keys (YAML, JSON, env) · planned Q4 2026
Unused dependencies (beyond Dependabot's tree analysis) · planned 2027
Eventually: any pattern where the AST proves the code is unreachable, unused, or replaced

Each one is the same shape: deterministic AST analysis, conservative findings, mechanical PRs. Each one is a capability an AI coding agent can call. The roadmap is the codebase-health stack.

The case for AI labs to own this layer

If you're at an AI lab building a coding agent product, the question is: do you build cleanup as a feature, or do you call a primitive that someone else maintains? The answer should be the same as why your agent uses someone else's deterministic linter, type checker, and test runner instead of LLM-grading every line. Some things should be boring infrastructure. Cleanup PRs are one of them.

"You can write the new code in an hour. You can write the removal of the old code in five minutes, but only if someone built the tool that lets you. Otherwise the removal is a six-month migration."
— Every senior engineer who has ever cleaned up after a year of feature work

FlagShark is currently four things: a GitHub Action, a SaaS dashboard, an open-source scanner, and a brand. None of those are the actual product. The actual product is the primitive. If the right path is an OEM relationship with an AI coding tool, where FlagShark becomes the cleanup capability inside Claude Code or Cursor or Copilot, that's the path we're optimizing for. If the right path is staying independent and serving every AI tool equally, that works too. The product is the same.

Either way: when an AI ships your code, someone has to clean it up. We'd like it to be us.

The surfaces FlagShark could ship through

Three integration specimens.

Claude Code · skill

target: Q3 2026

/flagshark scan

A first-party Claude Code skill that runs FlagShark on the current repo. Findings appear inline in the conversation. /flagshark retire opens the removal PR via Claude Code's existing PR workflow.

user: /flagshark scan
claude: found 4 stale flags · open all retirement PRs?

Cursor · extension

target: Q3 2026

cmd-shift-F

Cursor extension: keyboard shortcut runs FlagShark on the open repo. Findings appear in the sidebar with one-click retirement PRs opened via Cursor's GitHub integration.

Cursor sidebar · FlagShark · 4 stale flags found · [Retire all]

Copilot Agent · tool

target: Q4 2026

flagshark()

A tool registered in GitHub Copilot's agent mode. Agent invokes it as part of post-implementation hygiene. The agent never needs to learn flag-cleanup logic; it just calls the primitive.

tool_call: flagshark.scan() · result: 4 candidates

For AI labs & IDE teams

If you're building a coding agent, let's talk.

We have a working scanner. You have a coding product. The integration story above is real and shippable. If you're at Anthropic, OpenAI, Cursor, GitHub Next, Sourcegraph, or any team shipping AI-mediated code workflows, we'd love a 30-minute chat about what an OEM or first-party integration could look like.

Email the founder →See the technical details