Most teams adopt feature flags reactively. Someone needs a kill switch, another team wants to A/B test a checkout flow, a third is migrating a database and needs a safe rollback path. Flags get added, and eventually you have dozens -- then hundreds -- of them scattered across your codebase with no shared process for managing them.
The gap between "we use feature flags" and "we manage feature flags well" is massive. It is the difference between a team that ships confidently with flags as precision instruments and a team that treats every flag as permanent infrastructure because nobody is sure what is safe to remove. A maturity model gives you a way to measure where you are today and a concrete path toward where you need to be.
TL;DR: Feature flag maturity progresses through five stages: ad-hoc (no process), basic tracking (naming conventions and spreadsheets), lifecycle management (ownership and expiry dates), automated detection and alerting, and fully automated cleanup with health scores. Most teams are stuck at Stage 1-2. Advancing requires investing in tooling and process at each stage -- the jump from Stage 3 to Stage 4 is where dedicated tooling like FlagShark provides the most leverage.
What is a feature flag maturity model?
A maturity model maps an organization's practices against progressive stages of sophistication. You have probably encountered maturity models for CI/CD, DevOps, or cloud adoption. The concept is the same for feature flags: it captures how intentionally your team manages flags from the moment they are created through to the moment they are removed from the codebase.
Why does this matter? Because immature flag practices create compounding technical debt. A flag added without an owner, a naming convention, or an expiry date is a flag that will almost certainly never be removed. Multiply that across every engineer on every team over months and years, and you end up with a codebase where conditional logic is layered so deep that simple changes take days instead of hours. The maturity model gives your team a shared vocabulary for diagnosing where the process breaks down and what to fix first.
The model below has five stages. Most organizations operate somewhere between Stage 1 and Stage 2. The teams that reach Stage 4 or 5 consistently report faster development velocity, fewer production incidents related to flag interactions, and shorter onboarding times for new engineers.
What are the stages of feature flag maturity?
Each stage builds on the previous one. You cannot skip stages -- a team that jumps to automated cleanup without first establishing ownership and naming conventions will automate chaos rather than eliminate it.
Stage 1: Ad-hoc flags
This is where every team starts. Flags are added wherever needed, by whoever needs them, with no shared conventions or tracking. There is no central inventory, no ownership, and no cleanup process.
What it looks like in practice:
- Flag names are arbitrary:
new_feature,test_flag,temp_fix,johns_experiment,v2_checkout - Nobody can answer "how many flags do we have?" without running a grep across every repository
- Flags are created in pull requests with no discussion about when they will be removed
- The flag management platform (if one exists) has dozens of flags that no longer appear in the codebase, and the codebase has flag evaluations that do not exist in the platform
- Cleanup rate is near zero -- flags are only removed when someone stumbles across one during an unrelated refactor
The core problem at Stage 1: Flags are treated as disposable implementation details rather than managed artifacts. There is no lifecycle because nobody has decided that flags should have one.
Who is here: Teams that adopted feature flags within the last year, teams without a flag management platform, and teams where individual engineers make all flag decisions independently. This is the most common stage -- in our experience, the majority of engineering teams that use feature flags operate here.
Stage 2: Basic tracking
The team recognizes that ad-hoc flags create confusion and takes the first step toward order. Someone introduces a naming convention. Maybe a spreadsheet or wiki page appears where flags are listed. Code reviewers start asking "do we have a flag for this already?" before new ones are created.
What it looks like in practice:
- A naming convention exists:
YYYY-MM_team_featureorrelease_checkout_redesignor similar - A shared document (spreadsheet, Notion page, wiki) lists active flags with basic metadata
- Pull request reviewers check that new flags follow naming conventions
- Someone on the team periodically reminds others to update the tracking document
- A quarterly "flag cleanup day" is attempted (with mixed results)
The core problem at Stage 2: Visibility exists, but enforcement is manual and decays over time. The spreadsheet goes stale within weeks because updating it is not part of any automated workflow. The naming convention has exceptions because there is no automated check. The cleanup day removes a few flags, but the backlog grows faster than it shrinks.
The improvement over Stage 1: The team can answer basic questions like "what flags do we have?" and "who created this flag?" -- but the answers are only as current as the last time someone updated the tracking document.
Stage 3: Lifecycle management
This is where process matures from tracking into actual management. Every flag gets an owner and an expiry date at the time it is created. Your flag management platform's metadata features are used: LaunchDarkly tags, Unleash lifecycle markers, or equivalent. There is a regular cadence for reviewing stale flags, and code review checklists include flag hygiene.
What it looks like in practice:
- Every new flag has a named owner (an individual, not a team) and an expiration date
- Flag metadata in the management platform includes purpose, owner, creation date, and expected removal date
- The team runs monthly flag reviews where stale flags are identified and cleanup tickets are created
- Code review checklists include: "Does this flag have an owner? Does it have an expiry date? Is a cleanup ticket created?"
- Flag lifecycle stages are understood and roughly followed: creation, rollout, stabilization, deprecation, cleanup
The core problem at Stage 3: Everything depends on humans remembering to follow the process. The monthly review happens when the team lead remembers to schedule it. Cleanup tickets are created but deprioritized in favor of feature work. Expiration dates pass without consequence because there is no automated enforcement. Engineers who leave the company take their flag ownership context with them.
The improvement over Stage 2: The team has genuine lifecycle awareness. Flags are no longer treated as permanent -- they are understood to have finite lifetimes. But the gap between intention and execution is wide, because enforcement remains manual.
This is where most "good" teams plateau. Stage 3 feels like enough. The process exists, the reviews happen (mostly), and the flag count grows slowly instead of explosively. The problem is that "slowly" still compounds. A team that adds 2 net flags per month has 24 more stale flags after a year, 48 after two. The ratchet only tightens.
Stage 4: Automated detection and alerting
The jump from Stage 3 to Stage 4 is the most significant in the model because it replaces human discipline with automated systems. Tooling detects when flags are added or removed in pull requests without anyone needing to check. Alerts fire automatically when flags exceed age thresholds. A dashboard shows flag health metrics that the entire team can see.
What it looks like in practice:
- CI/CD integration automatically detects when a pull request adds or removes a feature flag
- PR comments surface flag metadata: "This PR adds 2 new flags. This PR removes 1 flag."
- Automated alerts fire when flags exceed configured age thresholds (e.g., 30 days for release flags, 60 days for experiment flags)
- A dashboard shows real-time flag health metrics: total count, age distribution, flags per team, cleanup velocity, stale percentage
- Flag ownership changes are tracked automatically when engineers leave or change teams
- The team can answer questions like "which flags are oldest?" and "which team has the most stale flags?" in seconds, not hours
The core problem at Stage 4: Detection and alerting create visibility, but the actual cleanup work is still manual. An alert that says "flag release_checkout_v2 is 45 days old" is useful, but someone still needs to trace every reference to that flag across the codebase, determine which code path to keep, update tests, and submit a PR. This manual cleanup work is the bottleneck that prevents most Stage 4 teams from achieving consistently low stale flag percentages.
The improvement over Stage 3: The process no longer depends on human memory. Flags cannot be added silently. Stale flags cannot age indefinitely without someone being notified. Metrics create accountability at the team level. This is where flag governance becomes enforceable rather than aspirational.
What is required to reach Stage 4: Integration between your flag management platform and your codebase. This can be built internally (CI scripts that parse diffs for flag SDK calls) or adopted via dedicated tooling. The key capability is automated detection of flag additions and removals in pull requests, combined with persistent tracking of flag age and ownership.
Stage 5: Fully automated cleanup
Stage 5 is the aspirational state. Stale flags are not just detected and surfaced -- they are automatically cleaned up. The system identifies flags that have exceeded their lifecycle, generates cleanup pull requests that remove the flag evaluation and dead code paths, and presents these PRs for human review and merge.
What it looks like in practice:
- When a flag reaches its expiration threshold, a cleanup PR is automatically generated
- The cleanup PR uses AST-level code transformation (not regex) to remove flag evaluations, dead code branches, and unused imports
- The PR includes context: flag age, owner, original purpose, which code path is being preserved
- Flag health score is a first-class engineering metric, tracked alongside deployment frequency and test coverage
- Net flag growth is consistently at or below zero -- the team removes flags at least as fast as they create them
- Flag debt is treated the same way security vulnerabilities are: detected automatically, surfaced with context, and expected to be resolved within a defined SLA
The core problem at Stage 5: There is not one. This is the target state. The remaining challenge is maintaining the system and calibrating thresholds as the organization evolves.
What is required to reach Stage 5: AST-level code parsing and transformation across every language in your codebase. Regex-based cleanup is fragile and error-prone -- it cannot reliably determine which code path to keep when a flag is removed, handle nested conditionals, or clean up imports and variables that become unused after flag removal. This is why Stage 5 requires specialized tooling like FlagShark, which uses tree-sitter parsing across 11 programming languages to generate safe, accurate cleanup PRs.
How do you assess your team's feature flag maturity level?
Use the following self-assessment to identify your current stage. Answer each question honestly -- the goal is an accurate diagnosis, not a flattering one.
| Question | If "Yes," You Have At Least... |
|---|---|
| Do you use feature flags in your codebase? | Stage 1 |
| Do your flags follow a naming convention? | Stage 2 |
| Can you list every active flag in your codebase? | Stage 2 |
| Does every flag have a documented owner? | Stage 3 |
| Does every flag have an expiration date? | Stage 3 |
| Do you conduct regular flag reviews (monthly or more)? | Stage 3 |
| Are flag additions and removals automatically detected in PRs? | Stage 4 |
| Are you automatically alerted when flags exceed age thresholds? | Stage 4 |
| Do you have a dashboard showing flag health metrics? | Stage 4 |
| Are cleanup PRs generated automatically for stale flags? | Stage 5 |
| Is your net flag growth at or below zero? | Stage 5 |
| Is flag health a tracked engineering metric alongside deploy frequency? | Stage 5 |
How to read your results: Your maturity level is the highest stage where you can answer "yes" to every question in that stage and all stages below it. If you have naming conventions (Stage 2) but not every flag has an owner (Stage 3), you are at Stage 2 -- even if you have a metrics dashboard (Stage 4). Maturity is sequential; skipping stages creates gaps that undermine the higher-level practices.
What strategies help you advance to the next maturity stage?
Each transition requires specific investments. Here is what to focus on at each step.
Moving from Stage 1 to Stage 2: Establish conventions
This is the cheapest transition. It requires no tooling -- just agreement.
- Adopt a naming convention and document it in your engineering handbook. A simple pattern like
{type}_{feature}_{date}(e.g.,release_checkout_redesign_2026_03) gives every flag a parseable name that conveys purpose and expected lifetime. - Create a shared tracking document. A spreadsheet with columns for flag name, owner, creation date, purpose, and expected removal date is sufficient. Do not over-engineer this -- a Google Sheet works fine at this stage.
- Add flag hygiene to your PR review checklist. Reviewers should verify that new flags follow the naming convention and appear in the tracking document.
- Run your first flag audit. Search your codebase for flag SDK calls, list every flag, and populate the tracking document. This one-time effort creates the baseline you need for everything that follows.
Moving from Stage 2 to Stage 3: Add ownership and lifecycle
This transition requires process changes and buy-in from the team.
- Make ownership mandatory at creation. Every new flag must have an individual owner recorded in the flag management platform and in the tracking document. "The team" is not an owner.
- Set expiration dates for every flag type. Release flags: 30-90 days. Experiment flags: 14-30 days after the experiment ends. Operational flags: reviewed annually. Document these expectations and enforce them during code review.
- Establish a monthly flag review ritual. A 20-minute meeting where the team reviews flags approaching or past their expiration date. Assign cleanup owners and create tickets with deadlines.
- Create cleanup tickets at flag creation time. When an engineer creates a flag, they also create the ticket for its removal. This front-loads the planning and prevents the "we'll get to it later" anti-pattern.
Moving from Stage 3 to Stage 4: Integrate automated detection
This transition requires tooling investment -- either building internal tools or adopting external ones.
- Integrate flag detection into your CI/CD pipeline. At minimum, your CI should parse pull request diffs to detect when flags are added or removed and leave a comment on the PR with relevant metadata.
- Build or adopt a flag health dashboard. Track total flag count, age distribution, stale percentage, cleanup ratio (flags removed / flags created), and flags per team. These metrics are the leading indicators that tell you whether your practices are working.
- Configure automated alerts for stale flags. When a flag passes its expiration date, the owner should receive a notification. When it passes a second threshold (e.g., 14 days past expiration), the engineering manager should be notified.
- Track ownership changes automatically. When an engineer leaves the team or the company, their flags should be surfaced for ownership transfer -- not discovered months later during an incident.
Moving from Stage 4 to Stage 5: Adopt automated cleanup
This is the most significant tooling investment, but it delivers the highest return.
- Adopt AST-level cleanup tooling. Tools like FlagShark use tree-sitter parsing to understand your code's structure and generate cleanup PRs that safely remove flag evaluations, dead code branches, and orphaned imports. This is not achievable with regex-based approaches.
- Automate cleanup PR generation. When a flag exceeds its lifecycle threshold, the system should generate a PR that removes the flag. The PR goes through normal code review -- the automation handles the tedious work of tracing references and determining which code paths to preserve.
- Make flag health a first-class metric. Add flag health score to your engineering dashboards alongside deployment frequency, test coverage, and incident rate. What gets measured gets managed.
- Set a cleanup SLA. Once a cleanup PR is generated, define how quickly it should be reviewed and merged. A 48-hour SLA for cleanup PR review sends a clear signal that flag hygiene is an engineering priority, not an afterthought.
Key takeaways
- Most teams operate at Stage 1-2, creating flags without naming conventions, ownership, or expiration dates. The first step is simply establishing shared conventions and a tracking mechanism.
- Stage 3 is the most common plateau for mature teams. Process exists, but enforcement is manual and decays over time. Monthly reviews happen inconsistently, cleanup tickets are deprioritized, and stale flags accumulate slowly but steadily.
- The jump from Stage 3 to Stage 4 delivers the biggest improvement. Automated detection and alerting replace human memory with systems. Flags can no longer age silently, and metrics create team-level accountability.
- Stage 5 requires specialized tooling, specifically AST-level code parsing and transformation. Regex-based approaches are too fragile for reliable automated cleanup across real codebases with nested conditionals, complex imports, and multi-language architectures.
- Maturity is sequential. Skipping stages creates gaps -- automated cleanup without ownership and naming conventions automates chaos. Build the foundation before adding automation.
- The metric that matters most is cleanup ratio (flags removed / flags created). A balanced or negative net flag growth rate is the clearest signal that your maturity practices are working, regardless of which stage label you assign yourself.
People also ask
How many feature flags is too many?
There is no universal number. The answer depends on team size, cleanup velocity, and how intentionally your flags are managed. The metrics that matter are flags per engineer (danger zone: 7+), stale flag percentage (danger zone: 40%+), and cleanup ratio (danger zone: below 0.5). A team with 200 flags and a cleanup ratio near 1.0 is healthier than a team with 40 flags that never removes any. For a deeper analysis with industry benchmarks, see our post on how many feature flags is too many.
What is the best way to manage feature flags?
The best approach depends on your maturity stage. At Stage 1-2, invest in process: naming conventions, ownership, expiration dates, and monthly reviews. At Stage 3, invest in tooling: automated detection, dashboards, and alerting. At Stage 4-5, invest in automation: cleanup PR generation and AST-level code transformation. The mistake most teams make is jumping to tooling before establishing the process foundation. A dashboard that shows 200 ownerless flags with no naming convention does not help -- you need the conventions first so that the tooling has structured data to work with.
How do you know when a feature flag should be removed?
A feature flag should be removed when its rollout has reached 100% and the feature has been stable in production for a defined period -- typically 2-4 weeks for release flags. If the flag has been fully enabled for 30+ days with no incidents and no rollbacks, it is no longer serving a purpose. The conditional logic is dead weight, and the "off" code path is dead code. The other strong signal is zero evaluations: if your flag management platform reports that a flag has not been evaluated in 30+ days, it is almost certainly safe to remove. Understanding the full flag lifecycle -- from creation through stabilization to cleanup -- makes these removal decisions straightforward rather than anxiety-inducing.
Why do most teams get stuck at Stage 2 or 3?
The fundamental reason is that flag creation has immediate value (safe rollout, quick rollback) while flag cleanup has deferred value (less debt, faster development later). Humans consistently prioritize immediate returns over deferred ones. This is why manual cleanup processes fail: every sprint planning session, feature work wins over flag cleanup because the feature has a visible stakeholder and the cleanup does not. Breaking past Stage 3 requires removing humans from the enforcement loop. Automated detection, alerting, and cleanup generation shift the work from "proactive discipline" to "reactive review" -- and engineers are much better at reviewing generated PRs than they are at proactively hunting for stale flags. This is precisely why teams that invest in tooling at Stage 4 and 5 see sustained improvement where process-only approaches eventually regress.