You have 347 feature flags in production. Is that too many?
The question seems simple. The answer is not. Engineering managers ask this when something already feels wrong -- builds are slower, code reviews take longer, new hires stare at the codebase in confusion, and nobody can confidently explain what half the flags do. But putting a number on "too many" requires context that most advice articles skip entirely.
This post provides that context. Drawing on published data from LaunchDarkly and Unleash, public engineering blog posts, and our own experience working with engineering teams, we will establish concrete thresholds for when flag counts cross from healthy to harmful. You will walk away with benchmarks you can compare against, a self-assessment checklist, and a framework for setting limits that make sense for your team.
The short answer: it depends (but here are the numbers)
There is no universal "safe" number of feature flags, just as there is no universal "safe" amount of technical debt. The right number depends on team size, deployment cadence, codebase complexity, and -- critically -- your cleanup velocity.
That said, based on publicly available data and our own experience, we can identify clear warning thresholds. Here are the benchmarks we use:
| Metric | Healthy | Caution Zone | Danger Zone |
|---|---|---|---|
| Flags per engineer | 1-3 | 4-6 | 7+ |
| Flags per 1,000 lines of code (KLOC) | 0.5-1.5 | 1.5-3.0 | 3.0+ |
| Flags per repository | 5-30 | 30-80 | 80+ |
| Stale flag percentage (>90 days, non-operational) | <20% | 20-40% | 40%+ |
| Net flag growth per month | 0 (balanced) | +1 to +5 | +5 or more |
| Average flag age | <45 days | 45-90 days | 90+ days |
The single most important metric is not the total count -- it is the ratio of flag creation to flag removal. An organization with 300 flags and a balanced creation/removal rate is healthier than one with 50 flags where nothing ever gets cleaned up.
Industry benchmarks: What the data actually says
LaunchDarkly's published data
LaunchDarkly, the largest feature flag management platform, has published data points about their customer base. Their public guidance indicates that the median customer maintains between 50 and 200 flags per project, with enterprise accounts averaging significantly higher. Their "code references" feature -- which scans codebases for flag usage -- frequently finds that a significant percentage of flags in a typical project have no remaining code references, meaning the flag exists in the management platform but has already been removed from (or was never added to) the codebase.
LaunchDarkly's own best practice guidance recommends treating flags as temporary by default and establishing expiration policies. The fact that the market leader in flag management explicitly tells customers to remove flags is telling. Even the company that profits from more flags acknowledges the debt problem.
Unleash open-source data
Unleash, the leading open-source feature flag platform, publishes usage metrics from their hosted offering. Their data shows a similar pattern: organizations create flags at approximately 3x the rate they archive or remove them. The average Unleash project accumulates 8-15 new flags per month while removing only 3-5.
Unleash introduced "potentially stale" flag detection in their platform specifically because users requested it -- a signal that flag accumulation was causing enough pain for users to ask for tooling to address it.
Patterns across the industry
Based on our experience working with engineering teams at various stages, public conference talks, and published case studies, the following patterns emerge for flag counts by organization size:
Rough estimates based on our experience:
| Organization Size | Total Active Flags | Flags Per Engineer | Stale Percentage |
|---|---|---|---|
| Startup (<50 eng) | 15-40 | 2-4 | Lower but growing |
| Scaleup (50-200 eng) | 80-200 | 4-6 | Moderate |
| Mid-market (200-1K eng) | 200-600 | 5-8 | High |
| Enterprise (1K+ eng) | 500-5,000+ | 4-7 | Highest |
The pattern is consistent: organizations at every scale are creating flags faster than they remove them, and the stale percentage increases with organizational size. But notice the enterprise "flags per engineer" number actually dips compared to mid-market. This is because large enterprises tend to have more centralized flag management and stricter governance -- but even with those controls, their stale percentage is the highest.
The three metrics that actually matter
Raw flag count is a vanity metric. It tells you something, but not enough to act on. These three derived metrics are far more diagnostic.
1. Flag density: Flags per KLOC
Flag density measures how thoroughly flags are woven into your codebase. It normalizes for codebase size, making it comparable across projects and teams.
How to calculate it: Count all feature flag evaluations in your codebase (calls to your flag SDK) and divide by thousands of lines of code.
Flag Density = Total flag evaluations / (Total lines of code / 1000)
Benchmarks:
| Flag Density (per KLOC) | Assessment | Typical Scenario |
|---|---|---|
| 0.1-0.5 | Low | Small team, few flags, or very large codebase |
| 0.5-1.5 | Healthy | Active flag usage with reasonable management |
| 1.5-3.0 | Elevated | Heavy flag usage, likely some stale flags |
| 3.0-5.0 | High | Flag-heavy codebase, cleanup needed |
| 5.0+ | Critical | Flags dominating code paths, urgent cleanup required |
Note the distinction between flag evaluations and unique flags. A single flag evaluated in 15 places creates more complexity (and more removal work) than 15 flags each evaluated once. Tools that use AST parsing -- like tree-sitter-based detection -- can count both, giving you a more accurate density picture than simple text search.
2. Cleanup ratio: Flags removed / Flags created
The cleanup ratio is the single best predictor of whether your flag count will become a problem. It measures organizational discipline, not just current state.
Cleanup Ratio = Flags removed this month / Flags created this month
Benchmarks:
| Cleanup Ratio | Assessment | Trajectory |
|---|---|---|
| >1.0 | Excellent | Actively reducing debt |
| 0.8-1.0 | Healthy | Roughly balanced, slight growth manageable |
| 0.5-0.8 | Concerning | Accumulating debt, intervention needed within 6 months |
| 0.3-0.5 | Poor | Significant accumulation, process/tooling gaps |
| <0.3 | Critical | Creating 3x+ more than removing, flag graveyard forming |
Based on what we have seen across codebases, the typical ratio is around 0.33 -- meaning teams create roughly three flags for every one they remove. This is how codebases end up with hundreds of stale flags despite nobody intending for it to happen.
3. Stale flag percentage
Stale flags are flags that have completed their purpose but remain in the codebase. A release flag that has been 100% enabled for six months is stale. An experiment flag from a test that concluded three months ago is stale. An operational kill switch reviewed and re-approved quarterly is not stale.
How to calculate it: Count flags older than your expiration threshold (commonly 90 days for release flags) that are not documented as long-lived operational flags. Divide by total flag count.
Stale Percentage = Stale flags / Total flags * 100
Benchmarks:
| Stale Percentage | Assessment | Impact |
|---|---|---|
| <15% | Excellent | Minimal debt, strong lifecycle management |
| 15-30% | Good | Some debt, manageable with periodic cleanup |
| 30-50% | Elevated | Noticeable developer friction, code reviews slowed |
| 50-70% | High | Significant productivity drain, onboarding impacted |
| 70%+ | Critical | Flag graveyard; major investment needed to recover |
In our experience, the typical stale percentage for most organizations is well above 50%. If your number is below 30%, you are doing better than most teams we have worked with.
Why raw count is misleading
Consider two teams:
Team A: 200 flags, cleanup ratio of 0.9, average flag age of 38 days, stale percentage of 18%. They create and remove flags aggressively as part of a mature trunk-based development workflow. Flags are temporary by design and culture.
Team B: 45 flags, cleanup ratio of 0.1, average flag age of 210 days, stale percentage of 78%. They adopted flags two years ago and have never removed one. Every flag is load-bearing in production, and nobody is confident about removing any of them.
Team A has 4x the flags but dramatically better flag health. Team B has a small number of flags but is sitting on a minefield. This is why "how many is too many" requires context beyond the count.
The compounding cost of excess flags
Flag debt compounds in ways that are not immediately obvious. Each additional stale flag does not add cost linearly -- it adds cost geometrically because of interactions between flags.
The testing combinatorial problem
Every boolean flag doubles the theoretical state space of your application. In practice, teams do not test every combination, but they do need to reason about them during code reviews, debugging, and incident response.
| Active Flags | Theoretical Combinations | Realistic Test Paths | Review Complexity |
|---|---|---|---|
| 10 | 1,024 | 20-30 | Manageable |
| 25 | 33+ million | 50-100 | Requires discipline |
| 50 | 1.1 quadrillion | 100-200 | Significant overhead |
| 100 | 1.27 x 10^30 | 200-400 | Requires tooling |
| 200 | 1.61 x 10^60 | 400-800 | Unsustainable without automation |
Even if you only test the "important" combinations, the cognitive load of understanding which combinations matter grows with every flag added. This is where developer velocity silently erodes.
The onboarding multiplier
New engineers must understand existing flags to work effectively. More flags means more conditional logic to learn, more context to absorb, and more time before a new hire can contribute confidently. In our experience, teams with hundreds of stale flags consistently report that new engineer onboarding takes meaningfully longer, and this cost multiplies with every new hire.
The incident response tax
During production incidents, engineers must navigate flag state to diagnose issues. Every flag in a code path is a potential variable that could explain the behavior. Flag-heavy codebases consistently show longer mean time to resolution (MTTR).
In our experience, flag-heavy codebases consistently show longer mean time to resolution. The more flags in a code path, the more variables an engineer must consider during diagnosis -- and the longer it takes to rule out flag-related causes.
An extra 30 minutes on a P1 incident at 2 AM is not just a cost -- it is a morale event. Engineers who repeatedly deal with flag-induced debugging complexity develop learned helplessness that depresses velocity long after the incident is resolved.
The self-assessment checklist
Use this checklist to evaluate your team's flag health. Score each item honestly -- the goal is diagnosis, not perfection.
Quantitative signals (measure these)
- Flags per engineer is above 6. Each engineer is responsible for more flags than they can reasonably track.
- Stale flag percentage exceeds 40%. Nearly half your flags have completed their purpose and remain.
- Cleanup ratio is below 0.5. You are creating flags at least 2x faster than removing them.
- Average flag age exceeds 90 days. Most flags are living well beyond a typical release cycle.
- No flag has been removed in the past 30 days. Flag removal is not happening at all.
Qualitative signals (observe these)
- New hires ask "what does this flag do?" more than once per day. Flags are obscuring rather than clarifying the codebase.
- Code reviews include comments like "is this flag still needed?" Reviewers cannot tell if flags are active or stale.
- Nobody knows who owns specific flags. Flag ownership has diffused to the point of anonymity.
- Engineers are afraid to remove flags. The team has learned helplessness around flag cleanup due to past incidents.
- You have flags referencing features that shipped over a year ago. Release flags have become permanent architecture.
- Flag names include "temp", "test", "v2", "old", or "new". Naming suggests these were always intended to be temporary.
- Your flag management platform shows flags with zero evaluations. Dead flags exist in configuration but not in code (or vice versa).
Scoring
| Red Flags Checked | Assessment | Recommended Action |
|---|---|---|
| 0-2 | Healthy | Maintain current practices, consider preventive policies |
| 3-5 | Caution | Establish cleanup cadence, set flag expiration policies |
| 6-8 | Concerning | Invest in cleanup tooling and process, audit current flags |
| 9-11 | Critical | Dedicate a sprint to flag cleanup, implement automated lifecycle management |
| 12+ | Emergency | Flag debt is a top-3 engineering productivity issue; treat accordingly |
Setting the right thresholds for your team
Instead of adopting universal limits, establish thresholds calibrated to your organization's context.
Step 1: Establish your baseline
Run a one-time audit. Count total flags, categorize them (release, experiment, operational, permission), calculate the three key metrics (density, cleanup ratio, stale percentage), and document flag age distribution.
Tools like FlagShark can automate this audit across repositories by scanning your codebase with tree-sitter AST parsing and producing a complete inventory. If you prefer a manual approach, search your codebase for your flag SDK's evaluation method calls and build a spreadsheet.
Step 2: Categorize flags by intended lifetime
Not all flags should have the same expiration threshold. Establish categories with distinct expectations:
| Flag Category | Description | Expected Lifetime | Expiration Policy |
|---|---|---|---|
| Release flags | Gate new features during rollout | 1-4 weeks | Remove within 30 days of 100% rollout |
| Experiment flags | A/B tests and experiments | 2-8 weeks | Remove within 14 days of experiment conclusion |
| Operational flags | Kill switches, circuit breakers | Indefinite | Annual review and re-approval |
| Permission flags | Entitlement and access control | Indefinite | Quarterly review |
| Migration flags | Database or service migrations | 2-12 weeks | Remove within 30 days of migration completion |
The crucial distinction: release and experiment flags should be temporary by default. Operational and permission flags are intentionally long-lived and should not count toward your stale percentage. If you do not make this distinction, your stale percentage will be inflated and your team will ignore the metric entirely.
Step 3: Set limits and enforce them
Once you have categories and baselines, set explicit limits:
Total flag budget = (Number of engineers * target flags-per-engineer)
+ Operational flags (exempt from budget)
For example, a 30-person team targeting 3 flags per engineer with 15 operational flags:
Flag budget = (30 * 3) + 15 = 105 flags
If your current count exceeds the budget, establish a debt reduction target. A reasonable pace is reducing stale flags by 10-15% per month -- aggressive enough to make progress, sustainable enough to not disrupt feature work.
Step 4: Automate enforcement
Manual flag policies fail. The data is unambiguous on this point: organizations that rely on process alone (cleanup sprints, manual reviews, "flag Fridays") see temporary improvements that regress within 2-3 months.
Sustainable flag management requires automation at key points in the lifecycle:
- At creation: Require expiration dates, owners, and categories for new flags. Block PR merges that add flags without metadata.
- During lifecycle: Monitor flag age and send alerts when flags approach expiration. Tools like FlagShark track flag lifecycle automatically by analyzing PRs as they are opened and merged.
- At expiration: Generate cleanup PRs automatically when flags exceed their expiration date. Automated cleanup PRs remove the single biggest friction point in flag management: the manual work of identifying stale code paths, removing flag evaluations, and cleaning up dead branches.
- In CI/CD: Fail builds or raise warnings when flag count exceeds your budget or stale percentage exceeds your threshold.
When more flags are actually fine
Not every high flag count indicates a problem. Some situations legitimately call for more flags:
Multi-tenant SaaS products often use flags for customer-specific configuration and entitlements. A B2B platform with 500 enterprise customers might have 500+ permission flags that are each intentionally long-lived. These should be categorized as operational/permission flags and excluded from stale calculations.
Platform teams and infrastructure use flags as operational controls -- circuit breakers, gradual migrations, load shedding toggles. A platform team with 50 operational flags is not necessarily unhealthy; it depends on whether those flags are documented, owned, and reviewed.
High-velocity product teams practicing trunk-based development may have elevated flag counts at any given moment because they are shipping daily. If their cleanup ratio is near 1.0 and average flag age is under 30 days, a higher absolute count is healthy -- it reflects velocity, not debt.
The key question is always: are these flags intentional and managed, or accidental and forgotten?
The answer, summarized
How many feature flags is too many? The answer in three sentences:
Your flag count is too high when your stale percentage exceeds 40%, your cleanup ratio falls below 0.5, or your flags-per-engineer exceeds 6. These thresholds indicate that flag creation has outpaced your ability to manage the lifecycle, and debt is accumulating in ways that will measurably impact developer productivity.
The absolute number matters less than the trend. A team moving from 200 to 180 flags is healthier than a team moving from 40 to 60, regardless of who has "more" flags.
If you are asking the question, you probably already know the answer. Engineering managers do not Google "how many feature flags is too many" when everything is fine. The fact that you are here means something is causing friction. Use the benchmarks and checklist in this post to quantify the problem, then set concrete thresholds and -- most importantly -- automate the enforcement. The teams that treat flag lifecycle as infrastructure rather than discipline are the ones that keep their codebases clean at scale.
Flag count is a symptom. Cleanup velocity is the diagnosis. Automation is the treatment. Measure the three metrics that matter (density, cleanup ratio, stale percentage), compare against the benchmarks above, and establish policies with automated enforcement. The organizations that get this right build faster, ship safer, and spend their engineering budget on features instead of archaeology.