Nobody wakes up one morning and decides to remove 500 feature flags. What happens instead is this: a team runs a routine audit, counts the flags in their codebase, and the number is so much larger than anyone expected that ignoring it becomes impossible. The audit forces a conversation. The conversation becomes a planning meeting. The planning meeting becomes a quarter-long initiative that touches every service, every team, and every assumption about what "done" looks like in software delivery.
This is the story of what that initiative looks like in practice -- the process, the results, the failures, and the lessons -- synthesized from documented large-scale cleanup efforts including Uber's removal of approximately 2,000 stale flags with Piranha, publicly shared engineering postmortems, and patterns observed across organizations that have undertaken flag debt remediation programs. The specific numbers represent realistic composites drawn from these sources, not a single organization's data.
The narrative follows a mid-size engineering organization: 180 engineers, 12 services, 3 primary programming languages (Go, TypeScript, Python), and one flag management platform (LaunchDarkly). The numbers are real-world plausible. The pain is universal.
The before state: 537 flags across 12 services
The audit started as a side project. A senior engineer, frustrated after spending an entire afternoon debugging an issue caused by a flag interaction nobody understood, wrote a script to count every flag evaluation across the organization's repositories.
The count came back: 537 active flags across 12 services.
The inventory breakdown
| Category | Count | Percentage |
|---|---|---|
| Release flags (100% enabled, feature shipped) | 248 | 46% |
| Experiment flags (experiment concluded, never cleaned up) | 89 | 17% |
| Operational flags (kill switches, circuit breakers) | 62 | 12% |
| Permission/entitlement flags | 48 | 9% |
| Migration flags (migration completed) | 41 | 8% |
| Unknown purpose (no documentation, no active owner) | 49 | 9% |
Nearly half of all flags were release flags that had been at 100% for months or years. These flags were controlling features that had been fully shipped -- in some cases, features that had been in production for over two years. The flags served no operational purpose. They were dead weight.
The 49 flags with "unknown purpose" were the most alarming. Nobody on the current team could explain what they controlled or why they existed. The engineers who created them had left the company, and the flags had no documentation, no comments, and names like TEMP_ROLLBACK_V2, NEW_FLOW_ENABLED, and EXPERIMENT_ALPHA.
The impact before cleanup
The team measured several baseline metrics before beginning the cleanup:
| Metric | Value | Notes |
|---|---|---|
| Average CI build time | 14.2 minutes | Across all services |
| Full test suite duration | 38 minutes | Including flag-variant tests |
| Average PR review time | 4.2 hours | From open to first review |
| Mean time to resolve incidents | 47 minutes | Flag-related incidents: 68 minutes |
| New engineer onboarding (productive) | 6 weeks | Time to first meaningful PR |
| Developer satisfaction (internal survey) | 6.1/10 | "Codebase health" category |
| Deployment frequency | 3.8 deploys/week per service | Down from 5.2 two years prior |
| Flag-related incidents (quarterly) | 8 | Average over previous 4 quarters |
The deployment frequency decline was the metric that got executive attention. Two years earlier, before the flag count crossed 300, the same services were deploying 5.2 times per week. The slowdown was gradual enough that nobody noticed until the trend line was plotted.
The cleanup process
The organization allocated one quarter to the initiative. A three-person team -- one senior backend engineer, one senior frontend engineer, and one QA engineer -- would lead the effort, with every service team contributing 10% of their sprint capacity to cleanup work in their own services.
Phase 1: Inventory and categorization (Weeks 1-2)
The first two weeks were spent building a complete inventory of every flag, its current state, its age, its owner (if identifiable), and its dependencies.
Categorization criteria:
| Category | Criteria | Action |
|---|---|---|
| Safe to remove | 100% enabled for > 90 days, no incidents in 90 days | Remove immediately |
| Likely safe | 100% enabled for 30-90 days, no incidents | Remove with monitoring |
| Needs investigation | Unknown purpose, no owner, complex targeting rules | Research before removing |
| Intentionally long-lived | Documented kill switches, active entitlements | Keep, document, set review date |
| Active | Currently in rollout or experiment | Leave alone |
Inventory results:
| Category | Count |
|---|---|
| Safe to remove | 289 |
| Likely safe | 74 |
| Needs investigation | 68 |
| Intentionally long-lived | 62 |
| Active | 44 |
| Total flagged for removal | 431 |
The team initially targeted all 431 removable flags but adjusted down to 502 flags after investigation revealed that 71 of the 68 "needs investigation" flags plus all "likely safe" flags were indeed removable. Two of the investigation flags turned out to be actively needed despite having no documentation (more on those later).
Phase 2: Prioritization (Week 3)
Not all flag removals are equal. The team prioritized based on a scoring model:
| Factor | Weight | Rationale |
|---|---|---|
| Number of files touched | 30% | More files = more code simplified |
| Service criticality | 25% | Highest-traffic services benefit most |
| Flag interactions (depends on other flags) | 20% | Removing interaction sources reduces combinatorial complexity |
| Age of flag | 15% | Oldest flags are most likely to surprise |
| Difficulty of removal | 10% | Easy wins first to build momentum |
The prioritization produced four tiers:
- Tier 1 (Weeks 4-6): 180 flags across the 4 highest-traffic services
- Tier 2 (Weeks 7-9): 160 flags across the next 4 services
- Tier 3 (Weeks 10-11): 110 flags across the remaining services
- Tier 4 (Week 12): 52 flags requiring special handling (complex interactions, shared libraries)
Phase 3: Phased removal (Weeks 4-12)
The removal process followed a strict protocol for each flag:
- Create a cleanup PR removing the flag evaluation and the dead code path
- Run the full test suite and fix any test failures
- Deploy to staging and run smoke tests
- Deploy to production during low-traffic hours
- Monitor for 24 hours for any anomalies
- Remove the flag from LaunchDarkly after the monitoring window
- Close the associated cleanup ticket
For the first two weeks, the team removed flags one at a time, deploying each removal individually to build confidence. After establishing the pattern and confirming no issues, they began batching: up to 5 related flags per PR, deployed as a group.
Removal velocity:
| Week | Flags Removed | Cumulative | Issues Encountered |
|---|---|---|---|
| 4 | 28 | 28 | 0 |
| 5 | 42 | 70 | 1 (test failure, quick fix) |
| 6 | 51 | 121 | 0 |
| 7 | 58 | 179 | 1 (see "What went wrong" below) |
| 8 | 63 | 242 | 0 |
| 9 | 71 | 313 | 1 (see "What went wrong" below) |
| 10 | 68 | 381 | 0 |
| 11 | 62 | 443 | 1 (see "What went wrong" below) |
| 12 | 59 | 502 | 0 |
Total flags removed: 502 over 9 weeks of active cleanup (12 weeks including planning).
What improved
The results exceeded the team's expectations in almost every measured dimension. Some improvements were anticipated; others were genuine surprises.
Build and test performance
| Metric | Before Cleanup | After Cleanup | Improvement |
|---|---|---|---|
| Average CI build time | 14.2 minutes | 11.3 minutes | 20% faster |
| Full test suite duration | 38 minutes | 24 minutes | 37% faster |
| Test case count | 4,847 | 3,921 | 926 tests removed (19%) |
| Flaky test rate | 3.2% | 1.8% | 44% reduction |
The test suite improvement was the most dramatic. Nearly 1,000 test cases existed solely to test flag variant combinations. Many of these were testing the "off" path of flags that had been 100% enabled for over a year -- testing code paths that no user had exercised in production for months. The flaky test reduction was a secondary benefit: many flaky tests were caused by race conditions in flag evaluation mocking.
Developer productivity
| Metric | Before Cleanup | After Cleanup | Improvement |
|---|---|---|---|
| Average PR review time | 4.2 hours | 3.1 hours | 26% faster |
| Lines of code removed | - | 47,200 | Net reduction |
| Files modified/deleted | - | 1,340 files modified, 89 files deleted | Simplified codebase |
| Deployment frequency | 3.8/week per service | 4.9/week per service | 29% increase |
| Developer satisfaction (survey) | 6.1/10 | 7.8/10 | +1.7 points |
The deployment frequency increase was the most strategically significant result. With fewer flags to reason about, engineers felt more confident shipping changes. The psychological burden of "what if this change interacts with a flag I don't understand" was materially reduced.
The developer satisfaction jump was unusually large for a single initiative. In post-cleanup surveys, engineers cited "the codebase feels cleaner" and "I can understand what's happening now" as the primary drivers. One engineer's comment captured the sentiment: "I didn't realize how much cognitive load the flags were adding until they were gone."
Operational reliability
| Metric | Before Cleanup | After Cleanup | Improvement |
|---|---|---|---|
| Flag-related incidents (quarterly) | 8 | 2 | 75% reduction |
| Mean time to resolve (all incidents) | 47 minutes | 38 minutes | 19% faster |
| Mean time to resolve (flag-related) | 68 minutes | 41 minutes | 40% faster |
| On-call escalations involving flags | 14/quarter | 3/quarter | 79% reduction |
Fewer flags means fewer flag interactions, which means fewer surprises in production. The remaining flag-related incidents were caused by active flags in rollout (expected and manageable), not by stale flags creating confusion.
New engineer onboarding
| Metric | Before Cleanup | After Cleanup | Improvement |
|---|---|---|---|
| Time to first meaningful PR | 6 weeks | 4 weeks | 33% faster |
| Onboarding questions about flags | 12 per new hire | 4 per new hire | 67% fewer |
| Services understood after 30 days | 2-3 | 4-5 | Broader early contribution |
New engineers could read the code and understand what it did. They no longer needed to ask "what does this flag do?" a dozen times during their first month, only to learn that the answer was "nothing, it's been on for two years."
What went wrong
Not everything went smoothly. Three flags caused problems during removal, and each taught the team something important about the risks of large-scale cleanup.
Incident 1: The shadow dependency (Week 7)
Flag: migration_payment_processor_v2
What happened: This flag had been at 100% for 14 months and was categorized as "safe to remove." When the team removed the flag and its associated "off" code path, a downstream batch processing job started failing. The batch job was not in the organization's primary codebase -- it was in a separate repository owned by the finance team, and it was checking the flag's value directly via the LaunchDarkly API (not through the SDK in the application code).
Impact: The batch job failed for 3 hours before the finance team noticed. No customer-facing impact, but financial reconciliation was delayed by one day.
Root cause: The flag's consumers extended beyond the application code. The batch job treated the flag as a feature toggle, using the LaunchDarkly API to check whether the new payment processor was active before running reconciliation logic.
Fix: The team restored the flag in LaunchDarkly (not in the code), fixed the batch job to remove its flag dependency, then re-removed the flag. Total resolution: 6 hours.
Lesson: Before removing a flag from the management platform, check for API-level consumers outside the application code. The management platform's usage analytics showed the batch job's evaluations, but nobody checked.
Incident 2: The accidental kill switch (Week 9)
Flag: EXPERIMENT_ALPHA
What happened: This was one of the 49 "unknown purpose" flags. Investigation during Phase 1 found no documentation, no owner, and the flag had been at 100% for 22 months. The team categorized it as safe to remove. After removal and deployment, a small percentage of users (approximately 2%) started seeing errors on the settings page.
Impact: 2% of users affected for 45 minutes. The on-call engineer rolled back the deployment within 15 minutes of the first alert; the remaining 30 minutes were spent verifying the rollback was clean.
Root cause: EXPERIMENT_ALPHA was not an experiment flag. It was a release flag for a settings page redesign that had been rolled out to 100% of users. But the "off" path was not the old settings page -- it was an error page. The original developer had deleted the old settings page code during the flag's rollout, leaving the "off" path as a broken render. With the flag at 100%, nobody ever hit the "off" path, so the broken state was invisible.
Fix: The team merged a PR that removed the flag properly -- keeping the "on" code path and deleting the broken "off" path. Total resolution: 2 hours including the rollback.
Lesson: When removing a flag, do not assume the "off" path is functional. Test both paths before removing the flag, even if the flag has been at 100% for years. The "off" path may have been broken for its entire lifetime without anyone noticing.
Incident 3: The performance cliff (Week 11)
Flag: cache_strategy_new
What happened: This flag controlled which caching strategy a high-traffic service used. The "on" path used a new, more efficient caching implementation. The "off" path used the legacy caching layer. The flag had been at 100% for 8 months. When the team removed the flag and its "off" code path (the legacy cache), the service's memory usage spiked by 40%.
Impact: No user-facing errors, but the service hit its memory limits on 2 of 8 pods, triggering restarts. Latency increased by 15% for approximately 30 minutes until the restarts stabilized.
Root cause: The legacy caching code, despite being behind a permanently disabled flag path, was still being initialized at startup. It was pre-warming a shared cache that the "new" caching strategy was inadvertently relying on. Removing the legacy code removed the pre-warming, causing the new cache to start cold and consume more memory during initial population.
Fix: The team added explicit cache pre-warming to the new caching implementation (which should have been there from the start) and re-deployed. Total resolution: 4 hours.
Lesson: Flags do not always create clean boundaries between code paths. Shared state, initialization side effects, and resource dependencies can span flag boundaries. High-traffic services need extra scrutiny during cleanup.
The surprising findings
Beyond the expected improvements in build times and developer productivity, the cleanup surfaced several unexpected discoveries.
12,400 lines of dead code exposed
Removing 502 flags did not just delete flag evaluations. It exposed large blocks of code that were only reachable through disabled flag paths. The team found:
- 12,400 lines of code that were completely unreachable (behind permanently disabled flags)
- 3 entire API endpoints that had been disabled for over a year and were still being maintained (rate limiting, monitoring, documentation) despite serving zero traffic
- 2 database migrations that had been written but never executed, gated behind migration flags that were never enabled
The dead code had been invisible because it was syntactically valid and passed all linting checks. It was only reachable through flag paths that were permanently off, making it functionally equivalent to deleted code -- but with ongoing maintenance costs.
14 unused dependencies discovered
Removing flag-gated code paths revealed that 14 third-party dependencies were only used by dead code. These dependencies were being downloaded, compiled, and potentially scanned for vulnerabilities -- all for code that would never execute.
| Dependency Type | Count | Impact |
|---|---|---|
| Go modules | 6 | Build time reduction |
| npm packages | 5 | Bundle size reduction (380KB) |
| Python packages | 3 | Docker image size reduction |
Removing these dependencies reduced the frontend bundle size by 380KB (a 4% reduction) and reduced Go build times by an additional 8% beyond the test suite improvements.
2 security issues found
Two of the removed flags were gating access to admin-level functionality. In both cases, the flags had been set to 100% (enabled for all users), but the code behind the flags contained authorization checks that were less strict than the current production authorization system. If the flags had been toggled off and then on again (by anyone with access to the flag management platform), the weaker authorization would have been re-enabled.
Neither flag had been exploited, but both represented a latent security vulnerability that existed for over a year. The cleanup eliminated the risk entirely by removing the code paths with weaker authorization.
1 billing discrepancy revealed
One flag controlled whether a service reported usage metrics to the billing system. The flag had been enabled, but the "on" code path contained a bug that under-counted API calls by approximately 3%. The "off" path (which nobody had examined in 16 months) contained the correct counting logic. The bug had been losing the organization an estimated $8,000-12,000 per month in unbilled usage.
The cleanup did not fix this bug directly -- the team discovered it during the investigation phase when they were categorizing flags and reviewing both code paths. The fix was deployed separately, and the team recovered partial revenue through usage reconciliation.
Metrics: Before and after summary
| Category | Metric | Before | After | Change |
|---|---|---|---|---|
| Codebase | Active flags | 537 | 97 | -82% |
| Lines of code | 312,000 | 264,800 | -15% | |
| Dead code paths | Unknown (est. 12,400 lines) | 0 verified | Eliminated | |
| Third-party dependencies | 189 | 175 | -7% | |
| Performance | CI build time | 14.2 min | 11.3 min | -20% |
| Test suite duration | 38 min | 24 min | -37% | |
| Test count | 4,847 | 3,921 | -19% | |
| Flaky test rate | 3.2% | 1.8% | -44% | |
| Frontend bundle size | 9.2 MB | 8.8 MB | -4% | |
| Productivity | PR review time | 4.2 hrs | 3.1 hrs | -26% |
| Deployment frequency | 3.8/week | 4.9/week | +29% | |
| Developer satisfaction | 6.1/10 | 7.8/10 | +28% | |
| Onboarding time | 6 weeks | 4 weeks | -33% | |
| Reliability | Flag-related incidents/quarter | 8 | 2 | -75% |
| MTTR (all incidents) | 47 min | 38 min | -19% | |
| On-call escalations (flag) | 14/quarter | 3/quarter | -79% | |
| Cleanup costs | Engineering time invested | - | ~480 person-hours | One-time cost |
| Incidents during cleanup | - | 3 | All resolved < 6 hours |
Key lessons learned
The team documented 10 lessons from the initiative. These are the most broadly applicable:
Lesson 1: The hardest part is deciding to start
The audit that revealed 537 flags could have been run at any time. The data was always available. What was missing was the will to look. The team spent months knowing they had a flag problem while avoiding quantifying it. Once the number was visible, action became inevitable.
Takeaway: Run the audit. Count your flags. The number will be higher than you think, and seeing it will catalyze action.
Lesson 2: Phased removal is dramatically safer than big-bang
The team's protocol of removing flags in batches with 24-hour monitoring windows between deployments caught all three incidents early, before they could cascade. A big-bang approach (removing all 502 flags in a single deployment) would have made incident diagnosis nearly impossible.
Takeaway: Remove flags in small batches. One flag per PR is ideal. Five related flags per PR is the maximum. Never combine unrelated flag removals.
Lesson 3: The "off" path is more dangerous than you think
Two of the three incidents were caused by assumptions about the "off" code path. In one case, the "off" path was broken. In another, it had hidden side effects. After years without execution, the "off" path is essentially untested production code.
Takeaway: Before removing a flag, deploy with the flag set to "off" in a staging environment and verify the application behaves correctly. If the "off" path fails in staging, that is valuable information about what will happen if the flag is ever toggled in production.
Lesson 4: Flag consumers extend beyond your codebase
The shadow dependency incident (a batch job querying the flag via API) revealed that flags can have consumers that do not appear in code search results. Any system that integrates with your flag management platform's API is a potential consumer.
Takeaway: Check the flag management platform's evaluation analytics before removing flags from the platform. If a flag shows evaluation traffic from unexpected sources, investigate before removing.
Lesson 5: Cleanup is a feature, not a chore
The team initially framed the cleanup as "paying down technical debt" -- necessary but unglamorous. By the end, they reframed it as "shipping a performance and reliability improvement." The build time reduction, the deployment frequency increase, and the incident rate decrease were tangible product improvements that benefited every engineer and every customer.
Takeaway: Frame cleanup work as the feature delivery it is. A 37% faster test suite and a 29% increase in deployment frequency are features. They have measurable business value.
Lesson 6: Invest in automation for the next time
The 480 person-hours invested in the cleanup initiative were a one-time cost, but flag creation did not stop. Without automation, the organization would need another cleanup initiative in 12-18 months. The team implemented FlagShark for continuous flag lifecycle monitoring and automated cleanup PR generation, converting the episodic initiative into an ongoing process.
Takeaway: Large-scale cleanup solves the backlog. Automation prevents the backlog from re-forming. You need both.
Lesson 7: Celebrate the wins publicly
The team shared the cleanup results in an all-engineering meeting. The build time improvements, the dead code discoveries, the security findings -- all of it was presented with specific numbers and before/after comparisons. The presentation shifted the organizational culture: cleanup work went from "that thing we should do someday" to "that thing that demonstrably made everything better."
Takeaway: Make the results visible. Numbers convince. "We removed 12,400 lines of dead code and found 2 security issues" is a more compelling argument for ongoing flag hygiene than any policy document.
Would you do it again?
When asked whether the initiative was worth the investment, the team's answer was immediate and unanimous: yes, but earlier.
The 480 person-hours cost was significant -- roughly equivalent to 3 engineer-months of focused work. But the returns in build time, deployment frequency, incident reduction, and developer satisfaction dwarfed the investment. The ongoing productivity savings were substantial enough that the payback period was well under a quarter.
The only regret was not starting sooner. Every month of delay meant more flags accumulating, more dead code growing, and more productivity draining away silently. The audit that triggered the initiative could have been run a year earlier, and the savings would have been compounding ever since.
Removing 500 stale feature flags is not a heroic act. It is an overdue one. Every organization that uses feature flags at scale is accumulating the same debt, experiencing the same productivity drag, and carrying the same operational risk. The question is not whether to clean up -- it is when, and whether you will wait for the debt to force the conversation or start the audit today. The flags are there. The cost is real. And the results of cleanup are better than you expect.