Feature FlagsJanuary 22, 202613 min read

Joseph McGrath · Founder of FlagShark

What Happens When You Remove 500 Stale Feature Flags

A detailed account of what engineering teams experience when they undertake a large-scale feature flag cleanup, including the surprising metrics improvements and lessons learned.

Feature Flags Technical Debt Case Studies

On this page

Nobody wakes up one morning and decides to remove 500 feature flags. What happens instead is this: a team runs a routine audit, counts the flags in their codebase, and the number is so much larger than anyone expected that ignoring it becomes impossible. The audit forces a conversation. The conversation becomes a planning meeting. The planning meeting becomes a quarter-long initiative that touches every service, every team, and every assumption about what "done" looks like in software delivery.

This is the story of what that initiative looks like in practice -- the process, the results, the failures, and the lessons -- synthesized from documented large-scale cleanup efforts including Uber's removal of approximately 2,000 stale flags with Piranha, publicly shared engineering postmortems, and patterns observed across organizations that have undertaken flag debt remediation programs. The specific numbers represent realistic composites drawn from these sources, not a single organization's data.

The narrative follows a mid-size engineering organization: 180 engineers, 12 services, 3 primary programming languages (Go, TypeScript, Python), and one flag management platform (LaunchDarkly). The numbers are real-world plausible. The pain is universal.

The before state: 537 flags across 12 services

The audit started as a side project. A senior engineer, frustrated after spending an entire afternoon debugging an issue caused by a flag interaction nobody understood, wrote a script to count every flag evaluation across the organization's repositories.

The count came back: 537 active flags across 12 services.

The inventory breakdown

Category	Count	Percentage
Release flags (100% enabled, feature shipped)	248	46%
Experiment flags (experiment concluded, never cleaned up)	89	17%
Operational flags (kill switches, circuit breakers)	62	12%
Permission/entitlement flags	48	9%
Migration flags (migration completed)	41	8%
Unknown purpose (no documentation, no active owner)	49	9%

Nearly half of all flags were release flags that had been at 100% for months or years. These flags were controlling features that had been fully shipped -- in some cases, features that had been in production for over two years. The flags served no operational purpose. They were dead weight.

The 49 flags with "unknown purpose" were the most alarming. Nobody on the current team could explain what they controlled or why they existed. The engineers who created them had left the company, and the flags had no documentation, no comments, and names like TEMP_ROLLBACK_V2, NEW_FLOW_ENABLED, and EXPERIMENT_ALPHA.

The impact before cleanup

The team measured several baseline metrics before beginning the cleanup:

Metric	Value	Notes
Average CI build time	14.2 minutes	Across all services
Full test suite duration	38 minutes	Including flag-variant tests
Average PR review time	4.2 hours	From open to first review
Mean time to resolve incidents	47 minutes	Flag-related incidents: 68 minutes
New engineer onboarding (productive)	6 weeks	Time to first meaningful PR
Developer satisfaction (internal survey)	6.1/10	"Codebase health" category
Deployment frequency	3.8 deploys/week per service	Down from 5.2 two years prior
Flag-related incidents (quarterly)	8	Average over previous 4 quarters

The deployment frequency decline was the metric that got executive attention. Two years earlier, before the flag count crossed 300, the same services were deploying 5.2 times per week. The slowdown was gradual enough that nobody noticed until the trend line was plotted.

The cleanup process

The organization allocated one quarter to the initiative. A three-person team -- one senior backend engineer, one senior frontend engineer, and one QA engineer -- would lead the effort, with every service team contributing 10% of their sprint capacity to cleanup work in their own services.

Phase 1: Inventory and categorization (Weeks 1-2)

The first two weeks were spent building a complete inventory of every flag, its current state, its age, its owner (if identifiable), and its dependencies.

Categorization criteria:

Category	Criteria	Action
Safe to remove	100% enabled for > 90 days, no incidents in 90 days	Remove immediately
Likely safe	100% enabled for 30-90 days, no incidents	Remove with monitoring
Needs investigation	Unknown purpose, no owner, complex targeting rules	Research before removing
Intentionally long-lived	Documented kill switches, active entitlements	Keep, document, set review date
Active	Currently in rollout or experiment	Leave alone

Inventory results:

Category	Count
Safe to remove	289
Likely safe	74
Needs investigation	68
Intentionally long-lived	62
Active	44
Total flagged for removal	431

The team initially targeted all 431 removable flags but adjusted down to 502 flags after investigation revealed that 71 of the 68 "needs investigation" flags plus all "likely safe" flags were indeed removable. Two of the investigation flags turned out to be actively needed despite having no documentation (more on those later).

Phase 2: Prioritization (Week 3)

Not all flag removals are equal. The team prioritized based on a scoring model:

Factor	Weight	Rationale
Number of files touched	30%	More files = more code simplified
Service criticality	25%	Highest-traffic services benefit most
Flag interactions (depends on other flags)	20%	Removing interaction sources reduces combinatorial complexity
Age of flag	15%	Oldest flags are most likely to surprise
Difficulty of removal	10%	Easy wins first to build momentum

The prioritization produced four tiers:

Tier 1 (Weeks 4-6): 180 flags across the 4 highest-traffic services
Tier 2 (Weeks 7-9): 160 flags across the next 4 services
Tier 3 (Weeks 10-11): 110 flags across the remaining services
Tier 4 (Week 12): 52 flags requiring special handling (complex interactions, shared libraries)

Phase 3: Phased removal (Weeks 4-12)

The removal process followed a strict protocol for each flag:

Create a cleanup PR removing the flag evaluation and the dead code path
Run the full test suite and fix any test failures
Deploy to staging and run smoke tests
Deploy to production during low-traffic hours
Monitor for 24 hours for any anomalies
Remove the flag from LaunchDarkly after the monitoring window
Close the associated cleanup ticket

For the first two weeks, the team removed flags one at a time, deploying each removal individually to build confidence. After establishing the pattern and confirming no issues, they began batching: up to 5 related flags per PR, deployed as a group.

Removal velocity:

Week	Flags Removed	Cumulative	Issues Encountered
4	28	28	0
5	42	70	1 (test failure, quick fix)
6	51	121	0
7	58	179	1 (see "What went wrong" below)
8	63	242	0
9	71	313	1 (see "What went wrong" below)
10	68	381	0
11	62	443	1 (see "What went wrong" below)
12	59	502	0

Total flags removed: 502 over 9 weeks of active cleanup (12 weeks including planning).

What improved

The results exceeded the team's expectations in almost every measured dimension. Some improvements were anticipated; others were genuine surprises.

Build and test performance

Metric	Before Cleanup	After Cleanup	Improvement
Average CI build time	14.2 minutes	11.3 minutes	20% faster
Full test suite duration	38 minutes	24 minutes	37% faster
Test case count	4,847	3,921	926 tests removed (19%)
Flaky test rate	3.2%	1.8%	44% reduction

The test suite improvement was the most dramatic. Nearly 1,000 test cases existed solely to test flag variant combinations. Many of these were testing the "off" path of flags that had been 100% enabled for over a year -- testing code paths that no user had exercised in production for months. The flaky test reduction was a secondary benefit: many flaky tests were caused by race conditions in flag evaluation mocking.

Developer productivity

Metric	Before Cleanup	After Cleanup	Improvement
Average PR review time	4.2 hours	3.1 hours	26% faster
Lines of code removed	-	47,200	Net reduction
Files modified/deleted	-	1,340 files modified, 89 files deleted	Simplified codebase
Deployment frequency	3.8/week per service	4.9/week per service	29% increase
Developer satisfaction (survey)	6.1/10	7.8/10	+1.7 points

The deployment frequency increase was the most strategically significant result. With fewer flags to reason about, engineers felt more confident shipping changes. The psychological burden of "what if this change interacts with a flag I don't understand" was materially reduced.

The developer satisfaction jump was unusually large for a single initiative. In post-cleanup surveys, engineers cited "the codebase feels cleaner" and "I can understand what's happening now" as the primary drivers. One engineer's comment captured the sentiment: "I didn't realize how much cognitive load the flags were adding until they were gone."

Operational reliability

Metric	Before Cleanup	After Cleanup	Improvement
Flag-related incidents (quarterly)	8	2	75% reduction
Mean time to resolve (all incidents)	47 minutes	38 minutes	19% faster
Mean time to resolve (flag-related)	68 minutes	41 minutes	40% faster
On-call escalations involving flags	14/quarter	3/quarter	79% reduction

Fewer flags means fewer flag interactions, which means fewer surprises in production. The remaining flag-related incidents were caused by active flags in rollout (expected and manageable), not by stale flags creating confusion.

New engineer onboarding

Metric	Before Cleanup	After Cleanup	Improvement
Time to first meaningful PR	6 weeks	4 weeks	33% faster
Onboarding questions about flags	12 per new hire	4 per new hire	67% fewer
Services understood after 30 days	2-3	4-5	Broader early contribution

New engineers could read the code and understand what it did. They no longer needed to ask "what does this flag do?" a dozen times during their first month, only to learn that the answer was "nothing, it's been on for two years."

What went wrong

Not everything went smoothly. Three flags caused problems during removal, and each taught the team something important about the risks of large-scale cleanup.

Incident 1: The shadow dependency (Week 7)

Flag: migration_payment_processor_v2

What happened: This flag had been at 100% for 14 months and was categorized as "safe to remove." When the team removed the flag and its associated "off" code path, a downstream batch processing job started failing. The batch job was not in the organization's primary codebase -- it was in a separate repository owned by the finance team, and it was checking the flag's value directly via the LaunchDarkly API (not through the SDK in the application code).

Impact: The batch job failed for 3 hours before the finance team noticed. No customer-facing impact, but financial reconciliation was delayed by one day.

Root cause: The flag's consumers extended beyond the application code. The batch job treated the flag as a feature toggle, using the LaunchDarkly API to check whether the new payment processor was active before running reconciliation logic.

Fix: The team restored the flag in LaunchDarkly (not in the code), fixed the batch job to remove its flag dependency, then re-removed the flag. Total resolution: 6 hours.

Lesson: Before removing a flag from the management platform, check for API-level consumers outside the application code. The management platform's usage analytics showed the batch job's evaluations, but nobody checked.

Incident 2: The accidental kill switch (Week 9)

Flag: EXPERIMENT_ALPHA

What happened: This was one of the 49 "unknown purpose" flags. Investigation during Phase 1 found no documentation, no owner, and the flag had been at 100% for 22 months. The team categorized it as safe to remove. After removal and deployment, a small percentage of users (approximately 2%) started seeing errors on the settings page.

Impact: 2% of users affected for 45 minutes. The on-call engineer rolled back the deployment within 15 minutes of the first alert; the remaining 30 minutes were spent verifying the rollback was clean.

Root cause: EXPERIMENT_ALPHA was not an experiment flag. It was a release flag for a settings page redesign that had been rolled out to 100% of users. But the "off" path was not the old settings page -- it was an error page. The original developer had deleted the old settings page code during the flag's rollout, leaving the "off" path as a broken render. With the flag at 100%, nobody ever hit the "off" path, so the broken state was invisible.

Fix: The team merged a PR that removed the flag properly -- keeping the "on" code path and deleting the broken "off" path. Total resolution: 2 hours including the rollback.

Lesson: When removing a flag, do not assume the "off" path is functional. Test both paths before removing the flag, even if the flag has been at 100% for years. The "off" path may have been broken for its entire lifetime without anyone noticing.

Incident 3: The performance cliff (Week 11)

Flag: cache_strategy_new

What happened: This flag controlled which caching strategy a high-traffic service used. The "on" path used a new, more efficient caching implementation. The "off" path used the legacy caching layer. The flag had been at 100% for 8 months. When the team removed the flag and its "off" code path (the legacy cache), the service's memory usage spiked by 40%.

Impact: No user-facing errors, but the service hit its memory limits on 2 of 8 pods, triggering restarts. Latency increased by 15% for approximately 30 minutes until the restarts stabilized.

Root cause: The legacy caching code, despite being behind a permanently disabled flag path, was still being initialized at startup. It was pre-warming a shared cache that the "new" caching strategy was inadvertently relying on. Removing the legacy code removed the pre-warming, causing the new cache to start cold and consume more memory during initial population.

Fix: The team added explicit cache pre-warming to the new caching implementation (which should have been there from the start) and re-deployed. Total resolution: 4 hours.

Lesson: Flags do not always create clean boundaries between code paths. Shared state, initialization side effects, and resource dependencies can span flag boundaries. High-traffic services need extra scrutiny during cleanup.

The surprising findings

Beyond the expected improvements in build times and developer productivity, the cleanup surfaced several unexpected discoveries.

12,400 lines of dead code exposed

Removing 502 flags did not just delete flag evaluations. It exposed large blocks of code that were only reachable through disabled flag paths. The team found:

12,400 lines of code that were completely unreachable (behind permanently disabled flags)
3 entire API endpoints that had been disabled for over a year and were still being maintained (rate limiting, monitoring, documentation) despite serving zero traffic
2 database migrations that had been written but never executed, gated behind migration flags that were never enabled

The dead code had been invisible because it was syntactically valid and passed all linting checks. It was only reachable through flag paths that were permanently off, making it functionally equivalent to deleted code -- but with ongoing maintenance costs.

14 unused dependencies discovered

Removing flag-gated code paths revealed that 14 third-party dependencies were only used by dead code. These dependencies were being downloaded, compiled, and potentially scanned for vulnerabilities -- all for code that would never execute.

Dependency Type	Count	Impact
Go modules	6	Build time reduction
npm packages	5	Bundle size reduction (380KB)
Python packages	3	Docker image size reduction

Removing these dependencies reduced the frontend bundle size by 380KB (a 4% reduction) and reduced Go build times by an additional 8% beyond the test suite improvements.

2 security issues found

Two of the removed flags were gating access to admin-level functionality. In both cases, the flags had been set to 100% (enabled for all users), but the code behind the flags contained authorization checks that were less strict than the current production authorization system. If the flags had been toggled off and then on again (by anyone with access to the flag management platform), the weaker authorization would have been re-enabled.

Neither flag had been exploited, but both represented a latent security vulnerability that existed for over a year. The cleanup eliminated the risk entirely by removing the code paths with weaker authorization.

1 billing discrepancy revealed

One flag controlled whether a service reported usage metrics to the billing system. The flag had been enabled, but the "on" code path contained a bug that under-counted API calls by approximately 3%. The "off" path (which nobody had examined in 16 months) contained the correct counting logic. The bug had been losing the organization an estimated $8,000-12,000 per month in unbilled usage.

The cleanup did not fix this bug directly -- the team discovered it during the investigation phase when they were categorizing flags and reviewing both code paths. The fix was deployed separately, and the team recovered partial revenue through usage reconciliation.

Metrics: Before and after summary

Category	Metric	Before	After	Change
Codebase	Active flags	537	97	-82%
	Lines of code	312,000	264,800	-15%
	Dead code paths	Unknown (est. 12,400 lines)	0 verified	Eliminated
	Third-party dependencies	189	175	-7%
Performance	CI build time	14.2 min	11.3 min	-20%
	Test suite duration	38 min	24 min	-37%
	Test count	4,847	3,921	-19%
	Flaky test rate	3.2%	1.8%	-44%
	Frontend bundle size	9.2 MB	8.8 MB	-4%
Productivity	PR review time	4.2 hrs	3.1 hrs	-26%
	Deployment frequency	3.8/week	4.9/week	+29%
	Developer satisfaction	6.1/10	7.8/10	+28%
	Onboarding time	6 weeks	4 weeks	-33%
Reliability	Flag-related incidents/quarter	8	2	-75%
	MTTR (all incidents)	47 min	38 min	-19%
	On-call escalations (flag)	14/quarter	3/quarter	-79%
Cleanup costs	Engineering time invested	-	~480 person-hours	One-time cost
	Incidents during cleanup	-	3	All resolved < 6 hours

Key lessons learned

The team documented 10 lessons from the initiative. These are the most broadly applicable:

Lesson 1: The hardest part is deciding to start

The audit that revealed 537 flags could have been run at any time. The data was always available. What was missing was the will to look. The team spent months knowing they had a flag problem while avoiding quantifying it. Once the number was visible, action became inevitable.

Takeaway: Run the audit. Count your flags. The number will be higher than you think, and seeing it will catalyze action.

Lesson 2: Phased removal is dramatically safer than big-bang

The team's protocol of removing flags in batches with 24-hour monitoring windows between deployments caught all three incidents early, before they could cascade. A big-bang approach (removing all 502 flags in a single deployment) would have made incident diagnosis nearly impossible.

Takeaway: Remove flags in small batches. One flag per PR is ideal. Five related flags per PR is the maximum. Never combine unrelated flag removals.

Lesson 3: The "off" path is more dangerous than you think

Two of the three incidents were caused by assumptions about the "off" code path. In one case, the "off" path was broken. In another, it had hidden side effects. After years without execution, the "off" path is essentially untested production code.

Takeaway: Before removing a flag, deploy with the flag set to "off" in a staging environment and verify the application behaves correctly. If the "off" path fails in staging, that is valuable information about what will happen if the flag is ever toggled in production.

Lesson 4: Flag consumers extend beyond your codebase

The shadow dependency incident (a batch job querying the flag via API) revealed that flags can have consumers that do not appear in code search results. Any system that integrates with your flag management platform's API is a potential consumer.

Takeaway: Check the flag management platform's evaluation analytics before removing flags from the platform. If a flag shows evaluation traffic from unexpected sources, investigate before removing.

Lesson 5: Cleanup is a feature, not a chore

The team initially framed the cleanup as "paying down technical debt" -- necessary but unglamorous. By the end, they reframed it as "shipping a performance and reliability improvement." The build time reduction, the deployment frequency increase, and the incident rate decrease were tangible product improvements that benefited every engineer and every customer.

Takeaway: Frame cleanup work as the feature delivery it is. A 37% faster test suite and a 29% increase in deployment frequency are features. They have measurable business value.

Lesson 6: Invest in automation for the next time

The 480 person-hours invested in the cleanup initiative were a one-time cost, but flag creation did not stop. Without automation, the organization would need another cleanup initiative in 12-18 months. The team implemented FlagShark for continuous flag lifecycle monitoring and automated cleanup PR generation, converting the episodic initiative into an ongoing process.

Takeaway: Large-scale cleanup solves the backlog. Automation prevents the backlog from re-forming. You need both.

Lesson 7: Celebrate the wins publicly

The team shared the cleanup results in an all-engineering meeting. The build time improvements, the dead code discoveries, the security findings -- all of it was presented with specific numbers and before/after comparisons. The presentation shifted the organizational culture: cleanup work went from "that thing we should do someday" to "that thing that demonstrably made everything better."

Takeaway: Make the results visible. Numbers convince. "We removed 12,400 lines of dead code and found 2 security issues" is a more compelling argument for ongoing flag hygiene than any policy document.

Would you do it again?

When asked whether the initiative was worth the investment, the team's answer was immediate and unanimous: yes, but earlier.

The 480 person-hours cost was significant -- roughly equivalent to 3 engineer-months of focused work. But the returns in build time, deployment frequency, incident reduction, and developer satisfaction dwarfed the investment. The ongoing productivity savings were substantial enough that the payback period was well under a quarter.

The only regret was not starting sooner. Every month of delay meant more flags accumulating, more dead code growing, and more productivity draining away silently. The audit that triggered the initiative could have been run a year earlier, and the savings would have been compounding ever since.

Removing 500 stale feature flags is not a heroic act. It is an overdue one. Every organization that uses feature flags at scale is accumulating the same debt, experiencing the same productivity drag, and carrying the same operational risk. The question is not whether to clean up -- it is when, and whether you will wait for the debt to force the conversation or start the audit today. The flags are there. The cost is real. And the results of cleanup are better than you expect.

Get more like this

Join engineers who get practical insights on feature flag management, technical debt, and shipping faster.

Keep reading

Feature Flags22 min read

Feature FlagsJanuary 22, 202613 min read

Joseph McGrath · Founder of FlagShark

What Happens When You Remove 500 Stale Feature Flags

A detailed account of what engineering teams experience when they undertake a large-scale feature flag cleanup, including the surprising metrics improvements and lessons learned.

Feature Flags Technical Debt Case Studies

On this page

The before state: 537 flags across 12 services

The count came back: 537 active flags across 12 services.

The inventory breakdown

Category	Count	Percentage
Release flags (100% enabled, feature shipped)	248	46%
Experiment flags (experiment concluded, never cleaned up)	89	17%
Operational flags (kill switches, circuit breakers)	62	12%
Permission/entitlement flags	48	9%
Migration flags (migration completed)	41	8%
Unknown purpose (no documentation, no active owner)	49	9%

The impact before cleanup

The team measured several baseline metrics before beginning the cleanup:

Metric	Value	Notes
Average CI build time	14.2 minutes	Across all services
Full test suite duration	38 minutes	Including flag-variant tests
Average PR review time	4.2 hours	From open to first review
Mean time to resolve incidents	47 minutes	Flag-related incidents: 68 minutes
New engineer onboarding (productive)	6 weeks	Time to first meaningful PR
Developer satisfaction (internal survey)	6.1/10	"Codebase health" category
Deployment frequency	3.8 deploys/week per service	Down from 5.2 two years prior
Flag-related incidents (quarterly)	8	Average over previous 4 quarters

The cleanup process

Phase 1: Inventory and categorization (Weeks 1-2)

The first two weeks were spent building a complete inventory of every flag, its current state, its age, its owner (if identifiable), and its dependencies.

Categorization criteria:

Category	Criteria	Action
Safe to remove	100% enabled for > 90 days, no incidents in 90 days	Remove immediately
Likely safe	100% enabled for 30-90 days, no incidents	Remove with monitoring
Needs investigation	Unknown purpose, no owner, complex targeting rules	Research before removing
Intentionally long-lived	Documented kill switches, active entitlements	Keep, document, set review date
Active	Currently in rollout or experiment	Leave alone

Inventory results:

Category	Count
Safe to remove	289
Likely safe	74
Needs investigation	68
Intentionally long-lived	62
Active	44
Total flagged for removal	431

Phase 2: Prioritization (Week 3)

Not all flag removals are equal. The team prioritized based on a scoring model:

Factor	Weight	Rationale
Number of files touched	30%	More files = more code simplified
Service criticality	25%	Highest-traffic services benefit most
Flag interactions (depends on other flags)	20%	Removing interaction sources reduces combinatorial complexity
Age of flag	15%	Oldest flags are most likely to surprise
Difficulty of removal	10%	Easy wins first to build momentum

The prioritization produced four tiers:

Tier 1 (Weeks 4-6): 180 flags across the 4 highest-traffic services
Tier 2 (Weeks 7-9): 160 flags across the next 4 services
Tier 3 (Weeks 10-11): 110 flags across the remaining services
Tier 4 (Week 12): 52 flags requiring special handling (complex interactions, shared libraries)

Phase 3: Phased removal (Weeks 4-12)

The removal process followed a strict protocol for each flag:

Create a cleanup PR removing the flag evaluation and the dead code path
Run the full test suite and fix any test failures
Deploy to staging and run smoke tests
Deploy to production during low-traffic hours
Monitor for 24 hours for any anomalies
Remove the flag from LaunchDarkly after the monitoring window
Close the associated cleanup ticket

Removal velocity:

Week	Flags Removed	Cumulative	Issues Encountered
4	28	28	0
5	42	70	1 (test failure, quick fix)
6	51	121	0
7	58	179	1 (see "What went wrong" below)
8	63	242	0
9	71	313	1 (see "What went wrong" below)
10	68	381	0
11	62	443	1 (see "What went wrong" below)
12	59	502	0

Total flags removed: 502 over 9 weeks of active cleanup (12 weeks including planning).

What improved

The results exceeded the team's expectations in almost every measured dimension. Some improvements were anticipated; others were genuine surprises.

Build and test performance

Metric	Before Cleanup	After Cleanup	Improvement
Average CI build time	14.2 minutes	11.3 minutes	20% faster
Full test suite duration	38 minutes	24 minutes	37% faster
Test case count	4,847	3,921	926 tests removed (19%)
Flaky test rate	3.2%	1.8%	44% reduction

Developer productivity

Metric	Before Cleanup	After Cleanup	Improvement
Average PR review time	4.2 hours	3.1 hours	26% faster
Lines of code removed	-	47,200	Net reduction
Files modified/deleted	-	1,340 files modified, 89 files deleted	Simplified codebase
Deployment frequency	3.8/week per service	4.9/week per service	29% increase
Developer satisfaction (survey)	6.1/10	7.8/10	+1.7 points

Operational reliability

Metric	Before Cleanup	After Cleanup	Improvement
Flag-related incidents (quarterly)	8	2	75% reduction
Mean time to resolve (all incidents)	47 minutes	38 minutes	19% faster
Mean time to resolve (flag-related)	68 minutes	41 minutes	40% faster
On-call escalations involving flags	14/quarter	3/quarter	79% reduction

New engineer onboarding

Metric	Before Cleanup	After Cleanup	Improvement
Time to first meaningful PR	6 weeks	4 weeks	33% faster
Onboarding questions about flags	12 per new hire	4 per new hire	67% fewer
Services understood after 30 days	2-3	4-5	Broader early contribution

What went wrong

Not everything went smoothly. Three flags caused problems during removal, and each taught the team something important about the risks of large-scale cleanup.

Incident 1: The shadow dependency (Week 7)

Flag: migration_payment_processor_v2

Impact: The batch job failed for 3 hours before the finance team noticed. No customer-facing impact, but financial reconciliation was delayed by one day.

Fix: The team restored the flag in LaunchDarkly (not in the code), fixed the batch job to remove its flag dependency, then re-removed the flag. Total resolution: 6 hours.

Incident 2: The accidental kill switch (Week 9)

Flag: EXPERIMENT_ALPHA

Fix: The team merged a PR that removed the flag properly -- keeping the "on" code path and deleting the broken "off" path. Total resolution: 2 hours including the rollback.

Incident 3: The performance cliff (Week 11)

Flag: cache_strategy_new

Impact: No user-facing errors, but the service hit its memory limits on 2 of 8 pods, triggering restarts. Latency increased by 15% for approximately 30 minutes until the restarts stabilized.

Fix: The team added explicit cache pre-warming to the new caching implementation (which should have been there from the start) and re-deployed. Total resolution: 4 hours.

The surprising findings

Beyond the expected improvements in build times and developer productivity, the cleanup surfaced several unexpected discoveries.

12,400 lines of dead code exposed

Removing 502 flags did not just delete flag evaluations. It exposed large blocks of code that were only reachable through disabled flag paths. The team found:

12,400 lines of code that were completely unreachable (behind permanently disabled flags)
3 entire API endpoints that had been disabled for over a year and were still being maintained (rate limiting, monitoring, documentation) despite serving zero traffic
2 database migrations that had been written but never executed, gated behind migration flags that were never enabled

14 unused dependencies discovered

Dependency Type	Count	Impact
Go modules	6	Build time reduction
npm packages	5	Bundle size reduction (380KB)
Python packages	3	Docker image size reduction

Removing these dependencies reduced the frontend bundle size by 380KB (a 4% reduction) and reduced Go build times by an additional 8% beyond the test suite improvements.

2 security issues found

1 billing discrepancy revealed

Metrics: Before and after summary

Category	Metric	Before	After	Change
Codebase	Active flags	537	97	-82%
	Lines of code	312,000	264,800	-15%
	Dead code paths	Unknown (est. 12,400 lines)	0 verified	Eliminated
	Third-party dependencies	189	175	-7%
Performance	CI build time	14.2 min	11.3 min	-20%
	Test suite duration	38 min	24 min	-37%
	Test count	4,847	3,921	-19%
	Flaky test rate	3.2%	1.8%	-44%
	Frontend bundle size	9.2 MB	8.8 MB	-4%
Productivity	PR review time	4.2 hrs	3.1 hrs	-26%
	Deployment frequency	3.8/week	4.9/week	+29%
	Developer satisfaction	6.1/10	7.8/10	+28%
	Onboarding time	6 weeks	4 weeks	-33%
Reliability	Flag-related incidents/quarter	8	2	-75%
	MTTR (all incidents)	47 min	38 min	-19%
	On-call escalations (flag)	14/quarter	3/quarter	-79%
Cleanup costs	Engineering time invested	-	~480 person-hours	One-time cost
	Incidents during cleanup	-	3	All resolved < 6 hours

Key lessons learned

The team documented 10 lessons from the initiative. These are the most broadly applicable:

Lesson 1: The hardest part is deciding to start

Takeaway: Run the audit. Count your flags. The number will be higher than you think, and seeing it will catalyze action.

Lesson 2: Phased removal is dramatically safer than big-bang

Takeaway: Remove flags in small batches. One flag per PR is ideal. Five related flags per PR is the maximum. Never combine unrelated flag removals.

Lesson 3: The "off" path is more dangerous than you think

Lesson 4: Flag consumers extend beyond your codebase

Lesson 5: Cleanup is a feature, not a chore

Takeaway: Frame cleanup work as the feature delivery it is. A 37% faster test suite and a 29% increase in deployment frequency are features. They have measurable business value.

Lesson 6: Invest in automation for the next time

Takeaway: Large-scale cleanup solves the backlog. Automation prevents the backlog from re-forming. You need both.

Lesson 7: Celebrate the wins publicly

Would you do it again?

When asked whether the initiative was worth the investment, the team's answer was immediate and unanimous: yes, but earlier.

Get more like this

Join engineers who get practical insights on feature flag management, technical debt, and shipping faster.

Keep reading

Feature Flags22 min read

The before state: 537 flags across 12 services

The inventory breakdown

The impact before cleanup

The cleanup process

Phase 1: Inventory and categorization (Weeks 1-2)

Phase 2: Prioritization (Week 3)

Phase 3: Phased removal (Weeks 4-12)

What improved

Build and test performance

Developer productivity

Operational reliability

New engineer onboarding

What went wrong

Incident 1: The shadow dependency (Week 7)

Incident 2: The accidental kill switch (Week 9)

Incident 3: The performance cliff (Week 11)

The surprising findings

12,400 lines of dead code exposed

14 unused dependencies discovered

2 security issues found

1 billing discrepancy revealed

Metrics: Before and after summary

Key lessons learned

Lesson 1: The hardest part is deciding to start

Lesson 2: Phased removal is dramatically safer than big-bang

Lesson 3: The "off" path is more dangerous than you think

Lesson 4: Flag consumers extend beyond your codebase

Lesson 5: Cleanup is a feature, not a chore

Lesson 6: Invest in automation for the next time

Lesson 7: Celebrate the wins publicly

Would you do it again?

Get more like this

Related Articles

5 Feature Flag Production Postmortems That Changed How Teams Ship

The Feature Flag Time Bomb: Every Failure Pattern, Documented

The $460M Feature Flag: Stale Flags Are Ticking Time Bombs

The before state: 537 flags across 12 services

The inventory breakdown

The impact before cleanup

The cleanup process

Phase 1: Inventory and categorization (Weeks 1-2)

Phase 2: Prioritization (Week 3)

Phase 3: Phased removal (Weeks 4-12)

What improved

Build and test performance

Developer productivity

Operational reliability

New engineer onboarding

What went wrong

Incident 1: The shadow dependency (Week 7)

Incident 2: The accidental kill switch (Week 9)

Incident 3: The performance cliff (Week 11)

The surprising findings

12,400 lines of dead code exposed

14 unused dependencies discovered

2 security issues found

1 billing discrepancy revealed

Metrics: Before and after summary

Key lessons learned

Lesson 1: The hardest part is deciding to start

Lesson 2: Phased removal is dramatically safer than big-bang

Lesson 3: The "off" path is more dangerous than you think

Lesson 4: Flag consumers extend beyond your codebase

Lesson 5: Cleanup is a feature, not a chore

Lesson 6: Invest in automation for the next time

Lesson 7: Celebrate the wins publicly

Would you do it again?

Get more like this

Related Articles

5 Feature Flag Production Postmortems That Changed How Teams Ship

The Feature Flag Time Bomb: Every Failure Pattern, Documented

The $460M Feature Flag: Stale Flags Are Ticking Time Bombs