February 4, 2026·14 min read

Feature Flag Rollback Strategy: When and How to Use Kill Switches

Kill switches enable instant rollbacks — but they become liabilities when forgotten. A practical guide to rollback strategies, kill switch design, and knowing when to retire them.

Feature Flags DevOps Risk Management Best Practices

At 2:47 AM on a Tuesday, your on-call engineer gets paged. The new recommendation engine -- deployed to production three hours earlier -- is returning results that mix up user profiles. Customers are seeing other people's purchase history in their "recommended for you" section. This is a privacy incident, not just a bug.

The engineer has two options. Option A: initiate a full rollback through the deployment pipeline -- revert the commit, trigger CI, wait for builds, deploy to staging, verify, promote to production. Estimated time to resolution: 25 to 40 minutes. Option B: flip a kill switch. One API call to the feature flag provider, the recommendation engine reverts to the previous algorithm, and the privacy leak stops. Estimated time to resolution: 45 seconds.

This is the promise of kill switches. When they work, they are the fastest rollback mechanism in your toolbox. But when they are poorly designed, forgotten, or allowed to accumulate unchecked, they become a different kind of liability -- one that can be just as dangerous as the incidents they were built to prevent.

Understanding flag types by purpose

Before diving into kill switch design, it is important to understand where kill switches fit in the broader taxonomy of feature flags. Not all flags serve the same purpose, and conflating them leads to mismanaged lifecycles.

Flag Type	Purpose	Expected Lifespan	Rollback Relevance
Release toggle	Gradually roll out a new feature to users	2-8 weeks	Medium -- can disable a feature during rollout, but not designed for emergency use
Experiment flag	A/B test or multivariate experiment	1-4 weeks	Low -- experiments are typically isolated and low-risk
Operational toggle	Control system behavior (rate limits, circuit breakers, feature degradation)	Long-lived, reviewed quarterly	High -- designed for ongoing operational control
Kill switch	Instantly disable a feature or revert to previous behavior during an incident	Long-lived, but should have a retirement plan	Critical -- this is the flag's entire purpose
Permission gate	Control access to features by user segment, role, or subscription tier	Varies	Low -- not typically used for rollback

The critical distinction: release toggles and experiment flags are temporary by nature. They should be removed once the rollout is complete or the experiment concludes. Kill switches and operational toggles are intentionally longer-lived, but "longer-lived" does not mean "forever" -- a point that many teams overlook with expensive consequences.

Anatomy of an effective kill switch

A kill switch that works at 2:47 AM during a privacy incident is not the same as a flag you casually wrap around a feature during development. Effective kill switches are engineered for reliability under pressure.

Design principles

1. Single responsibility. A kill switch should control exactly one feature or behavior. Kill switches that disable multiple unrelated features create collateral damage during incidents. If disabling the recommendation engine also disables the search ranking algorithm because they share a flag, your "targeted" rollback just became a shotgun blast.

// Good: Single-purpose kill switch
if (flags.isEnabled('kill-switch-recommendation-engine-v3')) {
  return recommendationEngineV3.getResults(user);
}
return recommendationEngineV2.getResults(user);

// Bad: Multi-purpose flag masquerading as a kill switch
if (flags.isEnabled('new-ml-features')) {
  // Controls recommendations AND search ranking AND email personalization
  return {
    recommendations: newRecommendations(user),
    searchRanking: newSearchRanking(query),
    emailContent: newEmailPersonalization(user),
  };
}

2. Safe defaults. When a kill switch is flipped off, the system should revert to a known-good state. This means the fallback path must be maintained and tested, not abandoned as dead code.

func GetRecommendations(ctx context.Context, user *User) ([]Product, error) {
    if featureFlags.IsEnabled("kill-switch-rec-engine-v3") {
        results, err := recEngineV3.Query(ctx, user)
        if err != nil {
            // If the new engine fails, fall back even without
            // flipping the kill switch
            log.Warn("rec-engine-v3 failed, falling back", "error", err)
            return recEngineV2.Query(ctx, user)
        }
        return results, nil
    }
    // Kill switch is off: use the proven previous version
    return recEngineV2.Query(ctx, user)
}

3. No authentication or authorization gates. The on-call engineer flipping the kill switch at 3 AM should not need special permissions beyond what is required for incident response. Kill switch access should be pre-authorized for anyone in the on-call rotation.

4. Minimal evaluation overhead. Kill switches should evaluate as close to instantly as possible. A kill switch that requires a network call to a remote flag service with a 500ms timeout is a kill switch that might not work when you need it most. Use local caching with short TTLs, or configure your flag provider to serve kill switch values from an in-memory cache.

5. Observable state. It must be immediately clear whether a kill switch is on or off. This means logging, dashboards, and alerting. When an engineer flips a kill switch, the team should know -- via Slack notification, PagerDuty annotation, or equivalent -- within seconds.

Kill switch naming convention

Kill switches should be instantly identifiable in the codebase. A naming convention that distinguishes them from other flag types eliminates ambiguity during incidents when speed matters.

Recommended format: kill-switch-[feature]-[version]

Name	What It Controls
`kill-switch-recommendation-engine-v3`	New recommendation engine (v3)
`kill-switch-payment-processor-stripe`	Stripe payment integration
`kill-switch-realtime-notifications`	WebSocket-based notification system
`kill-switch-ai-content-generation`	AI-powered content generation feature

The kill-switch- prefix is critical. During an incident, an engineer scanning a list of flags in the provider dashboard can immediately identify which flags are designed to be flipped in emergencies versus which are release toggles or experiments that should not be touched without broader context.

Rollback strategies using feature flags

Kill switches are one tool in the rollback arsenal. The broader question is how feature flags fit into your overall rollback strategy and when they are the right choice versus other mechanisms.

Strategy 1: Kill switch rollback (instant)

How it works: Flip the kill switch flag off. The new code path is bypassed, and the system reverts to the previous behavior.

Time to rollback: Seconds to low minutes (depending on flag propagation speed).

When to use:

Privacy or security incidents where every second counts
Performance degradation from a newly deployed feature
User-facing bugs that affect a large percentage of traffic
Any situation where the blast radius is growing and speed is paramount

When not to use:

Infrastructure failures (flags cannot fix a downed database)
Data corruption (disabling the feature does not uncorrupt the data)
Issues in the flag evaluation system itself (circular dependency)

Requirements:

Kill switch must be pre-deployed before the feature goes live
Fallback code path must be maintained and functional
Flag provider must be highly available (99.99%+ uptime)

Strategy 2: Percentage rollback (gradual)

How it works: Reduce the rollout percentage of a release toggle from, say, 50% to 10% or 0%.

Time to rollback: Minutes (percentage changes propagate through the flag provider).

When to use:

Issues discovered during a progressive rollout that are not emergencies
Degraded metrics (conversion rate drop, latency increase) that warrant investigation
Bugs affecting a specific user segment that can be narrowed by adjusting targeting rules

Example scenario: You are rolling out a new checkout flow at 25% of traffic. Conversion rate drops 3% for the test group. You reduce the percentage to 5%, isolate the issue to mobile Safari users, fix the CSS bug, and resume the rollout. Total user impact: minimal. No deployment pipeline involved.

Strategy 3: Deployment rollback (traditional)

How it works: Revert the commit, build, deploy the previous version through the standard pipeline.

Time to rollback: 15 to 45 minutes (depending on CI/CD speed and deployment process).

When to use:

Issues in code that is not behind a feature flag
Infrastructure-level changes (database migrations, schema changes, service mesh configuration)
When the flag-controlled code has been removed after full rollout (no flag to flip)
When the fallback code path behind the kill switch is itself broken

When not to use:

During active incidents where time-to-resolution is the primary concern
When the deployment pipeline itself is the source of the problem

Strategy 4: Blue-green or canary rollback

How it works: Route traffic from the new deployment (green/canary) back to the stable deployment (blue/baseline).

Time to rollback: 1 to 5 minutes (traffic routing change at the load balancer or service mesh level).

When to use:

Infrastructure changes that cannot be controlled by feature flags
Database migration issues
Service-level incompatibilities
When you need to roll back the entire deployment, not just one feature

Comparison matrix

Strategy	Speed	Granularity	Pre-requisites	Risk
Kill switch	Seconds	Single feature	Flag must exist, fallback must work	Low (if well-designed)
Percentage rollback	Minutes	Feature with user targeting	Feature must use percentage-based rollout	Low
Deployment rollback	15-45 min	Entire deployment	Working CI/CD pipeline	Medium (deployment could introduce new issues)
Blue-green/canary	1-5 min	Entire deployment	Infrastructure support (load balancer, routing)	Low-medium

The key insight: these strategies are complementary, not competing. The strongest rollback posture uses kill switches for feature-level instant rollback, blue-green for deployment-level rollback, and traditional deployment rollback as the last resort.

Flag-based incident response playbook

When an incident occurs, the last thing you want is an on-call engineer improvising a rollback strategy. A pre-defined playbook that incorporates kill switches reduces mean time to recovery (MTTR) and removes decision-making burden during high-stress situations.

The playbook

Step 1: Identify the blast radius (0-2 minutes)

Determine what is affected. Is it a single feature, a user segment, or the entire application? Check monitoring dashboards, error rates, and user reports.

Step 2: Determine if a kill switch exists (1-2 minutes)

Search your flag provider for a kill switch associated with the affected feature. If your naming convention includes the kill-switch- prefix, this search takes seconds.

Finding	Action
Kill switch exists and is ON	Proceed to Step 3
Kill switch exists and is OFF	Feature is already rolled back -- the issue is elsewhere
No kill switch exists	Skip to Step 4

Step 3: Flip the kill switch (30 seconds)

Disable the feature via the kill switch. Immediately verify that the fallback behavior is working correctly. Monitor error rates and user-facing metrics for recovery.

# Example: Using LaunchDarkly CLI to flip a kill switch
ldcli flags update \
  --project production \
  --flag kill-switch-recommendation-engine-v3 \
  --off

# Example: Using a custom flag management API
curl -X PATCH https://flags.internal.company.com/api/flags/kill-switch-recommendation-engine-v3 \
  -H "Authorization: Bearer $ONCALL_TOKEN" \
  -d '{"enabled": false}'

Step 4: If no kill switch, escalate to deployment rollback (5-10 minutes)

Initiate the standard deployment rollback process. Revert the commit, trigger the CI pipeline, and promote the previous build. While waiting, communicate the expected timeline to stakeholders.

Step 5: Post-incident flag review (within 24 hours)

After the incident is resolved, review the kill switch response:

Did the kill switch work as expected?
Was the fallback behavior correct and complete?
How long did the total rollback take?
Should a kill switch be added for this feature if one did not exist?

This last question is critical. Every incident that requires a 30-minute deployment rollback instead of a 30-second kill switch flip is a signal that your kill switch coverage has a gap.

The lifecycle of a kill switch

Kill switches are intentionally longer-lived than release toggles. But "longer-lived" must have boundaries. An unmanaged kill switch lifecycle creates the same technical debt problems as any other stale flag -- with the added danger that the fallback code path may silently rot.

Phase 1: Introduction (Day 0)

The kill switch is created alongside the feature it protects. It is documented with:

The feature it controls
The fallback behavior when disabled
The owner responsible for the switch
The expected review date

Phase 2: Active protection (Day 1 - Day 90)

The kill switch serves its primary purpose. The feature is live, the kill switch is enabled, and the fallback path is maintained and tested. During this phase, the kill switch justifies its existence through the operational safety it provides.

Phase 3: Stability review (Day 90)

After 90 days with the feature running without incidents, the first lifecycle question arises: is this kill switch still providing value that justifies the maintenance cost of the fallback code path?

Scenario	Recommendation
Feature has been incident-free for 90 days and is non-critical	Remove the kill switch and the fallback code
Feature is in a critical path (payments, auth, data pipeline)	Retain the kill switch, schedule next review in 90 days
Feature has had incidents but was successfully rolled back via the switch	Retain the kill switch, investigate root cause of instability
Feature has been modified significantly since the switch was created	Verify the fallback path still works; update or remove

Phase 4: Quarterly review (ongoing for retained switches)

Kill switches that survive the 90-day review enter a quarterly review cycle. Each review must answer:

Has the fallback code path been tested recently?
Has the feature changed in ways that invalidate the fallback?
Is the kill switch still in the on-call runbook?
Does the team still know how to use this switch?

If the answer to any of these questions is "no," the kill switch has become a liability rather than an asset.

Phase 5: Retirement

When a kill switch is retired, both the switch and its fallback code path are removed. This is a cleanup task that must be treated with the same rigor as any code change:

Remove the flag evaluation from the code
Remove the fallback code path
Update the on-call runbook to remove references to the switch
Remove the flag from the provider dashboard
Verify that tests pass without the flag

The danger of long-lived kill switches

Here is where most teams get into trouble. Kill switches are created with good intentions during a feature launch, retained because they provide a sense of safety, and then forgotten. Over time, the fallback code path they protect becomes a maintenance burden that nobody recognizes as such.

How kill switches rot

The fallback path stops being tested. When a kill switch has been in the "on" position for a year, the fallback path has not executed in production for a year. Dependencies change. APIs evolve. Data schemas migrate. The fallback code that worked perfectly 12 months ago may now throw exceptions, return stale data, or simply crash.

The team forgets the switch exists. Engineers who created the kill switch leave the company. New team members do not know it is there. The on-call runbook references it by a name that no longer matches the current architecture. When an incident occurs and someone tries to flip the switch, they discover it either does not work or produces worse behavior than the original problem.

The fallback path creates maintenance overhead. Every code change in the feature's area must account for both the active path and the fallback path. This means more complex PRs, longer code reviews, and a higher chance of introducing bugs. The kill switch that was supposed to reduce risk is now increasing the surface area for errors.

Dependency drift. The fallback path may depend on services, APIs, or database schemas that have been deprecated or modified since the kill switch was created.

# Kill switch created 14 months ago
if feature_flags.is_enabled("kill-switch-search-v2"):
    return search_v2.query(request)  # Current implementation
else:
    # This fallback calls search_v1, which was decommissioned 6 months ago.
    # The endpoint still exists but returns empty results.
    # Nobody has tested this path since the kill switch was created.
    return search_v1.query(request)  # Silently broken

This is the worst possible outcome: a kill switch that appears functional but actually makes the situation worse when flipped. The on-call engineer flips the switch expecting to revert to the previous behavior. Instead, search returns zero results for all users. The incident just escalated.

Quantifying kill switch rot

Kill Switch Age	Probability Fallback Still Works	Risk Level
0-30 days	95%+	Low
30-90 days	80-95%	Low-Medium
90-180 days	60-80%	Medium
180-365 days	30-60%	High
365+ days	Below 30%	Critical -- likely a liability

These numbers are estimates based on typical codebase change velocity. In fast-moving codebases with frequent refactoring, the decay is faster. In stable, slow-changing systems, kill switches may remain viable longer. But the trend is universal: untested fallback paths decay over time.

Testing your kill switches

If the fallback path is not tested, the kill switch is theater -- it creates the illusion of safety without providing actual safety.

Testing strategies

1. Periodic fallback testing in staging. Schedule monthly tests where kill switches are flipped in a staging environment and the fallback behavior is verified end-to-end. Automate the verification where possible.

2. Chaos engineering integration. Include kill switch flips in your chaos engineering practice. Randomly disable features via their kill switches in a canary or staging environment and verify that the system degrades gracefully.

3. Fallback path unit tests. Write explicit tests for the fallback code path, not just the active path. These tests should be run in CI like any other test, ensuring the fallback path stays functional as the codebase evolves.

describe('RecommendationEngine', () => {
  it('returns v3 results when kill switch is enabled', async () => {
    mockFlags.enable('kill-switch-recommendation-engine-v3');
    const results = await getRecommendations(testUser);
    expect(results.source).toBe('v3');
    expect(results.items.length).toBeGreaterThan(0);
  });

  // This test is critical -- it verifies the fallback path works
  it('returns v2 results when kill switch is disabled', async () => {
    mockFlags.disable('kill-switch-recommendation-engine-v3');
    const results = await getRecommendations(testUser);
    expect(results.source).toBe('v2');
    expect(results.items.length).toBeGreaterThan(0);
  });

  it('falls back to v2 when v3 throws an error', async () => {
    mockFlags.enable('kill-switch-recommendation-engine-v3');
    mockRecEngineV3.throwOnNextCall(new Error('timeout'));
    const results = await getRecommendations(testUser);
    expect(results.source).toBe('v2');
  });
});

4. Production dark testing. For critical kill switches, periodically route a small percentage of shadow traffic through the fallback path (without serving the results to users) to verify it still produces valid output.

When to use flags vs. other rollback mechanisms

Feature flags are powerful, but they are not the right rollback mechanism for every situation. Choosing the wrong tool creates a false sense of security.

Situation	Best Rollback Mechanism	Why Not Feature Flags?
Application feature misbehaving	Kill switch	Best fit -- instant, granular
Database migration failure	Blue-green deployment	Flags cannot unmigrate data
Infrastructure outage	Service mesh routing / DNS failover	Flags depend on application layer being functional
Third-party API failure	Circuit breaker (can be flag-controlled)	Good fit for flags, but circuit breaker pattern is more appropriate
Security vulnerability in a dependency	Deployment rollback	Flags do not change running dependency versions
Configuration error (env vars, secrets)	Configuration rollback	Flags control code paths, not configuration
Data corruption from a bug	Data restoration from backup	Flags can stop further corruption but cannot repair existing damage

The general rule: feature flags are for code path rollbacks. They are not for infrastructure, data, or dependency rollbacks. When teams try to use flags for situations where deployments, infrastructure changes, or data operations are needed, they create gaps in their incident response that will eventually be exposed.

Putting it all together: a kill switch governance framework

Managing kill switches effectively requires the same organizational discipline as managing any other critical system component. Here is a governance framework that balances operational safety with technical debt prevention.

Creation standards

Every feature in a critical path (payments, authentication, data pipelines, user-facing core flows) must have a kill switch before deploying to production
Kill switches must follow the kill-switch-[feature]-[version] naming convention
Kill switches must be documented in the on-call runbook with clear instructions for when and how to flip them
The fallback code path must be tested before the feature launches

Lifecycle management

90-day initial review: Is the kill switch still needed?
Quarterly reviews for retained switches: Is the fallback path still functional?
Annual audit: Remove all kill switches that have not been flipped or tested in the past 12 months
Automated alerts when kill switches exceed their review date

Tools like FlagShark can automate the detection and tracking of kill switches across your codebase. By parsing your code with tree-sitter AST analysis, FlagShark identifies flag usage across 11 languages and tracks the lifecycle of every flag -- including kill switches that have exceeded their expected lifespan. When a kill switch becomes stale, FlagShark can generate a cleanup PR that removes both the flag check and the obsolete fallback code, ensuring your kill switch inventory stays lean and functional.

Retirement criteria

A kill switch should be retired when any of the following conditions are met:

Condition	Rationale
Feature has been incident-free for 180+ days	The risk the kill switch mitigates has diminished to near-zero
Fallback path has not been tested in 90+ days	The kill switch is likely non-functional and provides false confidence
The feature has been significantly refactored	The fallback path probably does not match the current architecture
The team cannot explain what the kill switch does	If nobody knows what it controls, it is more dangerous than helpful
The fallback path depends on deprecated services	Flipping the switch would make things worse, not better

The retirement checklist

When retiring a kill switch:

Verify that the feature it protects is stable and well-monitored
Remove the flag evaluation and the fallback code path from the application code
Remove the flag from the feature flag provider
Update the on-call runbook to remove all references
Update any monitoring or alerting that references the flag
Communicate the retirement to the on-call rotation

Kill switches are among the most valuable tools in your operational resilience arsenal. A well-designed kill switch can turn a 40-minute outage into a 45-second blip. But like any tool, they require maintenance. A kill switch you cannot trust is worse than no kill switch at all -- it creates the illusion of a safety net while the actual net has rotted away.

Build your kill switches with intention. Test them regularly. Review them quarterly. And retire them when they have served their purpose. The goal is not to have the most kill switches. The goal is to have kill switches that work when the page fires at 2:47 AM on a Tuesday.

SOC2 Compliance and Feature Flags: What Auditors Want to See

Preparing for SOC2? Feature flags intersect with change management, access control, and risk assessment. Here's what auditors look for and how to be ready.

February 6, 2026·13 min read

Progressive Delivery and Feature Flags: A Practical Guide

Progressive delivery uses feature flags for canary releases, percentage rollouts, and ring deployments. A practical guide to implementation, monitoring, and the cleanup challenge it creates.

February 5, 2026·12 min read

Feature Flag Security Risks: What Your Team Is Overlooking

Stale feature flags create hidden attack surfaces, accidental auth bypasses, and compliance gaps. The security risks most teams don't consider until it's too late.

January 31, 2026·12 min read

View all articles

February 4, 2026·14 min read

Feature Flag Rollback Strategy: When and How to Use Kill Switches

Kill switches enable instant rollbacks — but they become liabilities when forgotten. A practical guide to rollback strategies, kill switch design, and knowing when to retire them.

Feature Flags DevOps Risk Management Best Practices

Understanding flag types by purpose

Flag Type	Purpose	Expected Lifespan	Rollback Relevance
Release toggle	Gradually roll out a new feature to users	2-8 weeks	Medium -- can disable a feature during rollout, but not designed for emergency use
Experiment flag	A/B test or multivariate experiment	1-4 weeks	Low -- experiments are typically isolated and low-risk
Operational toggle	Control system behavior (rate limits, circuit breakers, feature degradation)	Long-lived, reviewed quarterly	High -- designed for ongoing operational control
Kill switch	Instantly disable a feature or revert to previous behavior during an incident	Long-lived, but should have a retirement plan	Critical -- this is the flag's entire purpose
Permission gate	Control access to features by user segment, role, or subscription tier	Varies	Low -- not typically used for rollback

Anatomy of an effective kill switch

Design principles

// Good: Single-purpose kill switch
if (flags.isEnabled('kill-switch-recommendation-engine-v3')) {
  return recommendationEngineV3.getResults(user);
}
return recommendationEngineV2.getResults(user);

// Bad: Multi-purpose flag masquerading as a kill switch
if (flags.isEnabled('new-ml-features')) {
  // Controls recommendations AND search ranking AND email personalization
  return {
    recommendations: newRecommendations(user),
    searchRanking: newSearchRanking(query),
    emailContent: newEmailPersonalization(user),
  };
}

2. Safe defaults. When a kill switch is flipped off, the system should revert to a known-good state. This means the fallback path must be maintained and tested, not abandoned as dead code.

func GetRecommendations(ctx context.Context, user *User) ([]Product, error) {
    if featureFlags.IsEnabled("kill-switch-rec-engine-v3") {
        results, err := recEngineV3.Query(ctx, user)
        if err != nil {
            // If the new engine fails, fall back even without
            // flipping the kill switch
            log.Warn("rec-engine-v3 failed, falling back", "error", err)
            return recEngineV2.Query(ctx, user)
        }
        return results, nil
    }
    // Kill switch is off: use the proven previous version
    return recEngineV2.Query(ctx, user)
}

Kill switch naming convention

Kill switches should be instantly identifiable in the codebase. A naming convention that distinguishes them from other flag types eliminates ambiguity during incidents when speed matters.

Recommended format: kill-switch-[feature]-[version]

Name	What It Controls
`kill-switch-recommendation-engine-v3`	New recommendation engine (v3)
`kill-switch-payment-processor-stripe`	Stripe payment integration
`kill-switch-realtime-notifications`	WebSocket-based notification system
`kill-switch-ai-content-generation`	AI-powered content generation feature

Rollback strategies using feature flags

Kill switches are one tool in the rollback arsenal. The broader question is how feature flags fit into your overall rollback strategy and when they are the right choice versus other mechanisms.

Strategy 1: Kill switch rollback (instant)

How it works: Flip the kill switch flag off. The new code path is bypassed, and the system reverts to the previous behavior.

Time to rollback: Seconds to low minutes (depending on flag propagation speed).

When to use:

Privacy or security incidents where every second counts
Performance degradation from a newly deployed feature
User-facing bugs that affect a large percentage of traffic
Any situation where the blast radius is growing and speed is paramount

When not to use:

Infrastructure failures (flags cannot fix a downed database)
Data corruption (disabling the feature does not uncorrupt the data)
Issues in the flag evaluation system itself (circular dependency)

Requirements:

Kill switch must be pre-deployed before the feature goes live
Fallback code path must be maintained and functional
Flag provider must be highly available (99.99%+ uptime)

Strategy 2: Percentage rollback (gradual)

How it works: Reduce the rollout percentage of a release toggle from, say, 50% to 10% or 0%.

Time to rollback: Minutes (percentage changes propagate through the flag provider).

When to use:

Issues discovered during a progressive rollout that are not emergencies
Degraded metrics (conversion rate drop, latency increase) that warrant investigation
Bugs affecting a specific user segment that can be narrowed by adjusting targeting rules

Strategy 3: Deployment rollback (traditional)

How it works: Revert the commit, build, deploy the previous version through the standard pipeline.

Time to rollback: 15 to 45 minutes (depending on CI/CD speed and deployment process).

When to use:

Issues in code that is not behind a feature flag
Infrastructure-level changes (database migrations, schema changes, service mesh configuration)
When the flag-controlled code has been removed after full rollout (no flag to flip)
When the fallback code path behind the kill switch is itself broken

When not to use:

During active incidents where time-to-resolution is the primary concern
When the deployment pipeline itself is the source of the problem

Strategy 4: Blue-green or canary rollback

How it works: Route traffic from the new deployment (green/canary) back to the stable deployment (blue/baseline).

Time to rollback: 1 to 5 minutes (traffic routing change at the load balancer or service mesh level).

When to use:

Infrastructure changes that cannot be controlled by feature flags
Database migration issues
Service-level incompatibilities
When you need to roll back the entire deployment, not just one feature

Comparison matrix

Strategy	Speed	Granularity	Pre-requisites	Risk
Kill switch	Seconds	Single feature	Flag must exist, fallback must work	Low (if well-designed)
Percentage rollback	Minutes	Feature with user targeting	Feature must use percentage-based rollout	Low
Deployment rollback	15-45 min	Entire deployment	Working CI/CD pipeline	Medium (deployment could introduce new issues)
Blue-green/canary	1-5 min	Entire deployment	Infrastructure support (load balancer, routing)	Low-medium

Flag-based incident response playbook

The playbook

Step 1: Identify the blast radius (0-2 minutes)

Determine what is affected. Is it a single feature, a user segment, or the entire application? Check monitoring dashboards, error rates, and user reports.

Step 2: Determine if a kill switch exists (1-2 minutes)

Search your flag provider for a kill switch associated with the affected feature. If your naming convention includes the kill-switch- prefix, this search takes seconds.

Finding	Action
Kill switch exists and is ON	Proceed to Step 3
Kill switch exists and is OFF	Feature is already rolled back -- the issue is elsewhere
No kill switch exists	Skip to Step 4

Step 3: Flip the kill switch (30 seconds)

Disable the feature via the kill switch. Immediately verify that the fallback behavior is working correctly. Monitor error rates and user-facing metrics for recovery.

# Example: Using LaunchDarkly CLI to flip a kill switch
ldcli flags update \
  --project production \
  --flag kill-switch-recommendation-engine-v3 \
  --off

# Example: Using a custom flag management API
curl -X PATCH https://flags.internal.company.com/api/flags/kill-switch-recommendation-engine-v3 \
  -H "Authorization: Bearer $ONCALL_TOKEN" \
  -d '{"enabled": false}'

Step 4: If no kill switch, escalate to deployment rollback (5-10 minutes)

Initiate the standard deployment rollback process. Revert the commit, trigger the CI pipeline, and promote the previous build. While waiting, communicate the expected timeline to stakeholders.

Step 5: Post-incident flag review (within 24 hours)

After the incident is resolved, review the kill switch response:

Did the kill switch work as expected?
Was the fallback behavior correct and complete?
How long did the total rollback take?
Should a kill switch be added for this feature if one did not exist?

This last question is critical. Every incident that requires a 30-minute deployment rollback instead of a 30-second kill switch flip is a signal that your kill switch coverage has a gap.

The lifecycle of a kill switch

Phase 1: Introduction (Day 0)

The kill switch is created alongside the feature it protects. It is documented with:

The feature it controls
The fallback behavior when disabled
The owner responsible for the switch
The expected review date

Phase 2: Active protection (Day 1 - Day 90)

Phase 3: Stability review (Day 90)

Scenario	Recommendation
Feature has been incident-free for 90 days and is non-critical	Remove the kill switch and the fallback code
Feature is in a critical path (payments, auth, data pipeline)	Retain the kill switch, schedule next review in 90 days
Feature has had incidents but was successfully rolled back via the switch	Retain the kill switch, investigate root cause of instability
Feature has been modified significantly since the switch was created	Verify the fallback path still works; update or remove

Phase 4: Quarterly review (ongoing for retained switches)

Kill switches that survive the 90-day review enter a quarterly review cycle. Each review must answer:

Has the fallback code path been tested recently?
Has the feature changed in ways that invalidate the fallback?
Is the kill switch still in the on-call runbook?
Does the team still know how to use this switch?

If the answer to any of these questions is "no," the kill switch has become a liability rather than an asset.

Phase 5: Retirement

When a kill switch is retired, both the switch and its fallback code path are removed. This is a cleanup task that must be treated with the same rigor as any code change:

Remove the flag evaluation from the code
Remove the fallback code path
Update the on-call runbook to remove references to the switch
Remove the flag from the provider dashboard
Verify that tests pass without the flag

The danger of long-lived kill switches

How kill switches rot

Dependency drift. The fallback path may depend on services, APIs, or database schemas that have been deprecated or modified since the kill switch was created.

# Kill switch created 14 months ago
if feature_flags.is_enabled("kill-switch-search-v2"):
    return search_v2.query(request)  # Current implementation
else:
    # This fallback calls search_v1, which was decommissioned 6 months ago.
    # The endpoint still exists but returns empty results.
    # Nobody has tested this path since the kill switch was created.
    return search_v1.query(request)  # Silently broken

Quantifying kill switch rot

Kill Switch Age	Probability Fallback Still Works	Risk Level
0-30 days	95%+	Low
30-90 days	80-95%	Low-Medium
90-180 days	60-80%	Medium
180-365 days	30-60%	High
365+ days	Below 30%	Critical -- likely a liability

Testing your kill switches

If the fallback path is not tested, the kill switch is theater -- it creates the illusion of safety without providing actual safety.

Testing strategies

describe('RecommendationEngine', () => {
  it('returns v3 results when kill switch is enabled', async () => {
    mockFlags.enable('kill-switch-recommendation-engine-v3');
    const results = await getRecommendations(testUser);
    expect(results.source).toBe('v3');
    expect(results.items.length).toBeGreaterThan(0);
  });

  // This test is critical -- it verifies the fallback path works
  it('returns v2 results when kill switch is disabled', async () => {
    mockFlags.disable('kill-switch-recommendation-engine-v3');
    const results = await getRecommendations(testUser);
    expect(results.source).toBe('v2');
    expect(results.items.length).toBeGreaterThan(0);
  });

  it('falls back to v2 when v3 throws an error', async () => {
    mockFlags.enable('kill-switch-recommendation-engine-v3');
    mockRecEngineV3.throwOnNextCall(new Error('timeout'));
    const results = await getRecommendations(testUser);
    expect(results.source).toBe('v2');
  });
});

When to use flags vs. other rollback mechanisms

Feature flags are powerful, but they are not the right rollback mechanism for every situation. Choosing the wrong tool creates a false sense of security.

Situation	Best Rollback Mechanism	Why Not Feature Flags?
Application feature misbehaving	Kill switch	Best fit -- instant, granular
Database migration failure	Blue-green deployment	Flags cannot unmigrate data
Infrastructure outage	Service mesh routing / DNS failover	Flags depend on application layer being functional
Third-party API failure	Circuit breaker (can be flag-controlled)	Good fit for flags, but circuit breaker pattern is more appropriate
Security vulnerability in a dependency	Deployment rollback	Flags do not change running dependency versions
Configuration error (env vars, secrets)	Configuration rollback	Flags control code paths, not configuration
Data corruption from a bug	Data restoration from backup	Flags can stop further corruption but cannot repair existing damage

Putting it all together: a kill switch governance framework

Creation standards

Every feature in a critical path (payments, authentication, data pipelines, user-facing core flows) must have a kill switch before deploying to production
Kill switches must follow the kill-switch-[feature]-[version] naming convention
Kill switches must be documented in the on-call runbook with clear instructions for when and how to flip them
The fallback code path must be tested before the feature launches

Lifecycle management

90-day initial review: Is the kill switch still needed?
Quarterly reviews for retained switches: Is the fallback path still functional?
Annual audit: Remove all kill switches that have not been flipped or tested in the past 12 months
Automated alerts when kill switches exceed their review date

Retirement criteria

A kill switch should be retired when any of the following conditions are met:

Condition	Rationale
Feature has been incident-free for 180+ days	The risk the kill switch mitigates has diminished to near-zero
Fallback path has not been tested in 90+ days	The kill switch is likely non-functional and provides false confidence
The feature has been significantly refactored	The fallback path probably does not match the current architecture
The team cannot explain what the kill switch does	If nobody knows what it controls, it is more dangerous than helpful
The fallback path depends on deprecated services	Flipping the switch would make things worse, not better

The retirement checklist

When retiring a kill switch:

Verify that the feature it protects is stable and well-monitored
Remove the flag evaluation and the fallback code path from the application code
Remove the flag from the feature flag provider
Update the on-call runbook to remove all references
Update any monitoring or alerting that references the flag
Communicate the retirement to the on-call rotation

SOC2 Compliance and Feature Flags: What Auditors Want to See

Preparing for SOC2? Feature flags intersect with change management, access control, and risk assessment. Here's what auditors look for and how to be ready.

February 6, 2026·13 min read

Progressive Delivery and Feature Flags: A Practical Guide

Progressive delivery uses feature flags for canary releases, percentage rollouts, and ring deployments. A practical guide to implementation, monitoring, and the cleanup challenge it creates.

February 5, 2026·12 min read

Feature Flag Security Risks: What Your Team Is Overlooking

Stale feature flags create hidden attack surfaces, accidental auth bypasses, and compliance gaps. The security risks most teams don't consider until it's too late.

January 31, 2026·12 min read

View all articles

Understanding flag types by purpose

Anatomy of an effective kill switch

Design principles

Kill switch naming convention

Rollback strategies using feature flags

Strategy 1: Kill switch rollback (instant)

Strategy 2: Percentage rollback (gradual)

Strategy 3: Deployment rollback (traditional)

Strategy 4: Blue-green or canary rollback

Comparison matrix

Flag-based incident response playbook

The playbook

The lifecycle of a kill switch

Phase 1: Introduction (Day 0)

Phase 2: Active protection (Day 1 - Day 90)

Phase 3: Stability review (Day 90)

Phase 4: Quarterly review (ongoing for retained switches)

Phase 5: Retirement

The danger of long-lived kill switches

How kill switches rot

Quantifying kill switch rot

Testing your kill switches

Testing strategies

When to use flags vs. other rollback mechanisms

Putting it all together: a kill switch governance framework

Creation standards

Lifecycle management

Retirement criteria

The retirement checklist

More articles

SOC2 Compliance and Feature Flags: What Auditors Want to See

Progressive Delivery and Feature Flags: A Practical Guide

Feature Flag Security Risks: What Your Team Is Overlooking

Understanding flag types by purpose

Anatomy of an effective kill switch

Design principles

Kill switch naming convention

Rollback strategies using feature flags

Strategy 1: Kill switch rollback (instant)

Strategy 2: Percentage rollback (gradual)

Strategy 3: Deployment rollback (traditional)

Strategy 4: Blue-green or canary rollback

Comparison matrix

Flag-based incident response playbook

The playbook

The lifecycle of a kill switch

Phase 1: Introduction (Day 0)

Phase 2: Active protection (Day 1 - Day 90)

Phase 3: Stability review (Day 90)

Phase 4: Quarterly review (ongoing for retained switches)

Phase 5: Retirement

The danger of long-lived kill switches

How kill switches rot

Quantifying kill switch rot

Testing your kill switches

Testing strategies

When to use flags vs. other rollback mechanisms

Putting it all together: a kill switch governance framework

Creation standards

Lifecycle management

Retirement criteria

The retirement checklist

More articles

SOC2 Compliance and Feature Flags: What Auditors Want to See

Progressive Delivery and Feature Flags: A Practical Guide

Feature Flag Security Risks: What Your Team Is Overlooking