Technical DebtFebruary 7, 202613 min read

Joseph McGrath · Founder of FlagShark

How to Measure Technical Debt: Metrics That Actually Work

Move beyond gut feeling. Learn concrete metrics for measuring technical debt---including flag age, flag density, cleanup velocity, and unused code percentage---and how to build a dashboard that makes debt visible.

Technical Debt Engineering Management Best Practices Code Quality

On this page

Technical Debt Deep Dive — Part 2 of 3

1.Technical Debt Prioritization Framework
2.How to Measure Technical Debt: Metrics That Actually Work
3.Types of Technical Debt: A Complete Guide for Engineering Teams

Previous Next

Every engineering team knows it has technical debt. The problem is rarely awareness---it is measurement. Ask five engineers how much debt exists in a codebase and you will get five different answers, ranging from "it's fine" to "burn it all down." Without concrete metrics, technical debt discussions devolve into opinions, and opinions do not survive sprint planning against a product manager holding a prioritized feature list.

TL;DR: Measure technical debt using five concrete metrics you can compute from source code: flag age distribution (median and 90th percentile), flag density per file (flags per 1,000 lines), cleanup velocity ratio (flags removed vs. added per month), unused code percentage from stale flags, and flag-to-incident correlation. Track them in a dashboard, review weekly, and use the trends to justify cleanup budgets to leadership.

The single biggest reason technical debt accumulates unchecked is that teams cannot quantify it. If you cannot measure it, you cannot track it. If you cannot track it, you cannot prove it is getting worse. And if you cannot prove it is getting worse, you will never get the time and budget to fix it.

This article covers metrics that actually work for measuring technical debt---not abstract academic concepts, but numbers you can compute from your codebase today, track over time, and present to leadership in a language they understand. Feature flags, as one of the most measurable forms of debt, serve as a particularly useful lens for understanding and quantifying the broader problem.

Why does most technical debt measurement fail?

Before diving into specific metrics, it is worth understanding why traditional approaches fall short.

The "story point" trap

Many teams try to measure debt by assigning story points to debt-reduction tickets. This is circular reasoning: you are using an estimate (story points) to measure something you cannot yet quantify (the debt itself). Story points measure effort to fix, not the severity or impact of the debt. A 1-point flag removal might eliminate 200 lines of dead code, while an 8-point refactor might improve nothing measurable.

The code smell counter

SonarQube, CodeClimate, and similar tools provide "debt ratios" based on code smells---long methods, deep nesting, duplicated blocks. These catch surface-level issues but miss structural debt entirely. A codebase with zero code smells can still carry enormous debt in the form of stale feature flags, abandoned abstractions, and dead code paths that pass every linter.

The developer survey

Quarterly surveys asking developers "how bad is the debt?" produce useful sentiment data but nothing actionable. You learn that developers are frustrated, which you already knew. You do not learn which debt costs the most, which is trending worse, or where to invest cleanup effort for maximum return.

What actually works

Effective technical debt measurement requires metrics that are:

Computable from source code (not from surveys or estimates)
Trackable over time (trending matters more than absolute numbers)
Correlated with developer pain (the metric should predict slowdowns)
Actionable (knowing the number should suggest what to do next)

Feature flags are one of the best starting points because they meet all four criteria. A stale flag is unambiguously identifiable, its age is precisely measurable, its code impact is quantifiable, and its cleanup action is clear.

How do you measure feature flag age distribution?

The single most revealing metric for feature flag debt is the age distribution of active flags in your codebase. "Active" here means flags that still have conditional logic in source code, regardless of whether the flag is enabled or disabled in the provider.

How to compute it

For each flag reference in the codebase, determine when it was first introduced. If you use a flag management platform like LaunchDarkly or Unleash, the creation date is available via API. If flags are defined inline, use git log to find the first commit that introduced the flag key:

# Find when a flag was first added to the codebase
git log --all --diff-filter=A --format='%aI' -1 -- '**/feature_flags*' \
  | head -1

# For a specific flag key, search across all files
git log --all -S "new-checkout-flow" --format='%aI %H %s' | tail -1

For a more systematic approach, build a script that extracts all flag keys and computes their age:

import subprocess
import json
from datetime import datetime, timezone

def get_flag_age(flag_key: str) -> int:
    """Returns the age of a flag in days based on git history."""
    result = subprocess.run(
        ["git", "log", "--all", "-S", flag_key,
         "--format=%aI", "--reverse"],
        capture_output=True, text=True
    )
    if not result.stdout.strip():
        return -1  # Flag key not found in git history

    first_date = result.stdout.strip().split("\n")[0]
    introduced = datetime.fromisoformat(first_date)
    age = (datetime.now(timezone.utc) - introduced).days
    return age

def categorize_flags(flags: list[str]) -> dict:
    """Categorize flags into age buckets."""
    buckets = {
        "fresh (< 7 days)": [],
        "active (7-30 days)": [],
        "aging (30-90 days)": [],
        "stale (90-180 days)": [],
        "ancient (180+ days)": [],
    }
    for flag in flags:
        age = get_flag_age(flag)
        if age < 7:
            buckets["fresh (< 7 days)"].append((flag, age))
        elif age < 30:
            buckets["active (7-30 days)"].append((flag, age))
        elif age < 90:
            buckets["aging (30-90 days)"].append((flag, age))
        elif age < 180:
            buckets["stale (90-180 days)"].append((flag, age))
        else:
            buckets["ancient (180+ days)"].append((flag, age))
    return buckets

What to look for

A healthy flag age distribution looks like a pyramid: many fresh flags at the bottom (features in active development), fewer in the 7-30 day range (recently rolled out), and very few beyond 90 days. An unhealthy distribution is an inverted pyramid---more ancient flags than fresh ones.

Target thresholds:

Age Bucket	Healthy %	Warning %	Critical %
< 30 days	> 50%	30-50%	< 30%
30-90 days	20-35%	35-50%	> 50%
90-180 days	< 15%	15-25%	> 25%
180+ days	< 5%	5-15%	> 15%

Track the median flag age and 90th percentile flag age monthly. If either is trending upward, flag debt is accumulating faster than you are cleaning it up.

What is flag density and why does it matter?

Flag density measures how many feature flag conditionals exist per file (or per 1,000 lines of code). High density in a single file signals that the file has become a hub of conditional complexity---every change requires understanding multiple flag states.

How to compute it

import ast
from pathlib import Path

def count_flag_references(filepath: str,
                          flag_functions: list[str]) -> int:
    """Count feature flag function calls in a Python file."""
    with open(filepath) as f:
        tree = ast.parse(f.read())

    count = 0
    for node in ast.walk(tree):
        if isinstance(node, ast.Call):
            func_name = ""
            if isinstance(node.func, ast.Attribute):
                func_name = node.func.attr
            elif isinstance(node.func, ast.Name):
                func_name = node.func.id
            if func_name in flag_functions:
                count += 1
    return count

def compute_flag_density(directory: str) -> list[dict]:
    """Compute flag density across all Python files."""
    flag_functions = [
        "is_enabled", "flag_is_active", "variation",
        "bool_variation", "isEnabled", "is_feature_enabled",
    ]
    results = []
    for path in Path(directory).rglob("*.py"):
        lines = len(path.read_text().splitlines())
        flags = count_flag_references(str(path), flag_functions)
        if flags > 0:
            results.append({
                "file": str(path),
                "flags": flags,
                "lines": lines,
                "density": round(flags / max(lines, 1) * 1000, 2),
            })
    return sorted(results, key=lambda x: x["density"], reverse=True)

For TypeScript or JavaScript codebases, a similar approach using tree-sitter or the TypeScript compiler API yields the same metric. The key is counting call expressions that match your flag provider's API surface.

What to look for

Flag density thresholds (per 1,000 lines):

0-2 flags/kLOC: Normal. Flags are isolated and manageable.
3-5 flags/kLOC: Elevated. The file is becoming a testing combinatorics problem.
6+ flags/kLOC: Critical. With 6 flags, there are 64 possible code paths. Nobody is testing all of them.

Files with density above 5 flags per 1,000 lines should be flagged (no pun intended) in pull request reviews. Track the top 10 densest files monthly---these are your highest-risk areas for flag-related bugs and the highest-value cleanup targets.

How do you track cleanup velocity for feature flags?

Cleanup velocity measures how many flags your team removes per unit of time. This is the debt repayment rate, and it is the most important trend line for engineering managers to watch.

How to compute it

Track flag removals over time using git history:

# Count flag-related removals in the last 30 days
git log --since="30 days ago" --all --oneline \
  --grep="remove.*flag\|cleanup.*flag\|delete.*flag" \
  | wc -l

# More precise: count PRs that reduced flag count
git log --since="30 days ago" --format="%H" | while read sha; do
  added=$(git show "$sha" | grep -c '^\+.*isEnabled\|^\+.*variation' || true)
  removed=$(git show "$sha" | grep -c '^\-.*isEnabled\|^\-.*variation' || true)
  if [ "$removed" -gt "$added" ] && [ "$removed" -gt 0 ]; then
    echo "$sha: net removal of $((removed - added)) flag references"
  fi
done

The velocity ratio

The most useful derived metric is the flag velocity ratio: flags removed per month divided by flags added per month.

Ratio > 1.0: You are reducing debt. The codebase is getting cleaner.
Ratio = 1.0: You are treading water. Every new flag is matched by a cleanup.
Ratio < 1.0: Debt is growing. You are adding flags faster than you remove them.
Ratio < 0.5: Debt is compounding. Without intervention, your flag count will double.

Most teams, when they first measure this, discover a ratio between 0.3 and 0.6. They are creating three flags for every one they clean up. This is the metric that makes the case for dedicated cleanup time in sprint planning.

How do you calculate dead code from stale flags?

Unused code percentage estimates how much of the codebase exists solely to support the "off" branch of a permanently-enabled flag. This is the dead weight metric---it tells you how much code is being maintained, compiled, deployed, and confused over for no functional reason.

How to estimate it

For each stale flag (age > 90 days, permanently enabled), count the lines of code in the disabled branch:

def estimate_dead_code_from_flags(
    flags: list[dict],
    stale_threshold_days: int = 90
) -> dict:
    """Estimate dead code attributable to stale flags."""
    total_dead_lines = 0
    total_codebase_lines = 0
    stale_flags = []

    for flag in flags:
        if flag["age_days"] > stale_threshold_days:
            # Lines in the else/fallback branch
            dead_lines = flag.get("disabled_branch_lines", 0)
            # Associated test code for the dead branch
            dead_test_lines = flag.get("dead_test_lines", 0)

            total_dead_lines += dead_lines + dead_test_lines
            stale_flags.append({
                "key": flag["key"],
                "age_days": flag["age_days"],
                "dead_lines": dead_lines + dead_test_lines,
            })

    return {
        "total_dead_lines": total_dead_lines,
        "stale_flag_count": len(stale_flags),
        "avg_dead_lines_per_flag": (
            total_dead_lines // max(len(stale_flags), 1)
        ),
        "flags": sorted(
            stale_flags, key=lambda x: x["dead_lines"], reverse=True
        ),
    }

Typical findings

Across mid-size codebases (100k-500k lines), stale feature flags typically account for 3-8% of total code. That number may sound small, but it translates to thousands of lines of dead code---with associated test files, configuration entries, and documentation---that adds to build times, increases cognitive load during code review, and creates false positives in security scans.

The more actionable way to frame this is in time: if your team spends 20% of code review time on files that contain stale flags, and those flags account for 30% of the conditional logic in those files, you can estimate the review time wasted on understanding dead branches.

How do you correlate flag density with production incidents?

This metric connects technical debt to business impact. Track whether files with high flag density have higher incident rates, more bugs, or longer time-to-resolution.

How to compute it

Cross-reference your incident management system with flag density data:

def correlate_flags_to_incidents(
    flag_density: dict[str, float],
    incidents: list[dict]
) -> dict:
    """Correlate file-level flag density with incident data."""
    flagged_file_incidents = 0
    clean_file_incidents = 0
    flagged_file_count = 0
    clean_file_count = 0

    incident_files = set()
    for incident in incidents:
        for f in incident.get("affected_files", []):
            incident_files.add(f)

    for filepath, density in flag_density.items():
        if density > 3.0:  # "flagged" file
            flagged_file_count += 1
            if filepath in incident_files:
                flagged_file_incidents += 1
        else:
            clean_file_count += 1
            if filepath in incident_files:
                clean_file_incidents += 1

    flagged_rate = (flagged_file_incidents /
                    max(flagged_file_count, 1)) * 100
    clean_rate = (clean_file_incidents /
                  max(clean_file_count, 1)) * 100

    return {
        "flagged_file_incident_rate": round(flagged_rate, 1),
        "clean_file_incident_rate": round(clean_rate, 1),
        "risk_multiplier": round(
            flagged_rate / max(clean_rate, 0.1), 1
        ),
    }

In our experience, teams that run this analysis often discover that files with high flag density have a noticeably higher incident rate than files with no flags. This is the metric that converts technical debt from an engineering concern into a business concern.

How do you build a technical debt dashboard?

Individual metrics are useful. A dashboard that tracks all of them over time is transformative. Here is a practical architecture for a technical debt dashboard focused on measurable indicators.

Dashboard components

1. Executive summary panel

Metric	Current	30-Day Trend	Target
Total active flags	47	+3 (up)	< 30
Median flag age	62 days	+8 days	< 30 days
Cleanup velocity ratio	0.4	-0.1	> 1.0
Estimated dead code	4.2%	+0.3%	< 2%
Flag-dense files (> 5/kLOC)	12	+2	0

2. Flag age histogram: A bar chart showing the distribution of flags across age buckets, updated weekly. The shape of this chart tells the story at a glance: a right-skewed distribution means debt is accumulating.

3. Top offenders list: The 10 files with the highest flag density, the 10 oldest flags, and the 10 flags with the most dead code. These are your highest-value cleanup targets.

4. Velocity trend line: A line chart showing flags added vs. flags removed per month over the last 12 months. The gap between these lines is your debt accumulation rate.

Automation with CI/CD

The most reliable way to keep a debt dashboard accurate is to compute metrics on every pull request:

# .github/workflows/debt-metrics.yml
name: Technical Debt Metrics
on:
  pull_request:
    branches: [main]
  schedule:
    - cron: '0 9 * * 1'  # Weekly Monday 9am

jobs:
  compute-metrics:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Full history for git log analysis

      - name: Count flag references
        run: |
          python scripts/count_flags.py > metrics/flags.json

      - name: Compute flag ages
        run: |
          python scripts/flag_ages.py > metrics/ages.json

      - name: Compute cleanup velocity
        run: |
          python scripts/cleanup_velocity.py \
            --since="30 days ago" > metrics/velocity.json

      - name: Upload metrics
        run: |
          python scripts/upload_metrics.py \
            --dashboard-url=${{ secrets.DASHBOARD_URL }}

Using FlagShark for automated measurement

Building and maintaining custom scripts for flag detection across multiple languages and frameworks is significant ongoing work. FlagShark automates this by integrating directly with GitHub to detect flag additions and removals in every pull request using tree-sitter-based parsing. It tracks each flag's full lifecycle---from introduction to cleanup---across 11 languages, providing the raw data needed for flag age distribution, cleanup velocity, and density metrics without requiring custom CI scripts or manual counting.

How do you turn debt metrics into action?

Metrics without action are just numbers. Here is how to translate each metric into specific engineering decisions.

The weekly debt standup

Dedicate 15 minutes per week to reviewing debt metrics as a team. Not a full retrospective---just a quick check on the trend lines:

Is the cleanup velocity ratio above or below 1.0? If below, discuss which flags are closest to removal and assign them.
Did any file cross the 5 flags/kLOC threshold? If so, that file should not accept new flags until existing ones are cleaned up.
Are any flags approaching 90 days? Create cleanup tickets proactively, before flags graduate from "aging" to "stale."

Cleanup budgeting

Use the dead code percentage metric to justify a specific time allocation. If 5% of your code is dead weight from stale flags, and your team spends 20% of its time navigating and reviewing code that includes those dead paths, then dedicating 10% of sprint capacity to flag cleanup will have a measurable productivity return.

The formula: (Dead code % x Review time multiplier) = Minimum cleanup budget

For example: 5% dead code x 2.5x review overhead = 12.5% of sprint capacity spent navigating dead code. Investing even half that time in cleanup will produce a measurable return.

Setting targets

Targets should be set based on your current baseline, not on ideal numbers. If your median flag age is 90 days, setting a target of 15 days will feel impossible. Instead, target a 20% improvement per quarter:

Q1: Reduce median flag age from 90 to 72 days
Q2: Reduce from 72 to 58 days
Q3: Reduce from 58 to 46 days
Q4: Reduce from 46 to 37 days

By Q4, you are within striking distance of the 30-day target, and the improvement trend is visible in every quarterly review.

How do you apply debt measurement beyond feature flags?

Feature flags are an ideal starting point for technical debt measurement because they are unambiguously identifiable and precisely trackable. But the same measurement principles apply to other categories of debt:

Dependency staleness: Age of outdated dependencies, number of major versions behind, security vulnerability count. Computable from package.json, go.mod, or requirements.txt.
Test coverage gaps: Not total coverage percentage (which is often misleading), but coverage delta---the difference between coverage of recently-changed code and coverage of old code. Low coverage in frequently-changed files is higher-risk debt than low coverage in stable files.
API deprecation debt: Count of deprecated API calls, age since deprecation, availability of replacement API.
Configuration drift: Number of environment-specific overrides, age of "temporary" configuration changes, count of TODO comments in configuration files.

Each of these can be computed from source code, tracked over time, correlated with pain points, and used to justify specific cleanup investments.

Key Takeaways

Flag age distribution is the single most revealing metric for feature flag debt. Track median and 90th percentile flag age monthly---if either trends upward, debt is accumulating faster than you are cleaning it up.
Flag density per file (flags per 1,000 lines of code) identifies your highest-risk files. Files exceeding 5 flags/kLOC create a combinatorial explosion of code paths that nobody is testing completely.
Cleanup velocity ratio (flags removed / flags added per month) tells you whether you are gaining or losing ground. Most teams discover a ratio between 0.3 and 0.6 when they first measure---creating three flags for every one they clean up.
Unused code percentage quantifies the dead weight in your codebase. Stale flags typically account for 3-8% of total code in mid-size codebases, translating to thousands of lines of dead code with associated test files and configuration entries.
Flag-to-incident correlation connects technical debt to business impact. Files with high flag density tend to have noticeably higher incident rates---this converts debt from an engineering concern into a business concern.
Automate metric collection in your CI/CD pipeline to keep your debt dashboard accurate. Compute metrics on every pull request and review trends weekly as a team.
Start with flags, then generalize. The same measurement principles (computable from source code, trackable over time, correlated with pain, actionable) apply to dependency staleness, test coverage gaps, API deprecation debt, and configuration drift.

Once you have these metrics, use them to build the case for a dedicated cleanup budget. For a systematic approach to deciding which debt to tackle first, see our technical debt prioritization framework. To calculate the business case in dollar terms, see our ROI calculator.

If you cannot see it, you cannot fix it. Start measuring.

How to Measure Technical Debt: Metrics That Actually Work

Technical Debt Engineering Management Best Practices Code Quality

On this page

Technical Debt Deep Dive — Part 2 of 3

1.Technical Debt Prioritization Framework
2.How to Measure Technical Debt: Metrics That Actually Work
3.Types of Technical Debt: A Complete Guide for Engineering Teams

Previous Next

TL;DR: Measure technical debt using five concrete metrics you can compute from source code: flag age distribution (median and 90th percentile), flag density per file (flags per 1,000 lines), cleanup velocity ratio (flags removed vs. added per month), unused code percentage from stale flags, and flag-to-incident correlation. Track them in a dashboard, review weekly, and use the trends to justify cleanup budgets to leadership.

Why does most technical debt measurement fail?

Before diving into specific metrics, it is worth understanding why traditional approaches fall short.

The "story point" trap

The code smell counter

The developer survey

What actually works

Effective technical debt measurement requires metrics that are:

Computable from source code (not from surveys or estimates)
Trackable over time (trending matters more than absolute numbers)
Correlated with developer pain (the metric should predict slowdowns)
Actionable (knowing the number should suggest what to do next)

How do you measure feature flag age distribution?

How to compute it

# Find when a flag was first added to the codebase
git log --all --diff-filter=A --format='%aI' -1 -- '**/feature_flags*' \
  | head -1

# For a specific flag key, search across all files
git log --all -S "new-checkout-flow" --format='%aI %H %s' | tail -1

For a more systematic approach, build a script that extracts all flag keys and computes their age:

import subprocess
import json
from datetime import datetime, timezone

def get_flag_age(flag_key: str) -> int:
    """Returns the age of a flag in days based on git history."""
    result = subprocess.run(
        ["git", "log", "--all", "-S", flag_key,
         "--format=%aI", "--reverse"],
        capture_output=True, text=True
    )
    if not result.stdout.strip():
        return -1  # Flag key not found in git history

    first_date = result.stdout.strip().split("\n")[0]
    introduced = datetime.fromisoformat(first_date)
    age = (datetime.now(timezone.utc) - introduced).days
    return age

def categorize_flags(flags: list[str]) -> dict:
    """Categorize flags into age buckets."""
    buckets = {
        "fresh (< 7 days)": [],
        "active (7-30 days)": [],
        "aging (30-90 days)": [],
        "stale (90-180 days)": [],
        "ancient (180+ days)": [],
    }
    for flag in flags:
        age = get_flag_age(flag)
        if age < 7:
            buckets["fresh (< 7 days)"].append((flag, age))
        elif age < 30:
            buckets["active (7-30 days)"].append((flag, age))
        elif age < 90:
            buckets["aging (30-90 days)"].append((flag, age))
        elif age < 180:
            buckets["stale (90-180 days)"].append((flag, age))
        else:
            buckets["ancient (180+ days)"].append((flag, age))
    return buckets

What to look for

Target thresholds:

Age Bucket	Healthy %	Warning %	Critical %
< 30 days	> 50%	30-50%	< 30%
30-90 days	20-35%	35-50%	> 50%
90-180 days	< 15%	15-25%	> 25%
180+ days	< 5%	5-15%	> 15%

Track the median flag age and 90th percentile flag age monthly. If either is trending upward, flag debt is accumulating faster than you are cleaning it up.

What is flag density and why does it matter?

How to compute it

import ast
from pathlib import Path

def count_flag_references(filepath: str,
                          flag_functions: list[str]) -> int:
    """Count feature flag function calls in a Python file."""
    with open(filepath) as f:
        tree = ast.parse(f.read())

    count = 0
    for node in ast.walk(tree):
        if isinstance(node, ast.Call):
            func_name = ""
            if isinstance(node.func, ast.Attribute):
                func_name = node.func.attr
            elif isinstance(node.func, ast.Name):
                func_name = node.func.id
            if func_name in flag_functions:
                count += 1
    return count

def compute_flag_density(directory: str) -> list[dict]:
    """Compute flag density across all Python files."""
    flag_functions = [
        "is_enabled", "flag_is_active", "variation",
        "bool_variation", "isEnabled", "is_feature_enabled",
    ]
    results = []
    for path in Path(directory).rglob("*.py"):
        lines = len(path.read_text().splitlines())
        flags = count_flag_references(str(path), flag_functions)
        if flags > 0:
            results.append({
                "file": str(path),
                "flags": flags,
                "lines": lines,
                "density": round(flags / max(lines, 1) * 1000, 2),
            })
    return sorted(results, key=lambda x: x["density"], reverse=True)

What to look for

Flag density thresholds (per 1,000 lines):

0-2 flags/kLOC: Normal. Flags are isolated and manageable.
3-5 flags/kLOC: Elevated. The file is becoming a testing combinatorics problem.
6+ flags/kLOC: Critical. With 6 flags, there are 64 possible code paths. Nobody is testing all of them.

How do you track cleanup velocity for feature flags?

Cleanup velocity measures how many flags your team removes per unit of time. This is the debt repayment rate, and it is the most important trend line for engineering managers to watch.

How to compute it

Track flag removals over time using git history:

# Count flag-related removals in the last 30 days
git log --since="30 days ago" --all --oneline \
  --grep="remove.*flag\|cleanup.*flag\|delete.*flag" \
  | wc -l

# More precise: count PRs that reduced flag count
git log --since="30 days ago" --format="%H" | while read sha; do
  added=$(git show "$sha" | grep -c '^\+.*isEnabled\|^\+.*variation' || true)
  removed=$(git show "$sha" | grep -c '^\-.*isEnabled\|^\-.*variation' || true)
  if [ "$removed" -gt "$added" ] && [ "$removed" -gt 0 ]; then
    echo "$sha: net removal of $((removed - added)) flag references"
  fi
done

The velocity ratio

The most useful derived metric is the flag velocity ratio: flags removed per month divided by flags added per month.

Ratio > 1.0: You are reducing debt. The codebase is getting cleaner.
Ratio = 1.0: You are treading water. Every new flag is matched by a cleanup.
Ratio < 1.0: Debt is growing. You are adding flags faster than you remove them.
Ratio < 0.5: Debt is compounding. Without intervention, your flag count will double.

How do you calculate dead code from stale flags?

How to estimate it

For each stale flag (age > 90 days, permanently enabled), count the lines of code in the disabled branch:

def estimate_dead_code_from_flags(
    flags: list[dict],
    stale_threshold_days: int = 90
) -> dict:
    """Estimate dead code attributable to stale flags."""
    total_dead_lines = 0
    total_codebase_lines = 0
    stale_flags = []

    for flag in flags:
        if flag["age_days"] > stale_threshold_days:
            # Lines in the else/fallback branch
            dead_lines = flag.get("disabled_branch_lines", 0)
            # Associated test code for the dead branch
            dead_test_lines = flag.get("dead_test_lines", 0)

            total_dead_lines += dead_lines + dead_test_lines
            stale_flags.append({
                "key": flag["key"],
                "age_days": flag["age_days"],
                "dead_lines": dead_lines + dead_test_lines,
            })

    return {
        "total_dead_lines": total_dead_lines,
        "stale_flag_count": len(stale_flags),
        "avg_dead_lines_per_flag": (
            total_dead_lines // max(len(stale_flags), 1)
        ),
        "flags": sorted(
            stale_flags, key=lambda x: x["dead_lines"], reverse=True
        ),
    }

Typical findings

How do you correlate flag density with production incidents?

This metric connects technical debt to business impact. Track whether files with high flag density have higher incident rates, more bugs, or longer time-to-resolution.

How to compute it

Cross-reference your incident management system with flag density data:

def correlate_flags_to_incidents(
    flag_density: dict[str, float],
    incidents: list[dict]
) -> dict:
    """Correlate file-level flag density with incident data."""
    flagged_file_incidents = 0
    clean_file_incidents = 0
    flagged_file_count = 0
    clean_file_count = 0

    incident_files = set()
    for incident in incidents:
        for f in incident.get("affected_files", []):
            incident_files.add(f)

    for filepath, density in flag_density.items():
        if density > 3.0:  # "flagged" file
            flagged_file_count += 1
            if filepath in incident_files:
                flagged_file_incidents += 1
        else:
            clean_file_count += 1
            if filepath in incident_files:
                clean_file_incidents += 1

    flagged_rate = (flagged_file_incidents /
                    max(flagged_file_count, 1)) * 100
    clean_rate = (clean_file_incidents /
                  max(clean_file_count, 1)) * 100

    return {
        "flagged_file_incident_rate": round(flagged_rate, 1),
        "clean_file_incident_rate": round(clean_rate, 1),
        "risk_multiplier": round(
            flagged_rate / max(clean_rate, 0.1), 1
        ),
    }

How do you build a technical debt dashboard?

Individual metrics are useful. A dashboard that tracks all of them over time is transformative. Here is a practical architecture for a technical debt dashboard focused on measurable indicators.

Dashboard components

1. Executive summary panel

Metric	Current	30-Day Trend	Target
Total active flags	47	+3 (up)	< 30
Median flag age	62 days	+8 days	< 30 days
Cleanup velocity ratio	0.4	-0.1	> 1.0
Estimated dead code	4.2%	+0.3%	< 2%
Flag-dense files (> 5/kLOC)	12	+2	0

3. Top offenders list: The 10 files with the highest flag density, the 10 oldest flags, and the 10 flags with the most dead code. These are your highest-value cleanup targets.

4. Velocity trend line: A line chart showing flags added vs. flags removed per month over the last 12 months. The gap between these lines is your debt accumulation rate.

Automation with CI/CD

The most reliable way to keep a debt dashboard accurate is to compute metrics on every pull request:

# .github/workflows/debt-metrics.yml
name: Technical Debt Metrics
on:
  pull_request:
    branches: [main]
  schedule:
    - cron: '0 9 * * 1'  # Weekly Monday 9am

jobs:
  compute-metrics:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Full history for git log analysis

      - name: Count flag references
        run: |
          python scripts/count_flags.py > metrics/flags.json

      - name: Compute flag ages
        run: |
          python scripts/flag_ages.py > metrics/ages.json

      - name: Compute cleanup velocity
        run: |
          python scripts/cleanup_velocity.py \
            --since="30 days ago" > metrics/velocity.json

      - name: Upload metrics
        run: |
          python scripts/upload_metrics.py \
            --dashboard-url=${{ secrets.DASHBOARD_URL }}

Using FlagShark for automated measurement

How do you turn debt metrics into action?

Metrics without action are just numbers. Here is how to translate each metric into specific engineering decisions.

The weekly debt standup

Dedicate 15 minutes per week to reviewing debt metrics as a team. Not a full retrospective---just a quick check on the trend lines:

Is the cleanup velocity ratio above or below 1.0? If below, discuss which flags are closest to removal and assign them.
Did any file cross the 5 flags/kLOC threshold? If so, that file should not accept new flags until existing ones are cleaned up.
Are any flags approaching 90 days? Create cleanup tickets proactively, before flags graduate from "aging" to "stale."

Cleanup budgeting

The formula: (Dead code % x Review time multiplier) = Minimum cleanup budget

For example: 5% dead code x 2.5x review overhead = 12.5% of sprint capacity spent navigating dead code. Investing even half that time in cleanup will produce a measurable return.

Setting targets

Q1: Reduce median flag age from 90 to 72 days
Q2: Reduce from 72 to 58 days
Q3: Reduce from 58 to 46 days
Q4: Reduce from 46 to 37 days

By Q4, you are within striking distance of the 30-day target, and the improvement trend is visible in every quarterly review.

How do you apply debt measurement beyond feature flags?

Dependency staleness: Age of outdated dependencies, number of major versions behind, security vulnerability count. Computable from package.json, go.mod, or requirements.txt.
Test coverage gaps: Not total coverage percentage (which is often misleading), but coverage delta---the difference between coverage of recently-changed code and coverage of old code. Low coverage in frequently-changed files is higher-risk debt than low coverage in stable files.
API deprecation debt: Count of deprecated API calls, age since deprecation, availability of replacement API.
Configuration drift: Number of environment-specific overrides, age of "temporary" configuration changes, count of TODO comments in configuration files.

Each of these can be computed from source code, tracked over time, correlated with pain points, and used to justify specific cleanup investments.

Key Takeaways

Flag age distribution is the single most revealing metric for feature flag debt. Track median and 90th percentile flag age monthly---if either trends upward, debt is accumulating faster than you are cleaning it up.
Flag density per file (flags per 1,000 lines of code) identifies your highest-risk files. Files exceeding 5 flags/kLOC create a combinatorial explosion of code paths that nobody is testing completely.
Cleanup velocity ratio (flags removed / flags added per month) tells you whether you are gaining or losing ground. Most teams discover a ratio between 0.3 and 0.6 when they first measure---creating three flags for every one they clean up.
Unused code percentage quantifies the dead weight in your codebase. Stale flags typically account for 3-8% of total code in mid-size codebases, translating to thousands of lines of dead code with associated test files and configuration entries.
Flag-to-incident correlation connects technical debt to business impact. Files with high flag density tend to have noticeably higher incident rates---this converts debt from an engineering concern into a business concern.
Automate metric collection in your CI/CD pipeline to keep your debt dashboard accurate. Compute metrics on every pull request and review trends weekly as a team.
Start with flags, then generalize. The same measurement principles (computable from source code, trackable over time, correlated with pain, actionable) apply to dependency staleness, test coverage gaps, API deprecation debt, and configuration drift.

If you cannot see it, you cannot fix it. Start measuring.

Why does most technical debt measurement fail?

The "story point" trap

The code smell counter

The developer survey

What actually works

How do you measure feature flag age distribution?

How to compute it

What to look for

What is flag density and why does it matter?

How to compute it

What to look for

How do you track cleanup velocity for feature flags?

How to compute it

The velocity ratio

How do you calculate dead code from stale flags?

How to estimate it

Typical findings

How do you correlate flag density with production incidents?

How to compute it

How do you build a technical debt dashboard?

Dashboard components

Automation with CI/CD

Using FlagShark for automated measurement

How do you turn debt metrics into action?

The weekly debt standup

Cleanup budgeting

Setting targets

How do you apply debt measurement beyond feature flags?

Key Takeaways

People Also Ask

What metrics should I track for technical debt?

How do you quantify technical debt?

What is a good cleanup velocity ratio?

How do I convince leadership to invest in technical debt reduction?

Get more like this

Related Articles

Types of Technical Debt: A Complete Guide for Engineering Teams

Technical Debt Prioritization Framework

New Developer Onboarding: How Stale Feature Flags Slow Down Your Team

Why does most technical debt measurement fail?

The "story point" trap

The code smell counter

The developer survey

What actually works

How do you measure feature flag age distribution?

How to compute it

What to look for

What is flag density and why does it matter?

How to compute it

What to look for

How do you track cleanup velocity for feature flags?

How to compute it

The velocity ratio

How do you calculate dead code from stale flags?

How to estimate it

Typical findings

How do you correlate flag density with production incidents?

How to compute it

How do you build a technical debt dashboard?

Dashboard components

Automation with CI/CD

Using FlagShark for automated measurement

How do you turn debt metrics into action?

The weekly debt standup

Cleanup budgeting

Setting targets

How do you apply debt measurement beyond feature flags?

Key Takeaways

People Also Ask

What metrics should I track for technical debt?

How do you quantify technical debt?

What is a good cleanup velocity ratio?

How do I convince leadership to invest in technical debt reduction?

Get more like this

Related Articles

Types of Technical Debt: A Complete Guide for Engineering Teams

Technical Debt Prioritization Framework

New Developer Onboarding: How Stale Feature Flags Slow Down Your Team