Every engineering team knows it has technical debt. The problem is rarely awareness---it is measurement. Ask five engineers how much debt exists in a codebase and you will get five different answers, ranging from "it's fine" to "burn it all down." Without concrete metrics, technical debt discussions devolve into opinions, and opinions do not survive sprint planning against a product manager holding a prioritized feature list.
The single biggest reason technical debt accumulates unchecked is that teams cannot quantify it. If you cannot measure it, you cannot track it. If you cannot track it, you cannot prove it is getting worse. And if you cannot prove it is getting worse, you will never get the time and budget to fix it.
This article covers metrics that actually work for measuring technical debt---not abstract academic concepts, but numbers you can compute from your codebase today, track over time, and present to leadership in a language they understand. Feature flags, as one of the most measurable forms of debt, serve as a particularly useful lens for understanding and quantifying the broader problem.
Why most technical debt measurement fails
Before diving into specific metrics, it is worth understanding why traditional approaches fall short.
The "story point" trap
Many teams try to measure debt by assigning story points to debt-reduction tickets. This is circular reasoning: you are using an estimate (story points) to measure something you cannot yet quantify (the debt itself). Story points measure effort to fix, not the severity or impact of the debt. A 1-point flag removal might eliminate 200 lines of dead code, while an 8-point refactor might improve nothing measurable.
The code smell counter
SonarQube, CodeClimate, and similar tools provide "debt ratios" based on code smells---long methods, deep nesting, duplicated blocks. These catch surface-level issues but miss structural debt entirely. A codebase with zero code smells can still carry enormous debt in the form of stale feature flags, abandoned abstractions, and dead code paths that pass every linter.
The developer survey
Quarterly surveys asking developers "how bad is the debt?" produce useful sentiment data but nothing actionable. You learn that developers are frustrated, which you already knew. You do not learn which debt costs the most, which is trending worse, or where to invest cleanup effort for maximum return.
What actually works
Effective technical debt measurement requires metrics that are:
- Computable from source code (not from surveys or estimates)
- Trackable over time (trending matters more than absolute numbers)
- Correlated with developer pain (the metric should predict slowdowns)
- Actionable (knowing the number should suggest what to do next)
Feature flags are one of the best starting points because they meet all four criteria. A stale flag is unambiguously identifiable, its age is precisely measurable, its code impact is quantifiable, and its cleanup action is clear.
Metric 1: Flag age distribution
The single most revealing metric for feature flag debt is the age distribution of active flags in your codebase. "Active" here means flags that still have conditional logic in source code, regardless of whether the flag is enabled or disabled in the provider.
How to compute it
For each flag reference in the codebase, determine when it was first introduced. If you use a flag management platform like LaunchDarkly or Unleash, the creation date is available via API. If flags are defined inline, use git log to find the first commit that introduced the flag key:
# Find when a flag was first added to the codebase
git log --all --diff-filter=A --format='%aI' -1 -- '**/feature_flags*' \
| head -1
# For a specific flag key, search across all files
git log --all -S "new-checkout-flow" --format='%aI %H %s' | tail -1
For a more systematic approach, build a script that extracts all flag keys and computes their age:
import subprocess
import json
from datetime import datetime, timezone
def get_flag_age(flag_key: str) -> int:
"""Returns the age of a flag in days based on git history."""
result = subprocess.run(
["git", "log", "--all", "-S", flag_key,
"--format=%aI", "--reverse"],
capture_output=True, text=True
)
if not result.stdout.strip():
return -1 # Flag key not found in git history
first_date = result.stdout.strip().split("\n")[0]
introduced = datetime.fromisoformat(first_date)
age = (datetime.now(timezone.utc) - introduced).days
return age
def categorize_flags(flags: list[str]) -> dict:
"""Categorize flags into age buckets."""
buckets = {
"fresh (< 7 days)": [],
"active (7-30 days)": [],
"aging (30-90 days)": [],
"stale (90-180 days)": [],
"ancient (180+ days)": [],
}
for flag in flags:
age = get_flag_age(flag)
if age < 7:
buckets["fresh (< 7 days)"].append((flag, age))
elif age < 30:
buckets["active (7-30 days)"].append((flag, age))
elif age < 90:
buckets["aging (30-90 days)"].append((flag, age))
elif age < 180:
buckets["stale (90-180 days)"].append((flag, age))
else:
buckets["ancient (180+ days)"].append((flag, age))
return buckets
What to look for
A healthy flag age distribution looks like a pyramid: many fresh flags at the bottom (features in active development), fewer in the 7-30 day range (recently rolled out), and very few beyond 90 days. An unhealthy distribution is an inverted pyramid---more ancient flags than fresh ones.
Target thresholds:
| Age Bucket | Healthy % | Warning % | Critical % |
|---|---|---|---|
| < 30 days | > 50% | 30-50% | < 30% |
| 30-90 days | 20-35% | 35-50% | > 50% |
| 90-180 days | < 15% | 15-25% | > 25% |
| 180+ days | < 5% | 5-15% | > 15% |
Track the median flag age and 90th percentile flag age monthly. If either is trending upward, flag debt is accumulating faster than you are cleaning it up.
Metric 2: Flag density per file
Flag density measures how many feature flag conditionals exist per file (or per 1,000 lines of code). High density in a single file signals that the file has become a hub of conditional complexity---every change requires understanding multiple flag states.
How to compute it
import ast
from pathlib import Path
def count_flag_references(filepath: str,
flag_functions: list[str]) -> int:
"""Count feature flag function calls in a Python file."""
with open(filepath) as f:
tree = ast.parse(f.read())
count = 0
for node in ast.walk(tree):
if isinstance(node, ast.Call):
func_name = ""
if isinstance(node.func, ast.Attribute):
func_name = node.func.attr
elif isinstance(node.func, ast.Name):
func_name = node.func.id
if func_name in flag_functions:
count += 1
return count
def compute_flag_density(directory: str) -> list[dict]:
"""Compute flag density across all Python files."""
flag_functions = [
"is_enabled", "flag_is_active", "variation",
"bool_variation", "isEnabled", "is_feature_enabled",
]
results = []
for path in Path(directory).rglob("*.py"):
lines = len(path.read_text().splitlines())
flags = count_flag_references(str(path), flag_functions)
if flags > 0:
results.append({
"file": str(path),
"flags": flags,
"lines": lines,
"density": round(flags / max(lines, 1) * 1000, 2),
})
return sorted(results, key=lambda x: x["density"], reverse=True)
For TypeScript or JavaScript codebases, a similar approach using tree-sitter or the TypeScript compiler API yields the same metric. The key is counting call expressions that match your flag provider's API surface.
What to look for
Flag density thresholds (per 1,000 lines):
- 0-2 flags/kLOC: Normal. Flags are isolated and manageable.
- 3-5 flags/kLOC: Elevated. The file is becoming a testing combinatorics problem.
- 6+ flags/kLOC: Critical. With 6 flags, there are 64 possible code paths. Nobody is testing all of them.
Files with density above 5 flags per 1,000 lines should be flagged (no pun intended) in pull request reviews. Track the top 10 densest files monthly---these are your highest-risk areas for flag-related bugs and the highest-value cleanup targets.
Metric 3: Cleanup velocity
Cleanup velocity measures how many flags your team removes per unit of time. This is the debt repayment rate, and it is the most important trend line for engineering managers to watch.
How to compute it
Track flag removals over time using git history:
# Count flag-related removals in the last 30 days
git log --since="30 days ago" --all --oneline \
--grep="remove.*flag\|cleanup.*flag\|delete.*flag" \
| wc -l
# More precise: count PRs that reduced flag count
git log --since="30 days ago" --format="%H" | while read sha; do
added=$(git show "$sha" | grep -c '^\+.*isEnabled\|^\+.*variation' || true)
removed=$(git show "$sha" | grep -c '^\-.*isEnabled\|^\-.*variation' || true)
if [ "$removed" -gt "$added" ] && [ "$removed" -gt 0 ]; then
echo "$sha: net removal of $((removed - added)) flag references"
fi
done
The velocity ratio
The most useful derived metric is the flag velocity ratio: flags removed per month divided by flags added per month.
- Ratio > 1.0: You are reducing debt. The codebase is getting cleaner.
- Ratio = 1.0: You are treading water. Every new flag is matched by a cleanup.
- Ratio < 1.0: Debt is growing. You are adding flags faster than you remove them.
- Ratio < 0.5: Debt is compounding. Without intervention, your flag count will double.
Most teams, when they first measure this, discover a ratio between 0.3 and 0.6. They are creating three flags for every one they clean up. This is the metric that makes the case for dedicated cleanup time in sprint planning.
Metric 4: Unused code percentage
Unused code percentage estimates how much of the codebase exists solely to support the "off" branch of a permanently-enabled flag. This is the dead weight metric---it tells you how much code is being maintained, compiled, deployed, and confused over for no functional reason.
How to estimate it
For each stale flag (age > 90 days, permanently enabled), count the lines of code in the disabled branch:
def estimate_dead_code_from_flags(
flags: list[dict],
stale_threshold_days: int = 90
) -> dict:
"""Estimate dead code attributable to stale flags."""
total_dead_lines = 0
total_codebase_lines = 0
stale_flags = []
for flag in flags:
if flag["age_days"] > stale_threshold_days:
# Lines in the else/fallback branch
dead_lines = flag.get("disabled_branch_lines", 0)
# Associated test code for the dead branch
dead_test_lines = flag.get("dead_test_lines", 0)
total_dead_lines += dead_lines + dead_test_lines
stale_flags.append({
"key": flag["key"],
"age_days": flag["age_days"],
"dead_lines": dead_lines + dead_test_lines,
})
return {
"total_dead_lines": total_dead_lines,
"stale_flag_count": len(stale_flags),
"avg_dead_lines_per_flag": (
total_dead_lines // max(len(stale_flags), 1)
),
"flags": sorted(
stale_flags, key=lambda x: x["dead_lines"], reverse=True
),
}
Typical findings
Across mid-size codebases (100k-500k lines), stale feature flags typically account for 3-8% of total code. That number may sound small, but it translates to thousands of lines of dead code---with associated test files, configuration entries, and documentation---that adds to build times, increases cognitive load during code review, and creates false positives in security scans.
The more actionable way to frame this is in time: if your team spends 20% of code review time on files that contain stale flags, and those flags account for 30% of the conditional logic in those files, you can estimate the review time wasted on understanding dead branches.
Metric 5: Flag-to-incident correlation
This metric connects technical debt to business impact. Track whether files with high flag density have higher incident rates, more bugs, or longer time-to-resolution.
How to compute it
Cross-reference your incident management system with flag density data:
def correlate_flags_to_incidents(
flag_density: dict[str, float],
incidents: list[dict]
) -> dict:
"""Correlate file-level flag density with incident data."""
flagged_file_incidents = 0
clean_file_incidents = 0
flagged_file_count = 0
clean_file_count = 0
incident_files = set()
for incident in incidents:
for f in incident.get("affected_files", []):
incident_files.add(f)
for filepath, density in flag_density.items():
if density > 3.0: # "flagged" file
flagged_file_count += 1
if filepath in incident_files:
flagged_file_incidents += 1
else:
clean_file_count += 1
if filepath in incident_files:
clean_file_incidents += 1
flagged_rate = (flagged_file_incidents /
max(flagged_file_count, 1)) * 100
clean_rate = (clean_file_incidents /
max(clean_file_count, 1)) * 100
return {
"flagged_file_incident_rate": round(flagged_rate, 1),
"clean_file_incident_rate": round(clean_rate, 1),
"risk_multiplier": round(
flagged_rate / max(clean_rate, 0.1), 1
),
}
In our experience, teams that run this analysis often discover that files with high flag density have a noticeably higher incident rate than files with no flags. This is the metric that converts technical debt from an engineering concern into a business concern.
Building a technical debt dashboard
Individual metrics are useful. A dashboard that tracks all of them over time is transformative. Here is a practical architecture for a technical debt dashboard focused on measurable indicators.
Dashboard components
1. Executive summary panel
| Metric | Current | 30-Day Trend | Target |
|---|---|---|---|
| Total active flags | 47 | +3 (up) | < 30 |
| Median flag age | 62 days | +8 days | < 30 days |
| Cleanup velocity ratio | 0.4 | -0.1 | > 1.0 |
| Estimated dead code | 4.2% | +0.3% | < 2% |
| Flag-dense files (> 5/kLOC) | 12 | +2 | 0 |
2. Flag age histogram: A bar chart showing the distribution of flags across age buckets, updated weekly. The shape of this chart tells the story at a glance: a right-skewed distribution means debt is accumulating.
3. Top offenders list: The 10 files with the highest flag density, the 10 oldest flags, and the 10 flags with the most dead code. These are your highest-value cleanup targets.
4. Velocity trend line: A line chart showing flags added vs. flags removed per month over the last 12 months. The gap between these lines is your debt accumulation rate.
Automation with CI/CD
The most reliable way to keep a debt dashboard accurate is to compute metrics on every pull request:
# .github/workflows/debt-metrics.yml
name: Technical Debt Metrics
on:
pull_request:
branches: [main]
schedule:
- cron: '0 9 * * 1' # Weekly Monday 9am
jobs:
compute-metrics:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Full history for git log analysis
- name: Count flag references
run: |
python scripts/count_flags.py > metrics/flags.json
- name: Compute flag ages
run: |
python scripts/flag_ages.py > metrics/ages.json
- name: Compute cleanup velocity
run: |
python scripts/cleanup_velocity.py \
--since="30 days ago" > metrics/velocity.json
- name: Upload metrics
run: |
python scripts/upload_metrics.py \
--dashboard-url=${{ secrets.DASHBOARD_URL }}
Using FlagShark for automated measurement
Building and maintaining custom scripts for flag detection across multiple languages and frameworks is significant ongoing work. FlagShark automates this by integrating directly with GitHub to detect flag additions and removals in every pull request using tree-sitter-based parsing. It tracks each flag's full lifecycle---from introduction to cleanup---across 11 languages, providing the raw data needed for flag age distribution, cleanup velocity, and density metrics without requiring custom CI scripts or manual counting.
Turning metrics into action
Metrics without action are just numbers. Here is how to translate each metric into specific engineering decisions.
The weekly debt standup
Dedicate 15 minutes per week to reviewing debt metrics as a team. Not a full retrospective---just a quick check on the trend lines:
- Is the cleanup velocity ratio above or below 1.0? If below, discuss which flags are closest to removal and assign them.
- Did any file cross the 5 flags/kLOC threshold? If so, that file should not accept new flags until existing ones are cleaned up.
- Are any flags approaching 90 days? Create cleanup tickets proactively, before flags graduate from "aging" to "stale."
Cleanup budgeting
Use the dead code percentage metric to justify a specific time allocation. If 5% of your code is dead weight from stale flags, and your team spends 20% of its time navigating and reviewing code that includes those dead paths, then dedicating 10% of sprint capacity to flag cleanup will have a measurable productivity return.
The formula: (Dead code % x Review time multiplier) = Minimum cleanup budget
For example: 5% dead code x 2.5x review overhead = 12.5% of sprint capacity spent navigating dead code. Investing even half that time in cleanup will produce a measurable return.
Setting targets
Targets should be set based on your current baseline, not on ideal numbers. If your median flag age is 90 days, setting a target of 15 days will feel impossible. Instead, target a 20% improvement per quarter:
- Q1: Reduce median flag age from 90 to 72 days
- Q2: Reduce from 72 to 58 days
- Q3: Reduce from 58 to 46 days
- Q4: Reduce from 46 to 37 days
By Q4, you are within striking distance of the 30-day target, and the improvement trend is visible in every quarterly review.
Beyond flags: Generalizing the measurement approach
Feature flags are an ideal starting point for technical debt measurement because they are unambiguously identifiable and precisely trackable. But the same measurement principles apply to other categories of debt:
- Dependency staleness: Age of outdated dependencies, number of major versions behind, security vulnerability count. Computable from
package.json,go.mod, orrequirements.txt. - Test coverage gaps: Not total coverage percentage (which is often misleading), but coverage delta---the difference between coverage of recently-changed code and coverage of old code. Low coverage in frequently-changed files is higher-risk debt than low coverage in stable files.
- API deprecation debt: Count of deprecated API calls, age since deprecation, availability of replacement API.
- Configuration drift: Number of environment-specific overrides, age of "temporary" configuration changes, count of TODO comments in configuration files.
Each of these can be computed from source code, tracked over time, correlated with pain points, and used to justify specific cleanup investments.
Key takeaways
Measuring technical debt is not about producing a single number that summarizes the health of your codebase. It is about establishing trend lines that tell you whether things are getting better or worse, and by how much.
Start with what is measurable. Feature flags, by their nature, create the most precisely trackable form of technical debt: you know exactly when each flag was introduced, where it lives in the code, how much dead code it creates, and when it was cleaned up. Build your measurement practice around flags first, then extend the same principles to other categories of debt.
The teams that consistently reduce technical debt are not the ones with the most discipline or the biggest cleanup budgets. They are the ones that measure it, make it visible, and treat the trend line as a first-class engineering metric alongside uptime, deployment frequency, and lead time.
If you cannot see it, you cannot fix it. Start measuring.