Insights•14 min read

The CRUCIBLE Protocol: What We Learned When 3,277 Passing Tests Failed to Catch a Bug

March 20, 2026 by Asif Waliuddin

AITestingCRUCIBLEAgents

The CRUCIBLE Protocol: What We Learned When 3,277 Passing Tests Failed to Catch a Bug

3,277 tests passed. We shipped a bug that silently deleted all our graph metadata.

Let me tell you exactly how.

The System

dx3 is our semantic memory and intelligence layer. It runs Apache AGE for graph database operations.. traversals, shortest paths, influence mapping across knowledge graphs. Most of what makes our AI portfolio coherent passes through it at some point. It's not a toy project. It's running in production with real data.

On March 6, 2026, the team completed a rewrite of three core graph methods from AGE Cypher to SQL CTEs. It was a legitimate architectural improvement. The AGE Cypher approach had known issues. The SQL CTE approach was cleaner, more portable, better understood. The team did their due diligence. They ran local tests. 3,277 passed.

CI failed 3 integration tests.

The Trace

12 minutes to root cause. Here's the chain:

_extract_node_id_from_agtype() was failing to parse real AGE agtype output and returning None. The local test environment either mocked the connection or used fixture data that didn't produce real agtype objects. CI used a real PostgreSQL+AGE instance via Dockerfile.ci. Real agtype output. Real parsing failure.

That None got passed to _store_node_metadata(). Inside that function, the INSERT hit a BIGINT NOT NULL constraint. The INSERT failed. The exception was caught:

except Exception:
    pass

Silent. No log. No error status. No re-raise. Just... nothing.

The metadata tables stayed empty. Downstream queries (graph_traversal(), find_influential_nodes()) queried empty tables and returned empty result sets.

The tests asserted:

assert result.success is True
assert isinstance(result.data, list)

Both pass with an empty list. Nobody had added assert len(result.data) >= 1. The failure was invisible. The data was gone. The tests said everything was fine.

3,277 green tests.

It Got Worse

Between the rewrite commit and the fix commit, 323 tests disappeared. Not failed. Not skipped. Disappeared.

Metric	Before rewrite	After fix	Delta
Passed	3,277	3,016	-261
Skipped	108	46	-62
Total tracked	3,418	3,095	-323

The fix commit modified zero test files. 323 tests gone with no explanation, no flagging, no justification in any commit message.

Nobody noticed. We were focused on the 3 CI failures. P0 incidents create tunnel vision. That's human. But 323 tests vanishing without comment is not a small thing. It means our count of "how many tests we have" was wrong by 10%. And we had no automated mechanism to catch that.

The 6 Failure Patterns

We did a forensic audit of every commit involved. Six failure patterns. All of them documented in research. Most of them common across software teams running AI agents.

Pattern 1: Mock-Implementation Tautology.

The rewrite commit updated 5 unit mocks in the same commit as the production code. The mocks were updated from the old return schema to the new one. At first glance this looks correct.. the implementation changed, so the mocks need to change.

Look closer. Those mocks now assert that the code returns exactly what the code returns. They were reverse-engineered from the new implementation, not from a specification of what the methods should return. If you change the implementation again next week, someone will update the mocks to match, and the tests will pass, and nobody will have learned whether the implementation is correct.

This is circular validation. The test is a tautology. It proves nothing.

Pattern 2: Premature xfail Removal.

Four @xfail markers were removed from integration tests based on local test results. The xfails existed because those methods were known-broken against real AGE. The team ran the tests locally, saw them pass, removed the markers, and committed.

CI disagreed. Three integration tests failed.

The xfail markers were guards. They said: "this test is not reliable yet, don't trust it." Removing them without CI validation gave false confidence. The team saw "0 failed" locally and declared victory. CI had a different environment, a different opinion, and a different answer.

Pattern 3: Silent Failure Masking.

This is the worst one. except Exception: pass inside _store_node_metadata(). One line. Silent. The exception is caught, discarded, and the execution continues as if nothing happened.

No log line. No error status returned. No metric incremented. The metadata tables stay empty. The caller doesn't know. The tests don't know. The only way to detect this failure is to assert that the data you created actually exists after you created it.. and nobody had done that.

Silent failure masking is the failure mode that kills data pipelines in production. The system reports success while losing data. By the time anyone notices, it's hours or days later, and the damage is already done.

Pattern 4: Environment Coupling.

Local tests and CI tests disagreed because they run against different environments. Local: mocked connections or simplified fixtures. CI: real PostgreSQL with real AGE, producing real agtype output that the Python extractor couldn't handle.

The gap between local and CI is exactly where bugs hide. Not because anyone was careless. Because it's genuinely hard to make local environments perfectly mirror production. The principle of "no mocks, real tests on real hardware" exists for this reason.. but it only helps if the hardware local and the hardware CI are actually the same hardware.

Pattern 5: Test Count Deflation.

323 tests disappeared. No test files were modified. The delta must come from test collection differences: environment state, parametrization changes, module discovery issues. Something changed between commits that caused pytest to collect fewer tests.

Our portfolio rule says "test counts never decrease." That rule is in every CLAUDE.md. It has zero automated enforcement. It's a rule that relies on humans to notice a number in a commit message. Under P0 pressure, with 3 CI failures to fix and a 28-minute clock ticking, nobody checked the count.

Rules without enforcement are not rules. They're aspirations.

Pattern 6: Hollow Assertions.

isinstance(result.data, list) is technically an assertion. It's in a test file. It contributes to the test count. And it proves almost nothing.

Any function that returns a list passes this assertion. A function that returns an empty list passes. A function that returns a list of entirely wrong data passes. The assertion is checking data type, not correctness.

We had thousands of tests like this. They existed. They were green. They were not protecting us from anything.

What the Team Did Right

Before I get into the protocol, this needs to be said clearly.

The fix was production code only. Zero test changes. They did not weaken tests to match broken code. They did not add new xfails to skip the failing tests. They fixed the production code. That is the exact opposite of the reward-hacking behavior we documented in the research.. AI agents that, under optimization pressure, strip tests rather than fix code. The dx3 team did the right thing.

The root cause analysis in the commit message is excellent. Full chain: agtype parsing failure, None node_id, NOT NULL constraint, silent catch, empty tables, empty results, tests blind to it. Transparent. Reproducible. Debuggable.

28 minutes from CI failure to production fix. P0 directive issued in 12 minutes. Fix committed in 16 more. The governance machinery worked exactly as designed.

The structure caught it. The test quality didn't.

The CRUCIBLE Protocol

CRUCIBLE: Code Review Under Conditions Inducing Bug Latency Exposure.

Your code survives the crucible or it doesn't ship.

We built this from two sources: the BKR research (70+ citations on AI agent testing failures, detailed in part 1 of this series) and this incident (commit-level forensic evidence). Seven gates. Each one catches a specific failure mode we observed.

Gate 1: xfail Governance.

Never remove an @xfail / @pytest.mark.xfail / .skip() marker based on local results only. xfail removal is blocked until the corresponding test passes in CI. Cite the CI run in the commit message.

Evidence from the incident: 4 xfails removed on local evidence. CI disagreed. 3 integration tests failed.

Gate 2: Non-Empty Result Assertions.

If an integration test creates data and then queries it back, it must assert the result set is non-empty.

# Fails CRUCIBLE: passes even when data was silently lost
assert result.success is True
assert isinstance(result.data, list)
 
# Passes CRUCIBLE: catches silent data loss
assert result.success is True
assert isinstance(result.data, list)
assert len(result.data) >= 1, "Expected non-empty results after data creation"

One extra line. Catches the entire class of silent data loss failures.

Gate 3: Mock Drift Detection.

When a commit modifies both implementation code and the mocks/fixtures testing that code, the reviewer asks: did the SPEC change, or did only the implementation change?

If only the implementation changed, the mocks may have just become tautological. They're asserting that the code does what the code does. That's not a test.

Gate 4: Test Count Delta Gate.

CI reports the test count delta. Any decrease greater than 5 requires explicit justification in the commit message.

Tests: 3016 passed (+0/-261 vs previous)
WARNING: 261 fewer tests than last run. Justification required.

Acceptable justifications: test deduplication, feature removal, refactoring from parametrized to individual tests. "Will fix later" is not acceptable. No justification is not acceptable.

Rules without enforcement aren't rules.

Gate 5: Silent Exception Audit.

except blocks that catch exceptions without logging, re-raising, or returning an error status are flagged.

# Fails CRUCIBLE: swallows the failure
except Exception:
    pass
 
# Passes CRUCIBLE: failure is visible
except Exception as e:
    logger.warning(f"Metadata store failed: {e}")
    return error_result

The dx3 incident root cause was a single except Exception: pass. One line. Silence. That silence allowed silent data loss to reach production.

Gate 6: Mutation Testing.

Systematically mutate the code. Change a > to >=. Flip a boolean. Remove a return value. Run the tests. If the tests still pass, the test suite is hollow. That mutation survived. The code changed in a meaningful way and nobody knew.

Tools: mutmut or Cosmic Ray for Python, Stryker for TypeScript and JavaScript, cargo-mutants for Rust. None of these are exotic. They're not in most CI pipelines because adding them requires a decision.

Minimum 60% mutation score on critical paths. A test that survives all mutations proves nothing.

Gate 7: Spec-Test Traceability.

Integration test assertions should trace to a SPEC, NEXUS acceptance criterion, or documented requirement. Not to an implementation.

def test_graph_traversal_returns_connected_nodes():
    """Validates: NEXUS N-14 AC-3 — traversal returns all nodes
    within N hops of the start node."""
    create_edge("A", "B")
    create_edge("B", "C")
    result = graph_traversal(start="A", max_depth=2)
 
    # Assert against SPEC, not implementation return shape
    entity_ids = {r["entity_id"] for r in result.data}
    assert "B" in entity_ids, "Direct neighbor must be in results"
    assert "C" in entity_ids, "2-hop neighbor must be in results"
    assert len(result.data) >= 2

The test says what it's verifying and why. If the spec changes, the test changes. If the implementation changes without a spec change, the test should not change.

Oracle Triangulation

The bigger structural problem underneath all of this: most projects rely on a single test oracle.

An oracle is anything that can determine whether a test result is correct. Example-based assertions (assert output == expected) are one oracle type. Most test suites have only this one. When it has gaps, you have no backup.

Four machine-executable oracle types:

Oracle Type	What It Tests
Example-based	Specific input/output pairs
Property-based	Invariants that must always hold
Contract	API schemas, consumer/provider agreements
Integration	Real system behavior end-to-end

Property-based tests (Hypothesis for Python, fast-check for TypeScript) generate random inputs and verify that invariants hold. They catch edge cases that example-based tests miss because they test the shape of behavior, not specific instances.

Contract tests verify API schemas match documented interfaces. They catch drift between what a service promises and what it delivers.

Integration tests against real systems catch what unit mocks don't. The dx3 incident is the proof: unit mocks were green. Integration tests against real AGE failed.

The fifth oracle is human. The founder running the product cold. No machine test catches: "this error message is confusing," "I don't know what to click next," "I would be embarrassed to show this to someone." Those are judgment calls that require a human who has never seen the internals.

The human oracle is the most expensive oracle and the most valuable oracle. It should be the last gate before anything ships.

Requiring two or more independent oracle types per feature prevents "all tests pass but the app is broken." Requiring four on critical paths means there's no single point of failure in your verification chain.

The Quality Stack

CRUCIBLE sits in a three-layer verification stack:

Human Oracle (mandatory, Asif)     -- catches intent failures
CRUCIBLE 7 Gates (mandatory, CoS)  -- catches quality failures
CI Gate (mandatory, pre-push hook) -- catches correctness failures

These are not redundant. They're independent. The CI gate catches "tests fail." CRUCIBLE catches "tests exist but prove nothing." The human oracle catches "tests and code are correct but the thing still doesn't work the way a real person expects."

Three independent layers. Three different failure modes. None of them substitutes for the others.

The Portfolio Deployment

We deployed CRUCIBLE to 10 projects in one afternoon via CoS directives. Each project got a specific gate list based on architecture.. there's a deployment matrix in the protocol that maps each gate to which projects it applies to.

That's the thing about a concrete protocol versus an abstract principle. When the protocol has seven specific, implementable gates, you can actually deploy it. "Test quality matters" is not actionable. "Add assert len(result.data) >= 1 to data-producing integration tests" takes five minutes.

Gates 1 and 2: one-line code changes or one-sentence CLAUDE.md rules. Immediate. Gates 3 and 4: review habits and CI reporting. Days. Gate 5: a grep across the codebase. Hours. Gates 6 and 7: tooling and discipline. Weeks to months.

We're in Phase 1. The protocol has four phases over a month. We'll get there.

Test Count is a Vanity Metric

I want to be careful here because this can be misread.

Test count is necessary. Tests must exist. Test counts must not decrease. These are real requirements. I'm not arguing otherwise.

But counting tests and measuring test quality are different activities, and most of us only do the first one. I built 16,442 tests across 17 projects and felt good about that number. Then 3,277 of those tests failed to catch a silent data loss bug. And 323 of them disappeared between commits without anyone noticing.

The number wasn't wrong. The number was just... not enough information.

CRUCIBLE doesn't replace the CI gate or the test count ratchet. It layers on top of them and asks: of the tests we have, how many actually prove something? How many would catch a real failure? How many would survive a mutation? How many trace to a specification?

The answer, for most projects, is fewer than you think.

A project with 300 mutation-hardened, spec-traced, oracle-triangulated tests is safer than a project with 3,000 hollow assertions. The number is not the point. The point is whether the tests catch real bugs.

The CRUCIBLE Protocol is open. If you want the full document, it's at ~/ASIF/standards/crucible-protocol.md in the ASIF framework. The forensic audit is at ~/ASIF/learning/2026-03-06-dx3-test-forensic-audit.md. Use both. The audit shows you the failure modes in real commits. The protocol gives you the gates to close them.

We built this from a real incident on real production code. Every gate exists because we observed the failure mode it prevents. Not theory. Not academic research alone. Evidence.

Asif Waliuddin builds AI infrastructure at NXTG.AI. 16,442 tests across 17 projects, and still learning what they actually catch.

The CRUCIBLE Protocol: What We Learned When 3,277 Passing Tests Failed to Catch a Bug

The CRUCIBLE Protocol: What We Learned When 3,277 Passing Tests Failed to Catch a Bug

The System

The Trace

It Got Worse

The 6 Failure Patterns

What the Team Did Right

The CRUCIBLE Protocol

Oracle Triangulation

The Quality Stack

The Portfolio Deployment

Test Count is a Vanity Metric

Ship AI you can trust

Enjoyed this article?