Our Testing Philosophy: Why 446 Tests Isn't Enough

Authensor Team · 2026-02-13

Our Testing Philosophy: Why 446 Tests Isn't Enough

We have 446 tests in SafeClaw's test suite. We're proud of that number. But we also know it's a vanity metric. Test count tells you nothing about test quality, coverage, or effectiveness. A project with 50 excellent tests can be better protected than one with 5,000 mediocre ones.

Here's how we actually think about testing at Authensor.

Tests Are Specifications

Our primary view of tests is that they're executable specifications. Each test documents a specific behavior that SafeClaw must exhibit. "When a file write targets a path outside the workspace boundary, the classifier must return deny." That's not just a test — it's a contract.

This perspective changes how we write tests. We don't start with "what code should I test?" We start with "what behavior must SafeClaw guarantee?" The test suite is the definitive specification of SafeClaw's behavior, more authoritative than documentation and more precise than prose.

When a user reports unexpected behavior, the first question is: "Is there a test that specifies what should happen?" If yes, and the test passes, then SafeClaw is working as specified and we need to update the spec (and the test). If no, we've found a gap in our specification, which is worse than a bug.

The Testing Pyramid

Our test suite follows a classic testing pyramid:

Unit Tests (320+) — Fast, isolated tests for individual functions and modules. The classifier's pattern matching, the policy engine's rule evaluation, the boundary enforcer's path resolution — each component is tested in isolation with known inputs and expected outputs. These tests run in under 2 seconds and catch the majority of regressions. Integration Tests (90+) — Tests that verify interactions between components. An action flows through the classifier, is evaluated by the policy engine, and produces a decision that's logged in the session — this end-to-end flow is tested with realistic scenarios. These tests run in under 10 seconds and catch interface mismatches between components. Behavioral Tests (36+) — High-level tests that simulate real user scenarios. "A developer is working on a React project with the standard policy. The agent tries to read a file, write a file, delete a file, and make a network request. Verify the correct decisions for each." These tests take longer to run but provide confidence that SafeClaw works correctly in realistic conditions.

What We Test Beyond Code

Not all important properties can be tested with unit tests.

Performance Tests — We have a benchmark suite that measures classifier latency under load. If a code change causes the 99th percentile latency to exceed 5 milliseconds, the build fails. Performance is a correctness property for SafeClaw. Offline Tests — Our CI pipeline includes a test run with network access disabled at the OS level. Every feature must work offline, and this test enforces it. Upgrade Tests — We test that configurations written for previous versions of SafeClaw work correctly with the current version. Breaking configuration compatibility is a test failure. Adversarial Tests — We specifically test for attacks: path traversal via symlinks, TOCTOU race conditions, glob pattern injection, policy rule conflicts that create unintended gaps. These tests are informed by our threat model and updated whenever we learn about new attack techniques.

Mutation Testing

Test count and code coverage don't tell you whether your tests actually catch bugs. Mutation testing does. We use mutation testing to introduce small changes (mutations) into SafeClaw's code and verify that at least one test fails for each mutation.

If a mutation survives — meaning no test fails despite a code change — it indicates a weakness in our test suite. Either we're missing a test, or our existing tests aren't asserting on the right things. We run mutation testing weekly and track the mutation score over time.

Why 446 Isn't Enough

Our test count will keep growing because SafeClaw's behavior surface keeps growing. Every new feature adds new behaviors that need specification. Every bug report reveals a gap in our specification. Every new attack technique requires new adversarial tests.

We'll never declare our test suite "done." A safety-critical tool requires continuous investment in testing, just as it requires continuous investment in security. The number 446 is where we are today. Tomorrow, we'll need more.

Explore our test suite on GitHub. Read about our testing standards in the contributor guide. And if you find a behavior that isn't tested — tell us. That's one of the most valuable contributions you can make.