Authensor

How We Got to 446 Tests: Our Approach to AI Safety Quality

Authensor Team · 2026-02-13

How We Got to 446 Tests: Our Approach to AI Safety Quality

SafeClaw has 446 tests across 24 test files. For a project with 25 source files, that is roughly 18 tests per module. We did not set out to hit a number. We set out to build safety tooling we could trust, and the test count is a side effect of that standard.

This post is about how we think about testing when the software's job is to prevent AI agents from doing dangerous things.

The Stakes Are Different

Most software tests verify that features work correctly. If a test fails, a button is broken or an API returns the wrong status code. The consequence is a bug report.

SafeClaw's tests verify that dangerous actions are blocked. If our gateway test fails, it might mean an agent can delete files without approval. If our classifier test fails, it might mean a shell command is misclassified as a safe read operation. The consequence is not a bug report. It is an uncontrolled agent.

That difference in stakes drove every testing decision we made.

What We Test

Our 24 test files cover every layer of the system:

Classifier tests (classifier.test.js) verify that every SDK tool maps to the correct action type. Bash must map to code.exec. Write must map to filesystem.write. MCP tools must parse into mcp.{server}.{action}. Secret patterns must be redacted. We test the positive cases, the edge cases, and the adversarial cases (tools with names designed to confuse classification). Risk signal detection has its own test suite: 28 tests verifying that obfuscated commands, credential-adjacent paths, destructive operations, and persistence mechanisms are correctly tagged. Gateway tests (gateway.test.js) verify the full decision lifecycle. We test that safe reads bypass the control plane, that denied actions produce proper hook responses, that approval polling works with timeouts, that the gateway fails closed when the control plane is unreachable, and that risk signals flow through from classifier to audit entry. Every test constructs a real gateway hook with mocked dependencies and exercises the actual code path. Policy tests (policy.test.js, policy-advanced.test.js) verify rule evaluation, first-match-wins semantics, every operator (eq, startsWith, contains, matches, in), boolean combinators (any, all), time-based rules, auto-expire, versioning, rollback, and simulation. The policy engine is where safety rules are defined, so we test it exhaustively. Audit tests (audit.test.js) verify append-only behavior, hash chain integrity, the GENESIS sentinel, chain verification across hundreds of entries, and the behavior of verifyAuditIntegrity when entries are inserted, deleted, or modified. Security tests (security.test.js) verify CSRF protection on every write endpoint, ReDoS rejection for nested quantifier patterns, secret redaction in SSE output, file permission enforcement (0o600), oversized payload rejection, and security header presence. Integration tests (integration.test.js) spin up the actual HTTP server and make real requests against every API endpoint, verifying status codes, response formats, CSRF enforcement, and error handling.

We also have dedicated test files for the config module, settings, analytics, budget controls, cache, rate limiting, webhooks, scheduler, workspace scoping, session management, templates, validation helpers, the doctor command, the OpenAI agent tools, and the retry logic.

Our Testing Principles

1. Test the security boundary, not just the happy path. For every test that confirms "allowed actions execute," there is a corresponding test confirming "denied actions do not execute." The deny path is more important than the allow path. If allows break, the agent is temporarily unproductive. If denies break, the agent is uncontrolled. 2. Test the failure mode explicitly. We do not rely on "it probably works if nothing throws." We assert specific deny reasons, specific error messages, and specific HTTP status codes. When the control plane is unreachable, we assert that the gateway returns a deny decision with a reason string containing "fail-closed." When a ReDoS pattern is submitted, we assert that the safeRegex function returns { valid: false } with the specific error message. 3. No mocks for the safety-critical path. The classifier is a pure function. We test it directly with real inputs, not mocked versions. The policy engine's simulatePolicy function is deterministic. We test it with real policy objects and real action types. Mocks are reserved for external boundaries (the Authensor control plane HTTP client), never for the core safety logic. 4. Regression tests for every fix. When we fixed the ReDoS vulnerability in policy regex patterns, we did not just ship the fix. We added tests for every known ReDoS pattern class: (a+)+, (a*)+, (?:a+)+, (a|b+)+, and quantifier-brace combinations. When we hardened file permissions, we added tests that verify the mode bits on every sensitive file write.

The Growth Curve

The test count tracks the feature count closely:

Every feature ships with tests. Every security fix ships with regression tests. The test suite runs in CI on Node 20 and Node 22 on every commit.

Tests Are a Feature

When we say SafeClaw has 446 tests, we are not bragging about a metric. We are telling you that every safety-critical code path in the system has been verified, that every security boundary has been tested from both sides, and that every regression has a test preventing recurrence.

For safety tooling, this is not overhead. It is the product.

The full test suite is in the repo: github.com/AUTHENSOR/SafeClaw. Run it yourself with npm test.