How We Built SafeClaw's Action Classifier

Authensor Team · 2026-02-13

How We Built SafeClaw's Action Classifier

Every time an AI agent attempts an action on your system — writing a file, executing a shell command, making an API call — SafeClaw has to answer a single question in milliseconds: should this be allowed, denied, or escalated to a human?

That decision is the job of the action classifier, and it's the most critical component in the entire SafeClaw pipeline. Get it wrong, and you either block legitimate work or let dangerous actions slip through. Here's how we built it.

Why Not Use ML for Classification?

The obvious temptation was to train a machine learning model. We had the data. We had the compute. But we made a deliberate choice to go deterministic.

The reason is simple: predictability. When a developer writes a policy that says "deny all file writes outside /workspace," they need to know with absolute certainty that the classifier will enforce it. A probabilistic model introduces doubt. In safety-critical systems, doubt is unacceptable.

Our classifier is a rule-based engine that evaluates actions against policy definitions using exact pattern matching, glob expressions, and structured predicates. No neural networks. No embeddings. No surprises.

The Classification Pipeline

When an action comes in, it passes through four stages:

Normalization — The raw action from the AI agent is normalized into a canonical format. File paths are resolved, environment variables are expanded, and command strings are parsed into structured representations.

Category Matching — The normalized action is matched against our action taxonomy: file operations, shell commands, network requests, package installations, and more. Each category has its own set of matchers tuned for that action type.

Policy Evaluation — The categorized action is evaluated against the user's active policy profile. Policies are evaluated top-to-bottom, and the first match wins. This gives users explicit control over precedence.

Decision Emission — The classifier emits one of three decisions: allow, deny, or escalate. Escalated actions are held in a pending queue until a human approves or rejects them.

Handling Ambiguity

The hardest cases are the ambiguous ones. Consider curl https://internal-api.company.com/deploy — is that a read or a write? Is it safe?

We handle ambiguity with a conservative default: when the classifier cannot determine intent with high confidence, the action is escalated. This is our deny-by-default philosophy in practice. We would rather ask a human than guess wrong.

For common ambiguous patterns, we provide pre-built heuristics that users can enable. For example, our network-write-detection heuristic flags HTTP methods like POST, PUT, and DELETE while allowing GET and HEAD by default.

Performance Constraints

The classifier runs in the hot path of every agent action. If it's slow, the agent feels slow. We set a hard performance budget: classification must complete in under 5 milliseconds for 99% of actions.

To hit this target, we pre-compile policy rules into optimized lookup structures at startup. Glob patterns are converted to compiled matchers. Path-based rules are organized in trie structures for O(log n) lookups. The result is a classifier that typically completes in under 1 millisecond.

Testing the Classifier

Our test suite includes over 200 test cases specifically for the classifier, covering edge cases like symlink resolution, Unicode file paths, nested command substitution, and policy rule conflicts. Every new action pattern we encounter in production becomes a new test case.

You can explore the full implementation on GitHub and read more about how policies work in our documentation.

What's Next

We're actively working on user-defined custom classifiers — the ability to write your own classification logic as plugins. If you have patterns specific to your workflow that SafeClaw doesn't handle natively, you'll soon be able to teach it.

The action classifier is the foundation that everything else in SafeClaw is built on. We spent months getting it right, and we continue to refine it with every release.