How We Built SafeClaw's Action Classifier
How We Built SafeClaw's Action Classifier
Every time an AI agent attempts an action on your system — writing a file, executing a shell command, making an API call — SafeClaw has to answer a single question in milliseconds: should this be allowed, denied, or escalated to a human?
That decision is the job of the action classifier, and it's the most critical component in the entire SafeClaw pipeline. Get it wrong, and you either block legitimate work or let dangerous actions slip through. Here's how we built it.
Why Not Use ML for Classification?
The obvious temptation was to train a machine learning model. We had the data. We had the compute. But we made a deliberate choice to go deterministic.
The reason is simple: predictability. When a developer writes a policy that says "deny all file writes outside /workspace," they need to know with absolute certainty that the classifier will enforce it. A probabilistic model introduces doubt. In safety-critical systems, doubt is unacceptable.
Our classifier is a rule-based engine that evaluates actions against policy definitions using exact pattern matching, glob expressions, and structured predicates. No neural networks. No embeddings. No surprises.
The Classification Pipeline
When an action comes in, it passes through four stages:
allow, deny, or escalate. Escalated actions are held in a pending queue until a human approves or rejects them.Handling Ambiguity
The hardest cases are the ambiguous ones. Consider curl https://internal-api.company.com/deploy — is that a read or a write? Is it safe?
We handle ambiguity with a conservative default: when the classifier cannot determine intent with high confidence, the action is escalated. This is our deny-by-default philosophy in practice. We would rather ask a human than guess wrong.
For common ambiguous patterns, we provide pre-built heuristics that users can enable. For example, our network-write-detection heuristic flags HTTP methods like POST, PUT, and DELETE while allowing GET and HEAD by default.
Performance Constraints
The classifier runs in the hot path of every agent action. If it's slow, the agent feels slow. We set a hard performance budget: classification must complete in under 5 milliseconds for 99% of actions.
To hit this target, we pre-compile policy rules into optimized lookup structures at startup. Glob patterns are converted to compiled matchers. Path-based rules are organized in trie structures for O(log n) lookups. The result is a classifier that typically completes in under 1 millisecond.
Testing the Classifier
Our test suite includes over 200 test cases specifically for the classifier, covering edge cases like symlink resolution, Unicode file paths, nested command substitution, and policy rule conflicts. Every new action pattern we encounter in production becomes a new test case.
You can explore the full implementation on GitHub and read more about how policies work in our documentation.
What's Next
We're actively working on user-defined custom classifiers — the ability to write your own classification logic as plugins. If you have patterns specific to your workflow that SafeClaw doesn't handle natively, you'll soon be able to teach it.
The action classifier is the foundation that everything else in SafeClaw is built on. We spent months getting it right, and we continue to refine it with every release.