How SafeClaw Detects Risk Signals in Tool Calls
How SafeClaw Detects Risk Signals in Tool Calls
An AI agent says it wants to run rm -rf /tmp/build. Seems harmless. But what if the agent resolved a symlink incorrectly, and /tmp/build actually points to your home directory? What if the command before it was ln -s ~ /tmp/build?
Individual tool calls can look benign in isolation. The risk emerges from context, from sequences, and from subtle patterns that are easy to miss. SafeClaw's risk signal detection system is designed to catch what individual action checks would not.
What Is a Risk Signal?
A risk signal is any property of a tool call — or a sequence of tool calls — that indicates elevated risk. Risk signals don't automatically trigger a deny. Instead, they accumulate and contribute to an overall risk score for the current session.
We categorize risk signals into five classes:
- Destructive Operations — Commands that delete, overwrite, or truncate data.
rm,truncate,> file,DROP TABLE, and similar patterns. - Privilege Escalation — Actions that attempt to gain elevated permissions.
sudo,chmod 777,chown, modifying/etc, and writes to system directories. - Data Exfiltration — Outbound data transfers, especially when preceded by reads of sensitive files.
curl -d @secrets.env,scp, posting file contents to URLs. - Scope Violations — Actions that reach outside the defined workspace boundary. Reading files in parent directories, accessing other users' home folders, or touching paths outside the allowed set.
- Unusual Patterns — Tool call sequences that deviate from the agent's established behavioral baseline. A coding agent that suddenly starts making network requests, or a documentation agent that starts writing shell scripts.
The Detection Pipeline
Risk signal detection operates in two modes: synchronous and asynchronous.
Synchronous detection runs inline with every tool call. It checks the current action against the risk signal patterns and computes an immediate score. If the score exceeds the configured threshold, the action is escalated or denied before it executes. Synchronous detection is fast — it adds less than 2 milliseconds of latency. Asynchronous detection analyzes sequences of actions over the session history. After each tool call completes, a background process evaluates the recent action sequence for multi-step attack patterns. For example, aread .env followed by a curl POST would trigger an exfiltration signal, even if each action individually was below threshold.
Scoring and Thresholds
Each risk signal carries a weight between 0.1 and 1.0. Weights are configurable per policy profile, so you can tune sensitivity for your environment. A research project might tolerate higher network access scores, while a production environment might set aggressive thresholds.
The session's cumulative risk score is visible in the SafeClaw dashboard and exposed via the API. When the score crosses a threshold, SafeClaw can automatically downgrade the session to a more restrictive policy profile — reducing the agent's permissions until a human reviews the situation.
Reducing False Positives
The biggest challenge with risk detection is noise. Early versions of our system flagged too many legitimate actions, which trained users to ignore alerts — the worst possible outcome.
We addressed this through contextual baselines. SafeClaw learns the typical tool call patterns for each project and adjusts signal weights accordingly. A web development project where curl is used constantly will have a much higher threshold for network-related signals than a pure filesystem project.
Try It Yourself
Risk signal detection is enabled by default in SafeClaw. You can view detected signals in real-time through the dashboard, or query them via the CLI. The full configuration reference and signal definitions are in our docs, and the detection logic is fully open source on GitHub.
We built risk signal detection because single-action checks are not enough. Real threats are sequences, and catching them requires watching the whole story unfold.