A Safety Belt for Autonomous agents

My agents browse the internet all day. Now they ask permission first.

Apr 21, 2026

I’ve been running compounding judgment agents for a few months now. Some on OpenClaw, some on custom harnesses I’ve built for specific workflows. The 30-agent organization I wrote about in February was the most ambitious of these, but it wasn’t the only one. Most days I have several smaller setups running in parallel, doing research, drafting, comparing, escalating.

The trajectory is clear. The agents do more on their own every week. The list of things I have to approve shrinks. The list of things they decide for themselves grows.

Which means the surface area for things going sideways grows too.

Editorial illustration in New Yorker style, conceptual, isometric perspective, muted earth tones, flat shading, geometric lighting, mid-century modern, minimalist, no text. A warm grey workspace seen from above with several small robots each doing something different at separate stations: one reading a document, one browsing a screen, one sorting files, one looking at a recipe card. Between them and the outside world, a single modest amber-glowing doorframe with a small shield icon. The doorframe is calm, not imposing. Long dusty pink shadows. Architectural framing, faceless, no people.

the mundane attack surface

When I think about what my agents actually do, most of it is unremarkable.

One agent is figuring out where to pull updated BDC holdings data. Another is reading a Swagger spec for a new connector. A third is researching what’s for dinner because I asked it to plan the week’s meals around what’s already in the fridge. Several of them are reading email from other agents in the network.

In every one of those cases, the agent is on the open internet. Brave search results. Perplexity output. A vendor’s docs page. A recipe blog. A link in a message from another agent in my own network that pulled it from somewhere I haven’t audited.

This isn’t the dramatic attack surface people write about. It’s the ambient one. The agent doesn’t need to be tricked into doing something catastrophic. It just needs to wander somewhere it shouldn’t, render something it shouldn’t, follow a redirect into something I’d never click on myself.

I’m not mostly worried about prompt injection. There’s a real and growing literature on that, and the model providers are pushing on it. I’m worried about the unglamorous stuff. Sketchy ad networks. Domains that exist to scrape agent traffic. Pages that are obviously garbage to a human and invisible to an agent in a hurry.

I don’t want my agents browsing suspicious sites. That’s the whole requirement.

the skill

I wrote earlier this year about the difference between expertise you should rent and judgment you should own. URL safety was my example then too. Building a good URL validator is a security research problem. Months of work to do well. Years to do really well. But deciding what my agents should do when a URL comes back suspicious? That’s ten minutes of my time and three rules.

So I built a skill for my agents to religiously check every URL before visiting it. The skill calls Kovrex’s URL safety authority and waits for a verdict before any fetch.

Three rules govern it:

If the verdict comes back safe, the skill caches the response locally for 24 hours. The same URL doesn’t get re-checked for a day. This keeps latency down for repeat visits, which is what most agent browsing actually is.
If the verdict comes back unsafe, the skill discards the cache. The next time any agent in the network tries that URL, it gets re-checked. Verdicts can change. A clean domain today can be compromised tomorrow. I don’t want a stale “safe” answer locking in an outdated assumption.
If the verdict comes back unsafe a second time, the skill stops the agent and demands my approval before proceeding. The agent surfaces the URL, the verdict, the reason for the flag. I make the call. Sometimes I approve. Usually I don’t.

That’s the whole skill. A few hundred lines of code, one external authority, three rules. It took about ten minutes to build, mostly because one of my agents wrote it for me.

what comes back

Here’s the clean path. My agent wants to fetch Drudge Report. Before it does, the skill sends the URL to the Kovrex authority:

json

POST https://gateway.kovrex.ai/v1/call/url-safety

{
  "url": "https://drudgereport.com",
  "context": "fetch"
}

What comes back:

json

{
  "verdict": "SAFE",
  "risk_score": 0.03,
  "confidence": 0.92,
  "verdict_reason": "clean",
  "recommendation": "Safe to proceed.",
  "signals": [
    { "signal": "virustotal", "status": "ok", "score": 0.0, "weight": 0.30 },
    { "signal": "phishtank", "status": "ok", "score": 0.0, "weight": 0.15 },
    { "signal": "domain_age", "status": "ok", "score": 0.0, "weight": 0.10 },
    { "signal": "ssl_valid", "status": "ok", "score": 0.0, "weight": 0.10 },
    { "signal": "ipqs", "status": "ok", "score": 0.05, "weight": 0.20 }
  ]
}

SAFE, low risk, high confidence. Every signal checked in, none flagged anything. The skill caches this for 24 hours and the agent moves on. That’s the boring case. The boring case should be boring.

Now here’s what happens when I throw it a nonsense domain. One of my agents was researching data sources and surfaced a link to a98nf9aenf.com. The skill caught it before the agent visited:

json

{
  "verdict": "UNKNOWN",
  "risk_score": 0.048,
  "confidence": 0.66,
  "verdict_reason": "pending_analysis",
  "recommendation": "Pending threat intel analysis. Please retry soon.",
  "signals": [
    { "signal": "virustotal", "status": "ok", "value": { "not_found": true, "submitted_for_scan": true }, "latency_ms": 627 },
    { "signal": "domain_age", "status": "error", "error": "No match for A98NF9AENF.COM", "error_type": "lookup_failed" },
    { "signal": "ssl_validity", "status": "ok", "value": { "valid": false }, "error": "Name or service not known", "error_type": "lookup_failed" },
    { "signal": "abuseipdb", "status": "skipped", "value": { "reason": "dns_failed" } },
    { "signal": "ipqualityscore", "status": "ok" }
  ]
}

Look at what this tells you. The domain has no WHOIS record. DNS doesn’t resolve. SSL lookup fails because the host doesn’t exist. VirusTotal has never seen it and submitted it for scanning. AbuseIPDB skipped entirely because there’s no IP to check.

A naive safety tool would look at this and say “no threats found, you’re good.” The risk score is low. Nothing flagged it as malicious. But the Kovrex authority doesn’t say SAFE. It says UNKNOWN with a confidence of 0.66 and a recommendation to retry later. It refuses to call something safe just because no threat list has gotten around to flagging it yet.

That’s the whole philosophy in one response. Insufficient data is better than a false SAFE.

My skill sees UNKNOWN, treats it as unsafe, and the next time an agent tries to visit this domain it checks again. If it comes back UNKNOWN a second time, the skill stops the agent and asks me. In this case I’d say no and move on. The agent would route around and find its data somewhere else.

Now here’s the one that surprised me. I ran both youtube.com and www.youtube.com through the same check. Both came back SAFE. But not the same SAFE.

youtube.com scored 0.0 risk. Clean across the board. www.youtube.com scored 0.175 with one VirusTotal engine flagging it as malicious out of 91 total scanners:

json

{
  "verdict": "SAFE",
  "risk_score": 0.175,
  "confidence": 0.78,
  "verdict_reason": "clean",
  "signals": [
    { "signal": "virustotal", "status": "ok", "value": { "analysis_stats": { "malicious": 1, "harmless": 63, "undetected": 27 }, "positives": 1 }, "latency_ms": 217 },
    { "signal": "google_safebrowsing", "status": "ok", "value": { "flagged": false }, "latency_ms": 71 },
    { "signal": "ssl_validity", "status": "ok", "value": { "valid": true }, "latency_ms": 26 }
  ]
}

The authority still says SAFE. One engine out of 91 is noise, not signal. But the risk score is higher, and the signal breakdown shows you exactly why. This is the kind of verdict where your policy matters. If you’re running a strict shop and any VirusTotal positive triggers human review, you can build that into the skill. If you trust the aggregate, you let it through.

That’s three cases. A clean pass, an authority that refuses to guess, and a safe verdict with visible disagreement in the underlying signals. The skill routes on the verdict. But the signals are there if I ever want to tighten the policy.

what this isn’t

To be clear, this is not an enterprise URL filtering proxy. I’m not replacing Zscaler. I’m not building a corporate web gateway. The Kovrex URL safety authority checks individual URLs on demand. It doesn’t do content inspection, DLP, bandwidth management, or any of the other things a real network security stack handles.

What it does is answer a narrow question well: should my agent visit this URL right now? That’s the question I actually need answered fifty times a day, and I’d rather have an opinionated authority with published methodology answering it than build my own heuristics or skip the check entirely.

The skill pattern is the contribution here. An agent that checks before it acts, caches when the answer is clean, escalates when it isn’t. That pattern works with this authority. It would work with a different one. The important thing is that the check exists and the agent respects it.

why I trust the verdict

The reason I’m comfortable putting an external authority in the path of every fetch is that it aggregates and shows its work. VirusTotal, PhishTank, URLhaus, Google Safe Browsing, IPQS, AbuseIPDB, plus structural checks like domain age, SSL validity, redirect chains, and homograph detection. Each source gets a weight. Each weight is visible in the response. The methodology is published openly.

That matters more to me than raw accuracy. A single-source safety check is one outage or one bad day at a vendor away from being wrong. A multi-source check with a declared methodology is something I can reason about. When it flags a URL, I’m not just trusting an opaque score. I’m trusting a process I’ve actually looked at.

This is the part I keep coming back to in everything I build. Authorities should declare their methodology. If I can’t see how the verdict was reached, I can’t responsibly route an autonomous agent through it. The skill works because the underlying authority is willing to show its work.

why this works for the way agents actually browse

The thing that makes this fit my workflow rather than fight it is the cache. Agents revisit the same URLs constantly. Vendor docs. The same handful of data sources. Internal links between project artifacts. If every fetch required a fresh safety check, the latency would compound into something painful.

Twenty-four hours is short enough that I don’t worry much about a verdict going stale on a site that turns. Long enough that the working set of URLs an agent uses in a session almost never gets re-checked mid-task. The skill barely registers in the trace most of the time. It only shows up when there’s something to actually look at.

The approval gate is the part I care most about. Most safety tooling I’ve used either blocks silently or warns and continues. Both are wrong for autonomous agents. Silent blocking means the agent doesn’t know why it failed and may retry in a loop or route around. Warn-and-continue is theater. The agent should stop, surface the decision to me, and wait. That’s what an agent that respects its own boundaries looks like.

the larger pattern

URL safety is one surface. It won’t be the last. The same skill pattern applies anywhere an autonomous agent is about to do something with real-world consequences. File downloads. Following links in email from other agents. Authenticating to an unfamiliar API. Each of those is a moment where the agent should check before it acts, cache when the answer is clean, and escalate when it isn’t.

I haven’t built those skills yet. But the shape is obvious now that the first one works. A narrow authority that answers a specific question well. A local cache that keeps the latency invisible. An approval gate that puts a human in front of anything the authority flags. Ten minutes to build. Invisible when things are normal. Loud when they aren’t.

The agents do more every week. The safety belts have to come along for the ride.

Happy to share the skill with anyone who wants to try it. Reach out and I’ll send it over.

If you’re building agent organizations and want to compare notes on the boring infrastructure that keeps them running, subscribe. That’s most of what I write about here.

Discussion about this post

Ready for more?