What We've Actually Learned Running AI Agents at Work: An Airfind Note

A year ago, “AI agents” at Airfind meant a rough Python script that called Claude in a loop and occasionally did something useful. Six months ago, we built a lightweight harness to coordinate that loop across Slack, a shared session, and a few cron-scheduled prompts. We ended up open-sourcing it as ClaudeClaw, mostly because once we had the pattern working, it felt wasteful to keep it in a drawer. This is less a pitch for that tool and more a blunt write-up of what we learned shipping agents into a small company's daily work.

The short version: the wins were real, but smaller and more boring than the conference-stage version suggests. The failures were also more boring. Mostly we learned what to automate and what to leave alone.

The pattern that keeps working

Every agent that has earned its keep in our stack does the same three-step dance. It reads a narrow slice of state (an inbox, a metrics dashboard, a PR diff), runs a well-framed LLM call, and reports back to a channel where a human can actually see it. That is not a flashy architecture. It is a read, a think, and a nudge. The trick is being rigorous about what goes into the “think” step.

The agents that fail are the ones we let “take action” autonomously without a human in the loop. The ones that work are the ones that draft, summarize, or flag. Put differently: we have stopped letting Claude push buttons and started letting Claude highlight which buttons we should push. That shift alone doubled the value we got out of the stack.

Three agents that actually moved the needle

Out of the dozen or so we have built, these are the three that survived six months of real use.

Morning revenue brief.A scheduled job pulls yesterday's ad revenue, compares it to the rolling baseline, and drops a paragraph in Slack. If something is off, it flags the dimension that broke (country, app, ad unit). Saves the analytics team about 30 minutes every morning, and catches regressions that used to get noticed at lunch.
PR summarizer. On every pull request, Claude summarizes the diff, flags anything that changes a public API, and leaves a comment tagging the right reviewer. We still review. It just means we are reviewing with our brain already oriented.
Support triage. Incoming publisher emails get read, classified, and have a draft reply written. A human approves, edits, or rewrites. The team says the hardest part is no longer drafting, it is judging.

What we stopped trying

We spent a few weeks trying to make agents do multi-step outbound work: schedule meetings, negotiate ad inventory over email, auto- remediate pipeline failures. All of it either needed a human in the loop anyway or failed in ways that were expensive to unwind. The meeting scheduler once confidently double-booked a client call with two different team members in two different time zones. That was the week we decided external-action agents were a bad bet for a company our size.

We also stopped trying to build one giant agent that could “handle everything.” Every time we did, it got worse at the specific things and no better at the general things. Ten scoped agents beat one general one, every time.

The hardest problem is not building agents. It is resisting the urge to keep giving them more to do.

What surprised us

Three things that were not on our bingo card at the start:

First, observability matters more than the model. Whether we ran an agent on Claude 4.7, GPT-5.1, or a local Qwen model, the single biggest variable in whether it kept working over time was whether we logged every tool call and reviewed the logs weekly. Without that, silent drift in behavior took months to notice.

Second, the cost was lower than we expected. Our heaviest agent (support triage, ~3,000 tickets a day) runs for about $140 a month in inference. That is less than a single part-time contractor. Cost has basically not been the limiting factor.

Third, the team's attitude matters more than the tech. When engineers believed the agent was a helpful first-pass, they reviewed its output carefully and flagged issues. When they treated it as a magic black box, they rubber-stamped bad outputs and the system degraded. The lesson: do not let anyone treat an agent as an authority. It is an eager intern.

Where we are going next

The next six months are about consolidation, not expansion. Better evals, better observability, and turning the three that work into the kind of infrastructure the whole team can rely on without thinking about it. The shiny new agents can wait.

If there is a single takeaway for other teams looking to do this: pick one workflow. Make the scope embarrassingly small. Put a human between the agent and anyone who matters. See if the world gets better. If yes, do one more. If not, kill it without ceremony. That is the playbook. Everything else is noise.