Most “AI automation” content we read is either a screenshot of someone’s Zapier canvas or a 4-hour YouTube tutorial that ends with hello world. So we wrote the thing we wished existed: a small, working LLM script that replaced one real chore. In our case, that chore was triaging a shared support inbox. Here is the code, the prompts, the bills, and the mistakes.
The goal was narrow on purpose. We wanted to classify every new email into one of four buckets, auto-reply to the easy ones with a templated answer, and flag the rest for a human. No agents. No RAG. No vector database. Just one script, one model call per email, and a cron job. Ten days in, it handled about 140 messages a day at roughly $0.42 of API spend. That number shifted over the next few weeks, and the why is most of what we want to talk about.
Start with the boring part: what you are automating
Before writing a line of code, we pulled 200 recent emails from the inbox and hand-labeled them. This felt wasteful. It was the single most valuable thing we did. Three patterns fell out immediately. Roughly 55% were password reset requests we already had a canned reply for, 20% were billing questions that needed a real human, 15% were feature requests (archive and tag), and the last 10% was everything else (partnership pitches, bug reports, angry customers).
That label set became the ground truth we evaluated the script against. Without it, you are flying blind: your prompt feels great on five cherry-picked examples and then quietly mislabels 12% of real traffic. We built a tiny spreadsheet with email_id, true_label, and model_label columns and looked at it every morning for the first two weeks.
If you cannot describe what correct looks like in a spreadsheet, you cannot tell whether your automation is working. You just feel like it is, which is worse.
The stack, kept deliberately small
We used Node.js 22, the Gmail API for fetching and sending, and Claude Sonnet 4.6 via the Anthropic SDK. You could swap any of these. The lesson is not the stack, it is the size of the stack. One file, one deployment target, one API bill. A coworker later ported this to Python in an afternoon. The logic is the value, the runtime is the commodity.
Three libraries did real work:
googleapisfor Gmail access (OAuth token stored in 1Password, pulled at runtime)@anthropic-ai/sdkfor the model call with a JSON schema on the responsebetter-sqlite3for a local audit log of every classification, with the raw prompt and response saved for a week
The classifier prompt (and its scars)
Here is roughly what the model call looks like after three rounds of edits. The structured-output schema is the important bit. Without it, you will spend an afternoon writing a regex to parse “Certainly! Based on my analysis...” into something you can switch on.
// triage.ts
const response = await anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 400,
system: TRIAGE_SYSTEM_PROMPT,
messages: [
{
role: "user",
content: `Subject: ${email.subject}\n\nFrom: ${email.from}\n\nBody:\n${email.bodyText.slice(0, 6000)}`,
},
],
tools: [
{
name: "classify_email",
description: "Record the triage decision for one email.",
input_schema: {
type: "object",
properties: {
category: {
type: "string",
enum: ["password_reset", "billing", "feature_request", "other"],
},
confidence: { type: "number", minimum: 0, maximum: 1 },
suggested_reply: { type: "string" },
reason: { type: "string" },
},
required: ["category", "confidence", "reason"],
},
},
],
tool_choice: { type: "tool", name: "classify_email" },
});Two details saved us. First, tool_choice forces the model to return the tool call. We stopped getting the occasional prose reply that blew up our JSON parser. Second, we truncate the body at 6000 characters. One customer sent a 40KB log file pasted inline and turned a $0.002 request into a $0.19 one.
The system prompt we landed on
We started with a long, careful system prompt. It was worse than the short one. The short one is below. The model knows what a password reset email looks like. Your job is not to teach it, it is to tell it which buckets you care about and what “unsure” looks like.
You triage incoming support emails for a small SaaS.
Return exactly one category. If the email could plausibly fit two
categories, pick "other" and explain why in `reason`.
"password_reset" means the user cannot log in and wants their
access restored. It does NOT include billing login issues.
Set `confidence` below 0.7 whenever the email is ambiguous,
contains multiple requests, or is written in a language you are
uncertain about. A human will review anything below 0.8.Acting on the output without blowing things up
The auto-reply was the part that scared us. A bad classification sending a chipper “here is your password reset link” to an angry enterprise customer is not a good week. Two guardrails kept us out of trouble.
- A confidence floor of
0.85. Below that, we flag for a human, regardless of category. In production this meant about 22% of emails still went to a person. - A dry-run mode for the first 10 days. Every would-be auto-reply was written as a Gmail draft instead of being sent. A human had to click send. After 10 days of reviewing drafts, we flipped the switch.
The dry-run period caught one genuine horror. A customer replying to a marketing email with the phrase “please reset my expectations about your service” was classified as a password reset with 0.91 confidence. We added “reset” context to the prompt and moved on. Without the human-in-the-loop, that would have shipped.
What went wrong over the next month
The script worked. Then it slowly stopped working. This is the part nobody writes about.
Prompt drift. Around week three, our support team changed the canned password reset reply to include a link to a new help doc. The script was still sending the old version because the template was hardcoded. Nothing the model did was wrong. The ground truth had moved. We now store templates in a shared Notion doc that the script pulls at runtime.
Cost creep. We launched at around $0.42 a day. By week five we were at $1.80. The culprit was a marketing blast that generated a wave of replies with huge signature blocks. The 6000 character cap was doing its job on body length but the blasts also got classified at a higher rate of other, which meant more human review, which meant we had saved less time than we thought. A cheap heuristic (auto-archive anything from a list of known newsletter senders before hitting the model) cut spend by 40%.
False positives in the long tail.Our accuracy on the top three categories was 96%+. On the otherbucket it was 71%. That makes sense. “Other” is a grab bag. We split it into other_english and other_spanish, because half the misses were Spanish emails the model was trying to squeeze into English categories. The fix was one enum entry.
The numbers, four months in
A bit of real data so you have a reference point. Over the four months we ran the script, it processed 17,412 emails. 78% were classified with confidence above our 0.85 threshold and acted on automatically. The remaining 22% were flagged for human review. Of the auto-actioned group, we audited a random sample of 300 and found 9 misclassifications. That is a 97% precision rate on the confident subset, which was good enough for support work but would not be good enough for, say, compliance-sensitive communication.
Our average cost per classified email landed at $0.0041. The full script, including Gmail API overhead, runs in about 2.8 seconds per email on a t4g.small EC2 instance we were already paying for. Total monthly bill for the whole operation: about $20 of API spend, zero incremental infrastructure. The support lead estimated we saved her about 90 minutes of triage a day. Rounded to her fully-loaded cost, the script returned roughly its API spend every 45 minutes.
Those numbers are the answer to the “is this worth it” question for us. Your numbers will be different. Measure yours. The shape of the math is usually the same if you pick a task where the human was doing repetitive classification work.
Observability beats intuition every time
The single highest-leverage thing we added after week one was a cheap dashboard. Not Datadog, not a monitoring platform, a hand-rolled HTML page on the internal network that read from our SQLite audit log and showed:
- Emails classified per day, broken down by category and confidence bucket
- Auto-action rate (what % of emails we actually did something with vs flagged)
- Rolling 7-day cost in dollars
- Last 20 low-confidence classifications with the model’s
reasonfield shown
That last one was the killer feature. When the support team told us “the bot is being weird lately,” we could pull up the reason field and see exactly what the model was thinking about ambiguous emails. Twice it turned out to be us, not the model: our own help docs had changed and a new ticket type had emerged without anyone updating the script. We only noticed because the reason field showed the model trying to squeeze new content into old buckets.
When to graduate from a script to a real tool
We ran this script for four months before we started feeling the walls. The signals were clear when they arrived. You probably want to graduate when:
- Multiple people need to edit prompts or templates and you keep stepping on each other in git
- You need to replay the last week’s decisions against a new prompt to see if it is better, not just eyeball a few examples
- You have more than two automations and they are starting to share concepts (template storage, retry logic, audit logs)
- Someone non-technical needs to turn a rule on or off without pinging engineering
At that point, consider a lightweight orchestration layer. Our team moved to a small internal tool built on Inngest for scheduling and retries, with prompts stored in a database and an admin UI for the support team. It took three engineer-weeks. We would not have built it on day one because we would have built the wrong thing.
What we’d actually do
If a friend asked us how to start tomorrow, the answer is embarrassingly boring. Pick one task you already do badly and slowly. Hand-label 100 examples in a spreadsheet. Write a 150-line script with a structured-output model call. Run it in dry-run for two weeks. Add one guardrail (confidence floor, human approval, allowlist of senders, something). Measure accuracy against your spreadsheet weekly. Only then think about platforms.
The trap is reaching for a framework on day one. Frameworks are great when you know what you need. They are expensive when they make assumptions you have not earned yet. A 200-line script that ran in production for four months taught our team more about LLM automation than any “agent builder” demo ever did.