Vercel vs Cloudflare vs AWS for AI Apps: Real-World Tradeoffs

AI apps do not look like normal web apps. Responses stream for 30 seconds. Sessions have state. A single request can fan out into four model calls, a retrieval query, and a database write. Tail latency matters because users are waiting in real time. And the cost model changes month to month because the provider keeps shipping cheaper models. None of this is what “serverless Hello World” blog posts were written for.

We have shipped AI apps on all three major platforms. Different projects, same team. Here is what each one is genuinely good at, what bit us, and what we would pick for three specific workloads.

The shape of an AI app

Before the tools, name the shape. An AI app is usually one or more of:

Streaming responses: users want tokens as they are generated, not all at once after 20 seconds of silence
Long-running requests: a single user action might kick off a multi-step agent that runs for 2 minutes
Stateful sessions: conversation history, document context, per-user memory
Bursty compute: one user, one request, but each request lights up a GPU somewhere for 10 seconds

Traditional serverless platforms were designed for short, stateless, request/response workloads. AI apps break all four of those assumptions. The platforms have been racing to catch up since 2023. In 2026, they have, unevenly.

Vercel: the “just ship it” platform

Vercel in 2026 is the fastest way to get a streaming AI app in front of users. The AI SDK is genuinely good. Edge runtime plus Fluid Compute plus Next.js 16 React Server Components gets you a chat UI in a long weekend. Deploys are magic. Observability is built in. You will be shipping in hours, not days.

The catch is a known shape. Serverless Functions cap at 300 seconds maximum duration. Fluid Compute (their higher-density runtime) helps enormously with streaming economics, because you are no longer paying for a full container to sit there waiting on a model that is only emitting two tokens per second. Their pricing is still expensive at serious scale.

Rough math from our own RAG chatbot project, at 30K monthly active users, with roughly 4 messages per user per session, two sessions a day:

Vercel compute (Fluid): ~$820/month for the frontend + streaming API routes
Vercel bandwidth: ~$180/month
Vercel analytics + logging: ~$60/month

Call it $1,060/month just for the app shell. Your model bills are separate (and likely dwarf the platform bill at serious volume). For a side project or a small team, this is cheap. For a company at 500K MAU, this is the point where someone starts asking hard questions in the ops review.

Where Vercel shines: a product team wants to ship and iterate on UX. Time from PR to production is measured in seconds. Preview environments with full streaming work out of the box. For internal tools, MVPs, and anything where developer velocity is the constraint, we start here every time.

Cloudflare: the cheap-at-scale platform with weird edges

Cloudflare Workers + Durable Objects + AI Gateway is the most interesting AI-app stack of 2026, and the one with the steepest weirdness curve. Workers are cheap. Like, embarrassing cheap. 10 million requests for $5. Durable Objects give you stateful sessions with single-instance coordination (great for chat state, rate limiting per user, coordinating a long agent run).

AI Gateway sits between your Worker and whichever model provider you use, and it caches, rate-limits, retries, and gives you a universal logging layer. We cannot overstate how useful this is. We added AI Gateway to a production app as a one-line change and cut our model spend by 12% from caching alone.

The tradeoffs are real. Workers have CPU time limits. You used to be capped at 30 seconds of CPU time per request (wall clock could be longer because of I/O waits), and in 2026 the unbound tier removes that, but you are now in a different pricing bracket. Your framework of choice may or may not have a Workers target. Next.js on Workers is possible but not the official happy path.

Here is roughly what a streaming AI endpoint on a Worker looks like:

// app/api/chat/route.ts running on Cloudflare Workers
export async function POST(req: Request) {
  const { messages } = await req.json();

  const upstream = await fetch(
    "https://gateway.ai.cloudflare.com/v1/ACCOUNT/my-gateway/anthropic/v1/messages",
    {
      method: "POST",
      headers: {
        "x-api-key": env.ANTHROPIC_API_KEY,
        "anthropic-version": "2023-06-01",
        "content-type": "application/json",
      },
      body: JSON.stringify({
        model: "claude-sonnet-4-6",
        max_tokens: 2048,
        stream: true,
        messages,
      }),
    },
  );

  return new Response(upstream.body, {
    headers: {
      "content-type": "text/event-stream",
      "cache-control": "no-cache",
      "x-content-type-options": "nosniff",
    },
  });
}

That endpoint runs on every edge in the world, uses approximately zero CPU time (it is just proxying bytes), and costs somewhere in the fraction-of-a-cent range per thousand requests. This is the Cloudflare pitch in one block of code.

Durable Objects are the secret weapon for stateful AI workloads. A conversation lives in a Durable Object. The user hits the nearest edge, which routes to the specific DO that owns their session. You get consistent state without a database call on the hot path. For chat, we have seen this halve our p99 response latency compared to round-tripping to Postgres.

Cloudflare is the cheapest real way to run AI apps at scale in 2026, but you will spend the first month yelling at limits and the documentation.

AWS: the “build your own platform” platform

AWS in 2026 gives you the most power and the most rope to hang yourself with. Bedrock is genuinely good as a managed model-hosting layer, especially for companies that need data residency or VPC isolation. Lambda now supports streaming responses and up to 15 minute execution times. ECS Fargate plus Application Load Balancer is the grown-up escape hatch when your workload does not fit serverless.

For AI apps specifically, the AWS stack we have shipped looks like:

Lambda with Function URL streaming for short, stateless AI endpoints (RAG queries, content generation)
ECS Fargate for long-running agents, voice processing, anything over 15 minutes
Bedrock for model inference when the customer demands AWS-resident data
OpenSearch Serverless for vector storage, orpgvector on RDS if you already have Postgres
DynamoDB for per-user session state

The power is real. You can do anything. The cost is also real. You will spend weeks wiring IAM policies, VPC configs, CloudWatch alarms, and figuring out why your Lambda is cold starting. This is not a platform for a two-person team.

Where it wins is enterprise: a regulated customer that requires HIPAA or SOC2 Type II with custom controls, data never leaving a specific region, full audit trails, private model endpoints. None of the other platforms match AWS here. If your customers demand it, you are on AWS whether you like it or not.

Cost at 30K MAU for the same RAG chatbot, AWS version (rough):

Lambda streaming: ~$140/month
Bedrock (if you use it): varies, priced like any model API
DynamoDB on-demand: ~$60/month
OpenSearch Serverless: ~$320/month (minimum OCU floor)
CloudFront + ALB + data transfer: ~$180/month

Around $700/month before the model bill. Cheaper than Vercel at this scale, more expensive than a tight Cloudflare setup. But you are also paying someone to babysit it. Factor the engineer cost.

Three scenarios, three answers

Scenario A: RAG chatbot with streaming, B2C at 100K MAU

Pick Cloudflare. Workers + Durable Objects + AI Gateway is purpose-built for this shape. Streaming is cheap because you are proxying bytes. Session state lives in a Durable Object per user. AI Gateway gives you caching and rate limits without building your own. Estimated platform bill: under $200/month at 100K MAU. You will spend a week fighting the tooling and then it will run.

Pick Vercel if you need to ship in three weeks and move fast on UX. The bill will be larger. It will work on day one. Revisit at 250K MAU.

Scenario B: Background jobs with LLM calls

A queue of 50,000 documents to process with a model. Each takes 10-40 seconds. Pick AWS: SQS + Lambda for orchestration, ECS Fargate Spot for the long-running workers if you need more than 15 minutes. Lambda’s per-invocation cost makes this economical. You will want to set a max concurrency to avoid rate-limit blowups on whichever model API you are calling.

Cloudflare Queues + Workers also works now, and is cheaper, but the tooling for batch reprocessing, retries, and dead-letter handling is less mature than SQS. If you have an AWS-shaped organization, do not fight it.

Scenario C: Real-time voice (speech-to-speech agent)

Pick Cloudflare. This is what Durable Objects + WebSockets were built for: a persistent connection, state, low latency across the globe. Running voice on Vercel is a pain (WebSockets are not first-class in their runtime model for functions). Running voice on AWS is doable (API Gateway WebSockets + Lambda, or ECS), but you are building infrastructure.

The one caveat: if your voice stack needs a specific ML model running on a GPU somewhere, you will end up with a hybrid setup: Cloudflare for the connection handling and routing, GPU inference somewhere else (Modal, Baseten, RunPod, or a fine-tuned AWS G5/G6 instance). None of the big three has the GPU inference story fully solved.

The stuff that surprised us

Observability is the hidden winner. Vercel Observability is excellent out of the box. Cloudflare Logs + Workers Analytics is good enough. CloudWatch is its own universe and you will need a third-party layer like Datadog or Honeycomb to actually be productive. For AI specifically, tracing a slow request through retrieval, reranking, and three model calls is a core skill. Platforms that make this easy save you real hours.

Cold starts matter again. For streaming AI endpoints, a 700ms cold start feels horrible because the user is waiting with an empty chat bubble. Vercel and Cloudflare both do a reasonable job at keeping things warm. Lambda cold starts have improved (especially with provisioned concurrency or SnapStart) but are still the worst of the three for user-facing work.

Egress costs are a trap. AWS charges for data out. Vercel charges for bandwidth. Cloudflare does not charge for egress, which at scale is a rounding error that turns out not to be rounded. We have seen 10-20% of a monthly cloud bill disappear by moving traffic through Cloudflare in front of AWS backends. If you are ever tempted by AWS for a user-facing streaming workload, run the egress numbers first.

What we’d actually do

For a B2C AI product in 2026, starting from scratch today, we would build on Cloudflare Workers with AI Gateway in front of whichever model provider fits the task. It is the cheapest at scale, the fastest at the edge, and the stateful primitives actually fit AI workloads. Budget a week for the onboarding pain.

For a B2B AI product where the customer is an enterprise that demands VPC peering and data residency, go AWS. Bedrock plus Lambda plus a disciplined infra team will get you there. Budget a senior engineer for the ops side.

For an MVP or a project that has to be demo-able in three weeks, go Vercel. Migrate later if the economics demand it. Plenty of real companies never migrate off Vercel because the hours you would spend moving cost more than the Vercel bill. That is a perfectly good answer.

The worst choice is picking the platform first and then trying to fit your AI app into it. The AI app has a shape. Pick the platform that matches the shape.