How to Run AI on Cloudflare Workers

Running AI inference at the edge with Cloudflare Workers AI — what models you get, how requests work, and when the edge beats a traditional server.

Written by

BSH Technologies

Published on2026-04-06

How do you run AI on Cloudflare Workers?

Cloudflare Workers AI lets you call machine-learning models from a Worker that runs at the edge, close to your users, without managing any GPU or server. You bind the AI service to your Worker, send it a prompt or input, and get inference back — all inside Cloudflare's network, billed per request rather than per idle hour.

This matters because traditional AI hosting means renting a GPU box that costs money whether or not anyone is using it. Workers AI flips that to a serverless model: there is nothing to keep warm, it scales to zero, and your code already sits in hundreds of locations worldwide, so latency to the user is low.

What you can run at the edge

Workers AI ships a catalogue of ready models you invoke by name, covering the common needs of an AI feature.

Text generation with open large language models for chat, summarization and rewriting.
Embeddings for semantic search and retrieval, which pair naturally with Cloudflare's Vectorize vector database.
Image and vision tasks such as classification and captioning.
Speech to text for transcription workloads.

You do not download or host these models yourself. They run on Cloudflare's GPUs; your Worker just sends inputs and receives outputs, which keeps your code tiny and your deploy fast.

What a request looks like

The flow is deliberately simple. In your Worker you reference the bound AI service, pick a model by its identifier, and pass your input. The Worker awaits the result and returns it to the caller. Because Workers run on a request-response basis, you wire AI into an HTTP endpoint, a scheduled trigger or a queue consumer with the same few lines. Keys and bindings are configured in your Worker settings, so nothing sensitive ends up in client code. Pair Workers AI with KV for caching, R2 for file storage and Vectorize for embeddings, and you have an entire AI backend that never provisions a server.

When the edge beats a traditional server

Workers AI shines for short, stateless inference where low latency and scale-to-zero matter: a chat endpoint, an embedding generator behind search, a classification call inside a form. You pay per request, so a feature with spiky or low traffic costs almost nothing when idle, which is impossible with an always-on GPU instance.

It is less suited to very large frontier models, long-running multi-minute jobs, or workloads that need a specific framework and full control of the runtime. Workers have execution-time and memory limits, and you are choosing from the provided model catalogue rather than running arbitrary checkpoints. When you need that level of control, a dedicated GPU host or a managed inference service is the better fit — but for a great many AI features, the edge is faster to ship and cheaper to run.

Watch the limits and the cost model

Two things catch people out. The first is the execution and memory ceiling on a Worker: a single inference call is fine, but chaining several model calls or processing a large payload in one request can bump the limit, so push heavy or multi-step work onto a queue consumer rather than cramming it into one request. The second is billing. Workers AI is priced by usage measured in the units each model consumes, so a chatty endpoint without caching can cost more than you expect even though there is no idle charge. Put a cache in front of identical or near-identical requests and the per-request cost drops sharply, because the cheapest inference is the one you never run. Knowing both ceilings up front lets you design around them instead of discovering them in production.

A sensible edge AI architecture

A clean pattern: a Worker receives the request, checks KV for a cached answer, calls Workers AI on a miss, stores the result and responds. For search, generate embeddings with Workers AI, store them in Vectorize, and query Vectorize at request time to find relevant context before a generation call. This keeps the whole pipeline inside Cloudflare, with no origin server to maintain and no cold GPU to pay for.

Prefer it built and managed for you?

Edge AI is powerful but the right-fit decision — edge versus dedicated GPU — is easy to get wrong and expensive to undo. Talk to BSH Technologies and we will help you decide where your inference belongs and build it cleanly. Take a look at our cloud engineering services to see how we design serverless and edge architectures that stay cheap under real load.

Frequently asked questions

What is Cloudflare Workers AI?

It is a service that runs machine-learning models on Cloudflare GPUs, callable from a Worker at the edge. You bind the AI service to your Worker, send an input, and receive inference back. There is no GPU to manage, it scales to zero, and you are billed per request rather than per idle hour.

What models can I run on Workers AI?

Cloudflare provides a catalogue you invoke by name, including open large language models for text generation, embedding models for semantic search, image and vision models, and speech-to-text models. You do not host the models yourself; the Worker simply sends inputs and returns outputs from the provided catalogue.

When should I use edge AI instead of a server?

Use it for short, stateless inference where low latency and scale-to-zero matter, such as chat endpoints, embedding generation or classification. It is less suited to very large frontier models, multi-minute jobs, or workloads needing full runtime control, where a dedicated GPU host fits better.

How does Workers AI handle vector search?

It pairs with Vectorize, the Cloudflare vector database. You generate embeddings with Workers AI, store them in Vectorize, then query Vectorize at request time to find relevant context before a generation call. The whole retrieval pipeline stays inside Cloudflare with no separate database server to run.

Is Workers AI cheaper than running my own GPU?

For spiky or low-traffic features, usually yes, because you pay per request and nothing while idle. An always-on GPU instance bills whether or not it is used. For sustained, high-volume inference on large models, a dedicated GPU can become more economical, so it depends on your traffic shape.

From the blog

View all posts

Applied AI

How to Build an AI Agent for Free in 2026

You can build a working AI agent for free in 2026 using n8n, open-source frameworks, and a free LLM tier. Here is the exact stack and the steps.

BSH Technologies · 2026-06-17