Back

How to Run an LLM Without a GPU

Yes, you can run capable language models on a plain CPU. Here is how quantization, smaller models, and the right runtime make it work.

How to Run an LLM Without a GPU
Written by
BSH Technologies
Published on2026-05-15

You can run an LLM without a GPU using quantization and a CPU-friendly runtime

Running a language model on a CPU is entirely practical for small to mid-sized models, and the trick that makes it work is quantization — storing the model's weights in 4-bit or 8-bit precision instead of 16-bit, which shrinks memory needs and speeds up the math enough for a CPU to keep up. Pair a quantized model with a runtime built for CPU inference, such as Ollama or llama.cpp, and a modern laptop with 16 GB of RAM will happily run a 7B or 8B model. It will not match a GPU's speed, but for many uses it is more than fast enough.

Why a GPU is normally preferred

GPUs win at language model inference because they do thousands of arithmetic operations in parallel, and generating each token is mostly parallel arithmetic. A CPU does far fewer operations at once, so it produces tokens more slowly. That difference is real, but it is a difference of degree, not a hard wall — a CPU still computes the same result, just at a more measured pace. For interactive chat the lag is noticeable; for background tasks, scripts, or occasional queries it often does not matter at all.

The three levers that make CPU inference work

Getting acceptable performance without a GPU comes down to three choices you control.

  1. Use a smaller model. A 7B or 8B model is the comfortable ceiling for CPU inference. Larger models run but crawl, and the quality gain rarely justifies the wait.
  2. Use aggressive quantization. A 4-bit quantized model uses roughly a quarter of the memory of the full-precision version and runs markedly faster on a CPU, with only a modest quality cost for most tasks.
  3. Use a CPU-optimised runtime. llama.cpp and the tools built on it, including Ollama, are engineered specifically for fast CPU inference and use your processor's vector instructions to squeeze out every bit of speed.
Memory, not raw clock speed, is usually the binding constraint. If a quantized model fits in your available RAM, it will run. If it does not fit, it either fails or swaps to disk and becomes unusably slow.

What performance to expect

On a typical modern laptop CPU, a 4-bit 7B model generates somewhere in the range of a handful of tokens per second — readable, but visibly slower than a hosted API or a GPU. Apple Silicon machines do notably better here because their unified memory and accelerated frameworks give CPU-class inference a real boost. The practical test is simple: pull a quantized 7B or 8B model in Ollama, ask it a representative question, and judge whether the speed suits your use. For drafting, classification, extraction, and other non-interactive jobs, it usually does.

When to stop fighting it and add a GPU

CPU inference has a ceiling. If you need fast responses for many users, large models, or high throughput, no amount of tuning will make a CPU compete with a GPU, and the sensible move is to add one or rent GPU time in the cloud. But for a single user, a private assistant, or a batch process that runs overnight, a CPU is a perfectly legitimate place to run a capable open model — and it costs you nothing beyond the computer you already own.

Squeezing more from the hardware you have

Before you conclude a CPU is too slow, there are a few levers worth pulling that often make the difference between unusable and acceptable. They cost nothing but a little configuration.

  • Let the runtime use all your CPU cores rather than a default subset, since more threads directly means more tokens per second on most processors.
  • Keep the context window only as large as the task needs — a shorter prompt and history is faster to process and uses less memory.
  • Close memory-hungry applications so the model has room to stay resident in RAM instead of swapping to disk, which is what makes CPU inference feel genuinely slow.

Where CPU-only genuinely shines

It is worth naming the cases where a CPU is not a compromise at all but the obviously right choice. A private assistant for one person, an offline tool that must work without any network, a script that classifies or summarises a queue of items overnight, or a sensitive workload that cannot touch external hardware — in all of these, throughput barely matters and privacy or cost matters enormously. For this whole category of work, running a quantized open model on an ordinary CPU is not a fallback you settle for; it is simply the sensible engineering answer.

Prefer it built and managed for you?

Knowing exactly how far a CPU will carry your workload — and when a modest GPU pays for itself — is the kind of sizing decision that is easy to get wrong. BSH Technologies right-sizes local AI to your hardware and your actual usage, so you neither overspend on GPUs you do not need nor cripple a tool on hardware that cannot keep up. To get the balance right, talk to BSH Technologies or see our AI & automation services.

Frequently asked questions

Can you run an LLM without a GPU?

Yes. Small to mid-sized models, up to around 7B or 8B parameters, run on a CPU when you use quantization and a CPU-optimised runtime like Ollama or llama.cpp. A laptop with 16 GB of RAM is enough. Generation is slower than on a GPU but is fine for non-interactive tasks and occasional queries.

How much RAM do I need to run an LLM on CPU?

For a 4-bit quantized 7B or 8B model, around 6 to 8 GB of free RAM is enough, so a 16 GB machine works comfortably. Larger models need proportionally more memory. Memory is the binding constraint: if the model fits in RAM it runs, and if it does not it either fails or swaps to disk and becomes too slow to use.

How slow is running an LLM on a CPU?

On a typical modern laptop CPU, a 4-bit 7B model generates a handful of tokens per second. That is readable but noticeably slower than a GPU or hosted API. Apple Silicon machines perform better due to unified memory. The speed suits drafting, classification, and batch jobs more than fast interactive chat.

What is the best tool to run an LLM on CPU?

llama.cpp is the foundational CPU inference engine, and Ollama, which builds on it, is the easiest way to use it. Both exploit your processor vector instructions for speed and run quantized models efficiently. Ollama adds simple model management and a local API, making it the practical choice for most CPU-only setups.

Related Topics

#LLM#CPU#Local AI

From the blog

View all posts
How to Build an AI Agent for Free in 2026
Applied AI

How to Build an AI Agent for Free in 2026

You can build a working AI agent for free in 2026 using n8n, open-source frameworks, and a free LLM tier. Here is the exact stack and the steps.

BSH Technologies
BSH Technologies · 2026-06-17
Best Free AI Agent Frameworks in 2026
Applied AI

Best Free AI Agent Frameworks in 2026

The best free AI agent frameworks in 2026 are LangChain, CrewAI, Microsoft AutoGen, LangGraph, and n8n. Here is how to choose between them.

BSH Technologies
BSH Technologies · 2026-06-16