LLM Quantization Explained (Run Big Models Cheap)
Quantization shrinks language models so they run on modest hardware. Here is what it does, what it costs, and which level to pick.

Quantization shrinks an LLM by storing its weights in lower precision, so big models run on small hardware
Quantization is the technique that lets a model which would normally demand an expensive GPU run on a laptop instead. A language model is millions or billions of numbers — its weights — and by default each is stored in 16-bit precision. Quantization rounds those numbers to 8-bit or 4-bit, which can cut the memory needed by half or three-quarters and speed up inference, with only a modest loss in quality for most tasks. It is the single most important reason running capable open models at home is practical at all.
Why smaller numbers help so much
The benefit is direct: fewer bits per weight means less memory to hold the model and less data to move around while it runs, and memory bandwidth is often the real bottleneck in inference. A 7B model in full 16-bit precision needs roughly 14 GB just for its weights; the same model quantized to 4-bit needs around 4 GB. That is the difference between needing a dedicated GPU and fitting comfortably on an ordinary laptop, which is why nearly every model you download with Ollama or LM Studio is quantized by default.
The precision levels and what they cost
Quantization is a dial, not a switch, and the common settings trade memory against fidelity.
- 8-bit — halves memory versus full precision with quality loss so small it is usually imperceptible. A safe choice when you have the room.
- 4-bit — the popular sweet spot, cutting memory to roughly a quarter with a quality cost most tasks tolerate well. This is what most local setups use.
- Lower than 4-bit — smaller still, but quality degradation becomes noticeable, so these are for when memory is truly scarce and you have tested that the output remains acceptable.
The quality cost of quantization is real but easy to overstate. For summarisation, extraction, chat, and most everyday tasks, a 4-bit model is hard to distinguish from its full-precision twin. The honest way to know is to test both on your own work.
Reading those cryptic quantization tags
Download a model and you will meet labels like Q4_K_M or Q5_K_S, which look intimidating and are simpler than they appear. The number is the bit depth — Q4 is 4-bit, Q5 is 5-bit. The letters describe the method and how aggressively different parts of the model are compressed, with variants like K_M and K_S balancing size against quality in slightly different ways. You do not need to memorise the scheme; a 4-bit K_M variant is a dependable default, and you only need to explore further if you are chasing a specific memory or quality target.
When full precision is worth it
Quantization is not always the answer. For the most demanding reasoning tasks, or when you are fine-tuning a model and want maximum fidelity, running at 8-bit or full precision can be worth the extra hardware. But for the vast majority of inference workloads — the ones that actually fill a workday — quantization is a near-free win that turns expensive models into affordable ones. The pragmatic stance is to start quantized, confirm the quality meets your bar, and only step up precision if a measured shortfall demands it.
A bigger model at lower precision often wins
Here is a counterintuitive rule worth internalising: when memory is your constraint, a larger model quantized more aggressively frequently beats a smaller model at higher precision. A 13B model squeezed to 4-bit can outperform a 7B model at 8-bit while using a similar amount of memory, because the larger model simply knows more, and quantization costs it less than the gap in raw capability. So when you are deciding how to spend a fixed memory budget, do not automatically reach for the smaller model at full fidelity. Test the larger-but-quantized option too, since it often delivers more useful capability for the same hardware.
Letting your tools handle the details
The reassuring part is that you rarely need to perform quantization yourself. Tools like Ollama and LM Studio download models already quantized and pick sensible defaults, and the community publishes ready-made quantized versions of every popular open model. Your job is mostly to choose a precision level that fits your memory and confirm the quality holds on your own tasks — the heavy lifting of converting the weights has already been done for you. Only when you have an unusual model or a very specific target do you need to quantize from scratch, and even then mature tooling makes it a routine step rather than a research project.
Prefer it built and managed for you?
Choosing the right quantization for each model and task — balancing memory, speed, and quality — is exactly the sort of tuning that separates a sluggish setup from a snappy one. BSH Technologies picks the precision that fits your hardware and your quality requirements, so you get the most capability your machines can deliver. To get that balance right the first time, talk to BSH Technologies or browse our AI & automation services.
Frequently asked questions
What is LLM quantization?
Quantization stores a model weights in lower precision, such as 4-bit or 8-bit instead of the default 16-bit. This cuts the memory the model needs and speeds up inference, with only a modest quality loss for most tasks. It is the main reason large open models can run on laptops and modest GPUs rather than expensive hardware.
Does quantization make an LLM worse?
Slightly, but usually far less than people expect. For chat, summarisation, and extraction, a 4-bit model is often hard to distinguish from full precision. Quality loss grows at very low bit depths below 4-bit. The reliable way to judge the trade-off is to test a quantized and a full-precision model on your own tasks and compare.
What does Q4_K_M mean in a model name?
The number is the bit depth, so Q4 means 4-bit precision. The letters describe the quantization method and how aggressively parts of the model are compressed, with K_M and K_S variants balancing size and quality differently. You do not need to learn the full scheme; a 4-bit K_M version is a dependable default for most local setups.
How much memory does a quantized model save?
A 4-bit model uses roughly a quarter of the memory of the full 16-bit version, and an 8-bit model about half. A 7B model that needs around 14 GB at full precision drops to roughly 4 GB at 4-bit. That saving is what turns a model requiring a dedicated GPU into one that fits on an ordinary laptop.
Related Topics
From the blog
View all posts
How to Build an AI Agent for Free in 2026
You can build a working AI agent for free in 2026 using n8n, open-source frameworks, and a free LLM tier. Here is the exact stack and the steps.

Best Free AI Agent Frameworks in 2026
The best free AI agent frameworks in 2026 are LangChain, CrewAI, Microsoft AutoGen, LangGraph, and n8n. Here is how to choose between them.