Back

How to Run Llama 3 Locally With Ollama

A step-by-step guide to running Meta’s Llama 3 on your own machine with Ollama — install, pull, chat, and the hardware that actually matters.

How to Run Llama 3 Locally With Ollama
Written by
BSH Technologies
Published on2026-05-18

You can run Llama 3 locally in three commands with Ollama

Install Ollama, run ollama pull llama3, then ollama run llama3, and you are chatting with Meta's Llama 3 entirely on your own hardware with no API key and no data leaving the machine. Ollama is a free, open-source runner that wraps the fiddly parts of local inference — model downloads, quantization, and a local server — behind a command line that feels like Docker for language models. For a developer who wants a private model on a laptop or workstation, it is the fastest honest path there is.

Step one: install Ollama

Ollama runs on macOS, Linux, and Windows. On macOS and Windows you download an installer; on Linux a single curl-piped install script sets up the background service. Once installed, Ollama runs a local daemon that listens on port 11434 and serves an HTTP API, which is the same endpoint your code will talk to later. You do not need a GPU to start, though one helps enormously.

Step two: pull a Llama 3 model

Models live in a registry, and you fetch one with a pull. The plain llama3 tag grabs the 8-billion-parameter instruct model, which is the right default for most machines. Tags let you be specific about size and quantization.

  • ollama pull llama3 — the 8B instruct model, a sensible starting point that runs on a modern laptop.
  • ollama pull llama3:70b — the far larger 70B model, which needs a serious GPU or a lot of patience on CPU.
  • Append a quantization tag such as llama3:8b-instruct-q4_K_M when you want to trade a sliver of quality for a smaller memory footprint.

Step three: chat, or call the API

Running ollama run llama3 drops you into an interactive prompt. That is fine for kicking the tyres, but the real value is the API. A POST to http://localhost:11434/api/chat with a JSON body of messages returns a completion your application can consume, which means Ollama doubles as a local, OpenAI-shaped backend for whatever you are building.

The model size you can run is governed by memory, not magic. An 8B model in 4-bit quantization needs roughly 5 to 6 GB of RAM or VRAM; a 70B model needs ten times that. Match the model to the machine and everything else follows.

What hardware do you actually need?

The honest answer is that an 8B model runs comfortably on any recent laptop with 16 GB of RAM, and runs quickly if that laptop has an Apple Silicon chip or a discrete GPU. CPU-only inference works but generates a few tokens per second rather than dozens, which is fine for occasional queries and frustrating for a chat interface. If you want responsive local inference at larger sizes, a GPU with 12 GB or more of VRAM changes the experience entirely. Start with the 8B model, confirm it does what you need, and only reach for bigger models or better hardware once you have hit a wall you can measure.

Customising behaviour with a Modelfile

Ollama lets you tailor a model without retraining it through a small text file called a Modelfile. In it you set a system prompt that defines the assistant's role and tone, fix generation parameters like temperature so responses stay consistent, and then build a named variant you can run like any other model. This is the right tool when you want the same base model to behave as, say, a terse code reviewer in one context and a patient explainer in another. You are not changing the model's weights, only the instructions and settings wrapped around it, which makes experimenting cheap and reversible. For most teams this covers the customisation they actually need, long before fine-tuning enters the conversation.

Common first-run snags

A few predictable issues trip people up on day one, and all of them are quick to resolve once you know the cause.

  • A model that runs slowly is almost always falling back to CPU because it did not fit in VRAM — switch to a smaller model or a more aggressive quantization and the speed returns.
  • If a pull stalls, it is usually network or disk space rather than Ollama itself, so check both before retrying the download.
  • Port 11434 already in use means an Ollama instance is running already; you do not need a second one, as the existing service handles every request.

Prefer it built and managed for you?

Running a model on your laptop is a weekend; running one your business depends on is a system — with monitoring, access control, and a path to scale. BSH Technologies designs and operates local and self-hosted LLM stacks so you get private inference without becoming an infrastructure team. If a private model belongs in your workflow, talk to BSH Technologies or explore our AI & automation services.

Frequently asked questions

Is Ollama free to use?

Yes. Ollama is free and open source under the MIT licence, and the open models it runs, such as Llama 3, Mistral, and Qwen, carry their own permissive or community licences. There is no subscription and no API cost. Your only real expense is the hardware you run it on and the electricity it draws.

Do I need a GPU to run Llama 3 with Ollama?

No, a GPU is not required. An 8-billion-parameter model runs on a CPU with 16 GB of RAM, producing a few tokens per second. A GPU or Apple Silicon chip makes generation far faster and is recommended for interactive chat or larger models, but you can start and test entirely on CPU.

Where does Ollama store downloaded models?

Ollama keeps pulled models in a local directory, typically under your user home folder in a hidden .ollama path on macOS and Linux, or the equivalent on Windows. Models are cached so a second pull is instant. You can delete a model with ollama rm to reclaim disk space when you no longer need it.

Can I use Ollama models in my own application?

Yes. Ollama exposes an HTTP API on localhost port 11434 with chat and generate endpoints that return JSON. Any language that can make an HTTP request can call it, and the request shape is close to the OpenAI format, so existing client code often needs only the base URL changed to point at your local server.

Related Topics

#Ollama#Llama#Local AI

From the blog

View all posts
How to Build an AI Agent for Free in 2026
Applied AI

How to Build an AI Agent for Free in 2026

You can build a working AI agent for free in 2026 using n8n, open-source frameworks, and a free LLM tier. Here is the exact stack and the steps.

BSH Technologies
BSH Technologies · 2026-06-17
Best Free AI Agent Frameworks in 2026
Applied AI

Best Free AI Agent Frameworks in 2026

The best free AI agent frameworks in 2026 are LangChain, CrewAI, Microsoft AutoGen, LangGraph, and n8n. Here is how to choose between them.

BSH Technologies
BSH Technologies · 2026-06-16