How to Build a RAG Chatbot for Free

Build a working retrieval-augmented chatbot on a zero-cost stack using pgvector, sentence-transformers, and a free-tier LLM API.

Written by

BSH Technologies

Published on2026-05-28

You can build a RAG chatbot for free with open tools

A working retrieval-augmented generation chatbot costs nothing to build if you assemble it from free, open components: a local embedding model, an open-source vector store, and a generous free-tier LLM API. The architecture is the same one that powers paid products — ingest documents, embed them, retrieve the closest passages for each question, and ask a language model to answer from those passages. What changes on a free stack is where each piece runs, not how it works, which means everything you learn building the free version transfers directly to a production system later.

This guide walks the whole pipeline end to end so you finish with something you can demo. Treat it as a learning build: it will comfortably handle a few thousand documents and a handful of users, which is exactly the scale at which most teams decide whether RAG is worth investing in properly. Resist the temptation to optimise prematurely — get the simplest version answering questions first, then improve the weakest link once you can see where it struggles.

Step 1: Pick a free embedding model

Embeddings turn text into vectors that capture meaning, and they are the foundation everything else rests on. You do not need a paid embedding API to start — the sentence-transformers library runs strong open models like all-MiniLM-L6-v2 on your own machine, CPU included, with no per-token cost. For most internal documents the quality is more than adequate, and you avoid both a bill and a dependency on a third party seeing your text.

Install sentence-transformers and load a compact model; MiniLM is fast and good enough for most documents, while larger models trade speed for a quality bump you may not need.
Note the model's output dimension — MiniLM produces 384-dimensional vectors — because your vector column has to match it exactly or inserts will fail.
Keep the original text alongside every vector so you can show sources, verify answers, and rebuild the index later without re-fetching the documents.

Step 2: Store vectors in pgvector

You almost certainly have access to PostgreSQL, and the pgvector extension turns it into a capable vector database at no extra cost. A free Postgres instance from Supabase or Neon, or a local Docker container, is plenty to begin with, and using a database you already understand means one fewer new system to learn while you focus on the RAG logic itself.

Enable the extension, then add a vector column sized to your embedding model's dimension.
Create an HNSW index once you have more than a few thousand rows so similarity queries stay fast under load.
Store metadata — source file, page number, a link — in ordinary columns so you can filter results and cite sources in answers.

Step 3: Chunk and retrieve well

At query time, embed the user's question with the same model, fetch the nearest chunks from pgvector by cosine distance, and concatenate the top results into a context block. Retrieval quality decides answer quality, so this step deserves more care than the generation step that follows it. Chunk your documents on natural boundaries — paragraphs and headings — at roughly 300 to 800 tokens with a small overlap, and prepend each chunk with its document title so an isolated passage still carries enough context to be useful.

The model is the last mile. If the right passage is not in the retrieved context, no amount of prompt wording will rescue the answer.

A reranking pass over the top candidates is the cheapest meaningful upgrade you can add later, reordering retrieved chunks by true relevance before they reach the model. It is usually a one-line addition and frequently the single biggest jump in answer quality for a struggling free build.

Step 4: Generate answers on a free-tier LLM

Several providers offer free tiers generous enough for a prototype, and open-weight models run locally through Ollama if you would rather keep everything on your own hardware with no external calls at all. The orchestration glue — loading documents, chunking, retrieving, prompting — is handled cleanly by LangChain or LlamaIndex, both free and open source, so you write surprisingly little code to tie the pieces together.

Instruct the model to answer only from the supplied context and to say so plainly when the context is insufficient.
Return the source chunks with every answer so a reader can verify each claim instead of trusting the bot on faith.
Cap the context length you send so you stay comfortably inside free-tier token limits and keep responses fast.

Step 5: Know where the free stack runs out

A free build is perfect for learning and proving value, but be honest about its ceiling. Local embedding slows down as your corpus grows past a few thousand documents, free Postgres tiers cap storage and connections, and free LLM tiers throttle requests once you have more than a trickle of users. None of these limits matter for a prototype, and all of them matter the moment real people depend on the system. When that day comes you will want managed infrastructure, monitoring, evaluation, and a freshness pipeline — none of which the free stack provides on its own.

Prefer it built and managed for you?

BSH Technologies builds production RAG and chatbots that stay grounded, cite their sources, and scale past the free-tier ceiling without the rough edges. If a prototype has proven the idea and you need something your team can rely on, talk to BSH Technologies or explore our AI & automation services.

Frequently asked questions

Is it really possible to build a RAG chatbot for free?

Yes, for a prototype. Local embedding models from sentence-transformers, the pgvector extension on a free Postgres tier, and a free-tier LLM API or a local model via Ollama cover every piece at no cost. Free tiers throttle requests and cap storage, so production traffic eventually needs paid, managed infrastructure.

What is the cheapest vector database for RAG?

pgvector is effectively free if you already run PostgreSQL or use a free tier from Supabase or Neon. It adds vector columns and approximate-nearest-neighbour indexes to a database you already operate, avoiding a separate service. For learning and small corpora it is the most cost-effective option available.

Do I need a GPU to run a RAG chatbot?

No. Compact embedding models like all-MiniLM-L6-v2 run on CPU through sentence-transformers, and hosted LLM APIs do the generation remotely. A GPU only matters if you self-host a large language model locally; for embedding and retrieval, a normal laptop is sufficient at prototype scale.

What is the difference between LangChain and LlamaIndex?

Both are free orchestration frameworks for RAG. LangChain is broader, covering agents, chains, and many integrations, while LlamaIndex focuses tightly on indexing and retrieval over your data. For a straightforward document chatbot either works; LlamaIndex is often simpler for pure retrieval, LangChain for complex multi-step flows.

How many documents can a free RAG stack handle?

Comfortably a few thousand documents and a handful of concurrent users. Beyond that, local embedding slows, free Postgres tiers hit storage and connection limits, and free LLM tiers throttle. That ceiling is fine for proving the concept before committing to managed infrastructure for real traffic.

From the blog

View all posts

Applied AI

How to Build an AI Agent for Free in 2026

You can build a working AI agent for free in 2026 using n8n, open-source frameworks, and a free LLM tier. Here is the exact stack and the steps.

BSH Technologies · 2026-06-17