Building a RAG Chatbot Over Internal Docs
A field guide to wiring retrieval over your wiki, PDFs, and tickets — chunking, hybrid search, and the failure modes nobody warns you about.
Most RAG chatbot failures are retrieval failures
When a retrieval-augmented generation chatbot over internal docs gives a confidently wrong answer, the instinct is to blame the model. Nine times out of ten the model never saw the right passage. The retriever pulled three plausible-but-irrelevant chunks, and the language model dutifully summarised them. Fix retrieval first and most of your quality problems disappear before you ever touch the prompt.
The architecture is deceptively simple: ingest documents, split them into chunks, embed each chunk into a vector, store the vectors, and at query time embed the question, fetch the nearest chunks, and place them into the prompt. Every one of those steps has a sharp edge, and the gap between a weekend demo and a system your staff trust is made up entirely of how you handle those edges.
Chunking is the decision that matters most
Chunk too large and a single vector has to represent a page covering four topics, so it matches everything weakly and nothing strongly. Chunk too small and you shred the context a sentence needs to make sense. We have had the best results with structure-aware splitting that respects headings and paragraphs rather than blindly cutting every 500 characters, because the document's own structure is a strong signal of where ideas begin and end.
- Split on semantic boundaries — headings, list items, table rows — before falling back to a character count.
- Target roughly 300 to 800 tokens per chunk and keep a 10 to 15 percent overlap so a fact that straddles a boundary survives.
- Prepend the document title and section heading to each chunk before embedding. A chunk that reads only "the limit is 200 requests per minute" is useless without knowing which API it describes.
- Store the source path and a deep link in metadata so every answer can cite where it came from and a reader can verify it in one click.
Hybrid retrieval beats pure vector search
Dense vector search is excellent at semantic similarity and surprisingly bad at exact matches. Ask for error code "E_4012" or a specific person's name and embeddings often miss it, because the token is rare and its vector is noisy. Keyword search, the unglamorous BM25 that has powered search engines for decades, nails those cases without breaking a sweat.
Run both and fuse the results. Reciprocal rank fusion is a short function that combines the two ranked lists without weights to tune, and it consistently outperforms either retriever alone. On top of that, a cross-encoder reranker re-scores the top 20 candidates and reorders them by true relevance before the top 5 reach the model. That reranking pass is the single highest-leverage upgrade we add to a struggling system, and it is usually a one-line addition to the pipeline.
Treat the model as the last mile. If the right text is not in the retrieved context, no amount of prompt engineering will conjure it into existence.
Grounding, citations, and the courage to refuse
A useful internal chatbot has to be honest about what it does not know. Instruct the model to answer only from the supplied context and to say so plainly when that context is insufficient. Pair that with citations — each claim linked to the chunk it came from — so a sceptical engineer can verify the answer instantly. This single behaviour does more for adoption than any benchmark, because it converts a black box into a research assistant people are willing to rely on.
- Return the source documents alongside the answer, not buried in a log nobody reads.
- Detect when the top retrieval score falls below a threshold and short-circuit to "I could not find this in the documentation" rather than guessing.
- Log every question, the retrieved chunks, and the answer. Your evaluation set is hiding in those logs, waiting to be labelled.
Keeping the index fresh
Internal docs change constantly. A static index built once and forgotten will quietly rot, and users lose faith the first time it cites a policy that was retired months ago. Wire ingestion to a change feed — a webhook from your wiki, a nightly crawl, a queue from your ticketing system — and re-embed only what changed by hashing chunk content. Re-embedding the entire corpus every night is wasteful and slow once you pass a few thousand documents, and it is unnecessary when a content hash tells you precisely what moved.
How BSH can help
At BSH Technologies we build RAG systems on GCP and AWS that stay grounded, cite their sources, and refuse to bluff. From chunking strategy to hybrid retrieval, reranking, and a freshness pipeline that keeps answers current, our Thrissur team can take you from a promising demo to something your staff actually rely on. If you have a knowledge base worth talking to, let us help you make it answer back.
From the blog
View all postsDesigning Multi-Tenant SaaS That Scales
Choosing an isolation model, keeping tenant data separate, and dodging the noisy-neighbour and migration traps that bite SaaS later.
Hitting Green Core Web Vitals in Next.js
A practical guide to LCP, INP and CLS in Next.js — image handling, font loading, the App Router boundary, and costly third-party scripts.