How to Chat With Your PDFs Using AI

Turn a folder of PDFs into a question-answering assistant: extract text, chunk it, embed it, and retrieve grounded answers with citations.

Written by

BSH Technologies

Published on2026-05-27

Chatting with a PDF means retrieving its text, not reading the file

To chat with your PDFs using AI you do not feed the whole file to a model. You extract the text, split it into passages, embed those passages into a vector store, and at question time retrieve the most relevant ones to ground the model's answer. The PDF itself never goes to the language model — only the handful of passages that actually answer the question — which is what keeps the approach accurate, affordable, and able to cite exactly where each answer came from.

This is the single most requested "chat with your data" use case, and it is genuinely achievable with open tools. The quality of the result lives almost entirely in two steps most tutorials rush past: clean text extraction and sensible chunking. Get those right and the rest of the pipeline is straightforward; get them wrong and the chatbot answers confidently from garbled text, which is worse than not answering at all.

Step 1: Extract text cleanly from the PDF

PDFs are notoriously messy because the format describes how a page looks, not how it reads. A text-based PDF extracts cleanly with libraries like PyMuPDF or pdfplumber, but a scanned PDF is just images and needs optical character recognition through Tesseract first. Knowing which kind you have — and handling each correctly — is the difference between a reliable assistant and one that hallucinates.

Detect whether each page has a text layer; fall back to OCR only for the pages that do not, since OCR is slower and slightly less accurate.
Preserve page numbers as you extract so every answer can cite the exact page a reader can open and verify.
Handle tables deliberately — pdfplumber extracts them as structured rows, which embeds far better than a jumble of numbers that has lost its column structure.

Step 2: Chunk with structure in mind

Chunk too large and one vector blurs several topics, so it matches everything weakly and nothing strongly; chunk too small and you sever the context a sentence needs to make sense. Aim for passages of roughly 300 to 800 tokens with a small overlap so a fact that straddles a boundary survives, and split on natural boundaries — paragraphs and headings — before falling back to a raw character count.

Prepend the document title and section heading to each chunk before embedding. A chunk reading only "the deadline is March 31" is useless without knowing which document and which deadline it refers to.

Step 3: Embed and store the passages

Embed each chunk with a model such as those in sentence-transformers or a hosted embedding API, then store the vectors. For a single-user PDF assistant, ChromaDB runs locally with almost no setup; for something multi-user, pgvector on Postgres is a solid, durable choice that gives you backups and metadata filtering for free.

Keep source metadata — filename, page, and the original text — in the store so retrieval can both cite and verify.
Use the same embedding model for documents and questions; mixing models produces meaningless distances and silently bad results.
Persist the store to disk so your work survives a restart rather than vanishing into memory.

Step 4: Retrieve, ground, and cite

When a question arrives, embed it, fetch the closest passages, and place them in the prompt with a clear instruction to answer only from that context. The decisive behaviour for trust is citation: show which page each part of the answer came from so a reader can open the PDF and confirm it, turning a black box into a research assistant people are willing to rely on.

Set a relevance threshold and reply "I could not find this in the document" when nothing clears it, rather than inventing an answer.
Return the matching page numbers and snippets alongside the response so every claim is traceable.
Frameworks like LangChain and LlamaIndex wire this loop together with ready-made PDF loaders, so you write little custom code.

Step 5: Handle the hard PDFs

Real document sets include scanned contracts, multi-column layouts, and pages that are mostly tables or figures. These break naive pipelines quietly — the text extracts in the wrong order, or not at all — and the chatbot answers confidently from garbage with no warning that anything went wrong. Test your extraction on the worst documents you have, not the cleanest, before you trust any answer it gives. A few minutes spent reading the raw extracted text from your messiest files will save hours of confusion later, and it is the step that separates a demo from something dependable.

Prefer it built and managed for you?

BSH Technologies builds production RAG and chatbots that handle messy real-world PDFs, OCR scanned pages, and cite every answer back to its source page. If you have a document set worth talking to, talk to BSH Technologies or explore our AI & automation services.

Frequently asked questions

How do I chat with a PDF using AI?

Extract the PDF text with a library like PyMuPDF or pdfplumber, split it into passages, embed those passages into a vector store such as ChromaDB or pgvector, then at question time retrieve the closest passages and ask a language model to answer from them. The model sees only relevant passages, not the whole file.

Can AI read scanned PDFs?

Not directly — a scanned PDF is images with no text layer. Run optical character recognition first using a tool like Tesseract to convert the page images into text, then feed that text through the normal extract, chunk, and embed pipeline. Detect text-layer pages and only OCR the ones that genuinely need it.

Why does my PDF chatbot give wrong answers?

Usually the text extraction or chunking failed silently. Multi-column layouts and tables often extract in the wrong order, so the model reasons over garbled text. Test extraction on your messiest PDFs, preserve structure when chunking, and require the model to cite source pages so errors become visible quickly.

Which library is best for extracting PDF text?

PyMuPDF is fast and reliable for text-based PDFs, while pdfplumber excels at extracting tables as structured rows. For scanned documents, pair either with Tesseract for OCR. Choosing based on whether your PDFs contain tables or scans matters more than the specific library name you pick.

Is ChromaDB or pgvector better for a PDF assistant?

For a single-user, local PDF assistant, ChromaDB is simpler and needs almost no setup. For a multi-user or durable application, pgvector on PostgreSQL gives you backups, metadata filtering, and one system to operate. Start with Chroma for prototypes and move to pgvector when the app becomes shared.

Do PDF chatbots keep my documents private?

It depends on the stack. Local embedding models and a local vector store like ChromaDB keep documents on your machine, but a hosted LLM API sends the retrieved passages to a third party. For sensitive documents, use self-hosted models so no text leaves your environment at any point.

From the blog

View all posts

Applied AI

How to Build an AI Agent for Free in 2026

You can build a working AI agent for free in 2026 using n8n, open-source frameworks, and a free LLM tier. Here is the exact stack and the steps.

BSH Technologies · 2026-06-17

Applied AI

Best Free AI Agent Frameworks in 2026

The best free AI agent frameworks in 2026 are LangChain, CrewAI, Microsoft AutoGen, LangGraph, and n8n. Here is how to choose between them.

BSH Technologies · 2026-06-16

Chatting with a PDF means retrieving its text, not reading the file

Step 1: Extract text cleanly from the PDF

Detect whether each page has a text layer; fall back to OCR only for the pages that do not, since OCR is slower and slightly less accurate.

Preserve page numbers as you extract so every answer can cite the exact page a reader can open and verify.

Handle tables deliberately — pdfplumber extracts them as structured rows, which embeds far better than a jumble of numbers that has lost its column structure.

Step 2: Chunk with structure in mind

Prepend the document title and section heading to each chunk before embedding. A chunk reading only "the deadline is March 31" is useless without knowing which document and which deadline it refers to.

Step 3: Embed and store the passages

Keep source metadata — filename, page, and the original text — in the store so retrieval can both cite and verify.

Use the same embedding model for documents and questions; mixing models produces meaningless distances and silently bad results.

Persist the store to disk so your work survives a restart rather than vanishing into memory.

Step 4: Retrieve, ground, and cite

Set a relevance threshold and reply "I could not find this in the document" when nothing clears it, rather than inventing an answer.

Return the matching page numbers and snippets alongside the response so every claim is traceable.

Frameworks like LangChain and LlamaIndex wire this loop together with ready-made PDF loaders, so you write little custom code.

How to Chat With Your PDFs Using AI

Chatting with a PDF means retrieving its text, not reading the file

Step 1: Extract text cleanly from the PDF

Step 2: Chunk with structure in mind

Step 3: Embed and store the passages

Step 4: Retrieve, ground, and cite

Step 5: Handle the hard PDFs

Prefer it built and managed for you?