Building a PDF Chat System with RAG: A Step-by-Step Implementation Guide

7 min readMar 10, 2025

A practical guide to implementing Retrieval-Augmented Generation with local LLMs and vector search

Have you ever wished you could have a conversation with your PDF documents? In this comprehensive guide, we’ll build a PDF chat system from scratch using RAG (Retrieval-Augmented Generation) and local LLMs, creating an implementation we’ll call docuRAG.js.

By following along, you’ll learn to build a system that can process PDFs, understand their content, and answer questions accurately using the power of vector search and large language models. Grab your favorite code editor, and let’s dive into building something awesome!

What You’ll Learn

How RAG systems work and why they’re powerful
Step-by-step implementation of a PDF chat system
Practical code examples you can use today

Required Services

Local LLM Service (Ollama)

Download and install Ollama
Pull the Llama2 model: ollama pull llama2
Verify it’s running at http://localhost:11434

2. Vector Database (Qdrant)

Run using Docker: docker run -p 6333:6333 qdrant/qdrant
Verify it’s running at http://localhost:6333

What is RAG and Why Use It?

RAG (Retrieval-Augmented Generation) combines two powerful capabilities:

1. Retrieval: Finding relevant information in your documents

2. Generation: Creating human-like responses using an LLM

Instead of the LLM making things up, RAG helps it give accurate answers by first searching through your documents to find the most relevant information. It then uses these found passages as context, allowing the LLM to generate responses that are firmly grounded in your actual document content. This approach ensures that every answer is backed by real information from your documents.

The Implementation Journey

Let’s break this down into manageable steps. We’ll start with document processing and end with a working chat system.

1. Text Chunking: Document Segmentation

First, we need to read and process the PDF document. Let’s break this down into steps:

Reading the PDF

import pdfParse from 'pdf-parse';

async readPDFContent(pdfBuffer) {
 const data = await pdfParse(pdfBuffer);
 return data.text;
}

Splitting into Chunks

We use langchain’s text splitter to break the document into manageable pieces while preserving context:

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';

this.textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: this.config.chunkSize,
  chunkOverlap: this.config.chunkOverlap,
  lengthFunction: (text) => text.length,
});

Let’s break down these parameters:

1. chunkSize (default: 1000): This is how many characters we want in each piece of text. Think of it like deciding how big each “bite” of the document should be. If it’s too big, the LLM might get overwhelmed. If it’s too small, we might lose important context. We found that 1000 characters gives us a good balance — about a paragraph or two.

2. chunkOverlap (default: 200): This is where we let chunks “share” some text with their neighbours. Imagine you’re reading a book and each chunk is a page — the overlap is like having the last few sentences of the previous page repeated at the start of the next page. This helps us not lose context when important information spans across chunks.

Here’s a simple example:

Original text: "The new AI model was developed in 2023. 
It uses advanced neural networks."

Without overlap:
Chunk 1: "The new AI model was developed in 2023."
Chunk 2: "It uses advanced neural networks."
↓ We might miss the connection between the model and its technology! ↓

With overlap:
Chunk 1: "The new AI model was developed in 2023. It uses"
Chunk 2: "2023. It uses advanced neural networks."

Now we can connect the dots between the model and its features!

2. Vector Embeddings: Making Text Searchable

To make our text chunks searchable, we need to convert them into a format that computers can efficiently process and compare. This is where vector embeddings come in.

What is vector embeddings?

Think of vector embeddings like turning words into numbers that computers can understand. These numbers are special because they capture not just the words, but also their meaning and how they relate to each other. So when we say “happy” and “joyful”, their number codes end up being very similar, just like their meanings!

Why do we need it?

Similar meanings get similar numbers (like how “happy” and “joyful” would be close to each other)

We can find related text by looking for similar number patterns, It’s much faster to search through numbers than actual text!

And here’s where the magic happens!

Let’s look at how these numbers help our system understand meaning, even when questions are asked differently:

Text: "The weather is really cold today"
↓
Vector: [0.4, 0.7, 0.3, …]


Similar question: "What's the temperature outside?"
↓
Vector: [0.38, 0.72, 0.28, …] <- Notice how these numbers are close 
because the meanings are related!

Converting Chunks to Vectors

With our understanding of vectors, we can now process each text chunk through the LLM’s embedding API:

const vector = await this.getEmbedding(chunk);

3. Storing in Database

Now we can save both the text chunks and their vector representations in Qdrant. This creates a searchable knowledge base that we’ll query later when answering questions and save all these in the db

const vectors = await Promise.all(
  chunks.map(async (chunk, index) => {
      const embedding = await this.getEmbeddings(chunk);
      return {
          id: uuidv4(),
          vector: embedding,
          payload: { text: chunk, fileName: metadata.fileName, chunkIndex: index }
      };
  })
);

await this.qdrant.upsert(metadata.collectionName, {
  points: vectors
});

4. Building the Chat System

Now let’s put it all together into a working chat system:

Processing Chat Messages

When you send a chat message, the system follows a similar process to find and generate relevant answers:

1. Converting Your Question

First, we convert your question into a vector using the same embedding API. This allows us to search for similar content in our database.

Example: Processing a question

const question = "What are neural networks?";
const questionVector = await getEmbedding(question);

Vector representing the question

Result: [0.45, 0.82, …]

2. Finding Relevant Information

We use this vector to search the database for chunks that are most similar to your question.

Search for similar chunks

const matches = await qdrant.search('my_collection', {
 vector: questionVector,
 limit: 3 // Get top 3 most relevant chunks
});

Results:

[
{ text: "Neural networks are powerful algorithms…", score: 0.89 },
{ text: "Deep learning uses neural networks to…", score: 0.76 },
{ text: "Machine learning includes neural networks…", score: 0.71 }
]

3. Generating the Answer

Finally, we send both your question and the found chunks from the db to the LLM service. The LLM reads this context and generates an accurate, focused answer based on your documents.

Generate answer using context

const answer = await llm.chat({
 question: "What are neural networks?",
 context: matches.map(m => m.text).join('\n')
});

Streams back: “Neural networks are powerful algorithms that…”

5. How It All Works Together

The system connects document processing and question answering in a seamless flow:

1. Document Processing:

Read and parse PDF content using pdf-parse
Split text into chunks using RecursiveCharacterTextSplitter (1000 chars with 200 char overlap)
Generate vector embeddings for each chunk using the LLM API (3072-dimensional vectors)
Store in Qdrant collections with rich metadata
Create optimized Cosine similarity indices for efficient search

2. Question Answering:

Convert user’s question into a vector using the same embedding model
Search Qdrant for the most relevant chunks (default: top 3)
Create a structured prompt combining: Retrieved context chunks, User’s question, Response guidelines (bullet points, markdown formatting)
Send to LLM for final answer generation
Stream the response back to the user with source references

For a smoother user experience, responses are streamed in real-time as they’re generated, with each chunk including both the answer and its source references. This provides immediate feedback while maintaining the accuracy and traceability of RAG systems.

System Flow Diagram:

Conclusion:

RAG gives AI systems the ability to have meaningful conversations about your documents, combining an AI’s natural understanding of language with specific information from your documents to create responses that are both intelligent and factually grounded.

A real-world example can be:

Working with an architect designing your dream house:

When you consult with the architect, they already know the fundamentals of house design and construction — just like how an LLM understands language and general concepts. But to create exactly what you want, you show them your inspiration: floor plans you like from different houses, photos of designs that inspire you, specific material choices you prefer, and layout diagrams that appeal to you.

The architect then:

Takes all these references and works their magic. They carefully review your materials, identify the elements that match your needs, and blend them with their expertise. The result is a design that’s not only personalized and structurally sound but also transparent — they can show you exactly how each part of the final design was inspired by your references. Just like our RAG system, the architect doesn’t make up designs from thin air, but rather combines and adapts from verified sources to create something perfect for you.

The more relevant references you provide, the better the architect can understand your vision and create exactly what you want

Similarly, our system uses vectors to identify and provide the most relevant context to the LLM, ensuring you get precise, well-informed answers from your documents.

You now have both the understanding and the practical knowledge to build a PDF chat system that can:

Process a PDF document
Find relevant information quickly
Generate accurate, context-aware answers

Try out the library

You can find the complete implementation of this RAG system in our GitHub Repository, including all the code we’ve discussed in this guide, along with working examples using Express and NestJS frameworks.

About the Author: You can find more details about my projects and contributions on my website: msroot.me

Resources: