How to Build an AI Chatbot Trained on Your Own Data

"Trained on your data" is a phrase that gets used loosely. For most business use cases, it does not mean fine-tuning a model — it means building a RAG pipeline: a system that retrieves the most relevant pieces of your content at query time and passes them to the model as context. This is faster to build, cheaper to run, and easier to update than fine-tuning. It's also how most production chatbots that answer domain-specific questions actually work.
Here's what that build actually involves — from data preparation to deployment, with realistic estimates on time and cost.
The Architecture (RAG, Not Fine-Tuning)
A custom chatbot built on RAG has four main components: an indexing pipeline that processes your source documents, chunks them, embeds them, and stores the vectors; a retrieval layer that at query time embeds the user's question and finds the most relevant chunks; a generation layer that passes the retrieved chunks plus the user's question to the LLM and returns the response; and a chat UI and admin layer — the interface users interact with and the tools your team uses to manage the system.
Fine-tuning is the alternative, and it's almost never the right choice for this use case. Fine-tuning bakes knowledge into the model's weights — which sounds appealing but has serious practical problems: the model still hallucinates, it's expensive to retrain when your data changes, and it doesn't give you source attribution. RAG solves all three of these problems. The only situation where fine-tuning is genuinely better is when you want to change the model's behavior — tone, output format, reasoning style — rather than give it access to new information.
Data Preparation
This is consistently the part that takes longer than teams expect, and the part that has the most impact on chatbot quality.
Step 1: Audit your content. Collect everything the chatbot should be able to answer questions about — product documentation, PDFs, website content, FAQ databases, support ticket resolutions, and internal SOPs. Be selective. Including low-quality, outdated, or contradictory content will directly degrade the chatbot's answer quality. If you wouldn't point a new employee at a document and tell them to use it as a reference, don't include it.
Step 2: Clean and format. Raw documents almost always need preprocessing. PDFs need to be parsed to extract clean text (headers, footers, and page numbers create noise). HTML needs to be stripped of navigation and boilerplate. Scanned PDFs need OCR. Duplicate or near-duplicate content should be de-duped before indexing. This is tedious but cannot be skipped.
Step 3: Chunking strategy. Documents get split into chunks before embedding. Dense technical documentation works best at 300–500 tokens with 50-token overlap. FAQ content works as one Q&A pair per chunk. Long-form prose at 500–700 tokens with overlap. Structured data like tables and lists should be chunked by section rather than token count. Overlap between chunks ensures that content near chunk boundaries is captured by at least one chunk fully.
Embedding and Vector Store
Each chunk is passed through an embedding model to generate a vector representation. We use OpenAI's text-embedding-3-small as the default — it costs $0.02 per 1M tokens and produces excellent quality for most use cases. For a 1,000-page document corpus at roughly 300 tokens per chunk and 750 tokens per page, you're looking at approximately 2.5M tokens total — about $0.05 to embed the entire corpus.
For vector storage, pgvector on Postgres is the simplest path if you're already running Postgres — no additional managed service, handles millions of vectors comfortably for most use cases. Pinecone is a managed service with better performance at very large scale (tens of millions of vectors). Qdrant and Weaviate are open source and self-hostable middle grounds. For a typical business chatbot — a few thousand to a few hundred thousand chunks — pgvector on an existing Postgres instance is the right answer.
Retrieval Quality: How to Test It
The retrieval step is where most chatbots fail, and it's the hardest step to debug because failures are silent. If the relevant chunk isn't retrieved, the model has no way to know that — it'll either say 'I don't have information about that' or hallucinate.
- Build a retrieval test set. Take 30–50 questions the chatbot should answer. For each, manually identify which chunks contain the answer, then run retrieval and check whether those chunks appear in the top-K results. A hit rate below 80% means your chunking or embedding is failing.
- Check similarity scores. Log the cosine similarity scores of retrieved chunks. If the top chunk has a similarity of 0.65, that's a weak match. Set a minimum threshold below which you return 'I don't have information on that' rather than passing weak context to the model.
- Review conversations with low confidence. Filter your admin panel for responses where the model indicated uncertainty. Read the retrieved chunks that were passed. This is the fastest way to identify gaps in your corpus.
The Chat UI
The interface needs three things that a generic chat UI doesn't always provide. Source citations: show users which documents the answer came from — this builds trust, reduces hallucination impact, and is a legal requirement in regulated industries. Streaming responses: the user should see the response appear progressively, not wait 4–8 seconds for the full response. Feedback mechanism: a thumbs-up/thumbs-down or 'Was this helpful?' prompt on each response — this generates labeled data you can use to identify retrieval failures and improve the system over time.
Admin Panel Requirements
Your team needs to be able to manage the chatbot without touching the codebase. The minimum viable admin layer includes document management (add, update, and remove documents from the knowledge base with a trigger to re-index affected content), a conversation browser filterable by date and user, a failed and flagged response review queue, and usage stats showing queries per day, average response latency, and cost per day. Without an admin layer, your team will update the knowledge base by editing config files and re-deploying — that's not sustainable.
Cost to Build and Operate
Build cost: a production-quality RAG chatbot — indexing pipeline, retrieval layer, streaming chat UI, source citations, admin panel, feedback mechanism, and monitoring — is typically a 4–6 week build.
Ongoing API cost: a rough estimate for a business chatbot with 500 queries per day, averaging 2,000 input tokens per query and 400 output tokens: roughly $108/month with GPT-4.1 or $8/month with GPT-4o Mini. For most internal tools and customer-facing bots where quality is important, GPT-4o Mini with a strong system prompt is the right default. Use GPT-4.1 or Claude Sonnet 4.6 for high-stakes external-facing bots where output quality directly affects user trust.
Vector store hosting: pgvector on your existing Postgres instance has no additional cost. Pinecone starts at $70/month for a managed starter plan.
Realistic Timeline
- Week 1: data audit, cleaning, and preparation
- Week 2: indexing pipeline, vector store setup, retrieval layer
- Week 3: LLM integration, prompt engineering, streaming chat UI
- Week 4: admin panel, feedback mechanism, monitoring, testing
- Weeks 5–6: refinement based on retrieval quality testing, QA, deployment
The timeline compresses if your data is already clean and in a consistent format. It expands if you have messy PDFs, multiple data sources with different formats, or complex permission requirements.
This is one article in our complete AI integration guide.
Related Posts

AI Integration for Web and Mobile Apps: The Complete Guide
How to add AI features to your product — LLM selection, RAG pipelines, cost controls, prompt engineering, and architecture patterns that work in production.

What Is a RAG Pipeline? (And Do You Actually Need One?)
RAG — Retrieval-Augmented Generation — is the most practical way to make an LLM answer questions about your data. Here's how it works and when to use it.