RAG Architecture Decisions: What I'd Do Differently
Iβve spent the last 6 months building RAG (Retrieval-Augmented Generation) systems. Not toy demosβproduction systems processing thousands of documents daily.
Hereβs my honest retrospective on what worked, what broke, and what Iβd do differently.
The Setup
My main project was a document analysis system for enterprise clients. Requirements:
- Process PDFs, Word docs, scanned images
- Extract structured data (invoice fields, contract terms)
- Sub-3-second latency for queries
- 95%+ accuracy target
Decisions I Got Right
1. PostgreSQL + pgvector Over Managed Vector DBs
The temptation: Use Pinecone, Weaviate, or Qdrant. Theyβre optimized for vectors.
What I did: PostgreSQL with the pgvector extension.
Why it worked:
- Cost: Free tier handles 1M+ vectors. Pinecone would cost $70+/month
- Simplicity: One database for everything. No sync issues.
- Filtering: Native SQL for metadata filtering is powerful
- Locality: Data stays in my infrastructure
Trade-off accepted: pgvector is slower than purpose-built vector DBs for massive scale. But Iβm not at βmassive scaleβ yet.
2. Chunking with Overlap
I use 1000-token chunks with 200-token overlap:
def chunk_document(text: str, chunk_size: int = 1000, overlap: int = 200):
"""Smart chunking with overlap."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
# Try to end at sentence boundary
if end < len(text):
last_period = chunk.rfind('.')
if last_period > chunk_size * 0.8:
end = start + last_period + 1
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap
return chunks
The overlap prevents information loss at chunk boundaries. Worth the 20% storage overhead.
3. Caching Embeddings Aggressively
Embeddings donβt change. Cache them forever.
I use Redis with a simple hash:
- Key:
embed:{hash(text + model_version)} - Value: The embedding vector
This reduced my OpenAI costs by 60% for repeat queries.
Decisions I Got Wrong
1. Starting with 4K Token Context Windows
The mistake: βMore context = better retrieval, right?β
What happened: The LLM would hallucinate. Too much context meant distant, irrelevant information bled into the response.
The fix: Dropped to 1K token chunks + rich metadata. Let the retrieval be precise, not comprehensive.
2. Pure Vector Search
The mistake: Semantic search is magic! Itβll understand everything!
What happened: Queries like βinvoice #12345β returned random invoices because the embedding didnβt capture the exact number.
The fix: Hybrid search. BM25 for keyword matching + vector for semantic. Combine with reciprocal rank fusion.
def hybrid_search(query: str, k: int = 5):
# Get both result sets
vector_results = vector_search(query, k=k*2)
keyword_results = bm25_search(query, k=k*2)
# Reciprocal Rank Fusion
scores = {}
for rank, doc in enumerate(vector_results):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (rank + 60)
for rank, doc in enumerate(keyword_results):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (rank + 60)
# Return top k by combined score
sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
return [get_document(id) for id in sorted_ids[:k]]
3. No Evaluation Framework from Day 1
The mistake: βIβll add tests laterβ
What happened: Spent 3 weeks iterating based on vibes. βThis seems better?β
The fix: Created a test set of 100 queries with expected outputs. Now every change is measurable.
What Iβd Do Differently
1. Start with a Simpler Architecture
My initial design had:
- Separate embedding service
- Message queue for async processing
- Complex retry logic
What I actually needed:
- A single FastAPI app
- Synchronous processing (latency was fine)
- PostgreSQL for everything
Lesson: Build the simplest thing first. Add complexity when you hit limits.
2. Invest in Document Preprocessing
Garbage in, garbage out. I underestimated how much time Iβd spend on:
- OCR quality for scanned documents
- Table extraction (still hard)
- Handling multi-column layouts
Next time, Iβd allocate 30% of timeline to preprocessing pipeline.
3. Add Observability Earlier
Debugging βwhy did the LLM say that?β is hard without traces. I eventually added:
- Full prompt logging
- Retrieved chunk logging
- Confidence scores for each extraction
Should have done this from the start.
The Architecture Iβd Use Today
ββββββββββββββββ ββββββββββββββββ
β Document ββββββΆβ Preprocess β
β Ingestion β β (OCR, PDF) β
ββββββββββββββββ ββββββββ¬ββββββββ
β
βΌ
ββββββββββββββββ ββββββββββββββββ
β Chunk + βββββββ Storage β
β Embed β β (PostgreSQL) β
ββββββββββββββββ ββββββββ¬ββββββββ
β
βΌ
ββββββββββββββββ ββββββββββββββββ
β Hybrid βββββββ Query β
β Retrieval β β Handler β
ββββββββββββββββ ββββββββ¬ββββββββ
β
βΌ
ββββββββββββββββ
β Generate β
β Response β
ββββββββββββββββ
Simple. Debuggable. Cheap.
Building RAG systems? Iβd love to compare notes. Find me on LinkedIn.
Related Posts
Building Intelligent Agents with LangGraph
A deep dive into creating stateful AI agents that can handle complex, multi-step workflows. Here's what I learned building a production multi-agent system.