RAG Architecture Decisions: What I'd Do Differently

I’ve spent the last 6 months building RAG (Retrieval-Augmented Generation) systems. Not toy demos—production systems processing thousands of documents daily.

Here’s my honest retrospective on what worked, what broke, and what I’d do differently.

The Setup

My main project was a document analysis system for enterprise clients. Requirements:

Process PDFs, Word docs, scanned images
Extract structured data (invoice fields, contract terms)
Sub-3-second latency for queries
95%+ accuracy target

Decisions I Got Right

1. PostgreSQL + pgvector Over Managed Vector DBs

The temptation: Use Pinecone, Weaviate, or Qdrant. They’re optimized for vectors.

What I did: PostgreSQL with the pgvector extension.

Why it worked:

Cost: Free tier handles 1M+ vectors. Pinecone would cost $70+/month
Simplicity: One database for everything. No sync issues.
Filtering: Native SQL for metadata filtering is powerful
Locality: Data stays in my infrastructure

Trade-off accepted: pgvector is slower than purpose-built vector DBs for massive scale. But I’m not at “massive scale” yet.

2. Chunking with Overlap

I use 1000-token chunks with 200-token overlap:

def chunk_document(text: str, chunk_size: int = 1000, overlap: int = 200):
    """Smart chunking with overlap."""
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        
        # Try to end at sentence boundary
        if end < len(text):
            last_period = chunk.rfind('.')
            if last_period > chunk_size * 0.8:
                end = start + last_period + 1
                chunk = text[start:end]
        
        chunks.append(chunk)
        start = end - overlap
    
    return chunks

The overlap prevents information loss at chunk boundaries. Worth the 20% storage overhead.

3. Caching Embeddings Aggressively

Embeddings don’t change. Cache them forever.

I use Redis with a simple hash:

Key: embed:{hash(text + model_version)}
Value: The embedding vector

This reduced my OpenAI costs by 60% for repeat queries.

Decisions I Got Wrong

1. Starting with 4K Token Context Windows

The mistake: “More context = better retrieval, right?”

What happened: The LLM would hallucinate. Too much context meant distant, irrelevant information bled into the response.

The fix: Dropped to 1K token chunks + rich metadata. Let the retrieval be precise, not comprehensive.

2. Pure Vector Search

The mistake: Semantic search is magic! It’ll understand everything!

What happened: Queries like “invoice #12345” returned random invoices because the embedding didn’t capture the exact number.

The fix: Hybrid search. BM25 for keyword matching + vector for semantic. Combine with reciprocal rank fusion.

def hybrid_search(query: str, k: int = 5):
    # Get both result sets
    vector_results = vector_search(query, k=k*2)
    keyword_results = bm25_search(query, k=k*2)
    
    # Reciprocal Rank Fusion
    scores = {}
    for rank, doc in enumerate(vector_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (rank + 60)
    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (rank + 60)
    
    # Return top k by combined score
    sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
    return [get_document(id) for id in sorted_ids[:k]]

3. No Evaluation Framework from Day 1

The mistake: “I’ll add tests later”

What happened: Spent 3 weeks iterating based on vibes. “This seems better?”

The fix: Created a test set of 100 queries with expected outputs. Now every change is measurable.

What I’d Do Differently

1. Start with a Simpler Architecture

My initial design had:

Separate embedding service
Message queue for async processing
Complex retry logic

What I actually needed:

A single FastAPI app
Synchronous processing (latency was fine)
PostgreSQL for everything

Lesson: Build the simplest thing first. Add complexity when you hit limits.

2. Invest in Document Preprocessing

Garbage in, garbage out. I underestimated how much time I’d spend on:

OCR quality for scanned documents
Table extraction (still hard)
Handling multi-column layouts

Next time, I’d allocate 30% of timeline to preprocessing pipeline.

3. Add Observability Earlier

Debugging “why did the LLM say that?” is hard without traces. I eventually added:

Full prompt logging
Retrieved chunk logging
Confidence scores for each extraction

Should have done this from the start.

The Architecture I’d Use Today

┌──────────────┐     ┌──────────────┐
│   Document   │────▶│  Preprocess  │
│   Ingestion  │     │  (OCR, PDF)  │
└──────────────┘     └──────┬───────┘
                            │
                            ▼
┌──────────────┐     ┌──────────────┐
│   Chunk +    │◀────│   Storage    │
│   Embed      │     │ (PostgreSQL) │
└──────────────┘     └──────┬───────┘
                            │
                            ▼
┌──────────────┐     ┌──────────────┐
│   Hybrid     │◀────│    Query     │
│   Retrieval  │     │   Handler    │
└──────────────┘     └──────┬───────┘
                            │
                            ▼
                     ┌──────────────┐
                     │   Generate   │
                     │   Response   │
                     └──────────────┘

Simple. Debuggable. Cheap.

Building RAG systems? I’d love to compare notes. Find me on LinkedIn.