SK.
Back to all posts

RAG Architecture Decisions: What I'd Do Differently

4 min read

I’ve spent the last 6 months building RAG (Retrieval-Augmented Generation) systems. Not toy demosβ€”production systems processing thousands of documents daily.

Here’s my honest retrospective on what worked, what broke, and what I’d do differently.

The Setup

My main project was a document analysis system for enterprise clients. Requirements:

  • Process PDFs, Word docs, scanned images
  • Extract structured data (invoice fields, contract terms)
  • Sub-3-second latency for queries
  • 95%+ accuracy target

Decisions I Got Right

1. PostgreSQL + pgvector Over Managed Vector DBs

The temptation: Use Pinecone, Weaviate, or Qdrant. They’re optimized for vectors.

What I did: PostgreSQL with the pgvector extension.

Why it worked:

  • Cost: Free tier handles 1M+ vectors. Pinecone would cost $70+/month
  • Simplicity: One database for everything. No sync issues.
  • Filtering: Native SQL for metadata filtering is powerful
  • Locality: Data stays in my infrastructure

Trade-off accepted: pgvector is slower than purpose-built vector DBs for massive scale. But I’m not at β€œmassive scale” yet.

2. Chunking with Overlap

I use 1000-token chunks with 200-token overlap:

def chunk_document(text: str, chunk_size: int = 1000, overlap: int = 200):
    """Smart chunking with overlap."""
    chunks = []
    start = 0
    
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        
        # Try to end at sentence boundary
        if end < len(text):
            last_period = chunk.rfind('.')
            if last_period > chunk_size * 0.8:
                end = start + last_period + 1
                chunk = text[start:end]
        
        chunks.append(chunk)
        start = end - overlap
    
    return chunks

The overlap prevents information loss at chunk boundaries. Worth the 20% storage overhead.

3. Caching Embeddings Aggressively

Embeddings don’t change. Cache them forever.

I use Redis with a simple hash:

  • Key: embed:{hash(text + model_version)}
  • Value: The embedding vector

This reduced my OpenAI costs by 60% for repeat queries.

Decisions I Got Wrong

1. Starting with 4K Token Context Windows

The mistake: β€œMore context = better retrieval, right?”

What happened: The LLM would hallucinate. Too much context meant distant, irrelevant information bled into the response.

The fix: Dropped to 1K token chunks + rich metadata. Let the retrieval be precise, not comprehensive.

The mistake: Semantic search is magic! It’ll understand everything!

What happened: Queries like β€œinvoice #12345” returned random invoices because the embedding didn’t capture the exact number.

The fix: Hybrid search. BM25 for keyword matching + vector for semantic. Combine with reciprocal rank fusion.

def hybrid_search(query: str, k: int = 5):
    # Get both result sets
    vector_results = vector_search(query, k=k*2)
    keyword_results = bm25_search(query, k=k*2)
    
    # Reciprocal Rank Fusion
    scores = {}
    for rank, doc in enumerate(vector_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (rank + 60)
    for rank, doc in enumerate(keyword_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (rank + 60)
    
    # Return top k by combined score
    sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
    return [get_document(id) for id in sorted_ids[:k]]

3. No Evaluation Framework from Day 1

The mistake: β€œI’ll add tests later”

What happened: Spent 3 weeks iterating based on vibes. β€œThis seems better?”

The fix: Created a test set of 100 queries with expected outputs. Now every change is measurable.

What I’d Do Differently

1. Start with a Simpler Architecture

My initial design had:

  • Separate embedding service
  • Message queue for async processing
  • Complex retry logic

What I actually needed:

  • A single FastAPI app
  • Synchronous processing (latency was fine)
  • PostgreSQL for everything

Lesson: Build the simplest thing first. Add complexity when you hit limits.

2. Invest in Document Preprocessing

Garbage in, garbage out. I underestimated how much time I’d spend on:

  • OCR quality for scanned documents
  • Table extraction (still hard)
  • Handling multi-column layouts

Next time, I’d allocate 30% of timeline to preprocessing pipeline.

3. Add Observability Earlier

Debugging β€œwhy did the LLM say that?” is hard without traces. I eventually added:

  • Full prompt logging
  • Retrieved chunk logging
  • Confidence scores for each extraction

Should have done this from the start.

The Architecture I’d Use Today

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Document   │────▢│  Preprocess  β”‚
β”‚   Ingestion  β”‚     β”‚  (OCR, PDF)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Chunk +    │◀────│   Storage    β”‚
β”‚   Embed      β”‚     β”‚ (PostgreSQL) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Hybrid     │◀────│    Query     β”‚
β”‚   Retrieval  β”‚     β”‚   Handler    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
                            β–Ό
                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                     β”‚   Generate   β”‚
                     β”‚   Response   β”‚
                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Simple. Debuggable. Cheap.


Building RAG systems? I’d love to compare notes. Find me on LinkedIn.

Related Posts