SK.
Back to all projects

WebRAG – Scalable RAG Engine

High-concurrency RAG system built with Gemini-embedding-001 and Qdrant for low-latency document retrieval

GenAI Full Stack
/
2 min read
WebRAG – Scalable RAG Engine

The Problem

Building production-ready RAG systems that can handle high concurrency while maintaining low latency is challenging. Most implementations fail to scale beyond prototype stage.

The goal: Engineer a scalable RAG engine that handles concurrent requests with consistent low-latency retrieval.

Technical Implementation

Architecture Decisions

Built a high-concurrency RAG system with three core components:

  1. Embedding Pipeline: Document ingestion β†’ Gemini-embedding-001 β†’ Qdrant vector storage
  2. Async Processing: FastAPI + Celery for non-blocking document processing
  3. Metadata Persistence: PostgreSQL for document metadata and tracking

Key Technical Implementations

ComponentImplementationPurpose
EmbeddingsGemini-embedding-001High-quality semantic representations
Vector DBQdrantLow-latency similarity search
Text ChunkingRecursiveCharacterTextSplitterOptimized chunk sizing for embedding storage
Task QueueCelery + RedisAsync document processing pipeline
DeploymentDocker ComposeReproducible, scalable deployment

Tech Stack Rationale

Why Qdrant over alternatives?

  • Native support for payload filtering
  • Excellent performance at scale
  • Simple deployment with Docker

Why Celery?

  • Reliable async task execution
  • Redis as broker for fast message passing
  • Easy horizontal scaling for document processing

What I Learned

Things That Worked

  1. Gemini-embedding-001 quality: Consistently high-quality embeddings improved retrieval accuracy significantly.

  2. Async-first architecture: FastAPI + Celery combination handled concurrent loads efficiently.

  3. RecursiveCharacterTextSplitter: LangChain’s intelligent chunking preserved semantic context better than naive splitting.

Things I’d Improve

  1. Add hybrid search: Combine vector search with BM25 for better keyword matching.

  2. Implement caching layer: Cache frequent queries to reduce embedding API calls.

  3. Add evaluation pipeline: Systematic evaluation of retrieval quality.

Related Projects