RAG Architecture for Production: Choosing the Right Vector Database for Your Mid-Market Stack

What a RAG System Actually Requires

Retrieval-Augmented Generation (RAG) allows an LLM to answer questions about content it was not trained on — your internal documentation, your product data, your customer support history — by retrieving relevant context at inference time and including it in the prompt.

The vector database is the component that stores embedded representations of your content and retrieves the most semantically relevant chunks in response to a query embedding. Choosing the wrong database creates performance problems, operational overhead, or retrieval quality issues that are expensive to fix after launch.

This post covers the decision framework for vector database selection in production RAG systems at 50-200 person companies.

The Five Decisions That Drive Vector Database Selection

Decision 1: Is vector storage your primary use case or a feature of an existing system?

If you are building a standalone retrieval service — a RAG system that exists as its own component — a purpose-built vector database (Qdrant, Weaviate, Pinecone) is the right starting point.

If you are adding retrieval capability to an existing PostgreSQL-based application — and your team already runs and operates PostgreSQL — pgvector is worth serious consideration. It adds vector storage and similarity search as a PostgreSQL extension, eliminating a separate database to operate. The performance at mid-market scale (up to 5 million vectors with well-tuned indexes) is competitive with purpose-built alternatives.

The operational simplicity of one database system rather than two is a real advantage for a team with two platform engineers. Do not dismiss pgvector because it is "just an extension" — for many mid-market use cases, it is the correct choice.

Decision 2: What scale do you need to support?

Vector database performance characteristics diverge significantly at different scales:

Under 1 million vectors: All major options perform adequately. This is where most mid-market RAG systems operate.
1-10 million vectors: Purpose-built options (Qdrant, Weaviate) begin to show meaningful performance advantages over pgvector for high-QPS retrieval. pgvector remains viable with HNSW indexing.
Above 10 million vectors: Managed services with distributed architecture (Pinecone, Weaviate Cloud) or self-hosted Qdrant with proper resource allocation are the practical options.

Most internal enterprise RAG systems — knowledge bases, documentation retrieval, support ticket history — operate in the 100,000-2,000,000 vector range. Scale is rarely the differentiating factor at mid-market.

Decision 3: What is your team's operational capacity?

Self-hosted vector databases require deployment, monitoring, backup, and upgrade management. For a team with limited platform engineering capacity, managed services reduce operational burden at the cost of higher per-unit pricing.

Managed options: Pinecone (fully managed, no operational overhead), Weaviate Cloud, Qdrant Cloud. All abstract infrastructure management.

Self-hosted options: Qdrant (Docker or Kubernetes, well-documented), Weaviate (Kubernetes deployment, more complex), pgvector (PostgreSQL extension, lowest operational overhead if you already run Postgres).

For teams with existing Kubernetes operations experience and a need to minimize third-party managed service costs, self-hosted Qdrant is the most operationally straightforward purpose-built option. For teams without Kubernetes operations experience, a managed service is worth the cost premium.

Decision 4: What metadata filtering do you need?

RAG retrieval almost always requires filtering — "find the 10 most relevant documents from team X" or "retrieve context from documents published after date Y" are common retrieval patterns that require filtering on metadata alongside vector similarity.

Not all vector databases handle metadata filtering equally. Qdrant has particularly strong support for complex payload filtering with minimal performance degradation. Pinecone and Weaviate support metadata filtering well. pgvector with PostgreSQL WHERE clauses is flexible but requires careful index design for filtered high-cardinality queries.

If your RAG system will need to filter on multiple metadata dimensions (author, date range, document type, access permissions), test your specific filtering patterns on candidate databases before committing.

Decision 5: Does the rest of your stack create a forcing function?

Teams running entirely on GCP may find Vertex AI Vector Search integrated enough with their existing infrastructure to be the lowest-friction choice. Teams running AWS with OpenSearch already deployed may find OpenSearch k-NN the path of least resistance.

Ecosystem integration is a legitimate selection criterion. The best vector database for your team is often the one that requires the fewest new operational patterns.

The Four Most Common Mid-Market Choices

pgvector: Best for teams already running PostgreSQL who do not want to operate a new service. Good performance at up to 5 million vectors with HNSW indexing. Free and open-source. Limitations: not purpose-built for vector-heavy workloads, may require PostgreSQL tuning expertise for high-QPS retrieval.

Qdrant (self-hosted): Best for teams with Kubernetes operations experience who want a purpose-built database without managed service costs. Strong filtering support, good performance at scale, active development, and excellent documentation. Free and open-source.

Qdrant Cloud or Weaviate Cloud: Best for teams that want a purpose-built database without operational overhead. Predictable pricing, no infrastructure management. Worth the cost premium if platform engineering capacity is the binding constraint.

Pinecone: Best for teams that want zero operational overhead and have a straightforward retrieval use case (no complex metadata filtering requirements). Highest cost per vector among the options but the lowest operational burden. Appropriate for early-stage RAG systems where time-to-production is the priority.

Embedding Model Selection

The vector database is only one half of the retrieval equation. Embedding quality determines what the database can find.

For most enterprise text RAG use cases in 2026: OpenAI text-embedding-3-small (1536 dimensions, best cost-to-performance ratio for general text), or Cohere embed-english-v3.0 for higher retrieval accuracy at higher cost. For multilingual content: OpenAI text-embedding-3-large or Cohere embed-multilingual-v3.0.

Avoid the mistake of selecting the highest-dimensional model by default. Higher dimensions mean higher storage cost, higher memory usage, and slower retrieval. Test retrieval accuracy on your specific domain content before committing to a high-dimensional embedding model.

Chunking Strategy

Before vectors reach the database, content must be split into chunks. Chunk size affects retrieval quality significantly:

Too small (under 100 tokens): Chunks lack sufficient context for the LLM to use effectively
Too large (over 600 tokens): Retrieval becomes less precise as each chunk covers too many topics
Recommended starting point: 300-400 tokens with 50-token overlap between adjacent chunks

Fixed-size chunking with overlap is the correct starting point for most RAG systems. Semantic chunking — splitting at semantic boundaries rather than fixed token counts — produces marginally better retrieval accuracy for well-structured documents but adds implementation complexity that is rarely justified at the initial production deployment.

Frequently Asked Questions

What is the difference between Qdrant and pgvector for a production RAG system? Qdrant is a purpose-built vector database with a dedicated query engine, strong metadata filtering, and horizontal scaling support. pgvector is a PostgreSQL extension that adds vector storage and similarity search to an existing relational database. For teams already operating PostgreSQL at mid-market scale, pgvector is often the correct choice. For teams building a dedicated retrieval service, Qdrant provides better performance at high query volumes.

How many vectors do most mid-market RAG systems actually use? Internal knowledge bases and documentation retrieval systems at 50-200 person companies typically index 50,000-500,000 vectors. Customer support history and product data indexing ranges from 500,000 to 5 million vectors. Very few mid-market RAG systems require the distributed architectures that become necessary above 10 million vectors.

Should we use a different embedding model for different content types? Potentially. Embedding models trained on general text may underperform on code, legal documents, or highly technical domain content. If retrieval accuracy is unsatisfactory with a general embedding model, evaluate domain-specific or fine-tuned alternatives before investing in more complex architectural changes.

When should we consider a managed vector database over self-hosted? When the platform engineering team does not have capacity to own another service in the operational portfolio. A managed vector database at $200-500 per month is cheaper than the amortized cost of one platform engineer's time to operate a self-hosted alternative if that time is better spent on other infrastructure.

If you are designing a RAG system and want a review of your architecture decisions before committing to an implementation, we offer a free AI architecture review for mid-market engineering teams.

RAG Architecture for Production: Choosing the Right Vector Database for Your Mid-Market Stack

What a RAG System Actually Requires

The Five Decisions That Drive Vector Database Selection

Decision 1: Is vector storage your primary use case or a feature of an existing system?

Decision 2: What scale do you need to support?

Decision 3: What is your team's operational capacity?

Decision 4: What metadata filtering do you need?

Decision 5: Does the rest of your stack create a forcing function?

The Four Most Common Mid-Market Choices

Embedding Model Selection

Chunking Strategy

Frequently Asked Questions

Mohak Deep Singh

Stay Updated

Related Articles

Multi-Region Deployment Strategies for Low-Latency Indian Applications

Ultimate Cloud FinOps Savings Guide for 2026

Ready to Transform Your Cloud Infrastructure?