Skip to content

Vector Index Architecture Migration Plan: Pinecone to Supabase pgvector

1. Overview and Motivation

Currently, the vector database (Pinecone) partitions curriculum data into hard isolation boundaries using namespaces (e.g., grade-elementary-6-geography). While this design natively filters results, it prevents true broad Retrieval-Augmented Generation (RAG) across the entire curriculum and introduces architectural complexity with dual storage (vectors in Pinecone, text/metadata in Postgres).

The Goal: Migrate away from Pinecone and replace it with Supabase pgvector. This simplifies the architecture by unifying data storage and vector search within the same Postgres database. It naturally solves the "namespace" problem by storing vectors alongside chunks, allowing true broad semantic retrieval across the entire knowledge base, utilizing metadata to filter or softly boost (bias) results based on user profiles or request hints, and natively combining semantic and lexical searches in SQL.


2. Target Architecture

2.1. Vector DB Schema (Postgres)

  • Target Store: Supabase Postgres (chunks or embeddings table).
  • Column: Add an embedding vector(1536) column to the appropriate table.
  • Index: Create an hnsw or ivfflat index on the embedding column for fast vector similarity search using cosine distance (<=>).
  • Existing Metadata: Metadata filtering is inherently solved because the vectors live in the same row as all document metadata (grade, subject, document_id, language, etc.).

2.2. Query Execution Paradigm

  • Global Search Default: Semantic queries search the entire table by default.
  • Hybrid RPC: Create a single Postgres RPC (e.g., search_chunks_hybrid) that:
    1. Retrieves: Performs a vector similarity search (<=>) for the top $K$ semantic matches and a full-text search (websearch_to_tsquery) for the top $K$ lexical matches.
    2. RRF Fusion: Applies Reciprocal Rank Fusion (RRF) natively in SQL to blend the ranked semantic and lexical sets.
    3. Profile Bias / Filter: Optionally accepts p_grade and p_subject parameters to strictly filter (WHERE clause) or softly boost results.
  • LLM Rerank: Pass the top fused candidates to the GPT-mini reranker for final semantic validation in the Python application.

3. Component Migration Breakdown

Phase 1: Database Schema Updates

  • Target Files: New DB Migration (db/migrations/XXXX_add_vector_column.sql).
  • Changes:
    • Enable pgvector extension if not already enabled.
    • Add the embedding column (type vector(1536)) to the table storing text chunks (e.g., chunks table).
    • Create an index (HNSW recommended) for optimal distance queries.

Phase 2: Ingestion Pipeline Consolidation

  • Target Files: app/services/embedding_service.py, app/services/ingestion.py.
  • Changes:
    • Deprecate/Remove pinecone_adapter.py.
    • Update ingestion logic to write OpenAI embeddings directly to Supabase via the Python Supabase client during the chunk insertion phase.

Phase 3: Unified Hybrid Retrieval Pipeline

  • Target Files: app/services/retrieval_pipeline.py, New DB Migration for RPC.
  • Changes:
    • Write a new stored procedure search_chunks_hybrid that performs the combined vector/text search and RRF.
    • Update retrieval_pipeline.py to call the single Supabase RPC instead of coordinating Pinecone (semantic) and Supabase (lexical) separately.

Phase 4: Data Wiping and Re-ingestion

  • Strategy: Because this is a pre-production environment, no backfill script is needed.
  • Action: Wipe existing chunks/documents and run the standard ingestion pipeline to re-process the curriculum, creating the new pgvector embeddings natively.

4. Rollout Strategy & Risk Mitigation

  1. Code Deletion: Safely remove all Pinecone dependencies (pinecone-client from requirements.txt).
  2. Validation: Run the existing test suites (promptfoo or pytest) to verify that hybrid retrieval still meets the relevance baseline using the new pgvector implementation.
  3. Simplicity: Rely on the fact that production data does not exist to fast-track this migration without complex dual-write/dual-read logic.

5. Explicit Validation Plan

  1. Unit Tests:
    • Verify that retrieval_pipeline.py only calls Supabase and no longer references Pinecone.
    • Update ingestion pipeline tests to assert embeddings are passed to Supabase.
  2. Database Index Check: Ensure EXPLAIN ANALYZE on the hybrid RPC uses the vector index rather than an exact sequential scan when tables grow.
  3. End-to-End Broad Retrieval Test:
    • Perform a manual chat query with no grade or subject provided.
    • Verify the response pulls context seamlessly across the entire table.

6. Explicit Go/No-Go Checklist

  • [ ] Supabase project supports pgvector extension.
  • [ ] Pinecone SDK removed from dependencies.
  • [ ] New search_chunks_hybrid RPC correctly implements RRF blending in SQL.
  • [ ] Ingestion pipeline modified to store vectors in Postgres directly.