Question 1

How do I evaluate which embedding model works best for my data?

Accepted Answer

MTEB benchmark scores are a starting point but don't always predict performance on domain-specific data. Build a small evaluation set: 50-200 queries with expected relevant documents. Compute mean reciprocal rank (MRR) or recall@k for each model on your dataset. This takes 2-3 hours but gives you model selection confidence that generic benchmarks cannot.

Question 2

Should I use the same embedding model for storage and query?

Accepted Answer

Yes. Embeddings from different models are not comparable — storing documents with model A and querying with model B produces meaningless results. If you change embedding models, you must re-embed the entire corpus. This is one reason to choose a model carefully before production: re-embedding 1 million documents costs ~$100-150 at OpenAI pricing and takes hours of compute time.

Question 3

What is a matryoshka embedding?

Accepted Answer

Matryoshka Representation Learning (MRL) trains a single model to produce embeddings that are accurate at multiple dimensions — the first 256 dimensions are meaningful, the first 512 are better, and the full 3072 are best. OpenAI's text-embedding-3 models support this via the `dimensions` parameter. You can store 256-dim vectors, answer most queries accurately, and only upgrade specific documents to full precision when needed.

Embedding Model Picker: Match Your Model to Your Retrieval Task

Frequently Asked Questions

How do I evaluate which embedding model works best for my data?

Should I use the same embedding model for storage and query?

What is a matryoshka embedding?

Start Using GitIntel Free

Frequently Asked Questions

How do I evaluate which embedding model works best for my data?

Should I use the same embedding model for storage and query?

What is a matryoshka embedding?

Start Using GitIntel Free

Related Tools