Benchmarking Text Embeddings

Date: 25.05.2025

Intro

Embeddings are the foundation of modern AI applications — from semantic search and clustering to classification and retrieval. But how do we measure how "good" an embedding really is? In this article, we explore key benchmarks designed specifically to evaluate the quality of text embeddings across various natural language processing (NLP) tasks. You'll learn about well-established suites like MTEB and BEIR, as well as task-specific evaluations for semantic similarity, classification, and clustering. Each section includes a clear description, use case, and Python code sample to help you get started with embedding evaluation in real-world settings.

MTEB (Massive Text Embedding Benchmark)

MTEB is a benchmark suite and evaluation framework introduced by researchers at Hugging Face and others to evaluate how well text embedding models perform across a wide range of natural language processing (NLP) tasks. MTEB focuses on tasks that rely on high-quality text embeddings, such as classification, clustering, retrieval, reranking, and semantic textual similarity. The benchmark supports over 100 datasets in more than 100 languages and is designed to test both English and multilingual models. It is implemented as a Python library and currently does not have a native JavaScript or Node.js interface, though embeddings generated from any language can be evaluated if formatted properly.


from mteb import MTEB
from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Wrap it for MTEB
from mteb import MTEBModel

wrapped_model = MTEBModel(model)

# Select benchmark tasks (e.g., STSBenchmark)
evaluation = MTEB(tasks=["STSBenchmark"])

# Run the benchmark
evaluation.run(wrapped_model, output_folder="mteb_results/")

BEIR (Benchmarking Information Retrieval)

BEIR is a benchmark suite and evaluation framework designed to assess the performance of retrieval models on a variety of information retrieval (IR) tasks. It focuses on tasks like passage retrieval, question answering, fact checking, and argument retrieval across diverse domains such as news, scientific articles, and web data. Unlike other benchmarks, BEIR is specifically tailored for evaluating how well models retrieve relevant documents given a natural language query. It includes over a dozen datasets in mainly English, and is implemented as a Python library. While there is no native JavaScript or Node.js support, embeddings generated in any language can still be benchmarked if integrated properly.


from beir import util, LoggingHandler
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.models import SentenceBERT
from beir.retrieval import models, RetrievalEvaluator
from beir.retrieval.search.dense import DenseRetrievalExactSearch

import logging
import pathlib

#### Setup logging and download dataset
logging.basicConfig(level=logging.INFO)
dataset = "trec-covid"  # You can change to 'scifact', 'nfcorpus', etc.
data_path = util.download_and_unzip("trec-covid", "./datasets")

#### Load dataset and model
corpus, queries, qrels = GenericDataLoader(data_path).load(split="test")
model = SentenceBERT("msmarco-distilbert-base-v3")

#### Retrieval using dense search
retriever = DenseRetrievalExactSearch(model, batch_size=32)
retriever_results = retriever.retrieve(corpus, queries)

#### Evaluate performance
evaluator = RetrievalEvaluator()
results = evaluator.evaluate(qrels, retriever_results, [1, 3, 5, 10])
print(results)

STS (Semantic Textual Similarity)

STS Benchmark is a benchmark dataset and evaluation framework designed to assess how well models can measure the semantic similarity between pairs of text. It primarily focuses onsemantic similarity tasks, such as paraphrase detection and sentence-level meaning comparison. The benchmark has been widely used in evaluating embedding models and semantic search systems. Unlike broader benchmarks, STS is narrowly focused on understanding how similar two texts are in meaning, rather than supporting tasks like question answering or document classification. It includes several English-language datasets collected over multiple years (e.g., STS 2012–2017), and is typically used via Python-based libraries such as SentenceTransformers. While there is no native JavaScript or Node.js implementation, embeddings generated elsewhere can be evaluated using the benchmark if properly formatted.


from sentence_transformers import SentenceTransformer, util

# Load a pretrained model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Example sentence pairs
sentence1 = "The cat sits outside."
sentence2 = "A feline is sitting outdoors."

# Encode the sentences
emb1 = model.encode(sentence1, convert_to_tensor=True)
emb2 = model.encode(sentence2, convert_to_tensor=True)

# Compute cosine similarity
cosine_score = util.cos_sim(emb1, emb2)

print(f"Similarity Score: {cosine_score.item():.4f}")

Classification

Classification is a type of benchmark task used to evaluate how well models can assign predefined labels to input text based on its content. It is commonly used to assess models on applications like spam detection, sentiment analysis, and topic categorization. In benchmarks such as MTEB, classification tasks help determine the quality of a model's text embeddings by testing how well simple classifiers (e.g., logistic regression) can separate different categories based on those embeddings. Performance is typically evaluated using metrics such as accuracy, precision, recall, and F1-score.


from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Example texts and labels
texts = ["I love this!", "This is awful.", "Great product", "Terrible service", "Happy with it", "Not worth it"]
labels = [1, 0, 1, 0, 1, 0]  # 1 = positive, 0 = negative

# Encode texts
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(embeddings, labels, test_size=0.33)

# Train classifier
clf = LogisticRegression()
clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

Clustering

Clustering is a type of benchmark task used to evaluate how well models can group semantically similar data points together without labeled supervision. It is commonly used to assess models on applications like document clustering, topic modeling, and anomaly detection. In benchmarks such as MTEB, clustering tasks help determine the quality of a model's text embeddings by applying unsupervised clustering algorithms (e.g., k-means or hierarchical clustering) to group similar text samples. Performance is typically evaluated using metrics such as the silhouette score, adjusted Rand index (ARI), and normalized mutual information (NMI), which measure how well the discovered clusters align with known class labels.


from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Example texts
texts = [
    "I love machine learning",
    "Natural language processing is amazing",
    "AI is the future",
    "Pizza is delicious",
    "I enjoy Italian food",
    "Pasta and pizza are my favorites"
]

# Step 1: Generate embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts)

# Step 2: Apply KMeans clustering
num_clusters = 2
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
labels = kmeans.fit_predict(embeddings)

# Step 3: Evaluate clustering performance
score = silhouette_score(embeddings, labels)
print("Silhouette Score:", round(score, 3))

# Optional: Show clustered results
for i, text in enumerate(texts):
    print(f"Cluster {labels[i]}: {text}")

Summary

Text embedding benchmarks provide a practical way to evaluate the semantic power of models beyond traditional accuracy metrics. Frameworks like MTEB offer a comprehensive view across retrieval, classification, and clustering tasks, while specialized benchmarks like STS and BEIR help focus on specific capabilities such as similarity scoring or document ranking. These tools are critical when choosing or fine-tuning models for search engines, chatbots, recommendation systems, or any application that depends on vectorized language understanding. Whether you're building in Python or generating embeddings in another language, these benchmarks ensure your vectors are not just fast—but smart.