LLM Engines

Date: 01.29.2025

A reference list of LLM and embedding models by provider: generative models, reasoners, embeddings, and rerankers. Each row lists modalities, parameters, API or repo id, use cases, deployment, pricing, benchmark metrics where public numbers exist (vendor papers, MTEB, or common eval suites), and coarse Scale.

Scale is capacity class: tiny, medium, large; top vendor lines also use large / frontier; teacher–student distillations use tiny / distilled, medium / distilled, or large / distilled matching backbone size (details in prose where it matters). Economical embedding tiers use tiny / medium; strong API defaults use large; flagship lines (e.g. OpenAI text-embedding-3-large, Cohere embed-v4, Voyage voyage-3-large / voyage-code-3, Gemini Embedding 2) use large / frontier. Rerankers and niche tools often stay N/A.

text-embedding-3-smallOpen AIEmbedding
Best forRAG, semantic search, classification, recommendations, clustering, near-duplicate detection
Price~$0.02/1M tokens
DeploymentAPI
ModalitiesText-only
Parameters1,536 dims, 8,191 tokens
Prosstrong MTEB scores, dimension reduction support
ConsAPI-only (no local), text-only
Modeltext-embedding-3-small
Scalemedium
Benchmarking
- MTEB (English, mean — OpenAI-reported)62.3%
MTEB English average 62.3% (OpenAI-reported at v3 launch vs 61.0% ada-002). Public HF leaderboard reruns can differ.
Model / docs Company (HF)
text-embedding-3-largeOpen AIEmbedding
Best forRAG, semantic search, classification, recommendations, multilingual search, complex queries
Price~$0.13/1M tokens
DeploymentAPI
ModalitiesText-only
Parameters3,072 dims, 8,191 tokens
ProsHigher quality than 3-small, strong MTEB scores, dimension reduction support
ConsAPI-only (no local), higher cost (~$0.13/1M tokens), text-only
Modeltext-embedding-3-large
Scalelarge / frontier
Benchmarking
- MTEB (English, mean — OpenAI-reported)64.6%
MTEB English average 64.6% (OpenAI-reported at v3 launch). Public HF leaderboard reruns can differ.
Model / docs Company (HF)
CLIP (OpenAI / open source)Open AIEmbedding
Best forText–image retrieval, image search by text, zero-shot image classification, cross-modal similarity
PriceFree (open source)
DeploymentLocal (Hugging Face, PyTorch)
ModalitiesText + image
ParametersDepends on variant (ViT-B/32, ViT-L/14, etc.); ~224×224 images; 512–768 dims
ProsOpen source, runs locally, strong text–image alignment, widely used
ConsText + image only (no video/audio), older than newer multimodal models
Modelopenai/clip-vit-base-patch32, laion/CLIP-ViT-L-14, etc.
ScaleN/A
Benchmarking
Zero-shot classification (ViT-B/32): full 27-dataset suite in Radford et al. (2021) Table 11; below are selected columns from that row.
Model / docs Company (HF)
GPT-4oOpen AIGenerative
Best forChat, coding, analysis, general tasks, vision
Price~$2.50/1M input, $10/1M output
DeploymentAPI
ModalitiesText + image
Parameters128K context, multimodal (text + image)
ProsFast, strong performance, vision
ConsHigher cost than mini
Modelgpt-4o
Scalelarge
Benchmarking
- MMLU (5-shot, approx.)~87%
- MedQA USMLE 4-option (0-shot, GPT‑4o)89.4%
Vendor-reported text and multimodal evals: Hello GPT‑4o announcement; detailed tables in GPT‑4o System Card.
Model / docs Company (HF)
GPT-4o miniOpen AIGenerative
Best forChat, simple tasks, high volume, cost-sensitive use
Price~$0.15/1M input, $0.60/1M output
DeploymentAPI
ModalitiesText + image
Parameters128K context, multimodal (text + image)
ProsCheaper, fast
ConsLess capable than GPT-4o
Modelgpt-4o-mini
Scaletiny
Benchmarking
- MMLU (approx.)~82%
Vendor-reported; family benchmarks on Hello GPT‑4o post (Model evaluations).
Model / docs Company (HF)
GPT-5.4 miniOpen AIReasoning / thinker
Best forLightweight reasoning, high-volume agents, coding, cost-sensitive workloads
Price~$0.75/1M input, $4.50/1M output
DeploymentAPI
ModalitiesText + image (input)
Parameters400K context, reasoning tokens, image input
ProsStrong mini tier, tools (Responses API), lower cost than GPT-5.4
ConsLess capable than GPT-5.4; not a reranker
Modelgpt-5.4-mini
Scalemedium
Benchmarking
Source: OpenAI Introducing GPT-5.4 mini and nano (Mar 2026); vendor-reported at high reasoning effort (xhigh).
Model / docs Company (HF)
GPT-5.5Open AIReasoning / thinker
Best forComplex reasoning, coding, flagship multimodal when cost is secondary to quality
PricePremium tier; confirm on pricing page
DeploymentAPI
ModalitiesText + image (input)
ParametersResponses API — see docs for context caps, multimodal inputs, reasoning settings
ProsTop capability tier in GPT-5.x line for reasoning and coding in many setups
ConsHighest cost vs smaller GPT-5.x variants; versioning follows OpenAI release notes
Modelgpt-5.5
Scalelarge / frontier
Benchmarking
Source: OpenAI Introducing GPT-5.5 and GPT-5.5 system card (Apr 2026); vendor-reported.
Model / docs Company (HF)
GPT-5.5 proOpen AIReasoning / thinker
Best forHardest tasks where GPT-5.5 base ceiling is insufficient; maximal precision tier
PriceAbove GPT-5.5; confirm on pricing page
DeploymentAPI
ModalitiesText + image (input)
ParametersPro-tier GPT-5.5; quotas and rollout per docs
ProsStronger frontier behavior than base GPT-5.5 where Pro is enabled
ConsLatency and pricing above GPT-5.5 and GPT-5.4 tiers
Modelgpt-5.5-pro
Scalelarge / frontier
Benchmarking
Source: OpenAI Introducing GPT-5.5 (Apr 2026); Pro tier where published in comparison tables; vendor-reported.
Model / docs Company (HF)
GPT-5.4Open AIReasoning / thinker
Best forCoding, agents, strong quality without GPT-5.5 pricing; broader context than GPT-5.4 mini alone
PriceAbove GPT-5.4 mini/nano; confirm on pricing page
DeploymentAPI
ModalitiesText + image (input)
ParametersReasoning-capable GPT-5.4 flagship (non-mini); limits per Responses API docs
ProsBest quality in GPT-5.4 SKU family excluding Pro tier
ConsMore expensive than mini/nano tiers for trivial prompts
Modelgpt-5.4
Scalelarge
Benchmarking
Source: OpenAI Introducing GPT-5.5 (comparison tables) and Introducing GPT-5.4 (Mar 2026); vendor-reported.
Model / docs Company (HF)
GPT-5.4 proOpen AIReasoning / thinker
Best forHighest discipline on tool use / reasoning vs base GPT-5.4 within same family
PriceAbove GPT-5.4; confirm on pricing page
DeploymentAPI
ModalitiesText + image (input)
ParametersGPT-5.4 Pro tier per OpenAI naming; parity of interfaces vs gpt-5.4 base
ProsStronger frontier behavior than GPT-5.4 where Pro tier is rolled out
ConsHigher cost/latency than GPT-5.4 and GPT-5.4 mini
Modelgpt-5.4-pro
Scalelarge / frontier
Benchmarking
Source: OpenAI Introducing GPT-5.5 (Apr 2026); Pro tier where published in comparison tables; vendor-reported.
Model / docs Company (HF)
GPT-5.4 nanoOpen AIGenerative
Best forMicro-tasks, volume routing, cheap preprocessing before escalation to GPT-5.4 / GPT-5.5
PriceLowest in GPT-5.4 lineup; confirm on pricing page
DeploymentAPI
ModalitiesText + image (input)
ParametersEconomy GPT-5.4 class; multimodal/input rules follow API docs
ProsMaximum throughput/$ in GPT-5.4 branded line
ConsWeakest on hard reasoning vs mini or full GPT-5.4
Modelgpt-5.4-nano
Scaletiny
Benchmarking
Source: OpenAI Introducing GPT-5.4 mini and nano (Mar 2026); vendor-reported at high reasoning effort (xhigh).
Model / docs Company (HF)
BGE-large-en-v1.5 (BAAI), BAAI General Embedding, BAAI = Beijing Academy of Artificial IntelligenceSentence Transformers (Hugging Face)Embedding
Best forRAG, semantic search, retrieval, classification, open-source deployments
PriceFree (open source)
DeploymentLocal
ModalitiesText-only
Parameters1,024 dims, 512 tokens, ~335M params
ProsStrong MTEB scores, open source, runs locally, good quality/size trade-off
ConsText-only, shorter context than some models
ModelBAAI/bge-large-en-v1.5
Scalelarge
Benchmarking
- MTEB (English, v1)~64.2%
MTEB (English, v1): ~64.2%
Model / docs Company (HF)
GTE-large-en-v1.5 (Alibaba) - General Text EmbeddingsSentence Transformers (Hugging Face)Embedding
Best forRAG, search, classification, English long-context
PriceFree (open source)
DeploymentLocal
ModalitiesText-only
Parameters1,024 dims, 8,192 tokens, ~335M params
ProsHigh MTEB score, long context
ConsEnglish-only; other GTE checkpoints for multilingual
ModelAlibaba-NLP/gte-large-en-v1.5
Scalelarge
Benchmarking
- MTEB (English, mean)65.39%
MTEB: 65.39% (Average 56 tasks; mteb 1.2.0 eval setting on model card).
Model / docs Company (HF)
E5-mistral-7b-instructSentence Transformers (Hugging Face)Embedding
Best forHigh-quality retrieval, RAG, complex queries
PriceFree (open source)
DeploymentLocal
ModalitiesText-only
Parameters4,096 dims, 32K tokens, 7B params
ProsTop MTEB performance, long context
ConsHeavy; needs more GPU memory
Modelintfloat/e5-mistral-7b-instruct
Scalemedium
Benchmarking
- MTEB (English, 56 datasets)66.6%
MTEB English (56 datasets) mean 66.6%; Wang et al. (2024) Table 2.
Model / docs Company (HF)
BGE-reranker-largeSentence Transformers (Hugging Face)Reranker
Best forRAG reranking, passage–query relevance
PriceFree (open source)
DeploymentLocal
ModalitiesText-only
ParametersCross-encoder, query + passage → score
ProsStrong MTEB, works with Sentence Transformers
ConsText-only
ModelBAAI/bge-reranker-large
ScaleN/A
Benchmarking
C-MTEB reranking tasks, MAP (×100), from BAAI Hugging Face model card table.
Model / docs Company (HF)
BGE-reranker-v2-m3Sentence Transformers (Hugging Face)Reranker
Best forMultilingual reranking
PriceFree
DeploymentLocal
ModalitiesText-only (multilingual)
ParametersMultilingual cross-encoder
Pros100+ languages
ConsHeavier than base
ModelBAAI/bge-reranker-v2-m3
ScaleN/A
Benchmarking
Table 1 (arXiv:2409.07691): NDCG@10 as in paper (0–1). Row + bge-reranker-v2-m3 under Embedding: NV-EmbedQA-Mistral7B-v2 (top-100 rerank). FiQA bge-large pool: README figure.
Model / docs Company (HF)
Gemini Embedding 2GoogleEmbedding
Best forMultimodal RAG, cross-modal search, semantic search over text, images, video, audio, documents
Pricetext ~$0.20/1M tokens, images ~$0.45/1M tokens, audio ~$6.50/1M tokens, video ~$12.00/1M tokens
DeploymentAPI
ModalitiesText, image, video, audio, documents
Parameters3,072 dims, 8K tokens (text), images (up to 6), video (~120s), audio (~80s)
ProsNative multimodal, single embedding space, Matryoshka-style compression, 100+ languages
ConsAPI-only (no local), higher cost than text-only models
Modelgemini-embedding-2-preview
Scalelarge / frontier
Benchmarking
Source: Gemini Embedding technical report arXiv:2503.07891 / Google AI (vendor-reported).
Model / docs Company (HF)
Gemini Embedding 2 (Vertex AI)GoogleEmbedding
Best forMultimodal RAG, cross-modal search, semantic search over text, images, video, audio, documents
Pricetext: ~$0.15 / 1M tokens (cheaper than Gemini API), images/audio/video: similar to Gemini API
DeploymentAPI (Vertex AI)
ModalitiesText, image, video, audio, documents
Parameters3,072 dims, 8K tokens (text), images (up to 6), video (~120s), audio (~80s) · Vertex text embedding ~$0.15/1M vs Gemini API ~$0.20/1M is typical; confirm on current pricing pages.
ProsNative multimodal, single embedding space, Matryoshka-style compression, 100+ languages, enterprise support, VPC integration
ConsAPI-only (no local), higher cost than text-only models
ModelVertex AI (text-multilingual-embedding-002 or gemini-embedding-2)
Scalelarge / frontier
Benchmarking
Same Gemini Embedding family weights as Gemini API; benchmarks from Gemini Embedding technical report (vendor-reported).
Model / docs Company (HF)
Gemini 2.5 Pro (Thinking mode)GoogleReasoning / thinker
Best forMath, coding, multi-step reasoning, agents
Price~$1.25/1M in, $10/1M out (prompt ≤200k tokens); ~$2.50/1M in, $15/1M out (>200k)
DeploymentGemini API, Vertex AI
ModalitiesText, image, audio, video
ParametersExtended reasoning, 1M context
ProsStrong reasoning, multimodal
ConsSlower, higher cost
Modelgemini-2.5-pro (with thinking enabled)
Scalelarge / frontier
Benchmarking
Source: Gemini 2.5 Pro Model Card / Gemini 2.5 tech report (Google DeepMind; vendor-reported pass@1 unless noted).
Model / docs Company (HF)
Gemini 3.1 Pro PreviewGoogleReasoning / thinker
Best forComplex reasoning, coding, agents, multimodal analysis
PricePaid (split by prompt length): input $2.00/1M (≤200k tokens) or $4.00/1M (>200k); output $12.00/1M or $18.00/1M
DeploymentGemini API, Vertex AI
ModalitiesText, image, audio, video (per model doc)
ParametersSee model doc (preview; limits updated there)
ProsNewest Pro tier in Gemini 3.1 line, strong agentic / coding positioning
ConsPreview (behavior/rates may change); higher cost than Flash
Modelgemini-3.1-pro-preview
Scalelarge / frontier
Benchmarking
Source: Gemini 3.1 Pro model card (deepmind.google; Feb 2026; Gemini 3.1 Pro Thinking High vs external baselines table).
Model / docs Company (HF)
Gemini 2.5 Flash (Thinking)GoogleReasoning / thinker
Best forLightweight reasoning, cost-sensitive reasoning
Price~$0.30/1M input, $2.50/1M output (thinking counted in output; audio in $1/1M)
DeploymentGemini API, Vertex AI
ModalitiesText, image, audio, video
ParametersSmaller reasoning model
ProsCheaper than Pro thinking
ConsLess capable than Pro
Modelgemini-2.5-flash-thinking
Scalemedium
Benchmarking
Source: Gemini 3 Flash page (deepmind.google) Performance table — Gemini 2.5 Flash Thinking column.
Model / docs Company (HF)
Gemini 3 Flash Preview (Thinking)GoogleReasoning / thinker
Best forFast Gemini 3 Flash tier, reasoning, agents, search/grounding-heavy work
PricePaid (standard): ~$0.50/1M input (text/image/video), ~$1.00/1M (audio); ~$3.00/1M output (thinking in output)
DeploymentGemini API, Vertex AI
ModalitiesText, image, audio, video, PDF
ParametersPreview; 1M in / 65k out; thinking via API config
ProsNewer Flash line than 2.5; strong speed + capability mix
ConsPreview; higher $/1M than 2.5 Flash; stricter limits than stable models
Modelgemini-3-flash-preview
Scalemedium
Benchmarking
Source: Gemini 3 Flash model page (deepmind.google/models/gemini/flash; Gemini 3 Flash Thinking column).
Model / docs Company (HF)
Gemma 4 E2B (Google / DeepMind)GoogleGenerative
Best forOn-device and edge chat, agents, coding; multimodal (text, image, video, audio) on small sizes
PriceOpen weights (infra cost); see AI Studio / Vertex if hosted
DeploymentLocal, Hugging Face, Kaggle, Ollama, Vertex AI
ModalitiesText, image, video, audio
ParametersE2B line, PLE architecture, 128K context (see model card for parameterization)
ProsTiny footprint for Gemma 4 line, native multimodal on E2B/E4B, function calling, thinking modes
ConsLower ceiling than 31B / MoE variants
Modelgoogle/gemma-4-E2B-it
Scaletiny
Benchmarking
Source: Gemma 4 Model Card README table (ai.google.dev; instruct models).
Model / docs Company (HF)
Gemma 4 E4B (Google / DeepMind)GoogleGenerative
Best forEdge and browser-class workloads, agents, coding; multimodal (text, image, video, audio)
PriceOpen weights (infra cost); see AI Studio / Vertex if hosted
DeploymentLocal, Hugging Face, Kaggle, Ollama, Vertex AI
ModalitiesText, image, video, audio
ParametersE4B line, PLE architecture, 128K context (see model card for parameterization)
ProsStrong quality for size class, multimodal on small Gemma 4, Apache-2.0-style Gemma terms
ConsHigher static VRAM than naive 4B due to PLE tables
Modelgoogle/gemma-4-E4B-it
Scaletiny
Benchmarking
Source: Gemma 4 Model Card README table (ai.google.dev; instruct models).
Model / docs Company (HF)
Gemma 4 26B A4B MoE (Google / DeepMind)GoogleReasoning / thinker
Best forHigh-throughput reasoning, coding, agents; ~4B active params per token
PriceOpen weights (infra cost); full expert set loaded at inference
DeploymentLocal, Hugging Face, Kaggle, Vertex AI
ModalitiesText, image, video, audio
Parameters26B MoE (A4B active per token), 256K context, function calling, thinking modes
ProsEfficient per-token cost vs dense 26B-class, strong reasoning positioning
ConsMemory footprint closer to full 26B than to 4B active
Modelgoogle/gemma-4-26B-A4B-it
Scalelarge
Benchmarking
Source: Gemma 4 Model Card README table (ai.google.dev; instruct models).
Model / docs Company (HF)
Gemma 4 31B (Google / DeepMind)GoogleReasoning / thinker
Best forDeep reasoning, coding, enterprise and server-grade open-weight deployment
PriceOpen weights (infra cost)
DeploymentLocal, Hugging Face, Kaggle, Vertex AI
ModalitiesText, image, video, audio
Parameters31B dense, 256K context, multimodal (text, image, video, audio), function calling
ProsTop open-weight Gemma 4 dense tier, long context, agentic tooling
ConsHeavy GPU/TPU requirements at full precision
Modelgoogle/gemma-4-31B-it
Scalelarge / frontier
Benchmarking
Source: Gemma 4 Model Card README table (ai.google.dev; instruct models).
Model / docs Company (HF)
Vertex AI Ranking APIGoogleReranker
Best forRAG reranking, semantic ranking
Price$1.00 USD per 1,000 ranking calls
DeploymentVertex AI
ModalitiesText-only
ParametersSemantic reranker, <100ms latency
ProsLow latency, strong performance
ConsVertex AI only, API-only
Modelsemantic-ranker-default-004, semantic-ranker-fast-004
ScaleN/A
Benchmarking
NDCG@10 as ×100 where applicable. TREC DL merged eval: Pinecone Dec 2024 (https://www.pinecone.io/blog/pinecone-rerank-v0-announcement/) — google-semantic-ranker-512-003 via API, top-200 rerank, table under TREC section — not default-004, not BEIR. Default-004 BEIR: Google Cloud blog May 30, 2025 (Fig. 2 vs APIs) + GCP ranking_api_beir_evaluation.ipynb. Limits/latency: blog + Vertex Ranking docs.
Model / docs Company (HF)
embed-english-v3.0CohereEmbedding
Best forEnglish RAG, search, classification
Price~$0.10/1M tokens
DeploymentAPI
ModalitiesText + image
Parameters1,024 dims, English; API supports text and images
ProsStrong search, RAG
ConsEnglish only
Modelembed-english-v3.0
Scalelarge
Benchmarking
NDCG@10 as ×100 (percentage points). Values match the public Hugging Face model card model-index for this checkpoint. MTEB leaderboard: https://huggingface.co/spaces/mteb/leaderboard — Published MTEB result files (dataset tree): https://huggingface.co/datasets/mteb/results/tree/main/results/Cohere__Cohere-embed-english-v3.0 — Model card: https://huggingface.co/CohereLabs/Cohere-embed-english-v3.0
Model / docs Company (HF)
embed-v4CohereEmbedding
Best forMultimodal RAG, text + image + PDF, mixed content
Price~$0.10–0.12/1M tokens
DeploymentAPI
ModalitiesText + image
ParametersMultimodal (text, image), unified embedding space
ProsMultimodal, high-res images, PDFs
ConsAPI-only
Modelembed-v4
Scalelarge / frontier
Benchmarking
69.8 / 58.2 / 78.4 appear in the comparison table on https://getathenic.com/blog/cohere-embed-v4-multilingual-embeddings (third-party; links Cohere’s announcement). Cohere’s own post does not surface those three numbers as searchable text: https://cohere.com/blog/embed-4 — Per-task MTEB JSONs: https://huggingface.co/datasets/mteb/results/tree/main/results/Cohere__Cohere-embed-v4.0/1
Model / docs Company (HF)
embed-multilingual-v3.0CohereEmbedding
Best forMultilingual RAG, search
Price~$0.10/1M tokens
DeploymentAPI
ModalitiesText + image
Parameters1,024 dims, 100+ languages; API supports text and images
ProsMultilingual
ConsAPI-only
Modelembed-multilingual-v3.0
Scalelarge
Benchmarking
Source: Voyage multilingual comparison (Voyage v3 blog cites +3.89% vs multilingual-e5-large on multilingual suite; CMv3 third-party MTEB clusters ~60–64).
Model / docs Company (HF)
Command RCohereGenerative
Best forRAG, chat, cost-sensitive use
Price~$0.15/1M input, $0.60/1M output
DeploymentAPI
ModalitiesText-only
Parameters128K context
ProsCheaper than R+, solid RAG
ConsLess capable than R+
Modelcommand-r
Scalelarge
Benchmarking
command-r-08-2024 generation; Hugging Face weights README is login-gated. https://docs.cohere.com/docs/command-r — https://docs.cohere.com/changelog/command-gets-refreshed — https://cohere.com/blog/command-series-0824 — https://huggingface.co/datasets/open-llm-leaderboard/contents — https://huggingface.co/docs/leaderboards/open_llm_leaderboard/archive — https://huggingface.co/datasets/open-llm-leaderboard/results
Model / docs Company (HF)
Command R+CohereGenerative
Best forRAG, tool use, agents, multilingual
Price~$2.50/1M input, $10/1M output
DeploymentAPI
ModalitiesText-only
Parameters128K context, 23 languages
ProsStrong RAG, tool use, multilingual
ConsHigher cost than Command R
Modelcommand-r-plus
Scalelarge
Benchmarking
command-r-plus-08-2024 generation; open-weights HF card is often login-gated. https://docs.cohere.com/docs/command-r-plus — https://docs.cohere.com/changelog/command-gets-refreshed — https://cohere.com/blog/command-series-0824 — https://huggingface.co/datasets/open-llm-leaderboard/contents — https://huggingface.co/docs/leaderboards/open_llm_leaderboard/archive — https://huggingface.co/datasets/open-llm-leaderboard/results
Model / docs Company (HF)
Command R7BCohereGenerative
Best forHigh volume, simple tasks
Price~$0.04/1M input, $0.15/1M output
DeploymentAPI
ModalitiesText-only
Parameters7B params
ProsFast, cheap
ConsLess capable
Modelcommand-r7b
Scaletiny
Benchmarking
command-r7b-12-2024. https://docs.cohere.com/docs/command-r7b — https://cohere.com/blog/command-r7b — https://huggingface.co/datasets/open-llm-leaderboard/contents — https://huggingface.co/docs/leaderboards/open_llm_leaderboard/archive — open weights: https://huggingface.co/CohereForAI/c4ai-command-r7b-12-2024
Model / docs Company (HF)
Command A ReasoningCohereReasoning / thinker
Best forAgents, tool use, complex reasoning
PriceHigher than Command R+
DeploymentAPI
ModalitiesText-only
Parameters111B params, 256K context, 23 languages
ProsStrong reasoning, tool use, agentic
ConsAPI-only, higher cost
Modelcommand-a-reasoning-08-2025
Scalelarge / frontier
Benchmarking
Scores from Cohere Command A technical report (arxiv:2504.00698, Tables 1, 3, 15): flagship Command A. Command A Reasoning is the reasoning-tuned 111B variant; treat as same family, not byte-for-byte identical.
Model / docs Company (HF)
rerank-v4.0-proCohereReranker
Best forRAG reranking, high relevance
PricePer search
DeploymentAPI
ModalitiesText-only
Parameters32K context
ProsHigh quality
ConsSlower than fast
Modelrerank-v4.0-pro
ScaleN/A
Benchmarking
Metrics as ×100 (proportion×100). arXiv:2604.01733 Table I, T2-RAGBench (23,088 queries): hybrid BM25+dense RRF, 50 candidates to Cohere rerank-v4.0-pro, top-10. Pipeline end-to-end on finance text+table retrieval, not reranker-only BEIR.
Model / docs Company (HF)
rerank-v3.5CohereReranker
Best forMultilingual reranking
Price~$2/1,000 searches
DeploymentAPI
ModalitiesText-only (multilingual)
Parameters4,096 tokens
ProsMultilingual
ConsShorter context than v4
Modelrerank-v3.5
ScaleN/A
Benchmarking
Vendor suite numbers (Cohere, republished Microsoft Foundry catalog for Cohere-rerank-v3.5, Jan 2026). Reasoning: P@1 on adversarial two-candidate queries. Multilingual: mean nDCG@10 on Cohere multilingual suite (18 languages). Baselines are same-suite; not open BEIR/MIRACL tables.
Model / docs Company (HF)
voyage-3-largeVoyage AIEmbedding
Best forHigh-accuracy RAG, semantic search, multilingual retrieval, enterprise retrieval quality
PricePremium tier (check latest Voyage pricing page)
DeploymentAPI
ModalitiesText-only
ParametersText embeddings, long-context input, Matryoshka-style dimension shortening support
ProsVery strong retrieval quality, good multilingual performance, flexible embedding size
ConsAPI-only (no local), higher cost than lite variants
Modelvoyage-3-large
Scalelarge / frontier
Benchmarking
Source: voyage-3-large blog (+4.14% avg vs voyage-3 on eight-domain suite; NDCG@10 baseline from voyage-3 post).
Model / docs Company (HF)
voyage-3Voyage AIEmbedding
Best forGeneral-purpose RAG, semantic search, recommendations, classification
PriceMid tier (check latest Voyage pricing page)
DeploymentAPI
ModalitiesText-only
ParametersText embeddings, long-context input, balanced quality/latency
ProsStrong quality with better cost than large
ConsAPI-only, not as accurate as large on hard retrieval sets
Modelvoyage-3
Scalemedium
Benchmarking
- NDCG@1076.72%
NDCG@10: 76.72%
Model / docs Company (HF)
voyage-3-liteVoyage AIEmbedding
Best forHigh-volume embedding pipelines, cost-sensitive search/RAG, near-duplicate detection
PriceLow-cost tier (check latest Voyage pricing page)
DeploymentAPI
ModalitiesText-only
ParametersLightweight text embeddings, lower latency
ProsCheap, fast, scalable
ConsLower retrieval quality than voyage-3 / voyage-3-large
Modelvoyage-3-lite
Scaletiny
Benchmarking
Source: voyage-3 blog (NDCG@10 table; Sept 2024).
Model / docs Company (HF)
voyage-code-3Voyage AIEmbedding
Best forCode search, code RAG, repository retrieval, code similarity
PriceSpecialized tier (check latest Voyage pricing page)
DeploymentAPI
ModalitiesText + code
ParametersCode-focused embeddings for natural language + code retrieval
ProsBetter code retrieval than general text embeddings
ConsAPI-only, less ideal for purely non-code corpora
Modelvoyage-code-3
Scalelarge / frontier
Benchmarking
Source: voyage-code-3 blog (Dec 2024; 32 code retrieval datasets; NDCG@10 at 1024-dim vs baselines).
Model / docs Company (HF)
rerank-2Voyage AIReranker
Best forRe-ranking top-k retrieved documents for higher precision in RAG
PricePer-request/token based (check latest Voyage pricing page)
DeploymentAPI
ModalitiesText-only
ParametersCross-encoder reranker (query-document relevance scoring)
ProsNoticeable precision boost after initial vector retrieval
ConsExtra latency/cost step after retrieval
Modelrerank-2
ScaleN/A
Benchmarking
ICLERB nDCG as ×100: arXiv:2411.18947 Table 2 (in-context-learning retrieval; rankings differ from generic RAG). Vendor NDCG@10 suite: blog.voyageai.com/2024/09/30/rerank-2, 93 datasets, top-100→rerank→10, three first-stages (BM25+GTE-large-en-v1.5, text-embedding-3-large, voyage-multilingual-2).
Model / docs Company (HF)
rerank-2-liteVoyage AIReranker
Best forCost-sensitive re-ranking at larger scale
PriceLower than rerank-2 (check latest Voyage pricing page)
DeploymentAPI
ModalitiesText-only
ParametersLightweight reranker optimized for speed/cost
ProsFaster and cheaper than rerank-2
ConsSlightly lower precision than full rerank-2
Modelrerank-2-lite
ScaleN/A
Benchmarking
Vendor NDCG@10 suite: blog.voyageai.com/2024/09/30/rerank-2 (same protocol as rerank-2: 93 English-domain + multilingual slices, top-100→rerank→10, three first-stages). No third-party ICLERB row for lite.
Model / docs Company (HF)
Qwen2.5-72B-Instruct (Alibaba / Qwen)Alibaba QwenGenerative
Best forHigh-quality chat, analysis, coding, complex instruction following
PriceOpen weights (inference cost depends on your infra) or provider API pricing
DeploymentLocal or API (provider-dependent)
ModalitiesText-only
Parameters72B open-weight instruct model, long-context variants available
ProsStrong general quality, good multilingual support, open weights for self-hosting
ConsHeavy compute for local inference, not as cheap as smaller variants
ModelQwen/Qwen2.5-72B-Instruct
Scalelarge
Benchmarking
Source: Qwen2.5 technical report README (instruction-tuned).
Model / docs Company (HF)
Qwen2.5-32B-Instruct (Alibaba / Qwen)Alibaba QwenGenerative
Best forStrong quality with lower cost/latency than 72B
PriceOpen weights / provider-dependent API pricing
DeploymentLocal or API
ModalitiesText-only
Parameters32B instruct model, long-context variants available
ProsGood quality/performance balance
ConsStill requires substantial resources for local serving
ModelQwen/Qwen2.5-32B-Instruct
Scalemedium
Benchmarking
Source: Qwen2.5 README family eval table.
Model / docs Company (HF)
Qwen2.5-14B-Instruct (Alibaba / Qwen)Alibaba QwenGenerative
Best forMid-size production assistants, cost-aware coding/chat
PriceOpen weights / provider-dependent API pricing
DeploymentLocal or API
ModalitiesText-only
Parameters14B instruct model
ProsMuch easier to serve than 32B/72B, solid instruction following
ConsLower reasoning depth than larger Qwen models
ModelQwen/Qwen2.5-14B-Instruct
Scalemedium
Benchmarking
Source: Qwen2.5 README family eval table.
Model / docs Company (HF)
Qwen2.5-7B-Instruct (Alibaba / Qwen)Alibaba QwenGenerative
Best forLightweight assistants, edge/server cost-sensitive workloads
PriceOpen weights / provider-dependent API pricing
DeploymentLocal or API
ModalitiesText-only
Parameters7B instruct model
ProsFast, cheaper inference, widely deployable
ConsLower quality on harder reasoning/coding tasks
ModelQwen/Qwen2.5-7B-Instruct
Scaletiny
Benchmarking
Source: Qwen2.5 README family eval table.
Model / docs Company (HF)
Qwen2.5-Coder-32B-Instruct (Alibaba / Qwen)Alibaba QwenGenerative (code-focused)
Best forCode generation, refactoring, repo Q&A, code reasoning
PriceOpen weights / provider-dependent API pricing
DeploymentLocal or API
ModalitiesText-only
Parameters32B code-specialized instruct model
ProsStrong coding quality vs general-only models
ConsLarge model serving cost; for simple tasks smaller coder variants may be enough
ModelQwen/Qwen2.5-Coder-32B-Instruct
Scalelarge
Benchmarking
Source: Qwen2.5-Coder model card README (coding suites).
Model / docs Company (HF)
Qwen2.5-VL-72B-Instruct (Alibaba / Qwen)Alibaba QwenGenerative
Best forVision + text understanding, document/image question answering
PriceOpen weights / provider-dependent API pricing
DeploymentLocal or API
ModalitiesText + image
ParametersMultimodal (text + image) model family
ProsStrong multimodal capability in Qwen ecosystem
ConsHigher compute and serving complexity than text-only models
ModelQwen/Qwen2.5-VL-72B-Instruct
Scalelarge
Benchmarking
Source: Qwen2.5-VL technical README (vision QA / chart / doc benchmarks).
Model / docs Company (HF)
DeepSeek-V3DeepSeekGenerative
Best forGeneral chat, coding, analysis, high-quality assistant tasks
PriceProvider-dependent API pricing
DeploymentAPI (and provider-hosted endpoints)
ModalitiesText-only
ParametersLarge MoE-style foundation model line, long-context capable variants via providers
ProsStrong quality-to-cost, good coding and multilingual performance
ConsAPI/provider availability can vary by region; behavior depends on hosting/provider tuning
ModelDeepSeek-V3
Scalelarge / frontier
Benchmarking
Source: DeepSeek-V3 technical report arXiv:2412.19437 Table 6 (chat models, PDF ~p.31) — https://arxiv.org/pdf/2412.19437.pdf#page=31 . Mirror eval table: https://github.com/deepseek-ai/DeepSeek-V3#4-evaluation-results
Model / docs Company (HF)
DeepSeek-R1DeepSeekReasoning / thinker
Best forReasoning-heavy tasks, math, logic, multi-step planning/problem solving
PriceProvider-dependent API pricing
DeploymentAPI
ModalitiesText-only
ParametersReasoning-oriented model family with deliberate chain-style behavior
ProsStrong reasoning performance, useful for hard step-by-step tasks
ConsHigher latency/token usage than non-reasoning models on simple prompts
ModelDeepSeek-R1
Scalelarge / frontier
Benchmarking
Source: arXiv:2501.12948 Table 8 (main chat/reasoning comparison, PDF ~p.41) — https://arxiv.org/pdf/2501.12948.pdf#page=41 . HF model card: https://huggingface.co/deepseek-ai/DeepSeek-R1
Model / docs Company (HF)
DeepSeek-R1-Distill-Llama-70BDeepSeekReasoning / thinker
Best forCost-aware reasoning with strong quality, self-hosted reasoning workloads
PriceOpen weights (infra cost) or provider-dependent API pricing
DeploymentLocal or API
ModalitiesText-only
ParametersDistilled reasoning model based on Llama 70B backbone
ProsStrong reasoning at lower cost/complexity than full frontier reasoning models
ConsLower ceiling than full DeepSeek-R1 on hardest tasks
Modeldeepseek-ai/DeepSeek-R1-Distill-Llama-70B
Scalelarge / distilled
Benchmarking
Source: DeepSeek-R1 paper Table 15 (distilled models).
Model / docs Company (HF)
DeepSeek-R1-Distill-Qwen-32BDeepSeekReasoning / thinker
Best forMid-size reasoning deployments, balanced quality/latency
PriceOpen weights (infra cost) or provider-dependent API pricing
DeploymentLocal or API
ModalitiesText-only
ParametersDistilled reasoning model on Qwen 32B backbone
ProsGood reasoning/cost balance, easier to serve than larger models
ConsLess capable than 70B/full R1 on difficult benchmarks
Modeldeepseek-ai/DeepSeek-R1-Distill-Qwen-32B
Scalemedium / distilled
Benchmarking
Source: DeepSeek-R1 paper Table 15 (distilled models).
Model / docs Company (HF)
DeepSeek-Coder-V2-InstructDeepSeekGenerative (code-focused)
Best forCode generation, refactoring, debugging, repo-level coding assistance
PriceOpen weights (infra cost) or provider-dependent API pricing
DeploymentLocal or API
ModalitiesText-only
ParametersCode-specialized model line (various sizes/checkpoints)
ProsStrong coding performance, practical for dev workflows
ConsGeneral non-code reasoning/chat can be weaker than top general models
Modeldeepseek-ai/DeepSeek-Coder-V2-Instruct
Scalelarge
Benchmarking
- HumanEval>90%
- FocusRepo-scale code completion & repair
Source: DeepSeek-Coder-V2 paper — Table 3 (HumanEval); §4.2.1 Table 5 (RepoBench); Table 7 (repair).
Model / docs Company (HF)
kimi-k2.6Moonshot AI (Kimi)Reasoning / thinker
Best forAgent workflows, coding, and long-context multimodal reasoning with improved planning depth over earlier K2.x tiers
PriceSelf-hosted open weights: free model weights (infrastructure cost). Official API: $0.16/1M input (cache hit), $0.95/1M input (cache miss), $4.00/1M output
DeploymentLocal or API
ModalitiesText + image + video
ParametersMoE architecture: 1T total params / 32B activated per token; 61 layers; 384 experts (8 selected + 1 shared per token); 160K vocab; 256K context; MoonViT vision encoder (400M params)
ProsNewer K2.x line for stronger reasoning/planning scenarios
ConsPublic benchmark reporting and exact limits can change quickly by API release
Modelkimi-k2.6
Scalelarge / frontier
Benchmarking
Source: moonshotai/Kimi-K2.6 model card (thinking mode enabled).
Model / docs Company (HF)
kimi-k2.5Moonshot AI (Kimi)Generative
Best forMultimodal (image + video + text), vision-language, agent-style workflows; use thinking mode for harder reasoning
PriceToken-based; see Moonshot pricing docs
DeploymentAPI
ModalitiesText + image + video
ParametersMultimodal (image + video + text); video via upload / ms:// file refs; long context (model card: ~256K-class weights — confirm API limits for your account)
ProsFlagship multimodal line; thinking mode where supported
ConsThinking vs instant modes and defaults differ from older v1 APIs; confirm latest docs
Modelkimi-k2.5
Scalelarge / frontier
Benchmarking
- MMMU-Pro78.5%
- MathVision84.2%
MMMU-Pro: 78.5%; MathVision: 84.2%; VideoMMMU: 86.6%; VideoMME: 87.4% (per your published table)
Model / docs Company (HF)
moonshot-v1-128k-vision-previewMoonshot AI (Kimi)Generative
Best forHeavy multimodal context (long system + user + images) in one shot
PriceToken-based; see Moonshot pricing docs
DeploymentAPI
ModalitiesText + image
ParametersVision model id for ~128K context tier (preview)
ProsLargest v1 vision context tier in the name
ConsPreview; most expensive/heaviest when you use full context
Modelmoonshot-v1-128k-vision-preview
Scalelarge
Benchmarking
Source: BenchLM (Moonshot v1 family profile; not tier-specific to 128k preview id).
Company (HF)
moonshot-v1-32k-vision-previewMoonshot AI (Kimi)Generative
Best forLonger multimodal chats / more image+text context in one request
PriceToken-based (vision token accounting); see Moonshot pricing docs
DeploymentAPI
ModalitiesText + image
ParametersVision model id for ~32K context tier (preview); same image rules as other Moonshot vision models
ProsMore room for instructions + image context than 8K tier
ConsPreview; higher token use and cost than 8K when you fill context
Modelmoonshot-v1-32k-vision-preview
Scalemedium
Benchmarking
Source: BenchLM (Moonshot v1 family profile; not tier-specific to 32k preview id).
Company (HF)
moonshot-v1-8k-vision-previewMoonshot AI (Kimi)Generative
Best forImage + text in one request; short prompts and small multimodal turns
PriceToken-based chat pricing (vision uses dynamic image/video tokens); see Moonshot pricing docs
DeploymentAPI
ModalitiesText + image
ParametersVision model id for ~8K context tier (preview); images: png, jpeg, webp, gif; see Moonshot vision guide for request format (base64 / file id)
ProsLower context cost vs larger tiers when input fits in 8K
ConsPreview name may change; long images + long text can hit limits faster
Modelmoonshot-v1-8k-vision-preview
Scaletiny
Benchmarking
Source: BenchLM (Moonshot v1 family profile; not tier-specific to 8k preview id).
Company (HF)
MiniMax M2.7 (MiniMax)MiniMax AIReasoning / thinker
Best forAgents, coding, software engineering, office workflows, long-context reasoning; flagship text line
PriceSee https://platform.minimax.io pricing (token plans / pay-as-you-go)
DeploymentAPI (OpenAI- or Anthropic-compatible SDKs per docs)
ModalitiesText-only
ParametersMoE-class stack (~230B total / ~100B active per public materials); very long context (~204.8K-class); tools / Anthropic-compatible API path
ProsStrong real-world engineering and agentic positioning; highspeed variant for latency-sensitive paths
ConsAPI-only (no open weights); pricing/quotas region- and account-dependent
ModelMiniMax-M2.7
Scalelarge / frontier
Benchmarking
Source: MiniMax M2.7 model page (minimax.io/models/text/m27).
Model / docs Company (HF)
MiniMax M2.7-highspeed (MiniMax)MiniMax AIReasoning / thinker
Best forSame tasks as M2.7 when you need lower latency / higher throughput
PriceSee MiniMax pricing (often differs from base M2.7)
DeploymentAPI
ModalitiesText-only
ParametersSame capability tier as M2.7; faster inference (vendor-tuned routing)
ProsSignificantly faster than base M2.7 for similar quality class
ConsAPI-only; may differ slightly in throughput vs base under load
ModelMiniMax-M2.7-highspeed
Scalelarge / frontier
Benchmarking
MiniMax docs: same eval results as M2.7, higher throughput path.
Model / docs Company (HF)
MiniMax M2.5 (MiniMax)MiniMax AIGenerative
Best forCode generation, refactoring, polyglot coding, strong value tier before M2.7
PriceSee MiniMax pricing
DeploymentAPI
ModalitiesText-only
ParametersLong context (~204.8K-class per docs); code-optimized positioning
ProsPeak value tier in the M2 text line for coding-heavy workloads
ConsSuperseded for absolute frontier by M2.7 on vendor charts; API-only
ModelMiniMax-M2.5
Scalelarge
Benchmarking
- SWE-bench Verified (aggregate)80.2%
Third-party SWE-bench Verified aggregates (model family tier; not SWE-Pro).
Model / docs Company (HF)
MiniMax M2.5-highspeed (MiniMax)MiniMax AIGenerative
Best forSame as M2.5 with lower latency
PriceSee MiniMax pricing
DeploymentAPI
ModalitiesText-only
ParametersSame performance class as M2.5; faster inference
ProsFast M2.5-class option for high-volume coding
ConsAPI-only
ModelMiniMax-M2.5-highspeed
Scalelarge
Benchmarking
- SWE-bench Verified (aggregate)80.2%
Same published tier as M2.5; vendor lists highspeed for latency.
Model / docs Company (HF)
MiniMax M2.1 (MiniMax)MiniMax AIGenerative
Best forCode, reasoning, refactoring; legacy M2 line still listed for stable integrations
PriceSee MiniMax pricing
DeploymentAPI
ModalitiesText-only
ParametersMoE-style (~230B total, ~10B activated per token per docs); code-focused
ProsMature tier; often cheaper than newest flagship
ConsLegacy relative to M2.5 / M2.7; API-only
ModelMiniMax-M2.1
Scalelarge
Benchmarking
Source: MiniMax-AI/MiniMax-M2.1 GitHub README benchmark table.
Model / docs Company (HF)
MiniMax M2 (MiniMax)MiniMax AIReasoning / thinker
Best forLong output and agentic text (function calling, streaming) on older M2 generation
PriceSee MiniMax pricing
DeploymentAPI
ModalitiesText-only
Parameters~200K context; up to ~128K output (incl. chain-style content per docs)
ProsEstablished M2 generation; long outputs
ConsLegacy vs M2.5 / M2.7; API-only
ModelMiniMax-M2
Scalelarge
Benchmarking
Source: MiniMax-AI/MiniMax-M2.1 GitHub README (M2 column).
Model / docs Company (HF)
M2-her (MiniMax)MiniMax AIGenerative
Best forRoleplay, multi-character dialogue, long-horizon character interaction
PriceSee MiniMax pricing
DeploymentAPI
ModalitiesText-only
ParametersText chat tuned for character and emotional expression
ProsSpecialized for interactive fiction / persona use cases
ConsNot a general coding frontier model; API-only
ModelM2-her
ScaleN/A
Benchmarking
- DomainCharacter / dialogue fidelity
Roleplay-specialized SKU — not leaderboard-optimized coding evals.
Model / docs Company (HF)
Claude Opus 4.6AnthropicGenerative
Best forHardest tasks, agents, coding, long multimodal work
Price~$5/1M input, ~$25/1M output (see Anthropic pricing for batch, cache, thinking)
DeploymentClaude API, AWS Bedrock, Google Vertex AI
ModalitiesText + image in, text out
Parameters1M context; 128k max output; extended thinking + adaptive thinking
ProsStrongest Claude tier in the current lineup
ConsHighest latency and cost in the family
Modelclaude-opus-4-6
Scalelarge / frontier
Benchmarking
Source: Claude Opus 4.6 system card (anthropic.com).
Model / docs Company (HF)
Claude Sonnet 4.6AnthropicGenerative
Best forProduction chat, agents, coding, vision; balance of speed and quality
Price~$3/1M input, ~$15/1M output (see Anthropic pricing for batch, cache, thinking)
DeploymentClaude API, AWS Bedrock, Google Vertex AI
ModalitiesText + image in, text out
Parameters1M context; 64k max output; extended thinking + adaptive thinking
ProsFast relative to Opus; strong general capability
ConsLess capable than Opus on the hardest prompts
Modelclaude-sonnet-4-6
Scalelarge
Benchmarking
Source: Claude Sonnet 4.6 system card (anthropic.com).
Model / docs Company (HF)
Claude Haiku 4.5AnthropicGenerative
Best forLow-latency chat, high-volume routing, cost-sensitive workloads
Price~$1/1M input, ~$5/1M output (see Anthropic pricing)
DeploymentClaude API, AWS Bedrock, Google Vertex AI
ModalitiesText + image in, text out
Parameters200k context; 64k max output; extended thinking (no adaptive thinking per model table)
ProsFastest and cheapest current Claude tier listed in overview
ConsSmaller context than Opus/Sonnet 1M tier
Modelclaude-haiku-4-5
Scalemedium
Benchmarking
- SWE-bench Verified (public aggregate)73.3%
- WebArena53.1%
SWE from public leaderboards; WebArena from Claude Opus 4.6 system card table.
Model / docs Company (HF)
CONTEXT-1ChromaAgentic retrieval
Best forMulti-hop retrieval, agentic search paired with a frontier reasoning model
PriceOpen weights (inference cost on your infra)
DeploymentHugging Face, local
ModalitiesText-only
Parameters~20B params; query decomposition, iterative corpus search, in-loop context editing
ProsBuilt for complex multi-hop retrieval as a sub-agent
ConsSpecialized workflow; not a general chat or coding model
Modelchromadb/context-1
ScaleN/A
Benchmarking
Source: Chroma Context-1 tech report — Final Answer Found (F1) on retrieval-led tasks.
Model / docs Company (HF)
Groq Compound (Groq)GroqGenerative (orchestrated)
Best forTool-orchestrated search and routing across Groq-hosted models (e.g. open-weight ~120B-class paths per Groq)
PriceSee Groq Cloud pricing
DeploymentAPI (Groq)
ModalitiesText-only
ParametersProduct routes across retrieval + generative models; not one fixed public parameter count
ProsLow-latency Groq inference; unified compound surface
ConsWhich sub-model runs can change; not a single static checkpoint
Modelgroq-compound
Scalelarge
Benchmarking
- RoleOrchestrated retrieval + generative routing
Routing product; no single static benchmark row.
Model / docs Company (HF)
Tiny Recursive Model (TRM)Samsung (SAIL Montréal)Reasoning / thinker
Best forStructured puzzle reasoning (ARC-AGI, Sudoku, mazes), recursive-reasoning research
PriceFree (open source)
DeploymentLocal
ModalitiesStructured grids / puzzles (not free-form conversational text)
Parameters~7M parameters, iterative latent and answer refinement (paper: arXiv:2510.04871)
ProsOrders-of-magnitude smaller than LLMs on comparable reasoning tasks, MIT-licensed repo
ConsNot a general chat model, embedding, or reranker; no official hosted API; GPU stack for train/eval
ModelSamsungSAILMontreal/TinyRecursiveModels
Scaletiny
Benchmarking
- ARC-AGI-1 (reported)44.6%
- ARC-AGI-2 (reported)7.8%
ARC grid tasks; not MMLU-style chat.
Model / docs Company (HF)

Filter

Free only

Embeddings Vector Databases Langchain Agent, RAG, MCP & ML Concepts

Arena AI — It helps you pick and sanity-check models by seeing which ones real users prefer on real prompts, instead of trusting vendor claims or a single benchmark score.

Benchmarks (short for)

SWE-Bench Pro — Software Engineering Benchmark (hard/pro split)

Harder issue-fixing benchmark for coding agents on real repositories; success depends on making correct code changes that pass checks.

SWE-Bench

SWE-Bench Verified — Software Engineering Benchmark (verified split)

Curated split of SWE-Bench with validated tasks and stricter evaluation consistency for code-fix success.

SWE-Bench

Terminal-Bench 2.0 — Terminal agent execution benchmark

Measures how reliably an agent performs multi-step terminal workflows such as command sequencing, file edits, and tool usage.

Model card table

LiveCodeBench (v6) — Live coding benchmark (version 6)

Evaluates coding performance on continually refreshed programming problems to reduce contamination from old training sets.

LiveCodeBench

HLE-Full (w/ tools) — Humanity's Last Exam (full set, tool-enabled)

Broad frontier benchmark where models solve difficult tasks across domains; this variant allows tool use during evaluation.

HLE

GPQA-Diamond — Graduate-level Google-Proof Q&A (diamond split)

Difficult science/knowledge questions designed to resist simple web lookup, emphasizing deep reasoning over memorization.

Paper (GPQA)

MMLU — Massive Multitask Language Understanding

57 subject areas (STEM to law); reports accuracy under few-shot multiple choice on broad knowledge.

Paper (MMLU)

ARC — AI2 Reasoning Challenge

Grade-school science questions that need reasoning, not lookup; Easy vs Challenge splits.

Dataset (ARC)

HellaSwag — Commonsense sentence completion (stylized name)

Adversarially filtered commonsense endings; models pick one continuation from four options.

HellaSwag (site)

WinoGrande — Winograd-style coreference at scale

Pronoun disambiguation with paired sentences; tests robust coreference, not trivia memorization alone.

WinoGrande (AI2)

TruthfulQA — Truthfulness vs imitation

Questions where popular false answers tempt the model; scores how often answers are true and informative.

TruthfulQA (repo)

GSM8K — Grade School Math 8K

8.5k grade-school word problems with numeric answers; chain-of-thought style evaluation is common.

GSM8K (OpenAI)

MT-Bench — Multi-turn chat benchmark

Eight dialog topics, multiple turns each; a strong model (often GPT-4) scores follow-up quality.

Paper (MT-Bench)