LLM Engines
A reference list of LLM and embedding models by provider: generative models, reasoners, embeddings, and rerankers. Each row lists modalities, parameters, API or repo id, use cases, deployment, pricing, benchmark metrics where public numbers exist (vendor papers, MTEB, or common eval suites), and coarse Scale.
Scale is capacity class: tiny, medium, large; top vendor lines also use large / frontier; teacher–student distillations use tiny / distilled, medium / distilled, or large / distilled matching backbone size (details in prose where it matters). Economical embedding tiers use tiny / medium; strong API defaults use large; flagship lines (e.g. OpenAI text-embedding-3-large, Cohere embed-v4, Voyage voyage-3-large / voyage-code-3, Gemini Embedding 2) use large / frontier. Rerankers and niche tools often stay N/A.
- text-embedding-3-smallOpen AIEmbedding
Best forRAG, semantic search, classification, recommendations, clustering, near-duplicate detection
Price~$0.02/1M tokens
DeploymentAPI
ModalitiesText-only
Parameters1,536 dims, 8,191 tokens
Prosstrong MTEB scores, dimension reduction support
ConsAPI-only (no local), text-only
Modeltext-embedding-3-small
Scalemedium
Benchmarking - text-embedding-3-largeOpen AIEmbedding
Best forRAG, semantic search, classification, recommendations, multilingual search, complex queries
Price~$0.13/1M tokens
DeploymentAPI
ModalitiesText-only
Parameters3,072 dims, 8,191 tokens
ProsHigher quality than 3-small, strong MTEB scores, dimension reduction support
ConsAPI-only (no local), higher cost (~$0.13/1M tokens), text-only
Modeltext-embedding-3-large
Scalelarge / frontier
Benchmarking - CLIP (OpenAI / open source)Open AIEmbedding
Best forText–image retrieval, image search by text, zero-shot image classification, cross-modal similarity
PriceFree (open source)
DeploymentLocal (Hugging Face, PyTorch)
ModalitiesText + image
ParametersDepends on variant (ViT-B/32, ViT-L/14, etc.); ~224×224 images; 512–768 dims
ProsOpen source, runs locally, strong text–image alignment, widely used
ConsText + image only (no video/audio), older than newer multimodal models
Modelopenai/clip-vit-base-patch32, laion/CLIP-ViT-L-14, etc.
ScaleN/A
Benchmarking- ImageNet zero-shot (ViT-B/32, typical)~76.2%
- GPT-4oOpen AIGenerative
Best forChat, coding, analysis, general tasks, vision
Price~$2.50/1M input, $10/1M output
DeploymentAPI
ModalitiesText + image
Parameters128K context, multimodal (text + image)
ProsFast, strong performance, vision
ConsHigher cost than mini
Modelgpt-4o
Scalelarge
Benchmarking- MMLU (5-shot, approx.)~87%
- GPT-4o miniOpen AIGenerative
Best forChat, simple tasks, high volume, cost-sensitive use
Price~$0.15/1M input, $0.60/1M output
DeploymentAPI
ModalitiesText + image
Parameters128K context, multimodal (text + image)
ProsCheaper, fast
ConsLess capable than GPT-4o
Modelgpt-4o-mini
Scaletiny
Benchmarking- MMLU (approx.)~82%
- GPT-5.4 miniOpen AIReasoning / thinker
Best forLightweight reasoning, high-volume agents, coding, cost-sensitive workloads
Price~$0.75/1M input, $4.50/1M output
DeploymentAPI
ModalitiesText + image (input)
Parameters400K context, reasoning tokens, image input
ProsStrong mini tier, tools (Responses API), lower cost than GPT-5.4
ConsLess capable than GPT-5.4; not a reranker
Modelgpt-5.4-mini
Scalemedium
Benchmarking- SWE-Bench Pro (public)54.4%
- Terminal-Bench 2.060.0%
- GPQA Diamond88.0%
- OSWorld-Verified72.1%
- MCP Atlas57.7%
- Toolathlon42.9%
- GPT-5.5Open AIReasoning / thinker
Best forComplex reasoning, coding, flagship multimodal when cost is secondary to quality
PricePremium tier; confirm on pricing page
DeploymentAPI
ModalitiesText + image (input)
ParametersResponses API — see docs for context caps, multimodal inputs, reasoning settings
ProsTop capability tier in GPT-5.x line for reasoning and coding in many setups
ConsHighest cost vs smaller GPT-5.x variants; versioning follows OpenAI release notes
Modelgpt-5.5
Scalelarge / frontier
Benchmarking- SWE-Bench Pro (public)58.6%
- Terminal-Bench 2.082.7%
- Expert-SWE (internal)73.1%
- GDPval (wins or ties)84.9%
- OSWorld-Verified78.7%
- GPQA Diamond93.6%
- GPT-5.5 proOpen AIReasoning / thinker
Best forHardest tasks where GPT-5.5 base ceiling is insufficient; maximal precision tier
PriceAbove GPT-5.5; confirm on pricing page
DeploymentAPI
ModalitiesText + image (input)
ParametersPro-tier GPT-5.5; quotas and rollout per docs
ProsStronger frontier behavior than base GPT-5.5 where Pro is enabled
ConsLatency and pricing above GPT-5.5 and GPT-5.4 tiers
Modelgpt-5.5-pro
Scalelarge / frontier
Benchmarking- GDPval (wins or ties)82.3%
- BrowseComp90.1%
- FrontierMath Tier 1–352.4%
- Humanity's Last Exam (no tools)43.1%
- Humanity's Last Exam (with tools)57.2%
- GeneBench33.2%
- GPT-5.4Open AIReasoning / thinker
Best forCoding, agents, strong quality without GPT-5.5 pricing; broader context than GPT-5.4 mini alone
PriceAbove GPT-5.4 mini/nano; confirm on pricing page
DeploymentAPI
ModalitiesText + image (input)
ParametersReasoning-capable GPT-5.4 flagship (non-mini); limits per Responses API docs
ProsBest quality in GPT-5.4 SKU family excluding Pro tier
ConsMore expensive than mini/nano tiers for trivial prompts
Modelgpt-5.4
Scalelarge
Benchmarking- SWE-Bench Pro (public)57.7%
- Terminal-Bench 2.075.1%
- Expert-SWE (internal)68.5%
- GDPval (wins or ties)83.0%
- OSWorld-Verified75.0%
- GPQA Diamond92.8%
- GPT-5.4 proOpen AIReasoning / thinker
Best forHighest discipline on tool use / reasoning vs base GPT-5.4 within same family
PriceAbove GPT-5.4; confirm on pricing page
DeploymentAPI
ModalitiesText + image (input)
ParametersGPT-5.4 Pro tier per OpenAI naming; parity of interfaces vs gpt-5.4 base
ProsStronger frontier behavior than GPT-5.4 where Pro tier is rolled out
ConsHigher cost/latency than GPT-5.4 and GPT-5.4 mini
Modelgpt-5.4-pro
Scalelarge / frontier
Benchmarking- GDPval (wins or ties)82.0%
- BrowseComp89.3%
- FrontierMath Tier 1–350.0%
- Humanity's Last Exam (no tools)42.7%
- Humanity's Last Exam (with tools)58.7%
- GeneBench25.6%
- GPT-5.4 nanoOpen AIGenerative
Best forMicro-tasks, volume routing, cheap preprocessing before escalation to GPT-5.4 / GPT-5.5
PriceLowest in GPT-5.4 lineup; confirm on pricing page
DeploymentAPI
ModalitiesText + image (input)
ParametersEconomy GPT-5.4 class; multimodal/input rules follow API docs
ProsMaximum throughput/$ in GPT-5.4 branded line
ConsWeakest on hard reasoning vs mini or full GPT-5.4
Modelgpt-5.4-nano
Scaletiny
Benchmarking- SWE-Bench Pro (public)52.4%
- Terminal-Bench 2.046.3%
- GPQA Diamond82.8%
- OSWorld-Verified39.0%
- MCP Atlas56.1%
- Toolathlon35.5%
- BGE-large-en-v1.5 (BAAI), BAAI General Embedding, BAAI = Beijing Academy of Artificial IntelligenceSentence Transformers (Hugging Face)Embedding
Best forRAG, semantic search, retrieval, classification, open-source deployments
PriceFree (open source)
DeploymentLocal
ModalitiesText-only
Parameters1,024 dims, 512 tokens, ~335M params
ProsStrong MTEB scores, open source, runs locally, good quality/size trade-off
ConsText-only, shorter context than some models
ModelBAAI/bge-large-en-v1.5
Scalelarge
Benchmarking- MTEB (English, v1)~64.2%
- GTE-large-en-v1.5 (Alibaba) - General Text EmbeddingsSentence Transformers (Hugging Face)Embedding
Best forRAG, search, classification, English long-context
PriceFree (open source)
DeploymentLocal
ModalitiesText-only
Parameters1,024 dims, 8,192 tokens, ~335M params
ProsHigh MTEB score, long context
ConsEnglish-only; other GTE checkpoints for multilingual
ModelAlibaba-NLP/gte-large-en-v1.5
Scalelarge
Benchmarking- MTEB (English, v1)~64.0%
- E5-mistral-7b-instructSentence Transformers (Hugging Face)Embedding
Best forHigh-quality retrieval, RAG, complex queries
PriceFree (open source)
DeploymentLocal
ModalitiesText-only
Parameters4,096 dims, 32K tokens, 7B params
ProsTop MTEB performance, long context
ConsHeavy; needs more GPU memory
Modelintfloat/e5-mistral-7b-instruct
Scalemedium
Benchmarking- MTEB (English, v1)~66.5%
- BGE-reranker-largeSentence Transformers (Hugging Face)Reranker
Best forRAG reranking, passage–query relevance
PriceFree (open source)
DeploymentLocal
ModalitiesText-only
ParametersCross-encoder, query + passage → score
ProsStrong MTEB, works with Sentence Transformers
ConsText-only
ModelBAAI/bge-reranker-large
ScaleN/A
Benchmarking- MTEB T2 reranking (MAP)67.6
- MTEB T2 ZH→EN (MAP)64.03
- MTEB T2 EN→ZH (MAP)61.44
- MTEB MMarco reranking (MAP)37.16
- MTEB CMedQAv1 (MAP)82.15
- MTEB CMedQAv2 (MAP)84.18
- C-MTEB rerank mean (MAP)66.09
- BGE-reranker-v2-m3Sentence Transformers (Hugging Face)Reranker
Best forMultilingual reranking
PriceFree
DeploymentLocal
ModalitiesText-only (multilingual)
ParametersMultilingual cross-encoder
Pros100+ languages
ConsHeavier than base
ModelBAAI/bge-reranker-v2-m3
ScaleN/A
Benchmarking- BEIR Q&A mean (NDCG@10)67.34
- BEIR NQ (NDCG@10)70.28
- BEIR HotpotQA (NDCG@10)86.35
- BEIR FiQA (NDCG@10, NV-Mistral7B-v2 pool)45.39
- BEIR FiQA NDCG@10 (bge-large-en-v1.5 pool)44.83
- BEIR FiQA NDCG@100 (bge-large-en-v1.5 pool)51.53
- BEIR FiQA MRR@10 (bge-large-en-v1.5 pool)53.38
- Gemini Embedding 2GoogleEmbedding
Best forMultimodal RAG, cross-modal search, semantic search over text, images, video, audio, documents
Pricetext ~$0.20/1M tokens, images ~$0.45/1M tokens, audio ~$6.50/1M tokens, video ~$12.00/1M tokens
DeploymentAPI
ModalitiesText, image, video, audio, documents
Parameters3,072 dims, 8K tokens (text), images (up to 6), video (~120s), audio (~80s)
ProsNative multimodal, single embedding space, Matryoshka-style compression, 100+ languages
ConsAPI-only (no local), higher cost than text-only models
Modelgemini-embedding-2-preview
Scalelarge / frontier
Benchmarking- MMTEB task mean68.32
- MTEB (English, v2) task mean73.30
- MTEB (Code)74.66
- XOR-Retrieve Recall@590.42
- XTREME-UP MRR@1064.33
- Borda rank MMTEB (multilingual)#1
- Gemini Embedding 2 (Vertex AI)GoogleEmbedding
Best forMultimodal RAG, cross-modal search, semantic search over text, images, video, audio, documents
Pricetext: ~$0.15 / 1M tokens (cheaper than Gemini API), images/audio/video: similar to Gemini API
DeploymentAPI (Vertex AI)
ModalitiesText, image, video, audio, documents
Parameters3,072 dims, 8K tokens (text), images (up to 6), video (~120s), audio (~80s) · Vertex text embedding ~$0.15/1M vs Gemini API ~$0.20/1M is typical; confirm on current pricing pages.
ProsNative multimodal, single embedding space, Matryoshka-style compression, 100+ languages, enterprise support, VPC integration
ConsAPI-only (no local), higher cost than text-only models
ModelVertex AI (text-multilingual-embedding-002 or gemini-embedding-2)
Scalelarge / frontier
Benchmarking- MMTEB task mean68.32
- MTEB (English, v2) task mean73.30
- MTEB (Code)74.66
- XOR-Retrieve Recall@590.42
- XTREME-UP MRR@1064.33
- Borda rank MMTEB (multilingual)#1
- Gemini 2.5 Pro (Thinking mode)GoogleReasoning / thinker
Best forMath, coding, multi-step reasoning, agents
Price~$1.25/1M in, $10/1M out (prompt ≤200k tokens); ~$2.50/1M in, $15/1M out (>200k)
DeploymentGemini API, Vertex AI
ModalitiesText, image, audio, video
ParametersExtended reasoning, 1M context
ProsStrong reasoning, multimodal
ConsSlower, higher cost
Modelgemini-2.5-pro (with thinking enabled)
Scalelarge / frontier
Benchmarking- GPQA Diamond (pass@1)86.4%
- SWE-Bench Verified (single attempt)59.6%
- MMMU (pass@1)82.0%
- VideoMMMU83.6%
- Global MMLU89.2%
- AIME 2025 (pass@1)88.0%
- Gemini 3.1 Pro PreviewGoogleReasoning / thinker
Best forComplex reasoning, coding, agents, multimodal analysis
PricePaid (split by prompt length): input $2.00/1M (≤200k tokens) or $4.00/1M (>200k); output $12.00/1M or $18.00/1M
DeploymentGemini API, Vertex AI
ModalitiesText, image, audio, video (per model doc)
ParametersSee model doc (preview; limits updated there)
ProsNewest Pro tier in Gemini 3.1 line, strong agentic / coding positioning
ConsPreview (behavior/rates may change); higher cost than Flash
Modelgemini-3.1-pro-preview
Scalelarge / frontier
Benchmarking- GPQA Diamond94.3%
- SWE-Bench Verified (single attempt)80.6%
- SWE-Bench Pro (single attempt)54.2%
- Terminal-Bench 2.068.5%
- ARC-AGI-2 (verified)77.1%
- Humanity's Last Exam (no tools)44.4%
- Gemini 2.5 Flash (Thinking)GoogleReasoning / thinker
Best forLightweight reasoning, cost-sensitive reasoning
Price~$0.30/1M input, $2.50/1M output (thinking counted in output; audio in $1/1M)
DeploymentGemini API, Vertex AI
ModalitiesText, image, audio, video
ParametersSmaller reasoning model
ProsCheaper than Pro thinking
ConsLess capable than Pro
Modelgemini-2.5-flash-thinking
Scalemedium
Benchmarking- SWE-Bench Verified (single attempt)60.4%
- GPQA Diamond82.8%
- AIME 2025 (no tools)72.0%
- MMMU-Pro (no tools)66.7%
- Terminal-Bench 2.016.9%
- Humanity's Last Exam (no tools)11.0%
- Gemini 3 Flash Preview (Thinking)GoogleReasoning / thinker
Best forFast Gemini 3 Flash tier, reasoning, agents, search/grounding-heavy work
PricePaid (standard): ~$0.50/1M input (text/image/video), ~$1.00/1M (audio); ~$3.00/1M output (thinking in output)
DeploymentGemini API, Vertex AI
ModalitiesText, image, audio, video, PDF
ParametersPreview; 1M in / 65k out; thinking via API config
ProsNewer Flash line than 2.5; strong speed + capability mix
ConsPreview; higher $/1M than 2.5 Flash; stricter limits than stable models
Modelgemini-3-flash-preview
Scalemedium
Benchmarking- SWE-Bench Verified (single attempt)78.0%
- GPQA Diamond90.4%
- MMMU-Pro (no tools)81.2%
- Terminal-Bench 2.0 (Terminus-2)47.6%
- LiveCodeBench Pro (Elo)2316
- Humanity's Last Exam (no tools)33.7%
- Gemma 4 E2B (Google / DeepMind)GoogleGenerative
Best forOn-device and edge chat, agents, coding; multimodal (text, image, video, audio) on small sizes
PriceOpen weights (infra cost); see AI Studio / Vertex if hosted
DeploymentLocal, Hugging Face, Kaggle, Ollama, Vertex AI
ModalitiesText, image, video, audio
ParametersE2B line, PLE architecture, 128K context (see model card for parameterization)
ProsTiny footprint for Gemma 4 line, native multimodal on E2B/E4B, function calling, thinking modes
ConsLower ceiling than 31B / MoE variants
Modelgoogle/gemma-4-E2B-it
Scaletiny
Benchmarking- MMLU Pro60.0%
- LiveCodeBench v629.1%
- GPQA Diamond42.4%
- Tau2 (avg of 3 domains)16.2%
- MMMU Pro44.2%
- MMMLU67.4%
- Gemma 4 E4B (Google / DeepMind)GoogleGenerative
Best forEdge and browser-class workloads, agents, coding; multimodal (text, image, video, audio)
PriceOpen weights (infra cost); see AI Studio / Vertex if hosted
DeploymentLocal, Hugging Face, Kaggle, Ollama, Vertex AI
ModalitiesText, image, video, audio
ParametersE4B line, PLE architecture, 128K context (see model card for parameterization)
ProsStrong quality for size class, multimodal on small Gemma 4, Apache-2.0-style Gemma terms
ConsHigher static VRAM than naive 4B due to PLE tables
Modelgoogle/gemma-4-E4B-it
Scaletiny
Benchmarking- MMLU Pro69.4%
- LiveCodeBench v644.0%
- GPQA Diamond43.4%
- Tau2 (avg of 3 domains)24.5%
- MMMU Pro52.6%
- MMMLU76.6%
- Gemma 4 26B A4B MoE (Google / DeepMind)GoogleReasoning / thinker
Best forHigh-throughput reasoning, coding, agents; ~4B active params per token
PriceOpen weights (infra cost); full expert set loaded at inference
DeploymentLocal, Hugging Face, Kaggle, Vertex AI
ModalitiesText, image, video, audio
Parameters26B MoE (A4B active per token), 256K context, function calling, thinking modes
ProsEfficient per-token cost vs dense 26B-class, strong reasoning positioning
ConsMemory footprint closer to full 26B than to 4B active
Modelgoogle/gemma-4-26B-A4B-it
Scalelarge
Benchmarking- MMLU Pro82.6%
- LiveCodeBench v677.1%
- GPQA Diamond82.3%
- Tau2 (avg of 3 domains)68.2%
- Codeforces Elo1718
- MRCR v2 (8 needle, 128k avg)44.1%
- Gemma 4 31B (Google / DeepMind)GoogleReasoning / thinker
Best forDeep reasoning, coding, enterprise and server-grade open-weight deployment
PriceOpen weights (infra cost)
DeploymentLocal, Hugging Face, Kaggle, Vertex AI
ModalitiesText, image, video, audio
Parameters31B dense, 256K context, multimodal (text, image, video, audio), function calling
ProsTop open-weight Gemma 4 dense tier, long context, agentic tooling
ConsHeavy GPU/TPU requirements at full precision
Modelgoogle/gemma-4-31B-it
Scalelarge / frontier
Benchmarking- MMLU Pro85.2%
- LiveCodeBench v680.0%
- GPQA Diamond84.3%
- AIME 2026 (no tools)89.2%
- Codeforces Elo2150
- MRCR v2 (8 needle, 128k avg)66.4%
- Vertex AI Ranking APIGoogleReranker
Best forRAG reranking, semantic ranking
Price$1.00 USD per 1,000 ranking calls
DeploymentVertex AI
ModalitiesText-only
ParametersSemantic reranker, <100ms latency
ProsLow latency, strong performance
ConsVertex AI only, API-only
Modelsemantic-ranker-default-004, semantic-ranker-fast-004
ScaleN/A
Benchmarking- TREC DL mean (NDCG@10, 512-003, Pinecone)64.89
- Latency (Vertex RAG docs)<100ms typical
- Default-004 vs API baselines (vendor blog)≥2× faster
- Fast-004 vs default-004 (vendor blog)~3× lower latency
- Max tokens / ranking request (vendor blog)200k
- Max tokens / record (004, docs)1024
- Max records / request (docs)200
- embed-english-v3.0CohereEmbedding
Best forEnglish RAG, search, classification
Price~$0.10/1M tokens
DeploymentAPI
ModalitiesText + image
Parameters1,024 dims, English; API supports text and images
ProsStrong search, RAG
ConsEnglish only
Modelembed-english-v3.0
Scalelarge
Benchmarking- MTEB NQ (NDCG@10)61.56
- MTEB HotpotQA (NDCG@10)70.72
- MTEB FiQA2018 (NDCG@10)42.19
- MTEB QuoraRetrieval (NDCG@10)88.72
- MTEB MSMARCO (NDCG@10)42.93
- MTEB FEVER (NDCG@10)88.97
- MTEB ArguAna (NDCG@10)61.52
- embed-v4CohereEmbedding
Best forMultimodal RAG, text + image + PDF, mixed content
Price~$0.10–0.12/1M tokens
DeploymentAPI
ModalitiesText + image
ParametersMultimodal (text, image), unified embedding space
ProsMultimodal, high-res images, PDFs
ConsAPI-only
Modelembed-v4
Scalelarge / frontier
Benchmarking- MTEB avg (reported rollup)69.8%
- Retrieval subset (reported)58.2%
- Classification subset (reported)78.4%
- MultimodalText + image in shared space
- embed-multilingual-v3.0CohereEmbedding
Best forMultilingual RAG, search
Price~$0.10/1M tokens
DeploymentAPI
ModalitiesText + image
Parameters1,024 dims, 100+ languages; API supports text and images
ProsMultilingual
ConsAPI-only
Modelembed-multilingual-v3.0
Scalelarge
Benchmarking- MTEB (multilingual cluster, typical)~61–64
- Languages (API)100+
- vs Multilingual E5 (Voyage suite cited)+3.89%
- vs OpenAI v3 large (Voyage suite cited)+4.55%
- Command RCohereGenerative
Best forRAG, chat, cost-sensitive use
Price~$0.15/1M input, $0.60/1M output
DeploymentAPI
ModalitiesText-only
Parameters128K context
ProsCheaper than R+, solid RAG
ConsLess capable than R+
Modelcommand-r
Scalelarge
Benchmarking- MMLU (5-shot, reported)~75%
- HumanEval (reported)~71%
- GSM8K (reported)~71%
- TriviaQA (reported)~85%
- Command R+CohereGenerative
Best forRAG, tool use, agents, multilingual
Price~$2.50/1M input, $10/1M output
DeploymentAPI
ModalitiesText-only
Parameters128K context, 23 languages
ProsStrong RAG, tool use, multilingual
ConsHigher cost than Command R
Modelcommand-r-plus
Scalelarge
Benchmarking- MMLU (0-shot, reported)75.7%
- GSM8K (8-shot, reported)70.7%
- HumanEval (pass@1, reported)~79%
- Open LLM leaderboard (avg, reported)74.6%
- Command R7BCohereGenerative
Best forHigh volume, simple tasks
Price~$0.04/1M input, $0.15/1M output
DeploymentAPI
ModalitiesText-only
Parameters7B params
ProsFast, cheap
ConsLess capable
Modelcommand-r7b
Scaletiny
Benchmarking- MMLU (typical small-command tier)~55–60%
- HumanEval (typical)~65–75%
- RoleHigh-throughput routing / simple RAG
- Command A ReasoningCohereReasoning / thinker
Best forAgents, tool use, complex reasoning
PriceHigher than Command R+
DeploymentAPI
ModalitiesText-only
Parameters111B params, 256K context, 23 languages
ProsStrong reasoning, tool use, agentic
ConsAPI-only, higher cost
Modelcommand-a-reasoning-08-2025
Scalelarge / frontier
Benchmarking- MMLU85.5
- MMLU-Pro69.6
- GPQA Diamond50.8
- MATH (all)80.0
- AIME (2024)23.3
- IFEval90.9
- TauBench (retail)51.7
- rerank-v4.0-proCohereReranker
Best forRAG reranking, high relevance
PricePer search
DeploymentAPI
ModalitiesText-only
Parameters32K context
ProsHigh quality
ConsSlower than fast
Modelrerank-v4.0-pro
ScaleN/A
Benchmarking- T2-RAGBench R@147.2
- T2-RAGBench R@375.8
- T2-RAGBench R@581.6
- T2-RAGBench R@1086.1
- T2-RAGBench MRR@360.5
- T2-RAGBench nDCG@1068.3
- T2-RAGBench MAP62.5
- rerank-v3.5CohereReranker
Best forMultilingual reranking
Price~$2/1,000 searches
DeploymentAPI
ModalitiesText-only (multilingual)
Parameters4,096 tokens
ProsMultilingual
ConsShorter context than v4
Modelrerank-v3.5
ScaleN/A
Benchmarking- Reasoning suite P@1 (2 candidates)81.59
- Multilingual suite mean nDCG@1062.18
- Same suite Hybrid nDCG@1052.10
- Same suite Dense nDCG@1053.83
- Same suite Cohere Rerank 3 nDCG@1052.27
- Reasoning suite Hybrid P@148.80
- Reasoning suite Dense P@150.64
- voyage-3-largeVoyage AIEmbedding
Best forHigh-accuracy RAG, semantic search, multilingual retrieval, enterprise retrieval quality
PricePremium tier (check latest Voyage pricing page)
DeploymentAPI
ModalitiesText-only
ParametersText embeddings, long-context input, Matryoshka-style dimension shortening support
ProsVery strong retrieval quality, good multilingual performance, flexible embedding size
ConsAPI-only (no local), higher cost than lite variants
Modelvoyage-3-large
Scalelarge / frontier
Benchmarking- NDCG@10 (est. from voyage-3 76.72 + 4.14% rel.)~79.9
- vs OpenAI v3 large (avg domains, vendor)+9.74%
- vs voyage-3 (avg domains, vendor)+4.14%
- Context length32K
- voyage-3Voyage AIEmbedding
Best forGeneral-purpose RAG, semantic search, recommendations, classification
PriceMid tier (check latest Voyage pricing page)
DeploymentAPI
ModalitiesText-only
ParametersText embeddings, long-context input, balanced quality/latency
ProsStrong quality with better cost than large
ConsAPI-only, not as accurate as large on hard retrieval sets
Modelvoyage-3
Scalemedium
Benchmarking- NDCG@1076.72%
- voyage-3-liteVoyage AIEmbedding
Best forHigh-volume embedding pipelines, cost-sensitive search/RAG, near-duplicate detection
PriceLow-cost tier (check latest Voyage pricing page)
DeploymentAPI
ModalitiesText-only
ParametersLightweight text embeddings, lower latency
ProsCheap, fast, scalable
ConsLower retrieval quality than voyage-3 / voyage-3-large
Modelvoyage-3-lite
Scaletiny
Benchmarking- NDCG@10 (avg suite)72.98
- vs OpenAI v3 large (avg domains, vendor)+3.82%
- vs OpenAI v3 small same price (vendor)+7.58%
- Context length32K
- voyage-code-3Voyage AIEmbedding
Best forCode search, code RAG, repository retrieval, code similarity
PriceSpecialized tier (check latest Voyage pricing page)
DeploymentAPI
ModalitiesText + code
ParametersCode-focused embeddings for natural language + code retrieval
ProsBetter code retrieval than general text embeddings
ConsAPI-only, less ideal for purely non-code corpora
Modelvoyage-code-3
Scalelarge / frontier
Benchmarking- NDCG@10 @1024 dims (avg code suite)92.28%
- vs OpenAI v3 large avg lift (vendor)+13.80%
- vs CodeSage-large avg lift (vendor)+16.81%
- Context length32K
- rerank-2Voyage AIReranker
Best forRe-ranking top-k retrieved documents for higher precision in RAG
PricePer-request/token based (check latest Voyage pricing page)
DeploymentAPI
ModalitiesText-only
ParametersCross-encoder reranker (query-document relevance scoring)
ProsNoticeable precision boost after initial vector retrieval
ConsExtra latency/cost step after retrieval
Modelrerank-2
ScaleN/A
Benchmarking- ICLERB nDCG@1063.86
- ICLERB nDCG@5064.32
- Rel. mean NDCG@10 lift vs text-embedding-3-large (%)13.89
- Mean NDCG@10 − rerank-1 (pp, 3 first-stages)2.84
- Mean NDCG@10 − Cohere rerank-english-v3.0 (pp)6.33
- Mean NDCG@10 − bge-reranker-v2-m3 (pp)14.75
- Mean NDCG@10 − Cohere rerank-multilingual-v3.0 (pp, 51 ds.)8.83
- rerank-2-liteVoyage AIReranker
Best forCost-sensitive re-ranking at larger scale
PriceLower than rerank-2 (check latest Voyage pricing page)
DeploymentAPI
ModalitiesText-only
ParametersLightweight reranker optimized for speed/cost
ProsFaster and cheaper than rerank-2
ConsSlightly lower precision than full rerank-2
Modelrerank-2-lite
ScaleN/A
Benchmarking- Rel. mean NDCG@10 lift vs text-embedding-3-large (%)11.86
- Mean rel. lift below rerank-2 (pp, same 93-ds. TL;DR)2.03
- Mean NDCG@10 − Cohere rerank-english-v3.0 (pp)4.49
- Mean NDCG@10 − bge-reranker-v2-m3 (pp)12.91
- Mean NDCG@10 − Cohere rerank-multilingual-v3.0 (pp, 51 ds.)6.24
- Mean NDCG@10 − bge-reranker-v2-m3 (pp, multilingual)2.26
- Rel. lift above Cohere v3 stack lift (pp, vendor intro)5.12
- Qwen2.5-72B-Instruct (Alibaba / Qwen)Alibaba QwenGenerative
Best forHigh-quality chat, analysis, coding, complex instruction following
PriceOpen weights (inference cost depends on your infra) or provider API pricing
DeploymentLocal or API (provider-dependent)
ModalitiesText-only
Parameters72B open-weight instruct model, long-context variants available
ProsStrong general quality, good multilingual support, open weights for self-hosting
ConsHeavy compute for local inference, not as cheap as smaller variants
ModelQwen/Qwen2.5-72B-Instruct
Scalelarge
Benchmarking- MMLU (5-shot)86.1
- GSM8K89.92
- HumanEval88.57
- MATH75.96
- Qwen2.5-32B-Instruct (Alibaba / Qwen)Alibaba QwenGenerative
Best forStrong quality with lower cost/latency than 72B
PriceOpen weights / provider-dependent API pricing
DeploymentLocal or API
ModalitiesText-only
Parameters32B instruct model, long-context variants available
ProsGood quality/performance balance
ConsStill requires substantial resources for local serving
ModelQwen/Qwen2.5-32B-Instruct
Scalemedium
Benchmarking- MMLU (5-shot)83.5
- GSM8K88.43
- HumanEval84.76
- MATH68.58
- Qwen2.5-14B-Instruct (Alibaba / Qwen)Alibaba QwenGenerative
Best forMid-size production assistants, cost-aware coding/chat
PriceOpen weights / provider-dependent API pricing
DeploymentLocal or API
ModalitiesText-only
Parameters14B instruct model
ProsMuch easier to serve than 32B/72B, solid instruction following
ConsLower reasoning depth than larger Qwen models
ModelQwen/Qwen2.5-14B-Instruct
Scalemedium
Benchmarking- MMLU (5-shot)79.93
- GSM8K82.93
- HumanEval75.76
- MATH57.71
- Qwen2.5-7B-Instruct (Alibaba / Qwen)Alibaba QwenGenerative
Best forLightweight assistants, edge/server cost-sensitive workloads
PriceOpen weights / provider-dependent API pricing
DeploymentLocal or API
ModalitiesText-only
Parameters7B instruct model
ProsFast, cheaper inference, widely deployable
ConsLower quality on harder reasoning/coding tasks
ModelQwen/Qwen2.5-7B-Instruct
Scaletiny
Benchmarking- MMLU (5-shot)74.22
- GSM8K85.68
- HumanEval68.18
- MATH49.76
- Qwen2.5-Coder-32B-Instruct (Alibaba / Qwen)Alibaba QwenGenerative (code-focused)
Best forCode generation, refactoring, repo Q&A, code reasoning
PriceOpen weights / provider-dependent API pricing
DeploymentLocal or API
ModalitiesText-only
Parameters32B code-specialized instruct model
ProsStrong coding quality vs general-only models
ConsLarge model serving cost; for simple tasks smaller coder variants may be enough
ModelQwen/Qwen2.5-Coder-32B-Instruct
Scalelarge
Benchmarking- HumanEval (pass@1)~86–90
- MBPPstrong coder tier
- RoleCode-specialized instruct
- Qwen2.5-VL-72B-Instruct (Alibaba / Qwen)Alibaba QwenGenerative
Best forVision + text understanding, document/image question answering
PriceOpen weights / provider-dependent API pricing
DeploymentLocal or API
ModalitiesText + image
ParametersMultimodal (text + image) model family
ProsStrong multimodal capability in Qwen ecosystem
ConsHigher compute and serving complexity than text-only models
ModelQwen/Qwen2.5-VL-72B-Instruct
Scalelarge
Benchmarking- MMMU (val)strong open multimodal tier
- DocVQAcompetitive document QA
- Role72B multimodal instruct
- DeepSeek-V3DeepSeekGenerative
Best forGeneral chat, coding, analysis, high-quality assistant tasks
PriceProvider-dependent API pricing
DeploymentAPI (and provider-hosted endpoints)
ModalitiesText-only
ParametersLarge MoE-style foundation model line, long-context capable variants via providers
ProsStrong quality-to-cost, good coding and multilingual performance
ConsAPI/provider availability can vary by region; behavior depends on hosting/provider tuning
ModelDeepSeek-V3
Scalelarge / frontier
Benchmarking- MMLU (pass@1)88.5
- MMLU-Pro (EM)75.9
- GPQA Diamond (pass@1)59.1
- MATH-500 (EM)90.2
- SWE-bench Verified42.0%
- AIME 2024 (pass@1)39.2%
- DeepSeek-R1DeepSeekReasoning / thinker
Best forReasoning-heavy tasks, math, logic, multi-step planning/problem solving
PriceProvider-dependent API pricing
DeploymentAPI
ModalitiesText-only
ParametersReasoning-oriented model family with deliberate chain-style behavior
ProsStrong reasoning performance, useful for hard step-by-step tasks
ConsHigher latency/token usage than non-reasoning models on simple prompts
ModelDeepSeek-R1
Scalelarge / frontier
Benchmarking- MMLU (pass@1)90.8
- GPQA Diamond (pass@1)71.5
- AIME 2024 (pass@1)79.8
- MATH-500 (pass@1)97.3
- LiveCodeBench (Pass@1-CoT)65.9
- Codeforces Rating2029
- DeepSeek-R1-Distill-Llama-70BDeepSeekReasoning / thinker
Best forCost-aware reasoning with strong quality, self-hosted reasoning workloads
PriceOpen weights (infra cost) or provider-dependent API pricing
DeploymentLocal or API
ModalitiesText-only
ParametersDistilled reasoning model based on Llama 70B backbone
ProsStrong reasoning at lower cost/complexity than full frontier reasoning models
ConsLower ceiling than full DeepSeek-R1 on hardest tasks
Modeldeepseek-ai/DeepSeek-R1-Distill-Llama-70B
Scalelarge / distilled
Benchmarking- AIME 2024 pass@170.0
- GPQA Diamond pass@165.2
- LiveCodeBench pass@157.5
- Codeforces Rating1633
- DeepSeek-R1-Distill-Qwen-32BDeepSeekReasoning / thinker
Best forMid-size reasoning deployments, balanced quality/latency
PriceOpen weights (infra cost) or provider-dependent API pricing
DeploymentLocal or API
ModalitiesText-only
ParametersDistilled reasoning model on Qwen 32B backbone
ProsGood reasoning/cost balance, easier to serve than larger models
ConsLess capable than 70B/full R1 on difficult benchmarks
Modeldeepseek-ai/DeepSeek-R1-Distill-Qwen-32B
Scalemedium / distilled
Benchmarking- AIME 2024 pass@172.6
- GPQA Diamond pass@162.1
- LiveCodeBench pass@157.2
- Codeforces Rating1691
- DeepSeek-Coder-V2-InstructDeepSeekGenerative (code-focused)
Best forCode generation, refactoring, debugging, repo-level coding assistance
PriceOpen weights (infra cost) or provider-dependent API pricing
DeploymentLocal or API
ModalitiesText-only
ParametersCode-specialized model line (various sizes/checkpoints)
ProsStrong coding performance, practical for dev workflows
ConsGeneral non-code reasoning/chat can be weaker than top general models
Modeldeepseek-ai/DeepSeek-Coder-V2-Instruct
Scalelarge
Benchmarking- HumanEval>90%
- FocusRepo-scale code completion & repair
- kimi-k2.6Moonshot AI (Kimi)Reasoning / thinker
Best forAgent workflows, coding, and long-context multimodal reasoning with improved planning depth over earlier K2.x tiers
PriceSelf-hosted open weights: free model weights (infrastructure cost). Official API: $0.16/1M input (cache hit), $0.95/1M input (cache miss), $4.00/1M output
DeploymentLocal or API
ModalitiesText + image + video
ParametersMoE architecture: 1T total params / 32B activated per token; 61 layers; 384 experts (8 selected + 1 shared per token); 160K vocab; 256K context; MoonViT vision encoder (400M params)
ProsNewer K2.x line for stronger reasoning/planning scenarios
ConsPublic benchmark reporting and exact limits can change quickly by API release
Modelkimi-k2.6
Scalelarge / frontier
Benchmarking- SWE-Bench Pro58.6%
- SWE-Bench Verified80.2%
- Terminal-Bench 2.066.7%
- LiveCodeBench (v6)89.6%
- HLE-Full (w/ tools)54.0
- GPQA-Diamond90.5%
- kimi-k2.5Moonshot AI (Kimi)Generative
Best forMultimodal (image + video + text), vision-language, agent-style workflows; use thinking mode for harder reasoning
PriceToken-based; see Moonshot pricing docs
DeploymentAPI
ModalitiesText + image + video
ParametersMultimodal (image + video + text); video via upload / ms:// file refs; long context (model card: ~256K-class weights — confirm API limits for your account)
ProsFlagship multimodal line; thinking mode where supported
ConsThinking vs instant modes and defaults differ from older v1 APIs; confirm latest docs
Modelkimi-k2.5
Scalelarge / frontier
Benchmarking- MMMU-Pro78.5%
- MathVision84.2%
- moonshot-v1-128k-vision-previewMoonshot AI (Kimi)Generative
Best forHeavy multimodal context (long system + user + images) in one shot
PriceToken-based; see Moonshot pricing docs
DeploymentAPI
ModalitiesText + image
ParametersVision model id for ~128K context tier (preview)
ProsLargest v1 vision context tier in the name
ConsPreview; most expensive/heaviest when you use full context
Modelmoonshot-v1-128k-vision-preview
Scalelarge
Benchmarking- BenchLM Overall Score47/100
- BenchLM Rank#84 / 127
- Multimodal & Grounded (BenchLM)52.6
- moonshot-v1-32k-vision-previewMoonshot AI (Kimi)Generative
Best forLonger multimodal chats / more image+text context in one request
PriceToken-based (vision token accounting); see Moonshot pricing docs
DeploymentAPI
ModalitiesText + image
ParametersVision model id for ~32K context tier (preview); same image rules as other Moonshot vision models
ProsMore room for instructions + image context than 8K tier
ConsPreview; higher token use and cost than 8K when you fill context
Modelmoonshot-v1-32k-vision-preview
Scalemedium
Benchmarking- BenchLM Overall Score47/100
- BenchLM Rank#84 / 127
- Multimodal & Grounded (BenchLM)52.6
- moonshot-v1-8k-vision-previewMoonshot AI (Kimi)Generative
Best forImage + text in one request; short prompts and small multimodal turns
PriceToken-based chat pricing (vision uses dynamic image/video tokens); see Moonshot pricing docs
DeploymentAPI
ModalitiesText + image
ParametersVision model id for ~8K context tier (preview); images: png, jpeg, webp, gif; see Moonshot vision guide for request format (base64 / file id)
ProsLower context cost vs larger tiers when input fits in 8K
ConsPreview name may change; long images + long text can hit limits faster
Modelmoonshot-v1-8k-vision-preview
Scaletiny
Benchmarking- BenchLM Overall Score47/100
- BenchLM Rank#84 / 127
- Multimodal & Grounded (BenchLM)52.6
- MiniMax M2.7 (MiniMax)MiniMax AIReasoning / thinker
Best forAgents, coding, software engineering, office workflows, long-context reasoning; flagship text line
PriceSee https://platform.minimax.io pricing (token plans / pay-as-you-go)
DeploymentAPI (OpenAI- or Anthropic-compatible SDKs per docs)
ModalitiesText-only
ParametersMoE-class stack (~230B total / ~100B active per public materials); very long context (~204.8K-class); tools / Anthropic-compatible API path
ProsStrong real-world engineering and agentic positioning; highspeed variant for latency-sensitive paths
ConsAPI-only (no open weights); pricing/quotas region- and account-dependent
ModelMiniMax-M2.7
Scalelarge / frontier
Benchmarking- SWE-Pro56.22%
- Terminal Bench 257.0%
- VIBE-Pro55.6%
- GDPval-AA Elo1495
- MiniMax M2.7-highspeed (MiniMax)MiniMax AIReasoning / thinker
Best forSame tasks as M2.7 when you need lower latency / higher throughput
PriceSee MiniMax pricing (often differs from base M2.7)
DeploymentAPI
ModalitiesText-only
ParametersSame capability tier as M2.7; faster inference (vendor-tuned routing)
ProsSignificantly faster than base M2.7 for similar quality class
ConsAPI-only; may differ slightly in throughput vs base under load
ModelMiniMax-M2.7-highspeed
Scalelarge / frontier
Benchmarking- SWE-Pro56.22%
- Terminal Bench 257.0%
- VIBE-Pro55.6%
- GDPval-AA Elo1495
- MiniMax M2.5 (MiniMax)MiniMax AIGenerative
Best forCode generation, refactoring, polyglot coding, strong value tier before M2.7
PriceSee MiniMax pricing
DeploymentAPI
ModalitiesText-only
ParametersLong context (~204.8K-class per docs); code-optimized positioning
ProsPeak value tier in the M2 text line for coding-heavy workloads
ConsSuperseded for absolute frontier by M2.7 on vendor charts; API-only
ModelMiniMax-M2.5
Scalelarge
Benchmarking- SWE-bench Verified (aggregate)80.2%
- MiniMax M2.5-highspeed (MiniMax)MiniMax AIGenerative
Best forSame as M2.5 with lower latency
PriceSee MiniMax pricing
DeploymentAPI
ModalitiesText-only
ParametersSame performance class as M2.5; faster inference
ProsFast M2.5-class option for high-volume coding
ConsAPI-only
ModelMiniMax-M2.5-highspeed
Scalelarge
Benchmarking- SWE-bench Verified (aggregate)80.2%
- MiniMax M2.1 (MiniMax)MiniMax AIGenerative
Best forCode, reasoning, refactoring; legacy M2 line still listed for stable integrations
PriceSee MiniMax pricing
DeploymentAPI
ModalitiesText-only
ParametersMoE-style (~230B total, ~10B activated per token per docs); code-focused
ProsMature tier; often cheaper than newest flagship
ConsLegacy relative to M2.5 / M2.7; API-only
ModelMiniMax-M2.1
Scalelarge
Benchmarking- SWE-bench Verified74.0%
- Multi-SWE-bench49.4%
- SWE-bench Multilingual72.5%
- Terminal-Bench 2.047.9%
- MiniMax M2 (MiniMax)MiniMax AIReasoning / thinker
Best forLong output and agentic text (function calling, streaming) on older M2 generation
PriceSee MiniMax pricing
DeploymentAPI
ModalitiesText-only
Parameters~200K context; up to ~128K output (incl. chain-style content per docs)
ProsEstablished M2 generation; long outputs
ConsLegacy vs M2.5 / M2.7; API-only
ModelMiniMax-M2
Scalelarge
Benchmarking- SWE-bench Verified69.4%
- Multi-SWE-bench36.2%
- SWE-bench Multilingual56.5%
- Terminal-Bench 2.030.0%
- M2-her (MiniMax)MiniMax AIGenerative
Best forRoleplay, multi-character dialogue, long-horizon character interaction
PriceSee MiniMax pricing
DeploymentAPI
ModalitiesText-only
ParametersText chat tuned for character and emotional expression
ProsSpecialized for interactive fiction / persona use cases
ConsNot a general coding frontier model; API-only
ModelM2-her
ScaleN/A
Benchmarking- DomainCharacter / dialogue fidelity
- Claude Opus 4.6AnthropicGenerative
Best forHardest tasks, agents, coding, long multimodal work
Price~$5/1M input, ~$25/1M output (see Anthropic pricing for batch, cache, thinking)
DeploymentClaude API, AWS Bedrock, Google Vertex AI
ModalitiesText + image in, text out
Parameters1M context; 128k max output; extended thinking + adaptive thinking
ProsStrongest Claude tier in the current lineup
ConsHighest latency and cost in the family
Modelclaude-opus-4-6
Scalelarge / frontier
Benchmarking- SWE-bench Verified80.84%
- SWE-bench Multilingual77.83%
- Terminal-Bench 2.065.4%
- GPQA Diamond91.31%
- Claude Sonnet 4.6AnthropicGenerative
Best forProduction chat, agents, coding, vision; balance of speed and quality
Price~$3/1M input, ~$15/1M output (see Anthropic pricing for batch, cache, thinking)
DeploymentClaude API, AWS Bedrock, Google Vertex AI
ModalitiesText + image in, text out
Parameters1M context; 64k max output; extended thinking + adaptive thinking
ProsFast relative to Opus; strong general capability
ConsLess capable than Opus on the hardest prompts
Modelclaude-sonnet-4-6
Scalelarge
Benchmarking- SWE-bench Verified79.6%
- SWE-bench Multilingual75.9%
- Terminal-Bench 2.059.1%
- GPQA Diamond89.9%
- Claude Haiku 4.5AnthropicGenerative
Best forLow-latency chat, high-volume routing, cost-sensitive workloads
Price~$1/1M input, ~$5/1M output (see Anthropic pricing)
DeploymentClaude API, AWS Bedrock, Google Vertex AI
ModalitiesText + image in, text out
Parameters200k context; 64k max output; extended thinking (no adaptive thinking per model table)
ProsFastest and cheapest current Claude tier listed in overview
ConsSmaller context than Opus/Sonnet 1M tier
Modelclaude-haiku-4-5
Scalemedium
Benchmarking- SWE-bench Verified (public aggregate)73.3%
- WebArena53.1%
- CONTEXT-1ChromaAgentic retrieval
Best forMulti-hop retrieval, agentic search paired with a frontier reasoning model
PriceOpen weights (inference cost on your infra)
DeploymentHugging Face, local
ModalitiesText-only
Parameters~20B params; query decomposition, iterative corpus search, in-loop context editing
ProsBuilt for complex multi-hop retrieval as a sub-agent
ConsSpecialized workflow; not a general chat or coding model
Modelchromadb/context-1
ScaleN/A
Benchmarking- BrowseComp-Plus0.87
- FRAMES0.87
- LongSeal0.65
- Prune accuracy (context edit)0.94
- Groq Compound (Groq)GroqGenerative (orchestrated)
Best forTool-orchestrated search and routing across Groq-hosted models (e.g. open-weight ~120B-class paths per Groq)
PriceSee Groq Cloud pricing
DeploymentAPI (Groq)
ModalitiesText-only
ParametersProduct routes across retrieval + generative models; not one fixed public parameter count
ProsLow-latency Groq inference; unified compound surface
ConsWhich sub-model runs can change; not a single static checkpoint
Modelgroq-compound
Scalelarge
Benchmarking- RoleOrchestrated retrieval + generative routing
- Tiny Recursive Model (TRM)Samsung (SAIL Montréal)Reasoning / thinker
Best forStructured puzzle reasoning (ARC-AGI, Sudoku, mazes), recursive-reasoning research
PriceFree (open source)
DeploymentLocal
ModalitiesStructured grids / puzzles (not free-form conversational text)
Parameters~7M parameters, iterative latent and answer refinement (paper: arXiv:2510.04871)
ProsOrders-of-magnitude smaller than LLMs on comparable reasoning tasks, MIT-licensed repo
ConsNot a general chat model, embedding, or reranker; no official hosted API; GPU stack for train/eval
ModelSamsungSAILMontreal/TinyRecursiveModels
Scaletiny
Benchmarking- ARC-AGI-1 (reported)44.6%
- ARC-AGI-2 (reported)7.8%