Evaluation of LLM systems
Introduction
Shipping an LLM feature is not finished when the model answers plausibly in a demo. Evaluation is how you know whether a change in prompts, retrieval, tools, or weights actually improves task success, grounding, safety, and operating cost for real users. This article series will cover practical evaluation for generative systems: reference tasks and benchmarks, golden sets and judges, RAG-specific checks, agent and tool traces, online metrics, and how to wire those signals into release and monitoring. For now, this page sets the scope; deeper sections will follow here.

Core analytics parameters
In RAG and agentic stacks you still anchor dashboards on a small set of fundamentals:
- 🔥 Latency — how long a single request takes from send to a complete response (end-to-end or per hop, e.g. milliseconds from ask to answer).
- 🔥 Retrieval health — overall quality and stability of the retriever over time: relevant, grounded context at acceptable latency and coverage for real queries.
- 🔥 Faithfulness — how accurate and reliable the generated answer is with respect to the retrieved documents.
- 🔥 Relevance— how well the answer matches the user's query.
- 🔥 Diversity — whether retrieved and surfaced knowledge stays broad and useful rather than overly narrow or repetitive.
- 🔥 Cost — LLM and embedding usage in tokens and money so spend stays predictable.
- 🔥 Failures — errors, timeouts, and empty paths that cap every other metric.
DeepEval
DeepEval is an open-source Python framework (Apache-2.0) from Confident AI for unit-testing LLM applications in the same spirit as pytest: you define test cases, attach metrics, and assert thresholds. It ships ready-made scores for RAG (answer relevancy, faithfulness, contextual recall, precision, relevancy, RAGAS-style bundles), agents (task completion, tool correctness, plan adherence, and related checks), multi-turn chat, MCP-oriented metrics, and general criteria via G-Eval and similar LLM-as-judge—or statistical / local NLP-backed evaluators when you do not want every score to call a remote model. You can run end-to-end black-box tests, trace nested components (retrieval, tools, sub-calls) for finer-grained evaluation, plug into CI/CD, generate synthetic datasets, and optionally sync runs with the Confident AI platform for reports and traces. It integrates with common stacks (for example OpenAI, LangChain, LangGraph, LlamaIndex, CrewAI, Anthropic) so the parameters above become concrete tests instead of slide bullets alone.
Arize Phoenix
Phoenix (Arize Phoenix) is an open-source observability and evaluation app for LLM, RAG, and agent workflows. You instrument with OpenInference-compatible telemetry, then inspect traces, datasets, and experiment runs in a UI—instead of relying only on stdout logs—so teams can debug failures and compare pipeline versions interactively.
It is typically used for:
- 🔥 Traces — multi-step flows (retrieval, tool calls, nested LLM calls) with latency and inputs/outputs.
- 🔥 RAG / retrieval — inspect retrieved chunks against the final answer (“why did it say that?”).
- 🔥 Evals — run or record evaluation results on traces or datasets (quality, grounding, safety-style checks, including LLM-as-judge patterns via their eval tooling).
- 🔥 Datasets and experiments — capture production-like examples and compare prompts, models, or architecture variants on the same set.
Alongside code-first runners such as DeepEval, Phoenix leans toward trace-first debugging and visual regression; many teams combine both.
RAGAS
RAGAS (Ragas) is a Python library aimed at turning one-off “vibe checks” into repeatable evaluation loops for LLM applications: it pairs LLM-based scores with structured runs so you can compare changes with confidence. Classic offline metrics rarely match what matters for retrieval and generation quality, and hand-labeling does not scale; Ragas focuses on objective, automatable signals plus experimentation so iteration becomes measurable rather than argumentative.
Highlights from the project:
- 🔥 Objective metrics — evaluate with LLM-backed and traditional scores tuned for LLM apps, not only generic classifiers.
- 🔥 Test data generation — synthesize broader scenario coverage when you do not yet have a large golden set.
- 🔥 Integrations — works with common frameworks such as LangChain and with major observability stacks for end-to-end workflows.
- 🔥 Feedback loops — use production-shaped data and experiment runs so improvements compound instead of stalling after launch.
LLM-Tool-Survey
LLM-Tool-Survey is a research-oriented collection—not a runnable evaluation framework—that maps how modern LLMs interact with external tools end to end: planning, tool choice, API-style calling, and answer synthesis. It also summarizes training angles (prompting, fine-tuning, reinforcement learning), open problems, and how the field measures tool use through benchmarks and metrics. Use it as a structured field guide when you design agents or decide what to evaluate next, rather than as drop-in code.
What it organizes:
- 🔥 Agent workflow breakdown — step-by-step view of how models plan, select, and invoke tools.
- 🔥 Tool learning methods — prompting, fine-tuning, and RL-style approaches.
- 🔥 Evaluation overview — benchmarks and metrics aimed at tool usage.
- 🔥 Curated research hub — papers and references grouped for navigation.
- 🔥 Limitations and challenges — practical issues in real tool-based agents.
MemoryAgentBench
MemoryAgentBench is benchmark infrastructure and code for evaluating memory in LLM agents when interaction is incremental and multi-turn (a long chat), not a single static prompt. The work is research-facing (paper on incremental multi-turn memory evaluation; ICLR 2026) rather than a drop-in memory product.
The README groups competence areas in a LongMemEval-style framing:
- 🔥 Accurate retrieval (AR)
- 🔥 Test-time learning (TTL)
- 🔥 Long-range understanding (LRU)
- 🔥 Conflict resolution (CR)
What the repo provides:
- 🔥 Datasets and tasks— adapted prior benchmarks, chunked for multi-turn exposure; includes constructs such as EventQA and FactConsolidation and an “inject once, query many times” design (one long source, several questions).
- 🔥 Run scripts — bash entrypoints for long-context agents, RAG-style setups, and agentic memory methods.
- 🔥 LLM-as-judge helpers — under
llm_based_eval/(e.g. LongmemEval QA, InfBench summarization), typically needing API access. - 🔥 Baselines / integrations — repo folders (e.g. mem0, letta, cognee) to compare memory approaches.
It is not a memory library like Nemori; use it when you need standardized, paper-linked experiments on memory under chunked multi-turn conditions—not when you only want a production memory API.
ET-Agent
ET-Agent is a research codebase for training tool-integrated reasoning (TIR) agents—systems that reason while calling search, retrieval, or other tools—with emphasis on shaping how the agent behaves (when it searches, how long chains get, efficiency), not only raw answer correctness.
At a glance:
- 🔥 Goal — improve behavioral patterns on TIR tasks through both data and algorithms.
- 🔥 Data — a Self-Evolving Data Flywheel that generates richer training data so RFT can cover more of the tool-use action space.
- 🔥 Algorithms — Behavior Calibration Training: supervised fine-tuning (e.g. via LLaMA-Factory), then iterative reinforcement learning built around ARPO-style training (group / Pareto-style sampling on rollouts, curriculum scripts), with rewards that encode efficiency—not only final correctness.
It also ships an evaluation track: inference drivers, F1, LLM-as-judge, efficiency, and analyses such as conciseness, successful execution, and reasoning length (via a radar script). The project is still primarily a training and data-generation pipeline—how to build a strong TIR agent—not a minimal, benchmark-only harness like MemoryAgentBench.
In one line: behavior-calibrated agent training for tool-using reasoning—flywheel data, SFT, RL (ARPO), retrieval/search plumbing, and bundled metrics rather than a stand-alone evaluator alone.
Conclusion
Good evaluation is layered. The analytics parameters give you the vocabulary for what to watch in production: latency, retrieval health, faithfulness, relevance, diversity, cost, and failure modes. Tools such as DeepEval, RAGAS, and Phoenix turn that vocabulary into tests, scores, and traces you can run in CI or explore in a UI. Surveys and benchmarks—LLM-Tool-Survey, MemoryAgentBench—anchor you to how the field defines tool use or memory under multi-turn conditions. Training-oriented stacks such as ET-Agent show that evaluation is inseparable from how you generate data and calibrate behavior; the same metrics should eventually feed back from offline runs and online monitoring into the next release.
None of these replace a clear product decision about what “good” means for your users—pick a small metric set, wire it to gates you actually enforce, and revisit it when models, tools, or traffic change.