Evaluation of LLM systems

  1. Home
  2. AI
  3. Agent, RAG, MCP & ML
  4. Evaluation of LLM systems

Introduction

Shipping an LLM feature is not finished when the model answers plausibly in a demo. Evaluation is how you know whether a change in prompts, retrieval, tools, or weights actually improves task success, grounding, safety, and operating cost for real users. This article series will cover practical evaluation for generative systems: reference tasks and benchmarks, golden sets and judges, RAG-specific checks, agent and tool traces, online metrics, and how to wire those signals into release and monitoring. For now, this page sets the scope; deeper sections will follow here.

Abstract evaluation loop: LLM pipeline flowing into metric panels, gauges, and quality checks on a dark background with orange and cyan accents
From prompts, retrieval, and tools to measurable signals: offline tests, judges, traces, and release gates on the same vocabulary of quality and cost.

Core analytics parameters

In RAG and agentic stacks you still anchor dashboards on a small set of fundamentals:

DeepEval

DeepEval is an open-source Python framework (Apache-2.0) from Confident AI for unit-testing LLM applications in the same spirit as pytest: you define test cases, attach metrics, and assert thresholds. It ships ready-made scores for RAG (answer relevancy, faithfulness, contextual recall, precision, relevancy, RAGAS-style bundles), agents (task completion, tool correctness, plan adherence, and related checks), multi-turn chat, MCP-oriented metrics, and general criteria via G-Eval and similar LLM-as-judge—or statistical / local NLP-backed evaluators when you do not want every score to call a remote model. You can run end-to-end black-box tests, trace nested components (retrieval, tools, sub-calls) for finer-grained evaluation, plug into CI/CD, generate synthetic datasets, and optionally sync runs with the Confident AI platform for reports and traces. It integrates with common stacks (for example OpenAI, LangChain, LangGraph, LlamaIndex, CrewAI, Anthropic) so the parameters above become concrete tests instead of slide bullets alone.

Sketch (Python)

Mapping:

ParameterWhat this uses
Latencytest_latency_end_to_end (perf_counter; split per-hop in your own RAG wrapper).
Retrieval healthContextualRelevancyMetric, plus ContextualRecall / ContextualPrecision if you supply expected_output. Stability over time = your logging + repeated runs on a suite.
FaithfulnessFaithfulnessMetric + retrieval_context.
RelevanceAnswerRelevancyMetric.
DiversityProxy via GEval on RETRIEVAL_CONTEXT(not a native “MMR diversity” meter).
Costevaluation_cost is often meaningless for OllamaModel (see DAG code path sets 0); real $/tokens usually needs provider usage + OpenAI-style billing.
Failurestest_failure_missing_retrieval_raises; add pytest.mark.timeout for timeouts.
import os
import time

import pytest
from deepeval import assert_test
from deepeval.errors import MissingTestCaseParamsError
from deepeval.metrics import (
    AnswerRelevancyMetric, # measures the relevancy of the answer to the question
    ContextualPrecisionMetric, # measures the precision of the answer to the question
    ContextualRecallMetric, # measures the recall of the answer to the question
    ContextualRelevancyMetric, # measures the contextual relevancy of the answer to the question
    FaithfulnessMetric, # measures the faithfulness of the answer to the question
    GEval, # measures the general evaluation of the answer to the question
)

# models that are used to evaluate the answer to the question
from deepeval.models import GPTModel, OllamaModel 

# test cases that are used to evaluate the answer to the question
from deepeval.test_case import LLMTestCase, SingleTurnParams 

# model that is used to evaluate the answer to the question
def _evaluation_llm():
    if os.environ.get("OPENAI_API_KEY"):
        mn = os.environ.get("OPENAI_MODEL_NAME")
        return GPTModel(model=mn) if mn else None
    name = os.environ.get("OLLAMA_MODEL_NAME") or os.environ.get("LOCAL_MODEL_NAME") or "qwen2.5:7b"
    base = os.environ.get("LOCAL_MODEL_BASE_URL")
    if base:
        u = str(base).rstrip("/")
        if u.endswith("/v1"):
            u = u[:-3]
        return OllamaModel(model=name, base_url=u)
    return OllamaModel(model=name)

# test case that is used to evaluate the answer to the question
def _sample_case():
    return LLMTestCase(
        input="What is the ROI of the Q3 rollout?",
        actual_output="Q3 rollout ROI was 12% based on pilot regions A and B.",
        expected_output="Roughly 12% ROI in pilot regions.",
        retrieval_context=[
            "Q3 rollout in regions A and B achieved 12% ROI.",
            "Q3 pilot ROI is defined as net gain over implementation cost.",
            "Executive summary: Q3 rollout return is 12% in pilot geographies.",
        ],
    )

# test function that is used to evaluate the answer to the question
def test_latency_end_to_end():
    tc = _sample_case()
    mdl = _evaluation_llm()
    metrics = [
        AnswerRelevancyMetric(threshold=0.5, model=mdl),
        FaithfulnessMetric(threshold=0.5, model=mdl),
        ContextualRelevancyMetric(threshold=0.5, model=mdl),
    ]
    t0 = time.perf_counter()
    for m in metrics:
        m.measure(tc, _show_indicator=False)
    elapsed_ms = (time.perf_counter() - t0) * 1000
    print(f"end_to_end_latency_ms_approx {elapsed_ms:.2f}")

# test function that is used to evaluate the answer to the question
def test_retrieval_health_contextual_relevancy():
    mdl = _evaluation_llm()
    assert_test(_sample_case(), [ContextualRelevancyMetric(threshold=0.5, model=mdl)])


def test_retrieval_health_recall_precision_when_reference_exists():
    mdl = _evaluation_llm()
    assert_test(
        _sample_case(),
        [
            ContextualRecallMetric(threshold=0.5, model=mdl),
            ContextualPrecisionMetric(threshold=0.5, model=mdl),
        ],
    )

# test function that is used to evaluate the answer to the question
def test_faithfulness():
    mdl = _evaluation_llm()
    assert_test(_sample_case(), [FaithfulnessMetric(threshold=0.35, model=mdl)])


def test_answer_relevance():
    mdl = _evaluation_llm()
    assert_test(_sample_case(), [AnswerRelevancyMetric(threshold=0.35, model=mdl)])

# test function that is used to evaluate the answer to the question
def test_diversity_via_geval():
    mdl = _evaluation_llm()
    m = GEval(
        name="Diversity",
        criteria="Retrieved chunks should cover distinct facts without near-duplicate fluff.",
        evaluation_params=[SingleTurnParams.RETRIEVAL_CONTEXT],
        evaluation_steps=[
            "Check chunks are not repetitive paraphrases of one fact.",
            "Reward coverage of distinct subtopics.",
        ],
        model=mdl,
        threshold=0.35,
    )
    assert_test(_sample_case(), [m])


def test_failure_missing_retrieval_raises():
    mdl = _evaluation_llm()
    tc = LLMTestCase(input="Hello?", actual_output="Hi", retrieval_context=None)
    m = FaithfulnessMetric(threshold=0.35, model=mdl)
    with pytest.raises(MissingTestCaseParamsError):
        m.measure(tc, _show_indicator=False)

# test function that is used to evaluate the answer to the question
def test_metric_cost_placeholder():
    tc = _sample_case()
    mdl = _evaluation_llm()
    m = AnswerRelevancyMetric(threshold=0.35, model=mdl)
    m.measure(tc, _show_indicator=False)
    print("evaluation_cost", getattr(m, "evaluation_cost", None))
DeepEval Metrics

Large variety of ready-to-use LLM eval metrics (all with explanations) powered by ANY LLM of your choice, statistical methods, or NLP models that run locally on your machine covering all use cases:

Custom, All-Purpose Metrics
  • G-Eval — a research-backed LLM-as-a-judge metric for evaluating on any custom criteria with human-like accuracy.
  • DAG— DeepEval's graph-based deterministic LLM-as-a-judge metric builder.
Agentic Metrics

Task completion, tool correctness, plan adherence, and related checks for tool-using and multi-step agents.

RAG Metrics

Answer relevancy, faithfulness, contextual recall, precision, relevancy, and RAGAS-style bundles tied to retrieval and grounded answers.

Multi-Turn Metrics

Scores for conversational threads, turn consistency, and long-running chat behavior.

MCP Metrics

Evaluations aimed at Model Context Protocol workflows and tool surfaces exposed via MCP.

Multimodal Metrics

Judges and rubrics when inputs or outputs span images, audio, or other non-text modalities.

Other Metrics

Statistical and local NLP evaluators when you do not want every score to call a remote LLM.

  • 🎯 Supports both end-to-end and component-level LLM evaluation.
  • 🧩 Build your own custom metrics that are automatically integrated with DeepEval's ecosystem.
  • 🔮 Generate both single and multi-turn synthetic datasets for evaluation.
  • 🔗 Integrates seamlessly with ANY CI/CD environment.
  • 🧬 Optimize prompts automatically based on evaluation results.
  • 🏆 Easily benchmark ANY LLM on popular LLM benchmarks in under 10 lines of code, including MMLU, HellaSwag, DROP, BIG-Bench Hard, TruthfulQA, HumanEval, GSM8K.

DeepEval does not ship a self-hosted local web UI; you run tests and read results in code, logs, and CI. For a browser-based experience—reports, traces, and collaboration—use the optional Confident AI cloud platform when you sync runs. For purely local runs, plot or summarize scores yourself with libraries such as matplotlib from exported metrics or DataFrames.

Arize Phoenix

Phoenix (Arize Phoenix) is an open-source observability and evaluation app for LLM, RAG, and agent workflows. You instrument with OpenInference-compatible telemetry, then inspect traces, datasets, and experiment runs in a UI—instead of relying only on stdout logs—so teams can debug failures and compare pipeline versions interactively.

It is typically used for:

Alongside code-first runners such as DeepEval, Phoenix leans toward trace-first debugging and visual regression; many teams combine both.

Sketch (Python)
import logging

import ollama
from opentelemetry import trace
from opentelemetry.instrumentation.ollama import OllamaInstrumentor
from opentelemetry.trace import Status, StatusCode

from openinference.semconv.trace import OpenInferenceSpanKindValues, SpanAttributes
from phoenix.evals import LLM, bind_evaluator, create_classifier, create_evaluator
from phoenix.evals.tracing import trace as span_trace
from phoenix.otel import register

logging.basicConfig(level=logging.WARNING)

OLLAMA_CHAT_MODEL = "llama3.2"
OLLAMA_OPENAI_URL = "http://127.0.0.1:11434/v1"
OTLP_GRPC = "http://127.0.0.1:4317"
PROJECT = "default"
CANON_TERMS = ("austin", "acme", "widget")

# Fake retriever span: keyword-filter a tiny corpus with a fallback 
# to the first two documents.
@span_trace(span_name="rag.retrieve", span_kind=OpenInferenceSpanKindValues.RETRIEVER)
def faux_retrieval(query: str) -> list[str]:
    corpus = [
        "Acme Corp HQ is in Austin, Texas.",
        "Acme was founded in 1999.",
        "Acme sells cloud widgets and on-prem gateways.",
        "Off-topic: whales migrate seasonally.",
    ]
    ql = query.lower()
    picks = [
        c
        for c in corpus
        if any(len(tok) > 2 and tok in c.lower() for tok in ql.split())
        or "acme" in c.lower()
    ]
    return picks or corpus[:2]

# LLM generation via Ollama, copying token counts onto the span 
# when the response includes them.
@span_trace(span_name="rag.answer", span_kind=OpenInferenceSpanKindValues.LLM)
def generate_answer_via_ollama(messages: list[dict]) -> str:
    out = ollama.chat(model=OLLAMA_CHAT_MODEL, messages=messages)
    txt = str(out.message.content)
    pv = getattr(out, "prompt_eval_count", None)
    cv = getattr(out, "eval_count", None)
    sp = trace.get_current_span()
    if pv is not None:
        sp.set_attribute(SpanAttributes.LLM_TOKEN_COUNT_PROMPT, int(pv))
    if cv is not None:
        sp.set_attribute(SpanAttributes.LLM_TOKEN_COUNT_COMPLETION, int(cv))
    if pv is not None or cv is not None:
        sp.set_attribute(
            SpanAttributes.LLM_TOKEN_COUNT_TOTAL,
            int((pv or 0) + (cv or 0)),
        )
    return txt

# One traced RAG chain: retrieve chunks, pack them into the prompt, 
# generate the answer with Ollama.
@span_trace(span_name="rag.pipeline", span_kind=OpenInferenceSpanKindValues.CHAIN)
def run_rag_once(query: str) -> tuple[str, list[str]]:
    chunks = faux_retrieval(query)
    prompt = (
        "Use ONLY the context. If insufficient, say you do not know.\n\n"
        "Context:\n"
        + "\n".join(f"- {c}" for c in chunks)
        + "\n\nQuestion:\n"
        + query
    )
    answer = generate_answer_via_ollama([{"role": "user", "content": prompt}])
    return answer, chunks


# Cheap retrieval-health proxy: fraction of canonical keywords present 
# in retrieved chunk text.
@create_evaluator(name="retrieval_hit_rate")
def retrieval_hit_rate(chunk_texts: list[str]) -> float:
    blob = " ".join(chunk_texts).lower()
    hits = sum(1 for kw in CANON_TERMS if kw in blob)
    return hits / len(CANON_TERMS)

# Heuristic retrieval diversity: ratio of distinct words 
# to total words across chunks.
@create_evaluator(name="diversity_lexical") # create an evaluator for the diversity lexical
def diversity_lexical(chunk_texts: list[str]) -> float:
    toks: list[str] = []
    for t in chunk_texts:
        toks.extend(t.lower().replace(",", " ").split())
    return len(set(toks)) / len(toks) if toks else 0.0

# Judge model via Phoenix LLM adapter: OpenAI SDK pointed at Ollama's 
# /v1-compatible endpoint.
def build_judge_llm() -> LLM:
    return LLM(
        provider="openai",
        model=OLLAMA_CHAT_MODEL,
        base_url=OLLAMA_OPENAI_URL,
        api_key="ollama",
    )

# Demonstrate failure observability: one span ends in ERROR after a 
# guaranteed Ollama error.
def intentional_failure() -> None:
    tracer = trace.get_tracer(__name__)
    with tracer.start_as_current_span("demo.forced_failure") as sp:
        try:
            ollama.chat(
                model="model-that-does-not-exist-xyz",
                messages=[{"role": "user", "content": "x"}],
            )
        except Exception as exc:
            sp.record_exception(exc)
            sp.set_status(Status(StatusCode.ERROR, str(exc)))

# Wire Phoenix OTLP, instrument Ollama, run a toy RAG trace, run heuristic + 
# LLM judges, log a forced error span, flush.
def main() -> None:
    register(project_name=PROJECT, endpoint=OTLP_GRPC, batch=False, verbose=False)
    OllamaInstrumentor().instrument()

    query = "Where is Acme headquartered and what do they sell?"

    answer, chunks = run_rag_once(query)

    rhr = bind_evaluator(
        evaluator=retrieval_hit_rate,
        input_mapping={"chunk_texts": lambda r: r["chunks"]},
    )
    div = bind_evaluator(
        evaluator=diversity_lexical,
        input_mapping={"chunk_texts": lambda r: r["chunks"]},
    )
    row = {"chunks": chunks}
    rhr.evaluate(row)
    div.evaluate(row)

    judge = build_judge_llm()
    grounded = create_classifier(
        name="faithfulness_vs_context",
        prompt_template=(
            "Question: {query}\nContext:\n{context}\nAnswer:\n{answer}\n"
            "Are all factual statements in the answer supported only by the context?\n"
            "Answer with exactly one label: yes, partly, no."
        ),
        llm=judge,
        choices=["yes", "partly", "no"],
    )
    relevance = create_classifier(
        name="answer_relevance",
        prompt_template=(
            "Question: {query}\nAnswer:\n{answer}\n"
            "Does the answer respond to what was asked?\n"
            "Answer with exactly one label: yes, partly, no."
        ),
        llm=judge,
        choices=["yes", "partly", "no"],

    )

    try:

        grounded.evaluate({"query": query, "answer": answer, "context": "\n".join(chunks)})
        relevance.evaluate({"query": query, "answer": answer})

    except Exception as exc:


        logging.warning("LLM evaluator failed (often JSON/tooling limits on small models): %s", exc)

    intentional_failure()
    trace.get_tracer_provider().force_flush(timeout_millis=10_000)
    print("Done. Open tracing view and expand the latest trace.")


if __name__ == "__main__":
    main()
Arize Phoenix metrics

Arize Phoenix is an open-source AI observability platform designed for experimentation, evaluation, and troubleshooting. It provides:

  • Tracing — trace your LLM application's runtime using OpenTelemetry-based instrumentation.
  • Evaluation — leverage LLMs to benchmark your application's performance using response and retrieval evals.
  • Datasets — create versioned datasets of examples for experimentation, evaluation, and fine-tuning.
  • Experiments — track and evaluate changes to prompts, LLMs, and retrieval.
  • Playground — optimize prompts, compare models, adjust parameters, and replay traced LLM calls.
  • Prompt Management — manage and test prompt changes systematically using version control, tagging, and experimentation.
Arize Phoenix UI

Phoenix is not only SDKs and OTLP exporters: it ships a dedicated web UI you open in the browser to work with the same signals you send from code. Run it locally (for example alongside phoenix serve or your own container), or use a hosted instance; either way you get trace timelines, eval reviews, datasets, experiments, and prompt tools without building a custom dashboard.

Below are example screens from that interface—click an image to open it full size. A public demo lives at phoenix-demo.arize.com.

RAGAS

RAGAS (Ragas) is a Python library aimed at turning one-off “vibe checks” into repeatable evaluation loops for LLM applications: it pairs LLM-based scores with structured runs so you can compare changes with confidence. Classic offline metrics rarely match what matters for retrieval and generation quality, and hand-labeling does not scale; Ragas focuses on objective, automatable signals plus experimentation so iteration becomes measurable rather than argumentative.

Highlights from the project:

Sketch (Python)

What Ragas can cover in one evaluate call (same columns as your demo) user_input, response, retrieved_contexts:

  • Faithfulness → faithfulness
  • Answer relevance → answer_relevancy (needs embeddings)
  • Retrieval helpfulness without gold labels → context_utilization (LLM judges if each chunk helped for this answer; no reference)

Needs extra columns

context_recall / context_precision (the usual bundled ones tied to reference) expect a reference (and sometimes more). Add a gold answer or gold contexts if you want those.

Not really “one Ragas line”

Latency: wrap evaluate with time.perf_counter(); keep latency_* on rows yourself (you already do).

Diversity: not a standard Ragas KPI; typically embedding spread / unique sources / MMR outside Ragas.

Cost: evaluate(..., token_usage_parser=...) only if you wire a parser to your LLM responses.

Failures: raise_exceptions=False + RunConfig (timeout / retries); then count NaNs in result.to_pandas().

context_utilization import: It's defined in ragas.metrics._context_precision; the lazy ragas.metrics namespace may not export the lowercase singleton, so import it explicitly:

List of available metrics documents every built-in Ragas score and what your dataset must supply for each.

import time

from ragas import evaluate
from ragas.run_config import RunConfig
from ragas.metrics import faithfulness, answer_relevancy
from ragas.metrics._context_precision import context_utilization

run_config = RunConfig(timeout=180, max_workers=4)

t0 = time.perf_counter()
result = evaluate(
    dataset=ds,
    metrics=[faithfulness, answer_relevancy, context_utilization],
    llm=llm,
    embeddings=embeddings_model,
    run_config=run_config,
    raise_exceptions=False,
)
ragas_wall_s = time.perf_counter() - t0

scores_df = result.to_pandas()
print("aggregate", result)
print("eval_wall_s", ragas_wall_s)
print("missing_scores", scores_df.isna().sum())
Ragas metrics
  • ContextRelevance — dual LLM judges each rate how pertinent the retrieved contexts are to the user input, then Ragas averages those ratings into a single 0–1 score. metric.py
  • Faithfulness — splits the model answer into atomic claims and uses an NLI-style pass to see how many are supported by the retrieved context, reporting that share as a 0–1 score. metric.py
  • AnswerRelevancy — the LLM invents several questions from the answer, embeddings compare them to the original user question (cosine similarity), and Ragas averages that similarity while zeroing out evasive/noncommittal generations. metric.py
  • ContextPrecisionWithReference — for each retrieved chunk an LLM decides if it helps answer the question relative to a supplied reference answer, and Ragas aggregates those yes/no verdicts into an average-precision-style 0–1 score (needs a gold/reference field). metric.py
  • AnswerAccuracy — two LLM judges each score how well the model answer matches a supplied reference (NVIDIA-style rubric), and Ragas averages those ratings into one 0–1 accuracy score. metric.py
  • AnswerCorrectness — blends a statement-level factuality F-score from LLM TP/FP/FN classification with optional embedding cosine similarity between the full answer and reference, using default 75% / 25% weights to yield one 0–1 score. metric.py
  • TokenUsage— Ragas's token bucket (input/output counts + model id) with parsers for common LangChain result shapes and a callback handler so you can feed token_usage_parser into evaluate and price runs from accumulated usage. cost.py
  • Langfuse — optional tracing glue around Langfuse (observe shim, trace sync/wait helpers, and URL utilities) so Ragas evals can align with Langfuse traces when the SDK is installed. langfuse.py

Ragas does not include a built-in UI; you work with scores in code or exported data. For charts or dashboards, use a plotting stack such as matplotlib or any library you already standardize on.

LLM-Tool-Survey

LLM-Tool-Survey is a research-oriented collection—not a runnable evaluation framework—that maps how modern LLMs interact with external tools end to end: planning, tool choice, API-style calling, and answer synthesis. It also summarizes training angles (prompting, fine-tuning, reinforcement learning), open problems, and how the field measures tool use through benchmarks and metrics. Use it as a structured field guide when you design agents or decide what to evaluate next, rather than as drop-in code.

What it organizes:

MemoryAgentBench

MemoryAgentBench is benchmark infrastructure and code for evaluating memory in LLM agents when interaction is incremental and multi-turn (a long chat), not a single static prompt. The work is research-facing (paper on incremental multi-turn memory evaluation; ICLR 2026) rather than a drop-in memory product.

The README groups competence areas in a LongMemEval-style framing:

What the repo provides:

It is not a memory library like Nemori; use it when you need standardized, paper-linked experiments on memory under chunked multi-turn conditions—not when you only want a production memory API.

ET-Agent

ET-Agent is a research codebase for training tool-integrated reasoning (TIR) agents—systems that reason while calling search, retrieval, or other tools—with emphasis on shaping how the agent behaves (when it searches, how long chains get, efficiency), not only raw answer correctness.

At a glance:

It also ships an evaluation track: inference drivers, F1, LLM-as-judge, efficiency, and analyses such as conciseness, successful execution, and reasoning length (via a radar script). The project is still primarily a training and data-generation pipeline—how to build a strong TIR agent—not a minimal, benchmark-only harness like MemoryAgentBench.

In one line: behavior-calibrated agent training for tool-using reasoning—flywheel data, SFT, RL (ARPO), retrieval/search plumbing, and bundled metrics rather than a stand-alone evaluator alone.

Conclusion

Good evaluation is layered. The analytics parameters give you the vocabulary for what to watch in production: latency, retrieval health, faithfulness, relevance, diversity, cost, and failure modes. Tools such as DeepEval, RAGAS, and Phoenix turn that vocabulary into tests, scores, and traces you can run in CI or explore in a UI. Surveys and benchmarks—LLM-Tool-Survey, MemoryAgentBench—anchor you to how the field defines tool use or memory under multi-turn conditions. Training-oriented stacks such as ET-Agent show that evaluation is inseparable from how you generate data and calibrate behavior; the same metrics should eventually feed back from offline runs and online monitoring into the next release.

None of these replace a clear product decision about what “good” means for your users—pick a small metric set, wire it to gates you actually enforce, and revisit it when models, tools, or traffic change.