Skip to main content
PyPI Version Phoenix Evals provides:
  • Built-in evaluators for common tasks (faithfulness, correctness, relevance, conciseness, and more)
  • A unified LLM wrapper supporting OpenAI, Anthropic, Google, LiteLLM, and other providers
  • Batch evaluation over pandas DataFrames with evaluate_dataframe
  • Custom evaluator creation via create_evaluator decorator or create_classifier factory
  • Benchmark datasets for testing evaluator accuracy

Installation

pip install "arize-phoenix-evals>=3"
Install the LLM vendor SDK for your chosen provider:
pip install openai  # or anthropic, google-generativeai, litellm, etc.

Quick example

import os
from phoenix.evals import LLM, evaluate_dataframe
from phoenix.evals.metrics import DocumentRelevanceEvaluator

os.environ["OPENAI_API_KEY"] = "<your-openai-key>"

llm = LLM(provider="openai", model="gpt-4o")
relevance_evaluator = DocumentRelevanceEvaluator(llm=llm)

# DataFrame must have columns matching the evaluator's input schema
# DocumentRelevanceEvaluator expects: 'input' (query) and 'output' (document)
results_df = evaluate_dataframe(
    dataframe=df,
    evaluators=[relevance_evaluator],
)

Core API

SymbolDescription
LLM(provider=..., model=...)Unified LLM wrapper for all supported providers
evaluate_dataframe(dataframe, evaluators)Run evaluators over a DataFrame (sync)
async_evaluate_dataframe(dataframe, evaluators)Async variant with concurrency control
create_evaluator(name, kind=...)Decorator to turn a function into an Evaluator
create_classifier(name, prompt_template, llm, choices)Factory for ClassificationEvaluator instances
bind_evaluator(evaluator, input_mapping)Bind field mappings to an evaluator

Built-in evaluators

All built-in evaluators live in phoenix.evals.metrics and accept llm: LLM as their first argument:
EvaluatorInput fieldsLabels
FaithfulnessEvaluatorinput, output, contextfaithful / unfaithful
CorrectnessEvaluatorinput, outputcorrect / incorrect
DocumentRelevanceEvaluatorinput, outputrelevant / unrelated
ConcisenessEvaluatorinput, outputconcise / verbose / too_concise
RefusalEvaluatorinput, outputrefusal / no_refusal
ToolSelectionEvaluatorinput, output, tool_name, expected_toolcorrect / incorrect
ToolInvocationEvaluatorinput, tool_name, tool_call_argscorrect / incorrect
ToolResponseHandlingEvaluatorinput, tool_response, outputgood / bad
exact_matchoutput, expected(code-based, no LLM)
MatchesRegex(pattern=...)output(code-based, no LLM)
To learn more about LLM Evals, see the evals quickstart.

Reference Documentation

Full API Reference

Complete API documentation for evaluators, metrics, and LLM classification