arize-phoenix-evals

Phoenix Evals provides:

Built-in evaluators for common tasks (faithfulness, correctness, relevance, conciseness, and more)
A unified LLM wrapper supporting OpenAI, Anthropic, Google, LiteLLM, and other providers
Batch evaluation over pandas DataFrames with evaluate_dataframe
Custom evaluator creation via create_evaluator decorator or create_classifier factory
Benchmark datasets for testing evaluator accuracy

Installation

pip install "arize-phoenix-evals>=3"

Install the LLM vendor SDK for your chosen provider:

pip install openai  # or anthropic, google-generativeai, litellm, etc.

Quick example

import os
from phoenix.evals import LLM, evaluate_dataframe
from phoenix.evals.metrics import DocumentRelevanceEvaluator

os.environ["OPENAI_API_KEY"] = "<your-openai-key>"

llm = LLM(provider="openai", model="gpt-4o")
relevance_evaluator = DocumentRelevanceEvaluator(llm=llm)

# DataFrame must have columns matching the evaluator's input schema
# DocumentRelevanceEvaluator expects: 'input' (query) and 'output' (document)
results_df = evaluate_dataframe(
    dataframe=df,
    evaluators=[relevance_evaluator],
)

Core API

Symbol	Description
`LLM(provider=..., model=...)`	Unified LLM wrapper for all supported providers
`evaluate_dataframe(dataframe, evaluators)`	Run evaluators over a DataFrame (sync)
`async_evaluate_dataframe(dataframe, evaluators)`	Async variant with concurrency control
`create_evaluator(name, kind=...)`	Decorator to turn a function into an `Evaluator`
`create_classifier(name, prompt_template, llm, choices)`	Factory for `ClassificationEvaluator` instances
`bind_evaluator(evaluator, input_mapping)`	Bind field mappings to an evaluator

Built-in evaluators

All built-in evaluators live in phoenix.evals.metrics and accept llm: LLM as their first argument:

Evaluator	Input fields	Labels
`FaithfulnessEvaluator`	`input`, `output`, `context`	faithful / unfaithful
`CorrectnessEvaluator`	`input`, `output`	correct / incorrect
`DocumentRelevanceEvaluator`	`input`, `output`	relevant / unrelated
`ConcisenessEvaluator`	`input`, `output`	concise / verbose / too_concise
`RefusalEvaluator`	`input`, `output`	refusal / no_refusal
`ToolSelectionEvaluator`	`input`, `output`, `tool_name`, `expected_tool`	correct / incorrect
`ToolInvocationEvaluator`	`input`, `tool_name`, `tool_call_args`	correct / incorrect
`ToolResponseHandlingEvaluator`	`input`, `tool_response`, `output`	good / bad
`exact_match`	`output`, `expected`	(code-based, no LLM)
`MatchesRegex(pattern=...)`	`output`	(code-based, no LLM)

To learn more about LLM Evals, see the evals quickstart.

Reference Documentation

Full API Reference

Complete API documentation for evaluators, metrics, and LLM classification

Python

TypeScript

Rest API

OpenInference SDK

arize-phoenix-evals

Installation

Quick example

Core API

Built-in evaluators

Reference Documentation

Full API Reference

Python

TypeScript

Rest API

OpenInference SDK

​Installation

​Quick example

​Core API

​Built-in evaluators

​Reference Documentation

Full API Reference

Installation

Quick example

Core API

Built-in evaluators

Reference Documentation