- Built-in evaluators for common tasks (faithfulness, correctness, relevance, conciseness, and more)
- A unified
LLMwrapper supporting OpenAI, Anthropic, Google, LiteLLM, and other providers - Batch evaluation over pandas DataFrames with
evaluate_dataframe - Custom evaluator creation via
create_evaluatordecorator orcreate_classifierfactory - Benchmark datasets for testing evaluator accuracy
Installation
Quick example
Core API
| Symbol | Description |
|---|---|
LLM(provider=..., model=...) | Unified LLM wrapper for all supported providers |
evaluate_dataframe(dataframe, evaluators) | Run evaluators over a DataFrame (sync) |
async_evaluate_dataframe(dataframe, evaluators) | Async variant with concurrency control |
create_evaluator(name, kind=...) | Decorator to turn a function into an Evaluator |
create_classifier(name, prompt_template, llm, choices) | Factory for ClassificationEvaluator instances |
bind_evaluator(evaluator, input_mapping) | Bind field mappings to an evaluator |
Built-in evaluators
All built-in evaluators live inphoenix.evals.metrics and accept llm: LLM as their first argument:
| Evaluator | Input fields | Labels |
|---|---|---|
FaithfulnessEvaluator | input, output, context | faithful / unfaithful |
CorrectnessEvaluator | input, output | correct / incorrect |
DocumentRelevanceEvaluator | input, output | relevant / unrelated |
ConcisenessEvaluator | input, output | concise / verbose / too_concise |
RefusalEvaluator | input, output | refusal / no_refusal |
ToolSelectionEvaluator | input, output, tool_name, expected_tool | correct / incorrect |
ToolInvocationEvaluator | input, tool_name, tool_call_args | correct / incorrect |
ToolResponseHandlingEvaluator | input, tool_response, output | good / bad |
exact_match | output, expected | (code-based, no LLM) |
MatchesRegex(pattern=...) | output | (code-based, no LLM) |
Reference Documentation
Full API Reference
Complete API documentation for evaluators, metrics, and LLM classification

