Patronus AI
Patronus AI is an evaluation and guardrails platform for production LLM applications and AI agents. It combines an API-first evaluation service with Python and TypeScript SDKs, in-house judge models (Lynx for hallucination detection, Glider for reasoning evaluation, Percival for agent debugging), and a portfolio of open benchmarks and datasets including FinanceBench, BLUR, and RL environments. Customers use Patronus for experimentation, production monitoring, RAG and agent evaluation, dataset generation, and human-in-the-loop annotation.
Patronus AI publishes 7 APIs on the APIs.io network. Tagged areas include LLM Evaluation, Guardrails, Judges, Hallucination Detection, and AI Research.
Patronus AI’s developer surface includes documentation, API reference, engineering blog, pricing, and 6 more developer resources.
APIs
Patronus Evaluation API
The Patronus Evaluation API scores LLM outputs against built-in and custom evaluators covering hallucination, answer relevance, context utilization, safety, and PII. Evaluators ...
Patronus Python SDK
The Patronus Python SDK provides decorators and clients for instrumenting LLM applications, running evaluators inline, recording traces, and pushing experiments to the Patronus ...
Patronus TypeScript SDK
The Patronus TypeScript SDK brings the same evaluation, tracing, and experiment workflows to Node.js and browser environments used by JavaScript-first AI applications.
Lynx
Lynx is Patronus's open-weights hallucination detection model published on Hugging Face. It is positioned as state-of-the-art on hallucination benchmarks and is available both a...
Glider
Glider is Patronus's small judge model for evaluating reasoning chains and rubric-based scoring with low latency and cost relative to large frontier judges.
Percival
Percival is Patronus's agent debugging product that ingests agent traces and surfaces failure modes, tool misuse, and reasoning errors across multi-step runs.
FinanceBench
FinanceBench is an open benchmark of 10,000 financial question-answer pairs grounded in public filings, used to evaluate LLM performance on financial document understanding.
Features
Hosted API for running built-in and custom evaluators on LLM inputs and outputs.
State-of-the-art open-weights hallucination judge available as a hosted evaluator.
Small reasoning-focused judge for rubric-based evaluation at production latency.
Agent trace analysis surfacing failure modes, tool misuse, and reasoning errors.
Compare prompts, models, and configurations across datasets with side-by-side outputs.
Real-time alerts, tracing, and dashboards for live LLM applications.
Synthetic dataset creation including red-teaming sets for RAG and agent systems.
Workflows for human-in-the-loop labeling and reviewer agreement tracking.
Use Cases
Score retrieval and generation quality in RAG applications across faithfulness, relevance, and context.
Trace and diagnose failures in multi-step agentic systems using Percival.
Benchmark candidate models against domain-specific datasets such as FinanceBench.
Apply Patronus judges as runtime guardrails on LLM responses.
Detect quality regressions across prompt, model, and configuration changes.
Integrations
Score outputs from OpenAI models inside Patronus experiments and monitoring.
Evaluate Anthropic Claude outputs using Patronus judges.
SDK integrations for LangChain chains and agents.
Evaluate LlamaIndex RAG pipelines with Patronus evaluators.
Ingest OTel-compatible LLM traces for evaluation and monitoring.
Lynx and Glider weights are distributed via Hugging Face for self-hosting.