Evals

A landscape catalog of the platforms, frameworks, libraries, and benchmark suites used to evaluate large language models, LLM-based applications, and AI agents. The topic spans human-rated, LLM-as-a-judge, reference-based, reference-free, and benchmark-aligned approaches to measuring AI system quality. Tracked alongside the eval platforms are the canonical multi-task and code/agent benchmark suites (MMLU, HumanEval, GAIA, AgentBench, BIG-Bench) that establish public points of comparison.

20 APIs 10 Features

EvalsLLM EvaluationAI QualityBenchmarksLLM as a JudgeObservabilityAgent EvaluationRAG EvaluationTest-Driven AI

APIs

OpenAI Evals

OpenAI Evals is the open-source framework released by OpenAI for evaluating large language models and LLM-based systems. The README states "Evals provide a framework for evaluat...

Inspect AI

Inspect AI is an open-source framework for large language model evaluations developed and maintained by the UK AI Security Institute (UK AISI) and Meridian Labs. It supports tex...

Braintrust

Braintrust is a commercial evaluation platform that captures eval runs as immutable, comparable experiment snapshots. The product supports code-based scorers, built-in autoevals...

LangSmith Evaluation

LangSmith Evaluation is LangChain's evaluation framework for measuring application quality across the lifecycle. The docs describe evals as "a way to breakdown what 'good' looks...

Promptfoo

Promptfoo is an open-source CLI and library for evaluating and red-teaming LLM applications. The docs describe it as enabling "test-driven LLM development rather than trial-and-...

Helicone

Helicone is an open-source observability and monitoring platform for LLM applications. The homepage states "The world's fastest-growing AI companies rely on Helicone to route, d...

Patronus AI

Patronus AI is a frontier lab building evaluation infrastructure and Digital World Models for human-aligned AGI. Its evaluator models include Lynx (a hallucination-detection mod...

DeepEval (Confident AI)

DeepEval is an open-source LLM evaluation package, paired with Confident AI as the hosted observability/evals/monitoring tier. The docs call DeepEval "an open-source LLM eval pa...

Arize AI (Phoenix)

Arize AI provides an AI observability and evaluation platform centered on Arize AX (the commercial product) and Phoenix (open-source LLM tracing and evaluation). Phoenix runs LL...

Galileo

Galileo is an enterprise AI observability and evaluation engineering platform. The product line emphasizes "20+ built-in evaluators" spanning RAG, agents, safety, and security, ...

Humanloop

Humanloop was a development platform for LLM applications, describing itself as having been "the first development platform for LLM applications" and having "shaped industry sta...

TruLens

TruLens is an open-source evaluation and tracing platform for AI agents that helps developers "move from vibes to metrics." Its feedback-function library covers the RAG triad — ...

Weights and Biases Weave

W&B Weave is a platform for evaluating, monitoring, and iterating on AI agents and applications, started with "one line of code." Weave Evaluations enable visual comparison of r...

Ragas

Ragas is an open-source evaluation library focused on retrieval-augmented generation, described in its own docs as "a library that helps you move from 'vibe checks' to systemati...

MLflow LLM Evaluate

MLflow LLM evaluate extends MLflow's experiment tracking with mlflow.evaluate() support for LLM tasks. The API runs reference-based and reference-free metrics (toxicity, perplex...

MMLU Benchmark

MMLU (Measuring Massive Multitask Language Understanding) is a multiple-choice benchmark spanning 57 subjects from STEM and international law to nutrition and religion. It conta...

HumanEval Benchmark

HumanEval is OpenAI's evaluation harness for code-generation models, described in its README as "an evaluation harness for the HumanEval problem solving dataset described in the...

GAIA Benchmark

GAIA is "a benchmark for General AI Assistants," published in 2023 (arXiv 2311.12983). It tests general-purpose AI agent capability across reasoning, tool use, multi-modality, a...

AgentBench

AgentBench is the first benchmark designed to evaluate LLM-as-Agent across a diverse spectrum of environments. It bundles 8 environments — 5 newly created (Operating System, Dat...

BIG-Bench

The Beyond the Imitation Game Benchmark (BIG-Bench) is "a collaborative benchmark intended to probe large language models and extrapolate their future capabilities." It contains...

Features

LLM-as-a-Judge Scoring

A second LLM evaluates the output of the system-under-test, producing a numeric or categorical score and (optionally) a written rationale. The dominant scoring mode for free-form text outputs across Braintrust, LangSmith, DeepEval, Weave, TruLens, Phoenix, and Patronus.

Reference-Based Scoring

Compares model output against a ground-truth expected answer using exact match, BLEU, ROUGE, embedding similarity, or task-specific equality (e.g. unit-test pass/fail). The native mode for benchmarks like MMLU, HumanEval, and GAIA.

Reference-Free Scoring

Assesses output quality without ground truth — toxicity, coherence, faithfulness against retrieved context, criterion adherence. Enables online (production-traffic) evaluation where labels do not exist.

Pairwise Comparison

A judge ranks two candidate outputs A vs B (or a tie). Useful when absolute scoring is hard but relative preference is reliable. Surfaced explicitly by LangSmith and used widely in chatbot arenas.

Benchmark-Aligned Evaluation

Runs the system-under-test against a standardized public dataset (MMLU, HumanEval, GAIA, AgentBench, BIG-Bench) to produce comparable, headline scores. The basis of model leaderboards.

Human-Rated Scoring

Domain experts or end users provide thumbs-up/down, Likert ratings, or written critiques. Used as ground truth, as a judge-calibration signal, and as a final acceptance gate before production.

RAG Triad

Three feedback functions — groundedness, context relevance, answer relevance — codified by TruLens and widely adopted across Ragas, Phoenix, DeepEval, and LangSmith for evaluating retrieval-augmented generation pipelines.

Agent and Tool-Use Evaluation

Evaluating multi-step agent trajectories — did the agent pick the right tool, did the tool call succeed, did the final answer satisfy the goal. Supported by Inspect AI, Galileo, Weave, LangSmith, Braintrust, and benchmarks like AgentBench and GAIA.

Online Production Monitoring

Eval scorers (typically reference-free LLM judges and Luna-style distilled evaluators) attach to live traffic via tracing/observability layers (Phoenix, Arize, Helicone, Galileo, Weave) to flag regressions in real time.

Red-Team / Safety Evaluation

Adversarial test suites probe jailbreaks, prompt injection, PII leakage, harmful content, and policy violations. First-class in Promptfoo, Patronus, Galileo, and Inspect AI's safety evals.

Use Cases

Model Selection

Run candidate models (GPT-5, Claude 4.7, Gemini 3, open-weight) against a shared eval suite to choose the best fit for a specific application by quality, cost, and latency.

Prompt Engineering Iteration

Compare prompt variants in a matrix-style eval (Promptfoo, LangSmith experiments, Braintrust experiments) to pick the best prompt before shipping.

Regression Detection in CI/CD

Wire an eval suite into CI so a pull request that drops a key scorer below a threshold fails the build, preventing quality regressions from reaching production.

RAG Pipeline Tuning

Use RAG-triad scores (groundedness / context relevance / answer relevance) and faithfulness to tune chunking, embedding, reranking, and prompt choices.

Agent Trajectory Quality

Score multi-step agent runs on tool-selection correctness, step efficiency, and final-answer faithfulness — the core measurement for production agentic apps.

Hallucination and Safety Guardrails

Deploy dedicated judge models (Lynx, GLIDER, Luna) to flag hallucinations, toxic content, PII leakage, and policy violations in real time.

Frontier Capability and Safety Assessment

Independent labs (UK AISI, US AISI) run capability and safety evaluations on frontier models before release, using frameworks like Inspect AI.

Public Leaderboard Reporting

Submit a model's scores against MMLU, HumanEval, GAIA, AgentBench, and BIG-Bench to position it on community leaderboards and back marketing claims with reproducible numbers.

Integrations

OpenTelemetry

Phoenix, TruLens, Weave, and most modern eval platforms ingest LLM traces via OpenTelemetry, making eval a layer on top of standard observability.

LangChain / LangGraph

LangSmith is the native eval tier for LangChain/LangGraph apps; most other platforms also integrate.

LlamaIndex

Ragas, DeepEval, and Phoenix integrate directly with LlamaIndex for RAG evaluation.

Hugging Face Datasets

MMLU, HumanEval, and GAIA are distributed as Hugging Face datasets and consumed by every eval framework.

CI/CD (GitHub Actions, etc.)

Braintrust, Promptfoo, LangSmith, and DeepEval ship CI integrations to fail builds on regression.

Snowflake

OpenAI Evals can log eval results to Snowflake; TruLens (Truera) is now part of Snowflake.

Eval Suite Schema

Eval Run Structure

Sources

name: Evals
description: A landscape catalog of the platforms, frameworks, libraries, and benchmark suites used to evaluate large language models, LLM-based applications, and AI agents. The topic spans human-rated, LLM-as-a-judge, reference-based, reference-free, and benchmark-aligned approaches to measuring AI system quality. Tracked alongside the eval platforms are the canonical multi-task and code/agent benchmark suites (MMLU, HumanEval, GAIA, AgentBench, BIG-Bench) that establish public points of comparison.
url: https://github.com/api-evangelist/evals
created: '2026-05-22'
modified: '2026-05-22'
specificationVersion: '0.18'
tags:
  - Evals
  - LLM Evaluation
  - AI Quality
  - Benchmarks
  - LLM as a Judge
  - Observability
  - Agent Evaluation
  - RAG Evaluation
  - Test-Driven AI
apis:
  - name: OpenAI Evals
    description: OpenAI Evals is the open-source framework released by OpenAI for evaluating large language models and LLM-based systems. The README states "Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs." The repo bundles a registry of benchmark evals, support for model-graded grading without writing custom code, private eval data via Snowflake logging, and templates for prompt chains and tool-using agents. Written primarily in Python, the project sits at roughly 18.5k stars / 3k forks.
    humanURL: https://github.com/openai/evals
    baseURL: https://github.com/openai/evals
    tags:
      - OpenAI
      - Open Source
      - Model Graded
      - Benchmark Registry
      - Python
    properties:
      - type: GitHubRepository
        url: https://github.com/openai/evals
      - type: Documentation
        url: https://github.com/openai/evals/tree/main/docs
      - type: License
        url: https://github.com/openai/evals/blob/main/LICENSE.md

  - name: Inspect AI
    description: Inspect AI is an open-source framework for large language model evaluations developed and maintained by the UK AI Security Institute (UK AISI) and Meridian Labs. It supports text comparisons, model-based grading such as model_graded_fact(), and custom scorers. Datasets carry input and target columns, with multimodal support across image, audio, and video. The framework targets frontier-AI capability and safety assessment across coding, reasoning, knowledge, behavior, and multimodal understanding.
    humanURL: https://inspect.aisi.org.uk/
    baseURL: https://inspect.aisi.org.uk
    tags:
      - UK AISI
      - Open Source
      - Frontier AI
      - Model Graded
      - Safety Evaluation
    properties:
      - type: Documentation
        url: https://inspect.aisi.org.uk/
      - type: GitHubRepository
        url: https://github.com/UKGovernmentBEIS/inspect_ai

  - name: Braintrust
    description: Braintrust is a commercial evaluation platform that captures eval runs as immutable, comparable experiment snapshots. The product supports code-based scorers, built-in autoevals, and LLM-as-a-judge evaluators for both offline and production use. Datasets are collections of test cases (input, optional expected output, metadata) sourced from production logs, user feedback, or manual curation. Experiments slot into CI/CD pipelines to detect regressions "before they reach production."
    humanURL: https://www.braintrust.dev/
    baseURL: https://www.braintrust.dev
    tags:
      - Commercial
      - LLM as a Judge
      - CI/CD
      - Experiments
      - Regression Detection
    properties:
      - type: Documentation
        url: https://www.braintrust.dev/docs
      - type: EvaluationGuide
        url: https://www.braintrust.dev/docs/guides/evals
      - type: Pricing
        url: https://www.braintrust.dev/pricing

  - name: LangSmith Evaluation
    description: LangSmith Evaluation is LangChain's evaluation framework for measuring application quality across the lifecycle. The docs describe evals as "a way to breakdown what 'good' looks like and measure it." It supports code evaluators (deterministic rules), LLM-as-judge evaluators (reference-based or reference-free), and heuristic checks (length, latency, keywords). Concepts include datasets and examples, experiments, and pairwise evaluation for relative comparisons.
    humanURL: https://docs.langchain.com/langsmith/evaluation-concepts
    baseURL: https://api.smith.langchain.com
    tags:
      - LangChain
      - LLM as a Judge
      - Pairwise
      - Reference-Free
      - Online and Offline
    properties:
      - type: Documentation
        url: https://docs.langchain.com/langsmith/evaluation-concepts
      - type: Portal
        url: https://smith.langchain.com
      - type: Pricing
        url: https://www.langchain.com/pricing-langsmith

  - name: Promptfoo
    description: Promptfoo is an open-source CLI and library for evaluating and red-teaming LLM applications. The docs describe it as enabling "test-driven LLM development rather than trial-and-error" and producing "matrix views that let you quickly evaluate outputs across many prompts." It supports assertion-based scoring, integrations across OpenAI, Anthropic, Azure, Google, HuggingFace, and open-source models, plus automated red-team and pentest runs that produce vulnerability and risk reports.
    humanURL: https://www.promptfoo.dev/
    baseURL: https://www.promptfoo.dev
    tags:
      - Open Source
      - CLI
      - Red Teaming
      - Assertions
      - Test-Driven
    properties:
      - type: Documentation
        url: https://www.promptfoo.dev/docs/intro/
      - type: GitHubRepository
        url: https://github.com/promptfoo/promptfoo
      - type: RedTeam
        url: https://www.promptfoo.dev/docs/red-team/

  - name: Helicone
    description: Helicone is an open-source observability and monitoring platform for LLM applications. The homepage states "The world's fastest-growing AI companies rely on Helicone to route, debug, and analyze their applications." Beyond observability dashboards (requests, segments, sessions, users), it offers prompt management, datasets, a playground, rate-limit tracking, and alerts. LLM-as-a-judge style evaluation runs against captured request logs.
    humanURL: https://www.helicone.ai/
    baseURL: https://api.helicone.ai
    tags:
      - Open Source
      - Observability
      - Proxy
      - Prompt Management
      - Y Combinator
    properties:
      - type: Documentation
        url: https://docs.helicone.ai/
      - type: GitHubRepository
        url: https://github.com/Helicone/helicone
      - type: Pricing
        url: https://www.helicone.ai/pricing

  - name: Patronus AI
    description: Patronus AI is a frontier lab building evaluation infrastructure and Digital World Models for human-aligned AGI. Its evaluator models include Lynx (a hallucination-detection model reported to outperform GPT-4 on hallucination tasks) and GLIDER (an evaluation model producing reasoning chains with explainable judgments). Coverage spans research science, software development, customer service, product applications, finance, and multi-turn dialogue / long-horizon task planning.
    humanURL: https://www.patronus.ai/
    baseURL: https://api.patronus.ai
    tags:
      - Commercial
      - Hallucination Detection
      - Judge Models
      - Lynx
      - GLIDER
    properties:
      - type: Documentation
        url: https://docs.patronus.ai/
      - type: Portal
        url: https://app.patronus.ai

  - name: DeepEval (Confident AI)
    description: DeepEval is an open-source LLM evaluation package, paired with Confident AI as the hosted observability/evals/monitoring tier. The docs call DeepEval "an open-source LLM eval package" and Confident AI "an AI quality platform with observability, evals, and monitoring." Metrics include GEval (research-backed custom metric), AnswerRelevancyMetric, TaskCompletionMetric, and ConversationalGEval. Test cases use LLMTestCase and ConversationalTestCase shapes; datasets organize Golden test cases for sync or async runs.
    humanURL: https://www.deepeval.com/
    baseURL: https://api.confident-ai.com
    tags:
      - Open Source
      - GEval
      - RAG
      - Conversational
      - Python
    properties:
      - type: Documentation
        url: https://www.deepeval.com/docs/getting-started
      - type: GitHubRepository
        url: https://github.com/confident-ai/deepeval
      - type: Portal
        url: https://app.confident-ai.com

  - name: Arize AI (Phoenix)
    description: Arize AI provides an AI observability and evaluation platform centered on Arize AX (the commercial product) and Phoenix (open-source LLM tracing and evaluation). Phoenix runs LLM-as-a-judge evaluators across traces, supports datasets and experiments, and integrates with OpenTelemetry. Arize AX layers monitoring, drift detection, and root-cause analysis on top of model and LLM telemetry.
    humanURL: https://arize.com/
    baseURL: https://api.arize.com
    tags:
      - Commercial
      - Open Source
      - Phoenix
      - OpenTelemetry
      - Observability
    properties:
      - type: Documentation
        url: https://arize.com/docs/ax/
      - type: GitHubRepository
        url: https://github.com/Arize-ai/phoenix
      - type: Portal
        url: https://app.arize.com

  - name: Galileo
    description: Galileo is an enterprise AI observability and evaluation engineering platform. The product line emphasizes "20+ built-in evaluators" spanning RAG, agents, safety, and security, plus custom evaluators that "auto-tune metrics from live feedback." Luna refers to compact distilled evaluator models that "monitor 100% of your traffic at 97% lower cost." The homepage tagline reads "Don't just monitor AI failures. Stop them."
    humanURL: https://www.galileo.ai/
    baseURL: https://api.galileo.ai
    tags:
      - Commercial
      - Enterprise
      - Luna
      - RAG
      - Safety
    properties:
      - type: Documentation
        url: https://docs.galileo.ai/
      - type: Portal
        url: https://app.galileo.ai

  - name: Humanloop
    description: Humanloop was a development platform for LLM applications, describing itself as having been "the first development platform for LLM applications" and having "shaped industry standards for how to manage and evaluate AI." Following its acquisition by Anthropic the platform has been sunset, with a migration path published for former customers. Retained in this catalog for historical completeness.
    humanURL: https://humanloop.com/
    baseURL: https://api.humanloop.com
    tags:
      - Historical
      - Acquired
      - Anthropic
      - Prompt Management
    properties:
      - type: Documentation
        url: https://humanloop.com/docs
      - type: AcquisitionNotice
        url: https://humanloop.com/

  - name: TruLens
    description: TruLens is an open-source evaluation and tracing platform for AI agents that helps developers "move from vibes to metrics." Its feedback-function library covers the RAG triad — groundedness (responses supported by retrieved content), context relevance (retrieved documents match the query), and answer relevance (responses address the user question) — plus coherence, comprehensiveness, toxicity, sentiment, fairness, and custom metrics. Integrates with OpenTelemetry traces and any agent framework.
    humanURL: https://www.trulens.org/
    baseURL: https://www.trulens.org
    tags:
      - Open Source
      - RAG Triad
      - Feedback Functions
      - Snowflake
      - OpenTelemetry
    properties:
      - type: Documentation
        url: https://www.trulens.org/
      - type: GitHubRepository
        url: https://github.com/truera/trulens

  - name: Weights and Biases Weave
    description: W&B Weave is a platform for evaluating, monitoring, and iterating on AI agents and applications, started with "one line of code." Weave Evaluations enable visual comparison of runs, automatic versioning of datasets and scorers, an interactive playground, and leaderboards. Scorers include pre-built ones (toxicity, hallucination), custom Python scoring functions, human feedback collection, and third-party scorers from providers such as RAGAS and LangChain.
    humanURL: https://wandb.ai/site/weave
    baseURL: https://api.wandb.ai
    tags:
      - Weights and Biases
      - Commercial
      - Scorers
      - Leaderboards
      - Human Feedback
    properties:
      - type: Documentation
        url: https://weave-docs.wandb.ai/
      - type: GitHubRepository
        url: https://github.com/wandb/weave
      - type: Portal
        url: https://wandb.ai

  - name: Ragas
    description: Ragas is an open-source evaluation library focused on retrieval-augmented generation, described in its own docs as "a library that helps you move from 'vibe checks' to systematic evaluation loops for your AI applications." It exposes LLM-driven metrics for RAG (faithfulness, context recall, answer relevancy), integrates with LangChain and LlamaIndex, and supports custom metric authoring as a complement to other eval platforms (Weave, LangSmith).
    humanURL: https://docs.ragas.io/
    baseURL: https://docs.ragas.io
    tags:
      - Open Source
      - RAG
      - Faithfulness
      - Context Recall
      - Library
    properties:
      - type: Documentation
        url: https://docs.ragas.io/en/stable/
      - type: GitHubRepository
        url: https://github.com/explodinggradients/ragas

  - name: MLflow LLM Evaluate
    description: MLflow LLM evaluate extends MLflow's experiment tracking with mlflow.evaluate() support for LLM tasks. The API runs reference-based and reference-free metrics (toxicity, perplexity, BLEU, ROUGE, exact match, custom LLM judges) over a logged model or a function and persists results into MLflow's experiment store alongside traditional ML metrics. Sits inside the broader MLflow open-source project.
    humanURL: https://mlflow.org/
    baseURL: https://mlflow.org
    tags:
      - Open Source
      - MLflow
      - Experiment Tracking
      - LLM Judges
      - Apache
    properties:
      - type: Documentation
        url: https://mlflow.org/docs/latest/llms/llm-evaluate/index.html
      - type: GitHubRepository
        url: https://github.com/mlflow/mlflow

  - name: MMLU Benchmark
    description: MMLU (Measuring Massive Multitask Language Understanding) is a multiple-choice benchmark spanning 57 subjects from STEM and international law to nutrition and religion. It contains 15,908 multiple-choice questions (four options each), of which 1,540 are reserved for hyperparameter tuning. Per its overview, "It was one of the most commonly used benchmarks for comparing the capabilities of large language models, with over 100 million downloads as of July 2024."
    humanURL: https://github.com/hendrycks/test
    baseURL: https://github.com/hendrycks/test
    tags:
      - Benchmark
      - Knowledge
      - Multiple Choice
      - Multitask
      - Reference-Based
    properties:
      - type: GitHubRepository
        url: https://github.com/hendrycks/test
      - type: Paper
        url: https://arxiv.org/abs/2009.03300
      - type: Dataset
        url: https://huggingface.co/datasets/cais/mmlu

  - name: HumanEval Benchmark
    description: HumanEval is OpenAI's evaluation harness for code-generation models, described in its README as "an evaluation harness for the HumanEval problem solving dataset described in the paper 'Evaluating Large Language Models Trained on Code'." Functional correctness is measured by executing model-generated code against unit tests, reported as pass@1, pass@10, and pass@100 by default.
    humanURL: https://github.com/openai/human-eval
    baseURL: https://github.com/openai/human-eval
    tags:
      - Benchmark
      - Code Generation
      - Functional Correctness
      - Pass@k
      - Reference-Based
    properties:
      - type: GitHubRepository
        url: https://github.com/openai/human-eval
      - type: Paper
        url: https://arxiv.org/abs/2107.03374
      - type: Dataset
        url: https://huggingface.co/datasets/openai/openai_humaneval

  - name: GAIA Benchmark
    description: GAIA is "a benchmark for General AI Assistants," published in 2023 (arXiv 2311.12983). It tests general-purpose AI agent capability across reasoning, tool use, multi-modality, and web browsing, with a public leaderboard hosted on Hugging Face for community submissions. The benchmark has become a reference point for evaluating agentic systems that combine an LLM with tools and a browser.
    humanURL: https://huggingface.co/gaia-benchmark
    baseURL: https://huggingface.co/gaia-benchmark
    tags:
      - Benchmark
      - AI Agents
      - Reasoning
      - Tool Use
      - Leaderboard
    properties:
      - type: Dataset
        url: https://huggingface.co/datasets/gaia-benchmark/GAIA
      - type: Paper
        url: https://arxiv.org/abs/2311.12983
      - type: Leaderboard
        url: https://huggingface.co/spaces/gaia-benchmark/leaderboard

  - name: AgentBench
    description: AgentBench is the first benchmark designed to evaluate LLM-as-Agent across a diverse spectrum of environments. It bundles 8 environments — 5 newly created (Operating System, Database, Knowledge Graph, Digital Card Game, Lateral Thinking Puzzles) and 3 adapted (House-Holding from ALFWorld, Web Shopping from WebShop, Web Browsing from Mind2Web). The benchmark requires roughly 4,000 dev-set and 13,000 test-set interactions per model.
    humanURL: https://github.com/THUDM/AgentBench
    baseURL: https://github.com/THUDM/AgentBench
    tags:
      - Benchmark
      - AI Agents
      - Multi-Environment
      - LLM-as-Agent
      - Tsinghua
    properties:
      - type: GitHubRepository
        url: https://github.com/THUDM/AgentBench
      - type: Paper
        url: https://arxiv.org/abs/2308.03688
      - type: Leaderboard
        url: https://llmbench.ai/agent

  - name: BIG-Bench
    description: The Beyond the Imitation Game Benchmark (BIG-Bench) is "a collaborative benchmark intended to probe large language models and extrapolate their future capabilities." It contains more than 200 tasks across JSON-based simplified tasks and programmatic tasks; a curated subset (BIG-Bench Lite) of 24 tasks is provided as the canonical headline measurement. Maintained on GitHub by Google with open community task submissions.
    humanURL: https://github.com/google/BIG-bench
    baseURL: https://github.com/google/BIG-bench
    tags:
      - Benchmark
      - Collaborative
      - Multitask
      - Google
      - BIG-Bench Lite
    properties:
      - type: GitHubRepository
        url: https://github.com/google/BIG-bench
      - type: Paper
        url: https://arxiv.org/abs/2206.04615
      - type: Documentation
        url: https://github.com/google/BIG-bench/blob/main/README.md

common:
  - type: GitHubOrganization
    url: https://github.com/api-evangelist
  - type: JSONSchema
    url: https://raw.githubusercontent.com/api-evangelist/evals/refs/heads/main/json-schema/evals-eval-run-schema.json
    title: Eval Run Schema
  - type: JSONSchema
    url: https://raw.githubusercontent.com/api-evangelist/evals/refs/heads/main/json-schema/evals-eval-suite-schema.json
    title: Eval Suite Schema
  - type: JSONSchema
    url: https://raw.githubusercontent.com/api-evangelist/evals/refs/heads/main/json-schema/evals-eval-case-schema.json
    title: Eval Case Schema
  - type: JSONSchema
    url: https://raw.githubusercontent.com/api-evangelist/evals/refs/heads/main/json-schema/evals-scorer-schema.json
    title: Scorer Schema
  - type: JSONSchema
    url: https://raw.githubusercontent.com/api-evangelist/evals/refs/heads/main/json-schema/evals-judge-schema.json
    title: Judge Schema
  - type: JSONSchema
    url: https://raw.githubusercontent.com/api-evangelist/evals/refs/heads/main/json-schema/evals-dataset-schema.json
    title: Dataset Schema
  - type: JSONStructure
    url: https://raw.githubusercontent.com/api-evangelist/evals/refs/heads/main/json-structure/evals-eval-run-structure.json
    title: Eval Run Structure
  - type: JSONLD
    url: https://raw.githubusercontent.com/api-evangelist/evals/refs/heads/main/json-ld/evals-context.jsonld
  - type: Vocabulary
    url: https://raw.githubusercontent.com/api-evangelist/evals/refs/heads/main/vocabulary/evals-vocabulary.yml
  - type: Features
    data:
      - name: LLM-as-a-Judge Scoring
        description: A second LLM evaluates the output of the system-under-test, producing a numeric or categorical score and (optionally) a written rationale. The dominant scoring mode for free-form text outputs across Braintrust, LangSmith, DeepEval, Weave, TruLens, Phoenix, and Patronus.
      - name: Reference-Based Scoring
        description: Compares model output against a ground-truth expected answer using exact match, BLEU, ROUGE, embedding similarity, or task-specific equality (e.g. unit-test pass/fail). The native mode for benchmarks like MMLU, HumanEval, and GAIA.
      - name: Reference-Free Scoring
        description: Assesses output quality without ground truth — toxicity, coherence, faithfulness against retrieved context, criterion adherence. Enables online (production-traffic) evaluation where labels do not exist.
      - name: Pairwise Comparison
        description: A judge ranks two candidate outputs A vs B (or a tie). Useful when absolute scoring is hard but relative preference is reliable. Surfaced explicitly by LangSmith and used widely in chatbot arenas.
      - name: Benchmark-Aligned Evaluation
        description: Runs the system-under-test against a standardized public dataset (MMLU, HumanEval, GAIA, AgentBench, BIG-Bench) to produce comparable, headline scores. The basis of model leaderboards.
      - name: Human-Rated Scoring
        description: Domain experts or end users provide thumbs-up/down, Likert ratings, or written critiques. Used as ground truth, as a judge-calibration signal, and as a final acceptance gate before production.
      - name: RAG Triad
        description: Three feedback functions — groundedness, context relevance, answer relevance — codified by TruLens and widely adopted across Ragas, Phoenix, DeepEval, and LangSmith for evaluating retrieval-augmented generation pipelines.
      - name: Agent and Tool-Use Evaluation
        description: Evaluating multi-step agent trajectories — did the agent pick the right tool, did the tool call succeed, did the final answer satisfy the goal. Supported by Inspect AI, Galileo, Weave, LangSmith, Braintrust, and benchmarks like AgentBench and GAIA.
      - name: Online Production Monitoring
        description: Eval scorers (typically reference-free LLM judges and Luna-style distilled evaluators) attach to live traffic via tracing/observability layers (Phoenix, Arize, Helicone, Galileo, Weave) to flag regressions in real time.
      - name: Red-Team / Safety Evaluation
        description: Adversarial test suites probe jailbreaks, prompt injection, PII leakage, harmful content, and policy violations. First-class in Promptfoo, Patronus, Galileo, and Inspect AI's safety evals.
  - type: UseCases
    data:
      - name: Model Selection
        description: Run candidate models (GPT-5, Claude 4.7, Gemini 3, open-weight) against a shared eval suite to choose the best fit for a specific application by quality, cost, and latency.
      - name: Prompt Engineering Iteration
        description: Compare prompt variants in a matrix-style eval (Promptfoo, LangSmith experiments, Braintrust experiments) to pick the best prompt before shipping.
      - name: Regression Detection in CI/CD
        description: Wire an eval suite into CI so a pull request that drops a key scorer below a threshold fails the build, preventing quality regressions from reaching production.
      - name: RAG Pipeline Tuning
        description: Use RAG-triad scores (groundedness / context relevance / answer relevance) and faithfulness to tune chunking, embedding, reranking, and prompt choices.
      - name: Agent Trajectory Quality
        description: Score multi-step agent runs on tool-selection correctness, step efficiency, and final-answer faithfulness — the core measurement for production agentic apps.
      - name: Hallucination and Safety Guardrails
        description: Deploy dedicated judge models (Lynx, GLIDER, Luna) to flag hallucinations, toxic content, PII leakage, and policy violations in real time.
      - name: Frontier Capability and Safety Assessment
        description: Independent labs (UK AISI, US AISI) run capability and safety evaluations on frontier models before release, using frameworks like Inspect AI.
      - name: Public Leaderboard Reporting
        description: Submit a model's scores against MMLU, HumanEval, GAIA, AgentBench, and BIG-Bench to position it on community leaderboards and back marketing claims with reproducible numbers.
  - type: Integrations
    data:
      - name: OpenTelemetry
        description: Phoenix, TruLens, Weave, and most modern eval platforms ingest LLM traces via OpenTelemetry, making eval a layer on top of standard observability.
      - name: LangChain / LangGraph
        description: LangSmith is the native eval tier for LangChain/LangGraph apps; most other platforms also integrate.
      - name: LlamaIndex
        description: Ragas, DeepEval, and Phoenix integrate directly with LlamaIndex for RAG evaluation.
      - name: Hugging Face Datasets
        description: MMLU, HumanEval, and GAIA are distributed as Hugging Face datasets and consumed by every eval framework.
      - name: CI/CD (GitHub Actions, etc.)
        description: Braintrust, Promptfoo, LangSmith, and DeepEval ship CI integrations to fail builds on regression.
      - name: Snowflake
        description: OpenAI Evals can log eval results to Snowflake; TruLens (Truera) is now part of Snowflake.

maintainers:
  - FN: Kin Lane
    email: [email protected]

Evals

APIs

OpenAI Evals

Inspect AI

Braintrust

LangSmith Evaluation

Promptfoo

Helicone

Patronus AI

DeepEval (Confident AI)

Arize AI (Phoenix)

Galileo

Humanloop

TruLens

Weights and Biases Weave

Ragas

MLflow LLM Evaluate

MMLU Benchmark

HumanEval Benchmark

GAIA Benchmark

AgentBench

BIG-Bench

Features

Use Cases

Integrations

Semantic Vocabularies

Evals Context

Resources

Sources