Home
Scalable Inference Serving
Scalable Inference Serving
A collection of APIs, frameworks, and platforms for scalable machine learning model inference serving, deployment, and management. This includes the KServe Open Inference Protocol (the CNCF standard for model serving on Kubernetes), BentoML (developer packaging and serving), vLLM (high-throughput LLM inference), NVIDIA Triton Inference Server, and supporting observability and registry tools. KServe recently joined CNCF as an incubating project (November 2025).
6 APIs
1 Capabilities
0 Features
AI CNCF Deployment Inference Kubernetes LLM Machine Learning Model Serving MLOps Scalability
KServe implements the Open Inference Protocol (OIP), also known as the KServe V2 Inference Protocol, which provides a standardized REST and gRPC interface for model inference ac...
BentoML is an open-source unified inference platform for deploying and scaling AI models. It auto-generates RESTful APIs from Python service definitions, provides built-in OpenA...
vLLM is a high-throughput and memory-efficient inference engine for LLMs, implementing PagedAttention for efficient KV cache management. vLLM exposes an OpenAI-compatible REST A...
NVIDIA Triton Inference Server is an open-source inference serving software that implements the KServe Open Inference Protocol (V2). Supports TensorRT, ONNX, TensorFlow, PyTorch...
MLflow is an open source platform for managing the ML lifecycle, including experiment tracking, reproducibility, and deployment. The MLflow REST API manages experiments, runs, m...
Ray Serve is a scalable model serving library built on Ray, designed for building online inference APIs. Supports composable deployments, autoscaling, HTTP ingress, gRPC, WebSoc...
Run Capabilities with Naftiko — Deploy and orchestrate these API capabilities using Naftiko Fleet.
Run with Naftiko
Workflow capability for ML engineers and data scientists performing model inference operations, health monitoring, and metadata inspection against OIP-compliant inference server...
Run with Naftiko
Run Capabilities with Naftiko — Deploy and orchestrate these API capabilities using Naftiko Fleet.
Run with Naftiko
12 classes · 11 properties
JSON-LD
17 rules ·
5 errors
9 warnings
3 info
SPECTRAL
Sources
name: Scalable Inference Serving
description: >-
A collection of APIs, frameworks, and platforms for scalable machine learning
model inference serving, deployment, and management. This includes the KServe
Open Inference Protocol (the CNCF standard for model serving on Kubernetes),
BentoML (developer packaging and serving), vLLM (high-throughput LLM inference),
NVIDIA Triton Inference Server, and supporting observability and registry tools.
KServe recently joined CNCF as an incubating project (November 2025).
image: https://kserve.github.io/website/images/KServe.png
url: https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/refs/heads/main/apis.yml
created: '2024-01-01'
modified: '2026-05-02'
specificationVersion: '0.18'
tags:
- AI
- CNCF
- Deployment
- Inference
- Kubernetes
- LLM
- Machine Learning
- Model Serving
- MLOps
- Scalability
apis:
- name: KServe Open Inference Protocol API
description: >-
KServe implements the Open Inference Protocol (OIP), also known as the KServe
V2 Inference Protocol, which provides a standardized REST and gRPC interface
for model inference across frameworks. KServe is a standardized distributed
generative and predictive AI inference platform for scalable, multi-framework
deployment on Kubernetes. CNCF incubating project since November 2025.
Supports TensorFlow, PyTorch, scikit-learn, XGBoost, ONNX, vLLM, and HuggingFace.
image: https://kserve.github.io/website/images/KServe.png
humanUrl: https://kserve.github.io/website/
baseUrl: https://inference.kserve.example.com
tags:
- CNCF
- Inference
- Kubernetes
- Model Serving
- Open Inference Protocol
- Open Source
properties:
- type: Documentation
url: https://kserve.github.io/website/docs/intro
- type: OpenAPI
url: https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/main/openapi/kserve-open-inference-protocol-openapi.yml
- type: GitHub
url: https://github.com/kserve/kserve
- type: Changelog
url: https://github.com/kserve/kserve/releases
- type: Getting Started
url: https://kserve.github.io/website/docs/get_started/
- type: SwaggerUI
url: https://kserve.github.io/website/latest/reference/swagger-ui/
contact:
- type: Slack
url: https://kubernetes.slack.com/archives/CH6E58LNP
- type: GitHub Issues
url: https://github.com/kserve/kserve/issues
- name: BentoML REST API
description: >-
BentoML is an open-source unified inference platform for deploying and scaling
AI models. It auto-generates RESTful APIs from Python service definitions,
provides built-in OpenAPI/Swagger documentation, supports adaptive batching,
and integrates with KServe for Kubernetes deployment. BentoML 1.0 introduced
the Runner abstraction for parallelizing inference workloads with adaptive
batching and independent scaling of pre/post-processing from model inference.
image: https://www.bentoml.com/favicon.ico
humanUrl: https://www.bentoml.com/
baseUrl: https://api.bentoml.example.com
tags:
- Batching
- Inference
- Model Serving
- Open Source
- Python
- REST API
properties:
- type: Documentation
url: https://docs.bentoml.com/en/latest/
- type: GitHub
url: https://github.com/bentoml/BentoML
- type: Getting Started
url: https://docs.bentoml.com/en/latest/get-started/quickstart.html
- type: Pricing
url: https://www.bentoml.com/pricing
- type: API Reference
url: https://docs.bentoml.com/en/latest/reference/index.html
contact:
- type: Community
url: https://l.bentoml.com/join-slack
- type: GitHub Issues
url: https://github.com/bentoml/BentoML/issues
- name: vLLM OpenAI-Compatible API
description: >-
vLLM is a high-throughput and memory-efficient inference engine for LLMs,
implementing PagedAttention for efficient KV cache management. vLLM exposes
an OpenAI-compatible REST API allowing seamless migration from OpenAI endpoints.
In 2026, vLLM integrates with KServe via LLMInferenceService and llm-d for
production-grade distributed LLM inference. Powers major LLM deployments at scale.
image: https://docs.vllm.ai/en/stable/_static/logo/vllm-logo-text-light.png
humanUrl: https://docs.vllm.ai/
baseUrl: https://vllm.example.com/v1
tags:
- GPU
- Inference
- KV Cache
- LLM
- Model Serving
- Open Source
- OpenAI-Compatible
properties:
- type: Documentation
url: https://docs.vllm.ai/en/stable/
- type: GitHub
url: https://github.com/vllm-project/vllm
- type: API Reference
url: https://docs.vllm.ai/en/stable/serving/openai_compatible_server.html
- type: Changelog
url: https://github.com/vllm-project/vllm/releases
contact:
- type: GitHub Issues
url: https://github.com/vllm-project/vllm/issues
- type: Slack
url: https://vllm-dev.slack.com/
- name: NVIDIA Triton Inference Server HTTP API
description: >-
NVIDIA Triton Inference Server is an open-source inference serving software
that implements the KServe Open Inference Protocol (V2). Supports TensorRT,
ONNX, TensorFlow, PyTorch, and Python backends. Provides dynamic batching,
model ensembles, model analyzers, and GPU/CPU inference. Used extensively in
production ML pipelines requiring maximum throughput.
image: https://developer.nvidia.com/favicon.ico
humanUrl: https://developer.nvidia.com/triton-inference-server
baseUrl: https://triton.example.com
tags:
- GPU
- Inference
- Model Serving
- NVIDIA
- Open Source
- TensorRT
- Triton
properties:
- type: Documentation
url: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/
- type: GitHub
url: https://github.com/triton-inference-server/server
- type: Getting Started
url: https://github.com/triton-inference-server/tutorials
- type: API Reference
url: https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/customization_guide/inference_protocols.html
contact:
- type: GitHub Issues
url: https://github.com/triton-inference-server/server/issues
- type: Forums
url: https://forums.developer.nvidia.com/c/ai-data-science/deep-learning/triton-inference-server/
- name: MLflow Model Registry REST API
description: >-
MLflow is an open source platform for managing the ML lifecycle, including
experiment tracking, reproducibility, and deployment. The MLflow REST API
manages experiments, runs, metrics, parameters, artifacts, and the Model
Registry for versioning and staging model deployments. CNCF-adjacent; used
with KServe for model lifecycle management.
image: https://mlflow.org/favicon.ico
humanUrl: https://mlflow.org/
baseUrl: https://mlflow.example.com/api/2.0
tags:
- Experiment Tracking
- Machine Learning
- Model Registry
- MLOps
- Open Source
- Versioning
properties:
- type: Documentation
url: https://mlflow.org/docs/latest/rest-api.html
- type: GitHub
url: https://github.com/mlflow/mlflow
- type: Getting Started
url: https://mlflow.org/docs/latest/getting-started/intro-quickstart/
- type: API Reference
url: https://mlflow.org/docs/latest/rest-api.html
contact:
- type: Community
url: https://github.com/mlflow/mlflow/discussions
- type: GitHub Issues
url: https://github.com/mlflow/mlflow/issues
- name: Ray Serve REST API
description: >-
Ray Serve is a scalable model serving library built on Ray, designed for
building online inference APIs. Supports composable deployments, autoscaling,
HTTP ingress, gRPC, WebSockets, and request batching. Integrates with any ML
framework. The Ray Serve dashboard and REST API manage deployments, replicas,
routes, and application status.
image: https://www.ray.io/favicon.ico
humanUrl: https://docs.ray.io/en/latest/serve/index.html
baseUrl: https://ray-serve.example.com
tags:
- Autoscaling
- Inference
- Machine Learning
- Model Serving
- Open Source
- Python
- Ray
properties:
- type: Documentation
url: https://docs.ray.io/en/latest/serve/index.html
- type: GitHub
url: https://github.com/ray-project/ray
- type: Getting Started
url: https://docs.ray.io/en/latest/serve/getting_started.html
- type: API Reference
url: https://docs.ray.io/en/latest/serve/api/index.html
contact:
- type: Community
url: https://discuss.ray.io/
- type: GitHub Issues
url: https://github.com/ray-project/ray/issues
common:
- type: Authentication
url: https://kserve.github.io/website/docs/intro
- type: Getting Started
url: https://kserve.github.io/website/docs/get_started/
- type: GitHub Organization
url: https://github.com/kserve
- type: CNCF Landscape
url: https://landscape.cncf.io/card-mode?project=incubating
- type: Blog
url: https://kserve.github.io/website/blog/
- type: OpenAPI
url: https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/main/openapi/kserve-open-inference-protocol-openapi.yml
- type: SpectralRuleset
url: https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/main/rules/kserve-open-inference-protocol-rules.yml
- type: NaftikoCapability
url: https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/main/capabilities/model-inference-operations.yaml
- type: JSONSchema
url: https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/main/json-schema/kserve-inference-request-schema.json
- type: JSONSchema
url: https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/main/json-schema/kserve-model-metadata-schema.json
- type: JSONLd
url: https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/main/json-ld/scalable-inference-serving-context.jsonld
- type: Vocabulary
url: https://raw.githubusercontent.com/api-evangelist/scalable-inference-serving/main/vocabulary/scalable-inference-serving-vocabulary.yml
maintainers:
- name: API Evangelist
email: [email protected]
url: https://apievangelist.com