Home
NVIDIA NIM
NVIDIA NIM
NVIDIA NIM (NVIDIA Inference Microservices) is a catalog of GPU-accelerated, containerized AI inference microservices that package optimized model engines (TensorRT-LLM, vLLM, SGLang, Triton) behind industry-standard OpenAI-compatible REST APIs. NIM covers large language models, embeddings and reranking, vision-language models, speech (Riva), visual generative AI, and biology (BioNeMo) — exposed identically whether consumed from the hosted endpoint at integrate.api.nvidia.com or self-hosted via Docker containers and the Kubernetes-native NIM Operator. NIM ships with NVIDIA AI Enterprise for commercial deployment and is the inference layer underneath NVIDIA AI Blueprints, NeMo Retriever, NeMo Guardrails, and the broader NVIDIA developer stack.
10 APIs
11 Capabilities
19 Features
AI Artificial Intelligence Inference Microservices LLM Foundation Models GPU Kubernetes NVIDIA OpenAI Compatible
NVIDIA NIM publishes 10 APIs on the APIs.io network, including Chat Completions API, Completions API, Embeddings API, and 7 more. Tagged areas include AI, Artificial Intelligence, Inference, Microservices, and LLM.
The NVIDIA NIM catalog on APIs.io includes 11 machine-runnable capabilities and 1 JSON-LD context.
NVIDIA NIM’s developer surface includes developer portal, documentation, getting-started guide, signup flow, sandbox, pricing, engineering blog, and 29 more developer resources.
OpenAI-compatible chat completions endpoint exposing 100+ foundation models — Meta Llama, Mistral, Mixtral, NVIDIA Nemotron, DeepSeek, Qwen, Microsoft Phi, Google Gemma, IBM Gra...
Legacy OpenAI-compatible text completion endpoint (/v1/completions) for non-chat foundation models served by NIM. Accepts a raw prompt and returns generated text with the same s...
OpenAI-compatible embeddings endpoint (/v1/embeddings) backed by NVIDIA NeMo Retriever text embedding models including NV-Embed, NV-EmbedQA-E5, llama-3.2-nv-embedqa-1b, and BAAI...
NeMo Retriever cross-encoder reranking endpoint (/v1/ranking) for scoring candidate passages against a query. Improves retrieval relevance on RAG pipelines and supports the llam...
OpenAI-compatible model catalog endpoint (/v1/models) returning the list of models served by the NIM endpoint or container. Each entry includes id, owned_by, and created timesta...
Vision-language model inference through the standard /v1/chat/completions surface with image inputs (base64 or URL) in the messages payload. Supports NVIDIA NeVA, microsoft/kosm...
Liveness, readiness, and startup probes exposed by self-hosted NIM containers (/v1/health/live, /v1/health/ready) and a Prometheus /v1/metrics scrape endpoint for GPU utilizatio...
Visual generative AI endpoints for text-to-image, image-to-image, and image editing using models such as Black Forest Labs FLUX.1, Stable Diffusion XL, Shutterstock-trained mode...
NVIDIA Riva-powered speech NIMs delivering automatic speech recognition (Parakeet, Canary), neural machine translation, and text-to-speech (Magpie-TTS, FastPitch) through HTTP a...
BioNeMo NIMs for protein structure prediction (AlphaFold2, ESMFold, OpenFold), protein generation (ProtGPT2, RFDiffusion), molecular property prediction (MolMIM), small molecule...
Run Capabilities with Naftiko — Deploy and orchestrate these API capabilities using Naftiko Fleet.
Run with Naftiko
BioNeMo NIMs — protein structure, ligand docking, and molecule generation.
Run with Naftiko
NVIDIA NIM Chat Completions — OpenAI-compatible chat completions across 100+ foundation models.
Run with Naftiko
NeMo Retriever text embeddings via OpenAI-compatible /v1/embeddings.
Run with Naftiko
Liveness, readiness, and Prometheus metrics endpoints for self-hosted NIM containers.
Run with Naftiko
Vision-language model inference via /v1/chat/completions with image inputs.
Run with Naftiko
Run Capabilities with Naftiko — Deploy and orchestrate these API capabilities using Naftiko Fleet.
Run with Naftiko
OpenAI-compatible REST surface — /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models, /v1/ranking
100+ foundation models exposed through a single API contract — Llama 3.1/3.2/3.3, Mistral, Mixtral, NVIDIA Nemotron, DeepSeek-R1, Qwen 2.5, Microsoft Phi, Google Gemma, IBM Granite, and Falcon
Free hosted inference at build.nvidia.com on DGX Cloud — 1,000 credits on signup, 40 RPM rate limit
Self-hosted deployment via Docker containers shipping TensorRT-LLM, vLLM, or SGLang inference engines
Kubernetes-native deployment via the NIM Operator with NIMService, NIMCache, NIMPipeline CRDs
GPU-aware autoscaling, persistent model caches, and rolling upgrades managed by the operator
Multi-tenant licensing through NVIDIA AI Enterprise (commercial production use)
NeMo Retriever NIMs for embeddings, reranking, OCR, and PDF-to-Markdown extraction in RAG pipelines
Vision Language Model NIMs reusing the chat-completions surface for multimodal inputs
NVIDIA Riva speech NIMs (Parakeet ASR, Canary translation, Magpie TTS) with HTTP and gRPC adapters
BioNeMo NIMs for AlphaFold2, ESMFold, ProtGPT2, MolMIM, DiffDock, RFDiffusion
Visual generative AI NIMs — FLUX.1, SDXL, Shutterstock Edify Image, Edify 3D
NeMo Guardrails for input/output safety and topic policy enforcement
Function calling, JSON mode, tool use, and structured outputs across compatible LLMs
Streaming via Server-Sent Events on chat/completions
Prometheus /v1/metrics scrape endpoint and /v1/health/{live,ready} probes for Kubernetes
LangChain, LlamaIndex, Haystack, OpenAI SDK, and direct REST client compatibility
NVIDIA AI Blueprints — full reference RAG, multimodal search, drug discovery, and digital human stacks
Available on DGX Cloud, AWS, Azure, Google Cloud, Oracle Cloud, GKE, EKS, AKS, OpenShift, and on-prem
40 classes · 10 properties
JSON-LD
Sources
aid: nvidia-nim
url: https://raw.githubusercontent.com/api-evangelist/nvidia-nim/refs/heads/main/apis.yml
apis:
- aid: nvidia-nim:nvidia-nim-chat-completions-api
name: NVIDIA NIM Chat Completions API
tags:
- AI
- Artificial Intelligence
- Chat
- Completions
- LLM
- OpenAI Compatible
humanURL: https://docs.api.nvidia.com/nim/reference/llm-apis
baseURL: https://integrate.api.nvidia.com/v1
properties:
- url: https://docs.api.nvidia.com/nim/reference/llm-apis
type: Documentation
- url: https://docs.nvidia.com/nim/large-language-models/latest/api-reference.html
type: Documentation
- url: openapi/nvidia-nim-chat-completions-api-openapi.yml
type: OpenAPI
- url: json-schema/nvidia-nim-chat-completion-schema.json
type: JSONSchema
- url: json-ld/nvidia-nim-context.jsonld
type: JSONLD
- type: NaftikoCapability
url: capabilities/chat-completions-chat.yaml
description: OpenAI-compatible chat completions endpoint exposing 100+ foundation models — Meta Llama, Mistral,
Mixtral, NVIDIA Nemotron, DeepSeek, Qwen, Microsoft Phi, Google Gemma, IBM Granite, and more — through a single
/v1/chat/completions surface. Supports streaming, tool/function calling, structured outputs, vision inputs on
multimodal models, and the standard temperature/top_p/max_tokens parameters. Switching models is a one-line
change to the model string. Available hosted on integrate.api.nvidia.com or self-hosted via NIM containers on
any GPU.
- aid: nvidia-nim:nvidia-nim-completions-api
name: NVIDIA NIM Completions API
tags:
- AI
- Artificial Intelligence
- Completions
- LLM
- OpenAI Compatible
humanURL: https://docs.nvidia.com/nim/large-language-models/latest/api-reference.html
baseURL: https://integrate.api.nvidia.com/v1
properties:
- url: https://docs.nvidia.com/nim/large-language-models/latest/api-reference.html
type: Documentation
- url: openapi/nvidia-nim-completions-api-openapi.yml
type: OpenAPI
- type: NaftikoCapability
url: capabilities/completions-completions.yaml
description: Legacy OpenAI-compatible text completion endpoint (/v1/completions) for non-chat foundation models
served by NIM. Accepts a raw prompt and returns generated text with the same streaming, sampling, and
stopping-criterion controls as the chat endpoint.
- aid: nvidia-nim:nvidia-nim-embeddings-api
name: NVIDIA NIM Embeddings API
tags:
- AI
- Artificial Intelligence
- Embeddings
- Retrieval
- RAG
- OpenAI Compatible
humanURL: https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/api-reference.html
baseURL: https://integrate.api.nvidia.com/v1
properties:
- url: https://docs.nvidia.com/nim/nemo-retriever/text-embedding/latest/api-reference.html
type: Documentation
- url: openapi/nvidia-nim-embeddings-api-openapi.yml
type: OpenAPI
- url: json-schema/nvidia-nim-embedding-schema.json
type: JSONSchema
- type: NaftikoCapability
url: capabilities/embeddings-embeddings.yaml
description: OpenAI-compatible embeddings endpoint (/v1/embeddings) backed by NVIDIA NeMo Retriever text
embedding models including NV-Embed, NV-EmbedQA-E5, llama-3.2-nv-embedqa-1b, and BAAI BGE-M3. Returns dense
float vectors for documents or queries to power RAG, semantic search, and clustering. Supports
`input_type=passage|query` for asymmetric retrieval and the standard `dimensions` parameter on models that
permit dimension reduction.
- aid: nvidia-nim:nvidia-nim-reranking-api
name: NVIDIA NIM Reranking API
tags:
- AI
- Artificial Intelligence
- Reranking
- Retrieval
- RAG
humanURL: https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/api-reference.html
baseURL: https://integrate.api.nvidia.com/v1
properties:
- url: https://docs.nvidia.com/nim/nemo-retriever/text-reranking/latest/api-reference.html
type: Documentation
- url: openapi/nvidia-nim-reranking-api-openapi.yml
type: OpenAPI
- type: NaftikoCapability
url: capabilities/reranking-reranking.yaml
description: NeMo Retriever cross-encoder reranking endpoint (/v1/ranking) for scoring candidate passages
against a query. Improves retrieval relevance on RAG pipelines and supports the llama-3.2-nv-rerankqa-1b and
NV-RerankQA-Mistral-4B-v3 models. Accepts a query plus a list of passages and returns a sorted list of
relevance scores.
- aid: nvidia-nim:nvidia-nim-models-api
name: NVIDIA NIM Models API
tags:
- AI
- Artificial Intelligence
- Models
- Catalog
- OpenAI Compatible
humanURL: https://docs.nvidia.com/nim/large-language-models/latest/api-reference.html
baseURL: https://integrate.api.nvidia.com/v1
properties:
- url: https://docs.nvidia.com/nim/large-language-models/latest/api-reference.html
type: Documentation
- url: openapi/nvidia-nim-models-api-openapi.yml
type: OpenAPI
- type: NaftikoCapability
url: capabilities/models-models.yaml
description: OpenAI-compatible model catalog endpoint (/v1/models) returning the list of models served by the
NIM endpoint or container. Each entry includes id, owned_by, and created timestamp. Used by clients to
discover the model name strings to pass to chat-completions / completions / embeddings.
- aid: nvidia-nim:nvidia-nim-vision-api
name: NVIDIA NIM Vision Language Models API
tags:
- AI
- Artificial Intelligence
- Vision
- Multimodal
- VLM
humanURL: https://docs.api.nvidia.com/nim/reference/vlm-apis
baseURL: https://integrate.api.nvidia.com/v1
properties:
- url: https://docs.api.nvidia.com/nim/reference/vlm-apis
type: Documentation
- url: openapi/nvidia-nim-vision-api-openapi.yml
type: OpenAPI
- type: NaftikoCapability
url: capabilities/vision-vision.yaml
description: Vision-language model inference through the standard /v1/chat/completions surface with image
inputs (base64 or URL) in the messages payload. Supports NVIDIA NeVA, microsoft/kosmos-2, Phi-3-vision,
llama-3.2-90b-vision-instruct, and other VLMs hosted in the NIM catalog.
- aid: nvidia-nim:nvidia-nim-health-api
name: NVIDIA NIM Health API
tags:
- Health
- Observability
- Kubernetes
humanURL: https://docs.nvidia.com/nim/large-language-models/latest/observability.html
properties:
- url: https://docs.nvidia.com/nim/large-language-models/latest/observability.html
type: Documentation
- url: openapi/nvidia-nim-health-api-openapi.yml
type: OpenAPI
- type: NaftikoCapability
url: capabilities/health-health.yaml
description: Liveness, readiness, and startup probes exposed by self-hosted NIM containers (/v1/health/live,
/v1/health/ready) and a Prometheus /v1/metrics scrape endpoint for GPU utilization, request latency, and
queue depth. Drives Kubernetes pod lifecycle and HPA scaling via the NIM Operator.
- aid: nvidia-nim:nvidia-nim-image-generation-api
name: NVIDIA NIM Image Generation API
tags:
- AI
- Artificial Intelligence
- Image Generation
- Visual
humanURL: https://docs.api.nvidia.com/nim/reference/visual-models
baseURL: https://integrate.api.nvidia.com/v1
properties:
- url: https://docs.api.nvidia.com/nim/reference/visual-models
type: Documentation
- url: openapi/nvidia-nim-image-generation-api-openapi.yml
type: OpenAPI
- type: NaftikoCapability
url: capabilities/image-generation-images.yaml
description: Visual generative AI endpoints for text-to-image, image-to-image, and image editing using models
such as Black Forest Labs FLUX.1, Stable Diffusion XL, Shutterstock-trained models, and NVIDIA-curated
Edify Image. Returns base64-encoded PNG/JPEG artifacts.
- aid: nvidia-nim:nvidia-nim-speech-api
name: NVIDIA NIM Speech API
tags:
- AI
- Artificial Intelligence
- Speech
- ASR
- TTS
- Riva
humanURL: https://docs.nvidia.com/nim/riva/latest/index.html
baseURL: https://integrate.api.nvidia.com/v1
properties:
- url: https://docs.nvidia.com/nim/riva/latest/index.html
type: Documentation
- url: openapi/nvidia-nim-speech-api-openapi.yml
type: OpenAPI
- type: NaftikoCapability
url: capabilities/speech-asr.yaml
- type: NaftikoCapability
url: capabilities/speech-tts.yaml
description: NVIDIA Riva-powered speech NIMs delivering automatic speech recognition (Parakeet, Canary),
neural machine translation, and text-to-speech (Magpie-TTS, FastPitch) through HTTP and gRPC surfaces.
Hosted endpoints accept WAV/FLAC audio and return transcripts or synthesized speech.
- aid: nvidia-nim:nvidia-nim-biology-api
name: NVIDIA NIM Biology (BioNeMo) API
tags:
- AI
- Biology
- BioNeMo
- Drug Discovery
- Healthcare
humanURL: https://docs.nvidia.com/nim/bionemo/latest/index.html
baseURL: https://integrate.api.nvidia.com/v1
properties:
- url: https://docs.nvidia.com/nim/bionemo/latest/index.html
type: Documentation
- url: openapi/nvidia-nim-biology-api-openapi.yml
type: OpenAPI
- type: NaftikoCapability
url: capabilities/biology-bionemo.yaml
description: BioNeMo NIMs for protein structure prediction (AlphaFold2, ESMFold, OpenFold), protein
generation (ProtGPT2, RFDiffusion), molecular property prediction (MolMIM), small molecule generation,
and molecular docking (DiffDock). Each model is a containerized microservice with the same OpenAPI surface.
name: NVIDIA NIM
tags:
- AI
- Artificial Intelligence
- Inference
- Microservices
- LLM
- Foundation Models
- GPU
- Kubernetes
- NVIDIA
- OpenAI Compatible
kind: contract
image: https://kinlane-productions2.s3.amazonaws.com/apis-json/apis-json-logo.jpg
access: 3rd-Party
common:
- type: Portal
url: https://build.nvidia.com
- type: Documentation
url: https://docs.nvidia.com/nim/index.html
- type: Documentation
url: https://docs.api.nvidia.com/nim/reference/llm-apis
- type: Documentation
url: https://developer.nvidia.com/nim
- type: GettingStarted
url: https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html
- type: SignUp
url: https://build.nvidia.com/explore/discover
- type: Sandbox
url: https://build.nvidia.com/explore/discover
- type: Pricing
url: https://www.nvidia.com/en-us/data-center/products/ai-enterprise/
- type: GitHubOrganization
url: https://github.com/NVIDIA
- type: GitHubOrganization
url: https://github.com/NVIDIA-NIM-Agent-Blueprints
- type: StatusPage
url: https://status.nvidia.com
- type: Blog
url: https://developer.nvidia.com/blog/category/generative-ai/
- type: Blog
url: https://blogs.nvidia.com/blog/category/generative-ai/
- type: Forum
url: https://forums.developer.nvidia.com/c/ai-data-science/nemo-llm-service/
- type: TrustCenter
url: https://www.nvidia.com/en-us/about-nvidia/legal-info/
- type: TermsOfService
url: https://www.nvidia.com/en-us/about-nvidia/terms-of-service/
- type: PrivacyPolicy
url: https://www.nvidia.com/en-us/about-nvidia/privacy-policy/
- type: Documentation
url: https://docs.nvidia.com/nim-operator/latest/index.html
name: NIM Operator Documentation
- type: SDK
url: https://github.com/NVIDIA/nim-deploy
name: NIM Deploy (Helm Charts and Reference Implementations)
- type: SDK
url: https://github.com/NVIDIA/k8s-nim-operator
name: Kubernetes NIM Operator
- type: SDK
url: https://github.com/NVIDIA/GenerativeAIExamples
name: Generative AI Examples
- type: SDK
url: https://github.com/NVIDIA/NeMo
name: NeMo Toolkit
- type: SDK
url: https://github.com/NVIDIA/NeMo-Guardrails
name: NeMo Guardrails
- type: SDK
url: https://github.com/NVIDIA/TensorRT-LLM
name: TensorRT-LLM
- type: SDK
url: https://github.com/triton-inference-server/server
name: Triton Inference Server
- type: SDK
url: https://github.com/langchain-ai/langchain-nvidia
name: LangChain NVIDIA AI Endpoints
- type: SDK
url: https://pypi.org/project/openai/
name: OpenAI Python SDK (compatible)
- type: CodeExamples
url: https://github.com/NVIDIA/GenerativeAIExamples
name: NVIDIA Generative AI Examples
- type: CodeExamples
url: https://github.com/NVIDIA-AI-Blueprints
name: NVIDIA AI Blueprints
- type: Models
url: https://build.nvidia.com/explore/discover
- type: KubernetesCRD
url: https://github.com/NVIDIA/k8s-nim-operator/tree/main/api
name: NIMService / NIMCache / NIMPipeline CRDs
- type: RateLimits
url: https://docs.api.nvidia.com/nim/reference/limits
- type: Versioning
url: https://docs.nvidia.com/nim/large-language-models/latest/release-notes.html
- url: plans/nvidia-nim-plans-pricing.yml
type: Plans
- url: rate-limits/nvidia-nim-rate-limits.yml
type: RateLimits
- url: finops/nvidia-nim-finops.yml
type: FinOps
- type: Features
data:
- OpenAI-compatible REST surface — /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/models, /v1/ranking
- 100+ foundation models exposed through a single API contract — Llama 3.1/3.2/3.3, Mistral, Mixtral, NVIDIA
Nemotron, DeepSeek-R1, Qwen 2.5, Microsoft Phi, Google Gemma, IBM Granite, and Falcon
- Free hosted inference at build.nvidia.com on DGX Cloud — 1,000 credits on signup, 40 RPM rate limit
- Self-hosted deployment via Docker containers shipping TensorRT-LLM, vLLM, or SGLang inference engines
- Kubernetes-native deployment via the NIM Operator with NIMService, NIMCache, NIMPipeline CRDs
- GPU-aware autoscaling, persistent model caches, and rolling upgrades managed by the operator
- Multi-tenant licensing through NVIDIA AI Enterprise (commercial production use)
- NeMo Retriever NIMs for embeddings, reranking, OCR, and PDF-to-Markdown extraction in RAG pipelines
- Vision Language Model NIMs reusing the chat-completions surface for multimodal inputs
- NVIDIA Riva speech NIMs (Parakeet ASR, Canary translation, Magpie TTS) with HTTP and gRPC adapters
- BioNeMo NIMs for AlphaFold2, ESMFold, ProtGPT2, MolMIM, DiffDock, RFDiffusion
- Visual generative AI NIMs — FLUX.1, SDXL, Shutterstock Edify Image, Edify 3D
- NeMo Guardrails for input/output safety and topic policy enforcement
- Function calling, JSON mode, tool use, and structured outputs across compatible LLMs
- Streaming via Server-Sent Events on chat/completions
- Prometheus /v1/metrics scrape endpoint and /v1/health/{live,ready} probes for Kubernetes
- LangChain, LlamaIndex, Haystack, OpenAI SDK, and direct REST client compatibility
- NVIDIA AI Blueprints — full reference RAG, multimodal search, drug discovery, and digital human stacks
- Available on DGX Cloud, AWS, Azure, Google Cloud, Oracle Cloud, GKE, EKS, AKS, OpenShift, and on-prem
sources:
- https://build.nvidia.com
- https://docs.nvidia.com/nim/index.html
- https://docs.api.nvidia.com/nim/reference/llm-apis
- https://www.nvidia.com/en-us/ai-data-science/products/nim-microservices/
- https://github.com/NVIDIA/k8s-nim-operator
- https://github.com/NVIDIA/nim-deploy
updated: '2026-05-25'
created: '2026-05-25'
modified: '2026-05-25'
position: Consuming
description: NVIDIA NIM (NVIDIA Inference Microservices) is a catalog of GPU-accelerated, containerized
AI inference microservices that package optimized model engines (TensorRT-LLM, vLLM, SGLang, Triton) behind
industry-standard OpenAI-compatible REST APIs. NIM covers large language models, embeddings and reranking,
vision-language models, speech (Riva), visual generative AI, and biology (BioNeMo) — exposed identically
whether consumed from the hosted endpoint at integrate.api.nvidia.com or self-hosted via Docker containers
and the Kubernetes-native NIM Operator. NIM ships with NVIDIA AI Enterprise for commercial deployment and
is the inference layer underneath NVIDIA AI Blueprints, NeMo Retriever, NeMo Guardrails, and the broader
NVIDIA developer stack.
maintainers:
- FN: Kin Lane
email: [email protected]
X: apievangelist
url: https://apievangelist.com
specificationVersion: '0.16'