vLLM
vLLM is a high-throughput, memory-efficient open-source inference and serving engine for LLMs. It provides an OpenAI-compatible REST server (vllm serve) plus a Python API. vLLM is Apache 2.0 and run on your own GPU infrastructure; there is no hosted vLLM SaaS from the project itself.
1 APIs
0 Features
LLMInferenceOpen SourceGPUOpenAI CompatibleSelf-Hosted
APIs
vLLM OpenAI-Compatible Server
OpenAI-compatible REST API exposed by `vllm serve`. Endpoints include /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/score, /v1/audio/transcriptions, /v1/audio/trans...
Resources
🔗
Website
Website
🌐
DeveloperPortal
DeveloperPortal
🔗
OpenSource
OpenSource
🔗
Plans
Plans
🔗
RateLimits
RateLimits
🔗
FinOps
FinOps