OctoAI logo

OctoAI

OctoAI (formerly OctoML) was a Seattle-based AI inference platform founded in 2019 as a University of Washington Allen School spin-out of the Apache TVM project. The company originally focused on machine-learning model optimization and compilation across CPUs, GPUs, and accelerators, and in June 2023 launched a generative-AI SaaS inference platform that served open-source foundation models (Llama 2, Mixtral, SDXL, Stable Diffusion, Whisper) behind OpenAI-style REST APIs with Python and TypeScript SDKs. In January 2024 OctoML formally rebranded to OctoAI and in April 2024 unveiled OctoStack, a self-contained generative-AI production stack for deploying models inside customer VPC and on-premises environments across NVIDIA GPUs, AMD GPUs, and AWS Inferentia. NVIDIA acquired OctoAI in September 2024 for a reported $165M (down from a 2021 peak valuation of ~$900M), with CEO Luis Ceze and key staff joining NVIDIA. OctoAI sent customers a "Wind down of OctoAI Services" notice and terminated all hosted endpoints, accounts, and SDK access on 31 October 2024. The octo.ai domain now 301-redirects to nvidia.com and no public OctoAI product, API, dashboard, or developer portal remains; the technology has been absorbed into NVIDIA's internal AI inference stack and is not separately purchasable. This catalog entry is a historical record of the former OctoAI developer surface and the GitHub artifacts that remain.

5 APIs 6 Features
AcquiredDefunctAI InferenceGenerative AILLMFoundation ModelsModel OptimizationApache TVMGPUPrivate AINVIDIA

OctoAI publishes 5 APIs on the APIs.io network. Tagged areas include Acquired, Defunct, AI Inference, Generative AI, and LLM.

OctoAI’s developer surface includes SDKs and 7 more developer resources.

APIs

OctoAI Text Gen Inference API

OpenAI-compatible chat and text-completion endpoints serving open-source LLMs including Llama 2, Llama 3, Mixtral 8x7B, Mistral 7B, Code Llama, and customer fine-tunes. Supporte...

OctoAI Image Gen Inference API

Text-to-image and image-to-image inference for SDXL, SDXL-Lightning, Stable Diffusion 1.5, and SSD-1B with ControlNet, LoRA, and adapter support, plus inpainting and asset-manag...

OctoAI Asset Library API

Endpoints for uploading, listing, and managing user assets — checkpoints, LoRAs, textual inversions, ControlNets, and VAE files — used by the image and text inference APIs. The ...

OctoAI Compute Service API

Container-deployment API ("Compute Service") that let customers build, register, and serve their own custom model containers on OctoAI's managed GPU fleet, with autoscaling and ...

OctoStack

OctoStack was OctoAI's self-contained generative-AI production stack for deploying open and customer-trained foundation models inside a customer's VPC or on-premises environment...

Features

OpenAI-Compatible Inference

OctoAI's text and image endpoints implemented OpenAI-style request and response shapes so existing OpenAI client code could be repointed by changing the base URL and API key.

Open-Source Model Catalog

A shared catalog hosted Llama 2/3, Mixtral, Mistral, Code Llama, SDXL, SSD-1B, Stable Diffusion 1.5, and Whisper behind per-token and per-image pricing without GPU provisioning.

Custom Model Compute Service

Customers could package their own model containers and have OctoAI autoscale them on a managed GPU fleet, billed by GPU-second.

Asset Library

Upload and manage LoRAs, checkpoints, textual inversions, VAEs, and ControlNets and apply them at request time to image and text-generation endpoints.

OctoStack Private Deployment

Self-contained inference stack that ran inside a customer VPC or on-premises across NVIDIA, AMD, and AWS Inferentia hardware with fine-tuning, batching, and asset management built in.

TVM-Based Model Optimization

OctoAI's optimization pipeline descended from Apache TVM (created by founder Tianqi Chen) and used ML-guided compilation to improve throughput and latency across heterogeneous accelerators.

Use Cases

Repointing OpenAI Workloads to Open Models

Teams used the OpenAI-compatible endpoints to swap GPT-3.5/4 calls for Llama 2 / Mixtral at lower cost without rewriting client code.

Generative Image Pipelines

Product, marketing, and creative teams ran SDXL-based image generation with custom LoRAs and ControlNets for branded asset production.

Private Generative AI in Regulated Industries

Healthcare, financial-services, and government customers deployed OctoStack in-VPC or on-premises to keep prompts, completions, and model weights inside their security boundary.

Custom Fine-Tune Hosting

Teams fine-tuned open-weights models and served the resulting adapters and full-weight checkpoints behind OctoAI inference endpoints without managing GPU infrastructure.

Integrations

NVIDIA

Acquired OctoAI in September 2024 for a reported $165M; OctoAI team and technology absorbed into NVIDIA's AI inference stack and all OctoAI hosted services terminated on 31 October 2024.

Apache TVM

OctoAI's optimization stack originated from Apache TVM, the deep-learning compiler founded by OctoAI co-founder Tianqi Chen at the University of Washington.

AWS

OctoAI was an AWS Partner; OctoStack ran on AWS GPU instances and AWS Inferentia accelerators, with sagemaker-examples published in the GitHub org.

Docker

OctoAI ran a DockerCon 2023 generative-AI workshop and published the dockercon23-octoai workshop repo.

LangChain & LlamaIndex

OctoAI's LLM endpoints shipped with documented LangChain and LlamaIndex providers, demonstrated in the octoml-llm-qa sample repo.

Resources

🔗
Website
Website
👥
GitHubOrganization
GitHubOrganization
🔗
Acquirer
Acquirer
🔗
AcquisitionAnnouncement
AcquisitionAnnouncement
🔗
WindDownNotice
WindDownNotice
🔗
Crunchbase
Crunchbase
🔗
LinkedIn
LinkedIn
🔗
SuccessorOrganization
SuccessorOrganization

Sources

apis.yml Raw ↑
aid: octoai
name: OctoAI
description: >-
  OctoAI (formerly OctoML) was a Seattle-based AI inference platform founded
  in 2019 as a University of Washington Allen School spin-out of the Apache
  TVM project. The company originally focused on machine-learning model
  optimization and compilation across CPUs, GPUs, and accelerators, and in
  June 2023 launched a generative-AI SaaS inference platform that served
  open-source foundation models (Llama 2, Mixtral, SDXL, Stable Diffusion,
  Whisper) behind OpenAI-style REST APIs with Python and TypeScript SDKs.
  In January 2024 OctoML formally rebranded to OctoAI and in April 2024
  unveiled OctoStack, a self-contained generative-AI production stack for
  deploying models inside customer VPC and on-premises environments across
  NVIDIA GPUs, AMD GPUs, and AWS Inferentia. NVIDIA acquired OctoAI in
  September 2024 for a reported $165M (down from a 2021 peak valuation of
  ~$900M), with CEO Luis Ceze and key staff joining NVIDIA. OctoAI sent
  customers a "Wind down of OctoAI Services" notice and terminated all
  hosted endpoints, accounts, and SDK access on 31 October 2024. The
  octo.ai domain now 301-redirects to nvidia.com and no public OctoAI
  product, API, dashboard, or developer portal remains; the technology has
  been absorbed into NVIDIA's internal AI inference stack and is not
  separately purchasable. This catalog entry is a historical record of the
  former OctoAI developer surface and the GitHub artifacts that remain.
type: Index
image: https://kinlane-productions.s3.amazonaws.com/apis-json/apis-json-logo.jpg
tags:
  - Acquired
  - Defunct
  - AI Inference
  - Generative AI
  - LLM
  - Foundation Models
  - Model Optimization
  - Apache TVM
  - GPU
  - Private AI
  - NVIDIA
url: https://raw.githubusercontent.com/api-evangelist/octoai/refs/heads/main/apis.yml
created: '2026-05-25'
modified: '2026-05-25'
specificationVersion: '0.20'
apis:
  - aid: octoai:octoai-text-gen-api
    name: OctoAI Text Gen Inference API
    description: >-
      OpenAI-compatible chat and text-completion endpoints serving open-source
      LLMs including Llama 2, Llama 3, Mixtral 8x7B, Mistral 7B, Code Llama,
      and customer fine-tunes. Supported streaming, function calling, JSON mode,
      and a shared model catalog. The API was reachable at
      https://text.octoai.run/v1 and shut down on 31 October 2024.
    humanURL: https://octo.ai
    baseURL: https://text.octoai.run/v1
    tags:
      - LLM
      - Chat
      - Completions
      - OpenAI Compatible
      - Defunct
    properties:
      - type: StatusPage
        url: https://octo.ai
        description: Domain now 301-redirects to nvidia.com; service terminated 31 October 2024.
  - aid: octoai:octoai-image-gen-api
    name: OctoAI Image Gen Inference API
    description: >-
      Text-to-image and image-to-image inference for SDXL, SDXL-Lightning,
      Stable Diffusion 1.5, and SSD-1B with ControlNet, LoRA, and adapter
      support, plus inpainting and asset-management endpoints. The API was
      reachable at https://image.octoai.run and shut down on 31 October 2024.
    humanURL: https://octo.ai
    baseURL: https://image.octoai.run
    tags:
      - Images
      - Diffusion
      - SDXL
      - ControlNet
      - Defunct
    properties:
      - type: StatusPage
        url: https://octo.ai
        description: Domain now 301-redirects to nvidia.com; service terminated 31 October 2024.
  - aid: octoai:octoai-asset-library-api
    name: OctoAI Asset Library API
    description: >-
      Endpoints for uploading, listing, and managing user assets — checkpoints,
      LoRAs, textual inversions, ControlNets, and VAE files — used by the image
      and text inference APIs. The API was reachable under api.octoai.cloud and
      shut down on 31 October 2024.
    humanURL: https://octo.ai
    baseURL: https://api.octoai.cloud
    tags:
      - Assets
      - LoRA
      - Checkpoints
      - Defunct
    properties:
      - type: StatusPage
        url: https://octo.ai
        description: Domain now 301-redirects to nvidia.com; service terminated 31 October 2024.
  - aid: octoai:octoai-compute-service-api
    name: OctoAI Compute Service API
    description: >-
      Container-deployment API ("Compute Service") that let customers build,
      register, and serve their own custom model containers on OctoAI's
      managed GPU fleet, with autoscaling and OpenAI-style invocation. Shut
      down on 31 October 2024.
    humanURL: https://octo.ai
    baseURL: https://api.octoai.cloud
    tags:
      - Compute
      - Containers
      - Custom Models
      - Deployment
      - Defunct
    properties:
      - type: StatusPage
        url: https://octo.ai
        description: Domain now 301-redirects to nvidia.com; service terminated 31 October 2024.
  - aid: octoai:octostack
    name: OctoStack
    description: >-
      OctoStack was OctoAI's self-contained generative-AI production stack
      for deploying open and customer-trained foundation models inside a
      customer's VPC or on-premises environment. Announced April 2024, it
      supported NVIDIA GPUs, AMD GPUs, and AWS Inferentia, claimed 4x
      better GPU utilization, and bundled high-utilization batching,
      fine-tuning, and asset management. OctoStack is no longer offered as
      a standalone product after the NVIDIA acquisition; its technology has
      been absorbed into NVIDIA's inference stack.
    humanURL: https://octo.ai
    tags:
      - Private AI
      - On-Prem
      - VPC
      - Inference
      - Defunct
    properties:
      - type: StatusPage
        url: https://octo.ai
        description: Product wound down after NVIDIA acquisition; absorbed into NVIDIA's inference stack.
common:
  - type: Website
    url: https://octo.ai
  - type: GitHubOrganization
    url: https://github.com/octoml
  - type: Acquirer
    url: https://www.nvidia.com
  - type: AcquisitionAnnouncement
    url: https://www.geekwire.com/2024/chip-giant-nvidia-acquires-octoai-a-seattle-startup-that-helps-companies-run-ai-models/
  - type: WindDownNotice
    url: https://www.sunsethq.com/blog/octoai-acquisition
  - type: Crunchbase
    url: https://www.crunchbase.com/organization/octoml
  - type: LinkedIn
    url: https://www.linkedin.com/company/octoml
  - type: Features
    data:
      - name: OpenAI-Compatible Inference
        description: >-
          OctoAI's text and image endpoints implemented OpenAI-style request
          and response shapes so existing OpenAI client code could be
          repointed by changing the base URL and API key.
      - name: Open-Source Model Catalog
        description: >-
          A shared catalog hosted Llama 2/3, Mixtral, Mistral, Code Llama,
          SDXL, SSD-1B, Stable Diffusion 1.5, and Whisper behind per-token
          and per-image pricing without GPU provisioning.
      - name: Custom Model Compute Service
        description: >-
          Customers could package their own model containers and have OctoAI
          autoscale them on a managed GPU fleet, billed by GPU-second.
      - name: Asset Library
        description: >-
          Upload and manage LoRAs, checkpoints, textual inversions, VAEs,
          and ControlNets and apply them at request time to image and
          text-generation endpoints.
      - name: OctoStack Private Deployment
        description: >-
          Self-contained inference stack that ran inside a customer VPC or
          on-premises across NVIDIA, AMD, and AWS Inferentia hardware with
          fine-tuning, batching, and asset management built in.
      - name: TVM-Based Model Optimization
        description: >-
          OctoAI's optimization pipeline descended from Apache TVM (created
          by founder Tianqi Chen) and used ML-guided compilation to improve
          throughput and latency across heterogeneous accelerators.
  - type: UseCases
    data:
      - name: Repointing OpenAI Workloads to Open Models
        description: >-
          Teams used the OpenAI-compatible endpoints to swap GPT-3.5/4 calls
          for Llama 2 / Mixtral at lower cost without rewriting client code.
      - name: Generative Image Pipelines
        description: >-
          Product, marketing, and creative teams ran SDXL-based image
          generation with custom LoRAs and ControlNets for branded asset
          production.
      - name: Private Generative AI in Regulated Industries
        description: >-
          Healthcare, financial-services, and government customers deployed
          OctoStack in-VPC or on-premises to keep prompts, completions, and
          model weights inside their security boundary.
      - name: Custom Fine-Tune Hosting
        description: >-
          Teams fine-tuned open-weights models and served the resulting
          adapters and full-weight checkpoints behind OctoAI inference
          endpoints without managing GPU infrastructure.
  - type: Integrations
    data:
      - name: NVIDIA
        description: >-
          Acquired OctoAI in September 2024 for a reported $165M; OctoAI
          team and technology absorbed into NVIDIA's AI inference stack and
          all OctoAI hosted services terminated on 31 October 2024.
      - name: Apache TVM
        description: >-
          OctoAI's optimization stack originated from Apache TVM, the
          deep-learning compiler founded by OctoAI co-founder Tianqi Chen at
          the University of Washington.
      - name: AWS
        description: >-
          OctoAI was an AWS Partner; OctoStack ran on AWS GPU instances and
          AWS Inferentia accelerators, with sagemaker-examples published in
          the GitHub org.
      - name: Docker
        description: >-
          OctoAI ran a DockerCon 2023 generative-AI workshop and published
          the dockercon23-octoai workshop repo.
      - name: LangChain & LlamaIndex
        description: >-
          OctoAI's LLM endpoints shipped with documented LangChain and
          LlamaIndex providers, demonstrated in the octoml-llm-qa sample
          repo.
  - type: SDK
    data:
      - name: Python SDK
        description: >-
          octoai-python-sdk — Python client for the OctoAI inference,
          asset-library, and compute-service APIs. Package and repo were
          retired alongside the service shutdown on 31 October 2024.
      - name: TypeScript SDK
        description: >-
          octoai-typescript-sdk — TypeScript / Node.js client for the
          OctoAI inference and asset APIs. Retired alongside the service
          shutdown on 31 October 2024.
  - type: SuccessorOrganization
    url: https://www.nvidia.com
maintainers:
  - FN: Kin Lane
    email: [email protected]