Apache Druid
Apache Druid is a high-performance, real-time analytics database governed by the Apache Software Foundation, designed for fast slice-and-dice OLAP queries on event-time data. It features a distributed, column-oriented storage engine with automatic rollup, supports both streaming (Kafka, Kinesis) and batch (S3, HDFS, local) data ingestion, and provides a SQL query interface plus a native JSON query API via REST. Druid is optimized for sub-second queries at petabyte scale with high concurrency.
APIs
Apache Druid REST API
Druid exposes REST APIs for Druid SQL (POST /druid/v2/sql), native JSON queries (POST /druid/v2), batch and streaming data ingestion tasks, supervisor management for Kafka/Kines...
Features
Columnar storage with bitmap indexes, dictionary encoding, and pre-aggregation (rollup) enables sub-second queries on billions of events.
REST endpoint for submitting standard SQL queries with ANSI SQL support, time-based filtering, and streaming response options.
Druid-native query format (Timeseries, TopN, GroupBy, Scan, Search) for maximum control and performance.
Real-time data ingestion from Apache Kafka and Amazon Kinesis with supervisor-managed offset tracking and exactly-once semantics.
Parallel batch indexing tasks from local files, S3, GCS, HDFS, and other external storage systems.
Pre-aggregates metrics at ingestion time to reduce storage and query time, configurable per datasource.
All data is partitioned by time interval (segments), enabling efficient time-range query pruning.
Query isolation and resource management via query lanes, scheduler priorities, and row-level access control.
Use Cases
Analyze click streams, IoT events, application logs, and user behavior data with sub-second query latency.
Power interactive BI dashboards with high-concurrency low-latency queries backed by Druid's columnar engine.
Ingest and analyze network flow data and security events in real time for threat detection and capacity planning.
Process advertising impression, click, and conversion events at high volume with real-time aggregation.
Monitor application performance metrics and operational data with drilldown and filtering capabilities.
Integrations
KafkaSupervisor for real-time continuous ingestion from Kafka topics into Druid datasources.
KinesisSupervisor for real-time data ingestion from AWS Kinesis data streams.
Native Hadoop batch indexing task for bulk loading data from HDFS or MapReduce job outputs.
Batch and streaming ingestion from object storage (S3, GCS, Azure Blob) using index tasks.
Druid-Hive integration for querying Druid datasources from HiveQL and performing joins.
Official Kubernetes operator for deploying and managing Druid clusters on Kubernetes.
Imply provides a commercial managed Druid service with additional features and enterprise support.