Apache Hudi

Apache Hudi is a data lake platform that provides incremental data processing primitives including upserts and incremental queries. It manages storage of large analytical datasets on distributed file systems with ACID transactions, timeline-based versioning, and integrations for Spark, Flink, and Hive.

2 APIs 1 Capabilities 8 Features

ACIDApacheBig DataData LakeIncremental ProcessingLakehouseOpen Source

APIs

Apache Hudi Timeline Server API

REST API for the Apache Hudi Timeline Server providing table timeline management, commit metadata inspection, and table administration for Hudi data lake tables.

Apache Hudi Java API

Java API for writing Hudi tables with upserts, inserts, and deletes, plus timeline management, compaction, and Spark/Flink DataSource integration APIs.

Capabilities

Run Capabilities with Naftiko — Deploy and orchestrate these API capabilities using Naftiko Fleet.

Run with Naftiko

Hudi Lakehouse Management

Run with Naftiko

Run Capabilities with Naftiko — Deploy and orchestrate these API capabilities using Naftiko Fleet.

Run with Naftiko

Features

ACID Upserts

Atomically insert or update records in data lake tables with ACID guarantees using record keys.

Hudi Timeline

Immutable commit timeline tracking all mutations for time travel, rollback, and incremental queries.

Incremental Queries

Query only the data changed since a given commit timestamp for efficient streaming ingestion.

Copy-On-Write Tables

COW table type rewrites entire Parquet files on upsert for read-optimized query performance.

Merge-On-Read Tables

MOR table type appends delta logs for fast writes with compaction-based read optimization.

Table Services

Built-in cleaning, compaction, clustering, and indexing services for table maintenance.

Multi-Engine Support

Read and write Hudi tables from Apache Spark, Flink, Hive, Presto, Trino, and Athena.

Schema Evolution

Support for adding, renaming, and dropping columns with backward-compatible schema evolution.

Use Cases

CDC Pipeline Ingestion

Ingest change data capture (CDC) events from databases into data lake tables with upsert support.

Streaming Data Lake

Build near-real-time data lake pipelines with Spark Structured Streaming or Flink.

Data Lake Maintenance

Manage storage costs with automated cleaning, compaction, and clustering of Hudi tables.

Incremental ETL

Build incremental ETL pipelines that process only changed data since the last run.

Regulatory Data Retention

Implement GDPR right-to-erasure by deleting records from Hudi tables with delete operations.

Integrations

Apache Spark

Primary write and read engine with Hudi DataSource and Spark SQL extensions.

Apache Flink

Flink sink and source connectors for streaming writes and incremental reads.

Apache Hive

Hive Metastore sync for making Hudi tables queryable from HiveQL.

Presto / Trino

Native Hudi input format support for querying Hudi tables from Presto and Trino.

AWS Athena

Athena supports reading Hudi COW and MOR tables stored in Amazon S3.

GitHubOrganization

NaftikoCapability

NaftikoCapability

Apache Hudi

APIs

Apache Hudi Timeline Server API

Apache Hudi Java API

Capabilities

Hudi Lakehouse Management

Features

Use Cases

Integrations

Semantic Vocabularies

Apache Hudi Timeline Context

API Governance Rules

Apache Hudi API Rules

Resources