Apache Hudi
Apache Hudi is a data lake platform that provides incremental data processing primitives including upserts and incremental queries. It manages storage of large analytical datasets on distributed file systems with ACID transactions, timeline-based versioning, and integrations for Spark, Flink, and Hive.
APIs
Apache Hudi Timeline Server API
REST API for the Apache Hudi Timeline Server providing table timeline management, commit metadata inspection, and table administration for Hudi data lake tables.
Apache Hudi Java API
Java API for writing Hudi tables with upserts, inserts, and deletes, plus timeline management, compaction, and Spark/Flink DataSource integration APIs.
Capabilities
Features
Atomically insert or update records in data lake tables with ACID guarantees using record keys.
Immutable commit timeline tracking all mutations for time travel, rollback, and incremental queries.
Query only the data changed since a given commit timestamp for efficient streaming ingestion.
COW table type rewrites entire Parquet files on upsert for read-optimized query performance.
MOR table type appends delta logs for fast writes with compaction-based read optimization.
Built-in cleaning, compaction, clustering, and indexing services for table maintenance.
Read and write Hudi tables from Apache Spark, Flink, Hive, Presto, Trino, and Athena.
Support for adding, renaming, and dropping columns with backward-compatible schema evolution.
Use Cases
Ingest change data capture (CDC) events from databases into data lake tables with upsert support.
Build near-real-time data lake pipelines with Spark Structured Streaming or Flink.
Manage storage costs with automated cleaning, compaction, and clustering of Hudi tables.
Build incremental ETL pipelines that process only changed data since the last run.
Implement GDPR right-to-erasure by deleting records from Hudi tables with delete operations.
Integrations
Primary write and read engine with Hudi DataSource and Spark SQL extensions.
Flink sink and source connectors for streaming writes and incremental reads.
Hive Metastore sync for making Hudi tables queryable from HiveQL.
Native Hudi input format support for querying Hudi tables from Presto and Trino.
Athena supports reading Hudi COW and MOR tables stored in Amazon S3.