Apache Hudi logo

Apache Hudi

Apache Hudi is a data lake platform that provides incremental data processing primitives including upserts and incremental queries. It manages storage of large analytical datasets on distributed file systems with ACID transactions, timeline-based versioning, and integrations for Spark, Flink, and Hive.

2 APIs 1 Capabilities 8 Features
ACIDApacheBig DataData LakeIncremental ProcessingLakehouseOpen Source

APIs

Apache Hudi Timeline Server API

REST API for the Apache Hudi Timeline Server providing table timeline management, commit metadata inspection, and table administration for Hudi data lake tables.

Apache Hudi Java API

Java API for writing Hudi tables with upserts, inserts, and deletes, plus timeline management, compaction, and Spark/Flink DataSource integration APIs.

Capabilities

Features

ACID Upserts

Atomically insert or update records in data lake tables with ACID guarantees using record keys.

Hudi Timeline

Immutable commit timeline tracking all mutations for time travel, rollback, and incremental queries.

Incremental Queries

Query only the data changed since a given commit timestamp for efficient streaming ingestion.

Copy-On-Write Tables

COW table type rewrites entire Parquet files on upsert for read-optimized query performance.

Merge-On-Read Tables

MOR table type appends delta logs for fast writes with compaction-based read optimization.

Table Services

Built-in cleaning, compaction, clustering, and indexing services for table maintenance.

Multi-Engine Support

Read and write Hudi tables from Apache Spark, Flink, Hive, Presto, Trino, and Athena.

Schema Evolution

Support for adding, renaming, and dropping columns with backward-compatible schema evolution.

Use Cases

CDC Pipeline Ingestion

Ingest change data capture (CDC) events from databases into data lake tables with upsert support.

Streaming Data Lake

Build near-real-time data lake pipelines with Spark Structured Streaming or Flink.

Data Lake Maintenance

Manage storage costs with automated cleaning, compaction, and clustering of Hudi tables.

Incremental ETL

Build incremental ETL pipelines that process only changed data since the last run.

Regulatory Data Retention

Implement GDPR right-to-erasure by deleting records from Hudi tables with delete operations.

Integrations

Apache Spark

Primary write and read engine with Hudi DataSource and Spark SQL extensions.

Apache Flink

Flink sink and source connectors for streaming writes and incremental reads.

Apache Hive

Hive Metastore sync for making Hudi tables queryable from HiveQL.

Presto / Trino

Native Hudi input format support for querying Hudi tables from Presto and Trino.

AWS Athena

Athena supports reading Hudi COW and MOR tables stored in Amazon S3.

Semantic Vocabularies

Apache Hudi Timeline Context

25 classes · 0 properties

JSON-LD

API Governance Rules

Apache Hudi API Rules

8 rules · 2 errors 5 warnings 1 info

SPECTRAL

Resources

🔗
Documentation
Documentation
🚀
GettingStarted
GettingStarted
👥
GitHubOrganization
GitHubOrganization
👥
GitHubRepository
GitHubRepository
🔗
SpectralRules
SpectralRules
🔗
Vocabulary
Vocabulary
🔗
NaftikoCapability
NaftikoCapability