Apache Beam
Apache Beam is a unified, open-source programming model developed by the Apache Software Foundation for defining both batch and streaming data processing pipelines. It provides a portable API layer that lets developers write pipeline logic once in Java, Python, or Go and deploy it to multiple execution engines (runners) including Apache Flink, Apache Spark, Google Cloud Dataflow, and the direct runner for local testing. The Beam portability framework enables cross-language pipelines and runner-agnostic execution.
APIs
Apache Beam SDK
The Apache Beam SDK provides the programming model for constructing data processing pipelines. Available in Java, Python, and Go, it provides PCollections, PTransforms, and Runn...
Apache Beam Job Service API
The Beam Job Service API provides a gRPC-based interface for submitting, managing, and monitoring Apache Beam pipeline jobs on supported runners. It is part of the Beam portabil...
Features
Single programming model for both batch and streaming data processing with consistent semantics.
Write pipeline logic once and execute on Apache Flink, Spark, Google Dataflow, Samza, or the local direct runner.
Native SDKs for Java, Python, and Go with cross-language transform support for mixing languages.
Flexible windowing (fixed, sliding, session, global) and trigger strategies for streaming data processing.
Built-in connectors for BigQuery, Kafka, Pub/Sub, GCS, HDFS, databases, and many other sources and sinks.
SQL-based data processing on Beam PCollections using Apache Calcite for query planning.
RunInference transform for integrating ML model inference into Beam pipelines with TensorFlow, PyTorch, and sklearn.
Schema inference and typed PCollections for structured data processing with automatic serialization.
Call Java transforms from Python pipelines and vice versa via the Beam portability framework.
Built-in metrics API and integration with runner-specific monitoring dashboards.
Use Cases
Extract, transform, and load data between storage systems using portable, reusable pipeline components.
Process high-throughput event streams with low-latency windowing and triggering strategies.
Compute aggregate statistics, joins, and group-by operations on large historical datasets.
Run ML model inference in distributed pipelines using the RunInference transform.
Parse, filter, and enrich log events from Kafka or Pub/Sub for operational analytics.
Migrate data between cloud providers and storage systems using Beam's portable I/O connectors.
Integrations
Managed Apache Beam runner on Google Cloud with autoscaling and monitoring.
Apache Flink runner for stateful stream processing with exactly-once semantics.
Apache Spark runner for batch and streaming processing on Spark clusters.
Kafka I/O connector for reading and writing Kafka topics in Beam pipelines.
BigQuery I/O connector for reading and writing BigQuery tables in Beam pipelines.
HDFS I/O connector for reading and writing files on Hadoop HDFS.
TFX uses Beam as the runtime for ML data validation and preprocessing components.