Apache Oozie
Apache Oozie is a workflow scheduler system for managing Apache Hadoop jobs. It enables users to define workflows as directed acyclic graphs (DAGs) of actions including MapReduce, Pig, Hive, Sqoop, and custom Java/shell steps. Coordinator jobs trigger workflows based on time schedules or data availability, while bundle jobs group multiple coordinators. Oozie provides a REST API for job submission, lifecycle management, monitoring, and system administration. Governed by the Apache Software Foundation under the Apache License 2.0, written in Java.
APIs
Apache Oozie REST API
The Oozie Web Services API provides REST endpoints for submitting, managing, and monitoring workflow, coordinator, and bundle jobs on Apache Hadoop. Supports full job lifecycle ...
Capabilities
Apache Oozie Workflow Orchestration
Workflow capability for orchestrating Hadoop data processing pipelines using Apache Oozie. Covers workflow, coordinator, and bundle job lifecycle management for data engineers a...
Run with NaftikoFeatures
Define complex data processing pipelines as DAGs of actions executed on Apache Hadoop.
Schedule recurring workflows triggered by time intervals or data availability conditions in HDFS.
Group multiple coordinator jobs into a single bundle for coordinated lifecycle management.
Full REST API for job submission, lifecycle control, monitoring, and system administration.
Built-in support for MapReduce, Pig, Hive, Sqoop, Distcp, and custom Java/shell actions.
Define and monitor service level agreements on workflow and coordinator actions with alert capabilities.
Generate PNG, SVG, or DOT graph visualizations of workflow DAGs for debugging and documentation.
Retrieve execution logs, error logs, and audit trails for jobs via REST API with filtering support.
Built-in HA support with multiple Oozie server instances and distributed state management.
Manage shared Hadoop libraries across workflows for consistent classpath management.
Use Cases
Orchestrate multi-step ETL pipelines combining Hive queries, MapReduce jobs, and data transfers on Hadoop.
Run recurring Hadoop batch jobs on time-based schedules using coordinator jobs.
Trigger workflows automatically when new data arrives in HDFS using coordinator data-in conditions.
Automate ML model training and evaluation pipelines on Hadoop with dependency chaining.
Orchestrate large-scale data migration, compaction, and archival workflows across Hadoop clusters.
Coordinate workflows that span multiple Hadoop clusters using Distcp and remote actions.
Integrations
Core integration with HDFS for data storage and YARN for resource management.
Native Hive action type for executing HiveQL queries as workflow steps.
Native Pig action type for data transformation scripts in workflow pipelines.
Native Sqoop action type for importing and exporting data between Hadoop and RDBMS.
Spark action type for running Spark jobs within Oozie workflows.
Native MapReduce action type as the foundational Hadoop computation framework.