Apache Spark
Apache Spark is a unified analytics engine for large-scale data processing. It provides high-level APIs in Java, Scala, Python, and R, and an optimized engine that supports general execution graphs. Spark offers a comprehensive suite of APIs for batch processing, SQL queries, streaming analytics, machine learning, and graph computation, governed by the Apache Software Foundation.
APIs
Apache Spark REST API
REST API for monitoring Spark applications, accessing cluster information, and managing Spark jobs through the Spark UI backend. Exposes endpoints for applications, jobs, stages...
Apache Spark SQL API
Spark module for structured data processing with DataFrame and Dataset APIs. Provides a SQL interface and supports various data sources including Parquet, ORC, JSON, CSV, JDBC, ...
Apache Spark Streaming API
Scalable, high-throughput, fault-tolerant stream processing of live data streams. Supports Structured Streaming (the newer DStream-based API) with exactly-once semantics, contin...
Apache Spark MLlib API
Spark's scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dime...
Apache Spark GraphX API
Spark API for graphs and graph-parallel computation with a collection of graph algorithms and builders, including PageRank, Connected Components, Triangle Counting, and shortest...
Features
Single engine for batch, streaming, SQL, ML, and graph processing workloads.
Optimized execution plans with Catalyst optimizer and DAG scheduling.
Up to 100x faster than Hadoop MapReduce for iterative algorithms via in-memory caching.
Unified streaming and batch processing with exactly-once semantics and Kafka integration.
High-level APIs in Scala, Java, Python (PySpark), and R (SparkR).
ACID transactions, schema evolution, and time travel for data lakes.
Native Kubernetes scheduling for cloud-native deployment of Spark workloads.
Use Cases
Extract, transform, and load petabytes of data across distributed clusters.
Streaming analytics on live event data with sub-second latency.
Distributed ML training and feature engineering at scale with MLlib.
Query and transform data stored in cloud object stores and HDFS.
Interactive SQL queries on structured and semi-structured data at scale.
Integrations
HDFS storage, YARN cluster manager, and Hadoop ecosystem integration.
Structured Streaming source and sink for real-time event processing.
Open-source storage layer with ACID transactions for data lakes.
Open table format for huge analytic datasets on cloud storage.
Hive metastore integration for table catalog and metadata management.
Native Kubernetes scheduling for cloud-native Spark deployments.
Workflow orchestration for scheduling and managing Spark jobs.