Apache Arrow
Apache Arrow is a cross-language development platform for in-memory analytics developed by the Apache Software Foundation. It specifies a standardized, language-independent columnar memory format for flat and nested data, organized for efficient analytic operations on modern hardware including CPUs and GPUs. Arrow provides computational libraries in C++, Java, Python (PyArrow), R, Go, Rust, JavaScript, C#, Ruby, Julia, and Swift, along with zero-copy streaming messaging via IPC and a high-performance data transfer framework called Arrow Flight (built on gRPC).
APIs
Apache Arrow Flight RPC
Arrow Flight is a high-performance RPC framework built on gRPC for transferring large datasets using the Arrow columnar format. It enables efficient bulk data transport between ...
Apache Arrow Libraries
Arrow provides native libraries in C++, Java, Python (PyArrow), R, Go, Rust, JavaScript, C#, Ruby, Julia, and Swift for reading, writing, and processing columnar data in the Arr...
Apache Arrow Format Specification
The Apache Arrow columnar format specification defines the binary layout for in-memory columnar data, including the IPC format for streaming and file-based data exchange. It cov...
Features
Standardized language-independent columnar memory layout for efficient analytic operations with zero-copy access.
High-performance gRPC-based framework for transferring large Arrow datasets between services with minimal serialization overhead.
Extension of Arrow Flight providing a SQL query execution interface over the Arrow Flight protocol.
Inter-process communication via shared memory and memory-mapped files, enabling zero-copy data sharing across process boundaries.
Native libraries for C++, Java, Python, R, Go, Rust, JavaScript, C#, Ruby, Julia, and Swift.
SIMD-optimized compute functions for analytical operations on Arrow arrays and tables.
First-class support for reading and writing Apache Parquet files via the Arrow columnar format.
Unified Dataset API for reading partitioned datasets from local filesystems, S3, GCS, and HDFS.
CUDA integration for zero-copy data sharing between CPU and GPU memory via the CUDA Arrow device.
Custom extension types for encoding domain-specific data using the Arrow format.
Use Cases
Share large analytical datasets between Python, R, Java, and other runtimes without serialization overhead.
Return query results from databases in Arrow format for fast analytics without Python/Java deserialization.
Accelerate ETL and data processing pipelines using columnar computation and SIMD optimizations.
Store and serve ML features in Arrow format for efficient batch and real-time feature retrieval.
Build high-throughput data microservices using Arrow Flight for efficient bulk data transfer over gRPC.
Share in-memory data between Python pandas/polars, Java, and Rust applications with zero-copy semantics.
Integrations
Native read/write support for Parquet columnar file format, the most common big data storage format.
Spark uses Arrow for Python UDF execution and pandas-on-Spark operations via PyArrow.
Deep integration with pandas DataFrames via PyArrow's to_pandas() and from_pandas() conversions.
DuckDB uses Arrow as its primary in-memory data format for zero-copy query result exchange.
Polars DataFrame library is built on Arrow and supports zero-copy interop with Arrow arrays.
Arrow Database Connectivity provides an Arrow-native database driver interface analogous to ODBC/JDBC.
Delta Lake integrates with Arrow for reading and writing Delta table data in columnar format.
Ray distributed computing framework uses Arrow for shared-memory object storage between workers.