Apache Arrow logo

Apache Arrow

Apache Arrow is a cross-language development platform for in-memory analytics developed by the Apache Software Foundation. It specifies a standardized, language-independent columnar memory format for flat and nested data, organized for efficient analytic operations on modern hardware including CPUs and GPUs. Arrow provides computational libraries in C++, Java, Python (PyArrow), R, Go, Rust, JavaScript, C#, Ruby, Julia, and Swift, along with zero-copy streaming messaging via IPC and a high-performance data transfer framework called Arrow Flight (built on gRPC).

3 APIs 10 Features
AnalyticsApacheColumnar FormatDatagRPCIn-MemoryIPCOpen SourcePython

APIs

Apache Arrow Flight RPC

Arrow Flight is a high-performance RPC framework built on gRPC for transferring large datasets using the Arrow columnar format. It enables efficient bulk data transport between ...

Apache Arrow Libraries

Arrow provides native libraries in C++, Java, Python (PyArrow), R, Go, Rust, JavaScript, C#, Ruby, Julia, and Swift for reading, writing, and processing columnar data in the Arr...

Apache Arrow Format Specification

The Apache Arrow columnar format specification defines the binary layout for in-memory columnar data, including the IPC format for streaming and file-based data exchange. It cov...

Features

Columnar In-Memory Format

Standardized language-independent columnar memory layout for efficient analytic operations with zero-copy access.

Arrow Flight RPC

High-performance gRPC-based framework for transferring large Arrow datasets between services with minimal serialization overhead.

Flight SQL

Extension of Arrow Flight providing a SQL query execution interface over the Arrow Flight protocol.

Zero-Copy IPC

Inter-process communication via shared memory and memory-mapped files, enabling zero-copy data sharing across process boundaries.

Multi-Language Support

Native libraries for C++, Java, Python, R, Go, Rust, JavaScript, C#, Ruby, Julia, and Swift.

Vectorized Computation

SIMD-optimized compute functions for analytical operations on Arrow arrays and tables.

Parquet Integration

First-class support for reading and writing Apache Parquet files via the Arrow columnar format.

Dataset API

Unified Dataset API for reading partitioned datasets from local filesystems, S3, GCS, and HDFS.

GPU Support

CUDA integration for zero-copy data sharing between CPU and GPU memory via the CUDA Arrow device.

Extension Types

Custom extension types for encoding domain-specific data using the Arrow format.

Use Cases

Analytics Data Exchange

Share large analytical datasets between Python, R, Java, and other runtimes without serialization overhead.

Database Query Results

Return query results from databases in Arrow format for fast analytics without Python/Java deserialization.

Data Pipeline Acceleration

Accelerate ETL and data processing pipelines using columnar computation and SIMD optimizations.

Machine Learning Feature Stores

Store and serve ML features in Arrow format for efficient batch and real-time feature retrieval.

High-Throughput Data Services

Build high-throughput data microservices using Arrow Flight for efficient bulk data transfer over gRPC.

Cross-Language Data Sharing

Share in-memory data between Python pandas/polars, Java, and Rust applications with zero-copy semantics.

Integrations

Apache Parquet

Native read/write support for Parquet columnar file format, the most common big data storage format.

Apache Spark

Spark uses Arrow for Python UDF execution and pandas-on-Spark operations via PyArrow.

pandas

Deep integration with pandas DataFrames via PyArrow's to_pandas() and from_pandas() conversions.

DuckDB

DuckDB uses Arrow as its primary in-memory data format for zero-copy query result exchange.

Polars

Polars DataFrame library is built on Arrow and supports zero-copy interop with Arrow arrays.

ADBC (Arrow Database Connectivity)

Arrow Database Connectivity provides an Arrow-native database driver interface analogous to ODBC/JDBC.

Delta Lake

Delta Lake integrates with Arrow for reading and writing Delta table data in columnar format.

Ray

Ray distributed computing framework uses Arrow for shared-memory object storage between workers.

Resources

👥
GitHubOrganization
GitHubOrganization
👥
GitHubRepository
GitHubRepository
👥
GitHubRepository
GitHubRepository
👥
GitHubRepository
GitHubRepository
👥
GitHubRepository
GitHubRepository
👥
GitHubRepository
GitHubRepository
🔗
Documentation
Documentation
🚀
GettingStarted
GettingStarted
💬
Support
Support
📜
TermsOfService
TermsOfService
📄
ChangeLog
ChangeLog
📦
PyArrow (Python)
SDK
📦
Apache Arrow Java (Maven)
SDK
📦
Arrow-rs (Rust, crates.io)
SDK
📦
Arrow Go
SDK
📦
Apache Arrow JavaScript (npm)
SDK