Apache ORC
Apache ORC is a self-describing, type-aware columnar file format designed for Hadoop workloads. It provides high compression ratios and fast read performance for large-scale data processing with support for complex data types.
APIs
Apache ORC
ORC provides Java and C++ APIs for reading and writing ORC columnar files, with support for predicate pushdown, column projection, compression codecs, and integration with Hive,...
Capabilities
Apache ORC File Processing Workflow
Workflow capability for reading, writing, converting, and analyzing Apache ORC columnar files.
Run with NaftikoFeatures
Stores data by column for efficient compression and query performance
Skip reading data that does not match query predicates
Read only the columns needed for a query
Full ACID transactional support when used with Apache Hive
Add, rename, and remove columns while preserving backward compatibility
Supports ZLIB, Snappy, LZO, LZ4, and ZSTD compression codecs
Use Cases
Store Hive tables in highly efficient ORC format
Process large ORC datasets with Apache Spark SQL
Fast analytical queries over ORC files with Presto or Trino
Efficient columnar storage for data lake architectures
Integrations
Native ORC support as default Hive storage format
ORC data source support in Spark SQL
Fast ORC reading with native vectorized reader
ORC file format support for batch and streaming
ORC to Arrow conversion for in-memory analytics