Apache Hive
Apache Hive is a data warehouse software that facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. It provides a SQL-like interface called HiveQL for querying data stored in Hadoop, along with a WebHCat REST API for job submission and metastore access.
APIs
Apache Hive WebHCat REST API
WebHCat (Templeton) REST API for Apache Hive providing DDL operations, HiveQL job submission, and Hive Metastore metadata access over HTTP.
Apache Hive JDBC API
JDBC interface to HiveServer2 for standard SQL client connectivity, supporting parameterized queries, result sets, and connection pooling from Java and ODBC-bridge applications.
Capabilities
Features
SQL-like query language for reading, writing, and aggregating data stored in distributed storage.
HTTP REST API (Templeton) for DDL operations, job submission, and metastore metadata access.
Thrift-based server with JDBC and ODBC drivers for standard SQL client connectivity.
Central repository for table schema, partition metadata, and storage location information.
Partition tables by column values for efficient query pruning and data organization.
Optimized columnar storage formats with predicate pushdown and compression support.
Full ACID transaction support for inserts, updates, and deletes on managed ORC tables.
Batch processing of rows in CPU register-width vectors for improved query throughput.
Use Cases
Run SQL analytics on petabyte-scale datasets stored in HDFS or object storage.
Use HiveQL scripts to transform and load data between raw and curated data lake zones.
Query structured data interactively using Beeline or JDBC-connected BI tools.
Parse and aggregate application logs stored as text or JSON in HDFS using Hive SerDes.
Use the Hive Metastore as a shared schema registry for Spark, Flink, and Presto.
Integrations
Hive reads and writes data stored in HDFS as the primary storage layer.
Spark uses the Hive Metastore for table discovery and supports Hive UDFs.
Hive HBase storage handler enables HiveQL queries against HBase tables.
Apache Tez DAG execution engine replaces MapReduce for faster Hive query processing.
Presto and Trino use the Hive Metastore for table metadata in federated SQL queries.