Apache Tika logo

Apache Tika

Apache Tika is a toolkit for detecting and extracting metadata and structured text content from over 1,000 file formats including PDF, Microsoft Office (Word, Excel, PowerPoint), OpenDocument, HTML, XML, images, audio, video, and archive formats. Tika provides a REST API server, Java library, and command-line tool. It is used by Apache Solr, Apache Nutch, and many other systems for content extraction and indexing. It is maintained by the Apache Software Foundation.

2 APIs 7 Features
Content ExtractionDocument ProcessingMetadataText ExtractionOpen Source

APIs

Apache Tika REST API

The Tika Server REST API provides HTTP endpoints for content type detection, text extraction, metadata extraction, and language detection from uploaded documents. Key endpoints ...

Apache Tika Java API

The Tika Java API provides the AutoDetectParser for automatic format detection and parsing, Metadata class for reading extracted metadata fields, ContentHandler for streaming SA...

Features

1000+ Format Support

Detect and extract content from over 1,000 file formats using parser plugins.

Metadata Extraction

Extract document metadata including author, creation date, title, and format-specific properties.

Language Detection

Automatic language detection from extracted text content.

MIME Type Detection

Accurate MIME type detection based on file content (magic bytes) not just file extension.

REST Server

Standalone HTTP server for document processing without Java library dependency.

OCR Integration

Optional Tesseract OCR integration for text extraction from images and scanned PDFs.

Recursive Parsing

Recursive parsing of archive formats (ZIP, TAR, JAR) and embedded documents.

Use Cases

Search Indexing

Extract text from documents for indexing in Apache Solr or Elasticsearch.

Document Intelligence

Automated metadata extraction and classification for document management systems.

Content Migration

Batch content extraction during digital archive migration and transformation.

E-Discovery

Legal e-discovery content extraction from diverse document collections.

Integrations

Apache Solr

Solr Cell uses Tika for extracting text from uploaded documents for indexing.

Apache Nutch

Nutch web crawler uses Tika for parsing fetched web page content.

Elasticsearch

Ingest attachment processor uses Tika for document content extraction.

Tesseract OCR

Optional Tesseract integration for OCR on images and scanned documents.

Apache NiFi

NiFi processor integration for automated document parsing workflows.

Resources

👥
GitHubRepository
GitHubRepository
🔗
Documentation
Documentation
🌐
Portal
Portal
🚀
GettingStarted
GettingStarted
📄
ReleaseNotes
ReleaseNotes
💬
Support
Support
📜
TermsOfService
TermsOfService
📦
Python Tika Package
SDK