Apache Tika
Apache Tika is a toolkit for detecting and extracting metadata and structured text content from over 1,000 file formats including PDF, Microsoft Office (Word, Excel, PowerPoint), OpenDocument, HTML, XML, images, audio, video, and archive formats. Tika provides a REST API server, Java library, and command-line tool. It is used by Apache Solr, Apache Nutch, and many other systems for content extraction and indexing. It is maintained by the Apache Software Foundation.
APIs
Apache Tika REST API
The Tika Server REST API provides HTTP endpoints for content type detection, text extraction, metadata extraction, and language detection from uploaded documents. Key endpoints ...
Apache Tika Java API
The Tika Java API provides the AutoDetectParser for automatic format detection and parsing, Metadata class for reading extracted metadata fields, ContentHandler for streaming SA...
Features
Detect and extract content from over 1,000 file formats using parser plugins.
Extract document metadata including author, creation date, title, and format-specific properties.
Automatic language detection from extracted text content.
Accurate MIME type detection based on file content (magic bytes) not just file extension.
Standalone HTTP server for document processing without Java library dependency.
Optional Tesseract OCR integration for text extraction from images and scanned PDFs.
Recursive parsing of archive formats (ZIP, TAR, JAR) and embedded documents.
Use Cases
Extract text from documents for indexing in Apache Solr or Elasticsearch.
Automated metadata extraction and classification for document management systems.
Batch content extraction during digital archive migration and transformation.
Legal e-discovery content extraction from diverse document collections.
Integrations
Solr Cell uses Tika for extracting text from uploaded documents for indexing.
Nutch web crawler uses Tika for parsing fetched web page content.
Ingest attachment processor uses Tika for document content extraction.
Optional Tesseract integration for OCR on images and scanned documents.
NiFi processor integration for automated document parsing workflows.