Apache Nutch logo

Apache Nutch

Apache Nutch is a highly extensible and scalable open-source web crawler software project built on Apache Hadoop data structures for batch processing. It provides a pluggable architecture supporting custom parse filters, scoring filters, index writers, and protocol implementations. Nutch integrates with Apache Solr and Elasticsearch for full-text search and exposes a REST API for managing crawl jobs, configurations, seed lists, and database queries. Governed by the Apache Software Foundation under the Apache License 2.0.

1 APIs 1 Capabilities 10 Features
Web CrawlerIndexingSearchApacheJavaHadoopOpen Source

APIs

Apache Nutch REST API

REST API for managing Apache Nutch crawl jobs, configurations, seed URL lists, database queries (CrawlDB and FetchDB), and data readers. Supports full crawl lifecycle management...

Capabilities

Apache Nutch Crawl Management

Workflow capability for managing end-to-end web crawl pipelines with Apache Nutch. Covers job lifecycle management, configuration control, seed list management, and CrawlDB quer...

Run with Naftiko

Features

Scalable Batch Crawling

Leverages Apache Hadoop data structures for distributed, large-scale web crawling batch processing.

Pluggable Architecture

Extensible plugin system supporting custom parse filters, scoring filters, index writers, protocol plugins, and URL filters.

REST API for Crawl Management

Full REST API for managing crawl jobs, configurations, seed lists, CrawlDB/FetchDB queries, and sequence file readers.

Full-Text Search Integration

Built-in index writers for Apache Solr and Elasticsearch to enable full-text search over crawled content.

Apache Tika Parsing

Uses Apache Tika for parsing a wide variety of document formats during the crawl pipeline.

Duplicate Detection

Built-in deduplication support to identify and remove duplicate content from the crawl database and search index.

Configurable URL Filtering

Regex-based and custom URL filter plugins to control crawl scope and exclusions.

Incremental Crawling

Supports multi-round incremental crawling workflows to keep the crawl database fresh.

CommonCrawl Export

Service operations for exporting crawl data in CommonCrawl-compatible formats.

HTTP Authentication Support

Configurable HTTP authentication schemes for crawling password-protected sites.

Use Cases

Enterprise Search

Build enterprise search engines over internal or external web content using Nutch as the crawler and Solr/Elasticsearch as the search backend.

Research Data Collection

Academic and research teams use Nutch for large-scale systematic web data collection and indexing.

Intranet Document Search

Crawl and index intranet sites, wikis, and document repositories for internal enterprise search.

Web Archive Creation

Create structured web archives compatible with CommonCrawl format for long-term data preservation.

SEO and Content Monitoring

Monitor web content changes, track competitor sites, and analyze web structure at scale.

Custom Data Extraction Pipelines

Build custom extraction pipelines using Nutch plugin architecture for targeted data acquisition tasks.

Integrations

Apache Solr

Native index writer plugin for indexing crawled content into Apache Solr for full-text search.

Elasticsearch

Index writer plugin for sending crawled content to Elasticsearch clusters.

Apache Hadoop

Core dependency providing distributed storage and processing via HDFS and MapReduce.

Apache Tika

Used for content detection and extraction from a wide range of document formats during parsing.

SolrCloud

Support for SolrCloud distributed search clusters for scalable indexing.

Semantic Vocabularies

Apache Nutch Context

16 classes · 35 properties

JSON-LD

API Governance Rules

Apache Nutch API Rules

33 rules · 12 errors 19 warnings 2 info

SPECTRAL

Resources

👥
Apache Nutch GitHub Repository
GitHubRepository
👥
Apache Software Foundation GitHub
GitHubOrganization
🔗
Apache Nutch Documentation
Documentation
🚀
Nutch Tutorial
GettingStarted
🎓
Apache Nutch Tutorials
Tutorials
💬
Apache Nutch FAQs
FAQ
📄
Nutch Release Notes
ReleaseNotes
📜
Apache License 2.0
TermsOfService
💬
Mailing Lists
Support
👥
Nutch on Stack Overflow
StackOverflow
🔗
Apache Nutch Spectral Rules
SpectralRules
🔗
Apache Nutch Crawl Management
NaftikoCapability
🔗
Apache Nutch Vocabulary
Vocabulary
🔗
Apache Nutch JSON-LD Context
JSONLD