Apache Nutch
Apache Nutch is a highly extensible and scalable open-source web crawler software project built on Apache Hadoop data structures for batch processing. It provides a pluggable architecture supporting custom parse filters, scoring filters, index writers, and protocol implementations. Nutch integrates with Apache Solr and Elasticsearch for full-text search and exposes a REST API for managing crawl jobs, configurations, seed lists, and database queries. Governed by the Apache Software Foundation under the Apache License 2.0.
APIs
Apache Nutch REST API
REST API for managing Apache Nutch crawl jobs, configurations, seed URL lists, database queries (CrawlDB and FetchDB), and data readers. Supports full crawl lifecycle management...
Capabilities
Apache Nutch Crawl Management
Workflow capability for managing end-to-end web crawl pipelines with Apache Nutch. Covers job lifecycle management, configuration control, seed list management, and CrawlDB quer...
Run with NaftikoFeatures
Leverages Apache Hadoop data structures for distributed, large-scale web crawling batch processing.
Extensible plugin system supporting custom parse filters, scoring filters, index writers, protocol plugins, and URL filters.
Full REST API for managing crawl jobs, configurations, seed lists, CrawlDB/FetchDB queries, and sequence file readers.
Built-in index writers for Apache Solr and Elasticsearch to enable full-text search over crawled content.
Uses Apache Tika for parsing a wide variety of document formats during the crawl pipeline.
Built-in deduplication support to identify and remove duplicate content from the crawl database and search index.
Regex-based and custom URL filter plugins to control crawl scope and exclusions.
Supports multi-round incremental crawling workflows to keep the crawl database fresh.
Service operations for exporting crawl data in CommonCrawl-compatible formats.
Configurable HTTP authentication schemes for crawling password-protected sites.
Use Cases
Build enterprise search engines over internal or external web content using Nutch as the crawler and Solr/Elasticsearch as the search backend.
Academic and research teams use Nutch for large-scale systematic web data collection and indexing.
Crawl and index intranet sites, wikis, and document repositories for internal enterprise search.
Create structured web archives compatible with CommonCrawl format for long-term data preservation.
Monitor web content changes, track competitor sites, and analyze web structure at scale.
Build custom extraction pipelines using Nutch plugin architecture for targeted data acquisition tasks.
Integrations
Native index writer plugin for indexing crawled content into Apache Solr for full-text search.
Index writer plugin for sending crawled content to Elasticsearch clusters.
Core dependency providing distributed storage and processing via HDFS and MapReduce.
Used for content detection and extraction from a wide range of document formats during parsing.
Support for SolrCloud distributed search clusters for scalable indexing.