apache-nutch 2.0 API

Apache Nutch 2.X is a branch of the Apache Nutch open source web-search software project.

See:
          Description

Core
org.apache.nutch.api  
org.apache.nutch.api.impl  
org.apache.nutch.crawl Crawl control code.
org.apache.nutch.fetcher The Nutch robot.
org.apache.nutch.host  
org.apache.nutch.html  
org.apache.nutch.indexer Maintain Lucene full-text indexes.
org.apache.nutch.indexer.solr  
org.apache.nutch.metadata A Multi-valued Metadata container, and set of constant fields for Nutch Metadata.
org.apache.nutch.net  
org.apache.nutch.net.protocols  
org.apache.nutch.parse  
org.apache.nutch.plugin The Nutch Plugin System.
org.apache.nutch.protocol  
org.apache.nutch.protocol.sftp Protocol plugin which supports retrieving documents via the sftp protocol.
org.apache.nutch.scoring  
org.apache.nutch.storage  
org.apache.nutch.tools  
org.apache.nutch.tools.arc  
org.apache.nutch.tools.proxy  
org.apache.nutch.urlfilter.automaton A url filter plugin based on dk.brics.automaton Finite-State Automata for JavaTM.
org.apache.nutch.urlfilter.domain A url filter plugin that filters by domain.
org.apache.nutch.urlfilter.prefix A url filter plugin.
org.apache.nutch.urlfilter.regex A url filter plugin.
org.apache.nutch.urlfilter.suffix  
org.apache.nutch.urlfilter.validator A url filter plugin that validates given urls.
org.apache.nutch.util  
org.apache.nutch.util.domain org.apache.nutch.util.domain

 

Plugins API
org.apache.nutch.protocol.http.api Common API used by HTTP plugins (http, httpclient)
org.apache.nutch.urlfilter.api  

 

Protocol Plugins
org.apache.nutch.protocol.file Protocol plugin which supports retrieving local file resources.
org.apache.nutch.protocol.ftp Protocol plugin which supports retrieving documents via the ftp protocol.
org.apache.nutch.protocol.http Protocol plugin which supports retrieving documents via the http protocol.
org.apache.nutch.protocol.httpclient Protocol plugin which supports retrieving documents via the HTTP and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.

 

URL Filter Plugins
org.apache.nutch.net.urlnormalizer.basic  
org.apache.nutch.net.urlnormalizer.pass  
org.apache.nutch.net.urlnormalizer.regex  

 

Scoring Plugins
org.apache.nutch.scoring.link  
org.apache.nutch.scoring.opic  
org.apache.nutch.scoring.tld Top Level Domain Scoring plugin.

 

Parse Plugins
org.apache.nutch.parse.ext  
org.apache.nutch.parse.feed  
org.apache.nutch.parse.html An HTML document parsing plugin.
org.apache.nutch.parse.js  
org.apache.nutch.parse.swf  
org.apache.nutch.parse.tika  
org.apache.nutch.parse.zip  

 

Indexing Filter Plugins
org.apache.nutch.indexer.anchor An indexing plugin for inbound anchor text.
org.apache.nutch.indexer.basic A basic indexing plugin.
org.apache.nutch.indexer.feed  
org.apache.nutch.indexer.more A more indexing plugin.
org.apache.nutch.indexer.subcollection  
org.apache.nutch.indexer.tld Top Level Domain Indexing plugin.

 

Misc. Plugins
org.apache.nutch.analysis.lang Text document language identifier.
org.apache.nutch.collection Subcollection is a subset of an index.
org.apache.nutch.microformats.reltag A microformats Rel-Tag Parser/Indexer/Querier plugin.
org.creativecommons.nutch Sample plugins that parse and index Creative Commons medadata.

 

Apache Nutch 2.X is a branch of the Apache Nutch open source web-search software project. It builds on Apache Gora for data persistence and Apache Solr for indexing adding web-specifics, such as a crawler, a link-graph database and parsing support handled by Apache Tika for HTML and an array other document formats.



Copyright © 2012 The Apache Software Foundation