org.apache.nutch.indexer.subcollection
Class SubcollectionIndexingFilter

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.indexer.subcollection.SubcollectionIndexingFilter
All Implemented Interfaces:
Configurable, IndexingFilter, FieldPluggable, Pluggable

public class SubcollectionIndexingFilter
extends Configured
implements IndexingFilter


Field Summary
static String FIELD_NAME
          Doc field name
static org.slf4j.Logger LOG
          Logger
 
Fields inherited from interface org.apache.nutch.indexer.IndexingFilter
X_POINT_ID
 
Constructor Summary
SubcollectionIndexingFilter()
           
SubcollectionIndexingFilter(Configuration conf)
           
 
Method Summary
 NutchDocument filter(NutchDocument doc, String url, WebPage page)
          Adds fields or otherwise modifies the document that will be indexed for a parse.
 Collection<WebPage.Field> getFields()
           
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 

Field Detail

FIELD_NAME

public static final String FIELD_NAME
Doc field name

See Also:
Constant Field Values

LOG

public static final org.slf4j.Logger LOG
Logger

Constructor Detail

SubcollectionIndexingFilter

public SubcollectionIndexingFilter()

SubcollectionIndexingFilter

public SubcollectionIndexingFilter(Configuration conf)
Method Detail

getFields

public Collection<WebPage.Field> getFields()
Specified by:
getFields in interface FieldPluggable

filter

public NutchDocument filter(NutchDocument doc,
                            String url,
                            WebPage page)
                     throws IndexingException
Description copied from interface: IndexingFilter
Adds fields or otherwise modifies the document that will be indexed for a parse. Unwanted documents can be removed from indexing by returning a null value.

Specified by:
filter in interface IndexingFilter
Parameters:
doc - document instance for collecting fields
url - page url
Returns:
modified (or a new) document instance, or null (meaning the document should be discarded)
Throws:
IndexingException


Copyright © 2012 The Apache Software Foundation