org.apache.nutch.urlfilter.suffix
Class SuffixURLFilter

java.lang.Object
  extended by org.apache.nutch.urlfilter.suffix.SuffixURLFilter
All Implemented Interfaces:
Configurable, URLFilter, Pluggable

public class SuffixURLFilter
extends Object
implements URLFilter

Filters URLs based on a file of URL suffixes. The file is named by

  1. property "urlfilter.suffix.file" in ./conf/nutch-default.xml, and
  2. attribute "file" in plugin.xml of this plugin
Attribute "file" has higher precedence if defined. If the config file is missing, all URLs will be rejected.

This filter can be configured to work in one of two modes:

The format of this config file is one URL suffix per line, with no preceding whitespace. Order, in which suffixes are specified, doesn't matter. Blank lines and comments (#) are allowed.

A single '+' or '-' sign not followed by any suffix must be used once, to signify the mode this plugin operates in. An optional single 'I' can be appended, to signify that suffix matches should be case-insensitive. The default, if not specified, is to use case-sensitive matches, i.e. suffix '.JPG' does not match '.jpg'.

NOTE: the format of this file is different from urlfilter-prefix, because that plugin doesn't support allowed/prohibited prefixes (only supports allowed prefixes). Please note that this plugin does not support regular expressions, it only accepts literal suffixes. I.e. a suffix "+*.jpg" is most probably wrong, you should use "+.jpg" instead.

Example 1

The configuration shown below will accept all URLs with '.html' or '.htm' suffixes (case-sensitive - '.HTML' or '.HTM' will be rejected), and prohibit all other suffixes.

  # this is a comment
  
  # prohibit all unknown, case-sensitive matching
  -

  # collect only HTML files.
  .html
  .htm
 

Example 2

The configuration shown below will accept all URLs except common graphical formats.

  # this is a comment
  
  # allow all unknown, case-insensitive matching
  +I
  
  # prohibited suffixes
  .gif
  .png
  .jpg
  .jpeg
  .bmp
 

Author:
Andrzej Bialecki

Field Summary
 
Fields inherited from interface org.apache.nutch.net.URLFilter
X_POINT_ID
 
Constructor Summary
SuffixURLFilter()
           
SuffixURLFilter(Reader reader)
           
 
Method Summary
 String filter(String url)
           
 Configuration getConf()
           
 boolean isIgnoreCase()
           
 boolean isModeAccept()
           
static void main(String[] args)
           
 void readConfiguration(Reader reader)
           
 void setConf(Configuration conf)
           
 void setFilterFromPath(boolean filterFromPath)
           
 void setIgnoreCase(boolean ignoreCase)
           
 void setModeAccept(boolean modeAccept)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SuffixURLFilter

public SuffixURLFilter()
                throws IOException
Throws:
IOException

SuffixURLFilter

public SuffixURLFilter(Reader reader)
                throws IOException
Throws:
IOException
Method Detail

filter

public String filter(String url)
Specified by:
filter in interface URLFilter

readConfiguration

public void readConfiguration(Reader reader)
                       throws IOException
Throws:
IOException

main

public static void main(String[] args)
                 throws IOException
Throws:
IOException

setConf

public void setConf(Configuration conf)
Specified by:
setConf in interface Configurable

getConf

public Configuration getConf()
Specified by:
getConf in interface Configurable

isModeAccept

public boolean isModeAccept()

setModeAccept

public void setModeAccept(boolean modeAccept)

isIgnoreCase

public boolean isIgnoreCase()

setIgnoreCase

public void setIgnoreCase(boolean ignoreCase)

setFilterFromPath

public void setFilterFromPath(boolean filterFromPath)


Copyright © 2012 The Apache Software Foundation