org.apache.nutch.urlfilter.api
Class RegexURLFilterBase

java.lang.Object
  extended by org.apache.nutch.urlfilter.api.RegexURLFilterBase
All Implemented Interfaces:
Configurable, URLFilter, Pluggable
Direct Known Subclasses:
AutomatonURLFilter, RegexURLFilter

public abstract class RegexURLFilterBase
extends Object
implements URLFilter

Generic URL filter based on regular expressions.

The regular expressions rules are expressed in a file. The file of rules is provided by each implementation using the #getRulesFile(Configuration) method.

The format of this file is made of many rules (one per line):
[+-]<regex>
where plus (+)means go ahead and index it and minus (-)means no.

Author:
Jérôme Charron

Field Summary
 
Fields inherited from interface org.apache.nutch.net.URLFilter
X_POINT_ID
 
Constructor Summary
  RegexURLFilterBase()
          Constructs a new empty RegexURLFilterBase
  RegexURLFilterBase(File filename)
          Constructs a new RegexURLFilter and init it with a file of rules.
protected RegexURLFilterBase(Reader reader)
          Constructs a new RegexURLFilter and init it with a Reader of rules.
  RegexURLFilterBase(String rules)
          Constructs a new RegexURLFilter and inits it with a list of rules.
 
Method Summary
protected abstract  RegexRule createRule(boolean sign, String regex)
          Creates a new RegexRule.
 String filter(String url)
           
 Configuration getConf()
           
protected abstract  Reader getRulesReader(Configuration conf)
          Returns the name of the file of rules to use for a particular implementation.
static void main(RegexURLFilterBase filter, String[] args)
          Filter the standard input using a RegexURLFilterBase.
 void setConf(Configuration conf)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

RegexURLFilterBase

public RegexURLFilterBase()
Constructs a new empty RegexURLFilterBase


RegexURLFilterBase

public RegexURLFilterBase(File filename)
                   throws IOException,
                          IllegalArgumentException
Constructs a new RegexURLFilter and init it with a file of rules.

Parameters:
filename - is the name of rules file.
Throws:
IOException
IllegalArgumentException

RegexURLFilterBase

public RegexURLFilterBase(String rules)
                   throws IOException,
                          IllegalArgumentException
Constructs a new RegexURLFilter and inits it with a list of rules.

Parameters:
rules - string with a list of rules, one rule per line
Throws:
IOException
IllegalArgumentException

RegexURLFilterBase

protected RegexURLFilterBase(Reader reader)
                      throws IOException,
                             IllegalArgumentException
Constructs a new RegexURLFilter and init it with a Reader of rules.

Parameters:
reader - is a reader of rules.
Throws:
IOException
IllegalArgumentException
Method Detail

createRule

protected abstract RegexRule createRule(boolean sign,
                                        String regex)
Creates a new RegexRule.

Parameters:
sign - of the regular expression. A true value means that any URL matching this rule must be included, whereas a false value means that any URL matching this rule must be excluded.
regex - is the regular expression associated to this rule.

getRulesReader

protected abstract Reader getRulesReader(Configuration conf)
                                  throws IOException
Returns the name of the file of rules to use for a particular implementation.

Parameters:
conf - is the current configuration.
Returns:
the name of the resource containing the rules to use.
Throws:
IOException

filter

public String filter(String url)
Specified by:
filter in interface URLFilter

setConf

public void setConf(Configuration conf)
Specified by:
setConf in interface Configurable

getConf

public Configuration getConf()
Specified by:
getConf in interface Configurable

main

public static void main(RegexURLFilterBase filter,
                        String[] args)
                 throws IOException,
                        IllegalArgumentException
Filter the standard input using a RegexURLFilterBase.

Parameters:
filter - is the RegexURLFilterBase to use for filtering the standard input.
args - some optional parameters (not used).
Throws:
IOException
IllegalArgumentException


Copyright © 2012 The Apache Software Foundation