org.apache.nutch.urlfilter.regex
Class RegexURLFilter
java.lang.Object
org.apache.nutch.urlfilter.api.RegexURLFilterBase
org.apache.nutch.urlfilter.regex.RegexURLFilter
- All Implemented Interfaces:
- Configurable, URLFilter, Pluggable
public class RegexURLFilter
- extends RegexURLFilterBase
Filters URLs based on a file of regular expressions using the
Java Regex implementation
.
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
URLFILTER_REGEX_FILE
public static final String URLFILTER_REGEX_FILE
- See Also:
- Constant Field Values
URLFILTER_REGEX_RULES
public static final String URLFILTER_REGEX_RULES
- See Also:
- Constant Field Values
RegexURLFilter
public RegexURLFilter()
RegexURLFilter
public RegexURLFilter(String filename)
throws IOException,
PatternSyntaxException
- Throws:
IOException
PatternSyntaxException
getRulesReader
protected Reader getRulesReader(Configuration conf)
throws IOException
- Rules specified as a config property will override rules specified
as a config file.
- Specified by:
getRulesReader
in class RegexURLFilterBase
- Parameters:
conf
- is the current configuration.
- Returns:
- the name of the resource containing the rules to use.
- Throws:
IOException
createRule
protected RegexRule createRule(boolean sign,
String regex)
- Description copied from class:
RegexURLFilterBase
- Creates a new
RegexRule
.
- Specified by:
createRule
in class RegexURLFilterBase
- Parameters:
sign
- of the regular expression.
A true
value means that any URL matching this rule
must be included, whereas a false
value means that any URL matching this rule must be excluded.regex
- is the regular expression associated to this rule.
main
public static void main(String[] args)
throws IOException
- Throws:
IOException
Copyright © 2012 The Apache Software Foundation