|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.nutch.urlfilter.suffix.SuffixURLFilter
public class SuffixURLFilter
Filters URLs based on a file of URL suffixes. The file is named by
This filter can be configured to work in one of two modes:
The format of this config file is one URL suffix per line, with no preceding whitespace. Order, in which suffixes are specified, doesn't matter. Blank lines and comments (#) are allowed.
A single '+' or '-' sign not followed by any suffix must be used once, to signify the mode this plugin operates in. An optional single 'I' can be appended, to signify that suffix matches should be case-insensitive. The default, if not specified, is to use case-sensitive matches, i.e. suffix '.JPG' does not match '.jpg'.
NOTE: the format of this file is different from urlfilter-prefix, because that plugin doesn't support allowed/prohibited prefixes (only supports allowed prefixes). Please note that this plugin does not support regular expressions, it only accepts literal suffixes. I.e. a suffix "+*.jpg" is most probably wrong, you should use "+.jpg" instead.
The configuration shown below will accept all URLs with '.html' or '.htm' suffixes (case-sensitive - '.HTML' or '.HTM' will be rejected), and prohibit all other suffixes.
# this is a comment # prohibit all unknown, case-sensitive matching - # collect only HTML files. .html .htm
The configuration shown below will accept all URLs except common graphical formats.
# this is a comment # allow all unknown, case-insensitive matching +I # prohibited suffixes .gif .png .jpg .jpeg .bmp
Field Summary |
---|
Fields inherited from interface org.apache.nutch.net.URLFilter |
---|
X_POINT_ID |
Constructor Summary | |
---|---|
SuffixURLFilter()
|
|
SuffixURLFilter(Reader reader)
|
Method Summary | |
---|---|
String |
filter(String url)
|
Configuration |
getConf()
|
boolean |
isIgnoreCase()
|
boolean |
isModeAccept()
|
static void |
main(String[] args)
|
void |
readConfiguration(Reader reader)
|
void |
setConf(Configuration conf)
|
void |
setFilterFromPath(boolean filterFromPath)
|
void |
setIgnoreCase(boolean ignoreCase)
|
void |
setModeAccept(boolean modeAccept)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public SuffixURLFilter() throws IOException
IOException
public SuffixURLFilter(Reader reader) throws IOException
IOException
Method Detail |
---|
public String filter(String url)
filter
in interface URLFilter
public void readConfiguration(Reader reader) throws IOException
IOException
public static void main(String[] args) throws IOException
IOException
public void setConf(Configuration conf)
setConf
in interface Configurable
public Configuration getConf()
getConf
in interface Configurable
public boolean isModeAccept()
public void setModeAccept(boolean modeAccept)
public boolean isIgnoreCase()
public void setIgnoreCase(boolean ignoreCase)
public void setFilterFromPath(boolean filterFromPath)
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |