|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.hadoop.conf.Configured org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
public class RegexURLNormalizer
Allows users to do regex substitutions on all/any URLs that are encountered, which is useful for stripping session IDs from URLs.
This class uses the urlnormalizer.regex.file property. It should be set to the file name of an xml file which should contain the patterns and substitutions to be done on encountered URLs.
This class also supports different rules depending on the scope. Please see
the javadoc in URLNormalizers
for more details.
Field Summary |
---|
Fields inherited from interface org.apache.nutch.net.URLNormalizer |
---|
X_POINT_ID |
Constructor Summary | |
---|---|
RegexURLNormalizer()
The default constructor which is called from UrlNormalizerFactory (normalizerClass.newInstance()) in method: getNormalizer()* |
|
RegexURLNormalizer(Configuration conf)
|
|
RegexURLNormalizer(Configuration conf,
String filename)
Constructor which can be passed the file name, so it doesn't look in the configuration files for it. |
Method Summary | |
---|---|
HashMap<String,List<org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.Rule>> |
getScopedRules()
|
static void |
main(String[] args)
Spits out patterns and substitutions that are in the configuration file. |
String |
normalize(String urlString,
String scope)
|
String |
regexNormalize(String urlString,
String scope)
This function does the replacements by iterating through all the regex patterns. |
void |
setConf(Configuration conf)
|
Methods inherited from class org.apache.hadoop.conf.Configured |
---|
getConf |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Methods inherited from interface org.apache.hadoop.conf.Configurable |
---|
getConf |
Constructor Detail |
---|
public RegexURLNormalizer()
public RegexURLNormalizer(Configuration conf)
public RegexURLNormalizer(Configuration conf, String filename) throws IOException, PatternSyntaxException
IOException
PatternSyntaxException
Method Detail |
---|
public HashMap<String,List<org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.Rule>> getScopedRules()
public void setConf(Configuration conf)
setConf
in interface Configurable
setConf
in class Configured
public String regexNormalize(String urlString, String scope)
public String normalize(String urlString, String scope) throws MalformedURLException
normalize
in interface URLNormalizer
MalformedURLException
public static void main(String[] args) throws PatternSyntaxException, IOException
PatternSyntaxException
IOException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |