org.apache.nutch.net.urlnormalizer.regex
Class RegexURLNormalizer

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer
All Implemented Interfaces:
Configurable, URLNormalizer

public class RegexURLNormalizer
extends Configured
implements URLNormalizer

Allows users to do regex substitutions on all/any URLs that are encountered, which is useful for stripping session IDs from URLs.

This class uses the urlnormalizer.regex.file property. It should be set to the file name of an xml file which should contain the patterns and substitutions to be done on encountered URLs.

This class also supports different rules depending on the scope. Please see the javadoc in URLNormalizers for more details.

Author:
Luke Baker, Andrzej Bialecki

Field Summary
 
Fields inherited from interface org.apache.nutch.net.URLNormalizer
X_POINT_ID
 
Constructor Summary
RegexURLNormalizer()
          The default constructor which is called from UrlNormalizerFactory (normalizerClass.newInstance()) in method: getNormalizer()*
RegexURLNormalizer(Configuration conf)
           
RegexURLNormalizer(Configuration conf, String filename)
          Constructor which can be passed the file name, so it doesn't look in the configuration files for it.
 
Method Summary
 HashMap<String,List<org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.Rule>> getScopedRules()
           
static void main(String[] args)
          Spits out patterns and substitutions that are in the configuration file.
 String normalize(String urlString, String scope)
           
 String regexNormalize(String urlString, String scope)
          This function does the replacements by iterating through all the regex patterns.
 void setConf(Configuration conf)
           
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf
 

Constructor Detail

RegexURLNormalizer

public RegexURLNormalizer()
The default constructor which is called from UrlNormalizerFactory (normalizerClass.newInstance()) in method: getNormalizer()*


RegexURLNormalizer

public RegexURLNormalizer(Configuration conf)

RegexURLNormalizer

public RegexURLNormalizer(Configuration conf,
                          String filename)
                   throws IOException,
                          PatternSyntaxException
Constructor which can be passed the file name, so it doesn't look in the configuration files for it.

Throws:
IOException
PatternSyntaxException
Method Detail

getScopedRules

public HashMap<String,List<org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer.Rule>> getScopedRules()

setConf

public void setConf(Configuration conf)
Specified by:
setConf in interface Configurable
Overrides:
setConf in class Configured

regexNormalize

public String regexNormalize(String urlString,
                             String scope)
This function does the replacements by iterating through all the regex patterns. It accepts a string url as input and returns the altered string.


normalize

public String normalize(String urlString,
                        String scope)
                 throws MalformedURLException
Specified by:
normalize in interface URLNormalizer
Throws:
MalformedURLException

main

public static void main(String[] args)
                 throws PatternSyntaxException,
                        IOException
Spits out patterns and substitutions that are in the configuration file.

Throws:
PatternSyntaxException
IOException


Copyright © 2012 The Apache Software Foundation