org.apache.nutch.parse
Class OutlinkExtractor

java.lang.Object
  extended by org.apache.nutch.parse.OutlinkExtractor

public class OutlinkExtractor
extends Object

Extractor to extract Outlinks / URLs from plain text using Regular Expressions.

Since:
0.7
Version:
1.0
Author:
Stephan Strittmatter - http://www.sybit.de
See Also:
Comparison of different regexp-Implementations , Overview about Java Regexp APIs

Constructor Summary
OutlinkExtractor()
           
 
Method Summary
static Outlink[] getOutlinks(String plainText, Configuration conf)
          Extracts Outlink from given plain text.
static Outlink[] getOutlinks(String plainText, String anchor, Configuration conf)
          Extracts Outlink from given plain text and adds anchor to the extracted Outlinks
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

OutlinkExtractor

public OutlinkExtractor()
Method Detail

getOutlinks

public static Outlink[] getOutlinks(String plainText,
                                    Configuration conf)
Extracts Outlink from given plain text. Applying this method to non-plain-text can result in extremely lengthy runtimes for parasitic cases (postscript is a known example).

Parameters:
plainText - the plain text from wich URLs should be extracted.
Returns:
Array of Outlinks within found in plainText

getOutlinks

public static Outlink[] getOutlinks(String plainText,
                                    String anchor,
                                    Configuration conf)
Extracts Outlink from given plain text and adds anchor to the extracted Outlinks

Parameters:
plainText - the plain text from wich URLs should be extracted.
anchor - the anchor of the url
Returns:
Array of Outlinks within found in plainText


Copyright © 2012 The Apache Software Foundation