org.apache.nutch.parse
Class OutlinkExtractor
java.lang.Object
org.apache.nutch.parse.OutlinkExtractor
public class OutlinkExtractor
- extends Object
Extractor to extract Outlinks
/ URLs from plain text using Regular Expressions.
- Since:
- 0.7
- Version:
- 1.0
- Author:
- Stephan Strittmatter - http://www.sybit.de
- See Also:
- Comparison
of different regexp-Implementations ,
Overview about Java Regexp APIs
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
OutlinkExtractor
public OutlinkExtractor()
getOutlinks
public static Outlink[] getOutlinks(String plainText,
Configuration conf)
- Extracts
Outlink from given plain text.
Applying this method to non-plain-text can result in extremely lengthy
runtimes for parasitic cases (postscript is a known example).
- Parameters:
plainText - the plain text from wich URLs should be extracted.
- Returns:
- Array of
Outlinks within found in plainText
getOutlinks
public static Outlink[] getOutlinks(String plainText,
String anchor,
Configuration conf)
- Extracts
Outlink from given plain text and adds anchor
to the extracted Outlinks
- Parameters:
plainText - the plain text from wich URLs should be extracted.anchor - the anchor of the url
- Returns:
- Array of
Outlinks within found in plainText
Copyright © 2012 The Apache Software Foundation