OutlinkExtractor (apache-nutch 2.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.nutch.parse
Class OutlinkExtractor

java.lang.Object
  org.apache.nutch.parse.OutlinkExtractor

public class OutlinkExtractor
extends Object
extends Object

Extractor to extract Outlinks / URLs from plain text using Regular Expressions.

Since:: 0.7
Version:: 1.0
Author:: Stephan Strittmatter - http://www.sybit.de
See Also:: Comparison of different regexp-Implementations , Overview about Java Regexp APIs

Constructor Summary
`OutlinkExtractor()`

Method Summary
`static Outlink[]`	`getOutlinks(String plainText, Configuration conf)` Extracts `Outlink` from given plain text.
`static Outlink[]`	`getOutlinks(String plainText, String anchor, Configuration conf)` Extracts `Outlink` from given plain text and adds anchor to the extracted `Outlink`s

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

OutlinkExtractor

public OutlinkExtractor()

Method Detail

getOutlinks

public static Outlink[] getOutlinks(String plainText,
                                    Configuration conf)

Extracts Outlink from given plain text. Applying this method to non-plain-text can result in extremely lengthy runtimes for parasitic cases (postscript is a known example).

Parameters:: plainText - the plain text from wich URLs should be extracted.
Returns:: Array of Outlinks within found in plainText

getOutlinks

public static Outlink[] getOutlinks(String plainText,
                                    String anchor,
                                    Configuration conf)

Extracts Outlink from given plain text and adds anchor to the extracted Outlinks

Parameters:: plainText - the plain text from wich URLs should be extracted.; anchor - the anchor of the url
Returns:: Array of Outlinks within found in plainText

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.nutch.parse Class OutlinkExtractor

OutlinkExtractor

getOutlinks

getOutlinks

org.apache.nutch.parse
Class OutlinkExtractor