|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.apache.nutch.parse.tika.DOMContentUtils
public class DOMContentUtils
A collection of methods for extracting content from DOM trees. This class holds a few utility methods for pulling content out of DOM nodes, such as getOutlinks, getText, etc.
| Constructor Summary | |
|---|---|
DOMContentUtils(Configuration conf)
|
|
| Method Summary | |
|---|---|
void |
getOutlinks(URL base,
ArrayList outlinks,
Node node)
This method finds all anchors below the supplied DOM node, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks ArrayList. |
void |
getText(StringBuffer sb,
Node node)
This is a convinience method, equivalent to getText(sb, node, false). |
boolean |
getTitle(StringBuffer sb,
Node node)
This method takes a StringBuffer and a DOM Node,
and will append the content text found beneath the first
title node to the StringBuffer. |
void |
setConf(Configuration conf)
|
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public DOMContentUtils(Configuration conf)
| Method Detail |
|---|
public void setConf(Configuration conf)
public void getText(StringBuffer sb,
Node node)
getText(sb, node, false).
public boolean getTitle(StringBuffer sb,
Node node)
StringBuffer and a DOM Node,
and will append the content text found beneath the first
title node to the StringBuffer.
public void getOutlinks(URL base,
ArrayList outlinks,
Node node)
node, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks ArrayList.
Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||