|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.apache.nutch.parse.html.DOMContentUtils
public class DOMContentUtils
A collection of methods for extracting content from DOM trees. This class holds a few utility methods for pulling content out of DOM nodes, such as getOutlinks, getText, etc.
| Nested Class Summary | |
|---|---|
static class |
DOMContentUtils.LinkParams
|
| Constructor Summary | |
|---|---|
DOMContentUtils(Configuration conf)
|
|
| Method Summary | |
|---|---|
URL |
getBase(Node node)
If Node contains a BASE tag then it's HREF is returned. |
void |
getOutlinks(URL base,
ArrayList<Outlink> outlinks,
Node node)
This method finds all anchors below the supplied DOM node, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks ArrayList. |
void |
getText(StringBuilder sb,
Node node)
This is a convinience method, equivalent to getText(sb, node, false). |
boolean |
getText(StringBuilder sb,
Node node,
boolean abortOnNestedAnchors)
This method takes a StringBuilder and a DOM Node,
and will append all the content text found beneath the DOM node to
the StringBuilder. |
boolean |
getTitle(StringBuilder sb,
Node node)
This method takes a StringBuffer and a DOM Node,
and will append the content text found beneath the first
title node to the StringBuffer. |
void |
setConf(Configuration conf)
|
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public DOMContentUtils(Configuration conf)
| Method Detail |
|---|
public void setConf(Configuration conf)
public boolean getText(StringBuilder sb,
Node node,
boolean abortOnNestedAnchors)
StringBuilder and a DOM Node,
and will append all the content text found beneath the DOM node to
the StringBuilder.
If abortOnNestedAnchors is true, DOM traversal will
be aborted and the StringBuffer will not contain
any text encountered after a nested anchor is found.
public void getText(StringBuilder sb,
Node node)
getText(sb, node, false).
public boolean getTitle(StringBuilder sb,
Node node)
StringBuffer and a DOM Node,
and will append the content text found beneath the first
title node to the StringBuffer.
public URL getBase(Node node)
public void getOutlinks(URL base,
ArrayList<Outlink> outlinks,
Node node)
node, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks ArrayList.
Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||