|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object org.apache.nutch.parse.html.DOMContentUtils
public class DOMContentUtils
A collection of methods for extracting content from DOM trees. This class holds a few utility methods for pulling content out of DOM nodes, such as getOutlinks, getText, etc.
Nested Class Summary | |
---|---|
static class |
DOMContentUtils.LinkParams
|
Constructor Summary | |
---|---|
DOMContentUtils(Configuration conf)
|
Method Summary | |
---|---|
URL |
getBase(Node node)
If Node contains a BASE tag then it's HREF is returned. |
void |
getOutlinks(URL base,
ArrayList<Outlink> outlinks,
Node node)
This method finds all anchors below the supplied DOM node , and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks ArrayList . |
void |
getText(StringBuilder sb,
Node node)
This is a convinience method, equivalent to getText(sb, node, false) . |
boolean |
getText(StringBuilder sb,
Node node,
boolean abortOnNestedAnchors)
This method takes a StringBuilder and a DOM Node ,
and will append all the content text found beneath the DOM node to
the StringBuilder . |
boolean |
getTitle(StringBuilder sb,
Node node)
This method takes a StringBuffer and a DOM Node ,
and will append the content text found beneath the first
title node to the StringBuffer . |
void |
setConf(Configuration conf)
|
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public DOMContentUtils(Configuration conf)
Method Detail |
---|
public void setConf(Configuration conf)
public boolean getText(StringBuilder sb, Node node, boolean abortOnNestedAnchors)
StringBuilder
and a DOM Node
,
and will append all the content text found beneath the DOM node to
the StringBuilder
.
If abortOnNestedAnchors
is true, DOM traversal will
be aborted and the StringBuffer
will not contain
any text encountered after a nested anchor is found.
public void getText(StringBuilder sb, Node node)
getText(sb, node, false)
.
public boolean getTitle(StringBuilder sb, Node node)
StringBuffer
and a DOM Node
,
and will append the content text found beneath the first
title
node to the StringBuffer
.
public URL getBase(Node node)
public void getOutlinks(URL base, ArrayList<Outlink> outlinks, Node node)
node
, and creates appropriate Outlink
records for each (relative to the supplied base
URL), and adds them to the outlinks
ArrayList
.
Links without inner structure (tags, text, etc) are discarded, as are links which contain only single nested links and empty text nodes (this is a common DOM-fixup artifact, at least with nekohtml).
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |