org.apache.nutch.parse
Class ParseUtil

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.parse.ParseUtil
All Implemented Interfaces:
Configurable

public class ParseUtil
extends Configured

A Utility class containing methods to simply perform parsing utilities such as iterating through a preferred list of Parsers to obtain Parse objects.

Author:
mattmann, Jérôme Charron, Sébastien Le Callonnec

Field Summary
static org.slf4j.Logger LOG
           
 
Constructor Summary
ParseUtil(Configuration conf)
           
 
Method Summary
 Configuration getConf()
           
 Parse parse(String url, WebPage page)
          Performs a parse by iterating through a List of preferred Parsers until a successful parse is performed and a Parse object is returned.
 URLWebPage process(String key, WebPage page)
          Parses given web page and stores parsed content within page.
 void setConf(Configuration conf)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.slf4j.Logger LOG
Constructor Detail

ParseUtil

public ParseUtil(Configuration conf)
Parameters:
conf -
Method Detail

getConf

public Configuration getConf()
Specified by:
getConf in interface Configurable
Overrides:
getConf in class Configured

setConf

public void setConf(Configuration conf)
Specified by:
setConf in interface Configurable
Overrides:
setConf in class Configured

parse

public Parse parse(String url,
                   WebPage page)
            throws ParserNotFound,
                   ParseException
Performs a parse by iterating through a List of preferred Parsers until a successful parse is performed and a Parse object is returned. If the parse is unsuccessful, a message is logged to the WARNING level, and an empty parse is returned.

Throws:
ParserNotFound - If there is no suitable parser found.
ParseException - If there is an error parsing.

process

public URLWebPage process(String key,
                          WebPage page)
Parses given web page and stores parsed content within page. Returns a pair of if a meta-redirect is discovered

Parameters:
key -
page -
Returns:
newly-discovered webpage (via a meta-redirect)


Copyright © 2012 The Apache Software Foundation