org.apache.nutch.parse.js
Class JSParseFilter

java.lang.Object
  extended by org.apache.nutch.parse.js.JSParseFilter
All Implemented Interfaces:
Configurable, ParseFilter, Parser, FieldPluggable, Pluggable

public class JSParseFilter
extends Object
implements ParseFilter, Parser

This class is a heuristic link extractor for JavaScript files and code snippets. The general idea of a two-pass regex matching comes from Heritrix. Parts of the code come from OutlinkExtractor.java by Stephan Strittmatter.

Author:
Andrzej Bialecki <ab@getopt.org>

Field Summary
static org.slf4j.Logger LOG
           
 
Fields inherited from interface org.apache.nutch.parse.ParseFilter
X_POINT_ID
 
Fields inherited from interface org.apache.nutch.parse.Parser
X_POINT_ID
 
Constructor Summary
JSParseFilter()
           
 
Method Summary
 Parse filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)
          Adds metadata or otherwise modifies a parse, given the DOM tree of a page.
 Configuration getConf()
           
 Collection<WebPage.Field> getFields()
           
 Parse getParse(String url, WebPage page)
           This method parses content in WebPage instance
static void main(String[] args)
           
 void setConf(Configuration conf)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.slf4j.Logger LOG
Constructor Detail

JSParseFilter

public JSParseFilter()
Method Detail

filter

public Parse filter(String url,
                    WebPage page,
                    Parse parse,
                    HTMLMetaTags metaTags,
                    DocumentFragment doc)
Description copied from interface: ParseFilter
Adds metadata or otherwise modifies a parse, given the DOM tree of a page.

Specified by:
filter in interface ParseFilter

getParse

public Parse getParse(String url,
                      WebPage page)
Description copied from interface: Parser

This method parses content in WebPage instance

Specified by:
getParse in interface Parser
Parameters:
url - Page's URL

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception

setConf

public void setConf(Configuration conf)
Specified by:
setConf in interface Configurable

getConf

public Configuration getConf()
Specified by:
getConf in interface Configurable

getFields

public Collection<WebPage.Field> getFields()
Specified by:
getFields in interface FieldPluggable


Copyright © 2012 The Apache Software Foundation