org.apache.nutch.analysis.lang
Class HTMLLanguageParser
java.lang.Object
org.apache.nutch.analysis.lang.HTMLLanguageParser
- All Implemented Interfaces:
- Configurable, ParseFilter, FieldPluggable, Pluggable
public class HTMLLanguageParser
- extends Object
- implements ParseFilter
Adds metadata identifying language of document if found We could also run
statistical analysis here but we'd miss all other formats
Field Summary |
static org.slf4j.Logger |
LOG
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
LOG
public static final org.slf4j.Logger LOG
HTMLLanguageParser
public HTMLLanguageParser()
filter
public Parse filter(String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
DocumentFragment doc)
- Scan the HTML document looking at possible indications of content language
- 1. html lang attribute
(http://www.w3.org/TR/REC-html40/struct/dirlang.html#h-8.1)
- 2. meta
dc.language
(http://dublincore.org/documents/2000/07/16/usageguide/qualified
-html.shtml#language)
- 3. meta http-equiv (content-language)
(http://www.w3.org/TR/REC-html40/struct/global.html#h-7.4.4.2)
- Specified by:
filter
in interface ParseFilter
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interface Configurable
getConf
public Configuration getConf()
- Specified by:
getConf
in interface Configurable
getFields
public Collection<WebPage.Field> getFields()
- Specified by:
getFields
in interface FieldPluggable
Copyright © 2012 The Apache Software Foundation