org.apache.nutch.protocol.http.api
Class RobotRulesParser

java.lang.Object
  extended by org.apache.nutch.protocol.http.api.RobotRulesParser
All Implemented Interfaces:
Configurable

public class RobotRulesParser
extends Object
implements Configurable

This class handles the parsing of robots.txt files. It emits RobotRules objects, which describe the download permissions as described in RobotRulesParser.

Author:
Tom Pierce, Mike Cafarella, Doug Cutting

Nested Class Summary
static class RobotRulesParser.RobotRuleSet
          This class holds the rules which were parsed from a robots.txt file, and can test paths against those rules.
 
Field Summary
static org.slf4j.Logger LOG
           
 
Constructor Summary
RobotRulesParser(Configuration conf)
           
 
Method Summary
 Configuration getConf()
           
 long getCrawlDelay(HttpBase http, URL url)
           
 RobotRulesParser.RobotRuleSet getRobotRulesSet(HttpBase http, String url)
           
 boolean isAllowed(HttpBase http, URL url)
           
static void main(String[] argv)
          command-line main for testing
 void setConf(Configuration conf)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

LOG

public static final org.slf4j.Logger LOG
Constructor Detail

RobotRulesParser

public RobotRulesParser(Configuration conf)
Method Detail

setConf

public void setConf(Configuration conf)
Specified by:
setConf in interface Configurable

getConf

public Configuration getConf()
Specified by:
getConf in interface Configurable

getRobotRulesSet

public RobotRulesParser.RobotRuleSet getRobotRulesSet(HttpBase http,
                                                      String url)

isAllowed

public boolean isAllowed(HttpBase http,
                         URL url)
                  throws ProtocolException,
                         IOException
Throws:
ProtocolException
IOException

getCrawlDelay

public long getCrawlDelay(HttpBase http,
                          URL url)
                   throws ProtocolException,
                          IOException
Throws:
ProtocolException
IOException

main

public static void main(String[] argv)
command-line main for testing



Copyright © 2012 The Apache Software Foundation