org.apache.nutch.crawl
Class DefaultFetchSchedule

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.crawl.AbstractFetchSchedule
          extended by org.apache.nutch.crawl.DefaultFetchSchedule
All Implemented Interfaces:
Configurable, FetchSchedule

public class DefaultFetchSchedule
extends AbstractFetchSchedule

This class implements the default re-fetch schedule. That is, no matter if the page was changed or not, the fetchInterval remains unchanged, and the updated page fetchTime will always be set to fetchTime + fetchInterval * 1000.

Author:
Andrzej Bialecki

Field Summary
 
Fields inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
defaultInterval, maxInterval
 
Fields inherited from interface org.apache.nutch.crawl.FetchSchedule
SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN
 
Constructor Summary
DefaultFetchSchedule()
           
 
Method Summary
 void setFetchSchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
          Sets the fetchInterval and fetchTime on a successfully fetched page.
 
Methods inherited from class org.apache.nutch.crawl.AbstractFetchSchedule
calculateLastFetchTime, forceRefetch, getFields, initializeSchedule, setConf, setPageGoneSchedule, setPageRetrySchedule, shouldFetch
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf
 

Constructor Detail

DefaultFetchSchedule

public DefaultFetchSchedule()
Method Detail

setFetchSchedule

public void setFetchSchedule(String url,
                             WebPage page,
                             long prevFetchTime,
                             long prevModifiedTime,
                             long fetchTime,
                             long modifiedTime,
                             int state)
Description copied from class: AbstractFetchSchedule
Sets the fetchInterval and fetchTime on a successfully fetched page. NOTE: this implementation resets the retry counter - extending classes should call super.setFetchSchedule() to preserve this behavior.

Specified by:
setFetchSchedule in interface FetchSchedule
Overrides:
setFetchSchedule in class AbstractFetchSchedule
Parameters:
url - url of the page
prevFetchTime - previous value of fetch time, or -1 if not available
prevModifiedTime - previous value of modifiedTime, or -1 if not available
fetchTime - the latest time, when the page was recently re-fetched. Most FetchSchedule implementations should update the value in to something greater than this value.
modifiedTime - last time the content was modified. This information comes from the protocol implementations, or is set to < 0 if not available. Most FetchSchedule implementations should update the value in to this value.
state - if FetchSchedule.STATUS_MODIFIED, then the content is considered to be "changed" before the fetchTime, if FetchSchedule.STATUS_NOTMODIFIED then the content is known to be unchanged. This information may be obtained by comparing page signatures before and after fetching. If this is set to FetchSchedule.STATUS_UNKNOWN, then it is unknown whether the page was changed; implementations are free to follow a sensible default behavior.


Copyright © 2012 The Apache Software Foundation