org.apache.nutch.crawl
Class AbstractFetchSchedule

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by org.apache.nutch.crawl.AbstractFetchSchedule
All Implemented Interfaces:
Configurable, FetchSchedule
Direct Known Subclasses:
AdaptiveFetchSchedule, DefaultFetchSchedule

public abstract class AbstractFetchSchedule
extends Configured
implements FetchSchedule

This class provides common methods for implementations of FetchSchedule.

Author:
Andrzej Bialecki

Field Summary
protected  int defaultInterval
           
protected  int maxInterval
           
 
Fields inherited from interface org.apache.nutch.crawl.FetchSchedule
SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN
 
Constructor Summary
AbstractFetchSchedule()
           
AbstractFetchSchedule(Configuration conf)
           
 
Method Summary
 long calculateLastFetchTime(WebPage page)
          This method return the last fetch time of the CrawlDatum
 void forceRefetch(String url, WebPage page, boolean asap)
          This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching.
 Set<WebPage.Field> getFields()
           
 void initializeSchedule(String url, WebPage page)
          Initialize fetch schedule related data.
 void setConf(Configuration conf)
           
 void setFetchSchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
          Sets the fetchInterval and fetchTime on a successfully fetched page.
 void setPageGoneSchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime)
          This method specifies how to schedule refetching of pages marked as GONE.
 void setPageRetrySchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime)
          This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.
 boolean shouldFetch(String url, WebPage page, long curTime)
          This method provides information whether the page is suitable for selection in the current fetchlist.
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf
 

Field Detail

defaultInterval

protected int defaultInterval

maxInterval

protected int maxInterval
Constructor Detail

AbstractFetchSchedule

public AbstractFetchSchedule()

AbstractFetchSchedule

public AbstractFetchSchedule(Configuration conf)
Method Detail

setConf

public void setConf(Configuration conf)
Specified by:
setConf in interface Configurable
Overrides:
setConf in class Configured

initializeSchedule

public void initializeSchedule(String url,
                               WebPage page)
Initialize fetch schedule related data. Implementations should at least set the fetchTime and fetchInterval. The default implementation sets the fetchTime to now, using the default fetchInterval.

Specified by:
initializeSchedule in interface FetchSchedule
Parameters:
url - URL of the page.
page -

setFetchSchedule

public void setFetchSchedule(String url,
                             WebPage page,
                             long prevFetchTime,
                             long prevModifiedTime,
                             long fetchTime,
                             long modifiedTime,
                             int state)
Sets the fetchInterval and fetchTime on a successfully fetched page. NOTE: this implementation resets the retry counter - extending classes should call super.setFetchSchedule() to preserve this behavior.

Specified by:
setFetchSchedule in interface FetchSchedule
Parameters:
url - url of the page
prevFetchTime - previous value of fetch time, or -1 if not available
prevModifiedTime - previous value of modifiedTime, or -1 if not available
fetchTime - the latest time, when the page was recently re-fetched. Most FetchSchedule implementations should update the value in to something greater than this value.
modifiedTime - last time the content was modified. This information comes from the protocol implementations, or is set to < 0 if not available. Most FetchSchedule implementations should update the value in to this value.
state - if FetchSchedule.STATUS_MODIFIED, then the content is considered to be "changed" before the fetchTime, if FetchSchedule.STATUS_NOTMODIFIED then the content is known to be unchanged. This information may be obtained by comparing page signatures before and after fetching. If this is set to FetchSchedule.STATUS_UNKNOWN, then it is unknown whether the page was changed; implementations are free to follow a sensible default behavior.

setPageGoneSchedule

public void setPageGoneSchedule(String url,
                                WebPage page,
                                long prevFetchTime,
                                long prevModifiedTime,
                                long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. Default implementation increases fetchInterval by 50%, and if it exceeds the maxInterval it calls #forceRefetch(Text, CrawlDatum, boolean).

Specified by:
setPageGoneSchedule in interface FetchSchedule
Parameters:
url - URL of the page
page -

setPageRetrySchedule

public void setPageRetrySchedule(String url,
                                 WebPage page,
                                 long prevFetchTime,
                                 long prevModifiedTime,
                                 long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. The default implementation sets the next fetch time 1 day in the future and increases the retry counter.

Specified by:
setPageRetrySchedule in interface FetchSchedule
Parameters:
url - URL of the page
page -
prevFetchTime - previous fetch time
prevModifiedTime - previous modified time
fetchTime - current fetch time

calculateLastFetchTime

public long calculateLastFetchTime(WebPage page)
This method return the last fetch time of the CrawlDatum

Specified by:
calculateLastFetchTime in interface FetchSchedule
Returns:
the date as a long.

shouldFetch

public boolean shouldFetch(String url,
                           WebPage page,
                           long curTime)
This method provides information whether the page is suitable for selection in the current fetchlist. NOTE: a true return value does not guarantee that the page will be fetched, it just allows it to be included in the further selection process based on scores. The default implementation checks fetchTime, if it is higher than the it returns false, and true otherwise. It will also check that fetchTime is not too remote (more than maxInterval
Specified by:
shouldFetch in interface FetchSchedule
Parameters:
url - URL of the page
page -
curTime - reference time (usually set to the time when the fetchlist generation process was started).
Returns:
true, if the page should be considered for inclusion in the current fetchlist, otherwise false.

forceRefetch

public void forceRefetch(String url,
                         WebPage page,
                         boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching.

Specified by:
forceRefetch in interface FetchSchedule
Parameters:
url - URL of the page
page -
asap - if true, force refetch as soon as possible - this sets the fetchTime to now. If false, force refetch whenever the next fetch time is set.

getFields

public Set<WebPage.Field> getFields()
Specified by:
getFields in interface FetchSchedule


Copyright © 2012 The Apache Software Foundation