org.apache.nutch.crawl
Class AdaptiveFetchSchedule
java.lang.Object
org.apache.hadoop.conf.Configured
org.apache.nutch.crawl.AbstractFetchSchedule
org.apache.nutch.crawl.AdaptiveFetchSchedule
- All Implemented Interfaces:
- Configurable, FetchSchedule
public class AdaptiveFetchSchedule
- extends AbstractFetchSchedule
This class implements an adaptive re-fetch algorithm. This works as follows:
- for pages that has changed since the last fetchTime, decrease their
fetchInterval by a factor of DEC_FACTOR (default value is 0.2f).
- for pages that haven't changed since the last fetchTime, increase their
fetchInterval by a factor of INC_FACTOR (default value is 0.2f).
If SYNC_DELTA property is true, then:
- calculate a
delta = fetchTime - modifiedTime
- try to synchronize with the time of change, by shifting the next fetchTime
by a fraction of the difference between the last modification time and the last
fetch time. I.e. the next fetch time will be set to
fetchTime + fetchInterval - delta * SYNC_DELTA_RATE
- if the adjusted fetch interval is bigger than the delta, then
fetchInterval = delta
.
- the minimum value of fetchInterval may not be smaller than MIN_INTERVAL
(default is 1 minute).
- the maximum value of fetchInterval may not be bigger than MAX_INTERVAL
(default is 365 days).
NOTE: values of DEC_FACTOR and INC_FACTOR higher than 0.4f may destabilize the algorithm,
so that the fetch interval either increases or decreases infinitely, with little
relevance to the page changes. Please use #main(String[])
method to
test the values before applying them in a production system.
- Author:
- Andrzej Bialecki
Method Summary |
void |
setConf(Configuration conf)
|
void |
setFetchSchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the fetchInterval and fetchTime on a
successfully fetched page. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
AdaptiveFetchSchedule
public AdaptiveFetchSchedule()
setConf
public void setConf(Configuration conf)
- Specified by:
setConf
in interface Configurable
- Overrides:
setConf
in class AbstractFetchSchedule
setFetchSchedule
public void setFetchSchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
- Description copied from class:
AbstractFetchSchedule
- Sets the
fetchInterval
and fetchTime
on a
successfully fetched page. NOTE: this implementation resets the
retry counter - extending classes should call super.setFetchSchedule() to
preserve this behavior.
- Specified by:
setFetchSchedule
in interface FetchSchedule
- Overrides:
setFetchSchedule
in class AbstractFetchSchedule
- Parameters:
url
- url of the pageprevFetchTime
- previous value of fetch time, or -1 if not availableprevModifiedTime
- previous value of modifiedTime, or -1 if not availablefetchTime
- the latest time, when the page was recently re-fetched. Most FetchSchedule
implementations should update the value in to something greater than this value.modifiedTime
- last time the content was modified. This information comes from
the protocol implementations, or is set to < 0 if not available. Most FetchSchedule
implementations should update the value in to this value.state
- if FetchSchedule.STATUS_MODIFIED
, then the content is considered to be "changed" before the
fetchTime
, if FetchSchedule.STATUS_NOTMODIFIED
then the content is known to be unchanged.
This information may be obtained by comparing page signatures before and after fetching. If this
is set to FetchSchedule.STATUS_UNKNOWN
, then it is unknown whether the page was changed; implementations
are free to follow a sensible default behavior.
Copyright © 2012 The Apache Software Foundation