|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.apache.hadoop.conf.Configured
org.apache.nutch.crawl.AbstractFetchSchedule
public abstract class AbstractFetchSchedule
This class provides common methods for implementations of
FetchSchedule.
| Field Summary | |
|---|---|
protected int |
defaultInterval
|
protected int |
maxInterval
|
| Fields inherited from interface org.apache.nutch.crawl.FetchSchedule |
|---|
SECONDS_PER_DAY, STATUS_MODIFIED, STATUS_NOTMODIFIED, STATUS_UNKNOWN |
| Constructor Summary | |
|---|---|
AbstractFetchSchedule()
|
|
AbstractFetchSchedule(Configuration conf)
|
|
| Method Summary | |
|---|---|
long |
calculateLastFetchTime(WebPage page)
This method return the last fetch time of the CrawlDatum |
void |
forceRefetch(String url,
WebPage page,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching. |
Set<WebPage.Field> |
getFields()
|
void |
initializeSchedule(String url,
WebPage page)
Initialize fetch schedule related data. |
void |
setConf(Configuration conf)
|
void |
setFetchSchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the fetchInterval and fetchTime on a
successfully fetched page. |
void |
setPageGoneSchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. |
void |
setPageRetrySchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. |
boolean |
shouldFetch(String url,
WebPage page,
long curTime)
This method provides information whether the page is suitable for selection in the current fetchlist. |
| Methods inherited from class org.apache.hadoop.conf.Configured |
|---|
getConf |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Methods inherited from interface org.apache.hadoop.conf.Configurable |
|---|
getConf |
| Field Detail |
|---|
protected int defaultInterval
protected int maxInterval
| Constructor Detail |
|---|
public AbstractFetchSchedule()
public AbstractFetchSchedule(Configuration conf)
| Method Detail |
|---|
public void setConf(Configuration conf)
setConf in interface ConfigurablesetConf in class Configured
public void initializeSchedule(String url,
WebPage page)
fetchTime and fetchInterval. The default
implementation sets the fetchTime to now, using the
default fetchInterval.
initializeSchedule in interface FetchScheduleurl - URL of the page.page -
public void setFetchSchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
fetchInterval and fetchTime on a
successfully fetched page. NOTE: this implementation resets the
retry counter - extending classes should call super.setFetchSchedule() to
preserve this behavior.
setFetchSchedule in interface FetchScheduleurl - url of the pageprevFetchTime - previous value of fetch time, or -1 if not availableprevModifiedTime - previous value of modifiedTime, or -1 if not availablefetchTime - the latest time, when the page was recently re-fetched. Most FetchSchedule
implementations should update the value in to something greater than this value.modifiedTime - last time the content was modified. This information comes from
the protocol implementations, or is set to < 0 if not available. Most FetchSchedule
implementations should update the value in to this value.state - if FetchSchedule.STATUS_MODIFIED, then the content is considered to be "changed" before the
fetchTime, if FetchSchedule.STATUS_NOTMODIFIED then the content is known to be unchanged.
This information may be obtained by comparing page signatures before and after fetching. If this
is set to FetchSchedule.STATUS_UNKNOWN, then it is unknown whether the page was changed; implementations
are free to follow a sensible default behavior.
public void setPageGoneSchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
maxInterval it calls
#forceRefetch(Text, CrawlDatum, boolean).
setPageGoneSchedule in interface FetchScheduleurl - URL of the pagepage -
public void setPageRetrySchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
setPageRetrySchedule in interface FetchScheduleurl - URL of the pagepage - prevFetchTime - previous fetch timeprevModifiedTime - previous modified timefetchTime - current fetch timepublic long calculateLastFetchTime(WebPage page)
calculateLastFetchTime in interface FetchSchedule
public boolean shouldFetch(String url,
WebPage page,
long curTime)
fetchTime, if it is higher than the
it returns false, and true otherwise. It will also
check that fetchTime is not too remote (more than maxInterval
shouldFetch in interface FetchScheduleurl - URL of the pagepage - curTime - reference time (usually set to the time when the
fetchlist generation process was started).
public void forceRefetch(String url,
WebPage page,
boolean asap)
forceRefetch in interface FetchScheduleurl - URL of the pagepage - asap - if true, force refetch as soon as possible - this sets
the fetchTime to now. If false, force refetch whenever the next fetch
time is set.public Set<WebPage.Field> getFields()
getFields in interface FetchSchedule
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||