org.apache.nutch.scoring
Interface ScoringFilter

All Superinterfaces:
Configurable, FieldPluggable, Pluggable
All Known Implementing Classes:
LinkAnalysisScoringFilter, OPICScoringFilter, ScoringFilters, TLDScoringFilter

public interface ScoringFilter
extends Configurable, FieldPluggable

A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.

Author:
Andrzej Bialecki

Field Summary
static String X_POINT_ID
          The name of the extension point.
 
Method Summary
 void distributeScoreToOutlinks(String fromUrl, WebPage page, Collection<ScoreDatum> scoreData, int allCount)
          Distribute score value from the current page to all its outlinked pages.
 float generatorSortValue(String url, WebPage page, float initSort)
          This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation.
 float indexerScore(String url, NutchDocument doc, WebPage page, float initScore)
          This method calculates a Lucene document boost.
 void initialScore(String url, WebPage page)
          Set an initial score for newly discovered pages.
 void injectedScore(String url, WebPage page)
          Set an initial score for newly injected pages.
 void updateScore(String url, WebPage page, List<ScoreDatum> inlinkedScoreData)
          This method calculates a new score during table update, based on the values contributed by inlinked pages.
 
Methods inherited from interface org.apache.hadoop.conf.Configurable
getConf, setConf
 
Methods inherited from interface org.apache.nutch.plugin.FieldPluggable
getFields
 

Field Detail

X_POINT_ID

static final String X_POINT_ID
The name of the extension point.

Method Detail

injectedScore

void injectedScore(String url,
                   WebPage page)
                   throws ScoringFilterException
Set an initial score for newly injected pages. Note: newly injected pages may have no inlinks, so filter implementations may wish to set this score to a non-zero value, to give newly injected pages some initial credit.

Parameters:
url - url of the page
page - new page. Filters will modify it in-place.
Throws:
ScoringFilterException

initialScore

void initialScore(String url,
                  WebPage page)
                  throws ScoringFilterException
Set an initial score for newly discovered pages. Note: newly discovered pages have at least one inlink with its score contribution, so filter implementations may choose to set initial score to zero (unknown value), and then the inlink score contribution will set the "real" value of the new page.

Parameters:
url - url of the page
page -
Throws:
ScoringFilterException

generatorSortValue

float generatorSortValue(String url,
                         WebPage page,
                         float initSort)
                         throws ScoringFilterException
This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation.

Parameters:
url - url of the page
datum - page row. Modifications will be persisted.
initSort - initial sort value, or a value from previous filters in chain
Throws:
ScoringFilterException

distributeScoreToOutlinks

void distributeScoreToOutlinks(String fromUrl,
                               WebPage page,
                               Collection<ScoreDatum> scoreData,
                               int allCount)
                               throws ScoringFilterException
Distribute score value from the current page to all its outlinked pages.

Parameters:
fromUrl - url of the source page
row - page row
scoreData - A list of OutlinkedScoreDatums for every outlink. These OutlinkedScoreDatums will be passed to #updateScore(String, OldWebTableRow, List) for every outlinked URL.
allCount - number of all collected outlinks from the source page
Throws:
ScoringFilterException

updateScore

void updateScore(String url,
                 WebPage page,
                 List<ScoreDatum> inlinkedScoreData)
                 throws ScoringFilterException
This method calculates a new score during table update, based on the values contributed by inlinked pages.

Parameters:
url - url of the page
page -
inlinked - list of OutlinkedScoreDatums for all inlinks pointing to this URL.
Throws:
ScoringFilterException

indexerScore

float indexerScore(String url,
                   NutchDocument doc,
                   WebPage page,
                   float initScore)
                   throws ScoringFilterException
This method calculates a Lucene document boost.

Parameters:
url - url of the page
doc - document. NOTE: this already contains all information collected by indexing filters. Implementations may modify this instance, in order to store/remove some information.
row - page row
initScore - initial boost value for the Lucene document.
Returns:
boost value for the Lucene document. This value is passed as an argument to the next scoring filter in chain. NOTE: implementations may also express other scoring strategies by modifying Lucene document directly.
Throws:
ScoringFilterException


Copyright © 2012 The Apache Software Foundation