|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
Packages that use WebPage | |
---|---|
org.apache.nutch.analysis.lang | Text document language identifier. |
org.apache.nutch.crawl | Crawl control code. |
org.apache.nutch.fetcher | The Nutch robot. |
org.apache.nutch.host | |
org.apache.nutch.indexer | Maintain Lucene full-text indexes. |
org.apache.nutch.indexer.anchor | An indexing plugin for inbound anchor text. |
org.apache.nutch.indexer.basic | A basic indexing plugin. |
org.apache.nutch.indexer.more | A more indexing plugin. |
org.apache.nutch.indexer.subcollection | |
org.apache.nutch.indexer.tld | Top Level Domain Indexing plugin. |
org.apache.nutch.microformats.reltag | A microformats Rel-Tag Parser/Indexer/Querier plugin. |
org.apache.nutch.parse | |
org.apache.nutch.parse.html | An HTML document parsing plugin. |
org.apache.nutch.parse.js | |
org.apache.nutch.parse.tika | |
org.apache.nutch.protocol | |
org.apache.nutch.protocol.file | Protocol plugin which supports retrieving local file resources. |
org.apache.nutch.protocol.ftp | Protocol plugin which supports retrieving documents via the ftp protocol. |
org.apache.nutch.protocol.http | Protocol plugin which supports retrieving documents via the http protocol. |
org.apache.nutch.protocol.http.api | Common API used by HTTP plugins (http ,
httpclient ) |
org.apache.nutch.protocol.httpclient | Protocol plugin which supports retrieving documents via the HTTP and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server. |
org.apache.nutch.protocol.sftp | Protocol plugin which supports retrieving documents via the sftp protocol. |
org.apache.nutch.scoring | |
org.apache.nutch.scoring.link | |
org.apache.nutch.scoring.opic | |
org.apache.nutch.scoring.tld | Top Level Domain Scoring plugin. |
org.apache.nutch.storage | |
org.apache.nutch.util | |
org.apache.nutch.util.domain | org.apache.nutch.util.domain |
org.creativecommons.nutch | Sample plugins that parse and index Creative Commons medadata. |
Uses of WebPage in org.apache.nutch.analysis.lang |
---|
Methods in org.apache.nutch.analysis.lang with parameters of type WebPage | |
---|---|
NutchDocument |
LanguageIndexingFilter.filter(NutchDocument doc,
String url,
WebPage page)
|
Parse |
HTMLLanguageParser.filter(String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
DocumentFragment doc)
Scan the HTML document looking at possible indications of content language 1. |
Uses of WebPage in org.apache.nutch.crawl |
---|
Methods in org.apache.nutch.crawl that return WebPage | |
---|---|
WebPage |
URLWebPage.getDatum()
|
Methods in org.apache.nutch.crawl with parameters of type WebPage | |
---|---|
byte[] |
MD5Signature.calculate(WebPage page)
|
byte[] |
TextProfileSignature.calculate(WebPage page)
|
abstract byte[] |
Signature.calculate(WebPage page)
|
long |
AbstractFetchSchedule.calculateLastFetchTime(WebPage page)
This method return the last fetch time of the CrawlDatum |
long |
FetchSchedule.calculateLastFetchTime(WebPage page)
Calculates last fetch time of the given CrawlDatum. |
void |
AbstractFetchSchedule.forceRefetch(String url,
WebPage page,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching. |
void |
FetchSchedule.forceRefetch(String url,
WebPage row,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime and page signature, so that it forces refetching. |
int |
URLPartitioner.SelectorEntryPartitioner.getPartition(GeneratorJob.SelectorEntry selectorEntry,
WebPage page,
int numReduces)
|
void |
AbstractFetchSchedule.initializeSchedule(String url,
WebPage page)
Initialize fetch schedule related data. |
void |
FetchSchedule.initializeSchedule(String url,
WebPage page)
Initialize fetch schedule related data. |
void |
GeneratorMapper.map(String reversedUrl,
WebPage page,
Mapper.Context context)
|
protected void |
InjectorJob.InjectorMapper.map(String key,
WebPage row,
Mapper.Context context)
|
protected void |
WebTableReader.WebTableStatMapper.map(String key,
WebPage value,
Mapper.Context context)
|
protected void |
WebTableReader.WebTableRegexMapper.map(String key,
WebPage value,
Mapper.Context context)
|
void |
DbUpdateMapper.map(String key,
WebPage page,
Mapper.Context context)
|
void |
URLWebPage.setDatum(WebPage datum)
|
void |
AbstractFetchSchedule.setFetchSchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the fetchInterval and fetchTime on a
successfully fetched page. |
void |
FetchSchedule.setFetchSchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the fetchInterval and fetchTime on a
successfully fetched page. |
void |
AdaptiveFetchSchedule.setFetchSchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
|
void |
DefaultFetchSchedule.setFetchSchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
|
void |
AbstractFetchSchedule.setPageGoneSchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. |
void |
FetchSchedule.setPageGoneSchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE. |
void |
AbstractFetchSchedule.setPageRetrySchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. |
void |
FetchSchedule.setPageRetrySchedule(String url,
WebPage page,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors. |
boolean |
AbstractFetchSchedule.shouldFetch(String url,
WebPage page,
long curTime)
This method provides information whether the page is suitable for selection in the current fetchlist. |
boolean |
FetchSchedule.shouldFetch(String url,
WebPage page,
long curTime)
This method provides information whether the page is suitable for selection in the current fetchlist. |
Method parameters in org.apache.nutch.crawl with type arguments of type WebPage | |
---|---|
protected void |
GeneratorReducer.reduce(GeneratorJob.SelectorEntry key,
Iterable<WebPage> values,
Reducer.Context context)
|
Constructors in org.apache.nutch.crawl with parameters of type WebPage | |
---|---|
URLWebPage(String url,
WebPage datum)
|
Uses of WebPage in org.apache.nutch.fetcher |
---|
Methods in org.apache.nutch.fetcher that return WebPage | |
---|---|
WebPage |
FetchEntry.getWebPage()
|
Methods in org.apache.nutch.fetcher with parameters of type WebPage | |
---|---|
protected void |
FetcherJob.FetcherMapper.map(String key,
WebPage page,
Mapper.Context context)
|
Constructors in org.apache.nutch.fetcher with parameters of type WebPage | |
---|---|
FetchEntry(Configuration conf,
String key,
WebPage page)
|
Uses of WebPage in org.apache.nutch.host |
---|
Methods in org.apache.nutch.host with parameters of type WebPage | |
---|---|
protected void |
HostDbUpdateJob.Mapper.map(String key,
WebPage value,
Mapper.Context context)
|
Method parameters in org.apache.nutch.host with type arguments of type WebPage | |
---|---|
protected void |
HostDbUpdateReducer.reduce(Text key,
Iterable<WebPage> values,
Reducer.Context context)
|
Uses of WebPage in org.apache.nutch.indexer |
---|
Fields in org.apache.nutch.indexer with type parameters of type WebPage | |
---|---|
org.apache.gora.store.DataStore<String,WebPage> |
IndexerJob.IndexerMapper.store
|
Methods in org.apache.nutch.indexer with parameters of type WebPage | |
---|---|
NutchDocument |
IndexingFilter.filter(NutchDocument doc,
String url,
WebPage page)
Adds fields or otherwise modifies the document that will be indexed for a parse. |
NutchDocument |
IndexingFilters.filter(NutchDocument doc,
String url,
WebPage page)
Run all defined filters. |
NutchDocument |
IndexUtil.index(String key,
WebPage page)
Index a webpage. |
void |
IndexerJob.IndexerMapper.map(String key,
WebPage page,
Mapper.Context context)
|
Uses of WebPage in org.apache.nutch.indexer.anchor |
---|
Methods in org.apache.nutch.indexer.anchor with parameters of type WebPage | |
---|---|
NutchDocument |
AnchorIndexingFilter.filter(NutchDocument doc,
String url,
WebPage page)
|
Uses of WebPage in org.apache.nutch.indexer.basic |
---|
Methods in org.apache.nutch.indexer.basic with parameters of type WebPage | |
---|---|
NutchDocument |
BasicIndexingFilter.filter(NutchDocument doc,
String url,
WebPage page)
|
Uses of WebPage in org.apache.nutch.indexer.more |
---|
Methods in org.apache.nutch.indexer.more with parameters of type WebPage | |
---|---|
NutchDocument |
MoreIndexingFilter.filter(NutchDocument doc,
String url,
WebPage page)
|
Uses of WebPage in org.apache.nutch.indexer.subcollection |
---|
Methods in org.apache.nutch.indexer.subcollection with parameters of type WebPage | |
---|---|
NutchDocument |
SubcollectionIndexingFilter.filter(NutchDocument doc,
String url,
WebPage page)
|
Uses of WebPage in org.apache.nutch.indexer.tld |
---|
Methods in org.apache.nutch.indexer.tld with parameters of type WebPage | |
---|---|
NutchDocument |
TLDIndexingFilter.filter(NutchDocument doc,
String url,
WebPage page)
|
Uses of WebPage in org.apache.nutch.microformats.reltag |
---|
Methods in org.apache.nutch.microformats.reltag with parameters of type WebPage | |
---|---|
NutchDocument |
RelTagIndexingFilter.filter(NutchDocument doc,
String url,
WebPage page)
|
Parse |
RelTagParser.filter(String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
DocumentFragment doc)
|
Uses of WebPage in org.apache.nutch.parse |
---|
Methods in org.apache.nutch.parse with parameters of type WebPage | |
---|---|
Parse |
ParseFilter.filter(String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
DocumentFragment doc)
Adds metadata or otherwise modifies a parse, given the DOM tree of a page. |
Parse |
ParseFilters.filter(String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
DocumentFragment doc)
Run all defined filters. |
Parse |
Parser.getParse(String url,
WebPage page)
This method parses content in WebPage instance |
static boolean |
ParserJob.isTruncated(String url,
WebPage page)
Checks if the page's content is truncated. |
void |
ParserJob.ParserMapper.map(String key,
WebPage page,
Mapper.Context context)
|
Parse |
ParseUtil.parse(String url,
WebPage page)
Performs a parse by iterating through a List of preferred Parser s
until a successful parse is performed and a Parse object is
returned. |
URLWebPage |
ParseUtil.process(String key,
WebPage page)
Parses given web page and stores parsed content within page. |
Uses of WebPage in org.apache.nutch.parse.html |
---|
Methods in org.apache.nutch.parse.html with parameters of type WebPage | |
---|---|
Parse |
HtmlParser.getParse(String url,
WebPage page)
|
Uses of WebPage in org.apache.nutch.parse.js |
---|
Methods in org.apache.nutch.parse.js with parameters of type WebPage | |
---|---|
Parse |
JSParseFilter.filter(String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
DocumentFragment doc)
|
Parse |
JSParseFilter.getParse(String url,
WebPage page)
|
Uses of WebPage in org.apache.nutch.parse.tika |
---|
Methods in org.apache.nutch.parse.tika with parameters of type WebPage | |
---|---|
Parse |
TikaParser.getParse(String url,
WebPage page)
|
Uses of WebPage in org.apache.nutch.protocol |
---|
Methods in org.apache.nutch.protocol with parameters of type WebPage | |
---|---|
ProtocolOutput |
Protocol.getProtocolOutput(String url,
WebPage page)
Returns the Content for a fetchlist entry. |
RobotRules |
Protocol.getRobotRules(String url,
WebPage page)
Retrieve robot rules applicable for this url. |
Uses of WebPage in org.apache.nutch.protocol.file |
---|
Methods in org.apache.nutch.protocol.file with parameters of type WebPage | |
---|---|
ProtocolOutput |
File.getProtocolOutput(String url,
WebPage page)
|
RobotRules |
File.getRobotRules(String url,
WebPage page)
|
Constructors in org.apache.nutch.protocol.file with parameters of type WebPage | |
---|---|
FileResponse(URL url,
WebPage page,
File file,
Configuration conf)
|
Uses of WebPage in org.apache.nutch.protocol.ftp |
---|
Methods in org.apache.nutch.protocol.ftp with parameters of type WebPage | |
---|---|
ProtocolOutput |
Ftp.getProtocolOutput(String url,
WebPage page)
|
RobotRules |
Ftp.getRobotRules(String url,
WebPage page)
|
Constructors in org.apache.nutch.protocol.ftp with parameters of type WebPage | |
---|---|
FtpResponse(URL url,
WebPage page,
Ftp ftp,
Configuration conf)
|
Uses of WebPage in org.apache.nutch.protocol.http |
---|
Methods in org.apache.nutch.protocol.http with parameters of type WebPage | |
---|---|
protected Response |
Http.getResponse(URL url,
WebPage page,
boolean redirect)
|
Constructors in org.apache.nutch.protocol.http with parameters of type WebPage | |
---|---|
HttpResponse(HttpBase http,
URL url,
WebPage page)
|
Uses of WebPage in org.apache.nutch.protocol.http.api |
---|
Methods in org.apache.nutch.protocol.http.api with parameters of type WebPage | |
---|---|
ProtocolOutput |
HttpBase.getProtocolOutput(String url,
WebPage page)
|
protected abstract Response |
HttpBase.getResponse(URL url,
WebPage page,
boolean followRedirects)
|
RobotRules |
HttpBase.getRobotRules(String url,
WebPage page)
|
Uses of WebPage in org.apache.nutch.protocol.httpclient |
---|
Methods in org.apache.nutch.protocol.httpclient with parameters of type WebPage | |
---|---|
protected Response |
Http.getResponse(URL url,
WebPage page,
boolean redirect)
Fetches the url with a configured HTTP client and gets the
response. |
Uses of WebPage in org.apache.nutch.protocol.sftp |
---|
Methods in org.apache.nutch.protocol.sftp with parameters of type WebPage | |
---|---|
ProtocolOutput |
Sftp.getProtocolOutput(String url,
WebPage page)
|
RobotRules |
Sftp.getRobotRules(String url,
WebPage page)
|
Uses of WebPage in org.apache.nutch.scoring |
---|
Methods in org.apache.nutch.scoring with parameters of type WebPage | |
---|---|
void |
ScoringFilter.distributeScoreToOutlinks(String fromUrl,
WebPage page,
Collection<ScoreDatum> scoreData,
int allCount)
Distribute score value from the current page to all its outlinked pages. |
void |
ScoringFilters.distributeScoreToOutlinks(String fromUrl,
WebPage row,
Collection<ScoreDatum> scoreData,
int allCount)
|
float |
ScoringFilter.generatorSortValue(String url,
WebPage page,
float initSort)
This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation. |
float |
ScoringFilters.generatorSortValue(String url,
WebPage row,
float initSort)
Calculate a sort value for Generate. |
float |
ScoringFilter.indexerScore(String url,
NutchDocument doc,
WebPage page,
float initScore)
This method calculates a Lucene document boost. |
float |
ScoringFilters.indexerScore(String url,
NutchDocument doc,
WebPage row,
float initScore)
|
void |
ScoringFilter.initialScore(String url,
WebPage page)
Set an initial score for newly discovered pages. |
void |
ScoringFilters.initialScore(String url,
WebPage row)
Calculate a new initial score, used when adding newly discovered pages. |
void |
ScoringFilter.injectedScore(String url,
WebPage page)
Set an initial score for newly injected pages. |
void |
ScoringFilters.injectedScore(String url,
WebPage row)
Calculate a new initial score, used when injecting new pages. |
void |
ScoringFilter.updateScore(String url,
WebPage page,
List<ScoreDatum> inlinkedScoreData)
This method calculates a new score during table update, based on the values contributed by inlinked pages. |
void |
ScoringFilters.updateScore(String url,
WebPage row,
List<ScoreDatum> inlinkedScoreData)
|
Uses of WebPage in org.apache.nutch.scoring.link |
---|
Methods in org.apache.nutch.scoring.link with parameters of type WebPage | |
---|---|
void |
LinkAnalysisScoringFilter.distributeScoreToOutlinks(String fromUrl,
WebPage page,
Collection<ScoreDatum> scoreData,
int allCount)
|
float |
LinkAnalysisScoringFilter.generatorSortValue(String url,
WebPage page,
float initSort)
|
float |
LinkAnalysisScoringFilter.indexerScore(String url,
NutchDocument doc,
WebPage page,
float initScore)
|
void |
LinkAnalysisScoringFilter.initialScore(String url,
WebPage page)
|
void |
LinkAnalysisScoringFilter.injectedScore(String url,
WebPage page)
|
void |
LinkAnalysisScoringFilter.updateScore(String url,
WebPage page,
List<ScoreDatum> inlinkedScoreData)
|
Uses of WebPage in org.apache.nutch.scoring.opic |
---|
Methods in org.apache.nutch.scoring.opic with parameters of type WebPage | |
---|---|
void |
OPICScoringFilter.distributeScoreToOutlinks(String fromUrl,
WebPage row,
Collection<ScoreDatum> scoreData,
int allCount)
Get cash on hand, divide it by the number of outlinks and apply. |
float |
OPICScoringFilter.generatorSortValue(String url,
WebPage row,
float initSort)
Use getScore() . |
float |
OPICScoringFilter.indexerScore(String url,
NutchDocument doc,
WebPage row,
float initScore)
Dampen the boost value by scorePower. |
void |
OPICScoringFilter.initialScore(String url,
WebPage row)
Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level. |
void |
OPICScoringFilter.injectedScore(String url,
WebPage row)
|
void |
OPICScoringFilter.updateScore(String url,
WebPage row,
List<ScoreDatum> inlinkedScoreData)
Increase the score by a sum of inlinked scores. |
Uses of WebPage in org.apache.nutch.scoring.tld |
---|
Methods in org.apache.nutch.scoring.tld with parameters of type WebPage | |
---|---|
void |
TLDScoringFilter.distributeScoreToOutlinks(String fromUrl,
WebPage page,
Collection<ScoreDatum> scoreData,
int allCount)
|
float |
TLDScoringFilter.generatorSortValue(String url,
WebPage page,
float initSort)
|
float |
TLDScoringFilter.indexerScore(String url,
NutchDocument doc,
WebPage page,
float initScore)
|
void |
TLDScoringFilter.initialScore(String url,
WebPage page)
|
void |
TLDScoringFilter.injectedScore(String url,
WebPage page)
|
void |
TLDScoringFilter.updateScore(String url,
WebPage page,
List<ScoreDatum> inlinkedScoreData)
|
Uses of WebPage in org.apache.nutch.storage |
---|
Methods in org.apache.nutch.storage that return WebPage | |
---|---|
WebPage |
WebPage.newInstance(org.apache.gora.persistency.StateManager stateManager)
|
Methods in org.apache.nutch.storage with parameters of type WebPage | |
---|---|
org.apache.avro.util.Utf8 |
Mark.checkMark(WebPage page)
|
void |
Mark.putMark(WebPage page,
String markValue)
|
void |
Mark.putMark(WebPage page,
org.apache.avro.util.Utf8 markValue)
|
org.apache.avro.util.Utf8 |
Mark.removeMark(WebPage page)
|
org.apache.avro.util.Utf8 |
Mark.removeMarkIfExist(WebPage page)
Remove the mark only if the mark is present on the page. |
Method parameters in org.apache.nutch.storage with type arguments of type WebPage | ||
---|---|---|
static
|
StorageUtils.initMapperJob(Job job,
Collection<WebPage.Field> fields,
Class<K> outKeyClass,
Class<V> outValueClass,
Class<? extends org.apache.gora.mapreduce.GoraMapper<String,WebPage,K,V>> mapperClass)
|
|
static
|
StorageUtils.initMapperJob(Job job,
Collection<WebPage.Field> fields,
Class<K> outKeyClass,
Class<V> outValueClass,
Class<? extends org.apache.gora.mapreduce.GoraMapper<String,WebPage,K,V>> mapperClass,
boolean reuseObjects)
|
|
static
|
StorageUtils.initMapperJob(Job job,
Collection<WebPage.Field> fields,
Class<K> outKeyClass,
Class<V> outValueClass,
Class<? extends org.apache.gora.mapreduce.GoraMapper<String,WebPage,K,V>> mapperClass,
Class<? extends Partitioner<K,V>> partitionerClass)
|
|
static
|
StorageUtils.initMapperJob(Job job,
Collection<WebPage.Field> fields,
Class<K> outKeyClass,
Class<V> outValueClass,
Class<? extends org.apache.gora.mapreduce.GoraMapper<String,WebPage,K,V>> mapperClass,
Class<? extends Partitioner<K,V>> partitionerClass,
boolean reuseObjects)
|
|
static
|
StorageUtils.initReducerJob(Job job,
Class<? extends org.apache.gora.mapreduce.GoraReducer<K,V,String,WebPage>> reducerClass)
|
Uses of WebPage in org.apache.nutch.util |
---|
Methods in org.apache.nutch.util that return WebPage | |
---|---|
WebPage |
WebPageWritable.getWebPage()
|
Methods in org.apache.nutch.util with parameters of type WebPage | |
---|---|
void |
EncodingDetector.autoDetectClues(WebPage page,
boolean filter)
|
String |
EncodingDetector.guessEncoding(WebPage page,
String defaultValue)
Guess the encoding with the previously specified list of clues. |
void |
WebPageWritable.setWebPage(WebPage webPage)
|
Method parameters in org.apache.nutch.util with type arguments of type WebPage | |
---|---|
protected void |
IdentityPageReducer.reduce(String key,
Iterable<WebPage> values,
Reducer.Context context)
|
Constructors in org.apache.nutch.util with parameters of type WebPage | |
---|---|
WebPageWritable(Configuration conf,
WebPage webPage)
|
Uses of WebPage in org.apache.nutch.util.domain |
---|
Methods in org.apache.nutch.util.domain with parameters of type WebPage | |
---|---|
protected void |
DomainStatistics.DomainStatisticsMapper.map(String key,
WebPage value,
Mapper.Context context)
|
Uses of WebPage in org.creativecommons.nutch |
---|
Methods in org.creativecommons.nutch with parameters of type WebPage | |
---|---|
NutchDocument |
CCIndexingFilter.filter(NutchDocument doc,
String url,
WebPage page)
|
Parse |
CCParseFilter.filter(String url,
WebPage page,
Parse parse,
HTMLMetaTags metaTags,
DocumentFragment doc)
Adds metadata or otherwise modifies a parse of an HTML document, given the DOM tree of a page. |
static void |
CCParseFilter.Walker.walk(Node doc,
URL base,
WebPage page,
Configuration conf)
Scan the document adding attributes to metadata. |
|
||||||||||
PREV NEXT | FRAMES NO FRAMES |