Uses of Class org.apache.nutch.storage.WebPage (apache-nutch 2.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV NEXT

FRAMES NO FRAMES

Uses of Class
org.apache.nutch.storage.WebPage

Packages that use WebPage
org.apache.nutch.analysis.lang	Text document language identifier.
org.apache.nutch.crawl	Crawl control code.
org.apache.nutch.fetcher	The Nutch robot.
org.apache.nutch.host
org.apache.nutch.indexer	Maintain Lucene full-text indexes.
org.apache.nutch.indexer.anchor	An indexing plugin for inbound anchor text.
org.apache.nutch.indexer.basic	A basic indexing plugin.
org.apache.nutch.indexer.more	A more indexing plugin.
org.apache.nutch.indexer.subcollection
org.apache.nutch.indexer.tld	Top Level Domain Indexing plugin.
org.apache.nutch.microformats.reltag	A microformats Rel-Tag Parser/Indexer/Querier plugin.
org.apache.nutch.parse
org.apache.nutch.parse.html	An HTML document parsing plugin.
org.apache.nutch.parse.js
org.apache.nutch.parse.tika
org.apache.nutch.protocol
org.apache.nutch.protocol.file	Protocol plugin which supports retrieving local file resources.
org.apache.nutch.protocol.ftp	Protocol plugin which supports retrieving documents via the ftp protocol.
org.apache.nutch.protocol.http	Protocol plugin which supports retrieving documents via the http protocol.
org.apache.nutch.protocol.http.api	Common API used by HTTP plugins (`http`, `httpclient`)
org.apache.nutch.protocol.httpclient	Protocol plugin which supports retrieving documents via the HTTP and HTTPS protocols, optionally with Basic, Digest and NTLM authentication schemes for web server as well as proxy server.
org.apache.nutch.protocol.sftp	Protocol plugin which supports retrieving documents via the sftp protocol.
org.apache.nutch.scoring
org.apache.nutch.scoring.link
org.apache.nutch.scoring.opic
org.apache.nutch.scoring.tld	Top Level Domain Scoring plugin.
org.apache.nutch.storage
org.apache.nutch.util
org.apache.nutch.util.domain	org.apache.nutch.util.domain
org.creativecommons.nutch	Sample plugins that parse and index Creative Commons medadata.

Uses of WebPage in org.apache.nutch.analysis.lang

Methods in org.apache.nutch.analysis.lang with parameters of type WebPage
`NutchDocument`	`LanguageIndexingFilter.filter(NutchDocument doc, String url, WebPage page)`
`Parse`	`HTMLLanguageParser.filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)` Scan the HTML document looking at possible indications of content language 1.

Uses of WebPage in org.apache.nutch.crawl

Methods in org.apache.nutch.crawl that return WebPage
`WebPage`	`URLWebPage.getDatum()`

Methods in org.apache.nutch.crawl with parameters of type WebPage
`byte[]`	`MD5Signature.calculate(WebPage page)`
`byte[]`	`TextProfileSignature.calculate(WebPage page)`
`abstract byte[]`	`Signature.calculate(WebPage page)`
`long`	`AbstractFetchSchedule.calculateLastFetchTime(WebPage page)` This method return the last fetch time of the CrawlDatum
`long`	`FetchSchedule.calculateLastFetchTime(WebPage page)` Calculates last fetch time of the given CrawlDatum.
`void`	`AbstractFetchSchedule.forceRefetch(String url, WebPage page, boolean asap)` This method resets fetchTime, fetchInterval, modifiedTime, retriesSinceFetch and page signature, so that it forces refetching.
`void`	`FetchSchedule.forceRefetch(String url, WebPage row, boolean asap)` This method resets fetchTime, fetchInterval, modifiedTime and page signature, so that it forces refetching.
`int`	`URLPartitioner.SelectorEntryPartitioner.getPartition(GeneratorJob.SelectorEntry selectorEntry, WebPage page, int numReduces)`
`void`	`AbstractFetchSchedule.initializeSchedule(String url, WebPage page)` Initialize fetch schedule related data.
`void`	`FetchSchedule.initializeSchedule(String url, WebPage page)` Initialize fetch schedule related data.
`void`	`GeneratorMapper.map(String reversedUrl, WebPage page, Mapper.Context context)`
`protected void`	`InjectorJob.InjectorMapper.map(String key, WebPage row, Mapper.Context context)`
`protected void`	`WebTableReader.WebTableStatMapper.map(String key, WebPage value, Mapper.Context context)`
`protected void`	`WebTableReader.WebTableRegexMapper.map(String key, WebPage value, Mapper.Context context)`
`void`	`DbUpdateMapper.map(String key, WebPage page, Mapper.Context context)`
`void`	`URLWebPage.setDatum(WebPage datum)`
`void`	`AbstractFetchSchedule.setFetchSchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)` Sets the `fetchInterval` and `fetchTime` on a successfully fetched page.
`void`	`FetchSchedule.setFetchSchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)` Sets the `fetchInterval` and `fetchTime` on a successfully fetched page.
`void`	`AdaptiveFetchSchedule.setFetchSchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)`
`void`	`DefaultFetchSchedule.setFetchSchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)`
`void`	`AbstractFetchSchedule.setPageGoneSchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime)` This method specifies how to schedule refetching of pages marked as GONE.
`void`	`FetchSchedule.setPageGoneSchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime)` This method specifies how to schedule refetching of pages marked as GONE.
`void`	`AbstractFetchSchedule.setPageRetrySchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime)` This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.
`void`	`FetchSchedule.setPageRetrySchedule(String url, WebPage page, long prevFetchTime, long prevModifiedTime, long fetchTime)` This method adjusts the fetch schedule if fetching needs to be re-tried due to transient errors.
`boolean`	`AbstractFetchSchedule.shouldFetch(String url, WebPage page, long curTime)` This method provides information whether the page is suitable for selection in the current fetchlist.
`boolean`	`FetchSchedule.shouldFetch(String url, WebPage page, long curTime)` This method provides information whether the page is suitable for selection in the current fetchlist.

Method parameters in org.apache.nutch.crawl with type arguments of type WebPage
`protected void`	`GeneratorReducer.reduce(GeneratorJob.SelectorEntry key, Iterable<WebPage> values, Reducer.Context context)`

Constructors in org.apache.nutch.crawl with parameters of type WebPage
`URLWebPage(String url, WebPage datum)`

Uses of WebPage in org.apache.nutch.fetcher

Methods in org.apache.nutch.fetcher that return WebPage
`WebPage`	`FetchEntry.getWebPage()`

Methods in org.apache.nutch.fetcher with parameters of type WebPage
`protected void`	`FetcherJob.FetcherMapper.map(String key, WebPage page, Mapper.Context context)`

Constructors in org.apache.nutch.fetcher with parameters of type WebPage
`FetchEntry(Configuration conf, String key, WebPage page)`

Uses of WebPage in org.apache.nutch.host

Methods in org.apache.nutch.host with parameters of type WebPage
`protected void`	`HostDbUpdateJob.Mapper.map(String key, WebPage value, Mapper.Context context)`

Method parameters in org.apache.nutch.host with type arguments of type WebPage
`protected void`	`HostDbUpdateReducer.reduce(Text key, Iterable<WebPage> values, Reducer.Context context)`

Uses of WebPage in org.apache.nutch.indexer

Fields in org.apache.nutch.indexer with type parameters of type WebPage
`org.apache.gora.store.DataStore<String,WebPage>`	`IndexerJob.IndexerMapper.store`

Methods in org.apache.nutch.indexer with parameters of type WebPage
`NutchDocument`	`IndexingFilter.filter(NutchDocument doc, String url, WebPage page)` Adds fields or otherwise modifies the document that will be indexed for a parse.
`NutchDocument`	`IndexingFilters.filter(NutchDocument doc, String url, WebPage page)` Run all defined filters.
`NutchDocument`	`IndexUtil.index(String key, WebPage page)` Index a webpage.
`void`	`IndexerJob.IndexerMapper.map(String key, WebPage page, Mapper.Context context)`

Uses of WebPage in org.apache.nutch.indexer.anchor

Methods in org.apache.nutch.indexer.anchor with parameters of type WebPage
`NutchDocument`	`AnchorIndexingFilter.filter(NutchDocument doc, String url, WebPage page)`

Uses of WebPage in org.apache.nutch.indexer.basic

Methods in org.apache.nutch.indexer.basic with parameters of type WebPage
`NutchDocument`	`BasicIndexingFilter.filter(NutchDocument doc, String url, WebPage page)`

Uses of WebPage in org.apache.nutch.indexer.more

Methods in org.apache.nutch.indexer.more with parameters of type WebPage
`NutchDocument`	`MoreIndexingFilter.filter(NutchDocument doc, String url, WebPage page)`

Uses of WebPage in org.apache.nutch.indexer.subcollection

Methods in org.apache.nutch.indexer.subcollection with parameters of type WebPage
`NutchDocument`	`SubcollectionIndexingFilter.filter(NutchDocument doc, String url, WebPage page)`

Uses of WebPage in org.apache.nutch.indexer.tld

Methods in org.apache.nutch.indexer.tld with parameters of type WebPage
`NutchDocument`	`TLDIndexingFilter.filter(NutchDocument doc, String url, WebPage page)`

Uses of WebPage in org.apache.nutch.microformats.reltag

Methods in org.apache.nutch.microformats.reltag with parameters of type WebPage
`NutchDocument`	`RelTagIndexingFilter.filter(NutchDocument doc, String url, WebPage page)`
`Parse`	`RelTagParser.filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)`

Uses of WebPage in org.apache.nutch.parse

Methods in org.apache.nutch.parse with parameters of type WebPage
`Parse`	`ParseFilter.filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)` Adds metadata or otherwise modifies a parse, given the DOM tree of a page.
`Parse`	`ParseFilters.filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)` Run all defined filters.
`Parse`	`Parser.getParse(String url, WebPage page)` This method parses content in WebPage instance
`static boolean`	`ParserJob.isTruncated(String url, WebPage page)` Checks if the page's content is truncated.
`void`	`ParserJob.ParserMapper.map(String key, WebPage page, Mapper.Context context)`
`Parse`	`ParseUtil.parse(String url, WebPage page)` Performs a parse by iterating through a List of preferred `Parser`s until a successful parse is performed and a `Parse` object is returned.
`URLWebPage`	`ParseUtil.process(String key, WebPage page)` Parses given web page and stores parsed content within page.

Uses of WebPage in org.apache.nutch.parse.html

Methods in org.apache.nutch.parse.html with parameters of type WebPage
`Parse`	`HtmlParser.getParse(String url, WebPage page)`

Uses of WebPage in org.apache.nutch.parse.js

Methods in org.apache.nutch.parse.js with parameters of type WebPage
`Parse`	`JSParseFilter.filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)`
`Parse`	`JSParseFilter.getParse(String url, WebPage page)`

Uses of WebPage in org.apache.nutch.parse.tika

Methods in org.apache.nutch.parse.tika with parameters of type WebPage
`Parse`	`TikaParser.getParse(String url, WebPage page)`

Uses of WebPage in org.apache.nutch.protocol

Methods in org.apache.nutch.protocol with parameters of type WebPage
`ProtocolOutput`	`Protocol.getProtocolOutput(String url, WebPage page)` Returns the `Content` for a fetchlist entry.
`RobotRules`	`Protocol.getRobotRules(String url, WebPage page)` Retrieve robot rules applicable for this url.

Uses of WebPage in org.apache.nutch.protocol.file

Methods in org.apache.nutch.protocol.file with parameters of type WebPage
`ProtocolOutput`	`File.getProtocolOutput(String url, WebPage page)`
`RobotRules`	`File.getRobotRules(String url, WebPage page)`

Constructors in org.apache.nutch.protocol.file with parameters of type WebPage
`FileResponse(URL url, WebPage page, File file, Configuration conf)`

Uses of WebPage in org.apache.nutch.protocol.ftp

Methods in org.apache.nutch.protocol.ftp with parameters of type WebPage
`ProtocolOutput`	`Ftp.getProtocolOutput(String url, WebPage page)`
`RobotRules`	`Ftp.getRobotRules(String url, WebPage page)`

Constructors in org.apache.nutch.protocol.ftp with parameters of type WebPage
`FtpResponse(URL url, WebPage page, Ftp ftp, Configuration conf)`

Uses of WebPage in org.apache.nutch.protocol.http

Methods in org.apache.nutch.protocol.http with parameters of type WebPage
`protected Response`	`Http.getResponse(URL url, WebPage page, boolean redirect)`

Constructors in org.apache.nutch.protocol.http with parameters of type WebPage
`HttpResponse(HttpBase http, URL url, WebPage page)`

Uses of WebPage in org.apache.nutch.protocol.http.api

Methods in org.apache.nutch.protocol.http.api with parameters of type WebPage
`ProtocolOutput`	`HttpBase.getProtocolOutput(String url, WebPage page)`
`protected abstract Response`	`HttpBase.getResponse(URL url, WebPage page, boolean followRedirects)`
`RobotRules`	`HttpBase.getRobotRules(String url, WebPage page)`

Uses of WebPage in org.apache.nutch.protocol.httpclient

Methods in org.apache.nutch.protocol.httpclient with parameters of type WebPage
`protected Response`	`Http.getResponse(URL url, WebPage page, boolean redirect)` Fetches the `url` with a configured HTTP client and gets the response.

Uses of WebPage in org.apache.nutch.protocol.sftp

Methods in org.apache.nutch.protocol.sftp with parameters of type WebPage
`ProtocolOutput`	`Sftp.getProtocolOutput(String url, WebPage page)`
`RobotRules`	`Sftp.getRobotRules(String url, WebPage page)`

Uses of WebPage in org.apache.nutch.scoring

Methods in org.apache.nutch.scoring with parameters of type WebPage
`void`	`ScoringFilter.distributeScoreToOutlinks(String fromUrl, WebPage page, Collection<ScoreDatum> scoreData, int allCount)` Distribute score value from the current page to all its outlinked pages.
`void`	`ScoringFilters.distributeScoreToOutlinks(String fromUrl, WebPage row, Collection<ScoreDatum> scoreData, int allCount)`
`float`	`ScoringFilter.generatorSortValue(String url, WebPage page, float initSort)` This method prepares a sort value for the purpose of sorting and selecting top N scoring pages during fetchlist generation.
`float`	`ScoringFilters.generatorSortValue(String url, WebPage row, float initSort)` Calculate a sort value for Generate.
`float`	`ScoringFilter.indexerScore(String url, NutchDocument doc, WebPage page, float initScore)` This method calculates a Lucene document boost.
`float`	`ScoringFilters.indexerScore(String url, NutchDocument doc, WebPage row, float initScore)`
`void`	`ScoringFilter.initialScore(String url, WebPage page)` Set an initial score for newly discovered pages.
`void`	`ScoringFilters.initialScore(String url, WebPage row)` Calculate a new initial score, used when adding newly discovered pages.
`void`	`ScoringFilter.injectedScore(String url, WebPage page)` Set an initial score for newly injected pages.
`void`	`ScoringFilters.injectedScore(String url, WebPage row)` Calculate a new initial score, used when injecting new pages.
`void`	`ScoringFilter.updateScore(String url, WebPage page, List<ScoreDatum> inlinkedScoreData)` This method calculates a new score during table update, based on the values contributed by inlinked pages.
`void`	`ScoringFilters.updateScore(String url, WebPage row, List<ScoreDatum> inlinkedScoreData)`

Uses of WebPage in org.apache.nutch.scoring.link

Methods in org.apache.nutch.scoring.link with parameters of type WebPage
`void`	`LinkAnalysisScoringFilter.distributeScoreToOutlinks(String fromUrl, WebPage page, Collection<ScoreDatum> scoreData, int allCount)`
`float`	`LinkAnalysisScoringFilter.generatorSortValue(String url, WebPage page, float initSort)`
`float`	`LinkAnalysisScoringFilter.indexerScore(String url, NutchDocument doc, WebPage page, float initScore)`
`void`	`LinkAnalysisScoringFilter.initialScore(String url, WebPage page)`
`void`	`LinkAnalysisScoringFilter.injectedScore(String url, WebPage page)`
`void`	`LinkAnalysisScoringFilter.updateScore(String url, WebPage page, List<ScoreDatum> inlinkedScoreData)`

Uses of WebPage in org.apache.nutch.scoring.opic

Methods in org.apache.nutch.scoring.opic with parameters of type WebPage
`void`	`OPICScoringFilter.distributeScoreToOutlinks(String fromUrl, WebPage row, Collection<ScoreDatum> scoreData, int allCount)` Get cash on hand, divide it by the number of outlinks and apply.
`float`	`OPICScoringFilter.generatorSortValue(String url, WebPage row, float initSort)` Use `getScore()`.
`float`	`OPICScoringFilter.indexerScore(String url, NutchDocument doc, WebPage row, float initScore)` Dampen the boost value by scorePower.
`void`	`OPICScoringFilter.initialScore(String url, WebPage row)` Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level.
`void`	`OPICScoringFilter.injectedScore(String url, WebPage row)`
`void`	`OPICScoringFilter.updateScore(String url, WebPage row, List<ScoreDatum> inlinkedScoreData)` Increase the score by a sum of inlinked scores.

Uses of WebPage in org.apache.nutch.scoring.tld

Methods in org.apache.nutch.scoring.tld with parameters of type WebPage
`void`	`TLDScoringFilter.distributeScoreToOutlinks(String fromUrl, WebPage page, Collection<ScoreDatum> scoreData, int allCount)`
`float`	`TLDScoringFilter.generatorSortValue(String url, WebPage page, float initSort)`
`float`	`TLDScoringFilter.indexerScore(String url, NutchDocument doc, WebPage page, float initScore)`
`void`	`TLDScoringFilter.initialScore(String url, WebPage page)`
`void`	`TLDScoringFilter.injectedScore(String url, WebPage page)`
`void`	`TLDScoringFilter.updateScore(String url, WebPage page, List<ScoreDatum> inlinkedScoreData)`

Uses of WebPage in org.apache.nutch.storage

Methods in org.apache.nutch.storage that return WebPage
`WebPage`	`WebPage.newInstance(org.apache.gora.persistency.StateManager stateManager)`

Methods in org.apache.nutch.storage with parameters of type WebPage
`org.apache.avro.util.Utf8`	`Mark.checkMark(WebPage page)`
`void`	`Mark.putMark(WebPage page, String markValue)`
`void`	`Mark.putMark(WebPage page, org.apache.avro.util.Utf8 markValue)`
`org.apache.avro.util.Utf8`	`Mark.removeMark(WebPage page)`
`org.apache.avro.util.Utf8`	`Mark.removeMarkIfExist(WebPage page)` Remove the mark only if the mark is present on the page.

Method parameters in org.apache.nutch.storage with type arguments of type WebPage

static



<K,V> void

StorageUtils.initMapperJob(Job job,
              Collection<WebPage.Field> fields,
              Class<K> outKeyClass,
              Class<V> outValueClass,
              Class<? extends org.apache.gora.mapreduce.GoraMapper<String,WebPage,K,V>> mapperClass)

static



<K,V> void

StorageUtils.initMapperJob(Job job,
              Collection<WebPage.Field> fields,
              Class<K> outKeyClass,
              Class<V> outValueClass,
              Class<? extends org.apache.gora.mapreduce.GoraMapper<String,WebPage,K,V>> mapperClass,
              boolean reuseObjects)

static



<K,V> void

StorageUtils.initMapperJob(Job job,
              Collection<WebPage.Field> fields,
              Class<K> outKeyClass,
              Class<V> outValueClass,
              Class<? extends org.apache.gora.mapreduce.GoraMapper<String,WebPage,K,V>> mapperClass,
              Class<? extends Partitioner<K,V>> partitionerClass)

static



<K,V> void

StorageUtils.initMapperJob(Job job,
              Collection<WebPage.Field> fields,
              Class<K> outKeyClass,
              Class<V> outValueClass,
              Class<? extends org.apache.gora.mapreduce.GoraMapper<String,WebPage,K,V>> mapperClass,
              Class<? extends Partitioner<K,V>> partitionerClass,
              boolean reuseObjects)

static



<K,V> void

StorageUtils.initReducerJob(Job job,
               Class<? extends org.apache.gora.mapreduce.GoraReducer<K,V,String,WebPage>> reducerClass)

Uses of WebPage in org.apache.nutch.util

Methods in org.apache.nutch.util that return WebPage
`WebPage`	`WebPageWritable.getWebPage()`

Methods in org.apache.nutch.util with parameters of type WebPage
`void`	`EncodingDetector.autoDetectClues(WebPage page, boolean filter)`
`String`	`EncodingDetector.guessEncoding(WebPage page, String defaultValue)` Guess the encoding with the previously specified list of clues.
`void`	`WebPageWritable.setWebPage(WebPage webPage)`

Method parameters in org.apache.nutch.util with type arguments of type WebPage
`protected void`	`IdentityPageReducer.reduce(String key, Iterable<WebPage> values, Reducer.Context context)`

Constructors in org.apache.nutch.util with parameters of type WebPage
`WebPageWritable(Configuration conf, WebPage webPage)`

Uses of WebPage in org.apache.nutch.util.domain

Methods in org.apache.nutch.util.domain with parameters of type WebPage
`protected void`	`DomainStatistics.DomainStatisticsMapper.map(String key, WebPage value, Mapper.Context context)`

Uses of WebPage in org.creativecommons.nutch

Methods in org.creativecommons.nutch with parameters of type WebPage
`NutchDocument`	`CCIndexingFilter.filter(NutchDocument doc, String url, WebPage page)`
`Parse`	`CCParseFilter.filter(String url, WebPage page, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc)` Adds metadata or otherwise modifies a parse of an HTML document, given the DOM tree of a page.
`static void`	`CCParseFilter.Walker.walk(Node doc, URL base, WebPage page, Configuration conf)` Scan the document adding attributes to metadata.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV NEXT

FRAMES NO FRAMES

Uses of Classorg.apache.nutch.storage.WebPage

Uses of Class
org.apache.nutch.storage.WebPage