QueryAutoStopWordAnalyzer (Lucene 3.6.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.query
Class QueryAutoStopWordAnalyzer

java.lang.Object
  org.apache.lucene.analysis.Analyzer
      org.apache.lucene.analysis.query.QueryAutoStopWordAnalyzer

All Implemented Interfaces:: Closeable

public final class QueryAutoStopWordAnalyzer
extends Analyzer
extends Analyzer

An Analyzer used primarily at query time to wrap another analyzer and provide a layer of protection which prevents very common words from being passed into queries.

For very large indexes the cost of reading TermDocs for a very common word can be high. This analyzer was created after experience with a 38 million doc index which had a term in around 50% of docs and was causing TermQueries for this term to take 2 seconds.

Use the various "addStopWords" methods in this class to automate the identification and addition of stop words found in an already existing index.

Field Summary
`static float`	`defaultMaxDocFreqPercent`

Constructor Summary
`QueryAutoStopWordAnalyzer(Version matchVersion, Analyzer delegate)` Deprecated. Stopwords should be calculated at instantiation using one of the other constructors
`QueryAutoStopWordAnalyzer(Version matchVersion, Analyzer delegate, IndexReader indexReader)` Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than `defaultMaxDocFreqPercent`
`QueryAutoStopWordAnalyzer(Version matchVersion, Analyzer delegate, IndexReader indexReader, Collection<String> fields, float maxPercentDocs)` Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency percentage greater than the given maxPercentDocs
`QueryAutoStopWordAnalyzer(Version matchVersion, Analyzer delegate, IndexReader indexReader, Collection<String> fields, int maxDocFreq)` Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency greater than the given maxDocFreq
`QueryAutoStopWordAnalyzer(Version matchVersion, Analyzer delegate, IndexReader indexReader, float maxPercentDocs)` Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than the given maxPercentDocs
`QueryAutoStopWordAnalyzer(Version matchVersion, Analyzer delegate, IndexReader indexReader, int maxDocFreq)` Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency greater than the given maxDocFreq

Method Summary
`int`	`addStopWords(IndexReader reader)` Deprecated. Stopwords should be calculated at instantiation using `QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader)`
`int`	`addStopWords(IndexReader reader, float maxPercentDocs)` Deprecated. Stowords should be calculated at instantiation using `QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, float)`
`int`	`addStopWords(IndexReader reader, int maxDocFreq)` Deprecated. Stopwords should be calculated at instantiation using `QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, int)`
`int`	`addStopWords(IndexReader reader, String fieldName, float maxPercentDocs)` Deprecated. Stowords should be calculated at instantiation using `QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, Collection, float)`
`int`	`addStopWords(IndexReader reader, String fieldName, int maxDocFreq)` Deprecated. Stowords should be calculated at instantiation using `QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, Collection, int)`
`Term[]`	`getStopWords()` Provides information on which stop words have been identified for all fields
`String[]`	`getStopWords(String fieldName)` Provides information on which stop words have been identified for a field
`TokenStream`	`reusableTokenStream(String fieldName, Reader reader)` Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method.
`TokenStream`	`tokenStream(String fieldName, Reader reader)` Creates a TokenStream which tokenizes all the text in the provided Reader.

Methods inherited from class org.apache.lucene.analysis.Analyzer
`close, getOffsetGap, getPositionIncrementGap, getPreviousTokenStream, setPreviousTokenStream`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

defaultMaxDocFreqPercent

public static final float defaultMaxDocFreqPercent

See Also:: Constant Field Values

Constructor Detail

QueryAutoStopWordAnalyzer

@Deprecated
public QueryAutoStopWordAnalyzer(Version matchVersion,
                                            Analyzer delegate)

Deprecated. Stopwords should be calculated at instantiation using one of the other constructors

Initializes this analyzer with the Analyzer object that actually produces the tokens

Parameters:: delegate - The choice of Analyzer that is used to produce the token stream which needs filtering

QueryAutoStopWordAnalyzer

public QueryAutoStopWordAnalyzer(Version matchVersion,
                                 Analyzer delegate,
                                 IndexReader indexReader)
                          throws IOException

Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than defaultMaxDocFreqPercent

Parameters:: matchVersion - Version to be used in StopFilter; delegate - Analyzer whose TokenStream will be filtered; indexReader - IndexReader to identify the stopwords from
Throws:: IOException - Can be thrown while reading from the IndexReader

QueryAutoStopWordAnalyzer

public QueryAutoStopWordAnalyzer(Version matchVersion,
                                 Analyzer delegate,
                                 IndexReader indexReader,
                                 int maxDocFreq)
                          throws IOException

Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency greater than the given maxDocFreq

Parameters:: matchVersion - Version to be used in StopFilter; delegate - Analyzer whose TokenStream will be filtered; indexReader - IndexReader to identify the stopwords from; maxDocFreq - Document frequency terms should be above in order to be stopwords
Throws:: IOException - Can be thrown while reading from the IndexReader

QueryAutoStopWordAnalyzer

public QueryAutoStopWordAnalyzer(Version matchVersion,
                                 Analyzer delegate,
                                 IndexReader indexReader,
                                 float maxPercentDocs)
                          throws IOException

Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for all indexed fields from terms with a document frequency percentage greater than the given maxPercentDocs

Parameters:: matchVersion - Version to be used in StopFilter; delegate - Analyzer whose TokenStream will be filtered; indexReader - IndexReader to identify the stopwords from; maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word
Throws:: IOException - Can be thrown while reading from the IndexReader

QueryAutoStopWordAnalyzer

public QueryAutoStopWordAnalyzer(Version matchVersion,
                                 Analyzer delegate,
                                 IndexReader indexReader,
                                 Collection<String> fields,
                                 float maxPercentDocs)
                          throws IOException

Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency percentage greater than the given maxPercentDocs

Parameters:: matchVersion - Version to be used in StopFilter; delegate - Analyzer whose TokenStream will be filtered; indexReader - IndexReader to identify the stopwords from; fields - Selection of fields to calculate stopwords for; maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word
Throws:: IOException - Can be thrown while reading from the IndexReader

QueryAutoStopWordAnalyzer

public QueryAutoStopWordAnalyzer(Version matchVersion,
                                 Analyzer delegate,
                                 IndexReader indexReader,
                                 Collection<String> fields,
                                 int maxDocFreq)
                          throws IOException

Creates a new QueryAutoStopWordAnalyzer with stopwords calculated for the given selection of fields from terms with a document frequency greater than the given maxDocFreq

Parameters:: matchVersion - Version to be used in StopFilter; delegate - Analyzer whose TokenStream will be filtered; indexReader - IndexReader to identify the stopwords from; fields - Selection of fields to calculate stopwords for; maxDocFreq - Document frequency terms should be above in order to be stopwords
Throws:: IOException - Can be thrown while reading from the IndexReader

Method Detail

addStopWords

@Deprecated
public int addStopWords(IndexReader reader)
                 throws IOException

Deprecated. Stopwords should be calculated at instantiation using QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader)

Automatically adds stop words for all fields with terms exceeding the defaultMaxDocFreqPercent

Parameters:: reader - The IndexReader which will be consulted to identify potential stop words that exceed the required document frequency
Returns:: The number of stop words identified.
Throws:: IOException

addStopWords

@Deprecated
public int addStopWords(IndexReader reader,
                                   int maxDocFreq)
                 throws IOException

Deprecated. Stopwords should be calculated at instantiation using QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, int)

Automatically adds stop words for all fields with terms exceeding the maxDocFreqPercent

Parameters:: reader - The IndexReader which will be consulted to identify potential stop words that exceed the required document frequency; maxDocFreq - The maximum number of index documents which can contain a term, after which the term is considered to be a stop word
Returns:: The number of stop words identified.
Throws:: IOException

addStopWords

@Deprecated
public int addStopWords(IndexReader reader,
                                   float maxPercentDocs)
                 throws IOException

Deprecated. Stowords should be calculated at instantiation using QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, float)

Automatically adds stop words for all fields with terms exceeding the maxDocFreqPercent

Parameters:: reader - The IndexReader which will be consulted to identify potential stop words that exceed the required document frequency; maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word.
Returns:: The number of stop words identified.
Throws:: IOException

addStopWords

@Deprecated
public int addStopWords(IndexReader reader,
                                   String fieldName,
                                   float maxPercentDocs)
                 throws IOException

Deprecated. Stowords should be calculated at instantiation using QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, Collection, float)

Automatically adds stop words for the given field with terms exceeding the maxPercentDocs

Parameters:: reader - The IndexReader which will be consulted to identify potential stop words that exceed the required document frequency; fieldName - The field for which stopwords will be added; maxPercentDocs - The maximum percentage (between 0.0 and 1.0) of index documents which contain a term, after which the word is considered to be a stop word.
Returns:: The number of stop words identified.
Throws:: IOException

addStopWords

@Deprecated
public int addStopWords(IndexReader reader,
                                   String fieldName,
                                   int maxDocFreq)
                 throws IOException

Deprecated. Stowords should be calculated at instantiation using QueryAutoStopWordAnalyzer(Version, Analyzer, IndexReader, Collection, int)

Automatically adds stop words for the given field with terms exceeding the maxPercentDocs

Parameters:: reader - The IndexReader which will be consulted to identify potential stop words that exceed the required document frequency; fieldName - The field for which stopwords will be added; maxDocFreq - The maximum number of index documents which can contain a term, after which the term is considered to be a stop word.
Returns:: The number of stop words identified.
Throws:: IOException

tokenStream

public TokenStream tokenStream(String fieldName,
                               Reader reader)

Description copied from class: Analyzer

Creates a TokenStream which tokenizes all the text in the provided Reader. Must be able to handle null field name for backward compatibility.

Specified by:: tokenStream in class Analyzer

reusableTokenStream

public TokenStream reusableTokenStream(String fieldName,
                                       Reader reader)
                                throws IOException

Description copied from class: Analyzer

Creates a TokenStream that is allowed to be re-used from the previous time that the same thread called this method. Callers that do not need to use more than one TokenStream at the same time from this analyzer should use this method for better performance.

Overrides:: reusableTokenStream in class Analyzer

Throws:: IOException

getStopWords

public String[] getStopWords(String fieldName)

Provides information on which stop words have been identified for a field

Parameters:: fieldName - The field for which stop words identified in "addStopWords" method calls will be returned
Returns:: the stop words identified for a field

getStopWords

public Term[] getStopWords()

Provides information on which stop words have been identified for all fields

Returns:: the stop words (as terms)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.query Class QueryAutoStopWordAnalyzer

defaultMaxDocFreqPercent

QueryAutoStopWordAnalyzer

QueryAutoStopWordAnalyzer

QueryAutoStopWordAnalyzer

QueryAutoStopWordAnalyzer

QueryAutoStopWordAnalyzer

QueryAutoStopWordAnalyzer

addStopWords

addStopWords

addStopWords

addStopWords

addStopWords

tokenStream

reusableTokenStream

getStopWords

getStopWords

org.apache.lucene.analysis.query
Class QueryAutoStopWordAnalyzer