|
||||||||||
PREV NEXT | FRAMES NO FRAMES |
Packages that use Tokenizer | |
---|---|
org.apache.lucene.analysis | API and code to convert text into indexable/searchable tokens. |
org.apache.lucene.analysis.ar | Analyzer for Arabic. |
org.apache.lucene.analysis.cjk | Analyzer for Chinese, Japanese, and Korean, which indexes bigrams (overlapping groups of two adjacent Han characters). |
org.apache.lucene.analysis.cn | Analyzer for Chinese, which indexes unigrams (individual chinese characters). |
org.apache.lucene.analysis.cn.smart |
Analyzer for Simplified Chinese, which indexes words. |
org.apache.lucene.analysis.icu.segmentation | Tokenizer that breaks text into words with the Unicode Text Segmentation algorithm. |
org.apache.lucene.analysis.in | Analysis components for Indian languages. |
org.apache.lucene.analysis.ja | Analyzer for Japanese. |
org.apache.lucene.analysis.ngram | Character n-gram tokenizers and filters. |
org.apache.lucene.analysis.path | Analysis components for path-like strings such as filenames. |
org.apache.lucene.analysis.ru | Analyzer for Russian. |
org.apache.lucene.analysis.standard | Standards-based analyzers implemented with JFlex. |
org.apache.lucene.analysis.wikipedia | Tokenizer that is aware of Wikipedia syntax. |
Uses of Tokenizer in org.apache.lucene.analysis |
---|
Subclasses of Tokenizer in org.apache.lucene.analysis | |
---|---|
class |
CharTokenizer
An abstract base class for simple, character-oriented tokenizers. |
class |
EmptyTokenizer
Emits no tokens |
class |
KeywordTokenizer
Emits the entire input as a single token. |
class |
LetterTokenizer
A LetterTokenizer is a tokenizer that divides text at non-letters. |
class |
LowerCaseTokenizer
LowerCaseTokenizer performs the function of LetterTokenizer and LowerCaseFilter together. |
class |
MockTokenizer
Tokenizer for testing. |
class |
WhitespaceTokenizer
A WhitespaceTokenizer is a tokenizer that divides text at whitespace. |
Fields in org.apache.lucene.analysis declared as Tokenizer | |
---|---|
protected Tokenizer |
ReusableAnalyzerBase.TokenStreamComponents.source
|
Constructors in org.apache.lucene.analysis with parameters of type Tokenizer | |
---|---|
ReusableAnalyzerBase.TokenStreamComponents(Tokenizer source)
Creates a new ReusableAnalyzerBase.TokenStreamComponents instance. |
|
ReusableAnalyzerBase.TokenStreamComponents(Tokenizer source,
TokenStream result)
Creates a new ReusableAnalyzerBase.TokenStreamComponents instance. |
Uses of Tokenizer in org.apache.lucene.analysis.ar |
---|
Subclasses of Tokenizer in org.apache.lucene.analysis.ar | |
---|---|
class |
ArabicLetterTokenizer
Deprecated. (3.1) Use StandardTokenizer instead. |
Uses of Tokenizer in org.apache.lucene.analysis.cjk |
---|
Subclasses of Tokenizer in org.apache.lucene.analysis.cjk | |
---|---|
class |
CJKTokenizer
Deprecated. Use StandardTokenizer, CJKWidthFilter, CJKBigramFilter, and LowerCaseFilter instead. |
Uses of Tokenizer in org.apache.lucene.analysis.cn |
---|
Subclasses of Tokenizer in org.apache.lucene.analysis.cn | |
---|---|
class |
ChineseTokenizer
Deprecated. Use StandardTokenizer instead, which has the same functionality.
This filter will be removed in Lucene 5.0 |
Uses of Tokenizer in org.apache.lucene.analysis.cn.smart |
---|
Subclasses of Tokenizer in org.apache.lucene.analysis.cn.smart | |
---|---|
class |
SentenceTokenizer
Tokenizes input text into sentences. |
Uses of Tokenizer in org.apache.lucene.analysis.icu.segmentation |
---|
Subclasses of Tokenizer in org.apache.lucene.analysis.icu.segmentation | |
---|---|
class |
ICUTokenizer
Breaks text into words according to UAX #29: Unicode Text Segmentation (http://www.unicode.org/reports/tr29/) |
Uses of Tokenizer in org.apache.lucene.analysis.in |
---|
Subclasses of Tokenizer in org.apache.lucene.analysis.in | |
---|---|
class |
IndicTokenizer
Deprecated. (3.6) Use StandardTokenizer instead. |
Uses of Tokenizer in org.apache.lucene.analysis.ja |
---|
Subclasses of Tokenizer in org.apache.lucene.analysis.ja | |
---|---|
class |
JapaneseTokenizer
Tokenizer for Japanese that uses morphological analysis. |
Uses of Tokenizer in org.apache.lucene.analysis.ngram |
---|
Subclasses of Tokenizer in org.apache.lucene.analysis.ngram | |
---|---|
class |
EdgeNGramTokenizer
Tokenizes the input from an edge into n-grams of given size(s). |
class |
NGramTokenizer
Tokenizes the input into n-grams of the given size(s). |
Uses of Tokenizer in org.apache.lucene.analysis.path |
---|
Subclasses of Tokenizer in org.apache.lucene.analysis.path | |
---|---|
class |
PathHierarchyTokenizer
Tokenizer for path-like hierarchies. |
class |
ReversePathHierarchyTokenizer
Tokenizer for domain-like hierarchies. |
Uses of Tokenizer in org.apache.lucene.analysis.ru |
---|
Subclasses of Tokenizer in org.apache.lucene.analysis.ru | |
---|---|
class |
RussianLetterTokenizer
Deprecated. Use StandardTokenizer instead, which has the same functionality.
This filter will be removed in Lucene 5.0 |
Uses of Tokenizer in org.apache.lucene.analysis.standard |
---|
Subclasses of Tokenizer in org.apache.lucene.analysis.standard | |
---|---|
class |
ClassicTokenizer
A grammar-based tokenizer constructed with JFlex |
class |
StandardTokenizer
A grammar-based tokenizer constructed with JFlex. |
class |
UAX29URLEmailTokenizer
This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs. |
Uses of Tokenizer in org.apache.lucene.analysis.wikipedia |
---|
Subclasses of Tokenizer in org.apache.lucene.analysis.wikipedia | |
---|---|
class |
WikipediaTokenizer
Extension of StandardTokenizer that is aware of Wikipedia syntax. |
|
||||||||||
PREV NEXT | FRAMES NO FRAMES |