|
|||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | ||||||||
See:
Description
| Interface Summary | |
|---|---|
| StandardTokenizerInterface | Internal interface for supporting versioned grammars. |
| Class Summary | |
|---|---|
| ClassicAnalyzer | Filters ClassicTokenizer with ClassicFilter, LowerCaseFilter and StopFilter, using a list of
English stop words. |
| ClassicFilter | Normalizes tokens extracted with ClassicTokenizer. |
| ClassicFilterFactory | Factory for ClassicFilter. |
| ClassicTokenizer | A grammar-based tokenizer constructed with JFlex |
| ClassicTokenizerFactory | Factory for ClassicTokenizer. |
| StandardAnalyzer | Filters StandardTokenizer with StandardFilter, LowerCaseFilter and StopFilter, using a list of
English stop words. |
| StandardFilter | Normalizes tokens extracted with StandardTokenizer. |
| StandardFilterFactory | Factory for StandardFilter. |
| StandardTokenizer | A grammar-based tokenizer constructed with JFlex. |
| StandardTokenizerFactory | Factory for StandardTokenizer. |
| StandardTokenizerImpl | This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. |
| UAX29URLEmailAnalyzer | Filters UAX29URLEmailTokenizer
with StandardFilter,
LowerCaseFilter and
StopFilter, using a list of
English stop words. |
| UAX29URLEmailTokenizer | This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs. |
| UAX29URLEmailTokenizerFactory | Factory for UAX29URLEmailTokenizer. |
| UAX29URLEmailTokenizerImpl | This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs. |
Fast, general-purpose grammar-based tokenizers.
The org.apache.lucene.analysis.standard package contains three
fast grammar-based tokenizers constructed with JFlex:
StandardTokenizer:
as of Lucene 3.1, implements the Word Break rules from the Unicode Text
Segmentation algorithm, as specified in
Unicode Standard Annex #29.
Unlike UAX29URLEmailTokenizer, URLs and email addresses are
not tokenized as single tokens, but are instead split up into
tokens according to the UAX#29 word break rules.
StandardAnalyzer includes
StandardTokenizer,
StandardFilter,
LowerCaseFilter
and StopFilter.
When the Version specified in the constructor is lower than
3.1, the ClassicTokenizer
implementation is invoked.ClassicTokenizer:
this class was formerly (prior to Lucene 3.1) named
StandardTokenizer. (Its tokenization rules are not
based on the Unicode Text Segmentation algorithm.)
ClassicAnalyzer includes
ClassicTokenizer,
StandardFilter,
LowerCaseFilter
and StopFilter.
UAX29URLEmailTokenizer:
implements the Word Break rules from the Unicode Text Segmentation
algorithm, as specified in
Unicode Standard Annex #29.
URLs and email addresses are also tokenized according to the relevant RFCs.
UAX29URLEmailAnalyzer includes
UAX29URLEmailTokenizer,
StandardFilter,
LowerCaseFilter
and StopFilter.
|
|||||||||
| PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES | ||||||||