org.apache.lucene.analysis.standard (Lucene 4.0.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES

Package org.apache.lucene.analysis.standard

Fast, general-purpose grammar-based tokenizers.

See:
Description

Interface Summary
StandardTokenizerInterface	Internal interface for supporting versioned grammars.

Class Summary
ClassicAnalyzer	Filters `ClassicTokenizer` with `ClassicFilter`, `LowerCaseFilter` and `StopFilter`, using a list of English stop words.
ClassicFilter	Normalizes tokens extracted with `ClassicTokenizer`.
ClassicFilterFactory	Factory for `ClassicFilter`.
ClassicTokenizer	A grammar-based tokenizer constructed with JFlex
ClassicTokenizerFactory	Factory for `ClassicTokenizer`.
StandardAnalyzer	Filters `StandardTokenizer` with `StandardFilter`, `LowerCaseFilter` and `StopFilter`, using a list of English stop words.
StandardFilter	Normalizes tokens extracted with `StandardTokenizer`.
StandardFilterFactory	Factory for `StandardFilter`.
StandardTokenizer	A grammar-based tokenizer constructed with JFlex.
StandardTokenizerFactory	Factory for `StandardTokenizer`.
StandardTokenizerImpl	This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29.
UAX29URLEmailAnalyzer	Filters `UAX29URLEmailTokenizer` with `StandardFilter`, `LowerCaseFilter` and `StopFilter`, using a list of English stop words.
UAX29URLEmailTokenizer	This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.
UAX29URLEmailTokenizerFactory	Factory for `UAX29URLEmailTokenizer`.
UAX29URLEmailTokenizerImpl	This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

Package org.apache.lucene.analysis.standard Description

Fast, general-purpose grammar-based tokenizers.

The org.apache.lucene.analysis.standard package contains three fast grammar-based tokenizers constructed with JFlex:

StandardTokenizer: as of Lucene 3.1, implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Unlike UAX29URLEmailTokenizer, URLs and email addresses are not tokenized as single tokens, but are instead split up into tokens according to the UAX#29 word break rules.
StandardAnalyzer includes StandardTokenizer, StandardFilter, LowerCaseFilter and StopFilter. When the Version specified in the constructor is lower than 3.1, the ClassicTokenizer implementation is invoked.
ClassicTokenizer: this class was formerly (prior to Lucene 3.1) named StandardTokenizer. (Its tokenization rules are not based on the Unicode Text Segmentation algorithm.) ClassicAnalyzer includes ClassicTokenizer, StandardFilter, LowerCaseFilter and StopFilter.
UAX29URLEmailTokenizer: implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. URLs and email addresses are also tokenized according to the relevant RFCs.
UAX29URLEmailAnalyzer includes UAX29URLEmailTokenizer, StandardFilter, LowerCaseFilter and StopFilter.

Overview

Package

Class

Use

Tree

Deprecated

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES