org.apache.lucene.analysis.standard (Lucene 3.6.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES

Package org.apache.lucene.analysis.standard

Standards-based analyzers implemented with JFlex.

See:
Description

Interface Summary
StandardTokenizerInterface	Internal interface for supporting versioned grammars.

Class Summary
ClassicAnalyzer	Filters `ClassicTokenizer` with `ClassicFilter`, `LowerCaseFilter` and `StopFilter`, using a list of English stop words.
ClassicFilter	Normalizes tokens extracted with `ClassicTokenizer`.
ClassicTokenizer	A grammar-based tokenizer constructed with JFlex
StandardAnalyzer	Filters `StandardTokenizer` with `StandardFilter`, `LowerCaseFilter` and `StopFilter`, using a list of English stop words.
StandardFilter	Normalizes tokens extracted with `StandardTokenizer`.
StandardTokenizer	A grammar-based tokenizer constructed with JFlex.
StandardTokenizerImpl	This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 Tokens produced are of the following types: <ALPHANUM>: A sequence of alphabetic and numeric characters <NUM>: A number <SOUTHEAST_ASIAN>: A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and Khmer <IDEOGRAPHIC>: A single CJKV ideographic character <HIRAGANA>: A single hiragana character
UAX29URLEmailAnalyzer	Filters `UAX29URLEmailTokenizer` with `StandardFilter`, `LowerCaseFilter` and `StopFilter`, using a list of English stop words.
UAX29URLEmailTokenizer	This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.
UAX29URLEmailTokenizerImpl	This class implements Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29 URLs and email addresses are also tokenized according to the relevant RFCs.

Package org.apache.lucene.analysis.standard Description

Standards-based analyzers implemented with JFlex.

The org.apache.lucene.analysis.standard package contains three fast grammar-based tokenizers constructed with JFlex:

StandardTokenizer: as of Lucene 3.1, implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Unlike UAX29URLEmailTokenizer, URLs and email addresses are not tokenized as single tokens, but are instead split up into tokens according to the UAX#29 word break rules.
StandardAnalyzer includes StandardTokenizer, StandardFilter, LowerCaseFilter and StopFilter. When the Version specified in the constructor is lower than 3.1, the ClassicTokenizer implementation is invoked.
ClassicTokenizer: this class was formerly (prior to Lucene 3.1) named StandardTokenizer. (Its tokenization rules are not based on the Unicode Text Segmentation algorithm.) ClassicAnalyzer includes ClassicTokenizer, StandardFilter, LowerCaseFilter and StopFilter.
UAX29URLEmailTokenizer: implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. URLs and email addresses are also tokenized according to the relevant RFCs.
UAX29URLEmailAnalyzer includes UAX29URLEmailTokenizer, StandardFilter, LowerCaseFilter and StopFilter.

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES