Package org.apache.lucene.analysis.cjk

Analyzer for Chinese, Japanese, and Korean, which indexes bigrams (overlapping groups of two adjacent Han characters).

See:
          Description

Class Summary
CJKAnalyzer An Analyzer that tokenizes text with StandardTokenizer, normalizes content with CJKWidthFilter, folds case with LowerCaseFilter, forms bigrams of CJK with CJKBigramFilter, and filters stopwords with StopFilter
CJKBigramFilter Forms bigrams of CJK terms that are generated from StandardTokenizer or ICUTokenizer.
CJKTokenizer Deprecated. Use StandardTokenizer, CJKWidthFilter, CJKBigramFilter, and LowerCaseFilter instead.
CJKWidthFilter A TokenFilter that normalizes CJK width differences: Folds fullwidth ASCII variants into the equivalent basic latin Folds halfwidth Katakana variants into the equivalent kana
 

Package org.apache.lucene.analysis.cjk Description

Analyzer for Chinese, Japanese, and Korean, which indexes bigrams (overlapping groups of two adjacent Han characters).

Three analyzers are provided for Chinese, each of which treats Chinese text in a different way.

Example phrase: "我是中国人"
  1. ChineseAnalyzer: 我-是-中-国-人
  2. CJKAnalyzer: 我是-是中-中国-国人
  3. SmartChineseAnalyzer: 我-是-中国-人