DefaultICUTokenizerConfig (Lucene 3.6.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.icu.segmentation
Class DefaultICUTokenizerConfig

java.lang.Object
  org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
      org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig

public class DefaultICUTokenizerConfig
extends ICUTokenizerConfig
extends ICUTokenizerConfig

Default ICUTokenizerConfig that is generally applicable to many languages.

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.getWordInstance(ULocale.ROOT)), but with the following tailorings:

Thai text is broken into words with a DictionaryBasedBreakIterator
Lao, Myanmar, and Khmer text is broken into syllables based on custom BreakIterator rules.
Hebrew text has custom tailorings to handle special cases involving punctuation.

WARNING: This API is experimental and might change in incompatible ways in the next release.

Field Summary
`static String`	`WORD_HANGUL` Token type for words containing Korean hangul
`static String`	`WORD_HIRAGANA` Token type for words containing Japanese hiragana
`static String`	`WORD_IDEO` Token type for words containing ideographic characters
`static String`	`WORD_KATAKANA` Token type for words containing Japanese katakana
`static String`	`WORD_LETTER` Token type for words that contain letters
`static String`	`WORD_NUMBER` Token type for words that appear to be numbers

Constructor Summary
`DefaultICUTokenizerConfig()`

Method Summary
`com.ibm.icu.text.BreakIterator`	`getBreakIterator(int script)` Return a breakiterator capable of processing a given script.
`String`	`getType(int script, int ruleStatus)` Return a token type value for a given script and BreakIterator rule status.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

WORD_IDEO

public static final String WORD_IDEO

Token type for words containing ideographic characters

WORD_HIRAGANA

public static final String WORD_HIRAGANA

Token type for words containing Japanese hiragana

WORD_KATAKANA

public static final String WORD_KATAKANA

Token type for words containing Japanese katakana

WORD_HANGUL

public static final String WORD_HANGUL

Token type for words containing Korean hangul

WORD_LETTER

public static final String WORD_LETTER

Token type for words that contain letters

WORD_NUMBER

public static final String WORD_NUMBER

Token type for words that appear to be numbers

Constructor Detail