org.apache.lucene.analysis.icu.segmentation
Class DefaultICUTokenizerConfig

java.lang.Object
  extended by org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
      extended by org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig

public class DefaultICUTokenizerConfig
extends ICUTokenizerConfig

Default ICUTokenizerConfig that is generally applicable to many languages.

Generally tokenizes Unicode text according to UAX#29 (BreakIterator.getWordInstance(ULocale.ROOT)), but with the following tailorings:

WARNING: This API is experimental and might change in incompatible ways in the next release.

Field Summary
static String WORD_HANGUL
          Token type for words containing Korean hangul
static String WORD_HIRAGANA
          Token type for words containing Japanese hiragana
static String WORD_IDEO
          Token type for words containing ideographic characters
static String WORD_KATAKANA
          Token type for words containing Japanese katakana
static String WORD_LETTER
          Token type for words that contain letters
static String WORD_NUMBER
          Token type for words that appear to be numbers
 
Constructor Summary
DefaultICUTokenizerConfig()
           
 
Method Summary
 com.ibm.icu.text.BreakIterator getBreakIterator(int script)
          Return a breakiterator capable of processing a given script.
 String getType(int script, int ruleStatus)
          Return a token type value for a given script and BreakIterator rule status.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

WORD_IDEO

public static final String WORD_IDEO
Token type for words containing ideographic characters


WORD_HIRAGANA

public static final String WORD_HIRAGANA
Token type for words containing Japanese hiragana


WORD_KATAKANA

public static final String WORD_KATAKANA
Token type for words containing Japanese katakana


WORD_HANGUL

public static final String WORD_HANGUL
Token type for words containing Korean hangul


WORD_LETTER

public static final String WORD_LETTER
Token type for words that contain letters


WORD_NUMBER

public static final String WORD_NUMBER
Token type for words that appear to be numbers

Constructor Detail

DefaultICUTokenizerConfig

public DefaultICUTokenizerConfig()
Method Detail

getBreakIterator

public com.ibm.icu.text.BreakIterator getBreakIterator(int script)
Description copied from class: ICUTokenizerConfig
Return a breakiterator capable of processing a given script.

Specified by:
getBreakIterator in class ICUTokenizerConfig

getType

public String getType(int script,
                      int ruleStatus)
Description copied from class: ICUTokenizerConfig
Return a token type value for a given script and BreakIterator rule status.

Specified by:
getType in class ICUTokenizerConfig