JapaneseTokenizer (Lucene 4.0.0 API)

Overview

Package

Class

Use

Tree

Deprecated

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.lucene.analysis.ja
Class JapaneseTokenizer

java.lang.Object
  org.apache.lucene.util.AttributeSource
      org.apache.lucene.analysis.TokenStream
          org.apache.lucene.analysis.Tokenizer
              org.apache.lucene.analysis.ja.JapaneseTokenizer

All Implemented Interfaces:: Closeable

public final class JapaneseTokenizer
extends Tokenizer
extends Tokenizer

Tokenizer for Japanese that uses morphological analysis.

This tokenizer sets a number of additional attributes:

BaseFormAttribute containing base form for inflected adjectives and verbs.
PartOfSpeechAttribute containing part-of-speech.
ReadingAttribute containing reading and pronunciation.
InflectionAttribute containing additional part-of-speech information for inflected forms.

This tokenizer uses a rolling Viterbi search to find the least cost segmentation (path) of the incoming characters. For tokens that appear to be compound (> length 2 for all Kanji, or > length 7 for non-Kanji), we see if there is a 2nd best segmentation of that token after applying penalties to the long tokens. If so, and the Mode is JapaneseTokenizer.Mode.SEARCH, we output the alternate segmentation as well.

Nested Class Summary
`static class`	`JapaneseTokenizer.Mode` Tokenization mode: this determines how the tokenizer handles compound and unknown words.
`static class`	`JapaneseTokenizer.Type` Token type reflecting the original source of this token

Nested classes/interfaces inherited from class org.apache.lucene.util.AttributeSource
`AttributeSource.AttributeFactory, AttributeSource.State`

Field Summary
`static JapaneseTokenizer.Mode`	`DEFAULT_MODE` Default tokenization mode.

Fields inherited from class org.apache.lucene.analysis.Tokenizer
`input`

Constructor Summary
`JapaneseTokenizer(Reader input, UserDictionary userDictionary, boolean discardPunctuation, JapaneseTokenizer.Mode mode)` Create a new JapaneseTokenizer.

Method Summary
`void`	`end()`
`boolean`	`incrementToken()`
`void`	`reset()`
`void`	`setGraphvizFormatter(GraphvizFormatter dotOut)` Expert: set this to produce graphviz (dot) output of the Viterbi lattice

Methods inherited from class org.apache.lucene.analysis.Tokenizer
`close, correctOffset, setReader`

Methods inherited from class org.apache.lucene.util.AttributeSource
`addAttribute, addAttributeImpl, captureState, clearAttributes, cloneAttributes, copyTo, equals, getAttribute, getAttributeClassesIterator, getAttributeFactory, getAttributeImplsIterator, hasAttribute, hasAttributes, hashCode, reflectAsString, reflectWith, restoreState`

Methods inherited from class java.lang.Object
`clone, finalize, getClass, notify, notifyAll, toString, wait, wait, wait`

Field Detail

DEFAULT_MODE

public static final JapaneseTokenizer.Mode DEFAULT_MODE

Default tokenization mode. Currently this is JapaneseTokenizer.Mode.SEARCH.

Constructor Detail